Newsletters and Journals
Newsletters
Live News Feed from World Health Organization
Open Access Journals
Nucleic Acids Research
A peer-reviewed journal focusing on leading edge research into physical, chemical, biochemical and biological aspects of nucleic acids and proteins involved in nucleic acid metabolism and/or interactions, published monthly by Oxford University Press.
Emerging Infectious Diseases
A peer-reviewed journal tracking and analyzing disease trends, published monthly by the National Center for Infectious Diseases, Centers for Disease Control and Prevention (CDC).
PLoS Pathogens
A peer-reviewed journal featuring important new ideas on bacteria, fungi, parasites, prions, and viruses that contribute to our understanding of the biology of pathogens and pathogen-host interactions, published by the nonprofit organization Public Library of Science.
PLoS Genetics
A peer-reviewed journal covering the full breadth and interdisciplinary nature of genetics and genomics research from mice and flies, to plants and bacteria, published by the nonprofit organization Public Library of Science.
PLoS Computational Biology
A peer-reviewed journal featuring works of exceptional significance that further our understanding of living systems at all scales through the application of computational methods. Published monthly by the nonprofit organization Public Library of Science in association with the International Society for Computational Biology.
The Journal of Biological Chemistry
A peer-reviewed journal covering all areas of biochemistry or molecular biology, published weekly by the American Society of Biochemistry and Molecular Biology, Inc.
RSS feeds of Abstracts from Subscription Journals
Bioinformatics
A peer-reviewed journal focusing on new developments in genome bioinformatics and computational biology, published monthly by Oxford University Press.
| KEPE--a motif frequently superimposed on sumoylation sites in metazoan chromatin proteins and transcription factors |
|
Motivation: We noted that the sumoylation site in C/EBP homologues is conserved beyond the canonical consensus sequence for sumoylation. Therefore, we investigated whether this pattern might define a more general protein motif. Results: We undertook a survey of the human proteome using a regular expression based on the C/EBP motif. This revealed significant enrichment of the motif using different Gene Ontology terms (e.g. ‘transcription’) that pertain to the nucleus. When considering requirements for the motif to be functional (evolutionary conservation, structural accessibility of the motif and proper cell localization of the protein), more than 130 human proteins were retrieved from the UniProt/Swiss-Prot database. These candidates were particularly enriched in transcription factors, including FOS, JUN, Hif-1, MLL2 and members of the KLF, MAF and NFATC families; chromatin modifiers like CHD-8, HDAC4 and DNA Top1; and the transcriptional regulatory kinases HIPK1 and HIPK2. The KEPEmotif appears to be restricted to the metazoan lineage and has three length variants—short, medium and long—which do not appear to interchange. Contact: Supplementary information: |
| Slider--maximum use of probability information for alignment of short sequence reads and SNP detection |
|
Motivation: A plethora of alignment tools have been created that are designed to best fit different types of alignment conditions. While some of these are made for aligning Illumina Sequence Analyzer reads, none of these are fully utilizing its probability (prb) output. In this article, we will introduce a new alignment approach (Slider) that reduces the alignment problem space by utilizing each read base's probabilities given in the prb files. Results: Compared with other aligners, Slider has higher alignment accuracy and efficiency. In addition, given that Slider matches bases with probabilities other than the most probable, it significantly reduces the percentage of base mismatches. The result is that its SNP predictions are more accurate than other SNP prediction approaches used today that start from the most probable sequence, including those using base quality. Contact: Supplementary information and availability: |
| Discovery of phosphorylation motif mixtures in phosphoproteomics data |
|
Motivation: Modification of proteins via phosphorylation is a primary mechanism for signal transduction in cells. Phosphorylation sites on proteins are determined in part through particular patterns, or motifs, present in the amino acid sequence. Results: We describe an algorithm that simultaneously discovers multiple motifs in a set of peptides that were phosphorylated by several different kinases. Such sets of peptides are routinely produced in proteomics experiments.Our motif-finding algorithm uses the principle of minimum description length to determine a mixture of sequence motifs that distinguish a foreground set of phosphopeptides from a background set of unphosphorylated peptides. We show that our algorithm outperforms existing motif-finding algorithms on synthetic datasets consisting of mixtures of known phosphorylation sites. We also derive a motif specificity score that quantifies whether or not the phosphoproteins containing an instance of a motif have a significant number of known interactions. Application of our motif-finding algorithm to recently published human and mouse proteomic studies recovers several known phosphorylation motifs and reveals a number of novel motifs that are enriched for interactions with a particular kinase or phosphatase. Our tools provide a new approach for uncovering the sequence specificities of uncharacterized kinases or phosphatases. Availability: Software is available at Contact: Supplementary information: |
| Predicting DNA recognition by Cys2His2 zinc finger proteins |
|
Motivation: Cys2His2 zinc finger (ZF) proteins represent the largest class of eukaryotic transcription factors. Their modular structure and well-conserved protein-DNA interface allow the development of computational approaches for predicting their DNA-binding preferences even when no binding sites are known for a particular protein. The ‘canonical model’ for ZF protein-DNA interaction consists of only four amino acid nucleotide contacts per zinc finger domain. Results: We present an approach for predicting ZF binding based on support vector machines (SVMs). While most previous computational approaches have been based solely on examples of known ZF protein–DNA interactions, ours additionally incorporates information about protein–DNA pairs known to bind weakly or not at all. Moreover, SVMs with a linear kernel can naturally incorporate constraints about the relative binding affinities of protein-DNA pairs; this type of information has not been used previously in predicting ZF protein-DNA binding. Here, we build a high-quality literature-derived experimental database of ZF–DNA binding examples and utilize it to test both linear and polynomial kernels for predicting ZF protein–DNA binding on the basis of the canonical binding model. The polynomial SVM outperforms previously published prediction procedures as well as the linear SVM. This may indicate the presence of dependencies between contacts in the canonical binding model and suggests that modification of the underlying structural model may result in further improved performance in predicting ZF protein–DNA binding. Overall, this work demonstrates that methods incorporating information about non-binding and relative binding of protein–DNA pairs have great potential for effective prediction of protein–DNA interactions. Availability: An online tool for predicting ZF DNA binding is available at Contact: Supplementary information: |
| Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature |
|
Motivation: In this work, we aim to develop a computational approach for predicting DNA-binding sites in proteins from amino acid sequences. To avoid overfitting with this method, all available DNA-binding proteins from the Protein Data Bank (PDB) are used to construct the models. The random forest (RF) algorithm is used because it is fast and has robust performance for different parameter values. A novel hybrid feature is presented which incorporates evolutionary information of the amino acid sequence, secondary structure (SS) information and orthogonal binary vector (OBV) information which reflects the characteristics of 20 kinds of amino acids for two physical–chemical properties (dipoles and volumes of the side chains). The numbers of binding and non-binding residues in proteins are highly unbalanced, so a novel scheme is proposed to deal with the problem of imbalanced datasets by downsizing the majority class. Results: The results show that the RF model achieves 91.41% overall accuracy with Matthew's correlation coefficient of 0.70 and an area under the receiver operating characteristic curve (AUC) of 0.913. To our knowledge, the RF method using the hybrid feature is currently the computationally optimal approach for predicting DNA-binding sites in proteins from amino acid sequences without using three-dimensional (3D) structural information. We have demonstrated that the prediction results are useful for understanding protein–DNA interactions. Availability: DBindR web server implementation is freely available at Contact: Supplementary information: |
| Model-based analysis of non-specific binding for background correction of high-density oligonucleotide microarrays |
|
Motivation: High-density DNA microarrays provide us with useful tools for analyzing DNA and RNA comprehensively. However, the background signal caused by the non-specific binding (NSB) between probe and target makes it difficult to obtain accurate measurements. To remove the background signal, there is a set of background probes on Affymetrix Exon arrays to represent the amount of non-specific signals, and an accurate estimation of non-specific signals using these background probes is desirable for improvement of microarray analyses. Results: We developed a thermodynamic model of NSB on short nucleotide microarrays in which the NSBs are modeled by duplex formation of probes and multiple hypothetical targets. We fitted the observed signal intensities of the background probes with those expected by the model to obtain the model parameters. As a result, we found that the presented model can improve the accuracy of prediction of non-specific signals in comparison with previously proposed methods. This result will provide a useful method to correct for the background signal in oligonucleotide microarray analysis. Availability: The software is implemented in the R language and can be downloaded from our website ( Contact: Supplementary information: |
| Shortest path analysis using partial correlations for classifying gene functions from gene expression data |
|
Motivation: Gaussian graphical models (GGMs) are a popular tool for representing gene association structures. We propose using estimated partial correlations from these models to attach lengths to the edges of the GGM, where the length of an edge is inversely related to the partial correlation between the gene pair. Graphical lasso is used to fit the GGMs and obtain partial correlations. The shortest paths between pairs of genes are found. Where terminal genes have the same biological function intermediate genes on the path are classified as having the same function. We validate the method using genes of known function using the Rosetta Compendium of yeast (Saccharomyces Cerevisiae) gene expression profiles. We also compare our results with those obtained using a graph constructed using correlations. Results: Using a partial correlation graph, we are able to classify approximately twice as many genes to the same level of accuracy as when using a correlation graph. More importantly when both methods are tuned to classify a similar number of genes, the partial correlation approach can increase the accuracy of the classifications. Contact: |
| Power enhancement via multivariate outlier testing with gene expression arrays |
|
Motivation: As the use of microarrays in human studies continues to increase, stringent quality assurance is necessary to ensure accurate experimental interpretation. We present a formal approach for microarray quality assessment that is based on dimension reduction of established measures of signal and noise components of expression followed by parametric multivariate outlier testing. Results: We applied our approach to several data resources. First, as a negative control, we found that the Affymetrix and Illumina contributions to MAQC data were free from outliers at a nominal outlier flagging rate of =0.01. Second, we created a tunable framework for artificially corrupting intensity data from the Affymetrix Latin Square spike-in experiment to allow investigation of sensitivity and specificity of quality assurance (QA) criteria. Third, we applied the procedure to 507 Affymetrix microarray GeneChips processed with RNA from human peripheral blood samples. We show that exclusion of arrays by this approach substantially increases inferential power, or the ability to detect differential expression, in large clinical studies. Availability: Contact: |
| The wisdom of the commons: ensemble tree classifiers for prostate cancer prognosis |
|
Motivation: Classification and regression trees have long been used for cancer diagnosis and prognosis. Nevertheless, instability and variable selection bias, as well as overfitting, are well-known problems of tree-based methods. In this article, we investigate whether ensemble tree classifiers can ameliorate these difficulties, using data from two recent studies of radical prostatectomy in prostate cancer. Results: Using time to progression following prostatectomy as the relevant clinical endpoint, we found that ensemble tree classifiers robustly and reproducibly identified three subgroups of patients in the two clinical datasets: non-progressors, early progressors and late progressors. Moreover, the consensus classifications were independent predictors of time to progression compared to known clinical prognostic factors. Contact: |
| Conditional random pattern algorithm for LOH inference and segmentation |
|
Motivation: Loss of heterozygosity (LOH) is one of the most important mechanisms in the tumor evolution. LOH can be detected from the genotypes of the tumor samples with or without paired normal samples. In paired sample cases, LOH detection for informative single nucleotide polymorphisms (SNPs) is straightforward if there is no genotyping error. But genotyping errors are always unavoidable, and there are about 70% non-informative SNPs whose LOH status can only be inferred from the neighboring informative SNPs. Results: This article presents a novel LOH inference and segmentation algorithm based on the conditional random pattern (CRP) model. The new model explicitly considers the distance between two neighboring SNPs, as well as the genotyping error rate and the heterozygous rate. This new method is tested on the simulated and real data of the Affymetrix Human Mapping 500K SNP arrays. The experimental results show that the CRP method outperforms the conventional methods based on the hidden Markov model (HMM). Availability: Software is available upon request. Contact: Supplementary information: |
| How frugal is mother nature with haplotypes? |
|
Motivation: Inference of haplotypes from genotype data is crucial and challenging for many vitally important studies. The first, and most critical step, is the ascertainment of a biologically sound model to be optimized. Many models that have been proposed rely partially or entirely on reducing the number of unique haplotypes in the solution. Results: This article examines the parsimony of haplotypes using known haplotypes as well as genotypes from the HapMap project. Our study reveals that there are relatively few unique haplotypes, but not always the least possible, for the datasets with known solutions. Furthermore, we show that there are frequently very large numbers of parsimonious solutions, and the number increases exponentially with increasing cardinality. Moreover, these solutions are quite varied, most of which are not consistent with the true solutions. These results quantify the limitations of the Pure Parsimony model and demonstrate the imperative need to consider additional properties for haplotype inference models. At a higher level, and with broad applicability, this article illustrates the power of combinatorial methods to tease out imperfections in a given biological model. Contact: |
| A novel signaling pathway impact analysis |
|
Motivation: Gene expression class comparison studies may identify hundreds or thousands of genes as differentially expressed (DE) between sample groups. Gaining biological insight from the result of such experiments can be approached, for instance, by identifying the signaling pathways impacted by the observed changes. Most of the existing pathway analysis methods focus on either the number of DE genes observed in a given pathway (enrichment analysis methods), or on the correlation between the pathway genes and the class of the samples (functional class scoring methods). Both approaches treat the pathways as simple sets of genes, disregarding the complex gene interactions that these pathways are built to describe. Results: We describe a novel signaling pathway impact analysis (SPIA) that combines the evidence obtained from the classical enrichment analysis with a novel type of evidence, which measures the actual perturbation on a given pathway under a given condition. A bootstrap procedure is used to assess the significance of the observed total pathway perturbation. Using simulations we show that the evidence derived from perturbations is independent of the pathway enrichment evidence. This allows us to calculate a global pathway significance P-value, which combines the enrichment and perturbation P-values. We illustrate the capabilities of the novel method on four real datasets. The results obtained on these data show that SPIA has better specificity and more sensitivity than several widely used pathway analysis methods. Availability: SPIA was implemented as an R package available at Contact: Supplementary information: |
| Pan-specific MHC class I predictors: a benchmark of HLA class I pan-specific prediction methods |
|
Motivation: MHC:peptide binding plays a central role in activating the immune surveillance. Computational approaches to determine T-cell epitopes restricted to any given major histocompatibility complex (MHC) molecule are of special practical value in the development of for instance vaccines with broad population coverage against emerging pathogens. Methods have recently been published that are able to predict peptide binding to any human MHC class I molecule. In contrast to conventional allele-specific methods, these methods do allow for extrapolation to uncharacterized MHC molecules. These pan-specific human lymphocyte antigen (HLA) predictors have not previously been compared using independent evaluation sets. Result: A diverse set of quantitative peptide binding affinity measurements was collected from Immune Epitope database (IEDB), together with a large set of HLA class I ligands from the SYFPEITHI database. Based on these datasets, three different pan-specific HLA web-accessible predictors NetMHCpan, adaptive double threading (ADT) and kernel-based inter-allele peptide binding prediction system (KISS) were evaluated. The performance of the pan-specific predictors was also compared with a well performing allele-specific MHC class I predictor, NetMHC, as well as a consensus approach integrating the predictions from the NetMHC and NetMHCpan methods. Conclusions: The benchmark demonstrated that pan-specific methods do provide accurate predictions also for previously uncharacterized MHC molecules. The NetMHCpan method trained to predict actual binding affinities was consistently top ranking both on quantitative (affinity) and binary (ligand) data. However, the KISS method trained to predict binary data was one of the best performing methods when benchmarked on binary data. Finally, a consensus method integrating predictions from the two best performing methods was shown to improve the prediction accuracy. Contact: Supplementary information: |
| Decomposition of complex microbial behaviors into resource-based stress responses |
|
Motivation: Highly redundant metabolic networks and experimental data from cultures likely adapting simultaneously to multiple stresses can complicate the analysis of cellular behaviors. It is proposed that the explicit consideration of these factors is critical to understanding the competitive basis of microbial strategies. Results: Wide ranging, seemingly unrelated Escherichia coli physiological fluxes can be simply and accurately described as linear combinations of a few ecologically relevant stress adaptations. These strategies were identified by decomposing the central metabolism of E.coli into elementary modes (mathematically defined biochemical pathways) and assessing the resource investment cost–benefit properties for each pathway. The approach capitalizes on the inherent tradeoffs related to investing finite resources like nitrogen into different pathway enzymes when the pathways have varying metabolic efficiencies. The subset of ecologically competitive pathways represented 0.02% of the total permissible pathways. The biological relevance of the assembled strategies was tested against 10 000 randomly constructed pathway subsets. None of the randomly assembled collections were able to describe all of the considered experimental data as accurately as the cost-based subset. The results suggest these metabolic strategies are biologically significant. The current descriptions were compared with linear programming (LP)-based flux descriptions using the Euclidean distance metric. The current study's pathway subset described the experimental fluxes with better accuracy than the LP results without having to test multiple objective functions or constraints and while providing additional ecological insight into microbial behavior. The assembled pathways seem to represent a generalized set of strategies that can describe a wide range of microbial responses and hint at evolutionary processes where a handful of successful metabolic strategies are utilized simultaneously in different combinations to adapt to diverse conditions. Contact: Supplementary information: |
| Align human interactome with phenome to identify causative genes and networks underlying disease families |
|
Motivation: Understanding the complexity in gene–phenotype relationship is vital for revealing the genetic basis of common diseases. Recent studies on the basis of human interactome and phenome not only uncovers prevalent phenotypic overlap and genetic overlap between diseases, but also reveals a modular organization of the genetic landscape of human diseases, providing new opportunities to reduce the complexity in dissecting the gene–phenotype association. Results: We provide systematic and quantitative evidence that phenotypic overlap implies genetic overlap. With these results, we perform the first heterogeneous alignment of human interactome and phenome via a network alignment technique and identify 39 disease families with corresponding causative gene networks. Finally, we propose AlignPI, an alignment-based framework to predict disease genes, and identify plausible candidates for 70 diseases. Our method scales well to the whole genome, as demonstrated by prioritizing 6154 genes across 37 chromosome regions for Crohn's disease (CD). Results are consistent with a recent meta-analysis of genome-wide association studies for CD. Availability: Bi-modules and disease gene predictions are freely available at the URL Contact: Supplementary information: |