Newsletters and Journals

 

Newsletters

Live News Feed from World Health Organization

Disease Outbreak News


Open Access Journals

Nucleic Acids Research

A peer-reviewed journal focusing on leading edge research into physical, chemical, biochemical and biological aspects of nucleic acids, and on proteins involved in nucleic acid metabolism and/or interactions. Published monthly by Oxford University Press.

Emerging Infectious Diseases

A peer-reviewed journal tracking and analyzing disease trends. Published monthly by the National Center for Infectious Diseases, Centers for Disease Control and Prevention (CDC).

Public Library of Science Journals

PLoS Pathogens

A peer-reviewed journal featuring important new ideas on bacteria, fungi, parasites, prions, and viruses that contribute to our understanding of the biology of pathogens and pathogen-host interactions.

PLoS Genetics

A peer-reviewed journal covering the full breadth and interdisciplinary nature of genetics and genomics research from mice and flies, to plants and bacteria.

PLoS Computational Biology

A peer-reviewed journal featuring works of exceptional significance that further our understanding of living systems at all scales through the application of computational methods. Published in association with the International Society for Computational Biology.

BioMed Central Journals

BMC Microbiology

A peer-reviewed journal featuring original research articles in analytical and functional studies of prokaryotic and eukaryotic microorganisms, viruses and small parasites, as well as host and therapeutic responses to them.

BMC Genomics

A peer-reviewed journal featuring research articles in all aspects of gene mapping, sequencing and analysis, functional genomics, and proteomics.

BMC Bioinformatics

A peer-reviewed journal featuring research articles in all aspects of computational methods used in the analysis and annotation of sequences and structures, as well as all other areas of computational biology.

The Journal of Biological Chemistry

A peer-reviewed journal covering all areas of biochemistry or molecular biology, published weekly by the American Society of Biochemistry and Molecular Biology, Inc.

RSS feeds of Abstracts from Subscription Journals

Bioinformatics

A peer-reviewed journal focusing on new developments in genome bioinformatics and computational biology, published monthly by Oxford University Press.

Click here to hide abstracts

Small RNA gene identification and mRNA target predictions in bacteria

Motivation: Bacterial small ribonucleic acids (sRNAs) that are not ribosomal and transfer or messenger RNAs were initially identified in the sixties, whereas their molecular functions are still under active investigation today. It is now widely accepted that most play central roles in gene expression regulation in response to environmental changes. Interestingly, some are also implicated in bacterial virulence. Functional studies revealed that a large subset of these sRNAs act by an antisense mechanism thanks to pairing interactions with dedicated mRNA targets, usually around their translation start sites, to modulate gene expression at the posttranscriptional level. Some sRNAs modulate protein activity or mimic the structure of other macromolecules. In the last few years, in silico methods have been developed to detect more bacterial sRNAs. Among these, computational analyses of the bacterial genomes by comparative genomics have predicted the existence of a plethora of sRNAs, some that were confirmed to be expressed in vivo. The prediction accuracy of these computational tools is highly variable and can be perfectible. Here we review the computational studies that have contributed to detecting the sRNA gene and mRNA targets in bacteria and the methods for their experimental testing. In addition, the remaining challenges are discussed.

Contact: bfelden@univ-rennes1.fr


Modularity of cellular networks shows general center-periphery polarization

The modular biology is supposed to be a bridge from the molecular to the systems biology. Using a new approach, it is shown here that the protein interaction networks of yeast Saccharomyces cerevisiae and bacteria Escherichia coli consist of two large-scale modularity layers, central and peripheral, separated by a zone of depressed modularity. This finding based on the analysis of network topology is further supported by the discovery that there are many more Gene Ontology categories (terms) and KEGG biochemical pathways that are overrepresented in the central and peripheral layers than in the intermediate zone. The categories of the central layer are mostly related to nuclear information processing, regulation and cell cycle, whereas the peripheral layer is dealing with various metabolic and energetic processes, transport and cell communication. A similar center-periphery polarization of modularity is found in the protein domain networks (‘built-in interactome’) and in a powergrid (as a non-biological example). These data suggest a ‘polarized modularity’ model of cellular networks where the central layer seems to be regulatory and to use information storage of the nucleus, whereas the peripheral layer seems devoted to more specialized tasks and environmental interactions, with a complex ‘bus’ between the layers.

Contact: aevin@mail.cytspb.rssi.ru

Supplementary information: Supplementary data are available at Bioinformatics online.


Aggressive assembly of pyrosequencing reads with mates

Motivation: DNA sequence reads from Sanger and pyrosequencing platforms differ in cost, accuracy, typical coverage, average read length and the variety of available paired-end protocols. Both read types can complement one another in a ‘hybrid’ approach to whole-genome shotgun sequencing projects, but assembly software must be modified to accommodate their different characteristics. This is true even of pyrosequencing mated and unmated read combinations. Without special modifications, assemblers tuned for homogeneous sequence data may perform poorly on hybrid data.

Results: Celera Assembler was modified for combinations of ABI 3730 and 454 FLX reads. The revised pipeline called CABOG (Celera Assembler with the Best Overlap Graph) is robust to homopolymer run length uncertainty, high read coverage and heterogeneous read lengths. In tests on four genomes, it generated the longest contigs among all assemblers tested. It exploited the mate constraints provided by paired-end reads from either platform to build larger contigs and scaffolds, which were validated by comparison to a finished reference sequence. A low rate of contig mis-assembly was detected in some CABOG assemblies, but this was reduced in the presence of sufficient mate pair data.

Availability: The software is freely available as open-source from http://wgs-assembler.sf.net under the GNU Public License.

Contact: jmiller@jcvi.org

Supplementary information: Supplementary data are available at Bioinformatics online.


Poisson approximation for significance in genome-wide ChIP-chip tiling arrays

Motivation: A genome-wide ChIP-chip tiling array study requires millions of simultaneous comparisons of hybridization for significance. Controlling the false positive rate in genome-wide tiling array studies is very important, because the number of computationally identified regions can easily go beyond the capability of experimental verification. No accurate and efficient method exists for evaluating statistical significance in tiling arrays. The Bonferroni method is overly conservative and the permutation test is time consuming for genome-wide studies.

Result: Motivated by the Poisson clumping heuristic, we propose an accurate and efficient method for evaluating statistical significance in genome-wide ChIP-chip tiling arrays. The method works accurately for any large number of multiple comparisons, and the computational cost for evaluating P-values does not increase with the total number of tests. Based on a moving window approach, we demonstrate how to combine results using various window sizes to increase the detection power while maintaining a specified type I error rate. We further introduce a new false discovery rate control that is more appropriate in measuring the false proportion of binding intervals in tiling array analysis. Our method is general and can be applied to many large-scale genomic and genetic studies.

Availability: http://www.stat.psu.edu/~yuzhang/pass.tar

Contact: yuzhang@stat.psu.edu


Identifying molecular markers associated with classification of genotypes by External Logistic Biplots

For characterization of genetic diversity in genotypes several molecular techniques, usually resulting in a binary data matrix, have been used. Despite the fact that in Cluster Analysis (CA) and Principal Coordinates Analysis (PCoA) the interpretation of the variables responsible for grouping is not straightforward, these methods are commonly used to classify genotypes using DNA molecular markers. In this article, we present a novel algorithm that uses a combination of PCoA, CA and Logistic Regression (LR), as a better way to interpret the variables (alleles or bands) associated to the classification of genotypes. The combination of three standard techniques with some new ideas about the geometry of the procedures, allows constructing an External Logistic Biplot (ELB) that helps in the interpretation of the variables responsible for the classification or ordination. An application of the method to study the genetic diversity of four populations from Africa, Asia and Europe, using the HapMap data is included.

Availability: The Matlab code for implementing the methods may be obtained from the web site: http://biplot.usal.es.

Contact: jhonny.demey@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.


Faster exact Markovian probability functions for motif occurrences: a DFA-only approach

Background: The computation of the statistical properties of motif occurrences has an obviously relevant application: patterns that are significantly over- or under-represented in genomes or proteins are interesting candidates for biological roles. However, the problem is computationally hard; as a result, virtually all the existing motif finders use fast but approximate scoring functions, in spite of the fact that they have been shown to produce systematically incorrect results. A few interesting exact approaches are known, but they are very slow and hence not practical in the case of realistic sequences.

Results: We give an exact solution, solely based on deterministic finite-state automata (DFA), to the problem of finding the whole relevant part of the probability distribution function of a simple-word motif in a homogeneous (biological) sequence. Out of that, the z-value can always be computed, while the P-value can be obtained either when it is not too extreme with respect to the number of floating-point digits available in the implementation, or when the number of pattern occurrences is moderately low. In particular, the time complexity of the algorithms for Markov models of moderate order (0≤m≤2) is far better than that of Nuel, which was the fastest similar exact algorithm known to date; in many cases, even approximate methods are outperformed.

Conclusions: DFA are a standard tool of computer science for the study of patterns; previous works in biology propose algorithms involving automata, but there they are used, respectively, as a first step to write a generating function, or to build a finite Markov-chain imbedding (FMCI). In contrast, we directly rely on DFA to perform the calculations; thus we manage to obtain an algorithm which is both easily interpretable and efficient. This approach can be used for exact statistical studies of very long genomes and protein sequences, as we illustrate with some examples on the scale of the human genome.

Contact: paolo.ribeca@gmail.com


IntaRNA: efficient prediction of bacterial sRNA targets incorporating target site accessibility and seed regions

Motivation: During the last few years, several new small regulatory RNAs (sRNAs) have been discovered in bacteria. Most of them act as post-transcriptional regulators by base pairing to a target mRNA, causing translational repression or activation, or mRNA degradation. Numerous sRNAs have already been identified, but the number of experimentally verified targets is considerably lower. Consequently, computational target prediction is in great demand. Many existing target prediction programs neglect the accessibility of target sites and the existence of a seed, while other approaches are either specialized to certain types of RNAs or too slow for genome-wide searches.

Results: We introduce INTARNA, a new general and fast approach to the prediction of RNA–RNA interactions incorporating accessibility of target sites as well as the existence of a user-definable seed. We successfully applied INTARNA to the prediction of bacterial sRNA targets and determined the exact locations of the interactions with a higher accuracy than competing programs.

Availability: http://www.bioinf.uni-freiburg.de/Software/

Contact: IntaRNA@informatik.uni-freiburg.de

Supplementary information: Supplementary data are available at Bioinformatics online.


Prediction of kinase-specific phosphorylation sites using conditional random fields

Motivation: Phosphorylation is a crucial post-translational protein modification mechanism with important regulatory functions in biological systems. It is catalyzed by a group of enzymes called kinases, each of which recognizes certain target sites in its substrate proteins. Several authors have built computational models trained from sets of experimentally validated phosphorylation sites to predict these target sites for each given kinase. All of these models suffer from certain limitations, such as the fact that they do not take into account the dependencies between amino acid motifs within protein sequences in a global fashion.

Results: We propose a novel approach to predict phosphorylation sites from the protein sequence. The method uses a positive dataset to train a conditional random field (CRF) model. The negative training dataset is used to specify the decision threshold corresponding to a desired false positive rate. Application of the method on experimentally verified benchmark phosphorylation data (Phospho.ELM) shows that it performs well compared to existing methods for most kinases. This is to our knowledge that the first report of the use of CRFs to predict post-translational modification sites in protein sequences.

Availability: The source code of the implementation, called CRPhos, is available from http://www.ptools.ua.ac.be/CRPhos/

Contact: kris.laukens@ua.ac.be

Suplementary Information: Supplementary data are available at http://www.ptools.ua.ac.be/CRPhos/


Predicting small ligand binding sites in proteins using backbone structure

Motivation: Specific non-covalent binding of metal ions and ligands, such as nucleotides and cofactors, is essential for the function of many proteins. Computational methods are useful for predicting the location of such binding sites when experimental information is lacking. Methods that use structural information, when available, are particularly promising since they can potentially identify non-contiguous binding motifs that cannot be found using only the amino acid sequence. Furthermore, a prediction method that can utilize low-resolution models is advantageous because high-resolution structures are available for only a relatively small fraction of proteins.

Results: SitePredict is a machine learning-based method for predicting binding sites in protein structures for specific metal ions or small molecules. The method uses Random Forest classifiers trained on diverse residue-based site properties including spatial clustering of residue types and evolutionary conservation. SitePredict was tested by cross-validation on a set of known binding sites for six different metal ions and five different small molecules in a non-redundant set of protein–ligand complex structures. The prediction performance was good for all ligands considered, as reflected by AUC values of at least 0.8. Furthermore, a more realistic test on unbound structures showed only a slight decrease in the accuracy. The properties that contribute the most to the prediction accuracy of each ligand were also examined. Finally, examples of predicted binding sites in homology models and uncharacterized proteins are discussed.

Availability: Binding site prediction results for all PDB protein structures and human protein homology models are available at http://sitepredict.org/.

Contact: bordner.andrew@mayo.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


Integrated search and alignment of protein structures

Motivation: Identification and comparison of similar three-dimensional (3D) protein structures has become an even greater challenge in the face of the rapidly growing structure databases. Here, we introduce Vorometric, a new method that provides efficient search and alignment of a query protein against a database of protein structures. Voronoi contacts of the protein residues are enriched with the secondary structure information and a metric substitution matrix is developed to allow efficient indexing. The contact hits obtained from a distance-based indexing method are extended to obtain high-scoring segment pairs, which are then used to generate structural alignments.

Results: Vorometric is the first to address both search and alignment problems in the protein structure databases. The experimental results show that Vorometric is simultaneously effective in retrieving similar protein structures, producing high-quality structure alignments, and identifying cross-fold similarities. Vorometric outperforms current structure retrieval methods in search accuracy, while requiring com-parable running times. Furthermore, the structural superpositions produced are shown to have better quality and coverage, when compared with those of the popular structure alignment tools.

Availability: Vorometric is available as a web service at http://bio.cse.ohio-state.edu/Vorometric

Contact: sacan@cse.ohio-state.edu


Reconstructing tumor-wise protein expression in tissue microarray studies using a Bayesian cell mixture model

Motivation: Tissue microarrays (TMAs) quantify tissue-specific protein expression of cancer biomarkers via high-density immuno-histochemical staining assays. Standard analysis approach estimates a sample mean expression in the tumor, ignoring the complex tissue-specific staining patterns observed on tissue arrays.

Methods: In this article, a cell mixture model (CMM) is proposed to reconstruct tumor expression patterns in TMA experiments. The concept is to assemble the whole-tumor expression pattern by aggregating over the subpopulation of tissue specimens sampled by needle biopsies. The expression pattern in each individual tissue element is assumed to be a zero-augmented Gamma distribution to assimilate the non-staining areas and the staining areas. A hierarchical Bayes model is imposed to borrow strength across tissue specimens and across tumors. A joint model is presented to link the CMM expression model with a survival model for censored failure time observations. The implementation involves imputation steps within each Markov chain Monte Carlo iteration and Monte Carlo integration technique.

Results: The model-based approach provides estimates for various tumor expression characteristics including the percentage of staining, mean intensity of staining and a composite meanstaining to associate with patient survival outcome.

Availability: R package to fit CMM model is available at http://www.mskcc.org/mskcc/html/85130.cfm

Contact: shenr@mskcc.org

Supplementary information: Supplementary data are available at Bioinformatics online.


Cross-hybridization modeling on Affymetrix exon arrays

Motivation: Microarray designs have become increasingly probe-rich, enabling targeting of specific features, such as individual exons or single nucleotide polymorphisms. These arrays have the potential to achieve quantitative high-throughput estimates of transcript abundances, but currently these estimates are affected by biases due to cross-hybridization, in which probes hybridize to off-target transcripts.

Results: To study cross-hybridization, we map Affymetrix exon array probes to a set of annotated mRNA transcripts, allowing a small number of mismatches or insertion/deletions between the two sequences. Based on a systematic study of the degree to which probes with a given match type to a transcript are affected by cross-hybridization, we developed a strategy to correct for cross-hybridization biases of gene-level expression estimates. Comparison with Solexa ultra high-throughput sequencing data demonstrates that correction for cross-hybridization leads to a significant improve-ment of gene expression estimates.

Availability: We provide mappings between human and mouse exon array probes and off-target transcripts and provide software extending the GeneBASE program for generating gene-level expression estimates including the cross-hybridization correction http://biogibbs.stanford.edu/~kkapur/GeneBase/.

Contact: whwong@stanford.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


Investigating the correspondence between transcriptomic and proteomic expression profiles using coupled cluster models

Motivation: Modern transcriptomics and proteomics enable us to survey the expression of RNAs and proteins at large scales. While these data are usually generated and analyzed separately, there is an increasing interest in comparing and co-analyzing transcriptome and proteome expression data. A major open question is whether transcriptome and proteome expression is linked and how it is coordinated.

Results: Here we have developed a probabilistic clustering model that permits analysis of the links between transcriptomic and proteomic profiles in a sensible and flexible manner. Our coupled mixture model defines a prior probability distribution over the component to which a protein profile should be assigned conditioned on which component the associated mRNA profile belongs to. We apply this approach to a large dataset of quantitative transcriptomic and proteomic expression data obtained from a human breast epithelial cell line (HMEC). The results reveal a complex relationship between transcriptome and proteome with most mRNA clusters linked to at least two protein clusters, and vice versa. A more detailed analysis incorporating information on gene function from the Gene Ontology database shows that a high correlation of mRNA and protein expression is limited to the components of some molecular machines, such as the ribosome, cell adhesion complexes and the TCP-1 chaperonin involved in protein folding.

Availability: Matlab code is available from the authors on request.

Contact: srogers@dcs.gla.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.


Reconstruction of transcriptional dynamics from gene reporter data using differential equations

Motivation: Promoter-driven reporter genes, notably luciferase and green fluorescent protein, provide a tool for the generation of a vast array of time-course data sets from living cells and organisms. The aim of this study is to introduce a modeling framework based on stochastic differential equations (SDEs) and ordinary differential equations (ODEs) that addresses the problem of reconstructing transcription time-course profiles and associated degradation rates. The dynamical model is embedded into a Bayesian framework and inference is performed using Markov chain Monte Carlo algorithms.

Results: We present three case studies where the methodology is used to reconstruct unobserved transcription profiles and to estimate associated degradation rates. We discuss advantages and limits of fitting either SDEs ODEs and address the problem of parameter identifiability when model variables are unobserved. We also suggest functional forms, such as on/off switches and stimulus response functions to model transcriptional dynamics and present results of fitting these to experimental data.

Contact: b.f.finkenstadt@warwick.ac.uk

Supplementary Information: Supplementary data are available at Bioinformatics online.


A new rule-based algorithm for identifying metabolic markers in prostate cancer using tandem mass spectrometry

Motivation: Prostate cancer is the most prevalent tumor in males and its incidence is expected to increase as the population ages. Prostate cancer is treatable by excision if detected at an early enough stage. The challenges of early diagnosis require the discovery of novel biomarkers and tools for prostate cancer management.

Results: We developed a novel feature selection algorithm termed as associative voting (AV) for identifying biomarker candidates in prostate cancer data measured via targeted metabolite profiling MS/MS analysis. We benchmarked our algorithm against two standard entropy-based and correlation-based feature selection methods [Information Gain (IG) and ReliefF (RF)] and observed that, on a variety of classification tasks in prostate cancer diagnosis, our algorithm identified subsets of biomarker candidates that are both smaller and show higher discriminatory power than the subsets identified by IG and RF. A literature study confirms that the highest ranked biomarker candidates identified by AV have independently been identified as important factors in prostate cancer development.

Availability: The algorithm can be downloaded from the following http://biomed.umit.at/page.cfm?pageid=516

Contact: melanie.osl@umit.at


Click here to hide abstracts