Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction
Open Access
- 16 March 2007
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLoS Computational Biology
- Vol. 3 (3), e54
- https://doi.org/10.1371/journal.pcbi.0030054
Abstract
Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov model, to combine separately trained models of genomic signals and content. By combining separate models of relevant genomic features, such gene predictors can exploit small training sets and incomplete annotations, and can be trained fairly efficiently. However, that type of piecewise training does not optimize prediction accuracy and has difficulty in accounting for statistical dependencies among different parts of the gene model. With genomic information being created at an ever-increasing rate, it is worth investigating alternative approaches in which many different types of genomic evidence, with complex statistical dependencies, can be integrated by discriminative learning to maximize annotation accuracy. Among discriminative learning methods, large-margin classifiers have become prominent because of the success of support vector machines (SVM) in many classification tasks. We describe CRAIG, a new program for ab initio gene prediction based on a conditional random field model with semi-Markov structure that is trained with an online large-margin algorithm related to multiclass SVMs. Our experiments on benchmark vertebrate datasets and on regions from the ENCODE project show significant improvements in prediction accuracy over published gene predictors that use intrinsic features only, particularly at the gene level and on genes with long introns. We describe a new approach to statistical learning for sequence data that is broadly applicable to computational biology problems and that has experimentally demonstrated advantages over current hidden Markov model (HMM)-based methods for sequence analysis. The methods we describe in this paper, implemented in the CRAIG program, allow researchers to modularly specify and train sequence analysis models that combine a wide range of weakly informative features into globally optimal predictions. Our results for the gene prediction problem show significant improvements over existing ab initio gene predictors on a variety of tests, including the specially challenging ENCODE regions. Such improved predictions, particularly on initial and single exons, could benefit researchers who are seeking more accurate means of recognizing such important features as signal peptides and regulatory regions. More generally, we believe that our method, by combining the structure-describing capabilities of HMMs with the accuracy of margin-based classification methods, provides a general tool for statistical learning in biological sequences that will replace HMMs in any sequence modeling task for which there is annotated training data.Keywords
This publication has 27 references indexed in Scilit:
- Improving the Caenorhabditis elegans Genome Annotation Using Machine LearningPLoS Computational Biology, 2007
- Using Multiple Alignments to Improve Gene PredictionJournal of Computational Biology, 2006
- The ENCODE (ENCyclopedia Of DNA Elements) ProjectScience, 2004
- TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-findersBioinformatics, 2004
- GeneWise and GenomewiseGenome Research, 2004
- EnsMart: A Generic System for Fast and Flexible Access to Biological DataGenome Research, 2004
- Leveraging the Mouse Genome for Gene Prediction in Human: From Whole-Genome Shotgun Reads to a Global Synteny MapGenome Research, 2003
- Evaluation of Gene-Finding Programs on Mammalian SequencesGenome Research, 2001
- Evaluation of Gene Structure Prediction ProgramsGenomics, 1996
- Identification of Protein Coding Regions In Genomic DNAJournal of Molecular Biology, 1995