Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction

Open Access

16 March 2007

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Computational Biology

Vol. 3 (3), e54
https://doi.org/10.1371/journal.pcbi.0030054

Abstract

Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov model, to combine separately trained models of genomic signals and content. By combining separate models of relevant genomic features, such gene predictors can exploit small training sets and incomplete annotations, and can be trained fairly efficiently. However, that type of piecewise training does not optimize prediction accuracy and has difficulty in accounting for statistical dependencies among different parts of the gene model. With genomic information being created at an ever-increasing rate, it is worth investigating alternative approaches in which many different types of genomic evidence, with complex statistical dependencies, can be integrated by discriminative learning to maximize annotation accuracy. Among discriminative learning methods, large-margin classifiers have become prominent because of the success of support vector machines (SVM) in many classification tasks. We describe CRAIG, a new program for ab initio gene prediction based on a conditional random field model with semi-Markov structure that is trained with an online large-margin algorithm related to multiclass SVMs. Our experiments on benchmark vertebrate datasets and on regions from the ENCODE project show significant improvements in prediction accuracy over published gene predictors that use intrinsic features only, particularly at the gene level and on genes with long introns. We describe a new approach to statistical learning for sequence data that is broadly applicable to computational biology problems and that has experimentally demonstrated advantages over current hidden Markov model (HMM)-based methods for sequence analysis. The methods we describe in this paper, implemented in the CRAIG program, allow researchers to modularly specify and train sequence analysis models that combine a wide range of weakly informative features into globally optimal predictions. Our results for the gene prediction problem show significant improvements over existing ab initio gene predictors on a variety of tests, including the specially challenging ENCODE regions. Such improved predictions, particularly on initial and single exons, could benefit researchers who are seeking more accurate means of recognizing such important features as signal peptides and regulatory regions. More generally, we believe that our method, by combining the structure-describing capabilities of HMMs with the accuracy of margin-based classification methods, provides a general tool for statistical learning in biological sequences that will replace HMMs in any sequence modeling task for which there is annotated training data.

Keywords

This publication has 27 references indexed in Scilit:

Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning
PLoS Computational Biology, 2007
Using Multiple Alignments to Improve Gene Prediction
Journal of Computational Biology, 2006
The ENCODE (ENCyclopedia Of DNA Elements) Project
Science, 2004
TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders
Bioinformatics, 2004
GeneWise and Genomewise
Genome Research, 2004
EnsMart: A Generic System for Fast and Flexible Access to Biological Data
Genome Research, 2004
Leveraging the Mouse Genome for Gene Prediction in Human: From Whole-Genome Shotgun Reads to a Global Synteny Map
Genome Research, 2003
Evaluation of Gene-Finding Programs on Mammalian Sequences
Genome Research, 2001
Evaluation of Gene Structure Prediction Programs
Genomics, 1996
Identification of Protein Coding Regions In Genomic DNA
Journal of Molecular Biology, 1995

Cited by 64 articles