Exogean: a framework for annotating protein-coding genes in eukaryotic genomic DNA

Open Access

7 August 2006

journal article
research article
Published by Springer Nature in Genome Biology

Vol. 7 (S1), S7.1-10
https://doi.org/10.1186/gb-2006-7-s1-s7

Abstract

Background: Accurate and automatic gene identification in eukaryotic genomic DNA is more than ever of crucial importance to efficiently exploit the large volume of assembled genome sequences available to the community. Automatic methods have always been considered less reliable than human expertise. This is illustrated in the EGASP project, where reference annotations against which all automatic methods are measured are generated by human annotators and experimentally verified. We hypothesized that replicating the accuracy of human annotators in an automatic method could be achieved by formalizing the rules and decisions that they use, in a mathematical formalism. Results: We have developed Exogean, a flexible framework based on directed acyclic colored multigraphs (DACMs) that can represent biological objects (for example, mRNA, ESTs, protein alignments, exons) and relationships between them. Graphs are analyzed to process the information according to rules that replicate those used by human annotators. Simple individual starting objects given as input to Exogean are thus combined and synthesized into complex objects such as protein coding transcripts. Conclusion: We show here, in the context of the EGASP project, that Exogean is currently the method that best reproduces protein coding gene annotations from human experts, in terms of identifying at least one exact coding sequence per gene. We discuss current limitations of the method and several avenues for improvement.

Keywords

This publication has 16 references indexed in Scilit:

GENCODE: producing a reference annotation for ENCODE
Genome Biology, 2006
EGASP: the human ENCODE Genome Annotation Assessment Project
Genome Biology, 2006
Genome annotation past, present, and future: How to define an ORF at each locus
Genome Research, 2005
The Vertebrate Genome Annotation (Vega) database
Nucleic Acids Research, 2004
Comparative ab initio prediction of gene structures using pair HMMs
Bioinformatics, 2002
Integrating genomic homology into gene structure prediction
Bioinformatics, 2001
Evaluation of Gene-Finding Programs on Mammalian Sequences
Genome Research, 2001
GeneID in Drosophila
Genome Research, 2000
Prediction of complete gene structures in human genomic DNA
Journal of Molecular Biology, 1997
Evaluation of Gene Structure Prediction Programs
Genomics, 1996

Cited by 15 articles