Full-length messenger RNA sequences greatly improve genome annotation
Open Access
- 30 May 2002
- journal article
- Published by Springer Nature in Genome Biology
Abstract
Annotation of eukaryotic genomes is a complex endeavor that requires the integration of evidence from multiple, often contradictory, sources. With the ever-increasing amount of genome sequence data now available, methods for accurate identification of large numbers of genes have become urgently needed. In an effort to create a set of very high-quality gene models, we used the sequence of 5,000 full-length gene transcripts from Arabidopsis to re-annotate its genome. We have mapped these transcripts to their exact chromosomal locations and, using alignment programs, have created gene models that provide a reference set for this organism. Approximately 35% of the transcripts indicated that previously annotated genes needed modification, and 5% of the transcripts represented newly discovered genes. We also discovered that multiple transcription initiation sites appear to be much more common than previously known, and we report numerous cases of alternative mRNA splicing. We include a comparison of different alignment software and an analysis of how the transcript data improved the previously published annotation. Our results demonstrate that sequencing of large numbers of full-length transcripts followed by computational mapping greatly improves identification of the complete exon structures of eukaryotic genes. In addition, we are able to find numerous introns in the untranslated regions of the genes.Keywords
This publication has 25 references indexed in Scilit:
- Gene Duplication in the Diversification of Secondary Metabolism: Tandem 2-Oxoglutarate–Dependent Dioxygenases Control Glucosinolate Biosynthesis in ArabidopsisPlant Cell, 2001
- Sequence and analysis of the Arabidopsis genomeCurrent Opinion in Plant Biology, 2001
- The Sequence of the Human GenomeScience, 2001
- Initial sequencing and analysis of the human genomeNature, 2001
- The Genome Sequence of Drosophila melanogasterScience, 2000
- Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thalianasequencesBioinformatics, 1999
- Interpolated Markov Models for Eukaryotic Gene FindingGenomics, 1999
- A Tool for Analyzing and Annotating Genomic SequencesGenomics, 1997
- Prediction of complete gene structures in human genomic DNAJournal of Molecular Biology, 1997
- A Novel Spliceosome Containing U11, U12, and U5 snRNPs Excises a Minor Class (AT–AC) Intron In VitroCell, 1996