Comprehensive analysis of amino acid and nucleotide composition in eukaryotic genomes, comparing genes and pseudogenes
Open Access
- 1 June 2002
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 30 (11), 2515-2523
- https://doi.org/10.1093/nar/30.11.2515
Abstract
Based on searches for disabled homologs to known proteins, we have identified a large population of pseudogenes in four sequenced eukaryotic genomes—the worm, yeast, fly and human (chromosomes 21 and 22 only). Each of our nearly 2500 pseudogenes is characterized by one or more disablements mid-domain, such as premature stops and frameshifts. Here, we perform a comprehensive survey of the amino acid and nucleotide composition of these pseudogenes in comparison to that of functional genes and intergenic DNA. We show that pseudogenes invariably have an amino acid composition intermediate between genes and translated intergenic DNA. Although the degree of intermediacy varies among the four organisms, in all cases, it is most evident for amino acid types that differ most in occurrence between genes and intergenic regions. The same intermediacy also applies to codon frequencies, especially in the worm and human. Moreover, the intermediate composition of pseudogenes applies even though the composition of the genes in the four organisms is markedly different, showing a strong correlation with the overall A/T content of the genomic sequence. Pseudogenes can be divided into ‘ancient’ and ‘modern’ subsets, based on the level of sequence identity with their closest matching homolog (within the same genome). Modern pseudogenes usually have a much closer sequence composition to genes than ancient pseudogenes. Collectively, our results indicate that the composition of pseudogenes that are under no selective constraints progressively drifts from that of coding DNA towards non-coding DNA. Therefore, we propose that the degree to which pseudogenes approach a random sequence composition may be useful in dating different sets of pseudogenes, as well as to assess the rate at which intergenic DNA accumulates mutations. Our compositional analyses with the interactive viewer are available over the web at http://genecensus.org/pseudogene.Keywords
This publication has 34 references indexed in Scilit:
- Studying Genomes Through the Aeons: Protein Families, Pseudogenes and Proteome EvolutionJournal of Molecular Biology, 2002
- A question of size: the eukaryotic proteome and the problems in defining itNucleic Acids Research, 2002
- A small reservoir of disabled ORFs in the yeast genome and its implications for the dynamics of proteome evolutionJournal of Molecular Biology, 2002
- Molecular Fossils in the Human Genome: Identification and Analysis of the Pseudogenes in Chromosomes 21 and 22Genome Research, 2002
- Computational Inference of Homologous Gene Structures in the Human GenomeGenome Research, 2001
- PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction informationNucleic Acids Research, 2001
- Genome-Scale Compositional Comparisons in EukaryotesGenome Research, 2001
- The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000Nucleic Acids Research, 2000
- The DNA sequence of human chromosome 22Nature, 1999
- Interspersed repeats and other mementos of transposable elements in mammalian genomesCurrent Opinion in Genetics & Development, 1999