Comprehensive analysis of amino acid and nucleotide composition in eukaryotic genomes, comparing genes and pseudogenes

Open Access

1 June 2002

journal article
research article
Published by Oxford University Press (OUP) in Nucleic Acids Research

Vol. 30 (11), 2515-2523
https://doi.org/10.1093/nar/30.11.2515

Abstract

Based on searches for disabled homologs to known proteins, we have identified a large population of pseudogenes in four sequenced eukaryotic genomes—the worm, yeast, fly and human (chromosomes 21 and 22 only). Each of our nearly 2500 pseudogenes is characterized by one or more disablements mid-domain, such as premature stops and frameshifts. Here, we perform a comprehensive survey of the amino acid and nucleotide composition of these pseudogenes in comparison to that of functional genes and intergenic DNA. We show that pseudogenes invariably have an amino acid composition intermediate between genes and translated intergenic DNA. Although the degree of intermediacy varies among the four organisms, in all cases, it is most evident for amino acid types that differ most in occurrence between genes and intergenic regions. The same intermediacy also applies to codon frequencies, especially in the worm and human. Moreover, the intermediate composition of pseudogenes applies even though the composition of the genes in the four organisms is markedly different, showing a strong correlation with the overall A/T content of the genomic sequence. Pseudogenes can be divided into ‘ancient’ and ‘modern’ subsets, based on the level of sequence identity with their closest matching homolog (within the same genome). Modern pseudogenes usually have a much closer sequence composition to genes than ancient pseudogenes. Collectively, our results indicate that the composition of pseudogenes that are under no selective constraints progressively drifts from that of coding DNA towards non-coding DNA. Therefore, we propose that the degree to which pseudogenes approach a random sequence composition may be useful in dating different sets of pseudogenes, as well as to assess the rate at which intergenic DNA accumulates mutations. Our compositional analyses with the interactive viewer are available over the web at http://genecensus.org/pseudogene.

Keywords

This publication has 34 references indexed in Scilit:

Studying Genomes Through the Aeons: Protein Families, Pseudogenes and Proteome Evolution
Journal of Molecular Biology, 2002
A question of size: the eukaryotic proteome and the problems in defining it
Nucleic Acids Research, 2002
A small reservoir of disabled ORFs in the yeast genome and its implications for the dynamics of proteome evolution
Journal of Molecular Biology, 2002
Molecular Fossils in the Human Genome: Identification and Analysis of the Pseudogenes in Chromosomes 21 and 22
Genome Research, 2002
Computational Inference of Homologous Gene Structures in the Human Genome
Genome Research, 2001
PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information
Nucleic Acids Research, 2001
Genome-Scale Compositional Comparisons in Eukaryotes
Genome Research, 2001
The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000
Nucleic Acids Research, 2000
The DNA sequence of human chromosome 22
Nature, 1999
Interspersed repeats and other mementos of transposable elements in mammalian genomes
Current Opinion in Genetics & Development, 1999

Cited by 129 articles