A structural census of the current population of protein sequences

Open Access

28 October 1997

journal article
research article
Published by Proceedings of the National Academy of Sciences in Proceedings of the National Academy of Sciences

Vol. 94 (22), 11911-11916
https://doi.org/10.1073/pnas.94.22.11911

Abstract

We examine the occurrence of the ≈300 known protein folds in different groups of organisms. To do this, we characterize a large fraction of the currently known protein sequences (≈140,000) in structural terms, by matching them to known structures via sequence comparison (or by secondary-structure class prediction for those without structural homologues). Overall, we find that an appreciable fraction of the known folds are present in each of the major groups of organisms (e.g., bacteria and eukaryotes share 156 of 275 folds), and most of the common folds are associated with many families of nonhomologous sequences (i.e., >10 sequence families for each common fold). However, different groups of organisms have characteristically distinct distributions of folds. So, for instance, some of the most common folds in vertebrates, such as globins or zinc fingers, are rare or absent in bacteria. Many of these differences in fold usage are biologically reasonable, such as the folds of metabolic enzymes being common in bacteria and those associated with extracellular transport and communication being common in animals. They also have important implications for database-based methods for fold recognition, suggesting that an unknown sequence from a plant is more likely to have a certain fold (e.g., a TIM barrel) than an unknown sequence from an animal.

Keywords

This publication has 51 references indexed in Scilit:

The emergence of major cellular processes in evolution
FEBS Letters, 1996
Residue – Residue Potentials with a Favorable Contact Pair Term and an Unfavorable High Packing Density Term, for Simulation and Threading
Journal of Molecular Biology, 1996
SCOP: A structural classification of proteins database for the investigation of sequences and structures
Journal of Molecular Biology, 1995
Volume changes in protein evolution
Journal of Molecular Biology, 1994
Prediction of Protein Secondary Structure at Better than 70% Accuracy
Journal of Molecular Biology, 1993
Statistics of local complexity in amino acid sequences and sequence databases
Computers & Chemistry, 1993
One thousand families for the molecular biologist
Nature, 1992
Selection of representative protein data sets
Protein Science, 1992
Weights for data related by a tree
Journal of Molecular Biology, 1989
The protein data bank: A computer-based archival file for macromolecular structures
Journal of Molecular Biology, 1977

Cited by 85 articles