Growth of novel protein structural data
- 27 February 2007
- journal article
- Published by Proceedings of the National Academy of Sciences in Proceedings of the National Academy of Sciences
- Vol. 104 (9), 3183-3188
- https://doi.org/10.1073/pnas.0611678104
Abstract
Contrary to popular assumption, the rate of growth of structural data has slowed, and the Protein Data Bank (PDB) has not been growing exponentially since 1995. Reaching such a dramatic conclusion requires careful measurement of growth of novel structures, which can be achieved by clustering entry sequences, or by using a novel index to down-weight entries with a higher number of sequence neighbors. These measures agree, and growth rates are very similar for entire PDB files, clusters, and weighted chains. The overall sizes of Structural Classification of Proteins (SCOP) categories (number of families, superfamilies, and folds) appear to be directly proportional to the number of deposited PDB files. Using our weighted chain count, which is most correlated to the change in the size of each SCOP category in any time period, shows that the rate of increase of SCOP categories is actually slowing down. This enables the final size of each of these SCOP categories to be predicted without examining or comparing protein structures. In the last 3 years, structures solved by structural genomics (SG) initiatives, especially the United States National Institutes of Health Protein Structure Initiative, have begun to redress the slowing growth of the PDB. Structures solved by SG are 3.8 times less sequence-redundant than typical PDB structures. Since mid-2004, SG programs have contributed half the novel structures measured by weighted chain counts. Our analysis does not rely on visual inspection of coordinate sets: it is done automatically, providing an accurate, up-to-date measure of the growth of novel protein structural data.Keywords
This publication has 42 references indexed in Scilit:
- Comprehensive Evaluation of Protein Structure Alignment Methods: Scoring by Geometric MeasuresJournal of Molecular Biology, 2004
- Toward Consistent Assignment of Structural Domains in ProteinsJournal of Molecular Biology, 2004
- Estimating the number of protein folds and families from complete genome data 1 1Edited by J. ThorntonJournal of Molecular Biology, 2000
- The Protein Data BankNucleic Acids Research, 2000
- Estimating the number of protein foldsJournal of Molecular Biology, 1998
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- CATH – a hierarchic classification of protein domain structuresStructure, 1997
- SCOP: A structural classification of proteins database for the investigation of sequences and structuresJournal of Molecular Biology, 1995
- One thousand families for the molecular biologistNature, 1992
- A possible three-dimensional structure of bovine α-lactalbumin based on that of hen's egg-white lysozymeJournal of Molecular Biology, 1969