The protein structure prediction problem could be solved using the current PDB library

14 January 2005

journal article
Published by Proceedings of the National Academy of Sciences in Proceedings of the National Academy of Sciences

Vol. 102 (4), 1029-1034
https://doi.org/10.1073/pnas.0407152101

Abstract

For single-domain proteins, we examine the completeness of the structures in the current Protein Data Bank (PDB) library for use in full-length model construction of unknown sequences. To address this issue, we employ a comprehensive benchmark set of 1,489 medium-size proteins that cover the PDB at the level of 35% sequence identity and identify templates by structure alignment. With homologous proteins excluded, we can always find similar folds to native with an average rms deviation (RMSD) from native of 2.5 A with approximately 82% alignment coverage. These template structures often contain a significant number of insertions/deletions. The tasser algorithm was applied to build full-length models, where continuous fragments are excised from the top-scoring templates and reassembled under the guide of an optimized force field, which includes consensus restraints taken from the templates and knowledge-based statistical potentials. For almost all targets (except for 2/1,489), the resultant full-length models have an RMSD to native below 6 A (97% of them below 4 A). On average, the RMSD of full-length models is 2.25 A, with aligned regions improved from 2.5 A to 1.88 A, comparable with the accuracy of low-resolution experimental structures. Furthermore, starting from state-of-the-art structural alignments, we demonstrate a methodology that can consistently bring template-based alignments closer to native. These results are highly suggestive that the protein-folding problem can in principle be solved based on the current PDB library by developing efficient fold recognition algorithms that can recover such initial alignments.

Keywords

This publication has 48 references indexed in Scilit:

The PDB is a Covering Set of Small Protein Structures
Journal of Molecular Biology, 2003
An integrated approach to the analysis and modeling of protein sequences and structures. I. Protein structural alignment and a quantitative measure for protein structural distance
Journal of Molecular Biology, 2000
Modeling of loops in protein structures
Protein Science, 2000
The Protein Data Bank
Nucleic Acids Research, 2000
Protein secondary structure prediction based on position-specific scoring matrices 1 1Edited by G. Von Heijne
Journal of Molecular Biology, 1999
GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences
Journal of Molecular Biology, 1999
CATH – a hierarchic classification of protein domain structures
Structure, 1997
SCOP: A structural classification of proteins database for the investigation of sequences and structures
Journal of Molecular Biology, 1995
Comparative Protein Modelling by Satisfaction of Spatial Restraints
Journal of Molecular Biology, 1993
A general method applicable to the search for similarities in the amino acid sequence of two proteins
Journal of Molecular Biology, 1970

Cited by 241 articles