Thoroughly sampling sequence space: Large‐scale protein design of structural ensembles
- 1 December 2002
- journal article
- research article
- Published by Wiley in Protein Science
- Vol. 11 (12), 2804-2813
- https://doi.org/10.1110/ps.0203902
Abstract
Modeling the inherent flexibility of the protein backbone as part of computational protein design is necessary to capture the behavior of real proteins and is a prerequisite for the accurate exploration of protein sequence space. We present the results of a broad exploration of sequence space, with backbone flexibility, through a novel approach: large-scale protein design to structural ensembles. A distributed computing architecture has allowed us to generate hundreds of thousands of diverse sequences for a set of 253 naturally occurring proteins, allowing exciting insights into the nature of protein sequence space. Designing to a structural ensemble produces a much greater diversity of sequences than previous studies have reported, and homology searches using profiles derived from the designed sequences against the Protein Data Bank show that the relevance and quality of the sequences is not diminished. The designed sequences have greater overall diversity than corresponding natural sequence alignments, and no direct correlations are seen between the diversity of natural sequence alignments and the diversity of the corresponding designed sequences. For structures in the same fold, the sequence entropies of the designed sequences cluster together tightly. This tight clustering of sequence entropies within a fold and the separation of sequence entropy distributions for different folds suggest that the diversity of designed sequences is primarily determined by a structure's overall fold, and that the designability principle postulated from studies of simple models holds in real proteins. This has important implications for experimental protein design and engineering, as well as providing insight into protein evolution.Keywords
This publication has 57 references indexed in Scilit:
- Statistical theory for protein combinatorial libraries. packing interactions, backbone flexibility, and the sequence variability of a main-chain structureJournal of Molecular Biology, 2001
- Analysis of covariation in an SH3 domain sequence alignment: applications in tertiary contact prediction and the design of compensating hydrophobic core substitutionsJournal of Molecular Biology, 2000
- Trading accuracy for speed: a quantitative comparison of search algorithms in protein sequence designJournal of Molecular Biology, 2000
- The Protein Data BankNucleic Acids Research, 2000
- De novo protein design. II. plasticity in sequence spaceJournal of Molecular Biology, 1999
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- Threading a database of protein coresProteins-Structure Function and Bioinformatics, 1995
- One thousand families for the molecular biologistNature, 1992
- Solvation energy in protein folding and bindingNature, 1986
- Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical featuresBiopolymers, 1983