Predicting protein secondary structure with probabilistic schemata of evolutionarily derived information
- 1 September 1997
- journal article
- Published by Wiley in Protein Science
- Vol. 6 (9), 1963-1975
- https://doi.org/10.1002/pro.5560060917
Abstract
We demonstrate the applicability of our previously developed Bayesian probabilistic approach for predicting residue solvent accessibility to the problem of predicting secondary structure. Using only single‐sequence data, this method achieves a three‐state accuracy of 67% over a database of 473 non‐homologous proteins. This approach is more amenable to inspection and less likely to overlearn specifics of a dataset than “black box” methods such as neural networks. It is also conceptually simpler and less computationally costly. We also introduce a novel method for representing and incorporating multiple‐sequence alignment information within the prediction algorithm, achieving 72% accuracy over a dataset of 304 non‐homologous proteins. This is accomplished by creating a statistical model of the evolutionarily derived correlations between patterns of amino acid substitution and local protein structure. This model consists of parameter vectors, termed “substitution schemata,” which probabilistically encode the structure‐based heterogeneity in the distributions of amino acid substitutions found in alignments of homologous proteins. The model is optimized for structure prediction by maximizing the mutual information between the set of schemata and the database of secondary structures. Unlike “expert heuristic” methods, this approach has been demonstrated to work well over large datasets. Unlike the opaque neural network algorithms, this approach is physicochemically intelligible. Moreover, the model optimization procedure, the formalism for predicting one‐dimensional structural features, and our previously developed method for tertiary structure recognition all share a common Bayesian probabilistic basis. This consistency starkly contrasts with the hybrid and ad hoc nature of methods that have dominated this field in recent years.Keywords
This publication has 45 references indexed in Scilit:
- Prediction of Protein Secondary Structure by Combining Nearest-neighbor Algorithms and Multiple Sequence AlignmentsJournal of Molecular Biology, 1995
- Enlarged representative set of protein structuresProtein Science, 1994
- Hidden Markov Models in Computational BiologyJournal of Molecular Biology, 1994
- Protein Secondary Structure Prediction Using Nearest-neighbor MethodsJournal of Molecular Biology, 1993
- Prediction of Protein Secondary Structure at Better than 70% AccuracyJournal of Molecular Biology, 1993
- Predicting protein secondary structure using neural net and statistical methodsJournal of Molecular Biology, 1992
- Comparative methods for explaining adaptationsNature, 1991
- Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical featuresBiopolymers, 1983
- The protein data bank: A computer-based archival file for macromolecular structuresJournal of Molecular Biology, 1977
- Algorithms for prediction of α-helical and β-structural regions in globular proteinsJournal of Molecular Biology, 1974