Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.

1 March 1990

journal article
research article
Published by Proceedings of the National Academy of Sciences in Proceedings of the National Academy of Sciences

Vol. 87 (6), 2264-2268
https://doi.org/10.1073/pnas.87.6.2264

Abstract

An unusual pattern in a nucleic acid or protein sequence or a region of strong similarity shared by two or more sequences may have biological significance. It is therefore desirable to known whether such a pattern can have arisen simply by chance. To identify interesting sequence patterns, appropriate scoring values can be assigned to the individual residues of a single sequence or to sets of residues when several sequences are compared. For single sequences, such scores can reflect biophysical properties such as charge, volume, hydrophobicity, or secondary structure potential; for multiple sequences, they can reflect nucleotide or amino acid similarity measured in a wide variety of ways. Using an appropriate random model, we present a theory that provides precise numerical formulas for assessing the statistical significance of any region with high aggregate score. A second class of results describes the composition of high-scoring segments. In certain contexts, these permit the choice of scoring systems which are "optimal" for distinguishing biologically relevant patterns. Examples are given of applications of the theory to a variety of protein sequences, highlighting segments with unusual biological features. These include distinctive charge regions in transcription factors and protooncogene products, pronounced hydrophobic segments in various receptor and transport proteins, and statistically significant subalignments involving the recently characterized cystic fibrosis gene.

This publication has 28 references indexed in Scilit:

A method to identify distinctive charge configurations in protein sequences, with application to human herpesvirus polypeptides
Journal of Molecular Biology, 1989
The molecular biology of cytochrome P450s.
1988
Amino acid substitutions in structurally related proteins a pattern recognition approach
Journal of Molecular Biology, 1988
The mas oncogene encodes an angiotensin receptor
Nature, 1988
Fos-Associated Protein p39 Is the Product of the jun Proto-Oncogene
Science, 1988
A gene activated by growth factors is related to the oncogene v-jun.
Proceedings of the National Academy of Sciences, 1988
Efficient algorithms for molecular sequence analysis.
Proceedings of the National Academy of Sciences, 1988
Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage.
Molecular Biology and Evolution, 1985
On the PAM matrix model of protein evolution.
Molecular Biology and Evolution, 1985
Aligning amino acid sequences: Comparison of commonly used methods
Journal of Molecular Evolution, 1985

Cited by 1129 articles