Calibrating E-values for hidden Markov models using reverse-sequence null models

Open Access

25 August 2005

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 21 (22), 4107-4115
https://doi.org/10.1093/bioinformatics/bti629

Abstract

Motivation: Hidden Markov models (HMMs) calculate the probability that a sequence was generated by a given model. Log-odds scoring provides a context for evaluating this probability, by considering it in relation to a null hypothesis. We have found that using a reverse-sequence null model effectively removes biases owing to sequence length and composition and reduces the number of false positives in a database search. Any scoring system is an arbitrary measure of the quality of database matches. Significance estimates of scores are essential, because they eliminate model- and method-dependent scaling factors, and because they quantify the importance of each match. Accurate computation of the significance of reverse-sequence null model scores presents a problem, because the scores do not fit the extreme-value (Gumbel) distribution commonly used to estimate HMM scores' significance. Results: To get a better estimate of the significance of reverse-sequence null model scores, we derive a theoretical distribution based on the assumption of a Gumbel distribution for raw HMM scores and compare estimates based on this and other distribution families. We derive estimation methods for the parameters of the distributions based on maximum likelihood and on moment matching (least-squares fit for Student's t-distribution). We evaluate the modeled distributions of scores, based on how well they fit the tail of the observed distribution for data not used in the fitting and on the effects of the improved E-values on our HMM-based fold-recognition methods. The theoretical distribution provides some improvement in fitting the tail and in providing fewer false positives in the fold-recognition test. An ad hoc distribution based on assuming a stretched exponential tail does an even better job. The use of Student's t to model the distribution fits well in the middle of the distribution, but provides too heavy a tail. The moment-matching methods fit the tails better than maximum-likelihood methods. Availability: Information on obtaining the SAM program suite (free for academic use), as well as a server interface, is available at and the open-source random sequence generator with varying compositional biases is available at Contact:karplus@soe.ucsc.edu

Keywords

This publication has 28 references indexed in Scilit:

Estimating and Evaluating the Statistics of Gapped Local-Alignment Scores
Journal of Computational Biology, 2002
Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks
Proteins-Structure Function and Bioinformatics, 2000
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
Scoring hidden Markov models
Bioinformatics, 1997
A flexible motif search technique based on generalized profiles
Computers & Chemistry, 1996
Knowledge‐based protein secondary structure assignment
Proteins-Structure Function and Bioinformatics, 1995
Maximum Discrimination Hidden Markov Models of Sequence Consensus
Journal of Computational Biology, 1995
Hidden Markov models of biological primary sequence information.
Proceedings of the National Academy of Sciences, 1994
Amino acid substitution matrices from an information theoretic perspective
Journal of Molecular Biology, 1991
Basic local alignment search tool
Journal of Molecular Biology, 1990

Cited by 41 articles