Abstract
MOTIVATION: Hidden Markov models can efficiently and automatically build statistical representations of related sequences. Unfortunately, training sets are frequently biased toward one subgroup of sequences, leading to an insufficiently general model. This work evaluates sequence weighting methods based on the maximum-discrimination idea. RESULTS: One good method scales sequence weights by an exponential that ranges between 0.1 for the best scoring sequence and 1.0 for the worst. Experiments with a curated data set show that while training with one or two sequences performed worse than single-sequence Probabilistic Smith-Waterman, training with five or ten sequences reduced errors by 20% and 51%, respectively. This new version of the SAM HMM suite outperforms HMMer (17% reduction over PSW for 10 training sequences), Meta-MEME (28% reduction), and unweighted SAM (31% reduction). AVAILABILITY: A WWW server, as well as information on obtaining the Sequence Alignment and Modeling (SAM) software suite and additional data from this work, can be found at http://www.cse.ucse. edu/research/compbio/sam.html