Identification of sequence motifs from a set of porteins with related function

Abstract
The automatic identification of motifs associated with a given function is an important challenge for molecular sequence analysis. A method is presented for the extraction of such patterns from large sets of unaligned sequences with related but general function, for example, a set of heat shock proteins. In such a set of proteins there can often be several subfamilies each characterized by one or more distinct motifs. The aim is to develop computational tools to identify these motifs. The algorithm presented locates high frequency words of length k with a given number of positions, r, fixed. Statistics for a binomial distribution are used to assess the significance of the words. The high-frequency words are clustered and highly populated clusters retained. The composition of the clusters is displayed graphically. A set of motifs associated with the sequence family can automatically be extracted. The method is benchmarked on a set of 106 heat shock sequences and a set of 257 toxin sequences. It is shown to recover previously identified motifs.