Finding Motifs Using Random Projections

1 April 2002

journal article
research article
Published by Mary Ann Liebert Inc in Journal of Computational Biology

Vol. 9 (2), 225-242
https://doi.org/10.1089/10665270252935430

Abstract

The DNA motif discovery problem abstracts the task of discovering short, conserved sites in genomic DNA. Pevzner and Sze recently described a precise combinatorial formulation of motif discovery that motivates the following algorithmic challenge: find twenty planted occurrences of a motif of length fifteen in roughly twelve kilobases of genomic sequence, where each occurrence of the motif differs from its consensus in four randomly chosen positions. Such "subtle" motifs, though statistically highly significant, expose a weakness in existing motif-finding algorithms, which typically fail to discover them. Pevzner and Sze introduced new algorithms to solve their (15,4)-motif challenge, but these methods do not scale efficiently to more difficult problems in the same family, such as the (14,4)-, (16,5)-, and (18,6)-motif problems. We introduce a novel motif-discovery algorithm, PROJECTION, designed to enhance the performance of existing motif finders using random projections of the input's substrings. Experiments on synthetic data demonstrate that PROJECTION remedies the weakness observed in existing algorithms, typically solving the difficult (14,4)-, (16,5)-, and (18,6)-motif problems. Our algorithm is robust to nonuniform background sequence distributions and scales to larger amounts of sequence than that specified in the original challenge. A probabilistic estimate suggests that related motif-finding problems that PROJECTION fails to solve are in all likelihood inherently intractable. We also test the performance of our algorithm on realistic biological examples, including transcription factor binding sites in eukaryotes and ribosome binding sites in prokaryotes.

Keywords

This publication has 26 references indexed in Scilit:

Algorithms for Phylogenetic Footprinting
Journal of Computational Biology, 2002
Efficient large-scale sequence comparison by locality-sensitive hashing
Bioinformatics, 2001
Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies 1 1Edited by G. von Heijne
Journal of Molecular Biology, 1998
A novel Mcm1-dependent element in the SWI4, CLN3, CDC6, and CDC47 promoters activates M/G1-specific transcription.
Genes & Development, 1997
Global self-organization of all known protein sequences reveals inherent biological signatures 1 1Edited by F. E. Cohen
Journal of Molecular Biology, 1997
Geometric hashing: an overview
IEEE Computational Science and Engineering, 1997
Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment
Science, 1993
Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments
Journal of Molecular Biology, 1992
Rigorous pattern-recognition methods for DNA sequences
Journal of Molecular Biology, 1985
Extensions of Lipschitz mappings into a Hilbert space
Published by American Mathematical Society (AMS) ,1984

Cited by 257 articles