Finding Motifs Using Random Projections
- 1 April 2002
- journal article
- research article
- Published by Mary Ann Liebert Inc in Journal of Computational Biology
- Vol. 9 (2), 225-242
- https://doi.org/10.1089/10665270252935430
Abstract
The DNA motif discovery problem abstracts the task of discovering short, conserved sites in genomic DNA. Pevzner and Sze recently described a precise combinatorial formulation of motif discovery that motivates the following algorithmic challenge: find twenty planted occurrences of a motif of length fifteen in roughly twelve kilobases of genomic sequence, where each occurrence of the motif differs from its consensus in four randomly chosen positions. Such "subtle" motifs, though statistically highly significant, expose a weakness in existing motif-finding algorithms, which typically fail to discover them. Pevzner and Sze introduced new algorithms to solve their (15,4)-motif challenge, but these methods do not scale efficiently to more difficult problems in the same family, such as the (14,4)-, (16,5)-, and (18,6)-motif problems. We introduce a novel motif-discovery algorithm, PROJECTION, designed to enhance the performance of existing motif finders using random projections of the input's substrings. Experiments on synthetic data demonstrate that PROJECTION remedies the weakness observed in existing algorithms, typically solving the difficult (14,4)-, (16,5)-, and (18,6)-motif problems. Our algorithm is robust to nonuniform background sequence distributions and scales to larger amounts of sequence than that specified in the original challenge. A probabilistic estimate suggests that related motif-finding problems that PROJECTION fails to solve are in all likelihood inherently intractable. We also test the performance of our algorithm on realistic biological examples, including transcription factor binding sites in eukaryotes and ribosome binding sites in prokaryotes.Keywords
This publication has 26 references indexed in Scilit:
- Algorithms for Phylogenetic FootprintingJournal of Computational Biology, 2002
- Efficient large-scale sequence comparison by locality-sensitive hashingBioinformatics, 2001
- Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies 1 1Edited by G. von HeijneJournal of Molecular Biology, 1998
- A novel Mcm1-dependent element in the SWI4, CLN3, CDC6, and CDC47 promoters activates M/G1-specific transcription.Genes & Development, 1997
- Global self-organization of all known protein sequences reveals inherent biological signatures 1 1Edited by F. E. CohenJournal of Molecular Biology, 1997
- Geometric hashing: an overviewIEEE Computational Science and Engineering, 1997
- Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple AlignmentScience, 1993
- Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragmentsJournal of Molecular Biology, 1992
- Rigorous pattern-recognition methods for DNA sequencesJournal of Molecular Biology, 1985
- Extensions of Lipschitz mappings into a Hilbert spacePublished by American Mathematical Society (AMS) ,1984