Limitations and potentials of current motif discovery algorithms

Open Access

1 January 2005

journal article
research article
Published by Oxford University Press (OUP) in Nucleic Acids Research

Vol. 33 (15), 4899-4913
https://doi.org/10.1093/nar/gki791

Abstract

Computational methods for de novo identification of gene regulation elements, such as transcription factor binding sites, have proved to be useful for deciphering genetic regulatory networks. However, despite the availability of a large number of algorithms, their strengths and weaknesses are not sufficiently understood. Here, we designed a comprehensive set of performance measures and benchmarked five modern sequence-based motif discovery algorithms using large datasets generated from Escherichia coli RegulonDB. Factors that affect the prediction accuracy, scalability and reliability are characterized. It is revealed that the nucleotide and the binding site level accuracy are very low, while the motif level accuracy is relatively high, which indicates that the algorithms can usually capture at least one correct motif in an input sequence. To exploit diverse predictions from multiple runs of one or more algorithms, a consensus ensemble algorithm has been developed, which achieved 6–45% improvement over the base algorithms by increasing both the sensitivity and specificity. Our study illustrates limitations and potentials of existing sequence-based motif discovery algorithms. Taking advantage of the revealed potentials, several promising directions for further improvements are discussed. Since the sequence-based algorithms are the baseline of most of the modern motif discovery algorithms, this paper suggests substantial improvements would be possible for them.

Keywords

This publication has 45 references indexed in Scilit:

Assessing computational tools for the discovery of transcription factor binding sites
Nature Biotechnology, 2005
Constrained Binding Site Diversity within Families of Transcription Factors Enhances Pattern Discovery Bioinformatics
Journal of Molecular Biology, 2004
Eukaryotic Regulatory Element Conservation Analysis and Identification Using Comparative Genomics
Genome Research, 2004
Identification of co-regulated genes through Bayesian clustering of predicted regulatory binding sites
Nature Biotechnology, 2003
An algorithm for finding protein–DNA binding sites with applications to chromatin- immunoprecipitation microarray experiments
Nature Biotechnology, 2002
Finding Motifs Using Random Projections
Journal of Computational Biology, 2002
Evaluation of Gene-Finding Programs on Mammalian Sequences
Genome Research, 2001
Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation
Nature Biotechnology, 1998
Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies 1 1Edited by G. von Heijne
Journal of Molecular Biology, 1998
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
Nucleic Acids Research, 1994

Cited by 176 articles