CMfinder—a covariance model based RNA motif finding algorithm

15 December 2005

journal article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 22 (4), 445-452
https://doi.org/10.1093/bioinformatics/btk008

Abstract

The recent discoveries of large numbers of non-coding RNAs and computational advances in genome-scale RNA search create a need for tools for automatic, high quality identification and characterization of conserved RNA motifs that can be readily used for database search. Previous tools fall short of this goal. CMfinder is a new tool to predict RNA motifs in unaligned sequences. It is an expectation maximization algorithm using covariance models for motif description, featuring novel integration of multiple techniques for effective search of motif space, and a Bayesian framework that blends mutual information-based and folding energy-based approaches to predict structure in a principled way. Extensive tests show that our method works well on datasets with either low or high sequence similarity, is robust to inclusion of lengthy extraneous flanking sequence and/or completely unrelated sequences, and is reasonably fast and scalable. In testing on 19 known ncRNA families, including some difficult cases with poor sequence conservation and large indels, our method demonstrates excellent average per-base-pair accuracy--79% compared with at most 60% for alternative methods. More importantly, the resulting probabilistic model can be directly used for homology search, allowing iterative refinement of structural models based on additional homologs. We have used this approach to obtain highly accurate covariance models of known RNA motifs based on small numbers of related sequences, which identified homologs in deeply-diverged species.

Keywords

This publication has 29 references indexed in Scilit:

Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%
Bioinformatics, 2005
A Glycine-Dependent Riboswitch That Uses Cooperative Binding to Control Gene Expression
Science, 2004
Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy
Bioinformatics, 2004
A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned sequences
Bioinformatics, 2004
Genetic Control by Metabolite‐Binding Riboswitches
ChemBioChem, 2003
Riboswitches Control Fundamental Biochemical Pathways in Bacillus subtilis and Other Bacteria
Cell, 2003
Secondary Structure Prediction for Aligned RNA Sequences
Journal of Molecular Biology, 2002
Dynalign: an algorithm for finding the secondary structure common to two RNA sequences
Journal of Molecular Biology, 2002
The equilibrium partition function and base pair binding probabilities for RNA secondary structure
Biopolymers, 1990
Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems
SIAM Journal on Applied Mathematics, 1985

Cited by 301 articles