ESPERR: Learning strong and weak signals in genomic sequence alignments to identify functional elements

19 October 2006

journal article
Published by Cold Spring Harbor Laboratory in Genome Research

Vol. 16 (12), 1596-1604
https://doi.org/10.1101/gr.4537706

Abstract

Genomic sequence signals—such as base composition, presence of particular motifs, or evolutionary constraint—have been used effectively to identify functional elements. However, approaches based only on specific signals known to correlate with function can be quite limiting. When training data are available, application of computational learning algorithms to multispecies alignments has the potential to capture broader and more informative sequence and evolutionary patterns that better characterize a class of elements. However, effective exploitation of patterns in multispecies alignments is impeded by the vast number of possible alignment columns and by a limited understanding of which particular strings of columns may characterize a given class. We have developed a computational method, called ESPERR (evolutionary and sequence pattern extraction through reduced representations), which uses training examples to learn encodings of multispecies alignments into reduced forms tailored for the prediction of chosen classes of functional elements. ESPERR produces a greatly improved Regulatory Potential score, which can discriminate regulatory regions from neutral sites with excellent accuracy (∼94%). This score captures strong signals (GC content and conservation), as well as subtler signals (with small contributions from many different alignment patterns) that characterize the regulatory elements in our training set. ESPERR is also effective for predicting other classes of functional elements, as we show for DNaseI hypersensitive sites and highly conserved regions with developmental enhancer activity. Our software, training data, and genome-wide predictions are available from our Web site (http://www.bx.psu.edu/projects/esperr).

Keywords

This publication has 30 references indexed in Scilit:

Experimental validation of predicted mammalian erythroid cis-regulatory modules
Genome Research, 2006
Unbiased location analysis of E2F1-binding sites suggests a widespread role for E2F1 in the human genome
Genome Research, 2006
Using Multiple Alignments to Improve Gene Prediction
Journal of Computational Biology, 2006
Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes
Genome Research, 2005
Assessing computational tools for the discovery of transcription factor binding sites
Nature Biotechnology, 2005
A more efficient search strategy for aging genes based on connectivity
Bioinformatics, 2004
Comparison of Site-Specific Rate-Inference Methods for Protein Sequences: Empirical Bayesian Methods Are Superior
Molecular Biology and Evolution, 2004
Distinguishing Regulatory DNA From Neutral Sites
Genome Research, 2003
The UCSC Genome Browser Database
Nucleic Acids Research, 2003
Initial sequencing and comparative analysis of the mouse genome
Nature, 2002

Cited by 103 articles