Extracting sequence features to predict protein–DNA interactions: a comparative study

Open Access

13 June 2008

journal article
research article
Published by Oxford University Press (OUP) in Nucleic Acids Research

Vol. 36 (12), 4137-4148
https://doi.org/10.1093/nar/gkn361

Abstract

Predicting how and where proteins, especially transcription factors (TFs), interact with DNA is an important problem in biology. We present here a systematic study of predictive modeling approaches to the TF–DNA binding problem, which have been frequently shown to be more efficient than those methods only based on position-specific weight matrices (PWMs). In these approaches, a statistical relationship between genomic sequences and gene expression or ChIP-binding intensities is inferred through a regression framework; and influential sequence features are identified by variable selection. We examine a few state-of-the-art learning methods including stepwise linear regression, multivariate adaptive regression splines, neural networks, support vector machines, boosting and Bayesian additive regression trees (BART). These methods are applied to both simulated datasets and two whole-genome ChIP-chip datasets on the TFs Oct4 and Sox2, respectively, in human embryonic stem cells. We find that, with proper learning methods, predictive modeling approaches can significantly improve the predictive power and identify more biologically interesting features, such as TF–TF interactions, than the PWM approach. In particular, BART and boosting show the best and the most robust overall performance among all the methods.

Keywords

This publication has 54 references indexed in Scilit:

A core Klf circuitry regulates self-renewal of embryonic stem cells
Nature Cell Biology, 2008
Genomic Sequence Is Highly Predictive of Local Nucleosome Depletion
PLoS Computational Biology, 2008
Induction of Pluripotent Stem Cells from Adult Human Fibroblasts by Defined Factors
Cell, 2007
A gene regulatory network in mouse embryonic stem cells
Proceedings of the National Academy of Sciences, 2007
A protein interaction network for pluripotency of embryonic stem cells
Nature, 2006
The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells
Nature Genetics, 2006
Core Transcriptional Regulatory Circuitry in Human Embryonic Stem Cells
Cell, 2005
Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes
Genome Research, 2005
An algorithm for finding protein–DNA binding sites with applications to chromatin- immunoprecipitation microarray experiments
Nature Biotechnology, 2002
Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation
Nature Biotechnology, 1998

Cited by 39 articles