Predicting Human Nucleosome Occupancy from Primary Sequence

Abstract
Nucleosomes are the fundamental repeating unit of chromatin and comprise the structural building blocks of the living eukaryotic genome. Micrococcal nuclease (MNase) has long been used to delineate nucleosomal organization. Microarray-based nucleosome mapping experiments in yeast chromatin have revealed regularly-spaced translational phasing of nucleosomes. These data have been used to train computational models of sequence-directed nuclesosome positioning, which have identified ubiquitous strong intrinsic nucleosome positioning signals. Here, we successfully apply this approach to nucleosome positioning experiments from human chromatin. The predictions made by the human-trained and yeast-trained models are strongly correlated, suggesting a shared mechanism for sequence-based determination of nucleosome occupancy. In addition, we observed striking complementarity between classifiers trained on experimental data from weakly versus heavily digested MNase samples. In the former case, the resulting model accurately identifies nucleosome-forming sequences; in the latter, the classifier excels at identifying nucleosome-free regions. Using this model we are able to identify several characteristics of nucleosome-forming and nucleosome-disfavoring sequences. First, by combining results from each classifier applied de novo across the human ENCODE regions, the classifier reveals distinct sequence composition and periodicity features of nucleosome-forming and nucleosome-disfavoring sequences. Short runs of dinucleotide repeat appear as a hallmark of nucleosome-disfavoring sequences, while nucleosome-forming sequences contain short periodic runs of GC base pairs. Second, we show that nucleosome phasing is most frequently predicted flanking nucleosome-free regions. The results suggest that the major mechanism of nucleosome positioning in vivo is boundary-event-driven and affirm the classical statistical positioning theory of nucleosome organization. Inside the nucleus, DNA is wrapped into a complex molecular structure called chromatin, whose fundamental unit is ∼150 bp of DNA organized around the eight-histone protein complex known as the nucleosome. Understanding the local organization of nucleosomes is critical for understanding how chromatin impacts gene regulation. Here, we describe a computational model that predicts nucleosome placement from DNA sequence. We train the model using data derived from human cell lines, and we apply the model systematically to 1% of the human genome. We show that previously described models trained from yeast data correlate strongly with the human-trained model, suggesting a common mechanism for sequence-based determination of nucleosome occupancy. In addition, we observe a striking complementarity between models trained using data from weakly and strongly digested samples: one type of model recognizes nucleosome-free regions, whereas the other identifies well-positioned nucleosomes. Finally, our analysis of predicted nucleosome positions in the human genome allows us to identify common features of nucleosome-forming and inhibitory sequences. Overall, our results are consistent with the classical statistical positioning theory of nucleosome organization.