Detection and Removal of Biases in the Analysis of Next-Generation Sequencing Reads
Open Access
- 31 January 2011
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLOS ONE
- Vol. 6 (1), e16685
- https://doi.org/10.1371/journal.pone.0016685
Abstract
Since the emergence of next-generation sequencing (NGS) technologies, great effort has been put into the development of tools for analysis of the short reads. In parallel, knowledge is increasing regarding biases inherent in these technologies. Here we discuss four different biases we encountered while analyzing various Illumina datasets. These biases are due to both biological and statistical effects that in particular affect comparisons between different genomic regions. Specifically, we encountered biases pertaining to the distributions of nucleotides across sequencing cycles, to mappability, to contamination of pre-mRNA with mRNA, and to non-uniform hydrolysis of RNA. Most of these biases are not specific to one analyzed dataset, but are present across a variety of datasets and within a variety of genomic contexts. Importantly, some of these biases correlated in a highly significant manner with biological features, including transcript length, gene expression levels, conservation levels, and exon-intron architecture, misleadingly increasing the credibility of results due to them. We also demonstrate the relevance of these biases in the context of analyzing an NGS dataset mapping transcriptionally engaged RNA polymerase II (RNAPII) in the context of exon-intron architecture, and show that elimination of these biases is crucial for avoiding erroneous interpretation of the data. Collectively, our results highlight several important pitfalls, challenges and approaches in the analysis of NGS reads.Keywords
This publication has 59 references indexed in Scilit:
- Dynamic changes in the human methylome during differentiationGenome Research, 2010
- A map of open chromatin in human pancreatic isletsNature Genetics, 2010
- RNA-Seq gene expression estimation with read mapping uncertaintyBioinformatics, 2009
- Sequencing technologies — the next generationNature Reviews Genetics, 2009
- Discovery and Annotation of Functional Chromatin Signatures in the Human GenomePLoS Computational Biology, 2009
- Sense from sequence reads: methods for alignment and assemblyNature Methods, 2009
- Human DNA methylomes at base resolution show widespread epigenomic differencesNature, 2009
- Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing dataBioinformatics, 2009
- Biased Chromatin Signatures around Polyadenylation Sites and ExonsMolecular Cell, 2009
- Alternative isoform regulation in human tissue transcriptomesNature, 2008