Characterizing and measuring bias in sequence data

Top Cited Papers

Open Access

29 May 2013

journal article
Published by Springer Nature in Genome Biology

Vol. 14 (5), R51
https://doi.org/10.1186/gb-2013-14-5-r51

Abstract

Background: DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias. Results: We applied these methods to the Illumina, Ion Torrent, Pacific Biosciences and Complete Genomics sequencing platforms, using data from human and from a set of microbes with diverse base compositions. As in previous work, library construction conditions significantly influence sequencing bias. Pacific Biosciences coverage levels are the least biased, followed by Illumina, although all technologies exhibit error-rate biases in high- and low-GC regions and at long homopolymer runs. The GC-rich regions prone to low coverage include a number of human promoters, so we therefore catalog 1,000 that were exceptionally resistant to sequencing. Our results indicate that combining data from two technologies can reduce coverage bias if the biases in the component technologies are complementary and of similar magnitude. Analysis of Illumina data representing 120-fold coverage of a well-studied human sample reveals that 0.20% of the autosomal genome was covered at less than 10% of the genome-wide average. Excluding locations that were similar to known bias motifs or likely due to sample-reference variations left only 0.045% of the autosomal genome with unexplained poor coverage. Conclusions: The assays presented in this paper provide a comprehensive view of sequencing bias, which can be used to drive laboratory improvements and to monitor production processes. Development guided by these assays should result in improved genome assemblies and better coverage of biologically important loci.

Keywords

This publication has 133 references indexed in Scilit:

A Review of Computational Tools in microRNA Discovery
Frontiers in Genetics, 2013
Performance comparison of whole-genome sequencing platforms
Nature Biotechnology, 2011
The role of the precursor structure in the biogenesis of microRNA
Cellular and Molecular Life Sciences, 2011
A map of human genome variation from population-scale sequencing
Nature, 2010
Mammalian microRNAs predominantly act to decrease target mRNA levels
Nature, 2010
A Mammalian microRNA Expression Atlas Based on Small RNA Library Sequencing
Cell, 2007
Approaches to microRNA discovery
Nature Genetics, 2006
Identification of hundreds of conserved and nonconserved human microRNAs
Nature Genetics, 2005
Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals
Nature, 2005
Conserved Seed Pairing, Often Flanked by Adenosines, Indicates that Thousands of Human Genes are MicroRNA Targets
Cell, 2005

Cited by 755 articles