Using quality scores and longer reads improves accuracy of Solexa read mapping
Open Access
- 28 February 2008
- journal article
- research article
- Published by Springer Nature in BMC Bioinformatics
- Vol. 9 (1), 1-8
- https://doi.org/10.1186/1471-2105-9-128
Abstract
Second-generation sequencing has the potential to revolutionize genomics and impact all areas of biomedical science. New technologies will make re-sequencing widely available for such applications as identifying genome variations or interrogating the oligonucleotide content of a large sample (e.g. ChIP-sequencing). The increase in speed, sensitivity and availability of sequencing technology brings demand for advances in computational technology to perform associated analysis tasks. The Solexa/Illumina 1G sequencer can produce tens of millions of reads, ranging in length from ~25–50 nt, in a single experiment. Accurately mapping the reads back to a reference genome is a critical task in almost all applications. Two sources of information that are often ignored when mapping reads from the Solexa technology are the 3' ends of longer reads, which contain a much higher frequency of sequencing errors, and the base-call quality scores. To investigate whether these sources of information can be used to improve accuracy when mapping reads, we developed the RMAP tool, which can map reads having a wide range of lengths and allows base-call quality scores to determine which positions in each read are more important when mapping. We applied RMAP to analyze data re-sequenced from two human BAC regions for varying read lengths, and varying criteria for use of quality scores. RMAP is freely available for downloading at http://rulai.cshl.edu/rmap/ . Our results indicate that significant gains in Solexa read mapping performance can be achieved by considering the information in 3' ends of longer reads, and appropriately using the base-call quality scores. The RMAP tool we have developed will enable researchers to effectively exploit this information in targeted re-sequencing projects.Keywords
This publication has 14 references indexed in Scilit:
- Genome-wide maps of chromatin state in pluripotent and lineage-committed cellsNature, 2007
- Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencingNature Methods, 2007
- High-Resolution Profiling of Histone Methylations in the Human GenomeCell, 2007
- Whole-genome re-sequencingCurrent Opinion in Genetics & Development, 2006
- Genome sequencing in microfabricated high-density picolitre reactorsNature, 2005
- Complete MHC Haplotype Sequencing for Common Disease Gene MappingGenome Research, 2004
- PatternHunter: faster and more sensitive homology searchBioinformatics, 2002
- Algorithms on Strings, Trees, and Sequences: Computer Science and Computational BiologyJournal of the American Statistical Association, 1999
- Multiple filtration and approximate pattern matchingAlgorithmica, 1995
- Basic local alignment search toolJournal of Molecular Biology, 1990