An Integrated Pipeline for de Novo Assembly of Microbial Genomes
Open Access
- 13 September 2012
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLOS ONE
- Vol. 7 (9), e42304
- https://doi.org/10.1371/journal.pone.0042304
Abstract
Remarkable advances in DNA sequencing technology have created a need for de novo genome assembly methods tailored to work with the new sequencing data types. Many such methods have been published in recent years, but assembling raw sequence data to obtain a draft genome has remained a complex, multi-step process, involving several stages of sequence data cleaning, error correction, assembly, and quality control. Successful application of these steps usually requires intimate knowledge of a diverse set of algorithms and software. We present an assembly pipeline called A5 (Andrew And Aaron's Awesome Assembly pipeline) that simplifies the entire genome assembly process by automating these stages, by integrating several previously published algorithms with new algorithms for quality control and automated assembly parameter selection. We demonstrate that A5 can produce assemblies of quality comparable to a leading assembly algorithm, SOAPdenovo, without any prior knowledge of the particular genome being assembled and without the extensive parameter tuning required by the other assembly algorithm. In particular, the assemblies produced by A5 exhibit 50% or more reduction in broken protein coding sequences relative to SOAPdenovo assemblies. The A5 pipeline can also assemble Illumina sequence data from libraries constructed by the Nextera (transposon-catalyzed) protocol, which have markedly different characteristics to mechanically sheared libraries. Finally, A5 has modest compute requirements, and can assemble a typical bacterial genome on current desktop or laptop computer hardware in under two hours, depending on depth of coverage.Keywords
This publication has 28 references indexed in Scilit:
- Bambus 2: scaffolding metagenomesBioinformatics, 2011
- Mauve Assembly MetricsBioinformatics, 2011
- High-quality draft assemblies of mammalian genomes from massively parallel sequence dataProceedings of the National Academy of Sciences, 2010
- Scaffolding pre-assembled contigs using SSPACEBioinformatics, 2010
- SVDetect: a tool to identify genomic structural variations from paired-end and mate-pair sequencing dataBioinformatics, 2010
- Efficient construction of an assembly string graph using the FM-indexBioinformatics, 2010
- TagDust—a program to eliminate artifacts from next generation sequencing dataBioinformatics, 2009
- BreakDancer: an algorithm for high-resolution mapping of genomic structural variationNature Methods, 2009
- Genome assembly reborn: recent computational challengesBriefings in Bioinformatics, 2009
- Fast and accurate short read alignment with Burrows–Wheeler transformBioinformatics, 2009