AMASS: A Structured Pattern Matching Approach to Shotgun Sequence Assembly
- 1 January 1999
- journal article
- research article
- Published by Mary Ann Liebert Inc in Journal of Computational Biology
- Vol. 6 (2), 163-186
- https://doi.org/10.1089/cmb.1999.6.163
Abstract
In this paper, we propose an efficient, reliable shotgun sequence assembly algorithm based on a fingerprinting scheme that is robust to both noise and repetitive sequences in the data, two primary roadblocks to effective whole-genome shotgun sequencing. Our algorithm uses exact matches of short patterns randomly selected from fragment data to identify fragment overlaps, construct an overlap map, and deliver a consensus sequence. We show how statistical clues made explicit in our approach can easily be exploited to correctly assemble results even in the presence of extensive repetitive sequences. Our approach is both accurate and exceptionally fast in practice: e.g., we have correctly assembled the whole Mycoplasma genitalium genome (approximately 580 kbp) is roughly 8 minutes of 64MB 200MHz Pentium Pro CPU time from real shotgun data, where most existing algorithms can be expected to run for several hours to a day on the same data. Moreover, experiments with artificially-shotgunned data prepared from real DNA sequences from a wide range of organisms (including human DNA) and containing complex repeating regions demonstrate our algorithm's robustness to input noise and the presence of repetitive sequences. For example, we have correctly assembled a 238-kbp human DNA sequence in less than 3 min of 64-MB 200-MHz Pentium Pro CPU time.Keywords
This publication has 14 references indexed in Scilit:
- Against a Whole-Genome ShotgunGenome Research, 1997
- An Improved Sequence Assembly ProgramGenomics, 1996
- The Minimal Gene Complement of Mycoplasma genitaliumScience, 1995
- Whole-Genome Random Sequencing and Assembly of Haemophilus influenzae RdScience, 1995
- Combinatorial algorithms for DNA sequence assemblyAlgorithmica, 1995
- A new DNA sequence assembly programNucleic Acids Research, 1995
- Genetic algorithms, operators, and DNA fragment assemblyMachine Learning, 1995
- A New Algorithm for DNA Sequence AssemblyJournal of Computational Biology, 1995
- Artificially Generated Data Sets for Testing DNA Sequence Assembly AlgorithmsGenomics, 1993
- A contig assembly program based on sensitive detection of fragment overlapsGenomics, 1992