SSAHA: A Fast Search Method for Large DNA Databases

Top Cited Papers

Open Access

1 October 2001

journal article
Published by Cold Spring Harbor Laboratory in Genome Research

Vol. 11 (10), 1725-1729
https://doi.org/10.1101/gr.194201

Abstract

We describe an algorithm, SSAHA (SequenceSearch and Alignment by HashingAlgorithm), for performing fast searches on databases containing multiple gigabases of DNA. Sequences in the database are preprocessed by breaking them into consecutive k-tuples ofk contiguous bases and then using a hash table to store the position of each occurrence of each k-tuple. Searching for a query sequence in the database is done by obtaining from the hash table the “hits” for each k-tuple in the query sequence and then performing a sort on the results. We discuss the effect of the tuple length k on the search speed, memory usage, and sensitivity of the algorithm and present the results of computational experiments which show that SSAHA can be three to four orders of magnitude faster than BLAST or FASTA, while requiring less memory than suffix tree methods. The SSAHAalgorithm is used for high-throughput single nucleotide polymorphism (SNP) detection and very large scale sequence assembly. Also, it provides Web-based sequence search facilities for Ensembl projects.

Keywords

This publication has 15 references indexed in Scilit:

A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms
Nature, 2001
Initial sequencing and analysis of the human genome
Nature, 2001
A Greedy Algorithm for Aligning DNA Sequences
Journal of Computational Biology, 2000
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
Basic Local Alignment Search Tool
Journal of Molecular Biology, 1990
Basic local alignment search tool
Journal of Molecular Biology, 1990
Improved tools for biological sequence comparison.
Proceedings of the National Academy of Sciences, 1988
Rapid and Sensitive Protein Similarity Searches
Science, 1985
Identification of common molecular subsequences
Journal of Molecular Biology, 1981
A general method applicable to the search for similarities in the amino acid sequence of two proteins
Journal of Molecular Biology, 1970

Cited by 830 articles