A fast bit-vector algorithm for approximate string matching based on dynamic programming
- 1 May 1999
- journal article
- research article
- Published by Association for Computing Machinery (ACM) in Journal of the ACM
- Vol. 46 (3), 395-415
- https://doi.org/10.1145/316542.316550
Abstract
The approximate string matching problem is to find all locations at which a query of lengthm matches a substring of a text of length n with k-or-fewer differences. Simple and practical bit-vector algorithms have been designed for this problem, most notably the one used in agrep. These algorithms compute a bit representation of the current state-set of the k-difference automaton for the query, and asymptotically run in either O(nm/w) or O(nm log σ/w) time where w is the word size of the machine (e.g., 32 or 64 in practice), and σ is the size of the pattern alphabet. Here we present an algorithm of comparable simplicity that requires only O(nm/w) time by virtue of computing a bit representation of the relocatable dynamic programming matrix for the problem. Thus, the algorithm's performance is independent of k, and it is found to be more efficient than the previous results for many choices of k and smallm. Moreover, because the algorithm is not dependent on k, it can be used to rapidly compute blocks of the dynamic programming matrix as in the 4-Russians algorithm of Wu et al.(1996). This gives rise to an O(kn/w) expected-time algorithm for the case where m may be arbitrarily large. In practice this new algorithm, that computes a region of the dynamic progr amming (d.p.) matrx w entries at a time using the basic algorithm as a subroutine is significantly faster than our previous 4-Russians algorithm, that computes the same region 4 or 5 entries at a time using table lookup. This performance improvement yields a code that is either superior or competitive with all existing algorithms except for some filtration algorithms that are superior when k/m is sufficiently small.Keywords
This publication has 15 references indexed in Scilit:
- A subquadratic algorithm for approximate limited expression matchingAlgorithmica, 1996
- Multiple filtration and approximate pattern matchingAlgorithmica, 1995
- A sublinear algorithm for approximate keyword searchingAlgorithmica, 1994
- Sublinear approximate string matching and biological applicationsAlgorithmica, 1994
- Approximate string matching using withinword parallelismSoftware: Practice and Experience, 1994
- Recent Developments in Linear-Space Alignment Methods: A SurveyJournal of Computational Biology, 1994
- Fast text searchingCommunications of the ACM, 1992
- Approximate string-matching with q-grams and maximal matchesTheoretical Computer Science, 1992
- Finding approximate patterns in stringsJournal of Algorithms, 1985
- The theory and computation of evolutionary distances: Pattern recognitionJournal of Algorithms, 1980