The Noisy Substring Matching Problem

1 May 1983

journal article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Software Engineering

Vol. SE-9 (3), 365-370
https://doi.org/10.1109/tse.1983.237018

Abstract

Let T(U) be the set of words in the dictionary H which contains U as a substring. The problem considered here is the estimation of the set T(U) when U is not known, but Y, a noisy version of U is available. The suggested set estimate S*(Y) of T(U) is a proper subset of H such that its every element contains at least one substring which resembles Y most according to the Levenshtein metric. The proposed algorithm for-the computation of S*(Y) requires cubic time. The algorithm uses the recursively computable dissimilarity measure Dk(X, Y), termed as the kth distance between two strings X and Y which is a dissimilarity measure between Y and a certain subset of the set of contiguous substrings of X. Another estimate of T(U), namely SM(Y) is also suggested. The accuracy of SM(Y) is only slightly less than that of S*(Y), but the computation time of SM(Y) is substantially less than that of S*(Y). Experimental results involving 1900 noisy substrings and dictionaries which are subsets of 1023 most common English words [11] indicate that the accuracy of the estimate S*(Y) is around 99 percent and that of SM(Y) is about 98 percent.

Keywords

This publication has 9 references indexed in Scilit:

An effective algorithm for string correction using generalized edit distance—II. Computational complexity of the algorithm and some applications
Information Sciences, 1981
An effective algorithm for string correction using generalized edit distances—I. Description of the algorithm and its optimality
Information Sciences, 1981
Syntactic Decision Rules for Recognition of Spoken Words and Phrases Using a Stochastic Automaton
IEEE Transactions on Pattern Analysis and Machine Intelligence, 1979
Algorithms for the Longest Common Subsequence Problem
Journal of the ACM, 1977
A fast string searching algorithm
Communications of the ACM, 1977
Fast Pattern Matching in Strings
SIAM Journal on Computing, 1977
A Method for the Correction of Garbled Words Based on the Levenshtein Metric
IEEE Transactions on Computers, 1976
The String-to-String Correction Problem
Journal of the ACM, 1974
Relativ Frequency of English Speech Sounds
Published by Harvard University Press ,1923

Cited by 20 articles