Nearest neighbour searching in serial files using text signatures

Abstract
A nearest neighbour search procedure is described for use with serial files of textual data. The procedure involves the grouping of records into blocks, each of which is characterised by a fixed-length bit string. A comparable query bit string may then be matched against each of these bit strings, and an upper bound calculation used to identify those blocks which need to be inspected in detail if the document that is most similar to the query is to be identified. Experiments with three small collections of documents and queries are used to test the efficiency of the approach. The experiments show that reduc tions in computation are possible, although the precise savings are crucially dependent upon a range of factors including the frequency characteristics of the documents and queries, the similarity coefficients, and the sizes of the bit strings and of the blocks.