Principles of hash-based text retrieval

23 July 2007

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

p. 527-534
https://doi.org/10.1145/1277741.1277832

Abstract

Hash-based similarity search reduces a continuous similarity relation to the binary concept "similar or not similar": two feature vectors are considered as similar if they are mapped on the same hash key. From its runtime performance this principle is unequaled--while being unaffected by dimensionality concerns at the same time. Similarity hashing is applied with great success for near similarity search in large document collections, and it is considered as a key technology for near-duplicate detection and plagiarism analysis. This papers reveals the design principles behind hash-based search methods and presents them in a unified way. We introduce new stress statistics that are suited to analyze the performance of hash-based search methods, and we explain the rationale of their effectiveness. Based on these insights, we show how optimum hash functions for similarity search can be derived. We also present new results of a comparative study between different hash-based search methods.

Keywords

This publication has 17 references indexed in Scilit:

Finding near-duplicate web pages
Published by Association for Computing Machinery (ACM) ,2006
Verification of Positive Definiteness
BIT Numerical Mathematics, 2006
Near Similarity Search and Plagiarism Analysis
Studies in Classification, Data Analysis, and Knowledge Organization, 2006
LSH forest
Published by Association for Computing Machinery (ACM) ,2005
KEYWORD EXTRACTION FROM A SINGLE DOCUMENT USING WORD CO-OCCURRENCE STATISTICAL INFORMATION
International Journal on Artificial Intelligence Tools, 2004
Similarity estimation techniques from rounding algorithms
Published by Association for Computing Machinery (ACM) ,2002
Indexing by latent semantic analysis
Journal of the American Society for Information Science, 1990
Computing a nearest symmetric positive semidefinite matrix
Linear Algebra and its Applications, 1988
Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis
Psychometrika, 1964
The approximation of one matrix by another of lower rank
Psychometrika, 1936

Cited by 46 articles