Principles of hash-based text retrieval
- 23 July 2007
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
- p. 527-534
- https://doi.org/10.1145/1277741.1277832
Abstract
Hash-based similarity search reduces a continuous similarity relation to the binary concept "similar or not similar": two feature vectors are considered as similar if they are mapped on the same hash key. From its runtime performance this principle is unequaled--while being unaffected by dimensionality concerns at the same time. Similarity hashing is applied with great success for near similarity search in large document collections, and it is considered as a key technology for near-duplicate detection and plagiarism analysis. This papers reveals the design principles behind hash-based search methods and presents them in a unified way. We introduce new stress statistics that are suited to analyze the performance of hash-based search methods, and we explain the rationale of their effectiveness. Based on these insights, we show how optimum hash functions for similarity search can be derived. We also present new results of a comparative study between different hash-based search methods.Keywords
This publication has 17 references indexed in Scilit:
- Finding near-duplicate web pagesPublished by Association for Computing Machinery (ACM) ,2006
- Verification of Positive DefinitenessBIT Numerical Mathematics, 2006
- Near Similarity Search and Plagiarism AnalysisStudies in Classification, Data Analysis, and Knowledge Organization, 2006
- LSH forestPublished by Association for Computing Machinery (ACM) ,2005
- KEYWORD EXTRACTION FROM A SINGLE DOCUMENT USING WORD CO-OCCURRENCE STATISTICAL INFORMATIONInternational Journal on Artificial Intelligence Tools, 2004
- Similarity estimation techniques from rounding algorithmsPublished by Association for Computing Machinery (ACM) ,2002
- Indexing by latent semantic analysisJournal of the American Society for Information Science, 1990
- Computing a nearest symmetric positive semidefinite matrixLinear Algebra and its Applications, 1988
- Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesisPsychometrika, 1964
- The approximation of one matrix by another of lower rankPsychometrika, 1936