Detecting near-duplicates for web crawling
Top Cited Papers
- 8 May 2007
- proceedings article
- Published by Association for Computing Machinery (ACM)
- p. 141-150
- https://doi.org/10.1145/1242572.1242592
Abstract
No abstract availableKeywords
This publication has 37 references indexed in Scilit:
- Efficient phrase-based document indexing for Web document clusteringIEEE Transactions on Knowledge and Data Engineering, 2004
- Methods for identifying versioned and plagiarized documentsJournal of the American Society for Information Science and Technology, 2003
- Collection statistics for fast duplicate document detectionACM Transactions on Information Systems, 2002
- Searching the WebACM Transactions on Internet Technology, 2001
- Authoritative sources in a hyperlinked environmentJournal of the ACM, 1999
- The anatomy of a large-scale hypertextual Web search engineComputer Networks and ISDN Systems, 1998
- Efficient crawling through URL orderingComputer Networks and ISDN Systems, 1998
- Dictionary Look-Up with One ErrorJournal of Algorithms, 1997
- Syntactic clustering of the WebComputer Networks and ISDN Systems, 1997
- Indexing by latent semantic analysisJournal of the American Society for Information Science, 1990