Detecting near-duplicates for web crawling

Top Cited Papers

No abstract available

This publication has 37 references indexed in Scilit:

Efficient phrase-based document indexing for Web document clustering
IEEE Transactions on Knowledge and Data Engineering, 2004
Methods for identifying versioned and plagiarized documents
Journal of the American Society for Information Science and Technology, 2003
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems, 2002
Searching the Web
ACM Transactions on Internet Technology, 2001
Authoritative sources in a hyperlinked environment
Journal of the ACM, 1999
The anatomy of a large-scale hypertextual Web search engine
Computer Networks and ISDN Systems, 1998
Efficient crawling through URL ordering
Computer Networks and ISDN Systems, 1998
Dictionary Look-Up with One Error
Journal of Algorithms, 1997
Syntactic clustering of the Web
Computer Networks and ISDN Systems, 1997
Indexing by latent semantic analysis
Journal of the American Society for Information Science, 1990