Finding replicated Web collections

16 May 2000

conference paper
Published by Association for Computing Machinery (ACM)

Vol. 29 (2), 355-366
https://doi.org/10.1145/342009.335429

Abstract

Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we make the case for identifying replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines. The paper describes how to eciently identify replicated documents and hyperlinked document collections. The challenge is to identify these replicas from an input data set of several tens of millions of web pages and several hundreds of gigabytes of textual data. We also present two real-life case studies where we used replication information to improve a crawler and a search engine. We report these results for a data set of 25 million web pages (about 150 gigabytes of HTML data) crawled from the web.

Keywords

This publication has 4 references indexed in Scilit:

Accessibility of information on the web
Nature, 1999
Mirror, mirror on the Web: a study of host pairs with replicated content
Computer Networks, 1999
Life, death, and lawfulness on the electronic frontier
Published by Association for Computing Machinery (ACM) ,1997
Building a scalable and accurate copy detection mechanism
Published by Association for Computing Machinery (ACM) ,1996

Cited by 48 articles