Customized Information Extraction as a Basis for Resource Discovery

Abstract
Indexing file contents is a powerful means of helping users locate documents, software, and other types of data among large repositories. In environments that contain many different types of data, content indexing requires type-specific processing to extract information effectively. In this paper we present a model for type- specific, user-customizable information extraction, and a system implementation called Essence. This software structure allows users to associate specialized extraction methods with ordinary flies, providing the illusion of an object-oriented file system that encapsulates specialized indexing methods within files. By exploiting semantics of common file types, Essence generates compact yet representative file summaries that can be used to improve both browsing and indexing in resource discovery systems. Essence can extract information from most of the types of files found in common file systems, including files with nested structure (such as compressed tar files). Essence interoperates with the Wide Area Information Servers (WAIS) system, allowing WAIS users to take advantage of the Essence information extraction methods.