Integration of heterogeneous databases without common domains using queries based on textual similarity

1 June 1998

journal article
Published by Association for Computing Machinery (ACM) in ACM SIGMOD Record

Vol. 27 (2), 201-212
https://doi.org/10.1145/276305.276323

Abstract

Most databases contain “name constants” like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. However, in many cases, this assumption does not hold; determining if two name constants should be considered identical can require detailed knowledge of the world, the purpose of the user's query, or both. In this paper, we reject the assumption that global domains can be easily constructed, and assume instead that the names are given in natural language text. We then propose a logic called WHIRL which reasons explicitly about the similarity of local names, as measured using the vector-space model commonly adopted in statistical information retrieval. We describe an efficient implementation of WHIRL and evaluate it experimentally on data extracted from the World Wide Web. We show that WHIRL is much faster than naive inference methods, even for short queries. We also show that inferences made by WHIRL are surprisingly accurate, equaling the accuracy of hand-coded normalization routines on one benchmark problem, and outperforming exact matching with a plausible global domain on a second.

Keywords

This publication has 19 references indexed in Scilit:

A Web-based information system that reasons with structured collections of text
Published by Association for Computing Machinery (ACM) ,1998
Formal models of Web queries
Published by Association for Computing Machinery (ACM) ,1997
Query planning in infomaster
Published by Association for Computing Machinery (ACM) ,1997
Query evaluation: Strategies and optimizations
Information Processing & Management, 1995
Probabilistic Datalog---a logic for powerful retrieval methods
Published by Association for Computing Machinery (ACM) ,1995
Linear-space best-first search
Artificial Intelligence, 1993
SPIDER
Published by Association for Computing Machinery (ACM) ,1993
The management of probabilistic data
IEEE Transactions on Knowledge and Data Engineering, 1992
An algorithm for suffix stripping
Program: electronic library and information systems, 1980
Automatic Linkage of Vital Records
Science, 1959

Cited by 53 articles