Efficient phrase-based document indexing for Web document clustering

24 August 2004

journal article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Knowledge and Data Engineering

Vol. 16 (10), 1279-1296
https://doi.org/10.1109/tkde.2004.58

Abstract

Document clustering techniques mostly rely on single term analysis of the document data set, such as the vector space model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This article presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the document index graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.

Keywords

This publication has 21 references indexed in Scilit:

Document clustering with committees
Published by Association for Computing Machinery (ACM) ,2002
Frequent term-based text clustering
Published by Association for Computing Machinery (ACM) ,2002
Web mining research
ACM SIGKDD Explorations Newsletter, 2000
Learning for Text Categorization and Information Extraction with ILP
Lecture Notes in Computer Science, 2000
Data clustering
ACM Computing Surveys, 1999
Learning approaches for detecting and tracking news events
IEEE Intelligent Systems and their Applications, 1999
Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning, 1999
Inductive learning algorithms and representations for text categorization
Published by Association for Computing Machinery (ACM) ,1998
Incremental clustering and dynamic information retrieval
Published by Association for Computing Machinery (ACM) ,1997
A vector space model for automatic indexing
Communications of the ACM, 1975

Cited by 210 articles