Web document clustering based on Global-Best Harmony Search, K-means, Frequent Term Sets and Bayesian Information Criterion

1 July 2010

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

Abstract

This paper introduces a new description-centric algorithm for web document clustering based on the hybridization of the Global-Best Harmony Search with the K-means algorithm, Frequent Term Sets and Bayesian Information Criterion. The new algorithm defines the number of clusters automatically. The Global-Best Harmony Search provides a global strategy for a search in the solution space, based on the Harmony Search and the concept of swarm intelligence. The K-means algorithm is used to find the optimum value in a local search space. Bayesian Information Criterion is used as a fitness function, while FP-Growth is used to reduce the high dimensionality in the vocabulary. This resulting algorithm, called IGBHSK, was tested with data sets based on Reuters-21578 and DMOZ, obtaining promising results (better precision results than a Singular Value Decomposition algorithm). Also, it was also then evaluated by a group of users.

Keywords

This publication has 26 references indexed in Scilit:

Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures
Expert Systems with Applications, 2009
A survey of Web clustering engines
ACM Computing Surveys, 2009
Text document clustering based on frequent word meaning sequences
Data & Knowledge Engineering, 2008
A new algorithm for clustering search results
Data & Knowledge Engineering, 2007
A method for initialising the K-means clustering algorithm using kd-trees
Pattern Recognition Letters, 2007
A new meta-heuristic algorithm for continuous engineering optimization: harmony search theory and practice
Computer Methods in Applied Mechanics and Engineering, 2005
A Concept-Driven Algorithm for Clustering Search Results
IEEE Intelligent Systems, 2005
A New Heuristic Optimization Algorithm: Harmony Search
SIMULATION, 2001
Data clustering
ACM Computing Surveys, 1999
A conceptual version of the K-means algorithm
Pattern Recognition Letters, 1995

Cited by 13 articles