Web document clustering based on Global-Best Harmony Search, K-means, Frequent Term Sets and Bayesian Information Criterion
- 1 July 2010
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
Abstract
This paper introduces a new description-centric algorithm for web document clustering based on the hybridization of the Global-Best Harmony Search with the K-means algorithm, Frequent Term Sets and Bayesian Information Criterion. The new algorithm defines the number of clusters automatically. The Global-Best Harmony Search provides a global strategy for a search in the solution space, based on the Harmony Search and the concept of swarm intelligence. The K-means algorithm is used to find the optimum value in a local search space. Bayesian Information Criterion is used as a fitness function, while FP-Growth is used to reduce the high dimensionality in the vocabulary. This resulting algorithm, called IGBHSK, was tested with data sets based on Reuters-21578 and DMOZ, obtaining promising results (better precision results than a Singular Value Decomposition algorithm). Also, it was also then evaluated by a group of users.Keywords
This publication has 26 references indexed in Scilit:
- Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measuresExpert Systems with Applications, 2009
- A survey of Web clustering enginesACM Computing Surveys, 2009
- Text document clustering based on frequent word meaning sequencesData & Knowledge Engineering, 2008
- A new algorithm for clustering search resultsData & Knowledge Engineering, 2007
- A method for initialising the K-means clustering algorithm using kd-treesPattern Recognition Letters, 2007
- A new meta-heuristic algorithm for continuous engineering optimization: harmony search theory and practiceComputer Methods in Applied Mechanics and Engineering, 2005
- A Concept-Driven Algorithm for Clustering Search ResultsIEEE Intelligent Systems, 2005
- A New Heuristic Optimization Algorithm: Harmony SearchSIMULATION, 2001
- Data clusteringACM Computing Surveys, 1999
- A conceptual version of the K-means algorithmPattern Recognition Letters, 1995