A SOM-based document clustering using phrases

Abstract
Most of the existing techniques for document clustering rely on a "bag of words" document representation. Each word in the document is considered as a separate feature, ignoring the word order. We investigate the use of phrases rather than words as document features for the document clustering. We present a phrase grammar extraction technique, and use the extracted phrases as the features in a self-organizing map based document clustering algorithm. We present clustering results using the REUTERS corpus and show an improvement in clustering performance using both entropy and F-measure evaluation measures.

This publication has 8 references indexed in Scilit: