Machine learning in automated text categorization

Top Cited Papers

1 March 2002

journal article
Published by Association for Computing Machinery (ACM) in ACM Computing Surveys

Vol. 34 (1), 1-47
https://doi.org/10.1145/505282.505283

Abstract

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

Keywords

All Related Versions

Version 1, 2001-10-26, ArXiv (Unconfirmed version)

This publication has 90 references indexed in Scilit:

Text-based approaches for non-topical image categorization
International Journal on Digital Libraries, 2000
Adaptive Information Filtering using evolutionary computation
Information Sciences, 2000
“Is this document relevant?…probably”
ACM Computing Surveys, 1998
Text classification with self-organizing maps: Some lessons learned
Neurocomputing, 1998
Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies
The VLDB Journal, 1998
Error Correlation and Error Reduction in Ensemble Classifiers
Connection Science, 1996
A sequential algorithm for training text classifiers
ACM SIGIR Forum, 1995
Probabilistic information retrieval as a combination of abstraction, inductive learning, and probabilistic assumptions
ACM Transactions on Information Systems, 1994
Computer assisted indexing
Information Storage and Retrieval, 1971
Automatic Document Classification
Journal of the ACM, 1963

Cited by 4815 articles