PATTERN CLUSTERING BY MULTIVARIATE MIXTURE ANALYSIS

1 April 1970

journal article
Published by Taylor & Francis in Multivariate Behavioral Research

Vol. 5 (3), 329-350
https://doi.org/10.1207/s15327906mbr0503_6

Abstract

Cluster analysis is reformulated as a problem of estimating the para- meters of a mixture of multivariate distributions. The maximum-likelihood theory and numerical solution techniques are developed for a fairly general class of distributions. The theory is applied to mixtures of multivariate nor- mals ("NORMIX") and mixtures of multivariate Bernoulli distributions ("La- tent Classes"). The feasibility of the procedures is demonstrated by two ex- amples of computer solutions for normal mixture models of the Fisher Iris data and of artificially generated clusters with unequal covariance matrices. This paper is addressed to the problem which has been var- iously called cluster analysis, Q-analysis, typology, grouping, clump- ing, classif ication, numerical taxonomy, and unsupervised pattern recognition. The variety of nomenclature may be due to the import- ance of the subject in such diverse fields as psychology, biology, signal detection, artificial intelligence, and information retrieval. Perhaps this multiplicity of names also indicates a certain confu- sion in the basic definition of the problem. This paper attempts to reformulate cluster analysis, with a resulting improvement in con- ceptual simplicity and statistical rigor. In this formulation cluster analysis will be viewed as a form of mixture analysis for finite mixtures of multivariate distributions. In clustering methodology, one is generally given a sample of N objects or individuals, each of which is measured on m variables. From this information alone, one must devise a classification scheme for grouping the objects into r classes. The number of classes and the characteristics of the classes are to be determined. If all the objects in a given class were identical to one another, the problem would be simple. However, in the usual situation the ob- jects in a class differ on most or all of the measures. Most cluster analysis procedures try to measure the "similarity" between any two objects, and then try to group the objects so as to maximize within-class similarity. Unfortunately, the appropriate measure of similarity is a subject of some controversy. It would be desirable to derive a cluster analysis system without arbitrary assumptions about similarity. Such a system will be presented in this paper.

Keywords

This publication has 8 references indexed in Scilit:

On Some Invariant Criteria for Grouping Data
Journal of the American Statistical Association, 1967
Estimation in Mixtures of Two Normal Distributions
Technometrics, 1967
Estimation of Parameters for a Mixture of Normal Distributions
Technometrics, 1966
A COMPUTER PROGRAM FOR THE MAXIMUM LIKELIHOOD ANALYSIS OF TYPES
Published by Defense Technical Information Center (DTIC) ,1965
On the Solution of Likelihood Equations by Iteration Processes. The Multiparametric Case
Biometrika, 1962
Three multivariate models: Factor analysis, latent structure analysis, and latent profile analysis
Psychometrika, 1959
THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS
Annals of Eugenics, 1936
III. Contributions to the mathematical theory of evolution
Philosophical Transactions of the Royal Society of London. (A.), 1894

Cited by 420 articles