PATTERN CLUSTERING BY MULTIVARIATE MIXTURE ANALYSIS
- 1 April 1970
- journal article
- Published by Taylor & Francis in Multivariate Behavioral Research
- Vol. 5 (3), 329-350
- https://doi.org/10.1207/s15327906mbr0503_6
Abstract
Cluster analysis is reformulated as a problem of estimating the para- meters of a mixture of multivariate distributions. The maximum-likelihood theory and numerical solution techniques are developed for a fairly general class of distributions. The theory is applied to mixtures of multivariate nor- mals ("NORMIX") and mixtures of multivariate Bernoulli distributions ("La- tent Classes"). The feasibility of the procedures is demonstrated by two ex- amples of computer solutions for normal mixture models of the Fisher Iris data and of artificially generated clusters with unequal covariance matrices. This paper is addressed to the problem which has been var- iously called cluster analysis, Q-analysis, typology, grouping, clump- ing, classif ication, numerical taxonomy, and unsupervised pattern recognition. The variety of nomenclature may be due to the import- ance of the subject in such diverse fields as psychology, biology, signal detection, artificial intelligence, and information retrieval. Perhaps this multiplicity of names also indicates a certain confu- sion in the basic definition of the problem. This paper attempts to reformulate cluster analysis, with a resulting improvement in con- ceptual simplicity and statistical rigor. In this formulation cluster analysis will be viewed as a form of mixture analysis for finite mixtures of multivariate distributions. In clustering methodology, one is generally given a sample of N objects or individuals, each of which is measured on m variables. From this information alone, one must devise a classification scheme for grouping the objects into r classes. The number of classes and the characteristics of the classes are to be determined. If all the objects in a given class were identical to one another, the problem would be simple. However, in the usual situation the ob- jects in a class differ on most or all of the measures. Most cluster analysis procedures try to measure the "similarity" between any two objects, and then try to group the objects so as to maximize within-class similarity. Unfortunately, the appropriate measure of similarity is a subject of some controversy. It would be desirable to derive a cluster analysis system without arbitrary assumptions about similarity. Such a system will be presented in this paper.Keywords
This publication has 8 references indexed in Scilit:
- On Some Invariant Criteria for Grouping DataJournal of the American Statistical Association, 1967
- Estimation in Mixtures of Two Normal DistributionsTechnometrics, 1967
- Estimation of Parameters for a Mixture of Normal DistributionsTechnometrics, 1966
- A COMPUTER PROGRAM FOR THE MAXIMUM LIKELIHOOD ANALYSIS OF TYPESPublished by Defense Technical Information Center (DTIC) ,1965
- On the Solution of Likelihood Equations by Iteration Processes. The Multiparametric CaseBiometrika, 1962
- Three multivariate models: Factor analysis, latent structure analysis, and latent profile analysisPsychometrika, 1959
- THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMSAnnals of Eugenics, 1936
- III. Contributions to the mathematical theory of evolutionPhilosophical Transactions of the Royal Society of London. (A.), 1894