Abstract
Cluster analysis is reformulated as a problem of estimating the para- meters of a mixture of multivariate distributions. The maximum-likelihood theory and numerical solution techniques are developed for a fairly general class of distributions. The theory is applied to mixtures of multivariate nor- mals ("NORMIX") and mixtures of multivariate Bernoulli distributions ("La- tent Classes"). The feasibility of the procedures is demonstrated by two ex- amples of computer solutions for normal mixture models of the Fisher Iris data and of artificially generated clusters with unequal covariance matrices. This paper is addressed to the problem which has been var- iously called cluster analysis, Q-analysis, typology, grouping, clump- ing, classif ication, numerical taxonomy, and unsupervised pattern recognition. The variety of nomenclature may be due to the import- ance of the subject in such diverse fields as psychology, biology, signal detection, artificial intelligence, and information retrieval. Perhaps this multiplicity of names also indicates a certain confu- sion in the basic definition of the problem. This paper attempts to reformulate cluster analysis, with a resulting improvement in con- ceptual simplicity and statistical rigor. In this formulation cluster analysis will be viewed as a form of mixture analysis for finite mixtures of multivariate distributions. In clustering methodology, one is generally given a sample of N objects or individuals, each of which is measured on m variables. From this information alone, one must devise a classification scheme for grouping the objects into r classes. The number of classes and the characteristics of the classes are to be determined. If all the objects in a given class were identical to one another, the problem would be simple. However, in the usual situation the ob- jects in a class differ on most or all of the measures. Most cluster analysis procedures try to measure the "similarity" between any two objects, and then try to group the objects so as to maximize within-class similarity. Unfortunately, the appropriate measure of similarity is a subject of some controversy. It would be desirable to derive a cluster analysis system without arbitrary assumptions about similarity. Such a system will be presented in this paper.