Some Classification Problems with Multivariate Qualitative Data

Abstract
This paper deals with the problem of assigning specimens to one of two or more universes when the measurements on each specimen are qualitative, each taking a small number of states. After presenting the optimum rule for classifying the specimens, three problems are considered. The construction of the rule requires initial data on a number of specimens known to be classified correctly. Standard classification theory assumes that these initial samples are infinite in size, although in practice they may be only moderate. The principal effects of the finite sizes of the initial samples are that the probability of mis-classification of the rule derived from them is underestimated and that this rule may be inferior to the theoretical optimum rule that we could construct if we had infinite samples. Methods are proposed for obtaining reasonably unbiased estimates of the performance of rules derived from finite samples and for estimating the difference between the actual and the theoretical optimum probability of misclassification. It appears that initial samples of size 50 from each of two universes should be adequate if there are not more than 8 multivariate states. With greater number of states, larger sample sizes are needed to ensure that the actual rule will be almost as good as the theoretical optimum. If most of the variates are qualitative but a few are continuous, one possibility is to transform the continuous variates into qualitative ones, particularly since classification is easier with qualitative than with continuous variates. Asymptotic results are obtained for the best points of partition and the probabilities of misclassification when a large number of independent normal variates are partitioned to form qualitative variates. For qualitative variates with 2, 3, 4, 5 and 6 states the relative efficiencies are 64, 81, 88, 92 with 2, 3, 4, 5 and 6 states the relative efficiencies are 64, 81, 88, 92 and 94 percent respectively. Computations for small numbers of variates show that the asymptotic points of partition remain satisfactory although the relative efficiencies are in general lower. The optimum rule depends on the relative frequencies with which specimens to be classified present themselves from different universes. Initial estimates of these frequencies must be made in order to set up the rule. With two universes, maximum likelihood estimates of the frequencies from the data for specimens that have been classified by the rule are given. These estimates enable the rule to be improved if the initial estimates differ from the frequencies that apply when the rule is being used.