Estimating the Number of Classes via Sample Coverage

Abstract
Assume that a random sample is drawn from a population with unknown number of classes and possibly unequal class probabilities. A nonparametric estimation technique is proposed to estimate the number of classes using the idea of sample coverage, which is defined as the sum of the cell probabilities of the observed classes. Since expected sample coverage can be well estimated, we were motivated to find its role in the estimation of the number of classes. This work generalizes the result of Esty to a nonparametric approach and extends Darroch and Ratcliff to incorporate the heterogeneity of the class probabilities. The coefficient of variation of the class sizes is shown to play an important role in the recommended estimation procedures. The performance of the proposed estimators is investigated by means of Monte Carlo simulations.