Large Sample Theory of Empirical Distributions in Biased Sampling Models

Abstract
Vardi (1985a) introduced an $s$-sample model for biased sampling, gave conditions which guarantee the existence and uniqueness of the nonparametric maximum likelihood estimator $\mathbb{G}_n$ of the common underlying distribution $G$ and discussed numerical methods for calculating the estimator. Here we examine the large sample behavior of the NPMLE $\mathbb{G}_n$, including results on uniform consistency of $\mathbb{G}_n$, convergence of $\sqrt n (\mathbb{G}_n - G)$ to a Gaussian process and asymptotic efficiency of $\mathbb{G}_n$ as an estimator of $G$. The proofs are based upon recent results for empirical processes indexed by sets and functions and convexity arguments. We also give a careful proof of identifiability of the underlying distribution $G$ under connectedness of a certain graph $\mathbf{G}$. Examples and applications include length-biased sampling, stratified sampling, "enriched" stratified sampling, "choice-based" sampling in econometrics and "case-control" studies in biostatistics. A final section discusses design issues and further problems.