Abstract
Most prevalence studies using health records are likely to miss some affected cases and thus be biased to underestimates. An adjustment for underascertainment is often necessary, but to our knowledge no validity studies of proposed methods have been done. Using a data set on Down syndrome which gives distributions by five different sources, the number listed in, say source X, i.e., the known “prevalence” (KP) of those in X, was compared with estimates of this prevalence derived (using only information on the intersections of X with other sources) by using several different models: 1) truncated β-binomial or Skellam (TS); 2) truncated binomial (TB); 3) Bernoulli census-independent sources (IS); 4) Bernoulli census-merged sources (MS); and 5) log-linear (LL). Up to three of the following assumptions are required by at least one of the models: (I) for each specific source X, each case in the population has the same probability of being listed by that source; (II) there is no variation between sources in these probabilities, i.e., the ascertainment probability is the same for all sources; and (III) the sources are independent. The TB model makes all three assumptions, the TS model makes assumptions two and three, the IS model makes assumptions one and three, and the MS and LL models make only the first assumption. Estimates derived from the TS model must always be greater than or equal to those from the TB model, and these in turn must be greater than or equal to those from the IS model. No such systematic relationship holds for estimates from the MS or LL models with regard to the others. Results by sources were mental hygiene records: KP = 263, estimates (as % of KP) were TS = 85%, TB = 84%, IS = 79%, MS = 81%, LL = 87%schools;KP = 252,TS =% 108%,TB = 95%, IS = 90%, MS = 95%, LL = 104%; hospital records: KP = 215,TS = 108%,TB = 108%, IS = 102%, MS = 105%, LL = 97%;obstetrical records: KP = 183, TS = 110%, TB = 109%, IS = 106%, MS = 121%, LL = 103%. (Department of Health Records: KP = 36, no estimates made.) The estimates derived from the log-linear models had in general the best agreement wtth the values of the known prevalences. In addition, for each source 95% confidence intervals include the known prevalences. The truncated β-binomial (Skellam) model (TS) was the only other model for which all confidence Intervals Include the known prevalences, but these intervals are so wide and so asymmetric around the known prevalences as to render this approach much less attractive. Thus, In general, the log-linear model, of those considered, appears preferable for prevalence estimation. The analyses presented here Illustrate the need for and value of collection and reporting of data by all source Intersections In multiple source Investigations.