Determining the value of additional surrogate exposure data for improving the estimate of an odds ratio

Abstract
We consider the design of both cohort and case‐control studies in which an initial (‘stage 1’) sample of complete data on an error‐free disease indicator (D), a correct (‘gold‐standard’) dichotomous exposure measurement (X) and an error‐prone exposure measurement (Z) are available. We calculate the amount of additional information on the odds ratio relating D to X that one can obtain from a second (‘stage 2’) sample of measurements only on D and Z. If one allows for differential measurement error in Z, there is often little advantage in having more than four times as much data in stage 2 data as in stage 1. With the assumption that a non‐differential measurement error model is reasonable, larger amounts of stage 2 data can be useful. Simulations indicate that stage 1 samples of modest size (50 cases in case‐control studies and 50 failures in cohort studies) yield sufficiently reliable estimates of needed parameters to assist in determining an appropriate size for the stage 2 sample. These ideas apply in settings either where the amount of stage 1 data is limited and fixed by external constraints or where one has gathered stage 1 data in advance to avoid collecting superfluous stage 2 data.