Abstract
There has been a growing emphasis on written examinations that assess the ability of physicians and teachers to make decisions in the absence of protocols for action—a crucial aspect of professional competence. A characteristic of such tests is controversy, even among experts, about what constitutes the correct response to some of the items. This paper studied the impact of variability in answer keys, constructed using the aggregate method, on total errors of measurement. Results indicated that several scorers made a sizable contribution to reduction in measurement error and that scorers or groups of scorers who each developed the answer key for a subset of items produced better results than a single group that developed the answer key for all items. Implications of judgment tests scored using the aggregate method are described for teacher and physician certification.