PURPOSE: To compare the psychometric properties of checklists, global rating scales preceded by a checklist, and global rating scales alone in assessing surgery residents' performances on an OSCE-like technical skills examination. METHOD: In 1996, 53 general surgery residents with one to six years of postgraduate training participated in a performance-based examination of technical skills consisting of eight 15-minute stations (bench-model simulations of operative procedures in general surgery). Two qualified surgeons marked at each station, one using a task-specific checklist (C) and a subsequent global rating scale (Gc), the other using a global rating scale only (G). RESULTS: Interstation reliabilities measured by Cronbach's alpha were .79 for C, .89 for Gc, and .85 for G. A series of multiple regressions predicting level of training from test scores revealed an R2 of .584 for C alone, which increased to .711 when Gc was entered after (p < .001), and increased to .704 when G was entered after C (p < .001). However, R2 for Gc alone was .711, and for G alone was .704, neither of which changed when C was entered into the prediction (p > .10). The R2 for Gc and G predicting level of training (.725) was not significantly greater than that of either Gc or G alone. A very similar pattern of results was seen when C, Gc, and G were used to predict independent evaluations of the operative outcomes. CONCLUSIONS: Global rating scales scored by experts showed higher inter-station reliability, better construct validity, and better concurrent validity than did checklists. Further, the presence of the checklists did not improve the reliability or validity of the global rating scale over that of the global rating scale alone. These results suggest that global rating scales administered by experts are a more appropriate summative measure when assessing candidates on performance-based examinations.