Station-length requirements for reliable performance-based examination scores

Abstract
To directly compare the generalizability of medical students' performance scores under systematically varied station times in two surgery end-of-clerkship performance-based examinations. The participants were 36 third-year students randomly assigned to the first two rotations of the core surgery clerkship during 1991-92 at Southern Illinois University School of Medicine. The students rotated through a 12-station examination that employed standardized patients (SPs). In the first rotation, the student took six five-minute stations and six ten-minute stations. In the second rotation, the time lengths were reversed for the same stations. The students' total scores were based on (1) subscores on checklists that were completed by the SPs and (2) subscores on the students' written responses to short questions about each station (these responses were provided at station couplets that were five minutes long, regardless of station length). Generalizability coefficients were computed from the pooled rotation results to provide reliabilities for scores from the two station lengths. Generalizability decreased in the ten-minute stations, mostly attributable to less variability among students' performances. The checklist subscores accounted for most of this variability, while couplet subscores remained stable between station lengths. The longer station length actually decreased the generalizability of the scores by decreasing the variability among students' performances; thus, allocating different times to stations can affect the score reliability, as well as impact on the overall testing time, of performance-based examinations.