Current elbow-scoring systems are based on the observer-derived assessment of a variety of clinical and functional criteria, which are scored separately and then aggregated. The aggregate score then is assigned a categorical ranking that ranges from excellent to poor. The developers of different elbow-scoring systems have chosen different outcome criteria, assigned different weights to each criterion, and accorded different ranges of values to each categorical ranking. Five different elbow-scoring systems (the Mayo elbow-performance index and the systems of Broberg and Morrey, Ewald et al., The Hospital for Special Surgery, and Pritchard) were used to evaluate the same group of patients. The validity of the scoring systems was determined with use of visual-analog scales for the assessment of pain and function, patient and physician-derived ratings of the severity of impairment of the elbow, and two functional questionnaires completed by the patient (the Disabilities of the Arm, Shoulder and Hand questionnaire and the Modified American Shoulder and Elbow Surgeons patient self-evaluation form). The study sample consisted of sixty-nine patients who had sought treatment at one of two tertiary referral clinics because of problems related to the elbow. Pearson product-moment correlation coefficients were used to compare the raw aggregate scores, and kappa statistics were used to determine the level of agreement among the categorical rankings (excellent, good, fair, and poor).Examination of the five scoring systems revealed a remarkable lack of concordance with regard to the aspects of elbow function that were assessed. Good correlation was observed when the systems were compared on the basis of raw scores (Pearson product-moment correlation coefficients, 0.79 to 0.90), but only slight-to-moderate correlation was noted when the systems were compared on the basis of categorical rankings (quadratic weighted kappa coefficients, 0.18 to 0.49). Validity testing showed the system of Ewald et al. and the Mayo elbow-performance index to be the most discriminating, the system of Pritchard to be the least discriminating, and the system of The Hospital for Special Surgery and the system of Broberg and Morrey to be intermediate. The scores determined with the elbow-scoring systems demonstrated only moderate correlation with the score for function on the visual analog scale (Pearson product-moment correlation coefficients, 0.44 to 0.66), whereas those derived from the functional questionnaires completed by the patient demonstrated moderate-to-good correlation with the score for function (Pearson product-moment correlation coefficients, 0.72 and 0.80).CLINICAL RELEVANCE: We observed a remarkable lack of agreement when five different elbow-scoring systems were used to determine categorical rankings for the same cohort of patients. The correlations between the raw aggregate scores were better. On the basis of these findings, we believe that outcomes should be expressed as raw scores rather than as categorical rankings. We also found that scores derived from patient-completed functional questionnaires correlated more closely with perceived functional loss than did those determined with aggregate elbow-scoring systems. It must be recognized that comparisons between studies that are based on different scoring systems are not valid and that the categorical rankings of different systems are not interchangeable. The outcome of therapies designed for the treatment of the elbow should be determined on the basis of a patient-derived assessment of function, a clinical examination, and an assessment of pain.