Assessing the Reliability of Two Toxicity Scales: Implications for Interpreting Toxicity Data

Abstract
Background : The toxicity of a given cancer therapy is an important end point in clinical trials examining the potential costs and benefits of that therapy. Treatment-related toxicity is conventionally measured with one of several toxicity criteria grading scales, even though the reliability and validity of these scales have not been established. Purpose : We determined the reliability of the National Cancer Institute of Canada Clinical Trials Group (NCIC-CTG) expanded toxicity scale and the World Health Organization (WHO) standard toxicity scale by use of a clinical simulation of actual patients. Methods : Seven experienced data managers each interviewed 12 simulated patients and scored their respective acute toxic effects. Inter-rater agreement (agreement between multiple raters of the same case) was calculated using the kappa (k) statistic across all seven randomly assigned raters for each of 18 toxicity categories (13 NCIC-CTG and five WHO categories). Intra-rater agreement (agreement within the same rater on one case rated on separate occasions) was calculated using k over repeated cases (where raters were blinded to the repeated nature of the subjects). Proportions of agreement (estimate of the probability of two randomly selected raters assigning the same toxicity grade to a given case) were also calculated for inter-rater agreement. Since minor lack of agreement might have adversely affected these statistics of agreement, both k and proportion of agreement analyses were repeated for the following condensed grading categories: none (0) versus low-grade (1 or 2) versus high-grade (3 or 4) toxicity present. Results : Modest levels of inter-rater reliability were demonstrated in this study with k values that ranged from 0.50 to 1.00 in laboratory-based categories and from −0.04 to 0.82 for clinically based categories. Proportions of agreement for clinical categories ranged from 0.52 to 0.98. Condensing the toxicity grades improved statistics of agreement, but substantial lack of agreement remained (k range, −0.04–0.82; proportions of agreement range, 0.67–0.98). Conclusions : Experienced data managers, when interviewing patients, draw varying conclusions regarding toxic effects experienced by such patients. Neither the NCIC-CTG expanded toxicity scale nor the WHO standard toxicity scale demonstrated a clear superiority in reliability, although the breadth of toxic effects recorded differed. (J Natl Cancer Inst 85: 1138–1148, 1993)