A tutorial on confidence intervals for proportions in diagnostic radiology.

Abstract
Research in diagnostic radiology often aims to establish the safety and the accuracy of a new procedure or to compare it with other procedures. Frequently, the diagnostic perform- ance of a test can be summarized by proportions such as accuracy, sensitivity, and specificity. Safety may be reflected by the proportion of patients experiencing unpleasant or adverse effects. The confidence interval is useful for sum- marizing data on proportions. However, the confidence inter- vals presented in most elementary statistics texts are map- proprmate for diagnostic research. Estimates of proportions can differ considerably from the actual proportions. A proportion based on a large number of subjects is obviously more accurate than one based on a small number, but just how accurate are estimated propor- tions? News reports of public opinion surveys, which typically include thousands of subjects, frequently cite an accuracy of ±3% for the percentages that they estimate. Few studies in diagnostic radiology include nearly so many subjects, which suggests an accuracy that is far lower than the ±3% of the ordinary public opinion survey. One way to indicate the precision of an estimated propor- tion is to give a range of values that is consistent with the data. As an example, suppose that a test correctly identified nine of 10 patients with a particular diagnosis, whereas in another series it correctly identified 90 of 100 patients. In each series, 90% of the cases were correctly classified, but it is obvious that the latter data are consistent with a much smaller range of values. According to one method (described later), the range from 55% to 99.7% is consistent with nine out of 10, and the range from 82% to 95% is consistent with 90 of 100. In many situations, the conclusions drawn from one of these series would differ from those drawn from the other even though 90% of the cases studied were correctly classified in each. The comparison of two proportions suffers from a related difficulty. When one diagnostic test correctly identifies nine out of 10 cases studied and another also correctly identifies the same nine out of the 10, the observed difference in the proportion correctly identified is exactly zero. At an intuitive level, one might feel that these results demonstrate equal accuracy of the two tests; if one has lower cost or greater convenience, it might be used in preference to the other. However, it is possible that if both tests were applied to a larger series that one test could correctly identify many more patients than the other. It would be useful to know how large a difference between the two tests is consistent with the data. According to one method, the difference of proportions could range from -.257 to +257. In other words, the data are consistent with setups in which one test correctly identi- fies more patients than the other by a margin of 25.7%. Thus, it would generally be inappropriate to consider the two tests as being even approximately equal. Confidence intervals are simple statistical tools for sum- marizing data on proportions or differences of proportions