DSM-5: How Reliable Is Reliable Enough?
by Helena Chmura Kraemer, David J. Kupfer, Diana E. Clarke, William E. Narrow, and Darrel A. Regier
American Journal of Psychiatry 2012 169:13-15.
… We previously commented in these pages on the need for field trials. Our purpose here is to set out realistic expectations concerning that assessment. In setting those expectations, one contentious issue is whether it is important that the prevalence for diagnoses based on proposed criteria for DSM-5 match the prevalence for the corresponding DSM-IV diagnoses. However, to require that the prevalence remain unchanged is to require that any existing difference between true and DSM-IV prevalence be reproduced in DSM-5. Any effort to improve the sensitivity of DSM-IV criteria will result in higher prevalence rates, and any effort to improve the specificity of DSM-IV criteria will result in lower prevalence rates. Thus, there are no specific expectations about the prevalence of disorders in DSM-5. The evaluations primarily address reliability.
… Reliability will be assessed using the intraclass kappa coefficient κI. For a categorical diagnosis with prevalence P, among subjects with an initial positive diagnosis, the probability of a second positive diagnosis is κI+P[1–κI], and among the remaining, it is P[1–κI]. The difference between these probabilities is κI. Thus κI=0 means that the first diagnosis has no predictive value for a second diagnosis, and κI=1 means that the first diagnosis is perfectly predictive of a second diagnosis
Reliability is essentially a signal-to-noise ratio indicator. In diagnosis, there are two major sources of “noise”: the inconsistency of expression of the diagnostic criteria by patients and the application of those criteria by the clinicians. It is all too easy to exaggerate reliability by removing some of that noise by design. Instead of a representative sample, as in DSM-5 field trials, one might select “case subjects” who are unequivocally symptomatic and “control subjects” who are unequivocally asymptomatic, omitting the ambiguous middle of the population for whom diagnostic errors are the most common and most costly. That approach would hide much of the patient-generated noise
… It is unrealistic to expect that the quality of psychiatric diagnoses can be much greater than that of diagnoses in other areas of medicine, where diagnoses are largely based on evidence that can be directly observed. Psychiatric diagnoses continue to be based on inferences derived from patient self-reports or observations of patient behavior. Nevertheless, we propose that the standard of evaluation of the test-retest reliability of DSM-5 be consistent with what is known about the reliability of diagnoses in other areas of medicine.
… From these results, to see a κIfor a DSM-5 diagnosis above 0.8 would be almost miraculous; to see κIbetween 0.6 and 0.8 would be cause for celebration. A realistic goal is κIbetween 0.4 and 0.6, while κIbetween 0.2 and 0.4 would be acceptable. We expect that the reliability [intraclass correlation coefficient] of DSM-5 dimensional measures will be larger, and we will aim for between 0.6 and 0.8 and accept between 0.4 and 0.6. The validity criteria in each case mirror those for reliability…
Standards for DSM-5 Reliability
by Robert L. Spitzer, M.D.; Janet B.W. Williams, Ph.D.; and Jean Endicott, Ph.D
American Journal of Psychiatry 2012 169:537-537.
To the Editor: In the January issue of the Journal, Helena Chmura Kraemer, Ph.D., and colleagues ask, in anticipation of the results of the DSM-5 field trial reliability study, how much reliability is reasonable to expect. They argue that standards for interpreting kappa reliability, which have been widely accepted by psychiatric researchers, are unrealistically high. Historically, psychiatric reliability studies have adopted the Fleiss standard, in which kappas below 0.4 have been considered poor. Kraemer and colleagues propose that kappas from 0.2 to 0.4 be considered “acceptable.” After reviewing the results of three test-retest studies in different areas of medicine [diagnosis of anemia based on conjunctival inspection, diagnosis of pediatric skin and soft tissue infections, and bimanual pelvic examinations] in which kappas fall within ranges of 0.36–0.60, 0.39–0.43, and 0.07–0.26, respectively, Kraemer et al. conclude that “to see κI for a DSM-5 diagnosis above 0.8 would be almost miraculous; to see κI between 0.6 and 0.8 would be cause for celebration.” Therefore, they note that for psychiatric diagnoses, “a realistic goal is κI between 0.4 and 0.6, while κI between 0.2 and 0.4 would be acceptable.”
When we conducted the DSM-III field trial, following the Fleiss standard, we considered kappas above 0.7 to be “good agreement as to whether or not the patient has a disorder within that diagnostic class”. According to the Kraemer et al. commentary, the DSM-III field trial results should be cause for celebration: the overall kappa for axis I disorders in the test-retest cohort (the one most comparable methodologically to the DSM-5 sample) was 0.66. Therefore, test-retest diagnostic reliability of at least 0.6 is achievable by clinicians in a real-world practice setting, and any results below that standard are a cause for concern.
Kraemer and colleagues’ central argument for these diagnostic reliability standards is to ensure that “our expectations of DSM-5 diagnoses…not be set unrealistically high, exceeding the standards that pertain to the rest of medicine.” Although the few cited test-retest studies have kappas averaging around 0.4, it is misleading to depict these as the “standards” of what is acceptable reliability in medicine. For example, the authors of the pediatric skin lesion study characterized their measured test-retest reliability of 0.39–0.43 as “poor.” Calling for psychiatry to accept kappa values that are characterized as unreliable in other fields of medicine is taking a step backward. One hopes that the DSM-5 reliability results are at least as good as the DSM-III results, if not better.
Response to Spitzer et al. Letter
by Helena Chmura Kraemer, Ph.D.; David J. Kupfer, M.D.; Diana E. Clarke, Ph.D.; William E. Narrow, M.D., M.P.H.; and Darrel A. Regier, M.D., M.P.H.
American Journal of Psychiatry 2012 169:537-537.
Homage must be paid to the DSM-III field trials that strongly influenced the design of the DSM-5 field trials. It could hardly be otherwise, since methods for evaluating categorical diagnoses were developed for DSM-III by Dr. Spitzer and his colleagues, Drs. Fleiss and Cohen. However, in the 30 years after 1979, the methodology and the understanding of kappa have advanced, and DSM-5 reflects that as well.
Like DSM-III, DSM-5 field trials sampled typical clinic patients. However, in the DSM-III field trials, participating clinicians were allowed to select the patients to evaluate and were trusted to report all results. In the DSM-5 field trials, symptomatic patients at each site were referred to a research associate for consent, assigned to an appropriate stratum, and randomly assigned to two participating clinicians for evaluation, with electronic data entry. In DSM-III field trials, the necessary independence of the two clinicians evaluating each patient was taken on trust. Stronger blinding protections were implemented in the DSM-5 field trials. Selection bias and lack of blindness tend to inflate kappas.
The sample sizes used in DSM-III, by current standards, were small. There appear to be only three diagnoses for which 25 or more cases were seen: any axis II personality disorder (kappa=0.54), all affective disorders (kappa=0.59), and the subcategory of major affective disorders (kappa=0.65). Four kappas of 1.00 were reported, each based on three or fewer cases; two kappas below zero were also reported based on 0–1 cases. In the absence of confidence intervals, other kappas may have been badly under- or overestimated. Since the kappas differ from one diagnosis to another, the overall kappa cited is uninterpretable.
Standards reflect not what we hope ideally to achieve but what the reliabilities are of diagnoses that are actually useful in practice. Recognizing the possible inflation in DSM-III and DSM-IV results, DSM-5 did not base its standards for kappa entirely on their findings. Fleiss articulated his standards before 1979 when there was little experience using kappa. Are the experience-based standards we proposed unreasonable? There seems to be major disagreement only about kappas between 0.2 and 0.4. We indicated that such kappas might be acceptable with low-prevalence disorders, where a small amount of random error can overwhelm a weak signal. Higher kappas may, in such cases, be achievable only in the following cases: when we do longitudinal follow-up, not with a single interview; when we use unknown biological markers; when we use specialists in that particular disorder; when we deal more effectively with comorbidity; and when we accept that “one size does not fit all” and develop personalized diagnostic procedures.
Greater validity may be achievable only with a small decrease in reliability. The goal of DSM-5 is to maintain acceptable reliability while increasing validity based on the accumulated research and clinical experience since DSM-IV. The goal of the DSM-5 field trials is to present accurate and precise estimates of reliability when used for real patients in real clinics by real clinicians trained in DSM-5 criteria.
They wrote this after they knew their results but before they released them at the APA. It sounds to me like they’re going to go ahead and try to publish the book on time: as if these results [to take us seriously…] aren’t there and aren’t disastrous; as if the skipped second set of Field Trials aren’t needed; as if Dr. Spitzer is off base with his comments; as if Dr. Frances’ predictions weren’t confirmed by their own Field Trials; as if the term "evidence-based" is only for show and not a principle to live by. I can’t find any way to view this as anything but their "true colors" – and not very pretty colors at that.