kappa karma…

Posted on Friday 25 May 2012

When I read this in January, my first thought was that they’d seen the Field Trial results and were preparing us for the bad news. I guess everybody thought that [don’t expect too much…]:
DSM-5: How Reliable Is Reliable Enough?
by Helena Chmura Kraemer, David J. Kupfer, Diana E. Clarke, William E. Narrow, and Darrel A. Regier
American Journal of Psychiatry 2012 169:13-15.

… We previously commented in these pages on the need for field trials. Our purpose here is to set out realistic expectations concerning that assessment. In setting those expectations, one contentious issue is whether it is important that the prevalence for diagnoses based on proposed criteria for DSM-5 match the prevalence for the corresponding DSM-IV diagnoses. However, to require that the prevalence remain unchanged is to require that any existing difference between true and DSM-IV prevalence be reproduced in DSM-5. Any effort to improve the sensitivity of DSM-IV criteria will result in higher prevalence rates, and any effort to improve the specificity of DSM-IV criteria will result in lower prevalence rates. Thus, there are no specific expectations about the prevalence of disorders in DSM-5. The evaluations primarily address reliability.

… Reliability will be assessed using the intraclass kappa coefficient κI. For a categorical diagnosis with prevalence P, among subjects with an initial positive diagnosis, the probability of a second positive diagnosis is κI+P[1–κI], and among the remaining, it is P[1–κI]. The difference between these probabilities is κI. Thus κI=0 means that the first diagnosis has no predictive value for a second diagnosis, and κI=1 means that the first diagnosis is perfectly predictive of a second diagnosis

Reliability is essentially a signal-to-noise ratio indicator. In diagnosis, there are two major sources of “noise”: the inconsistency of expression of the diagnostic criteria by patients and the application of those criteria by the clinicians. It is all too easy to exaggerate reliability by removing some of that noise by design. Instead of a representative sample, as in DSM-5 field trials, one might select “case subjects” who are unequivocally symptomatic and “control subjects” who are unequivocally asymptomatic, omitting the ambiguous middle of the population for whom diagnostic errors are the most common and most costly. That approach would hide much of the patient-generated noise

… It is unrealistic to expect that the quality of psychiatric diagnoses can be much greater than that of diagnoses in other areas of medicine, where diagnoses are largely based on evidence that can be directly observed. Psychiatric diagnoses continue to be based on inferences derived from patient self-reports or observations of patient behavior. Nevertheless, we propose that the standard of evaluation of the test-retest reliability of DSM-5 be consistent with what is known about the reliability of diagnoses in other areas of medicine.

… From these results, to see a κIfor a DSM-5 diagnosis above 0.8 would be almost miraculous; to see κIbetween 0.6 and 0.8 would be cause for celebration. A realistic goal is κIbetween 0.4 and 0.6, while κIbetween 0.2 and 0.4 would be acceptable. We expect that the reliability [intraclass correlation coefficient] of DSM-5 dimensional measures will be larger, and we will aim for between 0.6 and 0.8 and accept between 0.4 and 0.6. The validity criteria in each case mirror those for reliability…

In the May AJP, they published Robert Spitzer’s response to their absurdly low scale for Kappa values. I think you might summarize it with, "He scoffed.":
Standards for DSM-5 Reliability
by Robert L. Spitzer, M.D.; Janet B.W. Williams, Ph.D.; and Jean Endicott, Ph.D
American Journal of Psychiatry 2012 169:537-537.

To the Editor: In the January issue of the Journal, Helena Chmura Kraemer, Ph.D., and colleagues ask, in anticipation of the results of the DSM-5 field trial reliability study, how much reliability is reasonable to expect. They argue that standards for interpreting kappa reliability, which have been widely accepted by psychiatric researchers, are unrealistically high. Historically, psychiatric reliability studies have adopted the Fleiss standard, in which kappas below 0.4 have been considered poor. Kraemer and colleagues propose that kappas from 0.2 to 0.4 be considered “acceptable.” After reviewing the results of three test-retest studies in different areas of medicine [diagnosis of anemia based on conjunctival inspection, diagnosis of pediatric skin and soft tissue infections, and bimanual pelvic examinations] in which kappas fall within ranges of 0.36–0.60, 0.39–0.43, and 0.07–0.26, respectively, Kraemer et al. conclude that “to see κI for a DSM-5 diagnosis above 0.8 would be almost miraculous; to see κI between 0.6 and 0.8 would be cause for celebration.” Therefore, they note that for psychiatric diagnoses, “a realistic goal is κI between 0.4 and 0.6, while κI between 0.2 and 0.4 would be acceptable.”

When we conducted the DSM-III field trial, following the Fleiss standard, we considered kappas above 0.7 to be “good agreement as to whether or not the patient has a disorder within that diagnostic class”. According to the Kraemer et al. commentary, the DSM-III field trial results should be cause for celebration: the overall kappa for axis I disorders in the test-retest cohort (the one most comparable methodologically to the DSM-5 sample) was 0.66. Therefore, test-retest diagnostic reliability of at least 0.6 is achievable by clinicians in a real-world practice setting, and any results below that standard are a cause for concern.

Kraemer and colleagues’ central argument for these diagnostic reliability standards is to ensure that “our expectations of DSM-5 diagnoses…not be set unrealistically high, exceeding the standards that pertain to the rest of medicine.” Although the few cited test-retest studies have kappas averaging around 0.4, it is misleading to depict these as the “standards” of what is acceptable reliability in medicine. For example, the authors of the pediatric skin lesion study characterized their measured test-retest reliability of 0.39–0.43 as “poor.” Calling for psychiatry to accept kappa values that are characterized as unreliable in other fields of medicine is taking a step backward. One hopes that the DSM-5 reliability results are at least as good as the DSM-III results, if not better.

And then the DSM-5 Group’s response. These were published days before the results came out at the APA Meetings [they’re still not published on the DSM-5 website]:
Response to Spitzer et al. Letter
by Helena Chmura Kraemer, Ph.D.; David J. Kupfer, M.D.; Diana E. Clarke, Ph.D.; William E. Narrow, M.D., M.P.H.; and Darrel A. Regier, M.D., M.P.H.
American Journal of Psychiatry 2012 169:537-537.

Homage must be paid to the DSM-III field trials that strongly influenced the design of the DSM-5 field trials. It could hardly be otherwise, since methods for evaluating categorical diagnoses were developed for DSM-III by Dr. Spitzer and his colleagues, Drs. Fleiss and Cohen. However, in the 30 years after 1979, the methodology and the understanding of kappa have advanced, and DSM-5 reflects that as well.

Like DSM-III, DSM-5 field trials sampled typical clinic patients. However, in the DSM-III field trials, participating clinicians were allowed to select the patients to evaluate and were trusted to report all results. In the DSM-5 field trials, symptomatic patients at each site were referred to a research associate for consent, assigned to an appropriate stratum, and randomly assigned to two participating clinicians for evaluation, with electronic data entry. In DSM-III field trials, the necessary independence of the two clinicians evaluating each patient was taken on trust. Stronger blinding protections were implemented in the DSM-5 field trials. Selection bias and lack of blindness tend to inflate kappas.

The sample sizes used in DSM-III, by current standards, were small. There appear to be only three diagnoses for which 25 or more cases were seen: any axis II personality disorder (kappa=0.54), all affective disorders (kappa=0.59), and the subcategory of major affective disorders (kappa=0.65). Four kappas of 1.00 were reported, each based on three or fewer cases; two kappas below zero were also reported based on 0–1 cases. In the absence of confidence intervals, other kappas may have been badly under- or overestimated. Since the kappas differ from one diagnosis to another, the overall kappa cited is uninterpretable.

Standards reflect not what we hope ideally to achieve but what the reliabilities are of diagnoses that are actually useful in practice. Recognizing the possible inflation in DSM-III and DSM-IV results, DSM-5 did not base its standards for kappa entirely on their findings. Fleiss articulated his standards before 1979 when there was little experience using kappa. Are the experience-based standards we proposed unreasonable? There seems to be major disagreement only about kappas between 0.2 and 0.4. We indicated that such kappas might be acceptable with low-prevalence disorders, where a small amount of random error can overwhelm a weak signal. Higher kappas may, in such cases, be achievable only in the following cases: when we do longitudinal follow-up, not with a single interview; when we use unknown biological markers; when we use specialists in that particular disorder; when we deal more effectively with comorbidity; and when we accept that “one size does not fit all” and develop personalized diagnostic procedures.

Greater validity may be achievable only with a small decrease in reliability. The goal of DSM-5 is to maintain acceptable reliability while increasing validity based on the accumulated research and clinical experience since DSM-IV. The goal of the DSM-5 field trials is to present accurate and precise estimates of reliability when used for real patients in real clinics by real clinicians trained in DSM-5 criteria.

I find this response offensive [as in on the offense] for two reasons. They essentially attack the DSM-III and DSM-IV Field Trials as inflated. They certainly had ample time to say that before they had their own results – and didn’t say it. Then they say:
    There seems to be major disagreement only about kappas between 0.2 and 0.4. We indicated that such kappas might be acceptable with low-prevalence disorders, where a small amount of random error can overwhelm a weak signal.
But their results between 0.2 and 0.4 are in Generalized Anxiety Disorder and Major Depressive Disorder, hardly "low-prevalence disorders." Did they think we wouldn’t notice?

They wrote this after they knew their results but before they released them at the APA. It sounds to me like they’re going to go ahead and try to publish the book on time: as if these results [to take us seriously…] aren’t there and aren’t disastrous; as if the skipped second set of Field Trials aren’t needed; as if Dr. Spitzer is off base with his comments; as if Dr. Frances’ predictions weren’t confirmed by their own Field Trials; as if the term "evidence-based" is only for show and not a principle to live by. I can’t find any way to view this as anything but their "true colors" – and not very pretty colors at that.

I sometimes worry that we’re being to hash with these people, but then they do something like this and I wish I’d been marching around carrying a poster years ago. Why even have Field Trials if you’re not going to pay attention to the results? except to re-write the measurement scale to fit what you want them to say? Why would you argue with the guy that essentially invented Kappa and used it to lead psychiatry out of the wilderness about what Kappa means? Why would you ignore the petition signed by the people you hope will be your customers? There’s no shame in being wrong. The shame is in pressing forward when your own results prove you wrong.
    May 25, 2012 | 7:11 PM

    Thank you once again.

    May 25, 2012 | 9:28 PM

    Don’t be such a Debbie Downer! Who cares about major depression or anxiety anyway! At least we can agree on Autism Spectrum Disorder! This will surely advance clinical neuroscientific psychiatry! Tom Insel will be proud!

    May 25, 2012 | 9:40 PM

    I don’t see why accurate diagnosis is such a big deal. Treatment is still arbitrary sequences or combinations of psychiatric drugs no matter what the diagnosis.

    May 25, 2012 | 10:21 PM

    Alto: You are so right. A psychiatrist colleague of mine ran into me in the gym. She said: “Hell, all we do is throw pills at people. We don’t even have to ask about symptoms or lives. We just see what pill sticks.”

    May 26, 2012 | 12:47 PM

    Thanks, Tom

    From the Kraemer article: “….any effort to improve the specificity of DSM-IV criteria will result in lower prevalence rates….” Excuse me? Isn’t it desirable to decrease the rate of false positives?

    May 26, 2012 | 12:57 PM

    Excuse me? Isn’t it desirable to decrease the rate of false positives?
    Great point!

Sorry, the comment form is closed at this time.