One might think that I would tire of complaining about the DSM-5 process, but that doesn’t seem to be the case. This particular complaint is actually bigger than just the DSM-5, it’s about frequent misguided science in psychiatry in general. Here’s the DSM-5 example from January 2012:
Q: Are the standards set for the reliability of DSM-5 diagnoses set too low, given the very high reliabilities of DSM-IV and DSM-III diagnoses.
A: Many field trials done in the past have used selected samples, rather than representative samples, have muddled sites having very different patient pools, have used expert clinicians involved in designing the diagnosis, have used sample sizes too small to guarantee reasonable precision [and have presented no measures of precision], have not guaranteed complete "blindness" between test and retest, have focused the attention of clinicians on a limited set of diagnoses rather than asking for diagnosis as it would be done in practice. All of these past field trials have consequently reported inflated kappas. This has had the unfortunate effect of setting expectations unreasonably high. The AJP article by Kraemer et al. attempts to set expectations of true reliability in a clinical setting more realistically.
This comment and the referenced AJP article were written after they knew the results of their Field Trials [but we didn’t]. In the article they said:
… From these results, to see a κI for a DSM-5 diagnosis above 0.8 would be almost miraculous; to see κIbetween 0.6 and 0.8 would be cause for celebration. A realistic goal is κIbetween 0.4 and 0.6, while κIbetween 0.2 and 0.4 would be acceptable. We expect that the reliability [intraclass correlation coefficient] of DSM-5 dimensional measures will be larger, and we will aim for between 0.6 and 0.8 and accept between 0.4 and 0.6. The validity criteria in each case mirror those for reliability.
But looking at the DSM-5 web site and what they said before the field trials, it was a different story. This from their protocol in April 2010 …
Reliability: The test-retest and inter-rater reliabilities of the categorical diagnoses and the dimensional measures that are incorporated into the diagnostic scheme for DSM-5 will be assessed in independent analyses. Test-retest and inter-rater reliabilities for dichotomous and categorical items will be computed using the kappa statistic (k) for categorical data; for ordinal and continuous measures, test-retest reliability will be computed using the intraclass correlation coefficient (ICC). In both cases, sampling weights will be incorporated into the computation. The 95% confidence intervals will be obtained using bootstrap methods. Reliability will be rated as fair (.30 to .49), moderate (.50 to .69), or high (.70 to 1.00) for the purposes of comparison. Post hoc exploratory analyses will be conducted to determine whether reliabilities differ by gender or due to comorbid conditions.
The standard scientific method includes setting the outcome parameters before doing the study – what is the primary outcome variable? how will it be interpreted? Resetting the variables or their meanings after knowing the results isn’t kosher. If they thought that "All of these past field trials have consequently reported inflated kappas", they needed to say that before they knew their own results. They didn’t say that. If they thought "…to see a κI for a DSM-5 diagnosis above 0.8 would be almost miraculous; to see κI between 0.6 and 0.8 would be cause for celebration. A realistic goal is κI between 0.4 and 0.6, while κI between 0.2 and 0.4 would be acceptable." They needed to say that before the study instead of "Reliability will be rated as fair (.30 to .49), moderate (.50 to .69), or high (.70 to 1.00) for the purposes of comparison." The reason for declaring such things before doing the study are obvious – otherwise, you’re just spinning your lack-luster results. This kind of thing is common in psychiatric studies – after the fact inflation of results. A common example is the report of "near significance" or "approached significance" – seen a lot in clinical trials. Dr. Frances gave these values as traditional interpretations [before we had any results]:
Kappa is a statistic that measures agreement among raters, corrected for chance agreement. Historically, kappas above 0.8 are considered good, above 0.6 fair, and under 0.6 poor. Before this AJP commentary, no one has ever felt comfortable endorsing kappas so low as 0.2-0.4. As a comparison, the personality section in DSM III was widely derided when its kappas were around 0.5. A kappa between 0.2-0.4 comes dangerously close to no agreement.
Here are the DSM-5 Field Trial Results I know about interpreted by the various standards: Dr. Frances’ Standard, The DSM-5 Task Force BEFORE they had the results, the DSM-5 Task Force AFTER they had the results:
Disorder | Kappa | Frances | BEFORE | AFTER |
|
||||
Major neurocognitive disorder | .78 | fair | high | celebration |
Autism spectrum disorder | .69 | fair | moderate | celebration |
Post traumatic stress disorder | .67 | fair | moderate | celebration |
Child attention deficit disorder | .61 | fair | moderate | celebration |
Complex somatic disorder | .60 | fair | moderate | celebration |
Bipolar disorder | .54 | poor | moderate | realistic |
Oppositional defiant disorder | .41 | poor | fair | realistic |
Major Depressive Disorder (in adults) | .32 | no agreement | fair | acceptable |
Generalized anxiety disorder | .20 | no agreement | unacceptable | acceptable |
Disruptive mood dysregulation disorder | .50 | poor | moderate | realistic |
Schizophrenia | .46 | poor | fair | realistic |
Mild neurocognitive disorder | .50 | poor | moderate | realistic |
Schizoaffective Disorder | .50 | poor | moderate | realistic |
Mild traumatic brain injury | .46 | poor | fair | realistic |
Alcohol use disorder | .40 | no agreement | fair | realistic |
Hoarding | .59 | poor | moderate | realistic |
Binge Eating | .56 | poor | moderate | realistic |
Major Depressive Disorder (in kids) | .29 | no agreement | unacceptable | acceptable |
Borderline personality disorder | .58 | poor | moderate | realistic |
Mixed anxiety/depressive disorder | .06 | no agreement | unacceptable | unacceptable |
Conduct Disorder | .48 | poor | fair | realistic |
Antisocial Personality Disorder | .22 | no agreement | unacceptable | acceptable |
Obsessive Compulsive Disorder | .31 | no agreement | fair | acceptable |
Attenuated Psychosis Syndrome | .46 | poor | fair | realistic |
While it is particularly disturbing to see this kind of thing coming from our professional organization, it’s fairly widespread. The call is for clear declarations of how the results will be interpreted before they are in hand. The reasons are patently obvious. In science, a study usually seeks the answer to a hypothetical question. The conditions for the "yes", "no", "undetermined" answers to the question are part of the hypothesis itself and need to be clearly stated. Outcome is not a toy to be played with…
And once aware of a previous psychiatric diagnosis a psychiatrist is likely to interpret the patient “out of hand” before the patient even opens their mouth. Once the patient has been diagnosed with a mood disorder, there is nothing they can say or do to convince the psychiatrist that they are anything else and all of the stress and pain in that patient’s life is a result of their disorder and can be fixed if the patient would just stay on those four or five medications.
The degree to which this profession has become both more arrogant and more glib in equal measure is stultifying and worthy of psychoanalysis.
It’s always interesting to watch the evolution of “acceptable”. I gather it will eventually come down to what the greater society is willing to accept with these half baked attempts at selling mediocrity and outright deception.
This is what’s known in research as “moving the end point” and is a major reason why psychiatry studies are laughingstocks among serious medical researchers.
Looking at it another way, it’s very Zen: “Whatever you hit, call it the target.”