{"id":23736,"date":"2012-05-25T18:17:36","date_gmt":"2012-05-25T22:17:36","guid":{"rendered":"http:\/\/1boringoldman.com\/?p=23736"},"modified":"2012-05-25T18:20:54","modified_gmt":"2012-05-25T22:20:54","slug":"kappa-karma","status":"publish","type":"post","link":"https:\/\/1boringoldman.com\/index.php\/2012\/05\/25\/kappa-karma\/","title":{"rendered":"kappa karma&#8230;"},"content":{"rendered":"<div align=\"justify\">When I read this in January, my first thought was that they&#8217;d seen the Field Trial results and were preparing us for the bad news. I guess everybody thought that [<a target=\"_blank\" href=\"http:\/\/1boringoldman.com\/index.php\/2012\/01\/10\/18099\/\"><u><strong><font color=\"#200020\">don&rsquo;t expect too much&hellip;<\/font><\/strong><\/u><\/a>]:    <\/div>\n<blockquote>\n<div align=\"center\"><a href=\"http:\/\/ajp.psychiatryonline.org\/data\/Journals\/AJP\/4396\/appi.ajp.2011.11010050.pdf\" target=\"_blank\"><u><strong><font color=\"#200020\">DSM-5: How Reliable Is Reliable Enough?<\/font><\/strong><\/u><\/a><br \/>                           <sup>by Helena  Chmura Kraemer, David J. Kupfer, Diana E. Clarke,  William E. Narrow, and Darrel A. Regier<\/sup><br \/>                           <strong><font color=\"#004400\">American Journal of Psychiatry<\/font><\/strong> 2012 169:13-15.<\/div>\n<p>     <\/p>\n<div align=\"justify\"><sup>&hellip; We previously commented in these pages on the  need for field trials. Our purpose here is to set out realistic  expectations concerning that assessment. In setting those expectations,  one  contentious issue is whether it is important that the prevalence  for  diagnoses based on proposed criteria for DSM-5 match the prevalence  for  the corresponding DSM-IV diagnoses. However, to require that the   prevalence remain unchanged is to require that any existing difference   between true and DSM-IV prevalence be reproduced in DSM-5. Any effort to   improve the sensitivity of DSM-IV criteria will result in higher   prevalence rates, and any effort to improve the specificity of DSM-IV   criteria will result in lower prevalence rates. Thus, there are no   specific expectations about the prevalence of disorders in DSM-5. The   evaluations primarily address reliability.<\/p>\n<p>            &hellip; Reliability will be assessed using the intraclass kappa  coefficient &kappa;<sub>I<\/sub>.   For a categorical diagnosis with prevalence P, among subjects with an   initial positive diagnosis, the probability of a second positive   diagnosis is &kappa;<sub>I<\/sub>+P[1&ndash;&kappa;<sub>I<\/sub>], and among the remaining, it is P[1&ndash;&kappa;<sub>I<\/sub>]. The difference between these probabilities is &kappa;<sub>I<\/sub>. Thus &kappa;<sub>I<\/sub>=0 means that the first diagnosis has no predictive value for a second diagnosis, and &kappa;<sub>I<\/sub>=1 means that the first diagnosis is perfectly predictive of a second diagnosis<\/p>\n<p>            Reliability is essentially a  signal-to-noise  ratio indicator. In diagnosis, there are two major  sources of &ldquo;noise&rdquo;:  the inconsistency of expression of the diagnostic  criteria by patients  and the application of those criteria by the  clinicians. It is all too  easy to exaggerate reliability by removing  some of that noise by  design. Instead of a representative sample, as in  DSM-5 field trials,  one might select &ldquo;case subjects&rdquo; who are  unequivocally symptomatic and  &ldquo;control subjects&rdquo; who are unequivocally  asymptomatic, omitting the  ambiguous middle of the population for whom  diagnostic errors are the  most common and most costly. That approach  would hide much of the  patient-generated noise<\/p>\n<p>            &hellip; It is unrealistic to expect that the  quality  of psychiatric diagnoses can be much greater than that of  diagnoses in  other areas of medicine, where diagnoses are largely based  on evidence  that can be directly observed. Psychiatric diagnoses  continue to be  based on inferences derived from patient self-reports or  observations  of patient behavior. Nevertheless, we propose that the  standard of  evaluation of the test-retest reliability of DSM-5 be  consistent with  what is known about the reliability of diagnoses in  other areas of  medicine.<\/p>\n<p>            &hellip; From these results, to see a &kappa;<sub>I<\/sub>for a DSM-5 diagnosis above 0.8 would be almost miraculous; to see &kappa;<sub>I<\/sub>between 0.6 and 0.8 would be cause for celebration. A realistic goal is &kappa;<sub>I<\/sub>between 0.4 and 0.6, while &kappa;<sub>I<\/sub>between  0.2 and 0.4 would be acceptable. We  expect that the reliability  [intraclass correlation coefficient] of  DSM-5 dimensional measures will  be larger, and we will aim for between  0.6 and 0.8 and accept between  0.4 and 0.6. The validity criteria in  each case mirror those for  reliability&hellip;<\/sup><\/div>\n<\/blockquote>\n<div align=\"justify\">In the May AJP, they published Robert Spitzer&#8217;s response to their absurdly low scale for Kappa values. I think you might summarize it with, &quot;He scoffed.&quot;:   <\/div>\n<blockquote>\n<div align=\"center\"><a target=\"_blank\" href=\"http:\/\/psychiatryonline.org\/article.aspx?articleID=1109031\"><u><strong><font color=\"#200020\">Standards for DSM-5 Reliability<\/font><\/strong><\/u><\/a><br \/>          <sup>by Robert L. Spitzer, M.D.; Janet B.W. Williams, Ph.D.; and Jean Endicott, Ph.D<\/sup><br \/>          <strong><font color=\"#004400\">American Journal of Psychiatry<\/font><\/strong> 2012 169:537-537.<\/div>\n<p>         <\/p>\n<div align=\"justify\"><sup>To the Editor: In the January issue of the <em>Journal<\/em>, Helena Chmura Kraemer, Ph.D., and colleagues  ask, in anticipation of the results of the DSM-5 field trial  reliability study, how much reliability is reasonable to expect. They  argue that standards for interpreting kappa reliability, which have been  widely accepted by psychiatric researchers, are unrealistically high.  Historically, psychiatric reliability studies have adopted the Fleiss  standard, in which kappas below 0.4 have been considered poor.  Kraemer and colleagues propose that kappas from 0.2 to 0.4 be  considered &ldquo;acceptable.&rdquo; After reviewing the results of three  test-retest studies in different areas of medicine [diagnosis of anemia  based on conjunctival inspection, diagnosis of pediatric skin and soft  tissue infections, and bimanual pelvic examinations] in which kappas  fall within ranges of 0.36&ndash;0.60, 0.39&ndash;0.43, and 0.07&ndash;0.26, respectively,  Kraemer et al. conclude that &ldquo;to see &kappa;<sub>I<\/sub> for a DSM-5 diagnosis above 0.8 would be almost miraculous; to see &kappa;<sub>I<\/sub>  between 0.6 and 0.8 would be cause for celebration.&rdquo; Therefore, they  note that for psychiatric diagnoses, &ldquo;a realistic goal is &kappa;<sub>I<\/sub> between 0.4 and 0.6, while &kappa;<sub>I<\/sub> between 0.2 and 0.4 would be acceptable.&rdquo;<\/p>\n<p>          When we conducted the  DSM-III field trial, following the Fleiss standard, we considered kappas  above 0.7 to be &ldquo;good agreement as to whether or not the patient has a  disorder within that diagnostic class&rdquo;.  According to the Kraemer et al. commentary, the DSM-III field trial  results should be cause for celebration: the overall kappa for axis I  disorders in the test-retest cohort (the one most comparable  methodologically to the DSM-5 sample) was 0.66.  Therefore, test-retest diagnostic reliability of at least 0.6 is  achievable by clinicians in a real-world practice setting, and any  results below that standard are a cause for concern.<\/p>\n<p>          Kraemer and colleagues&#8217; central argument for  these diagnostic reliability standards is to ensure that &ldquo;our  expectations of DSM-5 diagnoses&hellip;not be set unrealistically high,  exceeding the standards that pertain to the rest of medicine.&rdquo; Although  the few cited test-retest studies have kappas averaging around 0.4, it  is misleading to depict these as the &ldquo;standards&rdquo; of what is acceptable  reliability in medicine. For example, the authors of the pediatric skin  lesion study  characterized their measured test-retest reliability of 0.39&ndash;0.43 as  &ldquo;poor.&rdquo; Calling for psychiatry to accept kappa values that are  characterized as unreliable in other fields of medicine is taking a step  backward. One hopes that the DSM-5 reliability results are at least as  good as the DSM-III results, if not better.<\/sup><\/div>\n<\/blockquote>\n<div align=\"justify\">And then the DSM-5 Group&#8217;s response. These were published days before the results came out at the APA Meetings [they&#8217;re still not published on the DSM-5 website]:   <\/div>\n<blockquote>\n<div align=\"center\"><a target=\"_blank\" href=\"http:\/\/psychiatryonline.org\/article.aspx?articleID=1109032\"><u><strong><font color=\"#200020\">Response to Spitzer et al. Letter<\/font><\/strong><\/u><\/a><br \/>          <sup>by Helena  Chmura Kraemer, Ph.D.; David J. Kupfer, M.D.; Diana E. Clarke, Ph.D.;  William E. Narrow, M.D., M.P.H.; and Darrel A. Regier, M.D., M.P.H.<\/sup><br \/>          <strong><font color=\"#004400\">American Journal of Psychiatry<\/font><\/strong> 2012 169:537-537.<\/div>\n<p>         <\/p>\n<div align=\"justify\"><sup>Homage must be paid to the DSM-III field trials  that strongly influenced the design of the DSM-5 field trials. It could  hardly be otherwise, since methods for evaluating categorical diagnoses  were developed for DSM-III by Dr. Spitzer and his colleagues, Drs.  Fleiss and Cohen. However, in the 30 years after 1979, the methodology  and the understanding of kappa have advanced, and DSM-5 reflects that as well.<\/p>\n<p>          Like DSM-III, DSM-5 field trials sampled  typical clinic patients. However, in the DSM-III field trials,  participating clinicians were allowed to select the patients to evaluate  and were trusted to report all results. In the DSM-5 field trials,  symptomatic patients at each site were referred to a research associate  for consent, assigned to an appropriate stratum, and randomly assigned  to two participating clinicians for evaluation, with electronic data  entry. In DSM-III field trials, the necessary independence of the two  clinicians evaluating each patient was taken on trust. Stronger blinding  protections were implemented in the DSM-5 field trials. Selection bias  and lack of blindness tend to inflate kappas. <\/p>\n<p>          The sample sizes used in DSM-III, by current  standards, were small. There appear to be only three diagnoses for  which 25 or more cases were seen: any axis II personality disorder  (kappa=0.54), all affective disorders (kappa=0.59), and the subcategory  of major affective disorders (kappa=0.65). Four kappas of 1.00 were  reported, each based on three or fewer cases; two kappas below zero were  also reported based on 0&ndash;1 cases. In the absence of confidence  intervals, other kappas may have been badly under- or overestimated.  Since the kappas differ from one diagnosis to another, the overall kappa  cited is uninterpretable.<\/p>\n<p>          Standards reflect not what we hope ideally  to achieve but what the reliabilities are of diagnoses that are actually  useful in practice. Recognizing the possible inflation in DSM-III and  DSM-IV results, DSM-5 did not base its standards for kappa entirely on  their findings. Fleiss articulated his standards before 1979 when there  was little experience using kappa. Are the experience-based standards  we proposed unreasonable? There seems to be major disagreement only  about kappas between 0.2 and 0.4. We indicated that such kappas might be  acceptable with low-prevalence disorders, where a small amount of  random error can overwhelm a weak signal. Higher kappas may, in such  cases, be achievable only in the following cases: when we do  longitudinal follow-up, not with a single interview; when we use unknown  biological markers; when we use specialists in that particular  disorder; when we deal more effectively with comorbidity; and when we  accept that &ldquo;one size does not fit all&rdquo; and develop personalized  diagnostic procedures. <\/p>\n<p>          Greater validity may be achievable only with  a small decrease in reliability. The goal of DSM-5 is to maintain  acceptable reliability while increasing validity based on the  accumulated research and clinical experience since DSM-IV. The goal of  the DSM-5 field trials is to present accurate and precise estimates of  reliability when used for real patients in real clinics by real  clinicians trained in DSM-5 criteria.<\/sup><\/div>\n<\/blockquote>\n<div align=\"justify\">I find this response offensive [as in on the offense] for two reasons. They essentially attack the DSM-III and DSM-IV Field Trials as inflated. They certainly had ample time to say that before they had their own results &#8211; and didn&#8217;t say it. Then they say:<\/div>\n<ul>\n<div align=\"justify\"><sup>There seems to be major disagreement only  about kappas between 0.2 and  0.4. We indicated that such kappas might be  acceptable with  low-prevalence disorders, where a small amount of  random error can  overwhelm a weak signal.<\/sup><\/div>\n<\/ul>\n<div align=\"justify\">But their results between 0.2 and 0.4 are in Generalized Anxiety Disorder and Major Depressive Disorder, hardly &quot;low-prevalence disorders.&quot; Did they think we wouldn&#8217;t notice?<\/div>\n<p align=\"justify\">They wrote this <em>after<\/em> they knew their results but <em>before<\/em> they released them at the APA. <strong><font color=\"#200020\">It sounds to me like they&#8217;re going to go ahead and try to publish the book on time<\/font><\/strong>: as if these results [<u><strong><a target=\"_blank\" href=\"http:\/\/1boringoldman.com\/index.php\/2012\/05\/22\/to-take-us-seriously\/\"><font color=\"#200020\">to take us seriously&hellip;<\/font><\/a><\/strong><\/u>] aren&#8217;t there and aren&#8217;t disastrous; as if the skipped second set of Field Trials aren&#8217;t needed; as if Dr. Spitzer is off base with his comments; as if Dr. Frances&#8217; predictions weren&#8217;t confirmed by their own Field Trials; as if the term &quot;evidence-based&quot; is only for show and not a principle to live by. I can&#8217;t find any way to view this as anything but their &quot;true colors&quot; &#8211; and not very pretty colors at that. <\/p>\n<div align=\"justify\">I sometimes worry that we&#8217;re being to hash with these people, but then they do something like this and I wish I&#8217;d been marching around carrying a poster years ago. Why even have Field Trials if you&#8217;re not going to pay attention to the results? except to re-write the measurement scale to fit what you want them to say? Why would you argue with the guy that essentially invented Kappa and used it to lead psychiatry out of the wilderness about what Kappa means? Why would you ignore the petition signed by the people you hope will be your customers? There&#8217;s no shame in being wrong. The shame is in pressing forward when your own results prove you wrong.<\/div>\n","protected":false},"excerpt":{"rendered":"<p>When I read this in January, my first thought was that they&#8217;d seen the Field Trial results and were preparing us for the bad news. I guess everybody thought that [don&rsquo;t expect too much&hellip;]: DSM-5: How Reliable Is Reliable Enough? by Helena Chmura Kraemer, David J. Kupfer, Diana E. Clarke, William E. Narrow, and Darrel [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[2],"tags":[],"class_list":["post-23736","post","type-post","status-publish","format-standard","hentry","category-politics"],"_links":{"self":[{"href":"https:\/\/1boringoldman.com\/index.php\/wp-json\/wp\/v2\/posts\/23736","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/1boringoldman.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/1boringoldman.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/1boringoldman.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/1boringoldman.com\/index.php\/wp-json\/wp\/v2\/comments?post=23736"}],"version-history":[{"count":13,"href":"https:\/\/1boringoldman.com\/index.php\/wp-json\/wp\/v2\/posts\/23736\/revisions"}],"predecessor-version":[{"id":23749,"href":"https:\/\/1boringoldman.com\/index.php\/wp-json\/wp\/v2\/posts\/23736\/revisions\/23749"}],"wp:attachment":[{"href":"https:\/\/1boringoldman.com\/index.php\/wp-json\/wp\/v2\/media?parent=23736"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/1boringoldman.com\/index.php\/wp-json\/wp\/v2\/categories?post=23736"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/1boringoldman.com\/index.php\/wp-json\/wp\/v2\/tags?post=23736"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}