1 Boring Old Man » self-rated metrics…

about
search

Posted on Monday 22 February 2016

With all our graphs and tables, we can lose sight of the nuts and bolts of clinical medicine – signs and symptoms. Symptoms are those things patients report, and signs are things a clinician can see. In studying mental illness, we mostly have to rely on subjective reports of symptoms since the objective signs, like a depressive countenance, are under voluntary control and can be feigned by actors [as demonstrated above and on our television sets more often than I can sometimes bear]. Objective biomarkers in the main elude us. And so we take a big hit from critics about the subjectivity of mental illness and its diagnosis. I’ve lived on both sides [Internal Medicine and Psychiatry], and I am little moved by that criticism. Just because something’s hard doesn’t mean it can’t be done. It’s simply the challenge of clinical experience and training. In the clinical trials of drug effects on mental symptoms, we put a lot of faith in the available clinimetrics – rating scales generated by raters who are blinded to the treatment group. And how are these raters competent to perform this function? That’s an encyclopedic question. Here’s just a sample:

Evaluating rater competency for CNS clinical trials

by SD Targum

Journal of Clinical Psychopharmacology. 2006 26[3]:308-310.

Clinical trials rely on ratings accuracy to document a beneficial drug effect. This study examined rater competency with clinical nervous system rating instruments relative to previous clinical experience and participation in specific rater training programs. One thousand two hundred forty-one raters scored videotaped interviews of the Hamilton Anxiety Scale [HAM-A], Hamilton Depression Scale [HAM-D], and Young Mania Rating Scale [YMRS] during rater training programs conducted at 9 different investigator meetings. Scoring deviations relative to established acceptable scores were used to evaluate individual rater competency. Rater competency was not achieved by clinical experience alone. Previous clinical experience with mood-disordered patients ranged from none at all [18%] to 40 years in 1 rater. However, raters attending their first-ever training session [n = 485] were not differentiated on the basis of clinical experience on the HAM-A [P = 0.054], HAM-D [P = 0.06], or YMRS [P = 0.66]. Alternatively, participation in repeated rater training sessions significantly improved rater competency on the HAM-A [P = 0.002], HAM-D [P < 0.001], and YMRS [P < 0.001]. Furthermore, raters with clinical experience still improved with rater training. Using 5 years of clinical experience as a minimum cutoff [n = 795], raters who had participated in 5 or more training sessions significantly outperformed comparably experienced raters attending their first-ever training session on the HAM-A [P = 0.003], HAM-D [P < 0.001], and YMRS [P < 0.001]. The findings show that rater training improves rater competency at all levels of clinical experience. Furthermore, more stringent criteria for rater eligibility and comprehensive rater training programs can improve ratings competency.

So, as to how are these raters competent to perform this function? The short answer is training. It’s their ratings on HAM-D, CGI, MADRS, CDRS-R, PANSS, etc that get turned into those tables, graphs, p-values, and odds ratios that populate the clinical trial reports that fill our journals – numeric proxies for the subjects of study. Parenthetically, I’m kind of impressed at how well they do judged by consistency and inter-rater reliability [better than clinicians on the DSM-5 field trials of diagnosis]. But that’s not my point here. I’m thinking about some other rating scales, the subject self-rating scales that are included in many of clinical trials, either by requirement or convention.

There are sort of two levels for evaluating the outcome of clinical trials of psychopharmacologic treatments. One is statistics. That’s the FDA standard. Their charge is to make sure that a medicine has medicinal properties, isn’t inert like many of the patent medicines of old. And in most cases, that’s that for the FDA – p < 0.05. Clinical significance isn’t their job. A second level might be thought of as the way the Cochrane Collaboration approaches evaluation – not just is it medicinal, but how strong is it. They display and report on the Effect Sizes – things like Cohen’s d, Hedges g, Standardized Mean Difference, Odds Ratio, NNT, NNH. Then they combine these strength of effect measures with the 95% Confidence Intervals [a probability measure] in their familiar forest plots which I find invaluable. But what about what the subjects say? Many of the Observer Rated Metrics have Subject Self-Rated versions that cover the same ground [HAM-D-SR, IDS-SR, QIDS-SR, etc]. And there are others that focus on other areas of subjective experience.

When we sit in our offices, all we have to go on is what our patients have to say about what the medications are doing and how they look when they walk in the door. The scale is simple: "It really helped," "I think it might be helping," "It’s not helping." But the subject as self-rater isn’t so prominently mentioned in the published clinical reports [unless it’s a positive report]. Take for example the recent clinical trials of Brexpiprazole [Rexulti®] in treatment resistant depression that I can’t seem to stop talking about. Remember that there are two sets of efficacy data – a jury-rigged set and the real data [in an appendix]. Here’s some summary info from the real data – primary outcome on top [graphs] and secondary outcomes below [table]:

The lonely IDS-SR [Inventory of Depressive Symptoms-Self Rated] didn’t make the grade. It wasn’t mentioned in the Results in the 2mg study article [on the right]. It happened that in the other article [1mg and 3mg], the jury-rigged data being reported came out with IDS-SR having a p-value of 0.0251, so it was included:

"Brexpiprazole 3 mg showed greater efficacy than placebo (P < .05) on MADRS-defined response rate, CGI-I–defined response rate, and CGI-I at week 6 and in mean change from baseline at week 6 in CGI-S, HDRS-17, HARS, and IDS-SR."

Having developed a late-life hobby of looking at these Clinical Trials, I notice that Effect Sizes are rarely included in the published studies [do-it-yourself guide @ john henry’s hammer: continuous variables II… and john henry’s hammer: continuous variables III…]. But unlike the FDA, clinicians are interested in the robust-ness of response. And like the example here, the subject rated metrics, if included at all, are either passed over or mentioned only in passing. If you even remotely accept my premise that these industry funded clinical trials are more product testing than scientific research, they’ve got things backwards. They show us the outcome that is most sensitive [statistical significance] and leave out the measures of robustness [Effect Sizes] or the subjects’ experiential ratings [self-rated metrics]. We wouldn’t accept that for toothpaste. Why should we accept that for our medications? Are we expected to believe that augmenting the regimen of patients that don’t respond to antidepressants with Rexulti® is a good idea if they don’t report that their symptoms are any different that those who got placebos?

As prescribing physicians, we have access to more information than our patients. All they get is what they see in the media [the actors]. We at least have the papers, but we have to do more these days than just read what’s handed to us. Accepting the deceptive and selective reporting in the published articles just can’t be justified in the climate of our current literature. So it behooves practitioners to go the extra mile to make some simple calculations that regularly go missing, and to take note of metrics like the subject self-ratings that may be mentioned in the Methods, but don’t make it to the Results except buried in a table.

Otherwise, we’re giving our patients no more than the paid actors in the now ubiquitous commercials…

Sorry, the comment form is closed at this time.