study 329 ix – mystic statistics…

Posted on Thursday 17 September 2015

Most of us have an incomplete knowledge of Statistical Analysis unless we’ve had formal training and hands-on experience, yet we tend to accept the output from the computer’s statistical packages as if it’s dogma. In academic and commercial laboratories, we count on Statisticians [or trained SAS Programmers] to generate those abstract lettered indices that we discuss as if they’re absolutes – p, d, k, SEM, SD, NNT, OR, etc. And even the experts can’t check things with a pad and pencil. So we’re vulnerable to subtle [and even not so subtle] deceptions. In our RIAT Team’s reanalysis of Study 329, we had decided to follow the a priori protocol, which meant sticking to the protocol defined outcome variables and ignoring those later exploratory variables [in blue in Keller et al‘s Table 2] as discussed earlier.

The Study 329 protocol is clear and precise about statistical testing: parametric Analysis of Variance for the continuous variables and Logistical Regression for the categorical [yes/no] variables. They specified a model containing treatment and investigator with contingencies for interactions between them [since I’ve already put the non-stat-savvy set to sleep, I’m going to dumb this down a bit going forward]. We noticed that our p values differed from those in both the Keller et al paper and the CSR [Full study report acute], even though our open source statistical package [R] is equivalent to their commercial package [SAS] – both available in the Secure Data Portal provided by GSK. While the results for the protocol defined variables were not significant, the numbers still should’ve been close to the same. And there was something else. They were reporting statistics for Paroxetine vs Placebo, Imipramine vs Placebo, and saying that the study was not powered to test Paroxetine vs Imipramine – all pairwise comparisons. Why this was important takes a little explaining.

When a dataset has only two groups [as a study of Paroxetine vs Placebo], pairwise statistical comparisons with something like the familiar t-test are perfectly appropriate. But when you run statistical comparisons on datasets with more than two groups, there’s a two step process. First you test the whole dataset using an OMNIBUS statistical test like Analysis of Variance [ANOVA]. If the whole dataset is significant, then you can run pairwise tests between the various groups to find where the significance lies. But if the OMNIBUS test is not  significant, it means that there are no differences among the groups – and that’s the end of that. The pairwise tests are immaterial no matter how they come out. Keller et al had skipped the OMNIBUS tests altogether [never mentioned in the protocol, the paper, or the CSR]. Our results were the OMNIBUS statistics and that’s why they were different. With the protocol-defined variables under consideration, it didn’t matter since nothing was significant no matter what your method. So the question became, "Why bother to skip the OMNIBUS statistical tests?"

Since we decided to drop those non-protocol-variables because they were declared post hoc [see the last two posts], we had never run the full statistical model analysis on them. But I remembered a spreadsheet we did on a rough pass through this data when we were first getting started. The results are shown here [the OMNIBUS tests are in the far right column and all significant values are shown in red]:

The protocol-specified-variables [white background] are not significant as reported by Keller et al. But look at the non-protocol variables [gray background]. Only two were OMNIBUS-significant. And look at the columns measuring strength of effect [EFFECT SIZE, NNT, ODDS RATIO]. Except for the HAM-D DEPRESSED ITEM, those exploratory variables are pretty lame [weak]. While this was a crude first-take without considering the investigator covariate, it suggests that the OMNIBUS statistics didn’t help their cause so they were conveniently ignored. That could offer a plausible explanation for why they skipped the OMNIBUS statistical test altogether [in fact, it’s the only explanation I can think of]. Recalling that spreadsheet, I went back and ran the "full monty" model on these variables and three of them did make it under the p<0.05 wire after all: as expected, the HAM-D DEPRESSED ITEM yielded p=0.0032; but the others only barely made the cut, HAM-D REMISSION was p=0.0504, and CGI IMPROVEMENT came in at p=0.0493. Those last two were barely statistically significant, hardly seeming clinically relevant. And there was something else [see below]. The LOCF dataset for K-SADS-L was very difficult to judge since it was an every other week metric and a number of subjects got off schedule, but for what it’s worth, I could never find the reported significance with various shots at defining the LOCF dataset. Running the full model, I got p=0.0833 OMNIBUS and p=0.0662 for Paroxetine vs Placebo.

Just one more piece of techno-babble. There’s something more to say about those two minimally significant exploratory variables:

Both of the non-protocol categorical variables were only significant in week 8, suggesting to me that they were probably outliers [flukes]. And, as mentioned earlier, even if you include the rogue non-protocol exploratory variables, applying any correction for multiple variables would wipe out statistical significance for three of the four. That leaves the HAM-D DEPRESSED ITEM as the only statistically significant finding in this entire study – one question on a multi-item rating scale! So in order for Keller et al to reach the conclusion "Paroxetine is generally well tolerated and effective for major depression in adolescents," all three things had a part to play: no correction for multiple variables; redefining a priori to mean before the blind is broken rather than before the study begins; and ignoring the OMNIBUS statistical test.

I know these posts are TMI [too much information], so this is the end of all my number chatter. To my way of thinking, Study 329 has become a paradigm, emblematic of the widespread subtle distortion of the tools of scientific analysis in the service of commercial gains in the analysis of Clinical Trials. We wrote this RIAT paper to correct the existing scientific literature, but also to give the clear message that if you publish Clinical Trials that disseminate misinformation to physicians and patients, they might just be coming right back at you. And, in the future, with greater Data Transparency and awareness, it won’t take any fourteen years to make the circuit…
  1.  
    September 17, 2015 | 6:38 PM
     

    Mickey, read this earlier:
    “The researchers were also able to look at only 93 of the 275 case reports, because they had insufficient time or resources. It is possible that a full re-analysis might change the overall message.” http://www.nhs.uk/news/2015/09September/Pages/Antidepressant-paroxetine-study-under-reported-data-on-harms.aspx

    Prior to this, I hadn’t heard that the re-analysis was a partial analysis.
    This surprises me – any truth in this?

  2.  
    September 17, 2015 | 6:54 PM
     

    Hi Mick,

    Yes it’s true. The CRFs were overwhelming one page at a time. I don’t think it would’ve made any difference in that we focused on the records identified in Keller et al, drop-outs, etc. Dr. Healy’s first research assistant finally threw up her hands and he actually lost the position. Author Jo took over and put forth a herculean effort. If we could’ve printed them out, we could’ve spread it out among many raters, but we only had a limited number of portals specific to one computer. I had one for efficacy, and the Bangor group had one for harms. It was just not feasible to do them all. Somewhere down the line, we’ll have something to say about that. Being the “first” anything is hard work. In response to your point, the only direction we could go would be to find even more harms. But I feel good about what we were able to get done [Jo probably ought to be knighted]…

  3.  
    alen
    September 18, 2015 | 5:20 AM
     

    Dr Nardo, as I understand your team have been granted access to GSK website by which they gathered the data, however without the right to print or download. My question is – why no one haven’t just taken screenshots or photograph computer screen and then extract text from images with appropriate software (for ex. Abby Finereader,) ? I’m pretty sure it would be illegal however it’s impossible to prove that action if computer wasn’t infected, stored in monitored room or remote accesed by some else.

  4.  
    September 18, 2015 | 11:35 AM
     

    Thanks Mickey. Oh yes, wonderful effort all round 🙂
    Just keeping my nose to the ground to sniff out the inevitable attempts at trying to undermine the findings of the re-analysis.

Sorry, the comment form is closed at this time.