exploratory analyses…

Posted on Monday 26 October 2015

Most of us are aware that a number of the articles in our peer-reviewed journals reporting industry-funded Clinical Trials of CNS drugs have been distorted to a greater or lesser degree. I think it’s important to look at how they’ve been "spun" to play up efficacy and downplay harms. The distortions were no accident, but rather deliberate and sometimes skillful moves to maximize the commercial desirability of the drugs. In our paper on Paxil Study 329 [Restoring Study 329: efficacy and harms of paroxetine and imipramine in treatment of major depression in adolescence], we didn’t speculate on such things directly though I did address them in some blogs [study 329 vi – revisited…, study 329 vii – variable variables?…, study 329 viii – variable variables decoded…, study 329 ix – mystic statistics…, study 329 x – “it wasn’t sin – it was spin”…study 329 xi – week 8…, and study 329 xii – premature hair loss…].

In the original article, Keller et al had elevated four post hoc exploratory variables not specified in the protocol to the status of secondary outcomes and used them as a basis for a positive conclusion. In one Rapid Response comment, Drs. Hieronymous and Eriksson had argued for the legitimacy of Keller et al’s approach, particularly looking at the HAM-D Depressed Mood item [Study 329 did detect an antidepressant signal from paroxetine]. I responded on technical grounds [Response to Eriksson & Hieronymus], but one of my coauthors [Elias Abi-Jaoude] responded to the general importance of an a priori protocol driven analysis. I think it’s an important essay in its own right and reproduce it here for emphasis [see also POM·posity… showing how common this kind of maneuver has become]:
by Elia Abi-Jaoude
British Medical Journal. 2015 351:h4320.

In scientific inquiry, there is a role for both hypothesis-driven and exploratory research. Research trainees have often heard one form or another of the saying, ‘If you torture your data long enough, it will tell you what you want to hear.’ Nevertheless, the practice of inappropriate data interrogation with the aim of obtaining a p value that is less than 0.05 pervades the literature[1–12]. Hence, prespecifying hypotheses helps protect against spurious findings, and also helps ensure that research pursuits are based on a rationale that is informed by known scientific evidence.

Nevertheless, clinical research endeavours typically demand considerable time and other resources, and the knowledge and interpretation of scientific evidence that may provide a basis for hypotheses is constantly evolving; thus, it is imperative that data from studies are adequately interrogated. The key, however, is that this is done in a transparent manner, both in terms of reporting the full extent of exploratory analyses, and in tempering interpretations of findings arising from such exploratory pursuits.

The study published by Keller and colleagues[13] is not transparent about the distinction between the prespecified outcome measures and the additional analyses carried out. In fact, one such exploratory variable is misleadingly described as a “[i.e., primary outcome measure]”[13][page 765]. An internal SKB memo describes the aim to “effectively manage the dissemination of these data in order to minimise any potential negative commercial impact”[14]. It described “no plans to publish data from Study 377” [a trial similar to Study 329 with similarly negative results][14] [pdf page 1], and that “Positive data from Study 329 will be published”[14] [pdf page 5]. Despite repeated requests from us, GSK was not able to produce adequate evidence for an analytical plan outlining the rationale for the exploratory analyses. Thus, there is nothing to indicate that the additional exploratory measures came about from a renewed understanding of the scientific merits of the additional exploratory variables.

Professor Ericksson and Dr. Hieronymus go on to propose the “depressed mood” item as a more sensitive and appropriate measure of antidepressant efficacy than the total sum of the Hamilton Depression Rating Scale [HDRS][15]. They refer to their recently published analyses of pharmaceutical company studies of SSRIs for adult depression, in which “whereas 56% of 32 comparisons failed to reveal a significant difference between groups when HDRS sum was used as effect parameter, only 9% failed to detect a significant superiority of the active drug with respect to the “depressed mood” item”[15,16]. However, whether the single “depressed mood” item is a more appropriate measure of antidepressant efficacy is debatable.

What can be made of the finding that statistical significance is reached on a single item – depressed mood – but not on the sum total of items representing the constellation of symptoms that we presently refer to as major depression[16]? Perusing the results presented in Table 2, almost all HDRS endpoint mean scores for the depressed mood item fall between ratings ‘1’ and ‘2’: the placebo arms mean scores are closer to ‘2’, i.e., ‘spontaneously reported verbally’, and the SSRI arms mean scores are closer to ‘1’, i.e., ‘indicated only on questioning’[16][Table 2]. This finding could be readily explained by SSRI-induced apathy, a common yet underappreciated effect of these drugs[17–23]. Thus, patients experiencing SSRI-induced apathy could be less likely to spontaneously report a depressed mood than patients on placebo, all the while there is no substantial difference between the two in terms of their overall symptoms of depression.

Furthermore, while the effect size based on this single “depressed mood” item is described as moderate, the change of much greater magnitude is that of the endpoint versus baseline mean scores, for both the SSRI and placebo arms[16][Table 2]. This highlights the important role of placebo and non-specific factors in the SSRI response. The additional effect from SSRI versus placebo could be partly a result of unblinding due to adverse effects. Both clinician and patient participants can tell with a high degree of accuracy whether they have been assigned to a drug or placebo arm in a trial[24,25], and this is rarely reported in clinical studies[26]. Further, adverse events have been shown to correlate with effect size in antidepressant trials[27,28]. Of note, in an early study by Thomson, whereas 43 of 68 trials showed tricyclic agents to be superior to inert placebo, only 1 out of 7 trials showed the antidepressant to be superior relative to atropine as an active placebo[28].

As an alternative to symptom-based scales, more meaningful, patient-relevant measures include those that assess function and quality of life. In Study 329, none of such protocol-defined measures showed paroxetine to be more efficacious than placebo, including the clinical global impression mean score, autonomous function check list change, self perception profile change, or the sickness impact profile[29].

In conclusion, while exploratory analyses can yield useful information, they can be – and very often are – used to fish for statistically significant results that are presented in a misleading manner[1–12]. It is worthwhile to explore more appropriate and meaningful alternatives to current popular measures to capture patient response to intervention. However, this necessitates full transparency, including access to clinical trial protocols and raw data. Otherwise, we will continue to subject our patients to interventions with a distorted impression of benefits and harms.
    NIH researcher
    October 27, 2015 | 4:16 PM

    This is a very interesting essay. Another consideration in drug research, is that the placebo group is in abrupt withdrawal from their previous medication, if the study uses a washout design. Anyone who has tried to make patients stop taking SSRIs, knows that withdrawal effects are not over after 10 days., So most studies are comparing abrupt withdrawal for the placebo group, with getting back a drug similar to the previous.

    So since the placebo group is not in any way getting a neutral treatment, research with a washout the design have absolutely no possibility of saying anything about the efficacy of the drug.

    It is not very difficult for a drug to beat cold turkey withdrawal. And it’s therefore extremely interesting that the difference between drug and placebo is so small in most studies.
    One of the reasons why study 329 did not get any good results, may be that this young population was not taken any drugs previous to the study. Therefore it did not compare cold turkey to getting back on a similar drug.

Sorry, the comment form is closed at this time.