Most of us are aware that a number of the articles in our peer-reviewed journals reporting industry-funded Clinical Trials of CNS drugs have been distorted to a greater or lesser degree. I think it’s important to look at how they’ve been "spun" to play up efficacy and downplay harms. The distortions were no accident, but rather deliberate and sometimes skillful moves to maximize the commercial desirability of the drugs. In our paper on Paxil Study 329 [Restoring Study 329: efficacy and harms of paroxetine and imipramine in treatment of major depression in adolescence], we didn’t speculate on such things directly though I did address them in some blogs [study 329 vi – revisited…, study 329 vii – variable variables?…, study 329 viii – variable variables decoded…, study 329 ix – mystic statistics…, study 329 x – “it wasn’t sin – it was spin”…, study 329 xi – week 8…, and study 329 xii – premature hair loss…].
Restoring Study 329: efficacy and harms of paroxetine and imipramine in treatment of major depression in adolescence: Response to Drs. Eriksson and Hieronymusby Elia Abi-JaoudeBritish Medical Journal. 2015 351:h4320.
In scientific inquiry, there is a role for both hypothesis-driven and exploratory research. Research trainees have often heard one form or another of the saying, ‘If you torture your data long enough, it will tell you what you want to hear.’ Nevertheless, the practice of inappropriate data interrogation with the aim of obtaining a p value that is less than 0.05 pervades the literature[1–12]. Hence, prespecifying hypotheses helps protect against spurious findings, and also helps ensure that research pursuits are based on a rationale that is informed by known scientific evidence.
Nevertheless, clinical research endeavours typically demand considerable time and other resources, and the knowledge and interpretation of scientific evidence that may provide a basis for hypotheses is constantly evolving; thus, it is imperative that data from studies are adequately interrogated. The key, however, is that this is done in a transparent manner, both in terms of reporting the full extent of exploratory analyses, and in tempering interpretations of findings arising from such exploratory pursuits.
The study published by Keller and colleagues is not transparent about the distinction between the prespecified outcome measures and the additional analyses carried out. In fact, one such exploratory variable is misleadingly described as a “[i.e., primary outcome measure]”[page 765]. An internal SKB memo describes the aim to “effectively manage the dissemination of these data in order to minimise any potential negative commercial impact”. It described “no plans to publish data from Study 377” [a trial similar to Study 329 with similarly negative results] [pdf page 1], and that “Positive data from Study 329 will be published” [pdf page 5]. Despite repeated requests from us, GSK was not able to produce adequate evidence for an analytical plan outlining the rationale for the exploratory analyses. Thus, there is nothing to indicate that the additional exploratory measures came about from a renewed understanding of the scientific merits of the additional exploratory variables.
Professor Ericksson and Dr. Hieronymus go on to propose the “depressed mood” item as a more sensitive and appropriate measure of antidepressant efficacy than the total sum of the Hamilton Depression Rating Scale [HDRS]. They refer to their recently published analyses of pharmaceutical company studies of SSRIs for adult depression, in which “whereas 56% of 32 comparisons failed to reveal a significant difference between groups when HDRS sum was used as effect parameter, only 9% failed to detect a significant superiority of the active drug with respect to the “depressed mood” item”[15,16]. However, whether the single “depressed mood” item is a more appropriate measure of antidepressant efficacy is debatable.
What can be made of the finding that statistical significance is reached on a single item – depressed mood – but not on the sum total of items representing the constellation of symptoms that we presently refer to as major depression? Perusing the results presented in Table 2, almost all HDRS endpoint mean scores for the depressed mood item fall between ratings ‘1’ and ‘2’: the placebo arms mean scores are closer to ‘2’, i.e., ‘spontaneously reported verbally’, and the SSRI arms mean scores are closer to ‘1’, i.e., ‘indicated only on questioning’[Table 2]. This finding could be readily explained by SSRI-induced apathy, a common yet underappreciated effect of these drugs[17–23]. Thus, patients experiencing SSRI-induced apathy could be less likely to spontaneously report a depressed mood than patients on placebo, all the while there is no substantial difference between the two in terms of their overall symptoms of depression.
Furthermore, while the effect size based on this single “depressed mood” item is described as moderate, the change of much greater magnitude is that of the endpoint versus baseline mean scores, for both the SSRI and placebo arms[Table 2]. This highlights the important role of placebo and non-specific factors in the SSRI response. The additional effect from SSRI versus placebo could be partly a result of unblinding due to adverse effects. Both clinician and patient participants can tell with a high degree of accuracy whether they have been assigned to a drug or placebo arm in a trial[24,25], and this is rarely reported in clinical studies. Further, adverse events have been shown to correlate with effect size in antidepressant trials[27,28]. Of note, in an early study by Thomson, whereas 43 of 68 trials showed tricyclic agents to be superior to inert placebo, only 1 out of 7 trials showed the antidepressant to be superior relative to atropine as an active placebo.
As an alternative to symptom-based scales, more meaningful, patient-relevant measures include those that assess function and quality of life. In Study 329, none of such protocol-defined measures showed paroxetine to be more efficacious than placebo, including the clinical global impression mean score, autonomous function check list change, self perception profile change, or the sickness impact profile.In conclusion, while exploratory analyses can yield useful information, they can be – and very often are – used to fish for statistically significant results that are presented in a misleading manner[1–12]. It is worthwhile to explore more appropriate and meaningful alternatives to current popular measures to capture patient response to intervention. However, this necessitates full transparency, including access to clinical trial protocols and raw data. Otherwise, we will continue to subject our patients to interventions with a distorted impression of benefits and harms.