paxil study 352: sins of omission…

Posted on Wednesday 5 December 2012

Randomized Double-Blind Placebo-Controlled Clinical Trials are nothing new. They’ve been around for a long time and there are conventions about how they’re conducted and reported. The outcome parameter is measured at intervals over the duration of the study, then the results for the  placebo and the treatment arms are displayed graphically [or sometimes in a table], usually with some method to deal with drop-outs [LOCF or the Mixed Model], and the results of placebo vs treatment are tested statistically at each interval. Often, they compare predefined criteria driven response and/or remission rates. Some studies also calculate effect size, number needed to treat, or the odds ratio as a measure of the strength of the drug effect. There’s one more up-front calculation of note. The Clinical Trial mavens can estimate how many subjects will be needed to prove their point one way or the other – called the power of the trial.

So, back to Paxil Study 352.  Paxil Study 352 is unique [paxil study 352 revisited…] in that there is neither graph nor table showing the data over time. In fact, the whole article follows few of the conventions of Clinical Trial reporting – visible from across the room: one table [spread over two pages] and one gratuitous graph [Lithium Levels?]:


So what about displayed graphically,  tested statistically, response and/or remission rates, effect size, number needed to treat, odds ratio

On first reading, I missed that they even described when they measured outcomes, but it’s there [briefly]:
During the 10-week study period, patients were assessed for both efficacy and adverse events at baseline and at weeks 1–6, 8, and 10.
There was no presentation of the sequential data – no graph, no table. They had an odd primary outcome variable: The HAM-D and CGI scores on the last availabe sample [at time of drop-out or the end of the study]. That’s what the Table was about. Since over a third dropped-out and there’s no notation of when, showing the means and standard deviations of that data is pretty odd. In Clinical Trials, these scores always decline over time, so your guess is as good as mine what an average of this type means [since the scores are obviously a function of the missing variable – time]:
Mean changes in score on the Hamilton depression scale and CGI severity of illness scale from baseline to endpoint for the paroxetine and imipramine groups were not significantly different than those of the placebo treated group [Table 1].
They defined response as HAMD≤7 or CGI≤2.0 and there was no significant difference for either treatment:
For the total intent-to-treat population, there were no statistically significant differences in response rates among those receiving paroxetine, imipramine, or placebo (per Hamilton criterion: 45.5% [N=15], 38.9%, [N=14], and 34.9% [N=15], respectively; per CGI criterion: 54.5% [N= 18], 58.3% [N=21]; and 46.5% [N=20]). Among the study completers, Hamilton depression scale scores ≤7 were achieved by 56.0% (N=14 of 25) of the paroxetine-treated patients, 47.8% (N=11 of 23) of the imipramine-treated patients, and 53.8% (N=14 of 26) of the placebo-treated patients. Similarly, CGI global improvement scores ≤2 were achieved by 68.0% (N=17) of the paroxetine-treated patients, 73.9% (N=17) of the imipramine-treated patients, and 69.2% (N=18) of the placebo-treated patients.
So here we are at the midway point. What we have is a negative study with no display of the scores vs time, an odd [but not significant] primary outcome variable,  an insignificant secondary outcome variable [response]. The strength of the effect is not calculated because there’s no effect to calculate the strength for. Looks to me like if you’re Bipolar and stabilized on Lithium, if you get depressed, neither Imipramine nor Paroxetine is right for you.

While there’s a Result Summary for Study 352 in the GSK Clinical Trials Register, the Protocol is missing. I was able to locate a copy, but it’s a later version, having been amended four times [two times after the study started]. It has this to say about the stratification by Lithium Levels:
So they separated the subjects by Lithium Level Strata prior to Randomization in order to insure that the Placebo and Treatment Groups weren’t skewed by Lithium Level. Good for them. There was no further mention of the Lithium Level Strata in the Protocol. As we know [paxil study 352 revisited…, paxil study 352 – what’s ghost-writing?, paxil study 352 – more about ghost-writing…], they then separated their results using this Lithium Stratification to compare the Low Lithium Group [≤0.8] to the High Lithium Group [>0.8]. This comparison was not mentioned in the protocol. In the paper, they said:
The group was stratified on the basis of serum lithium level at the screening examination (high: >0.8 meq/liter, low: ≤0.8 meq/liter). Lithium stratification criteria were determined a priori. The proportion of patients achieving dichotomous response was analyzed by the Cochran-Mantel-Haenszel test adjusting for lithium stratification or by Fisher’s exact test. The chi-square test was used for analyses within lithium strata.
This is our first piece of sleight of hand. The highlighted a priori implies that the Lithium Stratification they used to re-analyze their results was determined before the Study was done. That’s concretely true, but it was simply for parity among the groups, not for data analysis. But you wouldn’t know that without the Protocol in hand. So it is a post hoc [after the fact] analysis, as pointed out in the recent article by Amsterdam and McHenry [below].

So, as noted previously, they separated and compared the subjects by Lithium Level Strata [paxil study 352 revisited…], and reported significance [colored pink in Table 1 above]. By separating their subjects into High and Low Lithium Strata, they recorded a significant difference from Placebo for both Paroxetine and Imipramine  in the Low Lithium Strata in their endpoint HAMD and CGI scores. But there’s something not quite right with that finding either. For starters, there was no significant difference in the response rate:
Among patients with high serum lithium levels, similar response rates were noted among those receiving paroxetine, imipramine, or placebo (per Hamilton criterion: 35.7% [N=5], 41.2%, [N=7], and 38.1% [N=8], respectively; per CGI criterion: 57.1% [N=8], 47.1% [N=8]; and 52.4% [N=11]). For those with low serum lithium levels, no statistically significant differences in response rates were seen among those receiving paroxetine, imipramine, or placebo (per Hamilton criterion: 52.6% [N=10], 36.8%, [N=7], and 31.8% [N=7], respectively; per CGI criterion: 52.6% [N=10], 68.4% [N=13]; and 40.9% [N=9]).
And for that matter, the statistical test that should pick up an effect from the Lithium Level, showed no significance:
The general linear model procedure of SAS (Cary, N.C.) was used to perform the analysis with a model that included effects for treatment and lithium strata for scores on the Hamilton depression scale (first 17 items) and CGI severity of illness scale… The treatment-by-lithium strata interaction was found to be nonsignificant and was not included in the model.
So, as pointed out by Amsterdam and McHenry, there was no reason to test for significance because there was none. Undaunted, they did it anyway. But since they added another outcome parameter [Lithium Level Strata] they should’ve used the Bonferroni correction for multiple comparisons, but:
Because all other statistical comparisons were considered to be secondary, no adjustments for multiple comparisons were made. Therefore, the achievement of statistical significance for the primary efficacy variables at endpoint (i.e., changes from baseline in scores on the Hamilton depression scale and CGI severity of illness scale) was set at p≤0.05.
… as if the correction should be based on their intent rather than the fact that they made multiple comparisons [a remarkable bit of  illogic]. As Amsterdam and McHenry and I pointed out [paxil study 352 revisited…], the more logical explanation  for why they left out the correction is that it would have eliminated their meager significance and left them with a study that was as solidly negative as it really was. Amsterdam and McHenry added something only pros would’ve seen – the study failed recruitment and was badly  underpowered:
The paroxetine 352 bipolar trial: A study in medical ghostwriting
by Jay D. Amsterdam and Leemon B. McHenry
International Journal of Risk & Safety in Medicine 2012 24:221–231.

The original protocol sample size estimate of 0.9 (1-β) or 62 subjects per treatment group was officially amended downward to 0.8 (1-β) or 46 subjects per group during the study. The latter value was the sample size described in the GSK Clinical Trials Website Result Summary. No explanation was provided for this change in sample size in the amended protocol. However, we suspect that this reduction in power might have resulted from exceedingly slow subject recruitment into the study, which ultimately led GSK to add a 19th investigative site. By the time GSK decided to halt subject enrollment prematurely and terminate the study, only 117 (of the originally projected 186 subjects) were enrolled, resulting in final sample sizes for paroxetine (n = 35), imipramine (n = 39), and placebo (n = 43). By the time the study was published in June 2001 in the American Journal of Psychiatry, however, the declared sample size estimate had again changed with the article stating: “The study was designed (sic) to enroll 35 patients per arm, which would allow 70% power to detect a 5-point difference on the Hamilton depression scale score (SD = 8.5) between treatment groups”.

Although the published article noted that statistical power was estimated at only 70%, the article did not inform the reader that this value represented an unconventionally low power for a clinical trial. The article did not inform the reader that the original power estimate was 62 subjects per group or that the original power estimate had been officially reduced during the study. Moreover, the article made no mention of the fact that the final power estimate was determined after the study was completed, and that this post hoc power estimate most likely occurred as an ‘extra-regulatory’ protocol change in order to allow the final sample size estimate of 35 subjects per group to comport with the final sample size of the paroxetine group (i.e., n = 35). The published article failed to acknowledge clearly that the study failed to recruit the projected sample size necessary to test the primary study hypothesis, and only hinted by its published sample size estimate that the study had insufficient statistical power to test the primary study aims.
In case you haven’t noticed, this is an abysmal article that’s published in one of the primo psychiatric journals in the world – the American Journal of Psychiatry. I summarized its shortcomings again after the holiday and a trip out of town so I can pick back up where I left off – an exploration of ghost writing. But that’s enough for now. More to follow…

Sorry, the comment form is closed at this time.