Statistical errors in neuroscience: how a mouse turned into an elephant

Statistical errors in neuroscience: how a mouse turned into an elephant

Analyzing a large corpus of the neuroscience literature we found that the same statistical error – comparing effect sizes by comparing their significance levels – is appearing throughout even the most prestigious journals in neuroscience.

Early in 2011 I reviewed two manuscripts for Nature Neuroscience. In both manuscripts some of the main conclusions were based on a statistical error, which I realized I had seen many times before: the researchers wanted to claim that one effect (for example, a practice effect on neural activity in mutant mice) was larger or smaller than the other effect (the practice effect in control mice). To support this claim, they needed to report a statistically significant interaction (between amount of practice and type of mice), but instead they reported that one effect was statistically significant, whereas the other effect was not. Although superficially compelling, the latter type of statistical reasoning is erroneous because the difference between significant and not significant need not itself be statistically significant.

I decided to ask the editor if Nature Neuroscience would be interested in publishing a brief ‘letter to the editor’ in which I would point out this common mistake. To my surprise, she wrote back to say she would prefer a full-length opinion article if an extensive literature analysis confirmed my claim about the ubiquity of the error. Wow, a rare opportunity to write an article for a Nature journal—and about statistics, something I had never written about! Quite a challenge. I soon realized I needed to ask some colleagues, one of them a statistical expert, to help me carry out the literature analysis.

Together, we reviewed 513 recent neuroscience articles in five top-ranking journals (Science, Nature, Nature Neuroscience, Neuron and The Journal of Neuroscience), and found that 78 used the correct procedure and 79 used the incorrect procedure: the error was even more common than I’d expected. In our article we reported the literature analysis and various scenarios in which the erroneous procedure is particularly alluring. The reviewers of our manuscript understood the urgency of the topic and supported publication. In September 2011, the article was published.

What we had not anticipated was the huge amount of attention our article would receive outside academia. Our research featured in several prominent European newspapers, in less prominent newspapers such as de Pers and the Leidsch Dagblad, in numerous blogs on science, and in countless twitter messages. I just googled the exact title of our article and received 75,000 hits. The article was well-timed, given that the media and general public have become increasingly interested in science and, in particular, bad science… But who would have thought an article on statistics would ever draw so much attention? It was certainly my first experience with massive media attention—something that, I realized, requires some different skills than I’d needed so far in my career.


Annelies de Haan

[In response to Micheal on February 11, 2013 at 17:51]:
"Nice article, butw how do 78 and 79 add up to 513?"

Nicely spotted, but presumably not all of the articles examined needed to make this comparison (in such a way that an interaction SHOULD have been reported, whether it was or wasn't). To quote the article in question: "In 157 of these 513 articles (31%), the authors describe at least one situation in which they might be tempted to make the error." (Nieuwenhuis, Forstmann, & Wagenmakers, 2011, pg. 1105).


"Together, we reviewed 513 recent neuroscience articles ... and found that 78 used the correct procedure and 79 used the incorrect procedure"

Nice article, butw how do 78 and 79 add up to 513?

Eefje Poppelaars

I really enjoyed reading this; an inside look at how one could get to publish in Nature.

Mark de Rooij

[In reaction to 'confused' on February 7, 2013 at 17:35]

Well you are seriously confused!

In your set up where you want to see whether treatment A and B differ there cannot be an interation (and interaction of group (A/B) with what?).

So the actual question is not whether groups A and B differ but whether the relationship between variable X and Y is different for groups A and B. To answer this question you need an interaction and you cannot just look at the relationship between X and Y separately for groups A and B.

To do justice to science I think it is appropriate to say that the idea comes from Gelman and Stern in their paper "The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant"", The American Statistician, 2006, pp 328-331.


Why would you need to use an interaction? Why can't you just compare the two? For example, by hypothesis test whether the effect of treatment A (taken a a baseline) is statistically significantly different from the effect of treatment B.
A test for an interaction effect would do the job, but a direct comparison is, well, more direct...