Saturday, 15 September 2012

Must we really accept a 1-in-20 false positive rate in science?

There has been some very interesting and extremely important discussion recently addressing a fundamental problem in science: can we believe what we read?

After a spate of high-profile cases of scientific misdemeanours and outright fraud (see Alok Jha's piece in the Guardian), people are rightly looking for solutions to restore credibility to the scientific process [e.g., see Chris Chambers and Petroc Sumner's Guardian response here].

These include more transparency (especially pre-registering experiments), encouraging replication, promoting the dissemination of null effects, shifting career rewards from new findings (neophilia) to genuine discoveries, abolishing the cult of impact factors, etc. All these are important ideas, and many are more or less feasible to implement, especially with the right top-down influence. However, it seems to me that one of the most basic problems is staring us right in the face, and would require absolutely no structural change to correct. The fix is as simple as re-drawing a line in the sand.

Critical p-value: line in the sand

Probability estimates are inherently continuous, yet we typically divide our observations into two classes: significant (i.e., true, real, bona fide, etc) and non-significant (i.e., the rest). This reduces the mental burden of assessing experimental results - all we need to know is whether an effect is real, i.e., passes a statistical threshold. And so there are conventions, the most widely used being p<.05. If our statistical test falls below a probability of 5% chance level, we may assert that our conclusion is justified. Ideally, this threshold ensures that our inference is correct with at least 95% certainty. But turn this around, and it also means that at worst, the assertion could be wrong (i.e., false positive) one time in twenty (about the same odds as being awarded a research grant in the current climate). That already seems pretty high odds for accepting false positive claims in science. But worse, this is also only the ideal theoretical case. There are many dubious scientific practices that dramatically inflate the false discovery rate, such as cherry picking and peeking during data collection (see here).

These kinds of fishy goings-on are evident in statistical anomalies, such as the preponderance of just-significant effects reported in the literature (see here for blog review of empirical paper). Although it is difficult to estimate the true false positive rate out there, it can only be higher than the ideal one in twenty rate assumed by our statistical convention. So, even before worrying about outright fraud, it is actually quite likely that many of the results we read about in the peer-reviewed literature are in fact false positives.

Boosting the buffer zone

The obvious solution is to tighten up the accepted statistical threshold. Take physics, for example. Those folk only accept a new particle into their text books if the evidence reaches a statistical threshold of  5 sigma (i.e., p<0.0000003). Although the search for the Higgs boson involved plenty of peeking along the way, at 5 sigma the resultant inflation of the false discovery rate is hardly likely to matter. We can still believe the effect. A strict threshold level provides a more comfortable buffer between false positive and true effect. Although there are good and proper ways to correct for peeking, multiple comparisons, etc., all these assume full disclosure. It would clearly be safer just to adopt a conservative threshold. Perhaps not one quite as heroic as 5 sigma (after all, we aren't trying to find the God particle), but surely we can do better than a one-in-twenty false discovery rate as the minimal and ideal threshold.

Too conservative?

Of course, tightening the statistical threshold would necessarily increase the number of failures to detect a true effect, so-called type II errors. However, it is probably fair to say that most fields in science are suffering more from false positives (type I errors) than type II errors. False positives are more influential than false negatives, and harder to dispel. In fact, we are probably more likely to consider a null effect as a real effect cloaked in noise, especially if there is already a false positive lurking about somewhere in the published literature. It is notoriously difficult to convince your peers that your non-significant test indicates a true null effect. Increasingly, Bayesian methods are being developed to test for sameness between distributions, but this is another story.

The main point is that we can easily afford to be more conservative when bestowing statistical significance to putative effects, without stifling scientific progress. Sure, it would be harder to demonstrate evidence for really small effects, but not impossible if they are important enough to pursue. After all, the effect that betrayed the Higgs particle was very small indeed, but that didn't stop them from finding it. Valuable research could focus on validating trends of interest (i.e., strongly predicted results), rather than chasing down the next new positive effect leaving behind a catalogue of potentially suspect "significant effects" in your wake. Science cannot progress as a house of cards.

Too expensive?

Probably not. Currently, we are almost certainly wasting research money chasing down the dead ends that are opened up by false positives. A reduced, but more reliable corpus of highly reliable results would almost certainly increase the productivity of many scientific fields. At present, the pressure to publish has precipitated a flood of peer-reviewed scientific papers reporting any number of significant effects, many of which will almost certainly not stand the test of time. It would seem a far more sensible use of resources to focus on producing fewer, but more reliable scientific manuscripts. Interim findings and observations could be made readily available via any number of suggested internet-based initiatives. These more numerous 'leads' could provide a valuable source of possible research directions, without yet falling into the venerable category of immutable (i.e., citable) scientific fact. Like conference proceedings, they could adopt a more provisional status until they are robustly validated.

Raise the bar for outright fraud

Complete falsification is hard to detect in the absence of actual whistleblowers. In Simonsohn's words: "outright fraud is somewhat impossible to estimate, because if you're really good at it you wouldn't be detectable" (from Alok Jha). Even publishing the raw data is no guarantee of catching out the fraudster, as there are clever ways to generate plausible-looking data sets that would pass veracity testing.

However, fraudsters presumably start their life of crime in the grey area of routine misdemeanour. A bit of peeking here, some cherry picking there, before they are actually making up data points. Moreover, they know that even if their massaged results fail to replicate, benefit of the doubt should reasonably allow them to claim to be unwitting victims of an innocent false positive. After all, at p<0.05 there is already a 1-in-20 chance of a false positive, even if you do everything by the letter!

Like rogue traders, scientific fraudsters presumably start with a small, spur-of-the-moment act that they reasonably believe they can get away with. If we increase the threshold that needs to be crossed, fewer unscrupulous researchers will be tempted down the dark and ruinous path of scientific fraud. And if they did, it would be much harder for them to claim innocence after their 5 sigma results fail to replicate.

Why impose any statistical threshold at all?

Finally, it is worth noting some arguments that the statistical threshold should be abolished altogether. Maybe we should be more interested in the continua of effect sizes and confidence intervals, rather than discrete hypothesis testing  [e.g., see here]. I have a lot of sympathy for this argument. A more quantitative approach to inferential statistics would more accurately reflect the continuous nature of evidence and certainly, and also more readily suit meta-analyses. However, it is also useful to have a standard against which we can hold up putative facts for the ultimate test: true or false.