Naomi Oreskes has an op-ed in The New York Times today, which intends to defend the severe standards of evidence scientists employ, with special applicability to climate science and their explanation of causation (greenhouse gases produce radiative forcing), attribution (most of the increase in greenhouse gases is due to human sources, principally extraction and burning of fossil fuels), and effects (continuing emissions without drastic mitigation will produce severe health, social, economic, political, and military effects). Alas, Dr Oreskes centers her argument on significance testing and, in doing so, from one perspective, weakens her illustration to the point of evaporation and, from another, emphasizes and underscores a mode of statistical thought which is downright harmful, especially in medicine.
Consider a typical medical research study, for example designed to test the efficacy of a drug, in which a null hypothesis H0 ('no effect') is tested against an alternative hypothesis H1 ('some effect'). Suppose that the study results pass a test of statistical significance (that is P-value <0.05) in favor of H1. What has been shown?
1. H0 is false.
2. H1 is true.
3. H0 is probably false.
4. H1 is probably true.
5. Both (1) and (2).
6. Both (3) and (4).
7. None of the above.
This was actually posed as part of a study conducted by Westover, Westover, and Bianchi regarding such inference among medical specialists. (See their paper for details.) They obtained these results in their sample:
Quiz answer profile.
Answer (1) (2) (3) (4) (5) (6) (7)
Number 8 0 58 37 6 69 12
Percent 4.2 0 30.5 19.5 3.2 36.3 6.3
Lest medical scientists appear to be singled out, use of such tests is also common in climate geophysics. The prevalence of such statistical practice is one reason why Ioannidis found “Why most published research findings are false“.
What’s the answer? (7) or None of the above. As Westover, Westover, and Bianchi’s Figure 8 shows, simply stating the fact that a 0.05 significance test resulted in a p-value less than 0.05 tells very little about what it means. That determination also demands knowing the pre-test probability of the event (or its prior) and the statistical power of the test, otherwise described as the probability that the data favor an alternative hypothesis given that the alternative is true. These were not specified, so the conclusion is indeterminate.