There are many valid criticisms of P-values but the criticism that they are largely responsible for the reproducibility crisis has been accepted rather lightly in some quarters. Whatever the inferential statistic that is used, it is quite illogical to assume that as the sample size increases it will tend to show more evidence against the null hypothesis. This applies to Bayesian posterior probabilities as much as it does to P-values. In the context of P-values it can be referred to as the trend towards significance fallacy but more generally, for reasons I shall explain, it could be referred to as the anticipated evidence fallacy.
The anticipated evidence fallacy is itself an example of the overstated evidence fallacy. I shall also discuss this fallacy and other relevant matters affecting reproducible science including the problem of false negatives.
2. Senn: Trends toward significance 2
“Many of the test procedures for
examining model adequacy that are
provided in standard software are best
regarded as defined directly by the test
statistics used rather than by a family of
alternatives.” (p37)
“With large amounts of data, small
departures from H0, of no subject-matter
importance, may be highly significant.”
(p42)
David Cox (1924-2022)
4. Statement of the problem
Trends toward significance and replication
Senn: Trends toward significance 4
5. Trend toward Significance
Examples
The results showed a trend towards significance
P=0.07
The result was not significant due to the trial
being too small
As more trials become available the meta-
analysis will be even more significant
This trial was significant so the next one should
be also
Senn: Trends toward significance 5
When faced with a P value that has failed to
reach some specific threshold (generally
P<0.05), authors of scientific articles may
imply a “trend towards statistical
significance” or otherwise suggest that the
failure to achieve statistical significance was
due to insufficient data. This paper presents a
quantitative analysis to show that such
descriptions give a misleading impression and
undermine the principle of accurate reporting.
Wood J, Freemantle N, King M, Nazareth I.
Trap of trends to statistical significance:
likelihood of near significant P value
becoming more significant with extra
data. BMJ. Mar 31 2014;348:g2215.
doi:10.1136/bmj.g2215
6. Dealing with a red herring
Significance testing versus hypothesis testing etc. etc.
Senn: Trends toward significance 6
7. Red Herrings
Supposed differences between frequentists
Fisher
• Significance testing
• Stresses P-values
• Calculates tail area probabilities
• Inferences not decisions
• Has no way of addressing
sensitivity of experiments
• ???
Neyman-Pearson
• Hypothesis testing
• Uses fixed type I error rates
• Determines critical values
• Decisions not inferences
• Solves problem of sensitivity
using alternative hypothesis
• Power
Senn: Trends toward significance Stephen Senn 7
8. Are P-values really a difference between the
two schools?
Fisher introduces significance levels A disciple of Neyman’s advocates P-values
Senn: Trends toward significance 8
Fisher RA. Statistical Methods for Research Workers.
Oliver and Boyd; 1925.
Lehmann EL. Testing Statistical Hypotheses.
Second ed. Chapman and Hall; 1994.
9. The real differences
Issue Neyman Pearson Fisher
Power versus likelihood The justification of likelihood lies in
power
Likelihood is fundamental and
power is irrelevant
Alternative hypotheses and test
statistic
The alternative hypothesis dictates
choice of test statistic
H1 Test statistic
The tests statistic is chosen on the
basis of experience
Test statistic hypotheses
Conditioning Forward-looking properties of the
test are what matters
Conditioning on relevant subsets (if
they exist) is crucial
Sensitivity of experiment Power for given alternative Precision (standard error etc)
Senn: Trends toward significance 9
10. Some properties of P-values
Power to the P-value?
Senn: Trends toward significance 10
11. Problems with P-Values as evidence?
• Other things being equal smaller
P-values indicate weaker support
for H0
• and stronger for “H1”
• But be careful! Other things may
not be equal.
• Even if the P-value is small it could
be more likely under H0 than
under H1
• But be doubly careful! Who
knows what “H1”is?
Senn: Trends toward significance 11
12. My view
• P-values are more useful than significance, even if you want significance
• They provide a useful but limited function as one form of a surprise index
• But there are others
• And although they don’t require much input they often don’t deliver much output
• If you can do better you should
• But can you?
• You should use other approaches also
• However a P-value is a (vaguely defined) position it is not a movement
• There is no such thing as a trend towards significance
• This is a fallacy but also made (implicitly) in some discussions of the ‘replication
crisis’
• The anticipated evidence fallacy: ‘This trial was significant and so the next one should be too’
Senn: Trends toward significance 12
14. Replication probabilities
from a position of
(Bayesian) ignorance
• Q1. What is the probability that in a future
experiment, taking that experiment’s results
on its own, the results would favour 1 rather
than 2?
• Q2. What is the probability, having conducted
this experiment, and pooled its results with the
current one, the results would favour 1 rather
than 2?
• Q3. What is the probability that having
conducted a future experiment and then
calculated a Bayesian posterior using a
uninformative prior and the results of this
second experiment alone, the probability that
2 would be worse than 1 would be less than or
equal to 0.05?
Senn: Trends toward significance 14
15. Why if this is a problem it cannot uniquely be
a problem of P-values
• Imagine two similar experiments
considering the same question
• Choose any statistic you like to
express the results of experiments
• Calculate this statistic for
experiment one (independently of
experiment 2) and call it S1
• Calculate this statistic for
experiment two (independently of
experiment 1) and call it S2
• Assume that nothing else is known
(for example you have not yet seen
the results)
Senn: Trends toward significance 15
P(S1 > S2) = P(S2 > S1)
16. Senn: Trends toward significance 16
Sauce for the Goose and Sauce for the Gander
• This property is shared by Bayesian statements
• It follows from the Martingale property of Bayesian forecasts
• Hence, either
• The property is undesirable and hence is a criticism of Bayesian methods also
• Or it is desirable and is a point in favour of frequentist methods
• My own view is that it is much less important than many suppose
• We are interested in the truth and replication is only a means of
getting there
• A trial that is no more precise than the first one is only a very
imperfect judge
18. Senn: Trends toward significance 18
Parameter H0 H1
Mean, 0 2.8
Variance of
true effect
0 1
Standard error
Of statistic
1 1
Probability of
significance
0.05 0.80
1000 trials. In half the cases H0 is true
19. Senn: Trends toward significance 19
As before but magnified to
concentrate more easily on the
‘significance’ boundary of 0.05
20. Senn: Trends toward significance 20
NB Bayes factor calculated at non-centrality of 2.8, which corresponds to the value for planning
Magnified plot only
21. Comparison of Two Two-Trial Strategies
Statistics Frequentist P<0.05 Bayesian BF<1/10
Number of trials run in total 1372 1299
Number of false positives 0 1
Number of false negatives 68 133
Number of false decisions 68 134
Number of correct decisions 932 866
Trials per correct decision 1.47 1.5
Senn: Trends toward significance 21
Simulation values are as previously
Second trial is only run if first is ‘positive’
Efficacy declared if both trials are ‘positive’
Figures are per 1000 trials for which H0 is true in 500 and H1 in 500
22. Lessons
• This does not show that significance is better than Bayes factors
• Different parameter settings could show the opposite
• Different thresholds could show the opposite
• It does show that reproducibility is not necessarily the issue
• The strategy with poorer reproducibility is (marginally) better
• NB That does not have to be the case
• But it shows it can be the case
• It raises the issue of calibration
• And it also raises the issue as to whether reproducibility is a useful
goal in itself
• And how it should be measured
Senn: Trends toward significance 22
24. Standard errors reported as being smaller
than they are
Block main effect problems
• Incorrect unit of experimental
inference (Pseudoreplication,
Hurlbert, 1984)
• For example allocation of treatments to
cages but counting rats as the ‘n’
• Inappropriate experimental analogue
for observational studies
• Historical data treated as if they were
concurrent
• Thus parallel group trial is used as
inappropriate model rather than cluster
randomised trial
• This may apply more widely to
epidemiological studies
Block interaction problems
• Mistaking causal inference for
predictive inference
• Causal: What happened to the patients in
this trial?
• Interaction does not contribute to the error
• Fixed effects meta-analysis is an example
• Predictive what will happen to patients in
future?
• Interaction should contribute to the error
• Random effects meta-analysis is an
example
Senn: Trends toward significance 24
25. An example
• Peters et al 2005, describe ways of
combining different
epidemiological (humans) and
toxicological studies (animals)
using Bayesian approaches
• Trihalomethanes and low
birthweight used as an illustration
• However, their analysis of the
toxicology data from rats treats the
pups as independent, which makes
very strong (implicit) and
implausible assumptions
• Pseudoreplication
Discussed in Grieve and Senn (2005)
(Unpublished)
For pseudoreplication See Hurlbert
SH. Pseudoreplication and the design
of ecological field experiments.
Ecological monographs.
1984;54(2):187-211.
Senn: Trends toward significance 25
26. Other Problems
• Publication bias
• Hacking
• Multiplicity
• In the sense of only concentrating on the good news
• Surrogate endpoints
• Naïve expectations
Senn: Trends toward significance 26
28. My ‘Conclusions’
• P-values per se are not the problem
• There may be a harmful culture of ‘significance’ however this is defined
• P-values have a limited use as rough and ready tools using little structure
• Where you have more structure you can often do better
• Likelihood, Bayes etc
• Point estimates and standard errors are extremely useful for future research
synthesizers and should be provided regularly
• We need to make sure that estimates are reported honestly and standard
errors calculate appropriately
• We should not confuse the evidence from a study with the conclusion that
should be made
Senn: Trends toward significance 28
29. Finally, a thought about laws (or hypotheses)
from The Laws of Thought
Senn: Trends toward significance 29
When the defect of data* is supplied by
hypothesis, the solutions will, in general, vary
with the nature of hypotheses assumed.
Boole (1854) An Investigation of the Laws of Thought: Macmillan.
. P375
Quoted by R. A. Fisher (1956) Statistical methods and scientific inference
In Statistical Methods, Experimental Design and Scientific Inference (ed
J. H. Bennet), Oxford: Oxford University. p23
*Fisher interprets defect of data is supplied by hypothesis as meaning
supplying by hypothesis what is lacking in the data.
In worrying about the hypotheses, let’s not overlook the
defects of data