Why the EPV≥10 sample size rule is rubbish and what to use instead

Maarten van Smeden, PhD
2 november 2020
Why the EPV≥10 sample size rule is rubbish
and what to use instead

M.vanSmeden@umcutrecht.nl | Twitter: @MaartenvSmedenWhy the EPV ≥ 10 rule is rubbish and what to use instead
• Statistician at Julius Center for Health Sciences and Primary Care
• Main interests (but not limited to):
• prognostic and diagnostic modeling
• measurement error
• missing data
Today’s topic:
EPV≥10 sample size rule (aka 1 in 10 rule) has be one of the leading
sample size rules in prognostic/diagnostic prediction modeling

Outline
• The EPV≥10 rule-of-thumb: where does it come from?
• Evidence the EPV≥10 rule has no rationale
• Evidence that sample size is important (even if you use the fancier methods)
• Actual sample size calculations for prediction models

Ever wondered if AD/BC gives the “best” estimate of the odds ratio?
What if I told you that AD/BC is biased?

Let’s say we have fitted a logistic regression model to a dataset, and obtain
ln
𝑝𝑝𝑖𝑖
1 − 𝑝𝑝𝑖𝑖
= 𝛼𝛼� + 𝛽𝛽̂1 𝑋𝑋1𝑖𝑖 + 𝛽𝛽̂2 𝑋𝑋2𝑖𝑖 + ⋯ + 𝛽𝛽̂𝑘𝑘 𝑋𝑋𝑘𝑘𝑖𝑖
I’m very sorry, but 𝛽𝛽̂1 is a biased estimator, and 𝛽𝛽̂2 too, ….
…. actually they are all finite sample biased

Epidemiology text-books:
• Confounding bias
• Information bias
• Selection bias
… nothing about finite sample bias

Important: bias vs consistency
• Consistency ≈ as sample size increases, estimate converges to truth
• Bias ≈ with repeated samples, the average estimate converges to truth

Log(odds) is consistent but finite sample biased

Illustration by simulation
• Simulate 4 normal covariates with equal multivariable log-odds-ratios of 2
• 1,000 simulation samples of N = 50
• Consistency: create 1,000 meta-dataset of increasing size: meta-dataset
r consists of each created dataset up to r;
• Bias: calculate difference estimate of exposure effect and true value for
each of the created datasets up to r;

Average of 400 studies
with N = 50
1 study with N = 20,000

With decreasing sample size
How we usually think

With decreasing sample size
But actually with odds ratios
(and other ratios)

The origin of the 1 in 10 rule
“For EPV values of 10 or greater, no major problems occurred. For EPV
values less than 10, however, the regression coefficients were biased in
both positive and negative directions”

Source: Peduzzi et al. 1996
?

More simulation studies
Citations based on Google Scholar, Oct 30 2020
citations: 5,736
“a minimum of 10 EPV […] may be too conservative”
“substantial problems even if the number of EPV exceeds 10”
For EPV values of 10 or greater, no major problems
citations: 2,438
citations: 216

More simulation studies
Citations based on Google Scholar, Oct 30 2020
citations: 5,736
“a minimum of 10 EPV […] may be too conservative”
“substantial problems even if the number of EPV exceeds 10”
For EPV values of 10 or greater, no major problems
citations: 2,438
citations: 216
!?!

• Examine the reasons for substantial differences between the earlier EPV
simulation studies
• Evaluate a possible solution to reduce the finite sample bias

• Examine the reasons for substantial differences between the earlier EPV
simulation studies (simulation technicality: handling of “separation”)
• Evaluate a possible solution to reduce the finite sample bias

• Firth’s ”correction” aims to reduce finite sample bias in maximum
likelihood estimates, applicable to logistic regression
• It makes clever use of the “Jeffries prior” (from Bayesian literature) to
penalize the log-likelihood, which shrinks the estimated coefficients
• It has a nice theoretical justifications, but does it work well?

Standard
Averaged over 465 simulation conditions with 10,000 replications each

StandardFirth’scorrection
Averaged over 465 simulation conditions with 10,000 replications each

Firth’s correction
Difficult? No
Example R code:
> require(“logistf”)
> logistf(Y~X1+X2+X3+X4, firth=T, data=df)
Compared to default (maxlik) logistic regression, Firth’s correction generally:
• Narrower confidence intervals
• Lower MSE
• Better predictions*
*requires adjustment of the intercept using flic=TRUE option in logistf

Sample issue size solved?
… not quite!
• Precision of regression coefficients
• Variable selection and functional form
• Ensure predictions are adequate

Sample issue size solved?
… not quite!
• Precision of regression coefficients
• Variable selection and functional form
• Ensure predictions are adequate
• Why would a one-solution fits all rule-of-thumb be appropriate?
• Think of sample size for a randomized clinical trial
Would be odd to suggest all trials should have 100 patients in each arm?

TRIPOD Item 8. Explain how the study size was arrived at
Moons et al. Ann Intern Med 2015 (TRIPOD Explanation & Elaboration)
“Although there is a consensus on the importance
of having an adequate sample size for model
development, how to determine what counts as
‘adequate’ is not clear …”

Why is sample size important?
• We want to have a large enough sample size to develop a model that
provides accurate risk predictions in new individuals from target
population
• Many (most?) models do not perform well when checked in new data
• small sample sizes
• overfitting
• lack of (internal) validation
• …

Recent example
• Reviewed 232 prediction models
• “All models were rated at high or
unclear risk of bias”
• Sample size: median 338; IQR 134 to 707
• Number of events: median 69; IQR 37 to 160
Living review, doi: 10.1136/bmj.m1328 (these numbers from a soon to appear review update)

Recent example
• External validation 22 COVID-19 related
prognostic models
• Performance: poor to very poor
• “Admission oxygen saturation on room air and patient age are strong
predictors of deterioration and mortality among hospitalised adults with
COVID-19, respectively. None of the prognostic models evaluated here
offered incremental value for patient stratification to these univariable
predictors.”

Small sample size and overfitting
• Spurious predictor-outcome associations
• Important predictors can be missed
• Unimportant predictors can be selected
• Regression coefficients too large and uncertain
• Model doesn’t predict well in new data
• Disappointing discrimination
• Often calibration slope < 1
https://twitter.com/LesGuessing/status/997146590442799105

With small N: calibration slope often < 1
Predictions too extreme

“Modern” methods aim to circumvent overfitting
• Penalised regression: e.g. lasso, ridge regression, elastic net
• Standard regression followed by uniform (global) shrinkage
• Target calibrated predicted risks in new data: shrinkage and penalty
terms estimated using bootstrapping or cross-validation
• Sample size problem solved?

“shrinkage works on the average but may fail in the particular unique
problem on which the statistician is working.”
• Required shrinkage is hard to estimate
• Often large uncertainty correct value to use, especially in small datasets (!)

“We conclude that, despite improved performance on average, shrinkage often
worked poorly in individual datasets, in particular when it was most needed.
The results imply that shrinkage methods do not solve problems associated
with small sample size or low number of events per variable.”

Our proposal
• Calculate sample size that is needed to
• minimise potential overfitting
• estimate probability (risk) precisely
• Sample size formula’s for
• Continuous outcomes
• Time-to-event outcomes
• Binary outcomes (focus today)

Example
• COVID-19 prognosis hospitalized
patients
• Composite outcome: “deterioration”
(in-hospital death, ventilator support,
ICU)
A priori expectations
• Event fraction at least 30%
• 40 candidate predictor parameters
• C-statistic of 0.71(conservative est)
-> Cox-Snell R2 of 0.24
MedRxiv Preprint (not peer reviewed): 10.1101/2020.10.09.20209957

Restricted cubic splines
with 4 knots: 3 degrees of
freedom
Note: EPV rule also
calculates degrees of
freedom of candidate
predictors, not variables!

Calculate required sample size
Criterion 1. Shrinkage: expected heuristic shrinkage factor, S ≥ 0.9
(calibration slope, target < 10% overfitting)
Criterion 2. Optimism: Cox-Snell R2 apparent - Cox-Snell R2 validation < 0.05
(overfitting)
Criterion 3: A small margin of error in overall risk estimate < 0.05 absolute error
(precision estimated baseline risk)
(Criterion 4: a small margin of absolute error in the estimated risks)

Calculation
R code:
> require(pmsampsize)
> pmsampsize(type="b",rsquared=0.24,parameters=40,prevalence=0.3)

A few alternative scenarios
• rsquared=0.24,parameters=40,prevalence=0.3 -> EPV≥9.7
• rsquared=0.36,parameters=40,prevalence=0.2 -> EPV≥5

The sample size that meets all criteria is the MINIMUM required
• Why minimum? Other criteria may be important
e.g. missing data, clustering, variable selection
• May raise required sample size further
• Simulation based approaches
Preprint (not peer reviewed) doi: 10.21203/rs.3.rs-87100/v1

Summary
• Default logistic regression produces finite sample biased estimates
• Finite sample bias can be substantial; easily solved using Firth’s correction
• “Modern” approaches (e.g. Firth, Lasso, Ridge) no compensation for low N
• New sample size criteria to replace the one-size-fits-all EPV≥10 rule

https://www.prognosisresearch.com/
New website by Richard Riley and Kym Snell

Work in collaboration with:
• Carl Moons
• Hans Reitsma
• Richard Riley (Keele, materials for this presentation)
• Gary Collins (Oxford)
• Ben Van Calster (Leuven)
• Ewout Steyerberg (Leiden)
• Rishi Gupta (UCL)
• Many others
Contact: M.vanSmeden@umcutrecht.nl

Why the EPV≥10 sample size rule is rubbish and what to use instead

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Why the EPV≥10 sample size rule is rubbish and what to use instead

Similar to Why the EPV≥10 sample size rule is rubbish and what to use instead (6)

More from Maarten van Smeden

More from Maarten van Smeden (16)

Recently uploaded

Recently uploaded (20)

Why the EPV≥10 sample size rule is rubbish and what to use instead