Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Statistical hypothesis testing in e commerce
1. The basics of statistical hypothesis
testing in E-commerce.
By Anatoly Vuets
2. Agenda
• Why do we use (we should use) statistical hypothesis testing in e-commerce?
• Statistical test: how does it work and its main parameters
• Key features for e-commerce
3. Why do we need statistical
testing in e-commerce?
4. We need the right decisions
• A/B tests
• Ad-hoc analyses
• Building models
5. We need the right decisions
• A question: which of these groups makes more profit?
• What is missing here?
6. We need the right decisions
• A/B test: which version is better?
8. • Random variable (discrete or continuous)
• Probability distribution function (PMF(x), PDF(x))
• Mean M or μ
• Standard deviation SD or σ
Basics of statistics
12. We want to conclude about the statistical population based on single sample that we have
observed
Statistical population Observed sample Possible samples
Why is this important?
15. • We want to test a statement (typically existence of an effect).
• We have a set of observations (sample) from which we conclude the statement.
• Scenario, in which the statement is TRUE is called alternative hypothesis H1.
• Scenario, in which the statement is FALSE is called null hypothesis H0.
• Estimate the probability to observe the sample we have under H0.
• If the probability is high enough - we conclude that H1 can not be accepted. In the opposite
case, we accept H1.
Idea
17. H0 H1
H0 Correct
P: 1 - α
Error T1
P: α
H1 Error T2
P: β
Correct
P: 1 - β
Test T(s)
Truth
• Error T1 - accept H1 when H0 is true.
• Error T2 -accept H0 when H1 is true.
• We would like to have a perfect test (α = 0, β = 0).
However as we shall see later, this is impossible in
practice. Because of this, test design and result
interpretation are crucial for proper decision
making.
Statistical test parameters
18. A detector can be considered as a binary classifier: passenger does not have (H0) or has metal
objects (H1) (weapon etc.)
The detector has a sensitivity knob (decision boundary).
If the sensitivity is low detector falsely detects metal in α = 5% of cases, but skips metal in β =
67% of cases.
If the sensitivity is high - it falsely detects metal in α = 50%, but skips in β = 0.3% of cases.
Intermediate sensitivity values allow choosing the trade-off between skipping a passenger
who has hidden metal objects (increases probability of an incident) and the service speed
(additional airport costs and lower passenger satisfaction).
Statistical test parameters: metal
detector in airport
19. Statistical tests based on data achieved from an A/B test can be treated as a classifier which is
supposed to tell whether conversion rate increased (H1) or remained the same (H0).
Question: which trade-off between α and β would you choose?
Statistical test parameters:
increasing web-page conversion rate
20. • H0: C = 5%, H1: C > 5%
• T(s) = c/n, n = 3600
• significance level = 5%
• P(T|H0) - ?
Theory:
Simulation:
bootstrap
How does statistical test work:
distribution P(T|H0)
22. • H0: C = 5%, H1: C > 5%
• T(s) = c/n, n = 3600
• significance level = 5%
• P(T|H1) - ?
Hypothesis H1 consists of
infinite number of
hypotheses: C = 5.1%, C =
5.2% … Which one should
we consider?
• H1: С = 5.5%
(+ 10%, minimum expected boost)
How does statistical test works:
distribution P(T|H1)
29. Question: what should we do if we choose α = 10% but got p.value = 12%?
Uncertainty of p-value
30. • Key parameters of the statistical test are significance level and power that correspond to the
probability of false detection and probability to miss effect.
• Increased test power can be achieved in two ways: by increasing sample size or by increasing
effect size
• Keep in mind that p-value is a random statistic! It is important to account for its uncertainty.
• Mind that some metrics (like conversion from registration to buyer) may take significant time
to measure
• Anomalies in data may dramatically impact test results
Summary
31. Conclusions
• In e-commerce, test power is often of the most importance (probability not to miss effect)
• In the case of high-traffic business: the required trade-off between significance level and
power can be easily achieved by increasing the sample size.
• In the case of low-traffic business: focus on features which:
1) are cheap, easy to implement and not risky, or
2) have potentially big effects.