Statistical hypothesis testing in e commerce

The basics of statistical hypothesis
testing in E-commerce.
By Anatoly Vuets

Agenda
• Why do we use (we should use) statistical hypothesis testing in e-commerce?
• Statistical test: how does it work and its main parameters
• Key features for e-commerce

Why do we need statistical
testing in e-commerce?

We need the right decisions
• A/B tests
• Ad-hoc analyses
• Building models

• A question: which of these groups makes more proﬁt?
• What is missing here?

• A/B test: which version is better?

Statistical test: let’s recall
the basics!

• Random variable (discrete or continuous)
• Probability distribution function (PMF(x), PDF(x))
• Mean M or μ
• Standard deviation SD or σ
Basics of statistics

Basics of statistics: standard
distribution

Statistical test: uncertainty.

...
...
...
...
..................
True metrics value
Statistical population
Sample Possible samples
...
Observed value Other possible values
(distribution)
Uncertainty

We want to conclude about the statistical population based on single sample that we have
observed
Statistical population Observed sample Possible samples
Why is this important?

Distribution of metrics estimate

Statistical test: basic idea
and main parameters.

• We want to test a statement (typically existence of an effect).
• We have a set of observations (sample) from which we conclude the statement.
• Scenario, in which the statement is TRUE is called alternative hypothesis H1.
• Scenario, in which the statement is FALSE is called null hypothesis H0.
• Estimate the probability to observe the sample we have under H0.
• If the probability is high enough - we conclude that H1 can not be accepted. In the opposite
case, we accept H1.
Idea

... H0/H1𝗧(S)
H0: C = 5% H1: C > 5%
Statistical test

H0 H1
H0 Correct
P: 1 - α
Error T1
P: α
H1 Error T2
P: β
Correct
P: 1 - β
Test T(s)
Truth
• Error T1 - accept H1 when H0 is true.
• Error T2 -accept H0 when H1 is true.
• We would like to have a perfect test (α = 0, β = 0).
However as we shall see later, this is impossible in
practice. Because of this, test design and result
interpretation are crucial for proper decision
making.
Statistical test parameters

A detector can be considered as a binary classiﬁer: passenger does not have (H0) or has metal
objects (H1) (weapon etc.)
The detector has a sensitivity knob (decision boundary).
If the sensitivity is low detector falsely detects metal in α = 5% of cases, but skips metal in β =
67% of cases.
If the sensitivity is high - it falsely detects metal in α = 50%, but skips in β = 0.3% of cases.
Intermediate sensitivity values allow choosing the trade-off between skipping a passenger
who has hidden metal objects (increases probability of an incident) and the service speed
(additional airport costs and lower passenger satisfaction).
Statistical test parameters: metal
detector in airport

Statistical tests based on data achieved from an A/B test can be treated as a classiﬁer which is
supposed to tell whether conversion rate increased (H1) or remained the same (H0).
Question: which trade-off between α and β would you choose?
Statistical test parameters:
increasing web-page conversion rate

• H0: C = 5%, H1: C > 5%
• T(s) = c/n, n = 3600
• signiﬁcance level = 5%
• P(T|H0) - ?
Theory:
Simulation:
bootstrap
How does statistical test work:
distribution P(T|H0)

How does statistical test works:
signiﬁcance level and decision boundary

• H0: C = 5%, H1: C > 5%
• T(s) = c/n, n = 3600
• signiﬁcance level = 5%
• P(T|H1) - ?
Hypothesis H1 consists of
inﬁnite number of
hypotheses: C = 5.1%, C =
5.2% … Which one should
we consider?
• H1: С = 5.5%
(+ 10%, minimum expected boost)
How does statistical test works:
distribution P(T|H1)

How does statistical test work:
signiﬁcance level vs power

Important features of statistical
testing in e-commerce

Signiﬁcance level vs power trade-off
improvement: sample size

Signiﬁcance level vs power trade-off
improvement: effect size

Question: what should we do if we choose α = 10% but got p.value = 12%?
Uncertainty of p-value

• Key parameters of the statistical test are signiﬁcance level and power that correspond to the
probability of false detection and probability to miss effect.
• Increased test power can be achieved in two ways: by increasing sample size or by increasing
effect size
• Keep in mind that p-value is a random statistic! It is important to account for its uncertainty.
• Mind that some metrics (like conversion from registration to buyer) may take signiﬁcant time
to measure
• Anomalies in data may dramatically impact test results
Summary

Conclusions
• In e-commerce, test power is often of the most importance (probability not to miss effect)
• In the case of high-traffic business: the required trade-off between significance level and
power can be easily achieved by increasing the sample size.
• In the case of low-traffic business: focus on features which:
1) are cheap, easy to implement and not risky, or
2) have potentially big effects.

Statistical hypothesis testing in e commerce

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Statistical hypothesis testing in e commerce

Similar to Statistical hypothesis testing in e commerce (20)

Recently uploaded

Recently uploaded (20)

Statistical hypothesis testing in e commerce