Hybridoma Technology ( Production , Purification , and Application )
Hands-On Algorithms for Predictive Modeling
1. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Hands-On Algorithms for Predictive Modeling*
(Penalized) Logistic Regression, Boosting, Neural Nets, SVM, Trees, Forests
Arthur Charpentier (Universit´e du Qu´ebec `a Montr´eal, Canada)
Third International Congress on Actuarial Science and Quantitative Finance
Manizales, Colombia, June 2019
@freakonometrics freakonometrics freakonometrics.hypotheses.org 1
2. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Modeling a 0/1 random variable
Myocardial infarction of patients admited in E.R.
◦ heart rate (FRCAR) : x1
◦ heart index (INCAR) : x2
◦ stroke index (INSYS) : x3
◦ diastolic pressure (PRDIA) : x4
◦ pulmonary arterial pressure (PAPUL) : x5
◦ ventricular pressure (PVENT) : x6
◦ lung resistance (REPUL) : x7
◦ death or survival (PRONO) : y
dataset {(xi, yi)} where i = 1, . . . , n, n = 73.
1 > myocarde=read.table("http:// freakonometrics .free.fr/myocarde.csv",
head=TRUE ,sep=";")
@freakonometrics freakonometrics freakonometrics.hypotheses.org 2
5. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Logistic Regression (and GLM)
The logistic regression is based on the assumption that
Y |X = x ∼ B(px), where px =
exp[xTβ]
1 + exp[xTβ]
= 1 + exp[−xT
β]
−1
The goal is to estimate parameter β.
Recall that the heuristics for the use of that function for the probability is that
log[odds(Y = 1)] = log
P[Y = 1]
P[Y = 0]
= xT
β
• Maximum of the (log)-likelihood function
The log-likelihood is here
log L =
n
i=1
yi log pi + (1 − yi) log(1 − pi)
where pi = (1 + exp[−xi
Tβ])−1.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 5
6. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Mathematical Statistics Courses in a Nutshell
Consider observations {y1, · · · , yn}
from iid random variables Yi ∼ Fθ
(with “density” fθ).
Likelihood is
L(θ ; y) →
n
i=1
fθ(yi)
Maximum likelihood estimate is
θ
mle
∈ arg max
θ∈Θ
{L(θ; y)}
or θ
mle
∈ arg max
θ∈Θ
{log L(θ; y)}
Fisher (1912, On an absolute criterion for fitting frequency curves).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 6
7. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Mathematical Statistics Courses in a Nutshell
Under standard assumptions (Identification of the model, Compactness, Continuity and
Dominance), the maximum likelihood estimator is consistent θ
mle P
−→ θ. With additional
assumptions, it can be shown that the maximum likelihood estimator converges to a normal
distribution
√
n θ
mle
− θ
L
−→N(0, I−1
)
where I is Fisher information matrix (i.e. θ
mle
is asymptotically efficient).
Eg. if Y ∼ N(θ, σ2), log L ∝ −
n
i=1
yi − θ
σ
2
, and θmle = y (see also method of moments).
max log L ⇐⇒ min
n
i=1
ε2
i = least squares
@freakonometrics freakonometrics freakonometrics.hypotheses.org 7
8. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Logistic Regression
• Numerical Optimization
Using optim() might not be a great idea.
BFGS algorithm (Broyden, Fletcher, Goldfarb & Shanno) based on Newton’s algorithm.
To find a zero of function g, g(x0) = 0, use Taylor’s expansion,
0 = g(x0) = g(x) + g (x)(x0 − x)
i.e. (if x → xold and x0 → xnew)
xnew = xold −
g(xold)
g (xold)
from some starting point
or, in higher dimension
xnew = xold − g(xold)−1
· g(xold) from some starting point
Here g is the gradient of the log-likelihood,
βnew = βold −
∂2 log L(βold)
∂β∂βT
−1
·
∂ log L(βold)
∂β
@freakonometrics freakonometrics freakonometrics.hypotheses.org 8
9. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Logistic Regression
Numerically, since only g is given, consider some small h > 0 and set
xnew = xold −
2h · g(xold)
g(xold + h) + g(xold − h)
from some starting point
1 y = myocarde$PRONO
2 X = cbind (1,as.matrix(myocarde [ ,1:7]))
3 negLogLik = function(beta){
4 -sum(-y*log(1 + exp(-(X%*%beta))) - (1-y)*log(1 + exp(X%*%beta)))
5 }
6 beta_init = lm(PRONO˜.,data=myocarde)$ coefficients
7 logistic_opt = optim(par = beta_init , negLogLik , hessian=TRUE , method
= "BFGS", control=list(abstol =1e -9))
@freakonometrics freakonometrics freakonometrics.hypotheses.org 9
14. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
• iteratively reweighted least squares
We want to compute something like
βnew = (XT
∆oldX)−1
XT
∆oldz
(if we do substitute matrices in the analytical expressions) where
z = Xβold + ∆−1
old
[y − pold]
But actually, that’s simply a standard least-square problem
βnew = argmin (z − Xβ)T
∆−1
old
(z − Xβ)
The only problem here is that weights ∆old are functions of unknown βold
But actually, if we keep iterating, we should be able to solve it : given the β we got the weights,
and with the weights, we can use weighted OLS to get an updated β. That’s the idea of
iteratively reweighted least squares.
1 df = myocarde
2 beta_init = lm(PRONO˜.,data=df)$ coefficients
3 X = cbind (1,as.matrix(myocarde [ ,1:7]))
4 beta = beta_init
5 for(s in 1:1000){
@freakonometrics freakonometrics freakonometrics.hypotheses.org 14
27. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Regression with Kernels and k-NN
We do not use the probit framework, it is purely a non-parametric one.
The moving regressogram is
ˆm(x) =
n
i=1
yi · 1(xi ∈ [x ± h])
n
i=1
1(xi ∈ [x ± h])
Consider
˜m(x) =
n
i=1
yi · kh (x − xi)
n
i=1
kh (x − xi)
where (classically), for some kernel k,
kh(x) = k
x
h
k
−2 −1 0 1 2
@freakonometrics freakonometrics freakonometrics.hypotheses.org 27
28. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Regression with Kernels and k-NN
1 mean_x = function(x,bw){
2 w = dnorm (( myocarde$INSYS -x)/bw , mean=0,sd
=1)
3 weighted.mean(myocarde$PRONO ,w)}
4 u = seq (5,55, length =201)
5 v = Vectorize(function(x) mean_x(x,3))(u)
6 plot(u,v,ylim =0:1 , type="l",col="red")
7 points(myocarde$INSYS ,myocarde$PRONO ,pch =19)
1 v = Vectorize(function(x) mean_x(x,2))(u)
2 plot(u,v,ylim =0:1 , type="l",col="red")
3 points(myocarde$INSYS ,myocarde$PRONO ,pch =19)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 28
29. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Regression with Kernels and k-NN
One can also use the ksmooth() function of R,
1 reg = ksmooth(myocarde$INSYS ,myocarde$PRONO ,"
normal",bandwidth = 2*exp (1))
2 plot(reg$x,reg$y,ylim =0:1 , type="l",col="red",
lwd=2,xlab="INSYS",ylab="")
3 points(myocarde$INSYS ,myocarde$PRONO ,pch =19)
Classical bias / variance tradeoff (below mh(15) and mh(25))
@freakonometrics freakonometrics freakonometrics.hypotheses.org 29
10 20 30 40 50
0.00.20.40.60.81.0
INSYS
q
qq
q
qq
qq qq qq q qqqqq qq q
q qq qqq q
qqq
q q
q
q
qqq
q
q
qq
qq
q
q
q q
q
q
q q
q q
q
q
q q q q
q
q
q
q qq
qq
q
q
q
Density
0.0 0.2 0.4 0.6 0.8 1.0
0123456
Density
0.0 0.2 0.4 0.6 0.8 1.0
0123456
31. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Regression with Kernels and k-NN
One can also use k-nearest neigthbors
˜mk(x) =
1
n
n
i=1
ωi,k(x)yi
where ωi,k(x) = n/k if i ∈ Ik
x with
Ik
x = {i : xi one of the k nearest observations to x}I
Classically, use Mahalanobis distance
1 Sigma = var(myocarde [ ,1:7])
2 Sigma_Inv = solve(Sigma)
3 d2_ mahalanobis = function(x,y,Sinv){as.numeric(x-y)%*%Sinv%*%t(x-y)}
4 k_closest = function(i,k){
5 vect_dist = function(j) d2_ mahalanobis (myocarde[i ,1:7] , myocarde[j
,1:7] , Sigma_Inv)
6 vect = Vectorize(vect_dist)((1: nrow(myocarde)))
7 which (( rank(vect)))}
@freakonometrics freakonometrics freakonometrics.hypotheses.org 31
34. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit: ROC Curve
Confusion matrix
Given a sample (yi, xi) and a model m, the confusion matrix is the contin-
gency table, with dimensions observed yi ∈ {0, 1} and predicted yi ∈ {0, 1}.
y
0(-) 1(+)
0(-) TN FN
y
1(+) FP TP
FP
TN+FP
TP
FN+TP
FPR TPR
Classical measures are
true positive rate (TPR) - or sensitivity
false positive rate (FPR) - or fall-out,
true negative rate (TNR) - or specificity
TNR = 1-FPR
among others (see wikipedia)
ROCR::performance(prediction(Score,Y),"tpr","fpr")
@freakonometrics freakonometrics freakonometrics.hypotheses.org 34
35. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit: ROC Curve
ROC (Receiver Operating Characteristic ) Curve
Assume that mt is define from a score function s, with mt(x) = 1(s(x) > t)
for some threshold t. The ROC curve is the curve (FPRt, TPRt obtained
from confusion matrices of mt’s.
n = 100 individuals
50 yi = 0 and 50 yi = 1
(well balanced)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 35
q
q
Y=0 Y=1
OBSERVED
Y=1Y=0
PREDICTED
25 25
25 25
FPR TPR
25 25
50 50
~ 0.5 ~ 0.5
q
q
FALSE POSITIVE RATE
TRUEPOSITIVERATE
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
q
36. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit: ROC Curve
29 deaths (y = 0) and 42 survivals (y = 1)
y =
1 if P[Y = 1|X] > 0%
0 if P[Y = 1|X] ≤ 0%
0%
y
0 1
0 0 0
y
1 29 42
29
29+0
42
42+0
= 100% = 100%
FPR TPR
@freakonometrics freakonometrics freakonometrics.hypotheses.org 36
37. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit: ROC Curve
29 deaths (y = 0) and 42 survivals (y = 1)
y =
1 if P[Y = 1|X] > 15%
0 if P[Y = 1|X] ≤ 15%
15%
y
0 1
0 17 2
y
1 12 40
12
17+12
40
42+2
∼ 41.4% ∼ 95.2%
FPR TPR
@freakonometrics freakonometrics freakonometrics.hypotheses.org 37
38. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit: ROC Curve
29 deaths (y = 0) and 42 survivals (y = 1)
y =
1 if P[Y = 1|X] > 50%
0 if P[Y = 1|X] ≤ 50%
50%
y
0 1
0 25 3
y
1 4 39
4
25+4
39
39+3
∼ 13.8% ∼ 92.8%
FPR TPR
@freakonometrics freakonometrics freakonometrics.hypotheses.org 38
39. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit: ROC Curve
29 deaths (y = 0) and 42 survivals (y = 1)
y =
1 if P[Y = 1|X] > 85%
0 if P[Y = 1|X] ≤ 85%
50%
y
0 1
0 28 13
y
1 1 29
1
28+1
29
29+13
∼ 3.4% ∼ 69.9%
FPR TPR
@freakonometrics freakonometrics freakonometrics.hypotheses.org 39
40. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit: ROC Curve
29 deaths (y = 0) and 42 survivals (y = 1)
y =
1 if P[Y = 1|X] > 100%
0 if P[Y = 1|X] ≤ 100%
50%
y
0 1
0 29 42
y
0 0 29
0
29+0
0
42+0
= 0.0% = 0.0%
FPR TPR
@freakonometrics freakonometrics freakonometrics.hypotheses.org 40
41. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit
Be carefull of overfit...
Important to use
training / validation
• classification tree
• boosting classifier
@freakonometrics freakonometrics freakonometrics.hypotheses.org 41
43. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit: ROC Curve
With categorical variables, we have a collection of points
Need to interpolate between those points to have a curve, see convexification of ROC curves,
e.g. Flach (2012, Machine Learning)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 43
q
q
FALSE SPAM RATE
TRUESPAMRATE
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
q
q
q
q
q
q
FALSE SPAM RATE
TRUESPAMRATE
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
q
q
44. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit
One can derive a confidence interval of the ROC curve using
boostrap techniques,
pROC::ci.se()
Various measures can be used see library hmeasures , with Gini
index, the Area Under the curve, or Kolmogorov-Smirnov
ks = sup
t∈R
|F1(t) − F0(t)|
ks = sup
t∈R
|
1
n1
i:yi=1
1s(xi)≤t −
1
n0
i:yi=0
1s(xi)≤t|
Specificity (%)
Sensitivity(%)
020406080100
100 80 60 40 20 0
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Fn(x)
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Survival
Death
@freakonometrics freakonometrics freakonometrics.hypotheses.org 44
45. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit
The Area Under the Curve, AUC, can be interpreted as the probability that a classifier will
rank a randomly chosen positive instance higher than a randomly chosen negative one, see
Swets, Dawes & Monahan (2000, Psychological Science Can Improve Diagnostic Decisions)
Kappa statistic κ compares an Observed Accuracy with an Expected Accuracy (random
chance), see Landis & Koch (1977, The Measurement of Observer Agreement for Categorical Data
)
Y = 0 Y = 1
Y = 0 TN FN TN+FN
Y = 1 FP TP FP+TP
TN+FP FN+TP n
See also Obsersed and Random Confusion Tables
Y = 0 Y = 1
Y = 0 25 3 28
Y = 1 4 39 43
29 42 71
Y = 0 Y = 1
Y = 0 11.44 16.56 28
Y = 1 17.56 25.44 43
29 42 71
@freakonometrics freakonometrics freakonometrics.hypotheses.org 45
46. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit
Y = 0 Y = 1
Y = 0 25 3 28
Y = 1 4 39 43
29 42 71
Y = 0 Y = 1
Y = 0 11.44 16.56 28
Y = 1 17.56 25.44 43
29 42 71
total accuracy =
TP + TN
n
∼ 90.14%
random accuracy =
[TN + FP] · [TP + FN] + [TP + FP] · [TN + FN]
n2
∼ 51.93%
κ =
total accuracy − random accuracy
1 − random accuracy
∼ 79.48%
@freakonometrics freakonometrics freakonometrics.hypotheses.org 46
47. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Penalized Estimation
Consider a parametric model, with true (unkrown) parameter θ, then
mse(ˆθ) = E (ˆθ − θ)2
= E (ˆθ − E ˆθ )2
variance
+ E (E ˆθ − θ)2
bias2
One can think of a shrinkage of an unbiased estimator,
Let θ denote an unbiased estimator of θ. Then
ˆθ =
θ2
θ2 + mse(θ)
· θ = θ −
mse(θ)
θ2 + mse(θ)
· θ
penalty
satisfies mse(ˆθ) ≤ mse(θ).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 47
−2 −1 0 1 2 3 4
0.00.20.40.60.8
variance
48. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Linear Regression Shortcoming
Least Squares Estimator β = (XT
X)−1XT
y
Unbiased Estimator E[β] = β
Variance Var[β] = σ2(XT
X)−1
which can be (extremely) large when det[(XT
X)] ∼ 0.
X =
1 −1 2
1 0 1
1 2 −1
1 1 0
then XT
X =
4 2 2
2 6 −4
2 −4 6
while XT
X + I =
5 2 2
2 7 −4
2 −4 7
eigenvalues : {10, 6, 0} {11, 7, 1}
Ad-hoc strategy: use XT
X + λI
@freakonometrics freakonometrics freakonometrics.hypotheses.org 48
50. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Normalization : Euclidean 2 vs. Mahalonobis
We want to penalize complicated models :
if βk is “too small”, we prefer to have βk = 0.
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Instead of d(x, y) = (x − y)T
(x − y)
use dΣ(x, y) = (x − y)TΣ−1(x − y)
beta1
beta2
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 50
51. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Ridge Regression
... like the least square, but it shrinks estimated coefficients towards 0.
β
ridge
λ = argmin
n
i=1
(yi − xT
i β)2
+ λ
p
j=1
β2
j
β
ridge
λ = argmin
y − Xβ
2
2
=criteria
+ λ β 2
2
=penalty
λ ≥ 0 is a tuning parameter.
The constant is usually unpenalized. The true equation is
β
ridge
λ = argmin
y − (β0 + Xβ)
2
2
=criteria
+ λ β
2
2
=penalty
@freakonometrics freakonometrics freakonometrics.hypotheses.org 51
52. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Ridge Regression
β
ridge
λ = argmin y − (β0 + Xβ)
2
2
+ λ β
2
2
can be seen as a constrained optimization problem
β
ridge
λ = argmin
β 2
2
≤hλ
y − (β0 + Xβ)
2
2
Explicit solution
βλ = (XT
X + λI)−1
XT
y
If λ → 0, β
ridge
0 = β
ols
If λ → ∞, β
ridge
∞ = 0.
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 52
53. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Ridge Regression
This penalty can be seen as rather unfair if components of
x are not expressed on the same scale
• center: xj = 0, then β0 = y
• scale: xT
j xj = 1
Then compute
β
ridge
λ = argmin
y − Xβ 2
2
=loss
+ λ β 2
2
=penalty
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
40
40
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 53
54. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Ridge Regression
Observe that if xj1 ⊥ xj2 , then
β
ridge
λ = [1 + λ]−1
β
ols
λ
which explain relationship with shrinkage.
But generally, it is not the case...
q
q
Theorem There exists λ such that mse[β
ridge
λ ] ≤ mse[β
ols
λ ]
@freakonometrics freakonometrics freakonometrics.hypotheses.org 54
55. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Ridge Regression
Lλ(β) =
n
i=1
(yi − β0 − xT
i β)2
+ λ
p
j=1
β2
j
∂Lλ(β)
∂β
= −2XT
y + 2(XT
X + λI)β
∂2Lλ(β)
∂β∂βT
= 2(XT
X + λI)
where XT
X is a semi-positive definite matrix, and λI is a positive definite matrix, and
βλ = (XT
X + λI)−1
XT
y
@freakonometrics freakonometrics freakonometrics.hypotheses.org 55
56. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
The Bayesian Interpretation
From a Bayesian perspective,
P[θ|y]
posterior
∝ P[y|θ]
likelihood
· P[θ]
prior
i.e. log P[θ|y] = log P[y|θ]
log likelihood
+ log P[θ]
penalty
If β has a prior N(0, τ2I) distribution, then its posterior distribution has mean
E[β|y, X] = XT
X +
σ2
τ2
I
−1
XT
y.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 56
57. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Properties of the Ridge Estimator
βλ = (XT
X + λI)−1
XT
y
E[βλ] = XT
X(λI + XT
X)−1
β.
i.e. E[βλ] = β.
Observe that E[βλ] → 0 as λ → ∞.
Assume that X is an orthogonal design matrix, i.e. XT
X = I, then
βλ = (1 + λ)−1
β
ols
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 57
58. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Properties of the Ridge Estimator
Set W λ = (I + λ[XT
X]−1)−1. One can prove that
W λβ
ols
= βλ.
Thus,
Var[βλ] = W λVar[β
ols
]W T
λ
and
Var[βλ] = σ2
(XT
X + λI)−1
XT
X[(XT
X + λI)−1
]T
.
Observe that
Var[β
ols
] − Var[βλ] = σ2
W λ[2λ(XT
X)−2
+ λ2
(XT
X)−3
]W T
λ ≥ 0.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 58
59. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Properties of the Ridge Estimator
Hence, the confidence ellipsoid of ridge estimator is indeed
smaller than the OLS,
If X is an orthogonal design matrix,
Var[βλ] = σ2
(1 + λ)−2
I.
mse[βλ] = σ2
trace(W λ(XT
X)−1
W T
λ) + βT
(W λ − I)T
(W λ − I)β.
If X is an orthogonal design matrix,
mse[βλ] =
pσ2
(1 + λ)2
+
λ2
(1 + λ)2
βT
β
@freakonometrics freakonometrics freakonometrics.hypotheses.org 59
0.0 0.2 0.4 0.6 0.8
−1.0−0.8−0.6−0.4−0.2
β1
β2
1
2
3
4
5
6
7
60. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Properties of the Ridge Estimator
mse[βλ] =
pσ2
(1 + λ)2
+
λ2
(1 + λ)2
βT
β
is minimal for
λ =
pσ2
βT
β
Note that there exists λ > 0 such that mse[βλ] < mse[β0] = mse[β
ols
].
@freakonometrics freakonometrics freakonometrics.hypotheses.org 60
61. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Sparsity Issues
In severall applications, k can be (very) large, but a lot of features are just noise: βj = 0 for
many j’s. Let s denote the number of relevent features, with s << k, cf Hastie, Tibshirani &
Wainwright (2015) Statistical Learning with Sparsity,
s = card{S} where S = {j; βj = 0}
The model is now y = XT
S βS + ε, where XT
S XS is a full rank matrix.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 61
q
q
= . +
62. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Going further on sparcity issues
The Ridge regression problem was to solve
β = argmin
β∈{ β 2
≤s}
{ Y − XT
β 2
2
}
Define a 0 = 1(|ai| > 0).
Here dim(β) = k but β 0 = s.
We wish we could solve
β = argmin
β∈{ β 0
=s}
{ Y − XT
β 2
2
}
Problem: it is usually not possible to describe all possible constraints, since
s
k
coefficients
should be chosen here (with k (very) large).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 62
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
63. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Going further on sparcity issues
In a convex problem, solve the dual problem,
e.g. in the Ridge regression : primal problem
min
β∈{ β 2
≤s}
{ Y − XT
β 2
2
}
and the dual problem
min
β∈{ Y −XTβ 2
≤t}
{ β 2
2
}
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
26
27
30
32
35
40
40
50
60
70
80
90
100
110
120
120
130
130
140 140
X
q
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
26
27
30
32
35
40
40
50
60
70
80
90
100
110
120
120
130
130
140 140
X
q
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 63
64. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Going further on sparcity issues
Idea: solve the dual problem
β = argmin
β∈{ Y −XTβ 2
≤h}
{ β 0 }
where we might convexify the 0 norm, · 0 .
@freakonometrics freakonometrics freakonometrics.hypotheses.org 64
65. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Going further on sparcity issues
On [−1, +1]k, the convex hull of β 0 is β 1
On [−a, +a]k, the convex hull of β 0 is a−1 β 1
Hence, why not solve
β = argmin
β; β 1
≤˜s
{ Y − XT
β 2 }
which is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization problem
β = argmin{ Y − XT
β 2
2
+λ β 1 }
@freakonometrics freakonometrics freakonometrics.hypotheses.org 65
66. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
lasso Least Absolute Shrinkage and Selection Operator
β ∈ argmin{ Y − XT
β 2
2
+λ β 1 }
is a convex problem (several algorithms ), but not strictly convex (no unicity of the minimum).
Nevertheless, predictions y = xTβ are unique.
MM, minimize majorization, coordinate descent Hunter & Lange (2003, A Tutorial on MM
Algorithms).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 66
68. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
lasso Regression
For some λ, there are k’s such that β
lasso
k,λ = 0.
Further, λ → β
lasso
k,λ is piecewise linear
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
40
40
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 68
69. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
lasso Regression
In the orthogonal case, XT
X = I,
β
lasso
k,λ = sign(β
ols
k ) |β
ols
k | −
λ
2
i.e. the LASSO estimate is related to the soft threshold
function...
q
q
@freakonometrics freakonometrics freakonometrics.hypotheses.org 69
70. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 2 norm (Ridge)
We want to solve something like
βλ = argmin{− log L(β|x, y) + λ β 2
2
}
In the case where we consider the log-likelihood of some Gaussian variable, we get the sum of
the square of the residuals, and we can obtain an explicit solution. But not in the context of a
logistic regression.
The heuristics about Ridge regression is the following graph. In the background, we can
visualize the (two-dimensional) log-likelihood of the logistic regression, and the blue circle is the
constraint we have, if we rewrite the optimization problem as a constrained optimization
problem :
min
β: β 2
2
≤s
n
i=1
− log L(yi, β0 + xT
β)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 70
71. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 2 norm (Ridge)
It can be written equivalently (it is a strictly convex problem)
min
β,λ
−
n
i=1
log L(yi, β0 + xT
β) + λ β 2
2
Thus, the constrained maximum should lie in the blue disk.
1 > PennegLogLik = function(bbeta ,lambda =0){
2 b0 = bbeta [1]
3 beta = bbeta [-1]
4 -sum(-y*log(1 + exp(-(b0+X%*%beta))) - (1-y)*
5 log (1 + exp(b0+X%*%beta)))+lambda*sum(beta ˆ2)}
@freakonometrics freakonometrics freakonometrics.hypotheses.org 71
81. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
It can be written equivalently (it is a strictly convex problem)
min
β,λ
−
n
i=1
log L(yi, β0 + xT
β) + λ β 1
Thus, the constrained maximum should lie in the blue disk.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 81
82. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
1 y = myocarde$PRONO
2 X = myocarde [ ,1:7]
3 for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X
[,j])
4 X = as.matrix(X)
5 LogLik = function(bbeta){
6 b0=bbeta [1]
7 beta=bbeta [-1]
8 sum(-y*log(1 + exp(-(b0+X%*%beta))) -
9 (1-y)*log(1 + exp(b0+X%*%beta)))}
10 u = seq(-4,4, length =251)
11 v = outer(u,u,function(x,y) LogLik(c(1,x,y)))
12 image(u,u,v,col=rev(heat.colors (25)))
13 contour(u,u,v,add=TRUE)
14 polygon(c(-1,0,1,0),c(0,1,0,-1),border="blue")
@freakonometrics freakonometrics freakonometrics.hypotheses.org 82
84. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Observe that coefficients are not (strictly) null
1 library(glmnet)
2 glm_lasso = glmnet(X, y, alpha =1)
3 plot(glm_lasso ,xvar="lambda",col=colrs ,lwd =2)
4 plot(glm_lasso ,col=colrs ,lwd =2)
5 glmnet(X, y, alpha =1, lambda=exp (-4))$beta
6 7x1 sparse Matrix of class "dgCMatrix"
7 s0
8 FRCAR .
9 INCAR 0.11005070
10 INSYS 0.03231929
11 PRDIA .
12 PAPUL .
13 PVENT -0.03138089
14 REPUL -0.20962611
@freakonometrics freakonometrics freakonometrics.hypotheses.org 84
85. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
With orthogonal covariates
4 library(factoextra)
5 pca = princomp(X)
6 pca_X = get_pca_ind(pca)$coord
7 glm_lasso = glmnet(pca_X, y,
alpha =1)
8 plot(glm_lasso ,xvar="lambda",
col=colrs)
9 plot(glm_lasso ,col=colrs)
Before focusing on the lasso for binary variables, consider the standard ols-lasso...
But first, to illustrate, let us get back on quantile ( 1) regression...
@freakonometrics freakonometrics freakonometrics.hypotheses.org 85
86. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Optimization with linear problems (application to 1-regression)
Consider the case of the median. Consider a sample {y1, · · · , yn}.
To compute the median, solve min
µ
n
i=1
|yi − µ| which can be solved using linear
programming techniques (see e.g. simplex method)
More precisely, this problem is equivalent to min
µ,a,b
n
i=1
ai + bi with ai, bi ≥ 0 and
yi − µ = ai − bi, ∀i = 1, · · · , n.
Consider a sample obtained from a lognormal distribution
1 n = 101
2 set.seed (1)
3 y = rlnorm(n)
4 median(y)
5 [1] 1.077415
@freakonometrics freakonometrics freakonometrics.hypotheses.org 86
87. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Optimization with linear problems (application to 1-regression)
Here, one can use a standard optimization routine
1 md=Vectorize(function(m) sum(abs(y-m)))
2 optim(mean(y),md)
3 $par
4 [1] 1.077416
or a linear programing technique : use the matrix form, with 3n constraints, and 2n + 1
parameters,
1 library(lpSolve)
2 A1 = cbind(diag (2*n) ,0)
3 A2 = cbind(diag(n), -diag(n), 1)
4 r = lp("min", c(rep (1,2*n) ,0),
5 rbind(A1 , A2),c(rep(">=", 2*n), rep("=", n)), c(rep (0,2*n), y))
6 tail(r$solution ,1)
7 [1] 1.077415
@freakonometrics freakonometrics freakonometrics.hypotheses.org 87
88. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Optimization with linear problems (application to 1-regression)
More generally, consider here some quantile,
1 tau = .3
2 quantile(x,tau)
3 30%
4 0.6741586
The linear program is now min
µ,a,b
n
i=1
τai + (1 − τ)bi with ai, bi ≥ 0 and yi − µ = ai − bi,
∀i = 1, · · · , n.
1 A1 = cbind(diag (2*n) ,0)
2 A2 = cbind(diag(n), -diag(n), 1)
3 r = lp("min", c(rep(tau ,n),rep(1-tau ,n) ,0),
4 rbind(A1 , A2),c(rep(">=", 2*n), rep("=", n)), c(rep (0,2*n), y))
5 tail(r$solution ,1)
6 [1] 0.6741586
@freakonometrics freakonometrics freakonometrics.hypotheses.org 88
89. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Optimization with linear problems (application to 1-regression)
To illustrate numerical computations, use
1 base=read.table("http:// freakonometrics .free.fr/rent98_00. txt",header
=TRUE)
The linear program for the quantile regression is now
min
µ,a,b
n
i=1
τai + (1 − τ)bi
with ai, bi ≥ 0 and yi − [βτ
0 + βτ
1 xi] = ai − bi, ∀i = 1, · · · , n.
1 require(lpSolve)
2 tau = .3
3 n=nrow(base)
4 X = cbind( 1, base$area)
5 y = base$rent_euro
6 A1 = cbind(diag (2*n), 0,0)
7 A2 = cbind(diag(n), -diag(n), X)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 89
90. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Optimization with linear problems (application to 1-regression)
1 r = lp("min",
2 c(rep(tau ,n), rep(1-tau ,n) ,0,0), rbind(A1 , A2),
3 c(rep(">=", 2*n), rep("=", n)), c(rep(0,2*n), y))
4 tail(r$solution ,2)
5 [1] 148.946864 3.289674
see
1 library(quantreg)
2 rq(rent_euro˜area , tau=tau , data=base)
3 Coefficients :
4 (Intercept) area
5 148.946864 3.289674
@freakonometrics freakonometrics freakonometrics.hypotheses.org 90
91. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
If we get back to the original Lasso approach, the goal was to solve
min
1
2n
n
i=1
[yi − (β0 + xi
T
β)]2
+ λ
j
|βj|
Since the intercept is not subject to the penalty. The first order condition is then
∂
∂β0
y − Xβ − β01 2
= (Xβ − y)T
1 + β0 1 2
= 0
i.e.
β0 =
1
n2
(Xβ − y)T
1
For other parameters, assuming tha KKT conditions are satisfied, we can check if 0 contains
the subdifferential at the minimum,
0 ∈ ∂
1
2
y − Xβ 2
+ λ β 1 =
1
2
y − Xβ 2
+ ∂(λ β 1 )
@freakonometrics freakonometrics freakonometrics.hypotheses.org 91
92. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
For the term on the left, we recognize
1
2
y − Xβ 2
= −XT
(y − Xβ) = −g
so that the previous equation can be writen
gk ∈ ∂(λ|βk|) =
{+λ} if βk > 0
{−λ} if βk < 0
(−λ, +λ) if βk = 0
i.e. if βk = 0, then gk = sign(βk) · λ.
We can split βj into a sum of its positive and negative parts by replacing βj with β+
j − β−
j
where β+
j , β−
j ≥ 0.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 92
93. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
Then the Lasso problem becomes
− log L(β) + λ
j
(β+
j − β−
j )
with constraints β+
j − β−
j . Let α+
j , α−
j denote the Lagrange multipliers for β+
j , β−
j , respectively.
L(β) + λ
j
(β+
j − β−
j ) −
j
α+
j β+
j −
j
α−
j β−
j .
To satisfy the stationarity condition, we take the gradient of the Lagrangian with respect to β+
j
and set it to zero to obtain
L(β)j + λ − α+
j = 0
We do the same with respect to β−
j to obtain
− L(β)j + λ − α−
j = 0.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 93
94. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
For conveniency, introduce the soft-thresholding function
S(z, γ) = sign(z) · (|z| − γ)+ =
z − γ if γ < |z| and z > 0
z + γ if γ < |z| and z < 0
0 if γ ≥ |z|
Noticing that the optimization problem
1
2
y − Xβ 2
2
+ λ β 1
can also be written
min
p
j=1
−βols
j · βj +
1
2
β2
j + λ|βj|
observe that βj,λ = S(βols
j , λ) which is a coordinate-wise update.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 94
95. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
Now, if we consider a (slightly) more general problem, with weights in the first part
min
1
2n
n
i=1
ωi[yi − (β0 + xi
T
β)]2
+ λ
j
|βj|
the coordinate-wise update becomes
βj,λ,ω = S(βω−ols
j , λ)
An alternative is to set
rj = y − β01 +
k=j
βkxk = y − y
(j)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 95
96. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
so that the optimization problem can be written, equivalently
min
1
2n
p
j=1
[rj − βjxj]2
+ λ|βj|
hence
min
1
2n
p
j=1
β2
j xj − 2βjrj
T
xj + λ|βj|
and one gets
βj,λ =
1
xj
2
S(rj
T
xj, nλ)
or, if we develop
βj,λ =
1
i
x2
ij
S
i
xi,j[yi − y
(j)
i ], nλ
@freakonometrics freakonometrics freakonometrics.hypotheses.org 96
97. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
Again, if there are weights ω = (ωi), the coordinate-wise update becomes
βj,λ,ω =
1
i
ωix2
ij
S
i
ωixi,j[yi − y
(j)
i ], nλ
The code to compute this componentwise descent is
1 soft_ thresholding = function(x,a){
2 result = numeric(length(x))
3 result[which(x > a)] = a
4 result[which(x < -a)] <- x[which(x < -a)] + a
5 return(result)
and the code
@freakonometrics freakonometrics freakonometrics.hypotheses.org 97
99. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
17 beta0list[j+1] = beta0
18 betalist [[j+1]] = beta
19 obj[j] = (1/2)*(1/length(y))*norm(omega*(y - X%*%beta -
20 beta0*rep(1, length(y))),’F’)ˆ2 + lambda*sum(abs(beta))
21 if (norm(rbind(beta0list[j],betalist [[j]]) - rbind(beta0 ,beta),
’F’) < tol) { break }
22 }
23 return(list(obj=obj [1:j],beta=beta ,intercept=beta0)) }
If we get back to our logistic problem, the log-likelihood is here
log L =
1
n
n
i=1
yi · (β0 + xi
T
β) − log[1 + exp(β0 + xi
T
β)]
which is a concave function of the parameters. Hence, one can use a quadratic approximation of
the log-likelihood - using Taylor expansion,
log L ≈ log L =
1
n
n
i=1
ωi · [zi − (β0 + xi
T
β)]2
@freakonometrics freakonometrics freakonometrics.hypotheses.org 99
100. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
where zi is the working response
zi = (β0 + xi
T
β) +
yi − pi
pi[1 − pi]
pi is the prediction
pi =
exp[β0 + xi
Tβ]
1 + exp[β0 + xi
Tβ]
and ωi are weights ωi = pi[1 − pi].
Thus, we obtain a penalized least-square problem.
1 lasso_coord_desc = function(X,y,beta ,lambda ,tol =1e-6, maxiter =1000){
2 beta = as.matrix(beta)
3 X = as.matrix(X)
4 obj = numeric(length =( maxiter +1))
5 betalist = list(length(maxiter +1))
6 betalist [[1]] = beta
7 beta0 = sum(y-X%*%beta)/(length(y))
8 p = exp(beta0*rep(1, length(y)) + X%*%beta)/(1+ exp(beta0*rep(1,
length(y)) + X%*%beta))
9 z = beta0*rep(1, length(y)) + X%*%beta + (y-p)/(p*(1-p))
@freakonometrics freakonometrics freakonometrics.hypotheses.org 100
106. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : (Linear) Fisher’s Discrimination
Consider the follwing naive classification rule
m (x) = argminy{P[Y = y|X = x]}
or
m (x) = argminy
P[X = x|Y = y]
P[X = x]
(where P[X = x] is the density in the continuous case).
In the case where y takes two values, that will be standard {0, 1} here, one can rewrite the later
as
m (x) =
1 if E(Y |X = x) >
1
2
0 otherwise
@freakonometrics freakonometrics freakonometrics.hypotheses.org 106
107. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : (Linear) Fisher’s Discrimination
the set
DS = x, E(Y |X = x) =
1
2
is called the decision boundary.
Assume that X|Y = 0 ∼ N(µ0, Σ) and X|Y = 1 ∼ N(µ1, Σ), then explicit expressions can be
derived.
m (x) =
1 if r2
1 < r2
0 + 2log
P(Y = 1)
P(Y = 0)
+ log
|Σ0|
|Σ1|
0 otherwise
where r2
y is the Mahalanobis distance,
r2
y = [X − µy]T
Σ−1
y [X − µy]
@freakonometrics freakonometrics freakonometrics.hypotheses.org 107
108. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : (Linear) Fisher’s Discrimination
Let δy be defined as
δy(x) = −
1
2
log |Σy| −
1
2
[x − µy]T
Σ−1
y [x − µy] + log P(Y = y)
the decision boundary of this classifier is
{x such that δ0(x) = δ1(x)}
which is quadratic in x. This is the quadratic discriminant analysis.
In Fisher (1936, The use of multile measurements in taxinomic problems , it was assumed that
Σ0 = Σ1
In that case, actually,
δy(x) = xT
Σ−1
µy −
1
2
µT
y Σ−1
µy + log P(Y = y)
and the decision frontier is now linear in x.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 108
113. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Note that there are strong links with the logistic regression.
Assume as previously that X|Y = 0 ∼ N(µ0, Σ) and X|Y = 1 ∼ N(µ1, Σ) then
log
P(Y = 1|X = x)
P(Y = 0|X = x)
is equal to
xT
Σ−1
[µy] −
1
2
[µ1 − µ0]T
Σ−1
[µ1 − µ0] + log
P(Y = 1)
P(Y = 0)
which is linear in x
log
P(Y = 1|X = x)
P(Y = 0|X = x)
= xT
β
Hence, when each groups have Gaussian distributions with identical variance matrix, then LDA
and the logistic regression lead to the same classification rule.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 113
114. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : (Linear) Fisher’s Discrimination
Observe furthermore that the slope is proportional to Σ−1[µ1 − µ0] as stated in Fisher’s article.
But to obtain such a relationship, he observe that the ratio of between and within variances (in
the two groups) was
variance between
variance within
=
[ωµ1 − ωµ0]2
ωTΣ1ω + ωTΣ0ω
which is maximal when ω is proportional to Σ−1[µ1 − µ0].
@freakonometrics freakonometrics freakonometrics.hypotheses.org 114
115. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Neural Nets
The most interesting part is not ’neural’ but ’network’.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 115
116. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Neural Nets
Networks have nodes, and edges (possibly connected) that connect nodes,
or maybe, to more specific some sort of flow network,
@freakonometrics freakonometrics freakonometrics.hypotheses.org 116
117. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Neural Nets
In such a network, we usually have sources (here multiple) sources (here {s1, s2, s3}), on the
left, on a sink (here {t}), on the right. To continue with this metaphorical introduction,
information from the sources should reach the sink. An usually, sources are explanatory
variables, {x1, · · · , xp}, and the sink is our variable of interest y.
The output here is a binary variable y ∈ {0, 1}
Consider the sigmoid function f(x) =
1
1 + e−x
=
ex
ex + 1
which is actually the logistic function.
1 sigmoid = function(x) 1 / (1 + exp(-x))
This function f is called the activation function.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 117
118. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
If y ∈ {−1, +1}, one can consider the hyperbolic tangent f(x) = tanh(x) =
(ex − e−x)
(ex + e−x)
or the
inverse tangent function f(x) = tan−1(x).
And as input for such function, we consider a weighted sum of incoming nodes.
So here
yi = f
p
j=1
ωjxj,i
We can also add a constant actually
yi = f ω0 +
p
j=1
ωjxj,i
@freakonometrics freakonometrics freakonometrics.hypotheses.org 118
120. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Neural Nets
The error of that model can be computed used the quadratic loss function,
n
i=1
(yi − p(xi))2
or cross-entropy
n
i=1
(yi log p(xi) + [1 − yi] log[1 − p(xi)])
1 loss = function(weights){
2 mean( (myocarde$PRONO -sigmoid(X %*% weights))ˆ2) }
we want to solve
ω = argmin
1
n
n
i=1
yi, f(ω0 + xi
T
ω)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 120
121. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
1 weights_1 = optim(weights_0,loss)$par
2 y_5_2 = sigmoid(X %*% weights_1)
3 pred = ROCR :: prediction(y_5_2,myocarde$PRONO)
4 perf = ROCR :: performance (pred ,"tpr", "fpr")
5 plot(perf ,col="blue",lwd =2)
6 plot(perf0 ,add=TRUE ,col="red")
Consider a single hidden layer, with 3 different neurons, so
that
p(x) = f ω0 +
3
h=1
ωhfh ωh,0 +
p
j=1
ωh,jxj
or
p(x) = f ω0 +
3
h=1
ωhfh ωh,0 + xT
ωh
@freakonometrics freakonometrics freakonometrics.hypotheses.org 121
123. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
If
p(x) = f ω0 +
3
h=1
ωhfh ωh,0 + xT
ωh
we want to solve, with back propagation,
ω = argmin
1
n
n
i=1
(yi, p(xi))
For practical issues, let us center all variables
1 center = function(z) (z-mean(z))/sd(z)
The loss function is here
1 loss = function(weights){
2 weights_1 = weights [0+(1:7)]
3 weights_2 = weights [7+(1:7)]
4 weights_3 = weights [14+(1:7)]
5 weights_ = weights [21+1:4]
6 X1=X2=X3=as.matrix(myocarde [ ,1:7])
7 Z1 = center(X1 %*% weights_1)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 123
127. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Neural Nets
Of course, one can consider a multiple layer network,
p(x) = f ω0 +
3
h=1
ωhfh ωh,0 + zh
T
ωh
where
zh = f ωh,0 +
kh
j=1
ωh,jfh,j ωh,j,0 + xT
ωh,j
1 library(neuralnet)
2 model_nnet = neuralnet(formula(glm(PRONO˜.,
data=myocarde_minmax)),
3 myocarde_minmax ,hidden =3, act.fct = sigmoid)
4 plot(model_nnet)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 127
128. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Linearly Separable sample [econometric notations]
Data (y1, x1), · · · , (yn, xn) - with y ∈ {0, 1} - are
linearly separable if there are (β0, β) such that
- yi = 1 if β0 + x β > 0
- yi = 0 if β0 + x β < 0
Linearly Separable sample [ML notations]
Data (y1, x1), · · · , (yn, xn) - with y ∈ {−1, +1} -
are linearly separable if there are (b, ω) such that
- yi = +1 if b + x, ω > 0
- yi = −1 if b + x, ω < 0
or equivalently yi · (b + x, ω ) > 0, ∀i.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 128
129. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
yi · (b + x, ω ) = 0 is an hyperplane (in Rp)
orthogonal with ω
Use m(x) = 1b+ x,ω ≥0 − 1b+ x,ω <0 as classifier
Problem : equation (i.e. (b, ω)) is not unique !
Canonical form : min
i=1,··· ,n
|b + xi, ω | = 1
Problem : solution here is not unique !
Idea : use the widest (safety) margin γ
Vapnik & Lerner (1963, Pattern recognition using generalized
portrait method) or Cover (1965, Geometrical and statistical
properties of systems of linear inequalities with applications in
pattern recognition)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 129
130. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Consider two points, ω−1 and ω+1
γ =
1
2
ω, ω+1 − ω−1
ω
It is minimal when
b + xi, ω−1 = −1 and
b + xi, ω+1 = +1, and therefore
γ =
1
ω
Optimization problem becomes
min
(b ω)
1
2
ω 2
2
s.t. yi · (b + x, ω ) > 0, ∀i.
convex optimization problem with linear constraints
@freakonometrics freakonometrics freakonometrics.hypotheses.org 130
131. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Consider the following problem : min
u∈Rp
h(u) s.t. gi(u) ≥ 0 ∀i = 1, · · · , n
where h is quadratic and gi’s are linear.
Lagrangian is L : Rp × Rn → R defined as L(u, α) = h(u) −
n
i=1
αigi(u)
where α are dual variables, and the dual function is
Λ : α → L(uα, α) = min{L(u, α)} where uα = argmin L(u, α)
One can solve the dual problem, max{Λ(α)} s.t. α ≥ 0. Solution is u = uα .
Si gi(uα ) > 0, then necessarily αi = 0 (see Karush-Kuhn-Tucker (KKT) condition,
αi · gi(uα ) = 0)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 131
132. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Here, L(b, ω, α) =
1
2
ω 2
−
n
i=1
αi · yi · (b + x, ω ) − 1
From the first order conditions,
∂L(b, ω, α)
∂ω
= ω −
n
i=1
αi · yixi = 0, i.e.ωËĘ =
n
i=1
αi yixi
∂L(b, ω, α)
∂b
= −
n
i=1
αi · yi = 0, i.e.
n
i=1
αi · yi
and
Λ(α) =
n
i=1
αi −
1
2
n
i,j=1
αiαjyiyj αi, αj
@freakonometrics freakonometrics freakonometrics.hypotheses.org 132
133. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
min
α
1
2
α Qα − 1 α s.t.
αi ≥ 0, ∀i
y 1 = 0
where Q = [Qi,j] and Qi,j = yiyj xi, xj , and then
ω =
n
i=1
αi yixi and b = −
1
2
min
i:yi=+1
{ xi, ω } + min
i:yi=−1
{ xi, ω }
Points xi such that αi > 0 are called support
yi · (b + xi, ω ) = 1
Use m (x) = 1b + x,ω ≥0 − 1b + x,ω <0 as classifier
Observe that γ =
n
i=1
α 2
i
−1/2
@freakonometrics freakonometrics freakonometrics.hypotheses.org 133
134. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Optimization problem was
min
(b,ω)
1
2
ω 2
2
s.t. yi · (b + x, ω ) > 0, ∀i,
which became
min
(b,ω)
1
2
ω 2
2
+
n
i=1
αi · 1 − yi · (b + x, ω ) ,
or
min
(b,ω)
1
2
ω 2
2
+ penalty ,
@freakonometrics freakonometrics freakonometrics.hypotheses.org 134
135. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Consider here the more general case
where the space is not linearly separable
( ω, xi + b)yi ≥ 1
becomes
( ω, xi + b)yi ≥ 1−ξi
for some slack variables ξi’s.
and penalize large slack variables ξi, i.e. solve (for some coste C)
min
ω,b
1
2
ω β + C
n
i=1
ξi
subject to ∀i, ξi ≥ 0 and (xi ω + b)yi ≥ 1 − ξi.
This is the soft-margin extension, see e1071::svm() or kernlab::ksvm()
@freakonometrics freakonometrics freakonometrics.hypotheses.org 135
136. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
The dual optimization problem is now
min
α
1
2
α Qα − 1 α s.t.
0 ≤ αi≤ C, ∀i
y 1 = 0
where Q = [Qi,j] and Qi,j = yiyj xi, xj , and then
ω =
n
i=1
αi yixi and b = −
1
2
min
i:yi=+1
{ xi, ω } + min
i:yi=−1
{ xi, ω }
Note further that the (primal) optimization problem can be written
min
(b,ω)
1
2
ω 2
2
+
n
i=1
1 − yi · (b + x, ω )
+
,
where (1 − z)+ is a convex upper bound for empirical error 1z≤0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 136
137. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
One can also consider the kernel trick : xi xj is replace by ϕ(xi) ϕ(xj) for some mapping ϕ,
K(xi, xj) = ϕ(xi) ϕ(xj)
For instance
K(a, b) = (a b)3
= ϕ(a) ϕ(b)
where
ϕ(a1, a2) = (a3
1 ,
√
3a2
1a2 ,
√
3a1a2
2 , a3
2)
Consider polynomial kernels
K(a, b) = (1 + a b)p
or a Gaussian kernel
K(a, b) = exp(−(a − b) (a − b))
Solve
max
αi≥0
n
i=1
αi −
1
2
n
i,j=1
yiyjαiαjK(xi, xj)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 137
138. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
@freakonometrics freakonometrics freakonometrics.hypotheses.org 138
139. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Linear kernel
@freakonometrics freakonometrics freakonometrics.hypotheses.org 139
140. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Polynomial kernel (degree 2)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 140
141. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Polynomial kernel (degree 3)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 141
142. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Polynomial kernel (degree 4)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 142
143. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Polynomial kernel (degree 5)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 143
144. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Polynomial kernel (degree 6)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 144
145. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Radial kernel
@freakonometrics freakonometrics freakonometrics.hypotheses.org 145
146. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine - Radial Kernel, impact of the cost C
@freakonometrics freakonometrics freakonometrics.hypotheses.org 146
147. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine - Radial Kernel, tuning parameter γ
@freakonometrics freakonometrics freakonometrics.hypotheses.org 147
148. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
The radial kernel is formed by taking an infinite sum over polynomial kernels...
K(x, y) = exp − γ x − y 2
= ψ(x), ψ(y)
where ψ is some Rn → R∞ function, since
K(x, y) = exp − γ x − y 2
= exp(−γ x 2
− γ y 2
)
=constant
· exp 2γ x, y
i.e.
K(x, y) ∝ exp 2γ x, y =
∞
k=0
2γ
x, y k
k!
=
∞
k=0
2γKk(x, y)
where Kk is the polynomial kernel of degree k.
If K = K1 + K2 with ψj : Rn → Rdj then ψ : Rn → Rd with d ∼ d1 + d2
@freakonometrics freakonometrics freakonometrics.hypotheses.org 148
149. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
A kernel is a measure of similarity between vectors.
The smaller the value of γ the narrower the vectors should be to have a small measure
Is there a probabilistic interpretation ?
Platt (2000, Probabilities for SVM) suggested to use a logistic function over the SVM scores,
p(x) =
exp[b + x, ω ]
1 + exp[b + x, ω ]
@freakonometrics freakonometrics freakonometrics.hypotheses.org 149
150. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Let C denote the cost of mis-classification. The optimization problem becomes
min
w∈Rd,b∈R,ξ∈Rn
1
2
ω 2
+ C
n
i=1
ξi
subject to yi · (ωTxi + b) ≥ 1 − ξi, with ξi ≥ 0, ∀i = 1, · · · , n
1 n = length(myocarde[,"PRONO"])
2 myocarde0 = myocarde
3 myocarde0$PRONO = myocarde$PRONO*2-1
4 C = .5
5 f = function(param){
6 w = param [1:7]
7 b = param [8]
8 xi = param [8+1: nrow(myocarde)]
9 .5*sum(wˆ2) + C*sum(xi)}
10 grad_f = function(param){
11 w = param [1:7]
12 b = param [8]
@freakonometrics freakonometrics freakonometrics.hypotheses.org 150
151. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
13 xi = param [8+1: nrow(myocarde)]
14 c(2*w,0,rep(C,length(xi)))}
(linear) constraints are written as Uθ − c ≥ 0
1 U = rbind(cbind(myocarde0[,"PRONO"]*as.matrix(myocarde [ ,1:7]) ,diag(n)
,myocarde0[,"PRONO"]),
2 cbind(matrix (0,n,7),diag(n,n),matrix (0,n,1)))
3 C = c(rep(1,n),rep(0,n))
then use
1 constrOptim (theta=p_init , f, grad_f, ui = U,ci = C)
One can recognize in the separable case, but also in the non-separable case, a classic quadratic
program
min
z∈Rd
1
2
zT
Dz − dz
subject to Az ≥ b.
1 library(quadprog)
2 eps = 5e-4
@freakonometrics freakonometrics freakonometrics.hypotheses.org 151
152. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
3 y = myocarde[,"PRONO" ;]*2-1
4 X = as.matrix(cbind (1, myocarde [ ,1:7]))
5 n = length(y)
6 D = diag(n+7+1)
7 diag(D)[8+0:n] = 0
8 d = matrix(c(rep (0 ,7) ,0,rep(C,n)), nrow=n+7+1)
9 A = Ui
10 b = Ci
@freakonometrics freakonometrics freakonometrics.hypotheses.org 152
153. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Support Vector Machine (SVM) - Primal Problem
1 sol = solve.QP(D+eps*diag(n+7+1) , d, t(A), b, meq =1)
2 qpsol = sol$solution
3 (omega = qpsol [1:7])
4 [1] -0.106642005446 -0.002026198103 -0.022513312261 -0.018958578746
-0.023105767847 -0.018958578746 -1.080638988521
5 (b = qpsol[n+7+1])
6 [1] 997.6289927
Given an observation x, the prediction is y = sign[ωTx + b]
1 y_pred = 2*((as.matrix(myocarde0 [ ,1:7])%*%omega+b) >0) -1
@freakonometrics freakonometrics freakonometrics.hypotheses.org 153
154. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Support Vector Machine (SVM) - Dual Problem
The Lagrangian of the separable problem could be written introducing Lagrange multipliers
α ∈ Rn, α ≥ 0 as
L(ω, b, α) =
1
2
ω 2
−
n
i=1
αi yi(ωT
xi + b) − 1
Somehow, αi represents the influence of the observation (yi, xi). Consider the Dual Problem,
with G = [Gij] and Gij = yiyjxj
Txi.
min
α∈Rn
1
2
αT
Gα − 1T
α
subject to yTα = 0.
The Lagrangian of the non-separable problem could be written introducing Lagrange
multipliers α, β ∈ Rn, α, β ≥ 0,
@freakonometrics freakonometrics freakonometrics.hypotheses.org 154
155. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Support Vector Machine (SVM) - Dual Problem
and define the Lagrangian L(ω, b, ξ, α, β) as
1
2
ω 2
+ C
n
i=1
ξi −
n
i=1
αi yi(ωT
xi + b) − 1 + ξi −
n
i=1
βiξi
Somehow, αi represents the influence of the observation (yi, xi). The Dual Problem become
with G = [Gij] and Gij = yiyjxj
Txi.
min
α∈Rn
1
2
αT
Gα − 1T
α
subject to yTα = 0, α ≥ 0 and α ≤ C. As previously, one can also use quadratic programming
@freakonometrics freakonometrics freakonometrics.hypotheses.org 155
156. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Support Vector Machine (SVM) - Dual Problem
1 library(quadprog)
2 eps = 5e-4
3 y = myocarde[,"PRONO"]*2-1
4 X = as.matrix(cbind (1, myocarde [ ,1:7]))
5 n = length(y)
6 Q = sapply (1:n, function(i) y[i]*t(X)[,i])
7 D = t(Q)%*%Q
8 d = matrix (1, nrow=n)
9 A = rbind(y,diag(n),-diag(n))
10 C = .5
11 b = c(0,rep(0,n),rep(-C,n))
12 sol = solve.QP(D+eps*diag(n), d, t(A), b, meq=1, factorized=FALSE)
13 qpsol = sol$solution
@freakonometrics freakonometrics freakonometrics.hypotheses.org 156
157. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Support Vector Machine (SVM) - Dual Problem
The two problems are directed in the sense that for all xωTx + b =
n
i=1
αiyi(xTxi) + b
To recover the solution of the primal problem, ω =
n
i=1
αiyixi thus
1 omega = apply(qpsol*y*X,2,sum)
2 omega
3 1 FRCAR INCAR INSYS
4 0.000000000000000243907 0.0550138658 -0.092016323 0.3609571899422
5 PRDIA PAPUL PVENT REPUL
6 -0.109401796528869235669 -0.0485213403 -0.066005864 0.0010093656567
More generally
1 svm.fit = function(X, y, C=NULL) {
2 n.samples = nrow(X)
3 n.features = ncol(X)
4 K = matrix(rep(0, n.samples*n.samples), nrow=n.samples)
5 for (i in 1:n.samples){
6 for (j in 1:n.samples){
7 K[i,j] = X[i,] %*% X[j,] }}
@freakonometrics freakonometrics freakonometrics.hypotheses.org 157
159. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Support Vector Machine (SVM) - Dual Problem
To relate this duality optimization problem to OLS, recall that y = xTω + ε, so that y = xTω,
where ω = [XTX]−1XTy.
But one can also write y = xTω =
n
i=1
αi · xTxi where α = X[XTX]−1ω, or conversely,
ω = XTα.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 159
160. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Support Vector Machine (SVM) - Kernel Approach
A positive kernel on X is a function K : X × X → R symmetric, and such that for any n,
∀α1, · · · , αn and ∀x1, · · · , xn,
n
i=1
n
j=1
αiαjk(xi, xj) ≥ 0
For example, the linear kernel is k(xi, xj) = xi
Txj. ThatâĂŹs what weâĂŹve been using here,
so far. One can also define the product kernel k(xi, xj) = κ(xi) · κ(xj) where κ is some
function X → R.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 160
161. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Support Vector Machine (SVM) - Kernel Approach
Finally, the Gaussian kernel is k(xi, xj) = exp[− xi − xj
2].
Since it is a function of xi − xj , it is also called a radial kernel.
1 linear.kernel = function(x1 , x2) {
2 return (x1%*%x2)
3 }
4 svm.fit = function(X, y, FUN=linear.kernel , C=NULL) {
5 n.samples = nrow(X)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 161
162. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
6 n.features = ncol(X)
7 K = matrix(rep(0, n.samples*n.samples), nrow=n.samples)
8 for (i in 1:n.samples){
9 for (j in 1:n.samples){
10 K[i,j] = FUN(X[i,], X[j,])
11 }
12 }
13 Dmat = outer(y,y) * K
14 Dmat = as.matrix(nearPD(Dmat)$mat)
15 dvec = rep(1, n.samples)
16 Amat = rbind(y, diag(n.samples), -1*diag(n.samples))
17 bvec = c(0, rep(0, n.samples), rep(-C, n.samples))
18 res = solve.QP(Dmat ,dvec ,t(Amat),bvec=bvec , meq =1)
19 a = res$solution
20 bomega = apply(a*y*X,2,sum)
21 return(bomega)
22 }
@freakonometrics freakonometrics freakonometrics.hypotheses.org 162
165. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Classification Trees
1 library(rpart)
2 cart = rpart(PRONO˜.,data=
myocarde)
3 library(rpart.plot)
4 prp(cart ,type=2, extra =1)
A (binary) split is based on one specific variable âĂŞ say xj âĂŞ and a cutoff, say s. Then,
there are two options:
• either xi,j ≤ s, then observation i goes on the left, in IL
• or xi,j > s, then observation i goes on the right, in IR
Thus, I = IL ∪ IR.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 165
166. Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Classification Trees
Gini for node I is defined as
G(I) = −
y∈{0,1}
py(1 − py)
where py is the proportion of individuals in the leaf of type y,
G(I) = −
y∈{0,1}
ny,I
nI
1 −
ny,I
nI
1 gini = function(y,classe){
2 T. = table(y,classe)
3 nx = apply(T,2,sum)
4 n. = sum(T)
5 pxy = T/matrix(rep(nx ,each =2) ,nrow =2)
6 omega = matrix(rep(nx ,each =2) ,nrow =2)/n
7 g. = -sum(omega*pxy*(1-pxy))
8 return(g)}
@freakonometrics freakonometrics freakonometrics.hypotheses.org 166