SlideShare a Scribd company logo
1 of 188
Download to read offline
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Hands-On Algorithms for Predictive Modeling*
(Penalized) Logistic Regression, Boosting, Neural Nets, SVM, Trees, Forests
Arthur Charpentier (Universit´e du Qu´ebec `a Montr´eal, Canada)
Third International Congress on Actuarial Science and Quantitative Finance
Manizales, Colombia, June 2019
@freakonometrics freakonometrics freakonometrics.hypotheses.org 1
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Modeling a 0/1 random variable
Myocardial infarction of patients admited in E.R.
◦ heart rate (FRCAR) : x1
◦ heart index (INCAR) : x2
◦ stroke index (INSYS) : x3
◦ diastolic pressure (PRDIA) : x4
◦ pulmonary arterial pressure (PAPUL) : x5
◦ ventricular pressure (PVENT) : x6
◦ lung resistance (REPUL) : x7
◦ death or survival (PRONO) : y
dataset {(xi, yi)} where i = 1, . . . , n, n = 73.
1 > myocarde=read.table("http:// freakonometrics .free.fr/myocarde.csv",
head=TRUE ,sep=";")
@freakonometrics freakonometrics freakonometrics.hypotheses.org 2
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Modeling a 0/1 random variable
1 > myocarde = read.table("http:// freakonometrics .free.fr/myocarde.csv"
,head=TRUE , sep=";")
2 > head(myocarde)
3 FRCAR INCAR INSYS PRDIA PAPUL PVENT REPUL PRONO
4 1 90 1.71 19.0 16 19.5 16.0 912 SURVIE
5 2 90 1.68 18.7 24 31.0 14.0 1476 DECES
6 3 120 1.40 11.7 23 29.0 8.0 1657 DECES
7 4 82 1.79 21.8 14 17.5 10.0 782 SURVIE
8 5 80 1.58 19.7 21 28.0 18.5 1418 DECES
9 6 80 1.13 14.1 18 23.5 9.0 1664 DECES
10 > myocarde$PRONO = (myocarde$PRONO =="SURVIE")*1
@freakonometrics freakonometrics freakonometrics.hypotheses.org 3
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Logistic Regression (and GLM)
Simulated data, y ∈ {0, 1}
P(Y = 1|X = x) =
exp[xTβ]
1 + exp[xTβ]
Inference using maximum likelihood techniques
β = argmin
n
i=1
log[P(Y = 1|X = x)]
and the score model is then
s(x) =
exp[xTβ]
1 + exp[xTβ]
1 x1 = c(.4 ,.55 ,.65 ,.9 ,.1 ,.35 ,.5 ,.15 ,.2 ,.85)
2 x2 = c(.85 ,.95 ,.8 ,.87 ,.5 ,.55 ,.5 ,.2 ,.1 ,.3)
3 y = c(1,1,1,1,1,0,0,1,0,0)
4 df = data.frame(x1=x1 ,x2=x2 ,y=as.factor(y))
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
q
q
q
q
q
q
q
q
q
q
@freakonometrics freakonometrics freakonometrics.hypotheses.org 4
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Logistic Regression (and GLM)
The logistic regression is based on the assumption that
Y |X = x ∼ B(px), where px =
exp[xTβ]
1 + exp[xTβ]
= 1 + exp[−xT
β]
−1
The goal is to estimate parameter β.
Recall that the heuristics for the use of that function for the probability is that
log[odds(Y = 1)] = log
P[Y = 1]
P[Y = 0]
= xT
β
• Maximum of the (log)-likelihood function
The log-likelihood is here
log L =
n
i=1
yi log pi + (1 − yi) log(1 − pi)
where pi = (1 + exp[−xi
Tβ])−1.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 5
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Mathematical Statistics Courses in a Nutshell
Consider observations {y1, · · · , yn}
from iid random variables Yi ∼ Fθ
(with “density” fθ).
Likelihood is
L(θ ; y) →
n
i=1
fθ(yi)
Maximum likelihood estimate is
θ
mle
∈ arg max
θ∈Θ
{L(θ; y)}
or θ
mle
∈ arg max
θ∈Θ
{log L(θ; y)}
Fisher (1912, On an absolute criterion for fitting frequency curves).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 6
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Mathematical Statistics Courses in a Nutshell
Under standard assumptions (Identification of the model, Compactness, Continuity and
Dominance), the maximum likelihood estimator is consistent θ
mle P
−→ θ. With additional
assumptions, it can be shown that the maximum likelihood estimator converges to a normal
distribution
√
n θ
mle
− θ
L
−→N(0, I−1
)
where I is Fisher information matrix (i.e. θ
mle
is asymptotically efficient).
Eg. if Y ∼ N(θ, σ2), log L ∝ −
n
i=1
yi − θ
σ
2
, and θmle = y (see also method of moments).
max log L ⇐⇒ min
n
i=1
ε2
i = least squares
@freakonometrics freakonometrics freakonometrics.hypotheses.org 7
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Logistic Regression
• Numerical Optimization
Using optim() might not be a great idea.
BFGS algorithm (Broyden, Fletcher, Goldfarb & Shanno) based on Newton’s algorithm.
To find a zero of function g, g(x0) = 0, use Taylor’s expansion,
0 = g(x0) = g(x) + g (x)(x0 − x)
i.e. (if x → xold and x0 → xnew)
xnew = xold −
g(xold)
g (xold)
from some starting point
or, in higher dimension
xnew = xold − g(xold)−1
· g(xold) from some starting point
Here g is the gradient of the log-likelihood,
βnew = βold −
∂2 log L(βold)
∂β∂βT
−1
·
∂ log L(βold)
∂β
@freakonometrics freakonometrics freakonometrics.hypotheses.org 8
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Logistic Regression
Numerically, since only g is given, consider some small h > 0 and set
xnew = xold −
2h · g(xold)
g(xold + h) + g(xold − h)
from some starting point
1 y = myocarde$PRONO
2 X = cbind (1,as.matrix(myocarde [ ,1:7]))
3 negLogLik = function(beta){
4 -sum(-y*log(1 + exp(-(X%*%beta))) - (1-y)*log(1 + exp(X%*%beta)))
5 }
6 beta_init = lm(PRONO˜.,data=myocarde)$ coefficients
7 logistic_opt = optim(par = beta_init , negLogLik , hessian=TRUE , method
= "BFGS", control=list(abstol =1e -9))
@freakonometrics freakonometrics freakonometrics.hypotheses.org 9
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Logistic Regression
Here, we obtain
1 logistic_opt$par
2 (Intercept) FRCAR INCAR INSYS
3 1.656926397 0.045234029 -2.119441743 0.204023835
4 PRDIA PAPUL PVENT REPUL
5 -0.102420095 0.165823647 -0.081047525 -0.005992238
6 simu = function(i){
7 logistic_opt_i = optim(par = rnorm (8,0,3)*beta_init ,
8 negLogLik , hessian=TRUE , method = "BFGS",
9 control=list(abstol =1e-9))
10 logistic_opt_i$par [2:3]
11 }
12 v_beta = t(Vectorize(simu)(1:1000))
@freakonometrics freakonometrics freakonometrics.hypotheses.org 10
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Logistic Regression
1 plot(v_beta)
2 par(mfrow=c(1 ,2))
3 hist(v_beta [,1], xlab=names(myocarde)[1])
4 hist(v_beta [,2], xlab=names(myocarde)[2])
@freakonometrics freakonometrics freakonometrics.hypotheses.org 11
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
• Fisher Algorithm
We want to use Newton’s algorithm
βnew = βold −
∂2 log L(βold)
∂β∂βT
−1
·
∂ log L(βold)
∂β
but here, the gradient and the hessian matrix have explicit expressions
∂ log L(βold)
∂β
= XT
(y − pold)
∂2 log L(βold)
∂β∂βT
= −XT
∆oldX
Note that it is fast and robust (to the starting point)
1 Y=myocarde$PRONO
2 X=cbind (1,as.matrix(myocarde [ ,1:7]))
3 colnames(X)=c("Inter",names(myocarde [ ,1:7]))
4 beta=as.matrix(lm(Y˜0+X)$coefficients ,ncol =1)
5 for(s in 1:9){
6 pi=exp(X%*%beta[,s])/(1+ exp(X%*%beta[,s]))
7 gradient=t(X)%*%(Y-pi)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 12
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
8 omega=matrix (0,nrow(X),nrow(X));diag(omega)=(pi*(1-pi))
9 Hessian=-t(X)%*%omega%*%X
10 beta=cbind(beta ,beta[,s]-solve(Hessian)%*%gradient)}
11 beta [ ,8:10]
12 [,1] [,2] [,3]
13 XInter -10.187641685 -10.187641696 -10.187641696
14 XFRCAR 0.138178119 0.138178119 0.138178119
15 XINCAR -5.862429035 -5.862429037 -5.862429037
16 XINSYS 0.717084018 0.717084018 0.717084018
17 XPRDIA -0.073668171 -0.073668171 -0.073668171
18 XPAPUL 0.016756506 0.016756506 0.016756506
19 XPVENT -0.106776012 -0.106776012 -0.106776012
20 XREPUL -0.003154187 -0.003154187 -0.003154187
@freakonometrics freakonometrics freakonometrics.hypotheses.org 13
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
• iteratively reweighted least squares
We want to compute something like
βnew = (XT
∆oldX)−1
XT
∆oldz
(if we do substitute matrices in the analytical expressions) where
z = Xβold + ∆−1
old
[y − pold]
But actually, that’s simply a standard least-square problem
βnew = argmin (z − Xβ)T
∆−1
old
(z − Xβ)
The only problem here is that weights ∆old are functions of unknown βold
But actually, if we keep iterating, we should be able to solve it : given the β we got the weights,
and with the weights, we can use weighted OLS to get an updated β. That’s the idea of
iteratively reweighted least squares.
1 df = myocarde
2 beta_init = lm(PRONO˜.,data=df)$ coefficients
3 X = cbind (1,as.matrix(myocarde [ ,1:7]))
4 beta = beta_init
5 for(s in 1:1000){
@freakonometrics freakonometrics freakonometrics.hypotheses.org 14
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
6 p = exp(X %*% beta) / (1+ exp(X %*% beta))
7 omega = diag(nrow(df))
8 diag(omega) = (p*(1-p))
9 df$Z = X %*% beta + solve(omega) %*% (df$PRONO - p)
10 beta = lm(Z˜.,data=df[,-8], weights=diag(omega))$ coefficients
11 }
12 beta
13 (Intercept) FRCAR INCAR INSYS PRDIA
14 -10.187641696 0.138178119 -5.862429037 0.717084018 -0.073668171
15 PAPUL PVENT REPUL
16 0.016756506 -0.106776012 -0.003154187
@freakonometrics freakonometrics freakonometrics.hypotheses.org 15
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Logistic Regression
1 summary( lm(Z˜.,data=df[,-8], weights=diag(omega)))
2
3 Coefficients :
4 Estimate Std. Error t value Pr(>|t|)
5 (Intercept) -10.187642 10.668138 -0.955 0.343
6 FRCAR 0.138178 0.102340 1.350 0.182
7 INCAR -5.862429 6.052560 -0.969 0.336
8 INSYS 0.717084 0.503527 1.424 0.159
9 PRDIA -0.073668 0.261549 -0.282 0.779
10 PAPUL 0.016757 0.306666 0.055 0.957
11 PVENT -0.106776 0.099145 -1.077 0.286
12 REPUL -0.003154 0.004386 -0.719 0.475
@freakonometrics freakonometrics freakonometrics.hypotheses.org 16
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Logistic Regression
1 summary(glm(PRONO˜.,data=myocarde ,family=binomial(link = "logit")))
2
3 Coefficients :
4 Estimate Std. Error z value Pr(>|z|)
5 (Intercept) -10.187642 11.895227 -0.856 0.392
6 FRCAR 0.138178 0.114112 1.211 0.226
7 INCAR -5.862429 6.748785 -0.869 0.385
8 INSYS 0.717084 0.561445 1.277 0.202
9 PRDIA -0.073668 0.291636 -0.253 0.801
10 PAPUL 0.016757 0.341942 0.049 0.961
11 PVENT -0.106776 0.110550 -0.966 0.334
12 REPUL -0.003154 0.004891 -0.645 0.519
@freakonometrics freakonometrics freakonometrics.hypotheses.org 17
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Logistic Regression with Splines (and GAM)
To get home-made linear splines, use
β1x + β2(x − s)+
1 pos = function(x,s) (x-s)*(x>=s)
1 reg = glm(PRONO˜INSYS+pos(INSYS ,15)+pos(INSYS ,25) ,data=myocarde ,
family=binomial)
1 summary(reg)
2
3 Coefficients :
4 Estimate Std. Error z value Pr(>|z|)
5 (Intercept) -0.1109 3.278 -0.034 0.973
6 INSYS -0.1751 0.252 -0.693 0.488
7 pos(INS ,15) 0.7900 0.374 2.109 0.034 *
8 pos(INS ,25) -0.5797 0.290 -1.997 0.045 *
@freakonometrics freakonometrics freakonometrics.hypotheses.org 18
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Logistic Regression with Splines
We can define spline functions with support (5, 55) and with knots {15, 25}
1 library(splines)
2 clr6 = c("#1 b9e77","#d95f02","#7570 b3","#
e7298a","#66 a61e","#e6ab02")
3 x = seq (0,60,by =.25)
4 B = bs(x,knots=c(15 ,25) ,Boundary.knots=c(5 ,55)
,degre =1)
5 matplot(x,B,type="l",lty=1,lwd=2,col=clr6)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 19
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Logistic Regression with Splines
1 reg = glm(PRONO˜bs(INSYS ,knots=c(15 ,25) ,
2 Boundary.knots=c(5 ,55) ,degre =1) ,
3 data=myocarde ,family=binomial)
4 summary(reg)
5
6 Coefficients :
7 Estimate Std. Error z value Pr(>|z|)
8 (Intercept) -0.9863 2.055 -0.480 0.631
9 bs(INSYS)1 -1.7507 2.526 -0.693 0.488
10 bs(INSYS)2 4.3989 2.061 2.133 0.032 *
11 bs(INSYS)3 5.4572 5.414 1.008 0.313
@freakonometrics freakonometrics freakonometrics.hypotheses.org 20
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Logistic Regression with Splines
For quadratic splines, consider a decomposition on x,x2,(x − s1)2
+
1 pos2 = function(x,s) (x-s)ˆ2*(x>=s)
2 reg = glm(PRONO˜poly(INSYS ,2)+pos2(INSYS ,15)+
pos2(INSYS ,25) ,
3 data=myocarde ,family=binomial)
4 summary(reg)
5
6 Coefficients :
7 Estimate Std. Error z value Pr(>|z|)
8 (Intercept) 29.9842 15.236 1.968 0.049 *
9 poly (2)1 408.7851 202.419 2.019 0.043 *
10 poly (2)2 199.1628 101.589 1.960 0.049 *
11 pos2 (15) -0.2281 0.126 -1.805 0.071 .
12 pos2 (25) 0.0439 0.080 0.545 0.585
@freakonometrics freakonometrics freakonometrics.hypotheses.org 21
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Logistic Regression with Splines
1 x = seq (0,60,by =.25)
2 B=bs(x,knots=c(15 ,25) ,Boundary.knots=c(5 ,55) ,
degre =2)
3 matplot(x,B,type="l",xlab="INSYS",col=clr6)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 22
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
1 reg = glm(PRONO˜bs(INSYS ,knots=c(15 ,25) ,
2 Boundary.knots=c(5 ,55) ,degre =2) ,data=myocarde ,
3 family=binomial)
4 summary(reg)
5
6 Coefficients :
7 Estimate Std. Error z value Pr(>|z|)
8 (Intercept) 7.186 5.261 1.366 0.172
9 bs(INSYS)1 -14.656 7.923 -1.850 0.064 .
10 bs(INSYS)2 -5.692 4.638 -1.227 0.219
11 bs(INSYS)3 -2.454 8.780 -0.279 0.779
12 bs(INSYS)4 6.429 41.675 0.154 0.877
Finally, for cubic splines, use x, x2, x3, (x − s1)3
+, (x − s2)3
+, etc.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 23
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
1 B=bs(x,knots=c(15 ,25) ,Boundary.knots=c(5 ,55) ,
degre =3)
2 matplot(x,B,type="l",lwd=2,col=clr6 ,lty=1,ylim
=c( -.2 ,1.2))
3 abline(v=c(5 ,15 ,25 ,55) ,lty =2)
1 reg = glm(PRONO˜1+bs(INSYS ,degree =1,df =4),data
=myocarde ,family=binomial)
2 attr(reg$terms , "predvars")[[3]]
3 bs(INSYS , degree = 1L, knots = c(15.8 , 21.4 ,
27.15) ,
4 Boundary.knots = c(8.7 , 54) , intercept = FALSE
)
5 quantile(myocarde$INSYS ,(0:4)/4)
6 0% 25% 50% 75% 100%
7 8.70 15.80 21.40 27.15 54.00
@freakonometrics freakonometrics freakonometrics.hypotheses.org 24
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
1 B = bs(x,degree =1,df=4)
2 B = cbind (1,B)
3 y = B%*% coefficients (reg)
4 plot(x,y,type="l",col="red",lwd =2)
5 abline(v=quantile(myocarde$INSYS ,(0:4)/4),lty
=2)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 25
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Logistic Regression with Splines
1 reg = glm(y˜bs(x1 ,degree =1,df =3)+bs(x2 ,degree
=1,df =3) ,data=df ,family=binomial(link = "
logit"))
2 u = seq(0,1, length =101)
3 p = function(x,y) predict.glm(reg ,newdata=data
.frame(x1=x,x2=y),type="response")
4 v = outer(u,u,p)
5 image(u,u,v,xlab="Variable 1",ylab="Variable 2
",col=clr10 ,breaks =(0:10)/10)
6 points(df$x1 ,df$x2 ,pch =19, cex =1.5 , col="white")
7 points(df$x1 ,df$x2 ,pch=c(1 ,19) [1+( df$y=="1")],
cex =1.5)
8 contour(u,u,v,levels = .5,add=TRUE)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 26
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Regression with Kernels and k-NN
We do not use the probit framework, it is purely a non-parametric one.
The moving regressogram is
ˆm(x) =
n
i=1
yi · 1(xi ∈ [x ± h])
n
i=1
1(xi ∈ [x ± h])
Consider
˜m(x) =
n
i=1
yi · kh (x − xi)
n
i=1
kh (x − xi)
where (classically), for some kernel k,
kh(x) = k
x
h
k
−2 −1 0 1 2
@freakonometrics freakonometrics freakonometrics.hypotheses.org 27
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Regression with Kernels and k-NN
1 mean_x = function(x,bw){
2 w = dnorm (( myocarde$INSYS -x)/bw , mean=0,sd
=1)
3 weighted.mean(myocarde$PRONO ,w)}
4 u = seq (5,55, length =201)
5 v = Vectorize(function(x) mean_x(x,3))(u)
6 plot(u,v,ylim =0:1 , type="l",col="red")
7 points(myocarde$INSYS ,myocarde$PRONO ,pch =19)
1 v = Vectorize(function(x) mean_x(x,2))(u)
2 plot(u,v,ylim =0:1 , type="l",col="red")
3 points(myocarde$INSYS ,myocarde$PRONO ,pch =19)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 28
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Regression with Kernels and k-NN
One can also use the ksmooth() function of R,
1 reg = ksmooth(myocarde$INSYS ,myocarde$PRONO ,"
normal",bandwidth = 2*exp (1))
2 plot(reg$x,reg$y,ylim =0:1 , type="l",col="red",
lwd=2,xlab="INSYS",ylab="")
3 points(myocarde$INSYS ,myocarde$PRONO ,pch =19)
Classical bias / variance tradeoff (below mh(15) and mh(25))
@freakonometrics freakonometrics freakonometrics.hypotheses.org 29
10 20 30 40 50
0.00.20.40.60.81.0
INSYS
q
qq
q
qq
qq qq qq q qqqqq qq q
q qq qqq q
qqq
q q
q
q
qqq
q
q
qq
qq
q
q
q q
q
q
q q
q q
q
q
q q q q
q
q
q
q qq
qq
q
q
q
Density
0.0 0.2 0.4 0.6 0.8 1.0
0123456
Density
0.0 0.2 0.4 0.6 0.8 1.0
0123456
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Regression with Kernels and k-NN
1 u = seq(0,1, length =101)
2 p = function(x,y){
3 bw1 = .2; bw2 = .2
4 w = dnorm ((df$x1 -x)/bw1 , mean=0,sd=1)*
5 dnorm ((df$x2 -y)/bw2 , mean=0,sd =1)
6 weighted.mean(df$y=="1",w)
7 }
8 v = outer(u,u,Vectorize(p))
9 image(u,u,v,col=clr10 ,breaks =(0:10)/10)
10 points(df$x1 ,df$x2 ,pch =19, cex =1.5 , col="white")
11 points(df$x1 ,df$x2 ,pch=c(1 ,19) [1+( df$y=="1")],
cex =1.5)
12 contour(u,u,v,levels = .5,add=TRUE)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 30
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Regression with Kernels and k-NN
One can also use k-nearest neigthbors
˜mk(x) =
1
n
n
i=1
ωi,k(x)yi
where ωi,k(x) = n/k if i ∈ Ik
x with
Ik
x = {i : xi one of the k nearest observations to x}I
Classically, use Mahalanobis distance
1 Sigma = var(myocarde [ ,1:7])
2 Sigma_Inv = solve(Sigma)
3 d2_ mahalanobis = function(x,y,Sinv){as.numeric(x-y)%*%Sinv%*%t(x-y)}
4 k_closest = function(i,k){
5 vect_dist = function(j) d2_ mahalanobis (myocarde[i ,1:7] , myocarde[j
,1:7] , Sigma_Inv)
6 vect = Vectorize(vect_dist)((1: nrow(myocarde)))
7 which (( rank(vect)))}
@freakonometrics freakonometrics freakonometrics.hypotheses.org 31
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
1 k_majority = function(k){
2 Y=rep(NA ,nrow(myocarde))
3 for(i in 1: length(Y)) Y[i] = sort(myocarde$PRONO[k_closest(i,k)])[(
k+1)/2]
4 return(Y)}
1 k_mean = function(k){
2 Y=rep(NA ,nrow(myocarde))
3 for(i in 1: length(Y)) Y[i] = mean(myocarde$PRONO[k_closest(i,k)])
4 return(Y)}
1 Sigma_Inv = solve(var(df[,c("x1","x2")]))
2 u = seq(0,1, length =51)
3 p = function(x,y){
4 k = 6
5 vect_dist = function(j) d2_ mahalanobis(c(x,y),df[j,c("x1","x2")],
Sigma_Inv)
6 vect = Vectorize(vect_dist)(1: nrow(df))
7 idx = which(rank(vect) <=k)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 32
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
8 return(mean ((df$y==1)[idx ]))}
1 v = outer(u,u,Vectorize(p))
2 image(u,u,v,col=clr10 ,breaks =(0:10)/10)
3 points(df$x1 ,df$x2 ,pch =19, cex =1.5 , col="white")
4 points(df$x1 ,df$x2 ,pch=c(1 ,19) [1+( df$y=="1")],
cex =1.5)
5 contour(u,u,v,levels = .5,add=TRUE)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 33
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit: ROC Curve
Confusion matrix
Given a sample (yi, xi) and a model m, the confusion matrix is the contin-
gency table, with dimensions observed yi ∈ {0, 1} and predicted yi ∈ {0, 1}.
y
0(-) 1(+)
0(-) TN FN
y
1(+) FP TP
FP
TN+FP
TP
FN+TP
FPR TPR
Classical measures are
true positive rate (TPR) - or sensitivity
false positive rate (FPR) - or fall-out,
true negative rate (TNR) - or specificity
TNR = 1-FPR
among others (see wikipedia)
ROCR::performance(prediction(Score,Y),"tpr","fpr")
@freakonometrics freakonometrics freakonometrics.hypotheses.org 34
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit: ROC Curve
ROC (Receiver Operating Characteristic ) Curve
Assume that mt is define from a score function s, with mt(x) = 1(s(x) > t)
for some threshold t. The ROC curve is the curve (FPRt, TPRt obtained
from confusion matrices of mt’s.
n = 100 individuals
50 yi = 0 and 50 yi = 1
(well balanced)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 35
q
q
Y=0 Y=1
OBSERVED
Y=1Y=0
PREDICTED
25 25
25 25
FPR TPR
25 25
50 50
~ 0.5 ~ 0.5
q
q
FALSE POSITIVE RATE
TRUEPOSITIVERATE
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
q
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit: ROC Curve
29 deaths (y = 0) and 42 survivals (y = 1)
y =
1 if P[Y = 1|X] > 0%
0 if P[Y = 1|X] ≤ 0%
0%
y
0 1
0 0 0
y
1 29 42
29
29+0
42
42+0
= 100% = 100%
FPR TPR
@freakonometrics freakonometrics freakonometrics.hypotheses.org 36
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit: ROC Curve
29 deaths (y = 0) and 42 survivals (y = 1)
y =
1 if P[Y = 1|X] > 15%
0 if P[Y = 1|X] ≤ 15%
15%
y
0 1
0 17 2
y
1 12 40
12
17+12
40
42+2
∼ 41.4% ∼ 95.2%
FPR TPR
@freakonometrics freakonometrics freakonometrics.hypotheses.org 37
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit: ROC Curve
29 deaths (y = 0) and 42 survivals (y = 1)
y =
1 if P[Y = 1|X] > 50%
0 if P[Y = 1|X] ≤ 50%
50%
y
0 1
0 25 3
y
1 4 39
4
25+4
39
39+3
∼ 13.8% ∼ 92.8%
FPR TPR
@freakonometrics freakonometrics freakonometrics.hypotheses.org 38
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit: ROC Curve
29 deaths (y = 0) and 42 survivals (y = 1)
y =
1 if P[Y = 1|X] > 85%
0 if P[Y = 1|X] ≤ 85%
50%
y
0 1
0 28 13
y
1 1 29
1
28+1
29
29+13
∼ 3.4% ∼ 69.9%
FPR TPR
@freakonometrics freakonometrics freakonometrics.hypotheses.org 39
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit: ROC Curve
29 deaths (y = 0) and 42 survivals (y = 1)
y =
1 if P[Y = 1|X] > 100%
0 if P[Y = 1|X] ≤ 100%
50%
y
0 1
0 29 42
y
0 0 29
0
29+0
0
42+0
= 0.0% = 0.0%
FPR TPR
@freakonometrics freakonometrics freakonometrics.hypotheses.org 40
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit
Be carefull of overfit...
Important to use
training / validation
• classification tree
• boosting classifier
@freakonometrics freakonometrics freakonometrics.hypotheses.org 41
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit: ROC Curve
ROC (Receiver Operating Characteristic ) Curve
Assume that mt is define from a score function s, with mt(x) = 1(s(x) > t)
for some threshold t. The ROC curve is the curve (FPRt, TPRt obtained
from confusion matrices of mt’s.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 42
qq qqq qq
q
qq qqqq qqqq
q
qqqqq
q
qqq qqq qq
qq
qqqqqqqqqq qqq qq
q
qq qqq qqqq qq qq qq qqq
q
q qq qqq q qq qq q qqqqq qqq qqqq qqqq qq qqqq q
q
qqq qq qqq qqq qqq qq qqq qqqqqqq q qqqqq q qqq
q
qq qq qqq
q
q
q
qqq qq qqqq q qq q qqqq qqq q qq qqq qqq qq qqqq qqqq
q
q qqq qqqq
q
qq
q
q
qq
qq qqq q qqqq
q
qqq qqq qqqqqq qq q
q
qq qqq qqq q qqqqq qqq q
q
q qq qq qqqq
q
qqqq qqqqq q qqq qq
q
qqq q
q
q q qqq
q
qq qqq qqq qq qqq q
q
q qq qqq qq
q
q qqqq qq
q
q q
q
qq q
qq
qqq
q
q qqq q
q
qqqq qqqqq qqqq qqqq qqq
q
q qq qq qqqq q
q
qqq
q
qqq qqqq qq q qq q qq qqqqqq q qqqq q qqqq qqqqq
q
qq q qq qqqq qqqq
q
q
qq
qq q
q
qqq qqqq q
q
q
q
qq qq
qq
qqq q qq q qqqq qqqq qqqq qq qq qq qqqq qqqq q qqqq qqqq
q
q qqqq qqq qqq q
q
qq qqqq qqqqqqqq qq qqqq qqq q
q
qq qq
q
q q qqqq qqq q qqq q q qqqqqq q
q
q q qqqqq q q qqqq qqqq q qqqq qqqqq qq qqq q
q
qqq qq qqqq qq qq
qq
qqqqqq qq qqq
q
q qqqqqq
q
q qq qq q qqq q qqq q qq qq qqq q
q
qq q
q
q qq qq qq
q
q qqq qqq qqqqq qqqq q qqqqq
q
q
q
q q
qq
qq
q
qqqq
q
q qq q q
q
q
q
qqqq qq qq q qqq q qq qqqqqq qq qq qqqq qqqq q qqqq qq q qq qq qqqq q qq q
q
qqqqq qqqq qq qq qqqq q qqq
q
qq
q
qq qqqq qqqqq qqqq qq qqq qqq qqqq qq q q
q
qqq qq qq
q
qq qq qqq qq qqqqq qqq qqq qqqqq qqqqq qq qqq qqqq qqq q qq qq qqqq q qq qq qqqqq qq qqqq
qq
qqqqqqqq qqqq qq q
q
qqqqq qqq qq
q
qqqqq
q
q q
q
qqq qq q qqqq qq q qqqq qqqq qqqqq qq qqqqq
q
q q qq qqq qqq q
q
q
q
q qq q qqqq
q
qqq q qq qq q
SCORE S
OBSERVATIONY
0.0 0.2 0.4 0.6 0.8 1.0
Y=0
Y=1
FALSE POSITIVE RATE
TRUEPOSITIVERATE
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
q
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit: ROC Curve
With categorical variables, we have a collection of points
Need to interpolate between those points to have a curve, see convexification of ROC curves,
e.g. Flach (2012, Machine Learning)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 43
q
q
FALSE SPAM RATE
TRUESPAMRATE
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
q
q
q
q
q
q
FALSE SPAM RATE
TRUESPAMRATE
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
q
q
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit
One can derive a confidence interval of the ROC curve using
boostrap techniques,
pROC::ci.se()
Various measures can be used see library hmeasures , with Gini
index, the Area Under the curve, or Kolmogorov-Smirnov
ks = sup
t∈R
|F1(t) − F0(t)|
ks = sup
t∈R
|
1
n1
i:yi=1
1s(xi)≤t −
1
n0
i:yi=0
1s(xi)≤t|
Specificity (%)
Sensitivity(%)
020406080100
100 80 60 40 20 0
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Fn(x)
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Survival
Death
@freakonometrics freakonometrics freakonometrics.hypotheses.org 44
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit
The Area Under the Curve, AUC, can be interpreted as the probability that a classifier will
rank a randomly chosen positive instance higher than a randomly chosen negative one, see
Swets, Dawes & Monahan (2000, Psychological Science Can Improve Diagnostic Decisions)
Kappa statistic κ compares an Observed Accuracy with an Expected Accuracy (random
chance), see Landis & Koch (1977, The Measurement of Observer Agreement for Categorical Data
)
Y = 0 Y = 1
Y = 0 TN FN TN+FN
Y = 1 FP TP FP+TP
TN+FP FN+TP n
See also Obsersed and Random Confusion Tables
Y = 0 Y = 1
Y = 0 25 3 28
Y = 1 4 39 43
29 42 71
Y = 0 Y = 1
Y = 0 11.44 16.56 28
Y = 1 17.56 25.44 43
29 42 71
@freakonometrics freakonometrics freakonometrics.hypotheses.org 45
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Goodness of Fit
Y = 0 Y = 1
Y = 0 25 3 28
Y = 1 4 39 43
29 42 71
Y = 0 Y = 1
Y = 0 11.44 16.56 28
Y = 1 17.56 25.44 43
29 42 71
total accuracy =
TP + TN
n
∼ 90.14%
random accuracy =
[TN + FP] · [TP + FN] + [TP + FP] · [TN + FN]
n2
∼ 51.93%
κ =
total accuracy − random accuracy
1 − random accuracy
∼ 79.48%
@freakonometrics freakonometrics freakonometrics.hypotheses.org 46
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Penalized Estimation
Consider a parametric model, with true (unkrown) parameter θ, then
mse(ˆθ) = E (ˆθ − θ)2
= E (ˆθ − E ˆθ )2
variance
+ E (E ˆθ − θ)2
bias2
One can think of a shrinkage of an unbiased estimator,
Let θ denote an unbiased estimator of θ. Then
ˆθ =
θ2
θ2 + mse(θ)
· θ = θ −
mse(θ)
θ2 + mse(θ)
· θ
penalty
satisfies mse(ˆθ) ≤ mse(θ).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 47
−2 −1 0 1 2 3 4
0.00.20.40.60.8
variance
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Linear Regression Shortcoming
Least Squares Estimator β = (XT
X)−1XT
y
Unbiased Estimator E[β] = β
Variance Var[β] = σ2(XT
X)−1
which can be (extremely) large when det[(XT
X)] ∼ 0.
X =





1 −1 2
1 0 1
1 2 −1
1 1 0





then XT
X =



4 2 2
2 6 −4
2 −4 6


 while XT
X + I =



5 2 2
2 7 −4
2 −4 7



eigenvalues : {10, 6, 0} {11, 7, 1}
Ad-hoc strategy: use XT
X + λI
@freakonometrics freakonometrics freakonometrics.hypotheses.org 48
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Linear Regression Shortcoming
Evolution of (β1, β2) →
n
i=1
[yi − (β1x1,i + β2x2,i)]2
when cor(X1, X2) = r ∈ [0, 1], on top.
Below, Ridge regression
(β1, β2) →
n
i=1
[yi − (β1x1,i + β2x2,i)]2
+λ(β2
1 + β2
2 )
where λ ∈ [0, ∞), below,
when cor(X1, X2) ∼ 1 (colinearity).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 49
−2 −1 0 1 2 3 4
−3−2−10123
β1
β2
500
1000
1500
2000
2000
2500
2500 2500
2500
3000
3000 3000
3000
3500
q
−2 −1 0 1 2 3 4
−3−2−10123
β1
β2
1000
1000
2000
2000
3000
3000
4000
4000
5000
5000
6000
6000
7000
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Normalization : Euclidean 2 vs. Mahalonobis
We want to penalize complicated models :
if βk is “too small”, we prefer to have βk = 0.
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Instead of d(x, y) = (x − y)T
(x − y)
use dΣ(x, y) = (x − y)TΣ−1(x − y)
beta1
beta2
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 50
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Ridge Regression
... like the least square, but it shrinks estimated coefficients towards 0.
β
ridge
λ = argmin
n
i=1
(yi − xT
i β)2
+ λ
p
j=1
β2
j
β
ridge
λ = argmin



y − Xβ
2
2
=criteria
+ λ β 2
2
=penalty



λ ≥ 0 is a tuning parameter.
The constant is usually unpenalized. The true equation is
β
ridge
λ = argmin



y − (β0 + Xβ)
2
2
=criteria
+ λ β
2
2
=penalty



@freakonometrics freakonometrics freakonometrics.hypotheses.org 51
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Ridge Regression
β
ridge
λ = argmin y − (β0 + Xβ)
2
2
+ λ β
2
2
can be seen as a constrained optimization problem
β
ridge
λ = argmin
β 2
2
≤hλ
y − (β0 + Xβ)
2
2
Explicit solution
βλ = (XT
X + λI)−1
XT
y
If λ → 0, β
ridge
0 = β
ols
If λ → ∞, β
ridge
∞ = 0.
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 52
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Ridge Regression
This penalty can be seen as rather unfair if components of
x are not expressed on the same scale
• center: xj = 0, then β0 = y
• scale: xT
j xj = 1
Then compute
β
ridge
λ = argmin



y − Xβ 2
2
=loss
+ λ β 2
2
=penalty



beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
40
40
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 53
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Ridge Regression
Observe that if xj1 ⊥ xj2 , then
β
ridge
λ = [1 + λ]−1
β
ols
λ
which explain relationship with shrinkage.
But generally, it is not the case...
q
q
Theorem There exists λ such that mse[β
ridge
λ ] ≤ mse[β
ols
λ ]
@freakonometrics freakonometrics freakonometrics.hypotheses.org 54
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Ridge Regression
Lλ(β) =
n
i=1
(yi − β0 − xT
i β)2
+ λ
p
j=1
β2
j
∂Lλ(β)
∂β
= −2XT
y + 2(XT
X + λI)β
∂2Lλ(β)
∂β∂βT
= 2(XT
X + λI)
where XT
X is a semi-positive definite matrix, and λI is a positive definite matrix, and
βλ = (XT
X + λI)−1
XT
y
@freakonometrics freakonometrics freakonometrics.hypotheses.org 55
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
The Bayesian Interpretation
From a Bayesian perspective,
P[θ|y]
posterior
∝ P[y|θ]
likelihood
· P[θ]
prior
i.e. log P[θ|y] = log P[y|θ]
log likelihood
+ log P[θ]
penalty
If β has a prior N(0, τ2I) distribution, then its posterior distribution has mean
E[β|y, X] = XT
X +
σ2
τ2
I
−1
XT
y.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 56
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Properties of the Ridge Estimator
βλ = (XT
X + λI)−1
XT
y
E[βλ] = XT
X(λI + XT
X)−1
β.
i.e. E[βλ] = β.
Observe that E[βλ] → 0 as λ → ∞.
Assume that X is an orthogonal design matrix, i.e. XT
X = I, then
βλ = (1 + λ)−1
β
ols
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 57
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Properties of the Ridge Estimator
Set W λ = (I + λ[XT
X]−1)−1. One can prove that
W λβ
ols
= βλ.
Thus,
Var[βλ] = W λVar[β
ols
]W T
λ
and
Var[βλ] = σ2
(XT
X + λI)−1
XT
X[(XT
X + λI)−1
]T
.
Observe that
Var[β
ols
] − Var[βλ] = σ2
W λ[2λ(XT
X)−2
+ λ2
(XT
X)−3
]W T
λ ≥ 0.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 58
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Properties of the Ridge Estimator
Hence, the confidence ellipsoid of ridge estimator is indeed
smaller than the OLS,
If X is an orthogonal design matrix,
Var[βλ] = σ2
(1 + λ)−2
I.
mse[βλ] = σ2
trace(W λ(XT
X)−1
W T
λ) + βT
(W λ − I)T
(W λ − I)β.
If X is an orthogonal design matrix,
mse[βλ] =
pσ2
(1 + λ)2
+
λ2
(1 + λ)2
βT
β
@freakonometrics freakonometrics freakonometrics.hypotheses.org 59
0.0 0.2 0.4 0.6 0.8
−1.0−0.8−0.6−0.4−0.2
β1
β2
1
2
3
4
5
6
7
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Properties of the Ridge Estimator
mse[βλ] =
pσ2
(1 + λ)2
+
λ2
(1 + λ)2
βT
β
is minimal for
λ =
pσ2
βT
β
Note that there exists λ > 0 such that mse[βλ] < mse[β0] = mse[β
ols
].
@freakonometrics freakonometrics freakonometrics.hypotheses.org 60
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Sparsity Issues
In severall applications, k can be (very) large, but a lot of features are just noise: βj = 0 for
many j’s. Let s denote the number of relevent features, with s << k, cf Hastie, Tibshirani &
Wainwright (2015) Statistical Learning with Sparsity,
s = card{S} where S = {j; βj = 0}
The model is now y = XT
S βS + ε, where XT
S XS is a full rank matrix.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 61
q
q
= . +
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Going further on sparcity issues
The Ridge regression problem was to solve
β = argmin
β∈{ β 2
≤s}
{ Y − XT
β 2
2
}
Define a 0 = 1(|ai| > 0).
Here dim(β) = k but β 0 = s.
We wish we could solve
β = argmin
β∈{ β 0
=s}
{ Y − XT
β 2
2
}
Problem: it is usually not possible to describe all possible constraints, since
s
k
coefficients
should be chosen here (with k (very) large).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 62
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Going further on sparcity issues
In a convex problem, solve the dual problem,
e.g. in the Ridge regression : primal problem
min
β∈{ β 2
≤s}
{ Y − XT
β 2
2
}
and the dual problem
min
β∈{ Y −XTβ 2
≤t}
{ β 2
2
}
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
26
27
30
32
35
40
40
50
60
70
80
90
100
110
120
120
130
130
140 140
X
q
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
26
27
30
32
35
40
40
50
60
70
80
90
100
110
120
120
130
130
140 140
X
q
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 63
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Going further on sparcity issues
Idea: solve the dual problem
β = argmin
β∈{ Y −XTβ 2
≤h}
{ β 0 }
where we might convexify the 0 norm, · 0 .
@freakonometrics freakonometrics freakonometrics.hypotheses.org 64
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Going further on sparcity issues
On [−1, +1]k, the convex hull of β 0 is β 1
On [−a, +a]k, the convex hull of β 0 is a−1 β 1
Hence, why not solve
β = argmin
β; β 1
≤˜s
{ Y − XT
β 2 }
which is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization problem
β = argmin{ Y − XT
β 2
2
+λ β 1 }
@freakonometrics freakonometrics freakonometrics.hypotheses.org 65
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
lasso Least Absolute Shrinkage and Selection Operator
β ∈ argmin{ Y − XT
β 2
2
+λ β 1 }
is a convex problem (several algorithms ), but not strictly convex (no unicity of the minimum).
Nevertheless, predictions y = xTβ are unique.
MM, minimize majorization, coordinate descent Hunter & Lange (2003, A Tutorial on MM
Algorithms).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 66
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
lasso Regression
No explicit solution...
If λ → 0, β
lasso
0 = β
ols
If λ → ∞, β
lasso
∞ = 0.
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 67
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
lasso Regression
For some λ, there are k’s such that β
lasso
k,λ = 0.
Further, λ → β
lasso
k,λ is piecewise linear
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
40
40
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 68
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
lasso Regression
In the orthogonal case, XT
X = I,
β
lasso
k,λ = sign(β
ols
k ) |β
ols
k | −
λ
2
i.e. the LASSO estimate is related to the soft threshold
function...
q
q
@freakonometrics freakonometrics freakonometrics.hypotheses.org 69
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 2 norm (Ridge)
We want to solve something like
βλ = argmin{− log L(β|x, y) + λ β 2
2
}
In the case where we consider the log-likelihood of some Gaussian variable, we get the sum of
the square of the residuals, and we can obtain an explicit solution. But not in the context of a
logistic regression.
The heuristics about Ridge regression is the following graph. In the background, we can
visualize the (two-dimensional) log-likelihood of the logistic regression, and the blue circle is the
constraint we have, if we rewrite the optimization problem as a constrained optimization
problem :
min
β: β 2
2
≤s
n
i=1
− log L(yi, β0 + xT
β)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 70
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 2 norm (Ridge)
It can be written equivalently (it is a strictly convex problem)
min
β,λ
−
n
i=1
log L(yi, β0 + xT
β) + λ β 2
2
Thus, the constrained maximum should lie in the blue disk.
1 > PennegLogLik = function(bbeta ,lambda =0){
2 b0 = bbeta [1]
3 beta = bbeta [-1]
4 -sum(-y*log(1 + exp(-(b0+X%*%beta))) - (1-y)*
5 log (1 + exp(b0+X%*%beta)))+lambda*sum(beta ˆ2)}
@freakonometrics freakonometrics freakonometrics.hypotheses.org 71
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 2 norm (Ridge)
1 > opt_ridge = function(lambda){
2 beta_init = lm(PRONO˜.,data=myocarde)$
coefficients
3 logistic_opt = optim(par = beta_init*0,
function(x) PennegLogLik (x,lambda), method
= "BFGS", control=list(abstol =1e -9))
4 logistic_opt$par [ -1]}
5 > v_lambda = c(exp(seq(-2,5, length =61)))
6 > est_ridge = Vectorize(opt_ridge)(v_lambda)
7 > plot(v_lambda ,est_ridge [1 ,])
@freakonometrics freakonometrics freakonometrics.hypotheses.org 72
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 2 norm (Ridge)
• Using Fisher’s (penalized) score
On the penalized problem, we can easily prove that
∂ log Lp(βλ,old)
∂β
=
∂ log L(βλ,old)
∂β
− 2λβold
while
∂2 log Lp(βλ,old)
∂β∂βT
=
∂2 log L(βλ,old)
∂β∂βT
− 2λI
Hence
βλ,new = (XT
∆oldX + 2λI)−1
XT
∆oldz
And interestingly, with that algorithm, we can also derive the variance of the estimator
Var[βλ] = [XT
∆X + 2λI]−1
XT
∆Var[z]∆X[XT
∆X + 2λI]−1
where Var[z] = ∆−1
@freakonometrics freakonometrics freakonometrics.hypotheses.org 73
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 2 norm (Ridge)
The code to compute βλ is then
1 Y = myocarde$PRONO
2 X = myocarde [ ,1:7]
3 for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j])
4 X = as.matrix(X)
5 X = cbind (1,X)
6 colnames(X) = c("Inter",names(myocarde [ ,1:7]))
7 beta = as.matrix(lm(Y˜0+X)$coefficients ,ncol =1)
8 for(s in 1:9){
9 pi = exp(X%*%beta[,s])/(1+ exp(X%*%beta[,s]))
10 Delta = matrix (0,nrow(X),nrow(X));diag(Delta)=(pi*(1-pi))
11 z = X%*%beta[,s] + solve(Delta)%*%(Y-pi)
12 B = solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) %*% (t(X)%*%
Delta%*%z)
13 beta = cbind(beta ,B)}
14 beta [ ,8:10]
15 [,1] [,2] [,3]
@freakonometrics freakonometrics freakonometrics.hypotheses.org 74
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
16 XInter 0.59619654 0.59619654 0.59619654
17 XFRCAR 0.09217848 0.09217848 0.09217848
18 XINCAR 0.77165707 0.77165707 0.77165707
19 XINSYS 0.69678521 0.69678521 0.69678521
20 XPRDIA -0.29575642 -0.29575642 -0.29575642
21 XPAPUL -0.23921101 -0.23921101 -0.23921101
22 XPVENT -0.33120792 -0.33120792 -0.33120792
23 XREPUL -0.84308972 -0.84308972 -0.84308972
Use
Var[βλ] = [XT
∆X + 2λI]−1
XT
∆Var[z]∆X[XT
∆X + 2λI]−1
where Var[z] = ∆−1.
1 newton_ridge = function(lambda =1){
2 beta = as.matrix(lm(Y˜0+X)$coefficients ,ncol =1)*runif (8)
3 for(s in 1:20){
4 pi = exp(X%*%beta[,s])/(1+ exp(X%*%beta[,s]))
5 Delta = matrix (0,nrow(X),nrow(X));diag(Delta)=(pi*(1-pi))
6 z = X%*%beta[,s] + solve(Delta)%*%(Y-pi)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 75
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
7 B = solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) %*% (t(X)%*%
Delta%*%z)
8 beta = cbind(beta ,B)}
9 Varz = solve(Delta)
10 Varb = solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) %*% t(X)%*%
Delta %*% Varz %*%
11 Delta %*% X %*% solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X)))
12 return(list(beta=beta[,ncol(beta)],sd=sqrt(diag(Varb))))}
1 v_lambda=c(exp(seq(-2,5, length =61)))
2 est_ridge=Vectorize(function(x) newton_ridge(x
)$beta)(v_lambda)
3 library(" RColorBrewer ")
4 colrs=brewer.pal(7,"Set1")
5 plot(v_lambda ,est_ridge [1,],col=colrs [1], type=
"l")
6 for(i in 2:7) lines(v_lambda ,est_ridge[i,],col
=colrs[i])
@freakonometrics freakonometrics freakonometrics.hypotheses.org 76
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 2 norm (Ridge)
And for the variance
1 v_lambda=c(exp(seq(-2,5, length =61)))
2 est_ridge=Vectorize(function(x) newton_ridge(x
)$sd)(v_lambda)
3 library(" RColorBrewer ")
4 colrs=brewer.pal(7,"Set1")
5 plot(v_lambda ,est_ridge [1,],col=colrs [1], type=
"l")
6 for(i in 2:7) lines(v_lambda ,est_ridge[i,],col
=colrs[i],lwd =2)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 77
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (Ridge)
1 y = myocarde$PRONO
2 X = myocarde [ ,1:7]
3 for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X
[,j])
4 X = as.matrix(X)
5 library(glmnet)
6 glm_ridge = glmnet(X, y, alpha =0)
7 plot(glm_ridge ,xvar="lambda",col=colrs ,lwd =2)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 78
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 2 norm (Ridge)
• Ridge Regression with Orthogonal Covariates
1 library(factoextra)
2 pca = princomp(X)
3 pca_X = get_pca_ind(pca)$coord
4
5 library(glmnet)
6 glm_ridge = glmnet(pca_X, y, alpha =0)
7 plot(glm_ridge ,xvar="lambda",col=colrs ,lwd =2)
8
9 plot(glm_ridge ,col=colrs ,lwd =2)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 79
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
1 df0 = df
2 df0$y=as.numeric(df$y) -1
3 plot_lambda = function(lambda){
4 m = apply(df0 ,2,mean)
5 s = apply(df0 ,2,sd)
6 for(j in 1:2) df0[,j] = (df0[,j]-m[
j])/s[j]
7 reg = glmnet(cbind(df0$x1 ,df0$x2),
df0$y==1, alpha =0)
8 par(mfrow=c(1 ,2))
9 plot(reg ,xvar="lambda",col=c("blue"
,"red"),lwd =2)
10 abline(v=log (.2))
11 plot_lambda (.2)
12 plot(reg ,xvar="lambda",col=c("blue"
,"red"),lwd =2)
13 abline(v=log (1.2))
14 plot_lambda (1.2)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 80
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
It can be written equivalently (it is a strictly convex problem)
min
β,λ
−
n
i=1
log L(yi, β0 + xT
β) + λ β 1
Thus, the constrained maximum should lie in the blue disk.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 81
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
1 y = myocarde$PRONO
2 X = myocarde [ ,1:7]
3 for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X
[,j])
4 X = as.matrix(X)
5 LogLik = function(bbeta){
6 b0=bbeta [1]
7 beta=bbeta [-1]
8 sum(-y*log(1 + exp(-(b0+X%*%beta))) -
9 (1-y)*log(1 + exp(b0+X%*%beta)))}
10 u = seq(-4,4, length =251)
11 v = outer(u,u,function(x,y) LogLik(c(1,x,y)))
12 image(u,u,v,col=rev(heat.colors (25)))
13 contour(u,u,v,add=TRUE)
14 polygon(c(-1,0,1,0),c(0,1,0,-1),border="blue")
@freakonometrics freakonometrics freakonometrics.hypotheses.org 82
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
1 PennegLogLik = function(bbeta ,lambda =0){
2 b0=bbeta [1]
3 beta=bbeta [-1]
4 -sum(-y*log(1 + exp(-(b0+X%*%beta))) -
5 (1-y)*log(1 + exp(b0+X%*%beta)))+lambda*sum(abs(beta)) }
6 opt_lasso = function(lambda){
7 beta_init = lm(PRONO˜.,data=myocarde)$ coefficients
8 logistic_opt = optim(par = beta_init*0, function(x) PennegLogLik (x,
lambda),
9 hessian=TRUE , method = "BFGS", control=list(abstol =1e -9))
1 logistic_opt$par [-1] }
2 v_lambda=c(exp(seq(-4,2, length =61)))
3 est_lasso=Vectorize(opt_lasso)(v_lambda)
4 plot(v_lambda ,est_lasso [1,],col=colrs [1])
5 for(i in 2:7) lines(v_lambda ,est_lasso[i,],col
=colrs[i],lwd =2)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 83
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Observe that coefficients are not (strictly) null
1 library(glmnet)
2 glm_lasso = glmnet(X, y, alpha =1)
3 plot(glm_lasso ,xvar="lambda",col=colrs ,lwd =2)
4 plot(glm_lasso ,col=colrs ,lwd =2)
5 glmnet(X, y, alpha =1, lambda=exp (-4))$beta
6 7x1 sparse Matrix of class "dgCMatrix"
7 s0
8 FRCAR .
9 INCAR 0.11005070
10 INSYS 0.03231929
11 PRDIA .
12 PAPUL .
13 PVENT -0.03138089
14 REPUL -0.20962611
@freakonometrics freakonometrics freakonometrics.hypotheses.org 84
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
With orthogonal covariates
4 library(factoextra)
5 pca = princomp(X)
6 pca_X = get_pca_ind(pca)$coord
7 glm_lasso = glmnet(pca_X, y,
alpha =1)
8 plot(glm_lasso ,xvar="lambda",
col=colrs)
9 plot(glm_lasso ,col=colrs)
Before focusing on the lasso for binary variables, consider the standard ols-lasso...
But first, to illustrate, let us get back on quantile ( 1) regression...
@freakonometrics freakonometrics freakonometrics.hypotheses.org 85
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Optimization with linear problems (application to 1-regression)
Consider the case of the median. Consider a sample {y1, · · · , yn}.
To compute the median, solve min
µ
n
i=1
|yi − µ| which can be solved using linear
programming techniques (see e.g. simplex method)
More precisely, this problem is equivalent to min
µ,a,b
n
i=1
ai + bi with ai, bi ≥ 0 and
yi − µ = ai − bi, ∀i = 1, · · · , n.
Consider a sample obtained from a lognormal distribution
1 n = 101
2 set.seed (1)
3 y = rlnorm(n)
4 median(y)
5 [1] 1.077415
@freakonometrics freakonometrics freakonometrics.hypotheses.org 86
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Optimization with linear problems (application to 1-regression)
Here, one can use a standard optimization routine
1 md=Vectorize(function(m) sum(abs(y-m)))
2 optim(mean(y),md)
3 $par
4 [1] 1.077416
or a linear programing technique : use the matrix form, with 3n constraints, and 2n + 1
parameters,
1 library(lpSolve)
2 A1 = cbind(diag (2*n) ,0)
3 A2 = cbind(diag(n), -diag(n), 1)
4 r = lp("min", c(rep (1,2*n) ,0),
5 rbind(A1 , A2),c(rep("&gt;=", 2*n), rep("=", n)), c(rep (0,2*n), y))
6 tail(r$solution ,1)
7 [1] 1.077415
@freakonometrics freakonometrics freakonometrics.hypotheses.org 87
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Optimization with linear problems (application to 1-regression)
More generally, consider here some quantile,
1 tau = .3
2 quantile(x,tau)
3 30%
4 0.6741586
The linear program is now min
µ,a,b
n
i=1
τai + (1 − τ)bi with ai, bi ≥ 0 and yi − µ = ai − bi,
∀i = 1, · · · , n.
1 A1 = cbind(diag (2*n) ,0)
2 A2 = cbind(diag(n), -diag(n), 1)
3 r = lp("min", c(rep(tau ,n),rep(1-tau ,n) ,0),
4 rbind(A1 , A2),c(rep("&gt;=", 2*n), rep("=", n)), c(rep (0,2*n), y))
5 tail(r$solution ,1)
6 [1] 0.6741586
@freakonometrics freakonometrics freakonometrics.hypotheses.org 88
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Optimization with linear problems (application to 1-regression)
To illustrate numerical computations, use
1 base=read.table("http:// freakonometrics .free.fr/rent98_00. txt",header
=TRUE)
The linear program for the quantile regression is now
min
µ,a,b
n
i=1
τai + (1 − τ)bi
with ai, bi ≥ 0 and yi − [βτ
0 + βτ
1 xi] = ai − bi, ∀i = 1, · · · , n.
1 require(lpSolve)
2 tau = .3
3 n=nrow(base)
4 X = cbind( 1, base$area)
5 y = base$rent_euro
6 A1 = cbind(diag (2*n), 0,0)
7 A2 = cbind(diag(n), -diag(n), X)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 89
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Optimization with linear problems (application to 1-regression)
1 r = lp("min",
2 c(rep(tau ,n), rep(1-tau ,n) ,0,0), rbind(A1 , A2),
3 c(rep("&gt;=", 2*n), rep("=", n)), c(rep(0,2*n), y))
4 tail(r$solution ,2)
5 [1] 148.946864 3.289674
see
1 library(quantreg)
2 rq(rent_euro˜area , tau=tau , data=base)
3 Coefficients :
4 (Intercept) area
5 148.946864 3.289674
@freakonometrics freakonometrics freakonometrics.hypotheses.org 90
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
If we get back to the original Lasso approach, the goal was to solve
min
1
2n
n
i=1
[yi − (β0 + xi
T
β)]2
+ λ
j
|βj|
Since the intercept is not subject to the penalty. The first order condition is then
∂
∂β0
y − Xβ − β01 2
= (Xβ − y)T
1 + β0 1 2
= 0
i.e.
β0 =
1
n2
(Xβ − y)T
1
For other parameters, assuming tha KKT conditions are satisfied, we can check if 0 contains
the subdifferential at the minimum,
0 ∈ ∂
1
2
y − Xβ 2
+ λ β 1 =
1
2
y − Xβ 2
+ ∂(λ β 1 )
@freakonometrics freakonometrics freakonometrics.hypotheses.org 91
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
For the term on the left, we recognize
1
2
y − Xβ 2
= −XT
(y − Xβ) = −g
so that the previous equation can be writen
gk ∈ ∂(λ|βk|) =



{+λ} if βk > 0
{−λ} if βk < 0
(−λ, +λ) if βk = 0
i.e. if βk = 0, then gk = sign(βk) · λ.
We can split βj into a sum of its positive and negative parts by replacing βj with β+
j − β−
j
where β+
j , β−
j ≥ 0.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 92
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
Then the Lasso problem becomes
− log L(β) + λ
j
(β+
j − β−
j )
with constraints β+
j − β−
j . Let α+
j , α−
j denote the Lagrange multipliers for β+
j , β−
j , respectively.
L(β) + λ
j
(β+
j − β−
j ) −
j
α+
j β+
j −
j
α−
j β−
j .
To satisfy the stationarity condition, we take the gradient of the Lagrangian with respect to β+
j
and set it to zero to obtain
L(β)j + λ − α+
j = 0
We do the same with respect to β−
j to obtain
− L(β)j + λ − α−
j = 0.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 93
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
For conveniency, introduce the soft-thresholding function
S(z, γ) = sign(z) · (|z| − γ)+ =



z − γ if γ < |z| and z > 0
z + γ if γ < |z| and z < 0
0 if γ ≥ |z|
Noticing that the optimization problem
1
2
y − Xβ 2
2
+ λ β 1
can also be written
min
p
j=1
−βols
j · βj +
1
2
β2
j + λ|βj|
observe that βj,λ = S(βols
j , λ) which is a coordinate-wise update.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 94
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
Now, if we consider a (slightly) more general problem, with weights in the first part
min
1
2n
n
i=1
ωi[yi − (β0 + xi
T
β)]2
+ λ
j
|βj|
the coordinate-wise update becomes
βj,λ,ω = S(βω−ols
j , λ)
An alternative is to set
rj = y − β01 +
k=j
βkxk = y − y
(j)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 95
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
so that the optimization problem can be written, equivalently
min
1
2n
p
j=1
[rj − βjxj]2
+ λ|βj|
hence
min
1
2n
p
j=1
β2
j xj − 2βjrj
T
xj + λ|βj|
and one gets
βj,λ =
1
xj
2
S(rj
T
xj, nλ)
or, if we develop
βj,λ =
1
i
x2
ij
S
i
xi,j[yi − y
(j)
i ], nλ
@freakonometrics freakonometrics freakonometrics.hypotheses.org 96
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
Again, if there are weights ω = (ωi), the coordinate-wise update becomes
βj,λ,ω =
1
i
ωix2
ij
S
i
ωixi,j[yi − y
(j)
i ], nλ
The code to compute this componentwise descent is
1 soft_ thresholding = function(x,a){
2 result = numeric(length(x))
3 result[which(x > a)] = a
4 result[which(x < -a)] <- x[which(x < -a)] + a
5 return(result)
and the code
@freakonometrics freakonometrics freakonometrics.hypotheses.org 97
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
1 lasso_coord_desc = function(X,y,beta ,lambda ,tol =1e-6, maxiter =1000){
2 beta = as.matrix(beta)
3 X = as.matrix(X)
4 omega = rep(1/length(y),length(y))
5 obj = numeric(length =( maxiter +1))
6 betalist = list(length(maxiter +1))
7 betalist [[1]] = beta
8 beta0list = numeric(length(maxiter +1))
9 beta0 = sum(y-X%*%beta)/(length(y))
10 beta0list [1] = beta0
11 for (j in 1: maxiter){
12 for (k in 1: length(beta)){
13 r = y - X[,-k]%*%beta[-k] - beta0*rep(1, length(y))
14 beta[k] = (1/sum(omega*X[,k]ˆ2))*soft_ thresholding (t(omega*r)
%*%X[,k],length(y)*lambda)
15 }
16 beta0 = sum(y-X%*%beta)/(length(y))
@freakonometrics freakonometrics freakonometrics.hypotheses.org 98
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
17 beta0list[j+1] = beta0
18 betalist [[j+1]] = beta
19 obj[j] = (1/2)*(1/length(y))*norm(omega*(y - X%*%beta -
20 beta0*rep(1, length(y))),’F’)ˆ2 + lambda*sum(abs(beta))
21 if (norm(rbind(beta0list[j],betalist [[j]]) - rbind(beta0 ,beta),
’F’) < tol) { break }
22 }
23 return(list(obj=obj [1:j],beta=beta ,intercept=beta0)) }
If we get back to our logistic problem, the log-likelihood is here
log L =
1
n
n
i=1
yi · (β0 + xi
T
β) − log[1 + exp(β0 + xi
T
β)]
which is a concave function of the parameters. Hence, one can use a quadratic approximation of
the log-likelihood - using Taylor expansion,
log L ≈ log L =
1
n
n
i=1
ωi · [zi − (β0 + xi
T
β)]2
@freakonometrics freakonometrics freakonometrics.hypotheses.org 99
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
where zi is the working response
zi = (β0 + xi
T
β) +
yi − pi
pi[1 − pi]
pi is the prediction
pi =
exp[β0 + xi
Tβ]
1 + exp[β0 + xi
Tβ]
and ωi are weights ωi = pi[1 − pi].
Thus, we obtain a penalized least-square problem.
1 lasso_coord_desc = function(X,y,beta ,lambda ,tol =1e-6, maxiter =1000){
2 beta = as.matrix(beta)
3 X = as.matrix(X)
4 obj = numeric(length =( maxiter +1))
5 betalist = list(length(maxiter +1))
6 betalist [[1]] = beta
7 beta0 = sum(y-X%*%beta)/(length(y))
8 p = exp(beta0*rep(1, length(y)) + X%*%beta)/(1+ exp(beta0*rep(1,
length(y)) + X%*%beta))
9 z = beta0*rep(1, length(y)) + X%*%beta + (y-p)/(p*(1-p))
@freakonometrics freakonometrics freakonometrics.hypotheses.org 100
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
10 omega = p*(1-p)/(sum((p*(1-p))))
11 beta0list = numeric(length(maxiter +1))
12 beta0 = sum(y-X%*%beta)/(length(y))
13 beta0list [1] = beta0
14 for (j in 1: maxiter){
15 for (k in 1: length(beta)){
16 r = z - X[,-k]%*%beta[-k] - beta0*rep(1, length(y))
17 beta[k] = (1/sum(omega*X[,k]ˆ2))*soft_ thresholding (t(omega*r)%
*%X[,k],length(y)*lambda)
18 }
19 beta0 = sum(y-X%*%beta)/(length(y))
20 beta0list[j+1] = beta0
21 betalist [[j+1]] = beta
22 obj[j] = (1/2)*(1/length(y))*norm(omega*(z - X%*%beta -
23 beta0*rep(1, length(y))),’F’)ˆ2 + lambda*sum(abs(beta))
24 p = exp(beta0*rep(1, length(y)) + X%*%beta)/(1+ exp(beta0*rep(1,
length(y)) + X%*%beta))
25 z = beta0*rep(1, length(y)) + X%*%beta + (y-p)/(p*(1-p))
@freakonometrics freakonometrics freakonometrics.hypotheses.org 101
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
26 omega = p*(1-p)/(sum((p*(1-p))))
27 if (norm(rbind(beta0list[j],betalist [[j]]) -
28 rbind(beta0 ,beta),’F’) < tol) { break }
29 }
30 return(list(obj=obj [1:j],beta=beta ,intercept=beta0)) }
@freakonometrics freakonometrics freakonometrics.hypotheses.org 102
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
4 df0 = df
5 df0$y = as.numeric(df$y) -1
6 plot_lambda = function(lambda){
7 m = apply(df0 ,2,mean)
8 s = apply(df0 ,2,sd)
9 for(j in 1:2) df0[,j] <- (df0[,j]-m[j])/s[j]
10 reg = glmnet(cbind(df0$x1 ,df0$x2), df0$y==1, alpha =1, lambda=lambda)
11 u = seq(0,1, length =101)
12 p = function(x,y){
13 xt = (x-m[1])/s[1]
14 yt = (y-m[2])/s[2]
15 predict(reg ,newx=cbind(x1=xt ,x2=yt),type="response")}
@freakonometrics freakonometrics freakonometrics.hypotheses.org 103
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
4 reg = glmnet(cbind(df0$x1 ,df0$
x2), df0$y==1, alpha =1)
5 par(mfrow=c(1 ,2))
6 plot(reg ,xvar="lambda",col=c("
blue","red"),lwd =2)
7 abline(v=exp ( -2.8))
8 plot_lambda(exp ( -2.8))
@freakonometrics freakonometrics freakonometrics.hypotheses.org 104
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Penalized Logistic Regression, 1 norm (lasso)
4 reg = glmnet(cbind(df0$x1 ,df0$
x2), df0$y==1, alpha =1)
5 par(mfrow=c(1 ,2))
6 plot(reg ,xvar="lambda",col=c("
blue","red"),lwd =2)
7 abline(v=exp ( -2.1))
8 plot_lambda(exp ( -2.1))
@freakonometrics freakonometrics freakonometrics.hypotheses.org 105
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : (Linear) Fisher’s Discrimination
Consider the follwing naive classification rule
m (x) = argminy{P[Y = y|X = x]}
or
m (x) = argminy
P[X = x|Y = y]
P[X = x]
(where P[X = x] is the density in the continuous case).
In the case where y takes two values, that will be standard {0, 1} here, one can rewrite the later
as
m (x) =
1 if E(Y |X = x) >
1
2
0 otherwise
@freakonometrics freakonometrics freakonometrics.hypotheses.org 106
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : (Linear) Fisher’s Discrimination
the set
DS = x, E(Y |X = x) =
1
2
is called the decision boundary.
Assume that X|Y = 0 ∼ N(µ0, Σ) and X|Y = 1 ∼ N(µ1, Σ), then explicit expressions can be
derived.
m (x) =
1 if r2
1 < r2
0 + 2log
P(Y = 1)
P(Y = 0)
+ log
|Σ0|
|Σ1|
0 otherwise
where r2
y is the Mahalanobis distance,
r2
y = [X − µy]T
Σ−1
y [X − µy]
@freakonometrics freakonometrics freakonometrics.hypotheses.org 107
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : (Linear) Fisher’s Discrimination
Let δy be defined as
δy(x) = −
1
2
log |Σy| −
1
2
[x − µy]T
Σ−1
y [x − µy] + log P(Y = y)
the decision boundary of this classifier is
{x such that δ0(x) = δ1(x)}
which is quadratic in x. This is the quadratic discriminant analysis.
In Fisher (1936, The use of multile measurements in taxinomic problems , it was assumed that
Σ0 = Σ1
In that case, actually,
δy(x) = xT
Σ−1
µy −
1
2
µT
y Σ−1
µy + log P(Y = y)
and the decision frontier is now linear in x.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 108
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : (Linear) Fisher’s Discrimination
1 m0 = apply(myocarde[myocarde$PRONO =="0" ,1:7],2, mean)
2 m1 = apply(myocarde[myocarde$PRONO =="1" ,1:7],2, mean)
3 Sigma = var(myocarde [ ,1:7])
4 omega = solve(Sigma)%*%(m1 -m0)
5 omega
6 [,1]
7 FRCAR -0.012909708542
8 INCAR 1.088582058796
9 INSYS -0.019390084344
10 PRDIA -0.025817110020
11 PAPUL 0.020441287970
12 PVENT -0.038298291091
13 REPUL -0.001371677757
14 b = (t(m1)%*%solve(Sigma)%*%m1 -t(m0)%*%solve(Sigma)%*%m0)/2
@freakonometrics freakonometrics freakonometrics.hypotheses.org 109
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : (Linear) Fisher’s Discrimination
1 x = c(.4 ,.55 ,.65 ,.9 ,.1 ,.35 ,.5 ,.15 ,.2 ,.85)
2 y = c(.85 ,.95 ,.8 ,.87 ,.5 ,.55 ,.5 ,.2 ,.1 ,.3)
3 z = c(1,1,1,1,1,0,0,1,0,0)
4 df = data.frame(x1=x,x2=y,y=as.factor(z))
5 m0 = apply(df[df$y=="0" ,1:2],2, mean)
6 m1 = apply(df[df$y=="1" ,1:2],2, mean)
7 Sigma = var(df [ ,1:2])
8 omega = solve(Sigma)%*%(m1 -m0)
9 omega
10 [,1]
11 x1 -2.640613174
12 x2 4.858705676
@freakonometrics freakonometrics freakonometrics.hypotheses.org 110
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
1 library(MASS)
2 fit_lda = lda(y ˜x1+x2 , data=df)
3 fit_lda
4
5 Coefficients of linear discriminants :
6 LD1
7 x1 -2.588389554
8 x2 4.762614663
and for the intercept
1 b = (t(m1)%*%solve(Sigma)%*%m1 -t(m0)%*%solve(
Sigma)%*%m0)/2
@freakonometrics freakonometrics freakonometrics.hypotheses.org 111
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
1 predlda = function(x,y) predict(fit_lda , data.
frame(x1=x,x2=y))$class ==1
2 vv=outer(vu ,vu ,predlda)
3 contour(vu ,vu ,vv ,add=TRUE ,lwd=2, levels = .5)
1 fit_qda = qda(y ˜x1+x2 , data=df)
2 plot(df$x1 ,df$x2 ,pch =19,
3 col=c("blue","red")[1+( df$y=="1")])
4 predqda=function(x,y) predict(fit_qda , data.
frame(x1=x,x2=y))$class ==1
5 vv=outer(vu ,vu ,predlda)
6 contour(vu ,vu ,vv ,add=TRUE ,lwd=2, levels = .5)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 112
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Note that there are strong links with the logistic regression.
Assume as previously that X|Y = 0 ∼ N(µ0, Σ) and X|Y = 1 ∼ N(µ1, Σ) then
log
P(Y = 1|X = x)
P(Y = 0|X = x)
is equal to
xT
Σ−1
[µy] −
1
2
[µ1 − µ0]T
Σ−1
[µ1 − µ0] + log
P(Y = 1)
P(Y = 0)
which is linear in x
log
P(Y = 1|X = x)
P(Y = 0|X = x)
= xT
β
Hence, when each groups have Gaussian distributions with identical variance matrix, then LDA
and the logistic regression lead to the same classification rule.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 113
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : (Linear) Fisher’s Discrimination
Observe furthermore that the slope is proportional to Σ−1[µ1 − µ0] as stated in Fisher’s article.
But to obtain such a relationship, he observe that the ratio of between and within variances (in
the two groups) was
variance between
variance within
=
[ωµ1 − ωµ0]2
ωTΣ1ω + ωTΣ0ω
which is maximal when ω is proportional to Σ−1[µ1 − µ0].
@freakonometrics freakonometrics freakonometrics.hypotheses.org 114
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Neural Nets
The most interesting part is not ’neural’ but ’network’.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 115
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Neural Nets
Networks have nodes, and edges (possibly connected) that connect nodes,
or maybe, to more specific some sort of flow network,
@freakonometrics freakonometrics freakonometrics.hypotheses.org 116
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Neural Nets
In such a network, we usually have sources (here multiple) sources (here {s1, s2, s3}), on the
left, on a sink (here {t}), on the right. To continue with this metaphorical introduction,
information from the sources should reach the sink. An usually, sources are explanatory
variables, {x1, · · · , xp}, and the sink is our variable of interest y.
The output here is a binary variable y ∈ {0, 1}
Consider the sigmoid function f(x) =
1
1 + e−x
=
ex
ex + 1
which is actually the logistic function.
1 sigmoid = function(x) 1 / (1 + exp(-x))
This function f is called the activation function.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 117
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
If y ∈ {−1, +1}, one can consider the hyperbolic tangent f(x) = tanh(x) =
(ex − e−x)
(ex + e−x)
or the
inverse tangent function f(x) = tan−1(x).
And as input for such function, we consider a weighted sum of incoming nodes.
So here
yi = f
p
j=1
ωjxj,i
We can also add a constant actually
yi = f ω0 +
p
j=1
ωjxj,i
@freakonometrics freakonometrics freakonometrics.hypotheses.org 118
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Neural Nets
1 weights_0 = lm(PRONO˜.,data=myocarde)$
coefficients
2 X = as.matrix(cbind (1, myocarde [ ,1:7]))
3 y_5_1 = sigmoid(X %*% weights_0)
4 library(ROCR)
5 pred = ROCR :: prediction(y_5_1,myocarde$PRONO)
6 perf = ROCR :: performance (pred ,"tpr", "fpr")
7 plot(perf ,col="blue",lwd =2)
8 reg = glm(PRONO˜.,data=myocarde ,family=
binomial(link = "logit"))
9 y_0 = predict(reg ,type="response")
10 pred0 = ROCR :: prediction(y_0,myocarde$PRONO)
11 perf0 = ROCR :: performance(pred0 ,"tpr", "fpr")
12 plot(perf0 ,add=TRUE ,col="red")
@freakonometrics freakonometrics freakonometrics.hypotheses.org 119
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Neural Nets
The error of that model can be computed used the quadratic loss function,
n
i=1
(yi − p(xi))2
or cross-entropy
n
i=1
(yi log p(xi) + [1 − yi] log[1 − p(xi)])
1 loss = function(weights){
2 mean( (myocarde$PRONO -sigmoid(X %*% weights))ˆ2) }
we want to solve
ω = argmin
1
n
n
i=1
yi, f(ω0 + xi
T
ω)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 120
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
1 weights_1 = optim(weights_0,loss)$par
2 y_5_2 = sigmoid(X %*% weights_1)
3 pred = ROCR :: prediction(y_5_2,myocarde$PRONO)
4 perf = ROCR :: performance (pred ,"tpr", "fpr")
5 plot(perf ,col="blue",lwd =2)
6 plot(perf0 ,add=TRUE ,col="red")
Consider a single hidden layer, with 3 different neurons, so
that
p(x) = f ω0 +
3
h=1
ωhfh ωh,0 +
p
j=1
ωh,jxj
or
p(x) = f ω0 +
3
h=1
ωhfh ωh,0 + xT
ωh
@freakonometrics freakonometrics freakonometrics.hypotheses.org 121
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
1 weights_1 = lm(PRONO˜1+ FRCAR+INCAR+INSYS+PAPUL
+PVENT ,data=myocarde)$ coefficients
2 X1 = as.matrix(cbind (1, myocarde[,c("FRCAR","
INCAR","INSYS","PAPUL","PVENT")]))
3 weights_2 = lm(PRONO˜1+ INSYS+PRDIA ,data=
myocarde)$ coefficients
4 X2 = as.matrix(cbind (1, myocarde[,c("INSYS","
PRDIA")]))
5 weights_3 = lm(PRONO˜1+ PAPUL+PVENT+REPUL ,data=
myocarde)$ coefficients
6 X3 = as.matrix(cbind (1, myocarde[,c("PAPUL","
PVENT","REPUL")]))
7 X = cbind(sigmoid(X1 %*% weights_1), sigmoid(
X2 %*% weights_2), sigmoid(X3 %*% weights_
3))
8 weights = c(1/3,1/3,1/3)
9 y_5_3 = sigmoid(X %*% weights)
10 pred = ROCR :: prediction(y_5_3,myocarde$PRONO)
11 perf = ROCR :: performance (pred ,"tpr", "fpr")
12 plot(perf ,col="blue",lwd =2)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 122
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
If
p(x) = f ω0 +
3
h=1
ωhfh ωh,0 + xT
ωh
we want to solve, with back propagation,
ω = argmin
1
n
n
i=1
(yi, p(xi))
For practical issues, let us center all variables
1 center = function(z) (z-mean(z))/sd(z)
The loss function is here
1 loss = function(weights){
2 weights_1 = weights [0+(1:7)]
3 weights_2 = weights [7+(1:7)]
4 weights_3 = weights [14+(1:7)]
5 weights_ = weights [21+1:4]
6 X1=X2=X3=as.matrix(myocarde [ ,1:7])
7 Z1 = center(X1 %*% weights_1)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 123
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
8 Z2 = center(X2 %*% weights_2)
9 Z3 = center(X3 %*% weights_3)
10 X = cbind (1, sigmoid(Z1), sigmoid(Z2), sigmoid(Z3))
11 mean( (myocarde$PRONO -sigmoid(X %*% weights_))ˆ2)}
@freakonometrics freakonometrics freakonometrics.hypotheses.org 124
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
1 pca = princomp(myocarde [ ,1:7])
2 W = get_pca_var(pca)$contrib
3 weights_0 = c(W[,1],W[,2],W[,3],c(-1,rep (1 ,3)/
3))
4 weights_opt = optim(weights_0,loss)$par
5 weights_1 = weights_opt [0+(1:7)]
6 weights_2 = weights_opt [7+(1:7)]
7 weights_3 = weights_opt [14+(1:7)]
8 weights_ = weights_opt [21+1:4]
9 X1=X2=X3=as.matrix(myocarde [ ,1:7])
10 Z1 = center(X1 %*% weights_1)
11 Z2 = center(X2 %*% weights_2)
12 Z3 = center(X3 %*% weights_3)
13 X = cbind (1, sigmoid(Z1), sigmoid(Z2), sigmoid(
Z3))
14 y_5_4 = sigmoid(X %*% weights_)
15 pred = ROCR :: prediction(y_5_4,myocarde$PRONO)
16 perf = ROCR :: performance (pred ,"tpr", "fpr")
17 plot(perf ,col="blue",lwd =2)
18 plot(perf ,add=TRUE ,col="red")
@freakonometrics freakonometrics freakonometrics.hypotheses.org 125
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
1 library(nnet)
2 myocarde_minmax = myocarde
3 minmax = function(z) (z-min(z))/(max(z)-min(z)
)
4 for(j in 1:7) myocarde_minmax[,j] = minmax(
myocarde_minmax[,j])
5 model_nnet = nnet(PRONO˜.,data=myocarde_minmax
,size =3)
6 summary(model_nnet)
7 library(devtools)
8 source_url(’https://gist. githubusercontent .com
/fawda123/7471137/raw/466
c1474d0a505ff044412703516c34f1a4684a5 /nnet
_plot_update.r’)
9 plot.nnet(model_nnet)
10 library(neuralnet)
11 model_nnet = neuralnet(formula(glm(PRONO˜.,
data=myocarde_minmax)),
12 myocarde_minmax ,hidden =3, act.fct = sigmoid)
13 plot(model_nnet)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 126
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Neural Nets
Of course, one can consider a multiple layer network,
p(x) = f ω0 +
3
h=1
ωhfh ωh,0 + zh
T
ωh
where
zh = f ωh,0 +
kh
j=1
ωh,jfh,j ωh,j,0 + xT
ωh,j
1 library(neuralnet)
2 model_nnet = neuralnet(formula(glm(PRONO˜.,
data=myocarde_minmax)),
3 myocarde_minmax ,hidden =3, act.fct = sigmoid)
4 plot(model_nnet)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 127
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Linearly Separable sample [econometric notations]
Data (y1, x1), · · · , (yn, xn) - with y ∈ {0, 1} - are
linearly separable if there are (β0, β) such that
- yi = 1 if β0 + x β > 0
- yi = 0 if β0 + x β < 0
Linearly Separable sample [ML notations]
Data (y1, x1), · · · , (yn, xn) - with y ∈ {−1, +1} -
are linearly separable if there are (b, ω) such that
- yi = +1 if b + x, ω > 0
- yi = −1 if b + x, ω < 0
or equivalently yi · (b + x, ω ) > 0, ∀i.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 128
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
yi · (b + x, ω ) = 0 is an hyperplane (in Rp)
orthogonal with ω
Use m(x) = 1b+ x,ω ≥0 − 1b+ x,ω <0 as classifier
Problem : equation (i.e. (b, ω)) is not unique !
Canonical form : min
i=1,··· ,n
|b + xi, ω | = 1
Problem : solution here is not unique !
Idea : use the widest (safety) margin γ
Vapnik & Lerner (1963, Pattern recognition using generalized
portrait method) or Cover (1965, Geometrical and statistical
properties of systems of linear inequalities with applications in
pattern recognition)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 129
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Consider two points, ω−1 and ω+1
γ =
1
2
ω, ω+1 − ω−1
ω
It is minimal when
b + xi, ω−1 = −1 and
b + xi, ω+1 = +1, and therefore
γ =
1
ω
Optimization problem becomes
min
(b ω)
1
2
ω 2
2
s.t. yi · (b + x, ω ) > 0, ∀i.
convex optimization problem with linear constraints
@freakonometrics freakonometrics freakonometrics.hypotheses.org 130
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Consider the following problem : min
u∈Rp
h(u) s.t. gi(u) ≥ 0 ∀i = 1, · · · , n
where h is quadratic and gi’s are linear.
Lagrangian is L : Rp × Rn → R defined as L(u, α) = h(u) −
n
i=1
αigi(u)
where α are dual variables, and the dual function is
Λ : α → L(uα, α) = min{L(u, α)} where uα = argmin L(u, α)
One can solve the dual problem, max{Λ(α)} s.t. α ≥ 0. Solution is u = uα .
Si gi(uα ) > 0, then necessarily αi = 0 (see Karush-Kuhn-Tucker (KKT) condition,
αi · gi(uα ) = 0)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 131
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Here, L(b, ω, α) =
1
2
ω 2
−
n
i=1
αi · yi · (b + x, ω ) − 1
From the first order conditions,
∂L(b, ω, α)
∂ω
= ω −
n
i=1
αi · yixi = 0, i.e.ωËĘ =
n
i=1
αi yixi
∂L(b, ω, α)
∂b
= −
n
i=1
αi · yi = 0, i.e.
n
i=1
αi · yi
and
Λ(α) =
n
i=1
αi −
1
2
n
i,j=1
αiαjyiyj αi, αj
@freakonometrics freakonometrics freakonometrics.hypotheses.org 132
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
min
α
1
2
α Qα − 1 α s.t.
αi ≥ 0, ∀i
y 1 = 0
where Q = [Qi,j] and Qi,j = yiyj xi, xj , and then
ω =
n
i=1
αi yixi and b = −
1
2
min
i:yi=+1
{ xi, ω } + min
i:yi=−1
{ xi, ω }
Points xi such that αi > 0 are called support
yi · (b + xi, ω ) = 1
Use m (x) = 1b + x,ω ≥0 − 1b + x,ω <0 as classifier
Observe that γ =
n
i=1
α 2
i
−1/2
@freakonometrics freakonometrics freakonometrics.hypotheses.org 133
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Optimization problem was
min
(b,ω)
1
2
ω 2
2
s.t. yi · (b + x, ω ) > 0, ∀i,
which became
min
(b,ω)
1
2
ω 2
2
+
n
i=1
αi · 1 − yi · (b + x, ω ) ,
or
min
(b,ω)
1
2
ω 2
2
+ penalty ,
@freakonometrics freakonometrics freakonometrics.hypotheses.org 134
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Consider here the more general case
where the space is not linearly separable
( ω, xi + b)yi ≥ 1
becomes
( ω, xi + b)yi ≥ 1−ξi
for some slack variables ξi’s.
and penalize large slack variables ξi, i.e. solve (for some coste C)
min
ω,b
1
2
ω β + C
n
i=1
ξi
subject to ∀i, ξi ≥ 0 and (xi ω + b)yi ≥ 1 − ξi.
This is the soft-margin extension, see e1071::svm() or kernlab::ksvm()
@freakonometrics freakonometrics freakonometrics.hypotheses.org 135
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
The dual optimization problem is now
min
α
1
2
α Qα − 1 α s.t.
0 ≤ αi≤ C, ∀i
y 1 = 0
where Q = [Qi,j] and Qi,j = yiyj xi, xj , and then
ω =
n
i=1
αi yixi and b = −
1
2
min
i:yi=+1
{ xi, ω } + min
i:yi=−1
{ xi, ω }
Note further that the (primal) optimization problem can be written
min
(b,ω)
1
2
ω 2
2
+
n
i=1
1 − yi · (b + x, ω )
+
,
where (1 − z)+ is a convex upper bound for empirical error 1z≤0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 136
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
One can also consider the kernel trick : xi xj is replace by ϕ(xi) ϕ(xj) for some mapping ϕ,
K(xi, xj) = ϕ(xi) ϕ(xj)
For instance
K(a, b) = (a b)3
= ϕ(a) ϕ(b)
where
ϕ(a1, a2) = (a3
1 ,
√
3a2
1a2 ,
√
3a1a2
2 , a3
2)
Consider polynomial kernels
K(a, b) = (1 + a b)p
or a Gaussian kernel
K(a, b) = exp(−(a − b) (a − b))
Solve
max
αi≥0
n
i=1
αi −
1
2
n
i,j=1
yiyjαiαjK(xi, xj)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 137
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
@freakonometrics freakonometrics freakonometrics.hypotheses.org 138
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Linear kernel
@freakonometrics freakonometrics freakonometrics.hypotheses.org 139
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Polynomial kernel (degree 2)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 140
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Polynomial kernel (degree 3)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 141
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Polynomial kernel (degree 4)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 142
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Polynomial kernel (degree 5)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 143
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Polynomial kernel (degree 6)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 144
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Radial kernel
@freakonometrics freakonometrics freakonometrics.hypotheses.org 145
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine - Radial Kernel, impact of the cost C
@freakonometrics freakonometrics freakonometrics.hypotheses.org 146
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine - Radial Kernel, tuning parameter γ
@freakonometrics freakonometrics freakonometrics.hypotheses.org 147
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
The radial kernel is formed by taking an infinite sum over polynomial kernels...
K(x, y) = exp − γ x − y 2
= ψ(x), ψ(y)
where ψ is some Rn → R∞ function, since
K(x, y) = exp − γ x − y 2
= exp(−γ x 2
− γ y 2
)
=constant
· exp 2γ x, y
i.e.
K(x, y) ∝ exp 2γ x, y =
∞
k=0
2γ
x, y k
k!
=
∞
k=0
2γKk(x, y)
where Kk is the polynomial kernel of degree k.
If K = K1 + K2 with ψj : Rn → Rdj then ψ : Rn → Rd with d ∼ d1 + d2
@freakonometrics freakonometrics freakonometrics.hypotheses.org 148
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
A kernel is a measure of similarity between vectors.
The smaller the value of γ the narrower the vectors should be to have a small measure
Is there a probabilistic interpretation ?
Platt (2000, Probabilities for SVM) suggested to use a logistic function over the SVM scores,
p(x) =
exp[b + x, ω ]
1 + exp[b + x, ω ]
@freakonometrics freakonometrics freakonometrics.hypotheses.org 149
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
SVM : Support Vector Machine
Let C denote the cost of mis-classification. The optimization problem becomes
min
w∈Rd,b∈R,ξ∈Rn
1
2
ω 2
+ C
n
i=1
ξi
subject to yi · (ωTxi + b) ≥ 1 − ξi, with ξi ≥ 0, ∀i = 1, · · · , n
1 n = length(myocarde[,"PRONO"])
2 myocarde0 = myocarde
3 myocarde0$PRONO = myocarde$PRONO*2-1
4 C = .5
5 f = function(param){
6 w = param [1:7]
7 b = param [8]
8 xi = param [8+1: nrow(myocarde)]
9 .5*sum(wˆ2) + C*sum(xi)}
10 grad_f = function(param){
11 w = param [1:7]
12 b = param [8]
@freakonometrics freakonometrics freakonometrics.hypotheses.org 150
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
13 xi = param [8+1: nrow(myocarde)]
14 c(2*w,0,rep(C,length(xi)))}
(linear) constraints are written as Uθ − c ≥ 0
1 U = rbind(cbind(myocarde0[,"PRONO"]*as.matrix(myocarde [ ,1:7]) ,diag(n)
,myocarde0[,"PRONO"]),
2 cbind(matrix (0,n,7),diag(n,n),matrix (0,n,1)))
3 C = c(rep(1,n),rep(0,n))
then use
1 constrOptim (theta=p_init , f, grad_f, ui = U,ci = C)
One can recognize in the separable case, but also in the non-separable case, a classic quadratic
program
min
z∈Rd
1
2
zT
Dz − dz
subject to Az ≥ b.
1 library(quadprog)
2 eps = 5e-4
@freakonometrics freakonometrics freakonometrics.hypotheses.org 151
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
3 y = myocarde[,&quot;PRONO&quot ;]*2-1
4 X = as.matrix(cbind (1, myocarde [ ,1:7]))
5 n = length(y)
6 D = diag(n+7+1)
7 diag(D)[8+0:n] = 0
8 d = matrix(c(rep (0 ,7) ,0,rep(C,n)), nrow=n+7+1)
9 A = Ui
10 b = Ci
@freakonometrics freakonometrics freakonometrics.hypotheses.org 152
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Support Vector Machine (SVM) - Primal Problem
1 sol = solve.QP(D+eps*diag(n+7+1) , d, t(A), b, meq =1)
2 qpsol = sol$solution
3 (omega = qpsol [1:7])
4 [1] -0.106642005446 -0.002026198103 -0.022513312261 -0.018958578746
-0.023105767847 -0.018958578746 -1.080638988521
5 (b = qpsol[n+7+1])
6 [1] 997.6289927
Given an observation x, the prediction is y = sign[ωTx + b]
1 y_pred = 2*((as.matrix(myocarde0 [ ,1:7])%*%omega+b) >0) -1
@freakonometrics freakonometrics freakonometrics.hypotheses.org 153
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Support Vector Machine (SVM) - Dual Problem
The Lagrangian of the separable problem could be written introducing Lagrange multipliers
α ∈ Rn, α ≥ 0 as
L(ω, b, α) =
1
2
ω 2
−
n
i=1
αi yi(ωT
xi + b) − 1
Somehow, αi represents the influence of the observation (yi, xi). Consider the Dual Problem,
with G = [Gij] and Gij = yiyjxj
Txi.
min
α∈Rn
1
2
αT
Gα − 1T
α
subject to yTα = 0.
The Lagrangian of the non-separable problem could be written introducing Lagrange
multipliers α, β ∈ Rn, α, β ≥ 0,
@freakonometrics freakonometrics freakonometrics.hypotheses.org 154
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Support Vector Machine (SVM) - Dual Problem
and define the Lagrangian L(ω, b, ξ, α, β) as
1
2
ω 2
+ C
n
i=1
ξi −
n
i=1
αi yi(ωT
xi + b) − 1 + ξi −
n
i=1
βiξi
Somehow, αi represents the influence of the observation (yi, xi). The Dual Problem become
with G = [Gij] and Gij = yiyjxj
Txi.
min
α∈Rn
1
2
αT
Gα − 1T
α
subject to yTα = 0, α ≥ 0 and α ≤ C. As previously, one can also use quadratic programming
@freakonometrics freakonometrics freakonometrics.hypotheses.org 155
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Support Vector Machine (SVM) - Dual Problem
1 library(quadprog)
2 eps = 5e-4
3 y = myocarde[,"PRONO"]*2-1
4 X = as.matrix(cbind (1, myocarde [ ,1:7]))
5 n = length(y)
6 Q = sapply (1:n, function(i) y[i]*t(X)[,i])
7 D = t(Q)%*%Q
8 d = matrix (1, nrow=n)
9 A = rbind(y,diag(n),-diag(n))
10 C = .5
11 b = c(0,rep(0,n),rep(-C,n))
12 sol = solve.QP(D+eps*diag(n), d, t(A), b, meq=1, factorized=FALSE)
13 qpsol = sol$solution
@freakonometrics freakonometrics freakonometrics.hypotheses.org 156
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Support Vector Machine (SVM) - Dual Problem
The two problems are directed in the sense that for all xωTx + b =
n
i=1
αiyi(xTxi) + b
To recover the solution of the primal problem, ω =
n
i=1
αiyixi thus
1 omega = apply(qpsol*y*X,2,sum)
2 omega
3 1 FRCAR INCAR INSYS
4 0.000000000000000243907 0.0550138658 -0.092016323 0.3609571899422
5 PRDIA PAPUL PVENT REPUL
6 -0.109401796528869235669 -0.0485213403 -0.066005864 0.0010093656567
More generally
1 svm.fit = function(X, y, C=NULL) {
2 n.samples = nrow(X)
3 n.features = ncol(X)
4 K = matrix(rep(0, n.samples*n.samples), nrow=n.samples)
5 for (i in 1:n.samples){
6 for (j in 1:n.samples){
7 K[i,j] = X[i,] %*% X[j,] }}
@freakonometrics freakonometrics freakonometrics.hypotheses.org 157
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
8 Dmat = outer(y,y) * K
9 Dmat = as.matrix(nearPD(Dmat)$mat)
10 dvec = rep(1, n.samples)
11 Amat = rbind(y, diag(n.samples), -1*diag(n.samples))
12 bvec = c(0, rep(0, n.samples), rep(-C, n.samples))
13 res = solve.QP(Dmat ,dvec ,t(Amat),bvec=bvec , meq =1)
14 a = res$solution
15 bomega = apply(a*y*X,2,sum)
16 return(bomega)
17 }
@freakonometrics freakonometrics freakonometrics.hypotheses.org 158
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Support Vector Machine (SVM) - Dual Problem
To relate this duality optimization problem to OLS, recall that y = xTω + ε, so that y = xTω,
where ω = [XTX]−1XTy.
But one can also write y = xTω =
n
i=1
αi · xTxi where α = X[XTX]−1ω, or conversely,
ω = XTα.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 159
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Support Vector Machine (SVM) - Kernel Approach
A positive kernel on X is a function K : X × X → R symmetric, and such that for any n,
∀α1, · · · , αn and ∀x1, · · · , xn,
n
i=1
n
j=1
αiαjk(xi, xj) ≥ 0
For example, the linear kernel is k(xi, xj) = xi
Txj. ThatâĂŹs what weâĂŹve been using here,
so far. One can also define the product kernel k(xi, xj) = κ(xi) · κ(xj) where κ is some
function X → R.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 160
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Support Vector Machine (SVM) - Kernel Approach
Finally, the Gaussian kernel is k(xi, xj) = exp[− xi − xj
2].
Since it is a function of xi − xj , it is also called a radial kernel.
1 linear.kernel = function(x1 , x2) {
2 return (x1%*%x2)
3 }
4 svm.fit = function(X, y, FUN=linear.kernel , C=NULL) {
5 n.samples = nrow(X)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 161
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
6 n.features = ncol(X)
7 K = matrix(rep(0, n.samples*n.samples), nrow=n.samples)
8 for (i in 1:n.samples){
9 for (j in 1:n.samples){
10 K[i,j] = FUN(X[i,], X[j,])
11 }
12 }
13 Dmat = outer(y,y) * K
14 Dmat = as.matrix(nearPD(Dmat)$mat)
15 dvec = rep(1, n.samples)
16 Amat = rbind(y, diag(n.samples), -1*diag(n.samples))
17 bvec = c(0, rep(0, n.samples), rep(-C, n.samples))
18 res = solve.QP(Dmat ,dvec ,t(Amat),bvec=bvec , meq =1)
19 a = res$solution
20 bomega = apply(a*y*X,2,sum)
21 return(bomega)
22 }
@freakonometrics freakonometrics freakonometrics.hypotheses.org 162
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
1 SVM2 = ksvm(y ˜ x1 + x2 , data = df0 , C=2,
kernel = "vanilladot " , prob.model=TRUE ,
type="C-svc")
2 pred_SVM2 = function(x,y){
3 return(predict(SVM2 ,newdata=data.frame(x1=x,
x2=y), type=" probabilities ")[ ,2])}
4 plot(df$x1 ,df$x2 ,pch=c(1 ,19) [1+( df$y=="1")],
5 cex =1.5 , xlab="",
6 ylab="",xlim=c(0 ,1),ylim=c(0 ,1))
7 vu = seq (-.1,1.1, length =251)
8 vv = outer(vu ,vu ,function(x,y) pred_SVM2(x,y
))
9 contour(vu ,vu ,vv ,add=TRUE ,lwd=2, levels = .5,
col="red")
@freakonometrics freakonometrics freakonometrics.hypotheses.org 163
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
1 SVM3 = ksvm(y ˜ x1 + x2 , data = df0 , C=2,
kernel = "rbfdot" , prob.model=TRUE ,
type="C-svc")
2 pred_SVM3 = function(x,y){
3 return(predict(SVM3 ,newdata=data.frame(x1=x,
x2=y), type=" probabilities ")[ ,2])}
4 plot(df$x1 ,df$x2 ,pch=c(1 ,19) [1+( df$y=="1")],
5 cex =1.5 , xlab="",
6 ylab="",xlim=c(0 ,1),ylim=c(0 ,1))
7 vu = seq (-.1,1.1, length =251)
8 vv = outer(vu ,vu ,function(x,y) pred_SVM2(x,y
))
9 contour(vu ,vu ,vv ,add=TRUE ,lwd=2, levels = .5,
col="red")
@freakonometrics freakonometrics freakonometrics.hypotheses.org 164
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Classification Trees
1 library(rpart)
2 cart = rpart(PRONO˜.,data=
myocarde)
3 library(rpart.plot)
4 prp(cart ,type=2, extra =1)
A (binary) split is based on one specific variable âĂŞ say xj âĂŞ and a cutoff, say s. Then,
there are two options:
• either xi,j ≤ s, then observation i goes on the left, in IL
• or xi,j > s, then observation i goes on the right, in IR
Thus, I = IL ∪ IR.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 165
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Classification Trees
Gini for node I is defined as
G(I) = −
y∈{0,1}
py(1 − py)
where py is the proportion of individuals in the leaf of type y,
G(I) = −
y∈{0,1}
ny,I
nI
1 −
ny,I
nI
1 gini = function(y,classe){
2 T. = table(y,classe)
3 nx = apply(T,2,sum)
4 n. = sum(T)
5 pxy = T/matrix(rep(nx ,each =2) ,nrow =2)
6 omega = matrix(rep(nx ,each =2) ,nrow =2)/n
7 g. = -sum(omega*pxy*(1-pxy))
8 return(g)}
@freakonometrics freakonometrics freakonometrics.hypotheses.org 166
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Classification Trees
1 -2*mean(myocarde$PRONO)*(1-mean(myocarde$PRONO))
2 [1] -0.4832375
3 gini(y=myocarde$PRONO ,classe=myocarde$PRONO <Inf)
4 [1] -0.4832375
5 gini(y=myocarde$PRONO ,classe=myocarde [ ,1] <=100)
6 [1] -0.4640415
@freakonometrics freakonometrics freakonometrics.hypotheses.org 167
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
Classification : Classification Trees
if we split, define index
G(IL, IR) = −
x∈{L,R}
nx
nIx
nI
y∈{0,1}
ny,Ix
nIx
1 −
ny,Ix
nIx
the entropic measure is
E(I) = −
y∈{0,1}
ny,I
nI
log
ny,I
nI
1 entropy = function(y,classe){
2 T = table(y,classe)
3 nx = apply(T,2,sum)
4 pxy = T/matrix(rep(nx ,each =2) ,nrow =2)
5 omega = matrix(rep(nx ,each =2) ,nrow =2)/sum(T)
6 g = sum(omega*pxy*log(pxy))
7 return(g)}
@freakonometrics freakonometrics freakonometrics.hypotheses.org 168
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
1 mat_gini = mat_v=matrix(NA ,7 ,101)
2 for(v in 1:7){
3 variable=myocarde[,v]
4 v_seuil=seq(quantile(myocarde[,v],
5 6/length(myocarde[,v])),
6 quantile(myocarde[,v],1-6/length(
7 myocarde[,v])),length =101)
8 mat_v[v,]=v_seuil
9 for(i in 1:101){
10 CLASSE=variable <=v_seuil[i]
11 mat_gini[v,i]=
12 gini(y=myocarde$PRONO ,classe=CLASSE)}}
13 -(gini(y=myocarde$PRONO ,classe =( myocarde
[ ,3] <19))-
14 gini(y=myocarde$PRONO ,classe =( myocarde [,3]<
Inf)))/
15 gini(y=myocarde$PRONO ,classe =( myocarde [,3]<
Inf))
16 [1] 0.5862131
@freakonometrics freakonometrics freakonometrics.hypotheses.org 169
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
1 idx = which(myocarde$INSYS <19)
2 mat_gini = mat_v = matrix(NA ,7 ,101)
3 for(v in 1:7){
4 variable = myocarde[idx ,v]
5 v_seuil = seq(quantile(myocarde[idx ,v],
6 7/length(myocarde[idx ,v])),
7 quantile(myocarde[idx ,v],1-7/length(
8 myocarde[idx ,v])), length =101)
9 mat_v[v,] = v_seuil
10 for(i in 1:101){
11 CLASSE = variable <=v_seuil[i]
12 mat_gini[v,i]=
13 gini(y=myocarde$PRONO[idx],classe=
CLASSE)}}
14 par(mfrow=c(3 ,2))
15 for(v in 2:7){
16 plot(mat_v[v,],mat_gini[v ,])
17 }
@freakonometrics freakonometrics freakonometrics.hypotheses.org 170
Arthur Charpentier, Algorithms for Predictive Modeling - 2019
1 idx = which(myocarde$INSYS >=19)
2 mat_gini = mat_v = matrix(NA ,7 ,101)
3 for(v in 1:7){
4 variable=myocarde[idx ,v]
5 v_seuil=seq(quantile(myocarde[idx ,v],
6 6/length(myocarde[idx ,v])),
7 quantile(myocarde[idx ,v],1-6/length(
8 myocarde[idx ,v])), length =101)
9 mat_v[v,]=v_seuil
10 for(i in 1:101){
11 CLASSE=variable <=v_seuil[i]
12 mat_gini[v,i]=
13 gini(y=myocarde$PRONO[idx],
14 classe=CLASSE)}}
15 par(mfrow=c(3 ,2))
16 for(v in 2:7){
17 plot(mat_v[v,],mat_gini[v ,])
18 }
@freakonometrics freakonometrics freakonometrics.hypotheses.org 171
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling

More Related Content

What's hot (20)

Machine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & InsuranceMachine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & Insurance
 
Lausanne 2019 #4
Lausanne 2019 #4Lausanne 2019 #4
Lausanne 2019 #4
 
Side 2019, part 1
Side 2019, part 1Side 2019, part 1
Side 2019, part 1
 
Lausanne 2019 #2
Lausanne 2019 #2Lausanne 2019 #2
Lausanne 2019 #2
 
Side 2019 #7
Side 2019 #7Side 2019 #7
Side 2019 #7
 
Side 2019 #12
Side 2019 #12Side 2019 #12
Side 2019 #12
 
Side 2019 #9
Side 2019 #9Side 2019 #9
Side 2019 #9
 
Side 2019 #6
Side 2019 #6Side 2019 #6
Side 2019 #6
 
Reinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and FinanceReinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and Finance
 
Side 2019 #3
Side 2019 #3Side 2019 #3
Side 2019 #3
 
Side 2019 #10
Side 2019 #10Side 2019 #10
Side 2019 #10
 
Slides econometrics-2018-graduate-4
Slides econometrics-2018-graduate-4Slides econometrics-2018-graduate-4
Slides econometrics-2018-graduate-4
 
Slides econ-lm
Slides econ-lmSlides econ-lm
Slides econ-lm
 
transformations and nonparametric inference
transformations and nonparametric inferencetransformations and nonparametric inference
transformations and nonparametric inference
 
Sildes buenos aires
Sildes buenos airesSildes buenos aires
Sildes buenos aires
 
Graduate Econometrics Course, part 4, 2017
Graduate Econometrics Course, part 4, 2017Graduate Econometrics Course, part 4, 2017
Graduate Econometrics Course, part 4, 2017
 
Slides univ-van-amsterdam
Slides univ-van-amsterdamSlides univ-van-amsterdam
Slides univ-van-amsterdam
 
Slides networks-2017-2
Slides networks-2017-2Slides networks-2017-2
Slides networks-2017-2
 
Lausanne 2019 #1
Lausanne 2019 #1Lausanne 2019 #1
Lausanne 2019 #1
 
Slides simplexe
Slides simplexeSlides simplexe
Slides simplexe
 

Similar to Hands-On Algorithms for Predictive Modeling

Slides barcelona Machine Learning
Slides barcelona Machine LearningSlides barcelona Machine Learning
Slides barcelona Machine LearningArthur Charpentier
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsAlexander Litvinenko
 
Interpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptxInterpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptxGairuzazmiMGhani
 
Using R Tool for Probability and Statistics
Using R Tool for Probability and Statistics Using R Tool for Probability and Statistics
Using R Tool for Probability and Statistics nazlitemu
 
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920Karl Rudeen
 
Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Asma Ben Slimene
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Asma Ben Slimene
 
A series of maximum entropy upper bounds of the differential entropy
A series of maximum entropy upper bounds of the differential entropyA series of maximum entropy upper bounds of the differential entropy
A series of maximum entropy upper bounds of the differential entropyFrank Nielsen
 
ABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsUmberto Picchini
 
Introduction
IntroductionIntroduction
Introductionbutest
 
SIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsSIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsJagadeeswaran Rathinavel
 
A nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formulaA nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formulaAlexander Litvinenko
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big DataChristian Robert
 

Similar to Hands-On Algorithms for Predictive Modeling (20)

Slides barcelona Machine Learning
Slides barcelona Machine LearningSlides barcelona Machine Learning
Slides barcelona Machine Learning
 
Slides ensae-2016-9
Slides ensae-2016-9Slides ensae-2016-9
Slides ensae-2016-9
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEs
 
Slides ensae 9
Slides ensae 9Slides ensae 9
Slides ensae 9
 
Slides ACTINFO 2016
Slides ACTINFO 2016Slides ACTINFO 2016
Slides ACTINFO 2016
 
Interpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptxInterpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptx
 
Using R Tool for Probability and Statistics
Using R Tool for Probability and Statistics Using R Tool for Probability and Statistics
Using R Tool for Probability and Statistics
 
Classification
ClassificationClassification
Classification
 
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
 
Project Paper
Project PaperProject Paper
Project Paper
 
Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Slides ub-7
Slides ub-7Slides ub-7
Slides ub-7
 
A series of maximum entropy upper bounds of the differential entropy
A series of maximum entropy upper bounds of the differential entropyA series of maximum entropy upper bounds of the differential entropy
A series of maximum entropy upper bounds of the differential entropy
 
ABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space models
 
Introduction
IntroductionIntroduction
Introduction
 
SIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsSIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithms
 
A nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formulaA nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formula
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big Data
 

More from Arthur Charpentier

More from Arthur Charpentier (12)

Family History and Life Insurance
Family History and Life InsuranceFamily History and Life Insurance
Family History and Life Insurance
 
ACT6100 introduction
ACT6100 introductionACT6100 introduction
ACT6100 introduction
 
Family History and Life Insurance (UConn actuarial seminar)
Family History and Life Insurance (UConn actuarial seminar)Family History and Life Insurance (UConn actuarial seminar)
Family History and Life Insurance (UConn actuarial seminar)
 
Control epidemics
Control epidemics Control epidemics
Control epidemics
 
STT5100 Automne 2020, introduction
STT5100 Automne 2020, introductionSTT5100 Automne 2020, introduction
STT5100 Automne 2020, introduction
 
Family History and Life Insurance
Family History and Life InsuranceFamily History and Life Insurance
Family History and Life Insurance
 
Optimal Control and COVID-19
Optimal Control and COVID-19Optimal Control and COVID-19
Optimal Control and COVID-19
 
Slides OICA 2020
Slides OICA 2020Slides OICA 2020
Slides OICA 2020
 
Lausanne 2019 #3
Lausanne 2019 #3Lausanne 2019 #3
Lausanne 2019 #3
 
Side 2019 #11
Side 2019 #11Side 2019 #11
Side 2019 #11
 
Side 2019 #8
Side 2019 #8Side 2019 #8
Side 2019 #8
 
Mutualisation et Segmentation
Mutualisation et SegmentationMutualisation et Segmentation
Mutualisation et Segmentation
 

Recently uploaded

Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 

Recently uploaded (20)

Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 

Hands-On Algorithms for Predictive Modeling

  • 1. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Hands-On Algorithms for Predictive Modeling* (Penalized) Logistic Regression, Boosting, Neural Nets, SVM, Trees, Forests Arthur Charpentier (Universit´e du Qu´ebec `a Montr´eal, Canada) Third International Congress on Actuarial Science and Quantitative Finance Manizales, Colombia, June 2019 @freakonometrics freakonometrics freakonometrics.hypotheses.org 1
  • 2. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Modeling a 0/1 random variable Myocardial infarction of patients admited in E.R. ◦ heart rate (FRCAR) : x1 ◦ heart index (INCAR) : x2 ◦ stroke index (INSYS) : x3 ◦ diastolic pressure (PRDIA) : x4 ◦ pulmonary arterial pressure (PAPUL) : x5 ◦ ventricular pressure (PVENT) : x6 ◦ lung resistance (REPUL) : x7 ◦ death or survival (PRONO) : y dataset {(xi, yi)} where i = 1, . . . , n, n = 73. 1 > myocarde=read.table("http:// freakonometrics .free.fr/myocarde.csv", head=TRUE ,sep=";") @freakonometrics freakonometrics freakonometrics.hypotheses.org 2
  • 3. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Modeling a 0/1 random variable 1 > myocarde = read.table("http:// freakonometrics .free.fr/myocarde.csv" ,head=TRUE , sep=";") 2 > head(myocarde) 3 FRCAR INCAR INSYS PRDIA PAPUL PVENT REPUL PRONO 4 1 90 1.71 19.0 16 19.5 16.0 912 SURVIE 5 2 90 1.68 18.7 24 31.0 14.0 1476 DECES 6 3 120 1.40 11.7 23 29.0 8.0 1657 DECES 7 4 82 1.79 21.8 14 17.5 10.0 782 SURVIE 8 5 80 1.58 19.7 21 28.0 18.5 1418 DECES 9 6 80 1.13 14.1 18 23.5 9.0 1664 DECES 10 > myocarde$PRONO = (myocarde$PRONO =="SURVIE")*1 @freakonometrics freakonometrics freakonometrics.hypotheses.org 3
  • 4. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Logistic Regression (and GLM) Simulated data, y ∈ {0, 1} P(Y = 1|X = x) = exp[xTβ] 1 + exp[xTβ] Inference using maximum likelihood techniques β = argmin n i=1 log[P(Y = 1|X = x)] and the score model is then s(x) = exp[xTβ] 1 + exp[xTβ] 1 x1 = c(.4 ,.55 ,.65 ,.9 ,.1 ,.35 ,.5 ,.15 ,.2 ,.85) 2 x2 = c(.85 ,.95 ,.8 ,.87 ,.5 ,.55 ,.5 ,.2 ,.1 ,.3) 3 y = c(1,1,1,1,1,0,0,1,0,0) 4 df = data.frame(x1=x1 ,x2=x2 ,y=as.factor(y)) q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 q q q q q q q q q q @freakonometrics freakonometrics freakonometrics.hypotheses.org 4
  • 5. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Logistic Regression (and GLM) The logistic regression is based on the assumption that Y |X = x ∼ B(px), where px = exp[xTβ] 1 + exp[xTβ] = 1 + exp[−xT β] −1 The goal is to estimate parameter β. Recall that the heuristics for the use of that function for the probability is that log[odds(Y = 1)] = log P[Y = 1] P[Y = 0] = xT β • Maximum of the (log)-likelihood function The log-likelihood is here log L = n i=1 yi log pi + (1 − yi) log(1 − pi) where pi = (1 + exp[−xi Tβ])−1. @freakonometrics freakonometrics freakonometrics.hypotheses.org 5
  • 6. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Mathematical Statistics Courses in a Nutshell Consider observations {y1, · · · , yn} from iid random variables Yi ∼ Fθ (with “density” fθ). Likelihood is L(θ ; y) → n i=1 fθ(yi) Maximum likelihood estimate is θ mle ∈ arg max θ∈Θ {L(θ; y)} or θ mle ∈ arg max θ∈Θ {log L(θ; y)} Fisher (1912, On an absolute criterion for fitting frequency curves). @freakonometrics freakonometrics freakonometrics.hypotheses.org 6
  • 7. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Mathematical Statistics Courses in a Nutshell Under standard assumptions (Identification of the model, Compactness, Continuity and Dominance), the maximum likelihood estimator is consistent θ mle P −→ θ. With additional assumptions, it can be shown that the maximum likelihood estimator converges to a normal distribution √ n θ mle − θ L −→N(0, I−1 ) where I is Fisher information matrix (i.e. θ mle is asymptotically efficient). Eg. if Y ∼ N(θ, σ2), log L ∝ − n i=1 yi − θ σ 2 , and θmle = y (see also method of moments). max log L ⇐⇒ min n i=1 ε2 i = least squares @freakonometrics freakonometrics freakonometrics.hypotheses.org 7
  • 8. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Logistic Regression • Numerical Optimization Using optim() might not be a great idea. BFGS algorithm (Broyden, Fletcher, Goldfarb & Shanno) based on Newton’s algorithm. To find a zero of function g, g(x0) = 0, use Taylor’s expansion, 0 = g(x0) = g(x) + g (x)(x0 − x) i.e. (if x → xold and x0 → xnew) xnew = xold − g(xold) g (xold) from some starting point or, in higher dimension xnew = xold − g(xold)−1 · g(xold) from some starting point Here g is the gradient of the log-likelihood, βnew = βold − ∂2 log L(βold) ∂β∂βT −1 · ∂ log L(βold) ∂β @freakonometrics freakonometrics freakonometrics.hypotheses.org 8
  • 9. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Logistic Regression Numerically, since only g is given, consider some small h > 0 and set xnew = xold − 2h · g(xold) g(xold + h) + g(xold − h) from some starting point 1 y = myocarde$PRONO 2 X = cbind (1,as.matrix(myocarde [ ,1:7])) 3 negLogLik = function(beta){ 4 -sum(-y*log(1 + exp(-(X%*%beta))) - (1-y)*log(1 + exp(X%*%beta))) 5 } 6 beta_init = lm(PRONO˜.,data=myocarde)$ coefficients 7 logistic_opt = optim(par = beta_init , negLogLik , hessian=TRUE , method = "BFGS", control=list(abstol =1e -9)) @freakonometrics freakonometrics freakonometrics.hypotheses.org 9
  • 10. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Logistic Regression Here, we obtain 1 logistic_opt$par 2 (Intercept) FRCAR INCAR INSYS 3 1.656926397 0.045234029 -2.119441743 0.204023835 4 PRDIA PAPUL PVENT REPUL 5 -0.102420095 0.165823647 -0.081047525 -0.005992238 6 simu = function(i){ 7 logistic_opt_i = optim(par = rnorm (8,0,3)*beta_init , 8 negLogLik , hessian=TRUE , method = "BFGS", 9 control=list(abstol =1e-9)) 10 logistic_opt_i$par [2:3] 11 } 12 v_beta = t(Vectorize(simu)(1:1000)) @freakonometrics freakonometrics freakonometrics.hypotheses.org 10
  • 11. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Logistic Regression 1 plot(v_beta) 2 par(mfrow=c(1 ,2)) 3 hist(v_beta [,1], xlab=names(myocarde)[1]) 4 hist(v_beta [,2], xlab=names(myocarde)[2]) @freakonometrics freakonometrics freakonometrics.hypotheses.org 11
  • 12. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 • Fisher Algorithm We want to use Newton’s algorithm βnew = βold − ∂2 log L(βold) ∂β∂βT −1 · ∂ log L(βold) ∂β but here, the gradient and the hessian matrix have explicit expressions ∂ log L(βold) ∂β = XT (y − pold) ∂2 log L(βold) ∂β∂βT = −XT ∆oldX Note that it is fast and robust (to the starting point) 1 Y=myocarde$PRONO 2 X=cbind (1,as.matrix(myocarde [ ,1:7])) 3 colnames(X)=c("Inter",names(myocarde [ ,1:7])) 4 beta=as.matrix(lm(Y˜0+X)$coefficients ,ncol =1) 5 for(s in 1:9){ 6 pi=exp(X%*%beta[,s])/(1+ exp(X%*%beta[,s])) 7 gradient=t(X)%*%(Y-pi) @freakonometrics freakonometrics freakonometrics.hypotheses.org 12
  • 13. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 8 omega=matrix (0,nrow(X),nrow(X));diag(omega)=(pi*(1-pi)) 9 Hessian=-t(X)%*%omega%*%X 10 beta=cbind(beta ,beta[,s]-solve(Hessian)%*%gradient)} 11 beta [ ,8:10] 12 [,1] [,2] [,3] 13 XInter -10.187641685 -10.187641696 -10.187641696 14 XFRCAR 0.138178119 0.138178119 0.138178119 15 XINCAR -5.862429035 -5.862429037 -5.862429037 16 XINSYS 0.717084018 0.717084018 0.717084018 17 XPRDIA -0.073668171 -0.073668171 -0.073668171 18 XPAPUL 0.016756506 0.016756506 0.016756506 19 XPVENT -0.106776012 -0.106776012 -0.106776012 20 XREPUL -0.003154187 -0.003154187 -0.003154187 @freakonometrics freakonometrics freakonometrics.hypotheses.org 13
  • 14. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 • iteratively reweighted least squares We want to compute something like βnew = (XT ∆oldX)−1 XT ∆oldz (if we do substitute matrices in the analytical expressions) where z = Xβold + ∆−1 old [y − pold] But actually, that’s simply a standard least-square problem βnew = argmin (z − Xβ)T ∆−1 old (z − Xβ) The only problem here is that weights ∆old are functions of unknown βold But actually, if we keep iterating, we should be able to solve it : given the β we got the weights, and with the weights, we can use weighted OLS to get an updated β. That’s the idea of iteratively reweighted least squares. 1 df = myocarde 2 beta_init = lm(PRONO˜.,data=df)$ coefficients 3 X = cbind (1,as.matrix(myocarde [ ,1:7])) 4 beta = beta_init 5 for(s in 1:1000){ @freakonometrics freakonometrics freakonometrics.hypotheses.org 14
  • 15. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 6 p = exp(X %*% beta) / (1+ exp(X %*% beta)) 7 omega = diag(nrow(df)) 8 diag(omega) = (p*(1-p)) 9 df$Z = X %*% beta + solve(omega) %*% (df$PRONO - p) 10 beta = lm(Z˜.,data=df[,-8], weights=diag(omega))$ coefficients 11 } 12 beta 13 (Intercept) FRCAR INCAR INSYS PRDIA 14 -10.187641696 0.138178119 -5.862429037 0.717084018 -0.073668171 15 PAPUL PVENT REPUL 16 0.016756506 -0.106776012 -0.003154187 @freakonometrics freakonometrics freakonometrics.hypotheses.org 15
  • 16. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Logistic Regression 1 summary( lm(Z˜.,data=df[,-8], weights=diag(omega))) 2 3 Coefficients : 4 Estimate Std. Error t value Pr(>|t|) 5 (Intercept) -10.187642 10.668138 -0.955 0.343 6 FRCAR 0.138178 0.102340 1.350 0.182 7 INCAR -5.862429 6.052560 -0.969 0.336 8 INSYS 0.717084 0.503527 1.424 0.159 9 PRDIA -0.073668 0.261549 -0.282 0.779 10 PAPUL 0.016757 0.306666 0.055 0.957 11 PVENT -0.106776 0.099145 -1.077 0.286 12 REPUL -0.003154 0.004386 -0.719 0.475 @freakonometrics freakonometrics freakonometrics.hypotheses.org 16
  • 17. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Logistic Regression 1 summary(glm(PRONO˜.,data=myocarde ,family=binomial(link = "logit"))) 2 3 Coefficients : 4 Estimate Std. Error z value Pr(>|z|) 5 (Intercept) -10.187642 11.895227 -0.856 0.392 6 FRCAR 0.138178 0.114112 1.211 0.226 7 INCAR -5.862429 6.748785 -0.869 0.385 8 INSYS 0.717084 0.561445 1.277 0.202 9 PRDIA -0.073668 0.291636 -0.253 0.801 10 PAPUL 0.016757 0.341942 0.049 0.961 11 PVENT -0.106776 0.110550 -0.966 0.334 12 REPUL -0.003154 0.004891 -0.645 0.519 @freakonometrics freakonometrics freakonometrics.hypotheses.org 17
  • 18. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Logistic Regression with Splines (and GAM) To get home-made linear splines, use β1x + β2(x − s)+ 1 pos = function(x,s) (x-s)*(x>=s) 1 reg = glm(PRONO˜INSYS+pos(INSYS ,15)+pos(INSYS ,25) ,data=myocarde , family=binomial) 1 summary(reg) 2 3 Coefficients : 4 Estimate Std. Error z value Pr(>|z|) 5 (Intercept) -0.1109 3.278 -0.034 0.973 6 INSYS -0.1751 0.252 -0.693 0.488 7 pos(INS ,15) 0.7900 0.374 2.109 0.034 * 8 pos(INS ,25) -0.5797 0.290 -1.997 0.045 * @freakonometrics freakonometrics freakonometrics.hypotheses.org 18
  • 19. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Logistic Regression with Splines We can define spline functions with support (5, 55) and with knots {15, 25} 1 library(splines) 2 clr6 = c("#1 b9e77","#d95f02","#7570 b3","# e7298a","#66 a61e","#e6ab02") 3 x = seq (0,60,by =.25) 4 B = bs(x,knots=c(15 ,25) ,Boundary.knots=c(5 ,55) ,degre =1) 5 matplot(x,B,type="l",lty=1,lwd=2,col=clr6) @freakonometrics freakonometrics freakonometrics.hypotheses.org 19
  • 20. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Logistic Regression with Splines 1 reg = glm(PRONO˜bs(INSYS ,knots=c(15 ,25) , 2 Boundary.knots=c(5 ,55) ,degre =1) , 3 data=myocarde ,family=binomial) 4 summary(reg) 5 6 Coefficients : 7 Estimate Std. Error z value Pr(>|z|) 8 (Intercept) -0.9863 2.055 -0.480 0.631 9 bs(INSYS)1 -1.7507 2.526 -0.693 0.488 10 bs(INSYS)2 4.3989 2.061 2.133 0.032 * 11 bs(INSYS)3 5.4572 5.414 1.008 0.313 @freakonometrics freakonometrics freakonometrics.hypotheses.org 20
  • 21. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Logistic Regression with Splines For quadratic splines, consider a decomposition on x,x2,(x − s1)2 + 1 pos2 = function(x,s) (x-s)ˆ2*(x>=s) 2 reg = glm(PRONO˜poly(INSYS ,2)+pos2(INSYS ,15)+ pos2(INSYS ,25) , 3 data=myocarde ,family=binomial) 4 summary(reg) 5 6 Coefficients : 7 Estimate Std. Error z value Pr(>|z|) 8 (Intercept) 29.9842 15.236 1.968 0.049 * 9 poly (2)1 408.7851 202.419 2.019 0.043 * 10 poly (2)2 199.1628 101.589 1.960 0.049 * 11 pos2 (15) -0.2281 0.126 -1.805 0.071 . 12 pos2 (25) 0.0439 0.080 0.545 0.585 @freakonometrics freakonometrics freakonometrics.hypotheses.org 21
  • 22. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Logistic Regression with Splines 1 x = seq (0,60,by =.25) 2 B=bs(x,knots=c(15 ,25) ,Boundary.knots=c(5 ,55) , degre =2) 3 matplot(x,B,type="l",xlab="INSYS",col=clr6) @freakonometrics freakonometrics freakonometrics.hypotheses.org 22
  • 23. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 1 reg = glm(PRONO˜bs(INSYS ,knots=c(15 ,25) , 2 Boundary.knots=c(5 ,55) ,degre =2) ,data=myocarde , 3 family=binomial) 4 summary(reg) 5 6 Coefficients : 7 Estimate Std. Error z value Pr(>|z|) 8 (Intercept) 7.186 5.261 1.366 0.172 9 bs(INSYS)1 -14.656 7.923 -1.850 0.064 . 10 bs(INSYS)2 -5.692 4.638 -1.227 0.219 11 bs(INSYS)3 -2.454 8.780 -0.279 0.779 12 bs(INSYS)4 6.429 41.675 0.154 0.877 Finally, for cubic splines, use x, x2, x3, (x − s1)3 +, (x − s2)3 +, etc. @freakonometrics freakonometrics freakonometrics.hypotheses.org 23
  • 24. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 1 B=bs(x,knots=c(15 ,25) ,Boundary.knots=c(5 ,55) , degre =3) 2 matplot(x,B,type="l",lwd=2,col=clr6 ,lty=1,ylim =c( -.2 ,1.2)) 3 abline(v=c(5 ,15 ,25 ,55) ,lty =2) 1 reg = glm(PRONO˜1+bs(INSYS ,degree =1,df =4),data =myocarde ,family=binomial) 2 attr(reg$terms , "predvars")[[3]] 3 bs(INSYS , degree = 1L, knots = c(15.8 , 21.4 , 27.15) , 4 Boundary.knots = c(8.7 , 54) , intercept = FALSE ) 5 quantile(myocarde$INSYS ,(0:4)/4) 6 0% 25% 50% 75% 100% 7 8.70 15.80 21.40 27.15 54.00 @freakonometrics freakonometrics freakonometrics.hypotheses.org 24
  • 25. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 1 B = bs(x,degree =1,df=4) 2 B = cbind (1,B) 3 y = B%*% coefficients (reg) 4 plot(x,y,type="l",col="red",lwd =2) 5 abline(v=quantile(myocarde$INSYS ,(0:4)/4),lty =2) @freakonometrics freakonometrics freakonometrics.hypotheses.org 25
  • 26. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Logistic Regression with Splines 1 reg = glm(y˜bs(x1 ,degree =1,df =3)+bs(x2 ,degree =1,df =3) ,data=df ,family=binomial(link = " logit")) 2 u = seq(0,1, length =101) 3 p = function(x,y) predict.glm(reg ,newdata=data .frame(x1=x,x2=y),type="response") 4 v = outer(u,u,p) 5 image(u,u,v,xlab="Variable 1",ylab="Variable 2 ",col=clr10 ,breaks =(0:10)/10) 6 points(df$x1 ,df$x2 ,pch =19, cex =1.5 , col="white") 7 points(df$x1 ,df$x2 ,pch=c(1 ,19) [1+( df$y=="1")], cex =1.5) 8 contour(u,u,v,levels = .5,add=TRUE) @freakonometrics freakonometrics freakonometrics.hypotheses.org 26
  • 27. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Regression with Kernels and k-NN We do not use the probit framework, it is purely a non-parametric one. The moving regressogram is ˆm(x) = n i=1 yi · 1(xi ∈ [x ± h]) n i=1 1(xi ∈ [x ± h]) Consider ˜m(x) = n i=1 yi · kh (x − xi) n i=1 kh (x − xi) where (classically), for some kernel k, kh(x) = k x h k −2 −1 0 1 2 @freakonometrics freakonometrics freakonometrics.hypotheses.org 27
  • 28. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Regression with Kernels and k-NN 1 mean_x = function(x,bw){ 2 w = dnorm (( myocarde$INSYS -x)/bw , mean=0,sd =1) 3 weighted.mean(myocarde$PRONO ,w)} 4 u = seq (5,55, length =201) 5 v = Vectorize(function(x) mean_x(x,3))(u) 6 plot(u,v,ylim =0:1 , type="l",col="red") 7 points(myocarde$INSYS ,myocarde$PRONO ,pch =19) 1 v = Vectorize(function(x) mean_x(x,2))(u) 2 plot(u,v,ylim =0:1 , type="l",col="red") 3 points(myocarde$INSYS ,myocarde$PRONO ,pch =19) @freakonometrics freakonometrics freakonometrics.hypotheses.org 28
  • 29. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Regression with Kernels and k-NN One can also use the ksmooth() function of R, 1 reg = ksmooth(myocarde$INSYS ,myocarde$PRONO ," normal",bandwidth = 2*exp (1)) 2 plot(reg$x,reg$y,ylim =0:1 , type="l",col="red", lwd=2,xlab="INSYS",ylab="") 3 points(myocarde$INSYS ,myocarde$PRONO ,pch =19) Classical bias / variance tradeoff (below mh(15) and mh(25)) @freakonometrics freakonometrics freakonometrics.hypotheses.org 29 10 20 30 40 50 0.00.20.40.60.81.0 INSYS q qq q qq qq qq qq q qqqqq qq q q qq qqq q qqq q q q q qqq q q qq qq q q q q q q q q q q q q q q q q q q q q qq qq q q q Density 0.0 0.2 0.4 0.6 0.8 1.0 0123456 Density 0.0 0.2 0.4 0.6 0.8 1.0 0123456
  • 30. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Regression with Kernels and k-NN 1 u = seq(0,1, length =101) 2 p = function(x,y){ 3 bw1 = .2; bw2 = .2 4 w = dnorm ((df$x1 -x)/bw1 , mean=0,sd=1)* 5 dnorm ((df$x2 -y)/bw2 , mean=0,sd =1) 6 weighted.mean(df$y=="1",w) 7 } 8 v = outer(u,u,Vectorize(p)) 9 image(u,u,v,col=clr10 ,breaks =(0:10)/10) 10 points(df$x1 ,df$x2 ,pch =19, cex =1.5 , col="white") 11 points(df$x1 ,df$x2 ,pch=c(1 ,19) [1+( df$y=="1")], cex =1.5) 12 contour(u,u,v,levels = .5,add=TRUE) @freakonometrics freakonometrics freakonometrics.hypotheses.org 30
  • 31. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Regression with Kernels and k-NN One can also use k-nearest neigthbors ˜mk(x) = 1 n n i=1 ωi,k(x)yi where ωi,k(x) = n/k if i ∈ Ik x with Ik x = {i : xi one of the k nearest observations to x}I Classically, use Mahalanobis distance 1 Sigma = var(myocarde [ ,1:7]) 2 Sigma_Inv = solve(Sigma) 3 d2_ mahalanobis = function(x,y,Sinv){as.numeric(x-y)%*%Sinv%*%t(x-y)} 4 k_closest = function(i,k){ 5 vect_dist = function(j) d2_ mahalanobis (myocarde[i ,1:7] , myocarde[j ,1:7] , Sigma_Inv) 6 vect = Vectorize(vect_dist)((1: nrow(myocarde))) 7 which (( rank(vect)))} @freakonometrics freakonometrics freakonometrics.hypotheses.org 31
  • 32. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 1 k_majority = function(k){ 2 Y=rep(NA ,nrow(myocarde)) 3 for(i in 1: length(Y)) Y[i] = sort(myocarde$PRONO[k_closest(i,k)])[( k+1)/2] 4 return(Y)} 1 k_mean = function(k){ 2 Y=rep(NA ,nrow(myocarde)) 3 for(i in 1: length(Y)) Y[i] = mean(myocarde$PRONO[k_closest(i,k)]) 4 return(Y)} 1 Sigma_Inv = solve(var(df[,c("x1","x2")])) 2 u = seq(0,1, length =51) 3 p = function(x,y){ 4 k = 6 5 vect_dist = function(j) d2_ mahalanobis(c(x,y),df[j,c("x1","x2")], Sigma_Inv) 6 vect = Vectorize(vect_dist)(1: nrow(df)) 7 idx = which(rank(vect) <=k) @freakonometrics freakonometrics freakonometrics.hypotheses.org 32
  • 33. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 8 return(mean ((df$y==1)[idx ]))} 1 v = outer(u,u,Vectorize(p)) 2 image(u,u,v,col=clr10 ,breaks =(0:10)/10) 3 points(df$x1 ,df$x2 ,pch =19, cex =1.5 , col="white") 4 points(df$x1 ,df$x2 ,pch=c(1 ,19) [1+( df$y=="1")], cex =1.5) 5 contour(u,u,v,levels = .5,add=TRUE) @freakonometrics freakonometrics freakonometrics.hypotheses.org 33
  • 34. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Goodness of Fit: ROC Curve Confusion matrix Given a sample (yi, xi) and a model m, the confusion matrix is the contin- gency table, with dimensions observed yi ∈ {0, 1} and predicted yi ∈ {0, 1}. y 0(-) 1(+) 0(-) TN FN y 1(+) FP TP FP TN+FP TP FN+TP FPR TPR Classical measures are true positive rate (TPR) - or sensitivity false positive rate (FPR) - or fall-out, true negative rate (TNR) - or specificity TNR = 1-FPR among others (see wikipedia) ROCR::performance(prediction(Score,Y),"tpr","fpr") @freakonometrics freakonometrics freakonometrics.hypotheses.org 34
  • 35. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Goodness of Fit: ROC Curve ROC (Receiver Operating Characteristic ) Curve Assume that mt is define from a score function s, with mt(x) = 1(s(x) > t) for some threshold t. The ROC curve is the curve (FPRt, TPRt obtained from confusion matrices of mt’s. n = 100 individuals 50 yi = 0 and 50 yi = 1 (well balanced) @freakonometrics freakonometrics freakonometrics.hypotheses.org 35 q q Y=0 Y=1 OBSERVED Y=1Y=0 PREDICTED 25 25 25 25 FPR TPR 25 25 50 50 ~ 0.5 ~ 0.5 q q FALSE POSITIVE RATE TRUEPOSITIVERATE 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 q
  • 36. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Goodness of Fit: ROC Curve 29 deaths (y = 0) and 42 survivals (y = 1) y = 1 if P[Y = 1|X] > 0% 0 if P[Y = 1|X] ≤ 0% 0% y 0 1 0 0 0 y 1 29 42 29 29+0 42 42+0 = 100% = 100% FPR TPR @freakonometrics freakonometrics freakonometrics.hypotheses.org 36
  • 37. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Goodness of Fit: ROC Curve 29 deaths (y = 0) and 42 survivals (y = 1) y = 1 if P[Y = 1|X] > 15% 0 if P[Y = 1|X] ≤ 15% 15% y 0 1 0 17 2 y 1 12 40 12 17+12 40 42+2 ∼ 41.4% ∼ 95.2% FPR TPR @freakonometrics freakonometrics freakonometrics.hypotheses.org 37
  • 38. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Goodness of Fit: ROC Curve 29 deaths (y = 0) and 42 survivals (y = 1) y = 1 if P[Y = 1|X] > 50% 0 if P[Y = 1|X] ≤ 50% 50% y 0 1 0 25 3 y 1 4 39 4 25+4 39 39+3 ∼ 13.8% ∼ 92.8% FPR TPR @freakonometrics freakonometrics freakonometrics.hypotheses.org 38
  • 39. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Goodness of Fit: ROC Curve 29 deaths (y = 0) and 42 survivals (y = 1) y = 1 if P[Y = 1|X] > 85% 0 if P[Y = 1|X] ≤ 85% 50% y 0 1 0 28 13 y 1 1 29 1 28+1 29 29+13 ∼ 3.4% ∼ 69.9% FPR TPR @freakonometrics freakonometrics freakonometrics.hypotheses.org 39
  • 40. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Goodness of Fit: ROC Curve 29 deaths (y = 0) and 42 survivals (y = 1) y = 1 if P[Y = 1|X] > 100% 0 if P[Y = 1|X] ≤ 100% 50% y 0 1 0 29 42 y 0 0 29 0 29+0 0 42+0 = 0.0% = 0.0% FPR TPR @freakonometrics freakonometrics freakonometrics.hypotheses.org 40
  • 41. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Goodness of Fit Be carefull of overfit... Important to use training / validation • classification tree • boosting classifier @freakonometrics freakonometrics freakonometrics.hypotheses.org 41
  • 42. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Goodness of Fit: ROC Curve ROC (Receiver Operating Characteristic ) Curve Assume that mt is define from a score function s, with mt(x) = 1(s(x) > t) for some threshold t. The ROC curve is the curve (FPRt, TPRt obtained from confusion matrices of mt’s. @freakonometrics freakonometrics freakonometrics.hypotheses.org 42 qq qqq qq q qq qqqq qqqq q qqqqq q qqq qqq qq qq qqqqqqqqqq qqq qq q qq qqq qqqq qq qq qq qqq q q qq qqq q qq qq q qqqqq qqq qqqq qqqq qq qqqq q q qqq qq qqq qqq qqq qq qqq qqqqqqq q qqqqq q qqq q qq qq qqq q q q qqq qq qqqq q qq q qqqq qqq q qq qqq qqq qq qqqq qqqq q q qqq qqqq q qq q q qq qq qqq q qqqq q qqq qqq qqqqqq qq q q qq qqq qqq q qqqqq qqq q q q qq qq qqqq q qqqq qqqqq q qqq qq q qqq q q q q qqq q qq qqq qqq qq qqq q q q qq qqq qq q q qqqq qq q q q q qq q qq qqq q q qqq q q qqqq qqqqq qqqq qqqq qqq q q qq qq qqqq q q qqq q qqq qqqq qq q qq q qq qqqqqq q qqqq q qqqq qqqqq q qq q qq qqqq qqqq q q qq qq q q qqq qqqq q q q q qq qq qq qqq q qq q qqqq qqqq qqqq qq qq qq qqqq qqqq q qqqq qqqq q q qqqq qqq qqq q q qq qqqq qqqqqqqq qq qqqq qqq q q qq qq q q q qqqq qqq q qqq q q qqqqqq q q q q qqqqq q q qqqq qqqq q qqqq qqqqq qq qqq q q qqq qq qqqq qq qq qq qqqqqq qq qqq q q qqqqqq q q qq qq q qqq q qqq q qq qq qqq q q qq q q q qq qq qq q q qqq qqq qqqqq qqqq q qqqqq q q q q q qq qq q qqqq q q qq q q q q q qqqq qq qq q qqq q qq qqqqqq qq qq qqqq qqqq q qqqq qq q qq qq qqqq q qq q q qqqqq qqqq qq qq qqqq q qqq q qq q qq qqqq qqqqq qqqq qq qqq qqq qqqq qq q q q qqq qq qq q qq qq qqq qq qqqqq qqq qqq qqqqq qqqqq qq qqq qqqq qqq q qq qq qqqq q qq qq qqqqq qq qqqq qq qqqqqqqq qqqq qq q q qqqqq qqq qq q qqqqq q q q q qqq qq q qqqq qq q qqqq qqqq qqqqq qq qqqqq q q q qq qqq qqq q q q q q qq q qqqq q qqq q qq qq q SCORE S OBSERVATIONY 0.0 0.2 0.4 0.6 0.8 1.0 Y=0 Y=1 FALSE POSITIVE RATE TRUEPOSITIVERATE 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 q
  • 43. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Goodness of Fit: ROC Curve With categorical variables, we have a collection of points Need to interpolate between those points to have a curve, see convexification of ROC curves, e.g. Flach (2012, Machine Learning) @freakonometrics freakonometrics freakonometrics.hypotheses.org 43 q q FALSE SPAM RATE TRUESPAMRATE 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 q q q q q q FALSE SPAM RATE TRUESPAMRATE 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 q q
  • 44. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Goodness of Fit One can derive a confidence interval of the ROC curve using boostrap techniques, pROC::ci.se() Various measures can be used see library hmeasures , with Gini index, the Area Under the curve, or Kolmogorov-Smirnov ks = sup t∈R |F1(t) − F0(t)| ks = sup t∈R | 1 n1 i:yi=1 1s(xi)≤t − 1 n0 i:yi=0 1s(xi)≤t| Specificity (%) Sensitivity(%) 020406080100 100 80 60 40 20 0 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 Fn(x) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Survival Death @freakonometrics freakonometrics freakonometrics.hypotheses.org 44
  • 45. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Goodness of Fit The Area Under the Curve, AUC, can be interpreted as the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one, see Swets, Dawes & Monahan (2000, Psychological Science Can Improve Diagnostic Decisions) Kappa statistic κ compares an Observed Accuracy with an Expected Accuracy (random chance), see Landis & Koch (1977, The Measurement of Observer Agreement for Categorical Data ) Y = 0 Y = 1 Y = 0 TN FN TN+FN Y = 1 FP TP FP+TP TN+FP FN+TP n See also Obsersed and Random Confusion Tables Y = 0 Y = 1 Y = 0 25 3 28 Y = 1 4 39 43 29 42 71 Y = 0 Y = 1 Y = 0 11.44 16.56 28 Y = 1 17.56 25.44 43 29 42 71 @freakonometrics freakonometrics freakonometrics.hypotheses.org 45
  • 46. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Goodness of Fit Y = 0 Y = 1 Y = 0 25 3 28 Y = 1 4 39 43 29 42 71 Y = 0 Y = 1 Y = 0 11.44 16.56 28 Y = 1 17.56 25.44 43 29 42 71 total accuracy = TP + TN n ∼ 90.14% random accuracy = [TN + FP] · [TP + FN] + [TP + FP] · [TN + FN] n2 ∼ 51.93% κ = total accuracy − random accuracy 1 − random accuracy ∼ 79.48% @freakonometrics freakonometrics freakonometrics.hypotheses.org 46
  • 47. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Penalized Estimation Consider a parametric model, with true (unkrown) parameter θ, then mse(ˆθ) = E (ˆθ − θ)2 = E (ˆθ − E ˆθ )2 variance + E (E ˆθ − θ)2 bias2 One can think of a shrinkage of an unbiased estimator, Let θ denote an unbiased estimator of θ. Then ˆθ = θ2 θ2 + mse(θ) · θ = θ − mse(θ) θ2 + mse(θ) · θ penalty satisfies mse(ˆθ) ≤ mse(θ). @freakonometrics freakonometrics freakonometrics.hypotheses.org 47 −2 −1 0 1 2 3 4 0.00.20.40.60.8 variance
  • 48. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Linear Regression Shortcoming Least Squares Estimator β = (XT X)−1XT y Unbiased Estimator E[β] = β Variance Var[β] = σ2(XT X)−1 which can be (extremely) large when det[(XT X)] ∼ 0. X =      1 −1 2 1 0 1 1 2 −1 1 1 0      then XT X =    4 2 2 2 6 −4 2 −4 6    while XT X + I =    5 2 2 2 7 −4 2 −4 7    eigenvalues : {10, 6, 0} {11, 7, 1} Ad-hoc strategy: use XT X + λI @freakonometrics freakonometrics freakonometrics.hypotheses.org 48
  • 49. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Linear Regression Shortcoming Evolution of (β1, β2) → n i=1 [yi − (β1x1,i + β2x2,i)]2 when cor(X1, X2) = r ∈ [0, 1], on top. Below, Ridge regression (β1, β2) → n i=1 [yi − (β1x1,i + β2x2,i)]2 +λ(β2 1 + β2 2 ) where λ ∈ [0, ∞), below, when cor(X1, X2) ∼ 1 (colinearity). @freakonometrics freakonometrics freakonometrics.hypotheses.org 49 −2 −1 0 1 2 3 4 −3−2−10123 β1 β2 500 1000 1500 2000 2000 2500 2500 2500 2500 3000 3000 3000 3000 3500 q −2 −1 0 1 2 3 4 −3−2−10123 β1 β2 1000 1000 2000 2000 3000 3000 4000 4000 5000 5000 6000 6000 7000
  • 50. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Normalization : Euclidean 2 vs. Mahalonobis We want to penalize complicated models : if βk is “too small”, we prefer to have βk = 0. 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Instead of d(x, y) = (x − y)T (x − y) use dΣ(x, y) = (x − y)TΣ−1(x − y) beta1 beta2 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 50
  • 51. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Ridge Regression ... like the least square, but it shrinks estimated coefficients towards 0. β ridge λ = argmin n i=1 (yi − xT i β)2 + λ p j=1 β2 j β ridge λ = argmin    y − Xβ 2 2 =criteria + λ β 2 2 =penalty    λ ≥ 0 is a tuning parameter. The constant is usually unpenalized. The true equation is β ridge λ = argmin    y − (β0 + Xβ) 2 2 =criteria + λ β 2 2 =penalty    @freakonometrics freakonometrics freakonometrics.hypotheses.org 51
  • 52. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Ridge Regression β ridge λ = argmin y − (β0 + Xβ) 2 2 + λ β 2 2 can be seen as a constrained optimization problem β ridge λ = argmin β 2 2 ≤hλ y − (β0 + Xβ) 2 2 Explicit solution βλ = (XT X + λI)−1 XT y If λ → 0, β ridge 0 = β ols If λ → ∞, β ridge ∞ = 0. beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 30 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 52
  • 53. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Ridge Regression This penalty can be seen as rather unfair if components of x are not expressed on the same scale • center: xj = 0, then β0 = y • scale: xT j xj = 1 Then compute β ridge λ = argmin    y − Xβ 2 2 =loss + λ β 2 2 =penalty    beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 30 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 40 40 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 53
  • 54. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Ridge Regression Observe that if xj1 ⊥ xj2 , then β ridge λ = [1 + λ]−1 β ols λ which explain relationship with shrinkage. But generally, it is not the case... q q Theorem There exists λ such that mse[β ridge λ ] ≤ mse[β ols λ ] @freakonometrics freakonometrics freakonometrics.hypotheses.org 54
  • 55. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Ridge Regression Lλ(β) = n i=1 (yi − β0 − xT i β)2 + λ p j=1 β2 j ∂Lλ(β) ∂β = −2XT y + 2(XT X + λI)β ∂2Lλ(β) ∂β∂βT = 2(XT X + λI) where XT X is a semi-positive definite matrix, and λI is a positive definite matrix, and βλ = (XT X + λI)−1 XT y @freakonometrics freakonometrics freakonometrics.hypotheses.org 55
  • 56. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 The Bayesian Interpretation From a Bayesian perspective, P[θ|y] posterior ∝ P[y|θ] likelihood · P[θ] prior i.e. log P[θ|y] = log P[y|θ] log likelihood + log P[θ] penalty If β has a prior N(0, τ2I) distribution, then its posterior distribution has mean E[β|y, X] = XT X + σ2 τ2 I −1 XT y. @freakonometrics freakonometrics freakonometrics.hypotheses.org 56
  • 57. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Properties of the Ridge Estimator βλ = (XT X + λI)−1 XT y E[βλ] = XT X(λI + XT X)−1 β. i.e. E[βλ] = β. Observe that E[βλ] → 0 as λ → ∞. Assume that X is an orthogonal design matrix, i.e. XT X = I, then βλ = (1 + λ)−1 β ols . @freakonometrics freakonometrics freakonometrics.hypotheses.org 57
  • 58. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Properties of the Ridge Estimator Set W λ = (I + λ[XT X]−1)−1. One can prove that W λβ ols = βλ. Thus, Var[βλ] = W λVar[β ols ]W T λ and Var[βλ] = σ2 (XT X + λI)−1 XT X[(XT X + λI)−1 ]T . Observe that Var[β ols ] − Var[βλ] = σ2 W λ[2λ(XT X)−2 + λ2 (XT X)−3 ]W T λ ≥ 0. @freakonometrics freakonometrics freakonometrics.hypotheses.org 58
  • 59. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Properties of the Ridge Estimator Hence, the confidence ellipsoid of ridge estimator is indeed smaller than the OLS, If X is an orthogonal design matrix, Var[βλ] = σ2 (1 + λ)−2 I. mse[βλ] = σ2 trace(W λ(XT X)−1 W T λ) + βT (W λ − I)T (W λ − I)β. If X is an orthogonal design matrix, mse[βλ] = pσ2 (1 + λ)2 + λ2 (1 + λ)2 βT β @freakonometrics freakonometrics freakonometrics.hypotheses.org 59 0.0 0.2 0.4 0.6 0.8 −1.0−0.8−0.6−0.4−0.2 β1 β2 1 2 3 4 5 6 7
  • 60. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Properties of the Ridge Estimator mse[βλ] = pσ2 (1 + λ)2 + λ2 (1 + λ)2 βT β is minimal for λ = pσ2 βT β Note that there exists λ > 0 such that mse[βλ] < mse[β0] = mse[β ols ]. @freakonometrics freakonometrics freakonometrics.hypotheses.org 60
  • 61. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Sparsity Issues In severall applications, k can be (very) large, but a lot of features are just noise: βj = 0 for many j’s. Let s denote the number of relevent features, with s << k, cf Hastie, Tibshirani & Wainwright (2015) Statistical Learning with Sparsity, s = card{S} where S = {j; βj = 0} The model is now y = XT S βS + ε, where XT S XS is a full rank matrix. @freakonometrics freakonometrics freakonometrics.hypotheses.org 61 q q = . +
  • 62. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Going further on sparcity issues The Ridge regression problem was to solve β = argmin β∈{ β 2 ≤s} { Y − XT β 2 2 } Define a 0 = 1(|ai| > 0). Here dim(β) = k but β 0 = s. We wish we could solve β = argmin β∈{ β 0 =s} { Y − XT β 2 2 } Problem: it is usually not possible to describe all possible constraints, since s k coefficients should be chosen here (with k (very) large). @freakonometrics freakonometrics freakonometrics.hypotheses.org 62 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0
  • 63. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Going further on sparcity issues In a convex problem, solve the dual problem, e.g. in the Ridge regression : primal problem min β∈{ β 2 ≤s} { Y − XT β 2 2 } and the dual problem min β∈{ Y −XTβ 2 ≤t} { β 2 2 } beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 26 27 30 32 35 40 40 50 60 70 80 90 100 110 120 120 130 130 140 140 X q −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 26 27 30 32 35 40 40 50 60 70 80 90 100 110 120 120 130 130 140 140 X q −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 63
  • 64. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Going further on sparcity issues Idea: solve the dual problem β = argmin β∈{ Y −XTβ 2 ≤h} { β 0 } where we might convexify the 0 norm, · 0 . @freakonometrics freakonometrics freakonometrics.hypotheses.org 64
  • 65. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Going further on sparcity issues On [−1, +1]k, the convex hull of β 0 is β 1 On [−a, +a]k, the convex hull of β 0 is a−1 β 1 Hence, why not solve β = argmin β; β 1 ≤˜s { Y − XT β 2 } which is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization problem β = argmin{ Y − XT β 2 2 +λ β 1 } @freakonometrics freakonometrics freakonometrics.hypotheses.org 65
  • 66. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 lasso Least Absolute Shrinkage and Selection Operator β ∈ argmin{ Y − XT β 2 2 +λ β 1 } is a convex problem (several algorithms ), but not strictly convex (no unicity of the minimum). Nevertheless, predictions y = xTβ are unique. MM, minimize majorization, coordinate descent Hunter & Lange (2003, A Tutorial on MM Algorithms). @freakonometrics freakonometrics freakonometrics.hypotheses.org 66
  • 67. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 lasso Regression No explicit solution... If λ → 0, β lasso 0 = β ols If λ → ∞, β lasso ∞ = 0. beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 67
  • 68. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 lasso Regression For some λ, there are k’s such that β lasso k,λ = 0. Further, λ → β lasso k,λ is piecewise linear beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 30 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 40 40 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 68
  • 69. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 lasso Regression In the orthogonal case, XT X = I, β lasso k,λ = sign(β ols k ) |β ols k | − λ 2 i.e. the LASSO estimate is related to the soft threshold function... q q @freakonometrics freakonometrics freakonometrics.hypotheses.org 69
  • 70. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 2 norm (Ridge) We want to solve something like βλ = argmin{− log L(β|x, y) + λ β 2 2 } In the case where we consider the log-likelihood of some Gaussian variable, we get the sum of the square of the residuals, and we can obtain an explicit solution. But not in the context of a logistic regression. The heuristics about Ridge regression is the following graph. In the background, we can visualize the (two-dimensional) log-likelihood of the logistic regression, and the blue circle is the constraint we have, if we rewrite the optimization problem as a constrained optimization problem : min β: β 2 2 ≤s n i=1 − log L(yi, β0 + xT β) @freakonometrics freakonometrics freakonometrics.hypotheses.org 70
  • 71. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 2 norm (Ridge) It can be written equivalently (it is a strictly convex problem) min β,λ − n i=1 log L(yi, β0 + xT β) + λ β 2 2 Thus, the constrained maximum should lie in the blue disk. 1 > PennegLogLik = function(bbeta ,lambda =0){ 2 b0 = bbeta [1] 3 beta = bbeta [-1] 4 -sum(-y*log(1 + exp(-(b0+X%*%beta))) - (1-y)* 5 log (1 + exp(b0+X%*%beta)))+lambda*sum(beta ˆ2)} @freakonometrics freakonometrics freakonometrics.hypotheses.org 71
  • 72. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 2 norm (Ridge) 1 > opt_ridge = function(lambda){ 2 beta_init = lm(PRONO˜.,data=myocarde)$ coefficients 3 logistic_opt = optim(par = beta_init*0, function(x) PennegLogLik (x,lambda), method = "BFGS", control=list(abstol =1e -9)) 4 logistic_opt$par [ -1]} 5 > v_lambda = c(exp(seq(-2,5, length =61))) 6 > est_ridge = Vectorize(opt_ridge)(v_lambda) 7 > plot(v_lambda ,est_ridge [1 ,]) @freakonometrics freakonometrics freakonometrics.hypotheses.org 72
  • 73. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 2 norm (Ridge) • Using Fisher’s (penalized) score On the penalized problem, we can easily prove that ∂ log Lp(βλ,old) ∂β = ∂ log L(βλ,old) ∂β − 2λβold while ∂2 log Lp(βλ,old) ∂β∂βT = ∂2 log L(βλ,old) ∂β∂βT − 2λI Hence βλ,new = (XT ∆oldX + 2λI)−1 XT ∆oldz And interestingly, with that algorithm, we can also derive the variance of the estimator Var[βλ] = [XT ∆X + 2λI]−1 XT ∆Var[z]∆X[XT ∆X + 2λI]−1 where Var[z] = ∆−1 @freakonometrics freakonometrics freakonometrics.hypotheses.org 73
  • 74. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 2 norm (Ridge) The code to compute βλ is then 1 Y = myocarde$PRONO 2 X = myocarde [ ,1:7] 3 for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j]) 4 X = as.matrix(X) 5 X = cbind (1,X) 6 colnames(X) = c("Inter",names(myocarde [ ,1:7])) 7 beta = as.matrix(lm(Y˜0+X)$coefficients ,ncol =1) 8 for(s in 1:9){ 9 pi = exp(X%*%beta[,s])/(1+ exp(X%*%beta[,s])) 10 Delta = matrix (0,nrow(X),nrow(X));diag(Delta)=(pi*(1-pi)) 11 z = X%*%beta[,s] + solve(Delta)%*%(Y-pi) 12 B = solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) %*% (t(X)%*% Delta%*%z) 13 beta = cbind(beta ,B)} 14 beta [ ,8:10] 15 [,1] [,2] [,3] @freakonometrics freakonometrics freakonometrics.hypotheses.org 74
  • 75. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 16 XInter 0.59619654 0.59619654 0.59619654 17 XFRCAR 0.09217848 0.09217848 0.09217848 18 XINCAR 0.77165707 0.77165707 0.77165707 19 XINSYS 0.69678521 0.69678521 0.69678521 20 XPRDIA -0.29575642 -0.29575642 -0.29575642 21 XPAPUL -0.23921101 -0.23921101 -0.23921101 22 XPVENT -0.33120792 -0.33120792 -0.33120792 23 XREPUL -0.84308972 -0.84308972 -0.84308972 Use Var[βλ] = [XT ∆X + 2λI]−1 XT ∆Var[z]∆X[XT ∆X + 2λI]−1 where Var[z] = ∆−1. 1 newton_ridge = function(lambda =1){ 2 beta = as.matrix(lm(Y˜0+X)$coefficients ,ncol =1)*runif (8) 3 for(s in 1:20){ 4 pi = exp(X%*%beta[,s])/(1+ exp(X%*%beta[,s])) 5 Delta = matrix (0,nrow(X),nrow(X));diag(Delta)=(pi*(1-pi)) 6 z = X%*%beta[,s] + solve(Delta)%*%(Y-pi) @freakonometrics freakonometrics freakonometrics.hypotheses.org 75
  • 76. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 7 B = solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) %*% (t(X)%*% Delta%*%z) 8 beta = cbind(beta ,B)} 9 Varz = solve(Delta) 10 Varb = solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) %*% t(X)%*% Delta %*% Varz %*% 11 Delta %*% X %*% solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) 12 return(list(beta=beta[,ncol(beta)],sd=sqrt(diag(Varb))))} 1 v_lambda=c(exp(seq(-2,5, length =61))) 2 est_ridge=Vectorize(function(x) newton_ridge(x )$beta)(v_lambda) 3 library(" RColorBrewer ") 4 colrs=brewer.pal(7,"Set1") 5 plot(v_lambda ,est_ridge [1,],col=colrs [1], type= "l") 6 for(i in 2:7) lines(v_lambda ,est_ridge[i,],col =colrs[i]) @freakonometrics freakonometrics freakonometrics.hypotheses.org 76
  • 77. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 2 norm (Ridge) And for the variance 1 v_lambda=c(exp(seq(-2,5, length =61))) 2 est_ridge=Vectorize(function(x) newton_ridge(x )$sd)(v_lambda) 3 library(" RColorBrewer ") 4 colrs=brewer.pal(7,"Set1") 5 plot(v_lambda ,est_ridge [1,],col=colrs [1], type= "l") 6 for(i in 2:7) lines(v_lambda ,est_ridge[i,],col =colrs[i],lwd =2) @freakonometrics freakonometrics freakonometrics.hypotheses.org 77
  • 78. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 1 norm (Ridge) 1 y = myocarde$PRONO 2 X = myocarde [ ,1:7] 3 for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X [,j]) 4 X = as.matrix(X) 5 library(glmnet) 6 glm_ridge = glmnet(X, y, alpha =0) 7 plot(glm_ridge ,xvar="lambda",col=colrs ,lwd =2) @freakonometrics freakonometrics freakonometrics.hypotheses.org 78
  • 79. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 2 norm (Ridge) • Ridge Regression with Orthogonal Covariates 1 library(factoextra) 2 pca = princomp(X) 3 pca_X = get_pca_ind(pca)$coord 4 5 library(glmnet) 6 glm_ridge = glmnet(pca_X, y, alpha =0) 7 plot(glm_ridge ,xvar="lambda",col=colrs ,lwd =2) 8 9 plot(glm_ridge ,col=colrs ,lwd =2) @freakonometrics freakonometrics freakonometrics.hypotheses.org 79
  • 80. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 1 df0 = df 2 df0$y=as.numeric(df$y) -1 3 plot_lambda = function(lambda){ 4 m = apply(df0 ,2,mean) 5 s = apply(df0 ,2,sd) 6 for(j in 1:2) df0[,j] = (df0[,j]-m[ j])/s[j] 7 reg = glmnet(cbind(df0$x1 ,df0$x2), df0$y==1, alpha =0) 8 par(mfrow=c(1 ,2)) 9 plot(reg ,xvar="lambda",col=c("blue" ,"red"),lwd =2) 10 abline(v=log (.2)) 11 plot_lambda (.2) 12 plot(reg ,xvar="lambda",col=c("blue" ,"red"),lwd =2) 13 abline(v=log (1.2)) 14 plot_lambda (1.2) @freakonometrics freakonometrics freakonometrics.hypotheses.org 80
  • 81. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 1 norm (lasso) It can be written equivalently (it is a strictly convex problem) min β,λ − n i=1 log L(yi, β0 + xT β) + λ β 1 Thus, the constrained maximum should lie in the blue disk. @freakonometrics freakonometrics freakonometrics.hypotheses.org 81
  • 82. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 1 norm (lasso) 1 y = myocarde$PRONO 2 X = myocarde [ ,1:7] 3 for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X [,j]) 4 X = as.matrix(X) 5 LogLik = function(bbeta){ 6 b0=bbeta [1] 7 beta=bbeta [-1] 8 sum(-y*log(1 + exp(-(b0+X%*%beta))) - 9 (1-y)*log(1 + exp(b0+X%*%beta)))} 10 u = seq(-4,4, length =251) 11 v = outer(u,u,function(x,y) LogLik(c(1,x,y))) 12 image(u,u,v,col=rev(heat.colors (25))) 13 contour(u,u,v,add=TRUE) 14 polygon(c(-1,0,1,0),c(0,1,0,-1),border="blue") @freakonometrics freakonometrics freakonometrics.hypotheses.org 82
  • 83. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 1 norm (lasso) 1 PennegLogLik = function(bbeta ,lambda =0){ 2 b0=bbeta [1] 3 beta=bbeta [-1] 4 -sum(-y*log(1 + exp(-(b0+X%*%beta))) - 5 (1-y)*log(1 + exp(b0+X%*%beta)))+lambda*sum(abs(beta)) } 6 opt_lasso = function(lambda){ 7 beta_init = lm(PRONO˜.,data=myocarde)$ coefficients 8 logistic_opt = optim(par = beta_init*0, function(x) PennegLogLik (x, lambda), 9 hessian=TRUE , method = "BFGS", control=list(abstol =1e -9)) 1 logistic_opt$par [-1] } 2 v_lambda=c(exp(seq(-4,2, length =61))) 3 est_lasso=Vectorize(opt_lasso)(v_lambda) 4 plot(v_lambda ,est_lasso [1,],col=colrs [1]) 5 for(i in 2:7) lines(v_lambda ,est_lasso[i,],col =colrs[i],lwd =2) @freakonometrics freakonometrics freakonometrics.hypotheses.org 83
  • 84. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Observe that coefficients are not (strictly) null 1 library(glmnet) 2 glm_lasso = glmnet(X, y, alpha =1) 3 plot(glm_lasso ,xvar="lambda",col=colrs ,lwd =2) 4 plot(glm_lasso ,col=colrs ,lwd =2) 5 glmnet(X, y, alpha =1, lambda=exp (-4))$beta 6 7x1 sparse Matrix of class "dgCMatrix" 7 s0 8 FRCAR . 9 INCAR 0.11005070 10 INSYS 0.03231929 11 PRDIA . 12 PAPUL . 13 PVENT -0.03138089 14 REPUL -0.20962611 @freakonometrics freakonometrics freakonometrics.hypotheses.org 84
  • 85. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 1 norm (lasso) With orthogonal covariates 4 library(factoextra) 5 pca = princomp(X) 6 pca_X = get_pca_ind(pca)$coord 7 glm_lasso = glmnet(pca_X, y, alpha =1) 8 plot(glm_lasso ,xvar="lambda", col=colrs) 9 plot(glm_lasso ,col=colrs) Before focusing on the lasso for binary variables, consider the standard ols-lasso... But first, to illustrate, let us get back on quantile ( 1) regression... @freakonometrics freakonometrics freakonometrics.hypotheses.org 85
  • 86. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Optimization with linear problems (application to 1-regression) Consider the case of the median. Consider a sample {y1, · · · , yn}. To compute the median, solve min µ n i=1 |yi − µ| which can be solved using linear programming techniques (see e.g. simplex method) More precisely, this problem is equivalent to min µ,a,b n i=1 ai + bi with ai, bi ≥ 0 and yi − µ = ai − bi, ∀i = 1, · · · , n. Consider a sample obtained from a lognormal distribution 1 n = 101 2 set.seed (1) 3 y = rlnorm(n) 4 median(y) 5 [1] 1.077415 @freakonometrics freakonometrics freakonometrics.hypotheses.org 86
  • 87. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Optimization with linear problems (application to 1-regression) Here, one can use a standard optimization routine 1 md=Vectorize(function(m) sum(abs(y-m))) 2 optim(mean(y),md) 3 $par 4 [1] 1.077416 or a linear programing technique : use the matrix form, with 3n constraints, and 2n + 1 parameters, 1 library(lpSolve) 2 A1 = cbind(diag (2*n) ,0) 3 A2 = cbind(diag(n), -diag(n), 1) 4 r = lp("min", c(rep (1,2*n) ,0), 5 rbind(A1 , A2),c(rep("&gt;=", 2*n), rep("=", n)), c(rep (0,2*n), y)) 6 tail(r$solution ,1) 7 [1] 1.077415 @freakonometrics freakonometrics freakonometrics.hypotheses.org 87
  • 88. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Optimization with linear problems (application to 1-regression) More generally, consider here some quantile, 1 tau = .3 2 quantile(x,tau) 3 30% 4 0.6741586 The linear program is now min µ,a,b n i=1 τai + (1 − τ)bi with ai, bi ≥ 0 and yi − µ = ai − bi, ∀i = 1, · · · , n. 1 A1 = cbind(diag (2*n) ,0) 2 A2 = cbind(diag(n), -diag(n), 1) 3 r = lp("min", c(rep(tau ,n),rep(1-tau ,n) ,0), 4 rbind(A1 , A2),c(rep("&gt;=", 2*n), rep("=", n)), c(rep (0,2*n), y)) 5 tail(r$solution ,1) 6 [1] 0.6741586 @freakonometrics freakonometrics freakonometrics.hypotheses.org 88
  • 89. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Optimization with linear problems (application to 1-regression) To illustrate numerical computations, use 1 base=read.table("http:// freakonometrics .free.fr/rent98_00. txt",header =TRUE) The linear program for the quantile regression is now min µ,a,b n i=1 τai + (1 − τ)bi with ai, bi ≥ 0 and yi − [βτ 0 + βτ 1 xi] = ai − bi, ∀i = 1, · · · , n. 1 require(lpSolve) 2 tau = .3 3 n=nrow(base) 4 X = cbind( 1, base$area) 5 y = base$rent_euro 6 A1 = cbind(diag (2*n), 0,0) 7 A2 = cbind(diag(n), -diag(n), X) @freakonometrics freakonometrics freakonometrics.hypotheses.org 89
  • 90. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Optimization with linear problems (application to 1-regression) 1 r = lp("min", 2 c(rep(tau ,n), rep(1-tau ,n) ,0,0), rbind(A1 , A2), 3 c(rep("&gt;=", 2*n), rep("=", n)), c(rep(0,2*n), y)) 4 tail(r$solution ,2) 5 [1] 148.946864 3.289674 see 1 library(quantreg) 2 rq(rent_euro˜area , tau=tau , data=base) 3 Coefficients : 4 (Intercept) area 5 148.946864 3.289674 @freakonometrics freakonometrics freakonometrics.hypotheses.org 90
  • 91. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 1 norm (lasso) If we get back to the original Lasso approach, the goal was to solve min 1 2n n i=1 [yi − (β0 + xi T β)]2 + λ j |βj| Since the intercept is not subject to the penalty. The first order condition is then ∂ ∂β0 y − Xβ − β01 2 = (Xβ − y)T 1 + β0 1 2 = 0 i.e. β0 = 1 n2 (Xβ − y)T 1 For other parameters, assuming tha KKT conditions are satisfied, we can check if 0 contains the subdifferential at the minimum, 0 ∈ ∂ 1 2 y − Xβ 2 + λ β 1 = 1 2 y − Xβ 2 + ∂(λ β 1 ) @freakonometrics freakonometrics freakonometrics.hypotheses.org 91
  • 92. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 1 norm (lasso) For the term on the left, we recognize 1 2 y − Xβ 2 = −XT (y − Xβ) = −g so that the previous equation can be writen gk ∈ ∂(λ|βk|) =    {+λ} if βk > 0 {−λ} if βk < 0 (−λ, +λ) if βk = 0 i.e. if βk = 0, then gk = sign(βk) · λ. We can split βj into a sum of its positive and negative parts by replacing βj with β+ j − β− j where β+ j , β− j ≥ 0. @freakonometrics freakonometrics freakonometrics.hypotheses.org 92
  • 93. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 1 norm (lasso) Then the Lasso problem becomes − log L(β) + λ j (β+ j − β− j ) with constraints β+ j − β− j . Let α+ j , α− j denote the Lagrange multipliers for β+ j , β− j , respectively. L(β) + λ j (β+ j − β− j ) − j α+ j β+ j − j α− j β− j . To satisfy the stationarity condition, we take the gradient of the Lagrangian with respect to β+ j and set it to zero to obtain L(β)j + λ − α+ j = 0 We do the same with respect to β− j to obtain − L(β)j + λ − α− j = 0. @freakonometrics freakonometrics freakonometrics.hypotheses.org 93
  • 94. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 1 norm (lasso) For conveniency, introduce the soft-thresholding function S(z, γ) = sign(z) · (|z| − γ)+ =    z − γ if γ < |z| and z > 0 z + γ if γ < |z| and z < 0 0 if γ ≥ |z| Noticing that the optimization problem 1 2 y − Xβ 2 2 + λ β 1 can also be written min p j=1 −βols j · βj + 1 2 β2 j + λ|βj| observe that βj,λ = S(βols j , λ) which is a coordinate-wise update. @freakonometrics freakonometrics freakonometrics.hypotheses.org 94
  • 95. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 1 norm (lasso) Now, if we consider a (slightly) more general problem, with weights in the first part min 1 2n n i=1 ωi[yi − (β0 + xi T β)]2 + λ j |βj| the coordinate-wise update becomes βj,λ,ω = S(βω−ols j , λ) An alternative is to set rj = y − β01 + k=j βkxk = y − y (j) @freakonometrics freakonometrics freakonometrics.hypotheses.org 95
  • 96. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 1 norm (lasso) so that the optimization problem can be written, equivalently min 1 2n p j=1 [rj − βjxj]2 + λ|βj| hence min 1 2n p j=1 β2 j xj − 2βjrj T xj + λ|βj| and one gets βj,λ = 1 xj 2 S(rj T xj, nλ) or, if we develop βj,λ = 1 i x2 ij S i xi,j[yi − y (j) i ], nλ @freakonometrics freakonometrics freakonometrics.hypotheses.org 96
  • 97. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 1 norm (lasso) Again, if there are weights ω = (ωi), the coordinate-wise update becomes βj,λ,ω = 1 i ωix2 ij S i ωixi,j[yi − y (j) i ], nλ The code to compute this componentwise descent is 1 soft_ thresholding = function(x,a){ 2 result = numeric(length(x)) 3 result[which(x > a)] = a 4 result[which(x < -a)] <- x[which(x < -a)] + a 5 return(result) and the code @freakonometrics freakonometrics freakonometrics.hypotheses.org 97
  • 98. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 1 norm (lasso) 1 lasso_coord_desc = function(X,y,beta ,lambda ,tol =1e-6, maxiter =1000){ 2 beta = as.matrix(beta) 3 X = as.matrix(X) 4 omega = rep(1/length(y),length(y)) 5 obj = numeric(length =( maxiter +1)) 6 betalist = list(length(maxiter +1)) 7 betalist [[1]] = beta 8 beta0list = numeric(length(maxiter +1)) 9 beta0 = sum(y-X%*%beta)/(length(y)) 10 beta0list [1] = beta0 11 for (j in 1: maxiter){ 12 for (k in 1: length(beta)){ 13 r = y - X[,-k]%*%beta[-k] - beta0*rep(1, length(y)) 14 beta[k] = (1/sum(omega*X[,k]ˆ2))*soft_ thresholding (t(omega*r) %*%X[,k],length(y)*lambda) 15 } 16 beta0 = sum(y-X%*%beta)/(length(y)) @freakonometrics freakonometrics freakonometrics.hypotheses.org 98
  • 99. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 17 beta0list[j+1] = beta0 18 betalist [[j+1]] = beta 19 obj[j] = (1/2)*(1/length(y))*norm(omega*(y - X%*%beta - 20 beta0*rep(1, length(y))),’F’)ˆ2 + lambda*sum(abs(beta)) 21 if (norm(rbind(beta0list[j],betalist [[j]]) - rbind(beta0 ,beta), ’F’) < tol) { break } 22 } 23 return(list(obj=obj [1:j],beta=beta ,intercept=beta0)) } If we get back to our logistic problem, the log-likelihood is here log L = 1 n n i=1 yi · (β0 + xi T β) − log[1 + exp(β0 + xi T β)] which is a concave function of the parameters. Hence, one can use a quadratic approximation of the log-likelihood - using Taylor expansion, log L ≈ log L = 1 n n i=1 ωi · [zi − (β0 + xi T β)]2 @freakonometrics freakonometrics freakonometrics.hypotheses.org 99
  • 100. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 where zi is the working response zi = (β0 + xi T β) + yi − pi pi[1 − pi] pi is the prediction pi = exp[β0 + xi Tβ] 1 + exp[β0 + xi Tβ] and ωi are weights ωi = pi[1 − pi]. Thus, we obtain a penalized least-square problem. 1 lasso_coord_desc = function(X,y,beta ,lambda ,tol =1e-6, maxiter =1000){ 2 beta = as.matrix(beta) 3 X = as.matrix(X) 4 obj = numeric(length =( maxiter +1)) 5 betalist = list(length(maxiter +1)) 6 betalist [[1]] = beta 7 beta0 = sum(y-X%*%beta)/(length(y)) 8 p = exp(beta0*rep(1, length(y)) + X%*%beta)/(1+ exp(beta0*rep(1, length(y)) + X%*%beta)) 9 z = beta0*rep(1, length(y)) + X%*%beta + (y-p)/(p*(1-p)) @freakonometrics freakonometrics freakonometrics.hypotheses.org 100
  • 101. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 10 omega = p*(1-p)/(sum((p*(1-p)))) 11 beta0list = numeric(length(maxiter +1)) 12 beta0 = sum(y-X%*%beta)/(length(y)) 13 beta0list [1] = beta0 14 for (j in 1: maxiter){ 15 for (k in 1: length(beta)){ 16 r = z - X[,-k]%*%beta[-k] - beta0*rep(1, length(y)) 17 beta[k] = (1/sum(omega*X[,k]ˆ2))*soft_ thresholding (t(omega*r)% *%X[,k],length(y)*lambda) 18 } 19 beta0 = sum(y-X%*%beta)/(length(y)) 20 beta0list[j+1] = beta0 21 betalist [[j+1]] = beta 22 obj[j] = (1/2)*(1/length(y))*norm(omega*(z - X%*%beta - 23 beta0*rep(1, length(y))),’F’)ˆ2 + lambda*sum(abs(beta)) 24 p = exp(beta0*rep(1, length(y)) + X%*%beta)/(1+ exp(beta0*rep(1, length(y)) + X%*%beta)) 25 z = beta0*rep(1, length(y)) + X%*%beta + (y-p)/(p*(1-p)) @freakonometrics freakonometrics freakonometrics.hypotheses.org 101
  • 102. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 26 omega = p*(1-p)/(sum((p*(1-p)))) 27 if (norm(rbind(beta0list[j],betalist [[j]]) - 28 rbind(beta0 ,beta),’F’) < tol) { break } 29 } 30 return(list(obj=obj [1:j],beta=beta ,intercept=beta0)) } @freakonometrics freakonometrics freakonometrics.hypotheses.org 102
  • 103. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 1 norm (lasso) 4 df0 = df 5 df0$y = as.numeric(df$y) -1 6 plot_lambda = function(lambda){ 7 m = apply(df0 ,2,mean) 8 s = apply(df0 ,2,sd) 9 for(j in 1:2) df0[,j] <- (df0[,j]-m[j])/s[j] 10 reg = glmnet(cbind(df0$x1 ,df0$x2), df0$y==1, alpha =1, lambda=lambda) 11 u = seq(0,1, length =101) 12 p = function(x,y){ 13 xt = (x-m[1])/s[1] 14 yt = (y-m[2])/s[2] 15 predict(reg ,newx=cbind(x1=xt ,x2=yt),type="response")} @freakonometrics freakonometrics freakonometrics.hypotheses.org 103
  • 104. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 1 norm (lasso) 4 reg = glmnet(cbind(df0$x1 ,df0$ x2), df0$y==1, alpha =1) 5 par(mfrow=c(1 ,2)) 6 plot(reg ,xvar="lambda",col=c(" blue","red"),lwd =2) 7 abline(v=exp ( -2.8)) 8 plot_lambda(exp ( -2.8)) @freakonometrics freakonometrics freakonometrics.hypotheses.org 104
  • 105. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Penalized Logistic Regression, 1 norm (lasso) 4 reg = glmnet(cbind(df0$x1 ,df0$ x2), df0$y==1, alpha =1) 5 par(mfrow=c(1 ,2)) 6 plot(reg ,xvar="lambda",col=c(" blue","red"),lwd =2) 7 abline(v=exp ( -2.1)) 8 plot_lambda(exp ( -2.1)) @freakonometrics freakonometrics freakonometrics.hypotheses.org 105
  • 106. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : (Linear) Fisher’s Discrimination Consider the follwing naive classification rule m (x) = argminy{P[Y = y|X = x]} or m (x) = argminy P[X = x|Y = y] P[X = x] (where P[X = x] is the density in the continuous case). In the case where y takes two values, that will be standard {0, 1} here, one can rewrite the later as m (x) = 1 if E(Y |X = x) > 1 2 0 otherwise @freakonometrics freakonometrics freakonometrics.hypotheses.org 106
  • 107. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : (Linear) Fisher’s Discrimination the set DS = x, E(Y |X = x) = 1 2 is called the decision boundary. Assume that X|Y = 0 ∼ N(µ0, Σ) and X|Y = 1 ∼ N(µ1, Σ), then explicit expressions can be derived. m (x) = 1 if r2 1 < r2 0 + 2log P(Y = 1) P(Y = 0) + log |Σ0| |Σ1| 0 otherwise where r2 y is the Mahalanobis distance, r2 y = [X − µy]T Σ−1 y [X − µy] @freakonometrics freakonometrics freakonometrics.hypotheses.org 107
  • 108. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : (Linear) Fisher’s Discrimination Let δy be defined as δy(x) = − 1 2 log |Σy| − 1 2 [x − µy]T Σ−1 y [x − µy] + log P(Y = y) the decision boundary of this classifier is {x such that δ0(x) = δ1(x)} which is quadratic in x. This is the quadratic discriminant analysis. In Fisher (1936, The use of multile measurements in taxinomic problems , it was assumed that Σ0 = Σ1 In that case, actually, δy(x) = xT Σ−1 µy − 1 2 µT y Σ−1 µy + log P(Y = y) and the decision frontier is now linear in x. @freakonometrics freakonometrics freakonometrics.hypotheses.org 108
  • 109. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : (Linear) Fisher’s Discrimination 1 m0 = apply(myocarde[myocarde$PRONO =="0" ,1:7],2, mean) 2 m1 = apply(myocarde[myocarde$PRONO =="1" ,1:7],2, mean) 3 Sigma = var(myocarde [ ,1:7]) 4 omega = solve(Sigma)%*%(m1 -m0) 5 omega 6 [,1] 7 FRCAR -0.012909708542 8 INCAR 1.088582058796 9 INSYS -0.019390084344 10 PRDIA -0.025817110020 11 PAPUL 0.020441287970 12 PVENT -0.038298291091 13 REPUL -0.001371677757 14 b = (t(m1)%*%solve(Sigma)%*%m1 -t(m0)%*%solve(Sigma)%*%m0)/2 @freakonometrics freakonometrics freakonometrics.hypotheses.org 109
  • 110. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : (Linear) Fisher’s Discrimination 1 x = c(.4 ,.55 ,.65 ,.9 ,.1 ,.35 ,.5 ,.15 ,.2 ,.85) 2 y = c(.85 ,.95 ,.8 ,.87 ,.5 ,.55 ,.5 ,.2 ,.1 ,.3) 3 z = c(1,1,1,1,1,0,0,1,0,0) 4 df = data.frame(x1=x,x2=y,y=as.factor(z)) 5 m0 = apply(df[df$y=="0" ,1:2],2, mean) 6 m1 = apply(df[df$y=="1" ,1:2],2, mean) 7 Sigma = var(df [ ,1:2]) 8 omega = solve(Sigma)%*%(m1 -m0) 9 omega 10 [,1] 11 x1 -2.640613174 12 x2 4.858705676 @freakonometrics freakonometrics freakonometrics.hypotheses.org 110
  • 111. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 1 library(MASS) 2 fit_lda = lda(y ˜x1+x2 , data=df) 3 fit_lda 4 5 Coefficients of linear discriminants : 6 LD1 7 x1 -2.588389554 8 x2 4.762614663 and for the intercept 1 b = (t(m1)%*%solve(Sigma)%*%m1 -t(m0)%*%solve( Sigma)%*%m0)/2 @freakonometrics freakonometrics freakonometrics.hypotheses.org 111
  • 112. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 1 predlda = function(x,y) predict(fit_lda , data. frame(x1=x,x2=y))$class ==1 2 vv=outer(vu ,vu ,predlda) 3 contour(vu ,vu ,vv ,add=TRUE ,lwd=2, levels = .5) 1 fit_qda = qda(y ˜x1+x2 , data=df) 2 plot(df$x1 ,df$x2 ,pch =19, 3 col=c("blue","red")[1+( df$y=="1")]) 4 predqda=function(x,y) predict(fit_qda , data. frame(x1=x,x2=y))$class ==1 5 vv=outer(vu ,vu ,predlda) 6 contour(vu ,vu ,vv ,add=TRUE ,lwd=2, levels = .5) @freakonometrics freakonometrics freakonometrics.hypotheses.org 112
  • 113. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Note that there are strong links with the logistic regression. Assume as previously that X|Y = 0 ∼ N(µ0, Σ) and X|Y = 1 ∼ N(µ1, Σ) then log P(Y = 1|X = x) P(Y = 0|X = x) is equal to xT Σ−1 [µy] − 1 2 [µ1 − µ0]T Σ−1 [µ1 − µ0] + log P(Y = 1) P(Y = 0) which is linear in x log P(Y = 1|X = x) P(Y = 0|X = x) = xT β Hence, when each groups have Gaussian distributions with identical variance matrix, then LDA and the logistic regression lead to the same classification rule. @freakonometrics freakonometrics freakonometrics.hypotheses.org 113
  • 114. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : (Linear) Fisher’s Discrimination Observe furthermore that the slope is proportional to Σ−1[µ1 − µ0] as stated in Fisher’s article. But to obtain such a relationship, he observe that the ratio of between and within variances (in the two groups) was variance between variance within = [ωµ1 − ωµ0]2 ωTΣ1ω + ωTΣ0ω which is maximal when ω is proportional to Σ−1[µ1 − µ0]. @freakonometrics freakonometrics freakonometrics.hypotheses.org 114
  • 115. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Neural Nets The most interesting part is not ’neural’ but ’network’. @freakonometrics freakonometrics freakonometrics.hypotheses.org 115
  • 116. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Neural Nets Networks have nodes, and edges (possibly connected) that connect nodes, or maybe, to more specific some sort of flow network, @freakonometrics freakonometrics freakonometrics.hypotheses.org 116
  • 117. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Neural Nets In such a network, we usually have sources (here multiple) sources (here {s1, s2, s3}), on the left, on a sink (here {t}), on the right. To continue with this metaphorical introduction, information from the sources should reach the sink. An usually, sources are explanatory variables, {x1, · · · , xp}, and the sink is our variable of interest y. The output here is a binary variable y ∈ {0, 1} Consider the sigmoid function f(x) = 1 1 + e−x = ex ex + 1 which is actually the logistic function. 1 sigmoid = function(x) 1 / (1 + exp(-x)) This function f is called the activation function. @freakonometrics freakonometrics freakonometrics.hypotheses.org 117
  • 118. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 If y ∈ {−1, +1}, one can consider the hyperbolic tangent f(x) = tanh(x) = (ex − e−x) (ex + e−x) or the inverse tangent function f(x) = tan−1(x). And as input for such function, we consider a weighted sum of incoming nodes. So here yi = f p j=1 ωjxj,i We can also add a constant actually yi = f ω0 + p j=1 ωjxj,i @freakonometrics freakonometrics freakonometrics.hypotheses.org 118
  • 119. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Neural Nets 1 weights_0 = lm(PRONO˜.,data=myocarde)$ coefficients 2 X = as.matrix(cbind (1, myocarde [ ,1:7])) 3 y_5_1 = sigmoid(X %*% weights_0) 4 library(ROCR) 5 pred = ROCR :: prediction(y_5_1,myocarde$PRONO) 6 perf = ROCR :: performance (pred ,"tpr", "fpr") 7 plot(perf ,col="blue",lwd =2) 8 reg = glm(PRONO˜.,data=myocarde ,family= binomial(link = "logit")) 9 y_0 = predict(reg ,type="response") 10 pred0 = ROCR :: prediction(y_0,myocarde$PRONO) 11 perf0 = ROCR :: performance(pred0 ,"tpr", "fpr") 12 plot(perf0 ,add=TRUE ,col="red") @freakonometrics freakonometrics freakonometrics.hypotheses.org 119
  • 120. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Neural Nets The error of that model can be computed used the quadratic loss function, n i=1 (yi − p(xi))2 or cross-entropy n i=1 (yi log p(xi) + [1 − yi] log[1 − p(xi)]) 1 loss = function(weights){ 2 mean( (myocarde$PRONO -sigmoid(X %*% weights))ˆ2) } we want to solve ω = argmin 1 n n i=1 yi, f(ω0 + xi T ω) @freakonometrics freakonometrics freakonometrics.hypotheses.org 120
  • 121. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 1 weights_1 = optim(weights_0,loss)$par 2 y_5_2 = sigmoid(X %*% weights_1) 3 pred = ROCR :: prediction(y_5_2,myocarde$PRONO) 4 perf = ROCR :: performance (pred ,"tpr", "fpr") 5 plot(perf ,col="blue",lwd =2) 6 plot(perf0 ,add=TRUE ,col="red") Consider a single hidden layer, with 3 different neurons, so that p(x) = f ω0 + 3 h=1 ωhfh ωh,0 + p j=1 ωh,jxj or p(x) = f ω0 + 3 h=1 ωhfh ωh,0 + xT ωh @freakonometrics freakonometrics freakonometrics.hypotheses.org 121
  • 122. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 1 weights_1 = lm(PRONO˜1+ FRCAR+INCAR+INSYS+PAPUL +PVENT ,data=myocarde)$ coefficients 2 X1 = as.matrix(cbind (1, myocarde[,c("FRCAR"," INCAR","INSYS","PAPUL","PVENT")])) 3 weights_2 = lm(PRONO˜1+ INSYS+PRDIA ,data= myocarde)$ coefficients 4 X2 = as.matrix(cbind (1, myocarde[,c("INSYS"," PRDIA")])) 5 weights_3 = lm(PRONO˜1+ PAPUL+PVENT+REPUL ,data= myocarde)$ coefficients 6 X3 = as.matrix(cbind (1, myocarde[,c("PAPUL"," PVENT","REPUL")])) 7 X = cbind(sigmoid(X1 %*% weights_1), sigmoid( X2 %*% weights_2), sigmoid(X3 %*% weights_ 3)) 8 weights = c(1/3,1/3,1/3) 9 y_5_3 = sigmoid(X %*% weights) 10 pred = ROCR :: prediction(y_5_3,myocarde$PRONO) 11 perf = ROCR :: performance (pred ,"tpr", "fpr") 12 plot(perf ,col="blue",lwd =2) @freakonometrics freakonometrics freakonometrics.hypotheses.org 122
  • 123. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 If p(x) = f ω0 + 3 h=1 ωhfh ωh,0 + xT ωh we want to solve, with back propagation, ω = argmin 1 n n i=1 (yi, p(xi)) For practical issues, let us center all variables 1 center = function(z) (z-mean(z))/sd(z) The loss function is here 1 loss = function(weights){ 2 weights_1 = weights [0+(1:7)] 3 weights_2 = weights [7+(1:7)] 4 weights_3 = weights [14+(1:7)] 5 weights_ = weights [21+1:4] 6 X1=X2=X3=as.matrix(myocarde [ ,1:7]) 7 Z1 = center(X1 %*% weights_1) @freakonometrics freakonometrics freakonometrics.hypotheses.org 123
  • 124. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 8 Z2 = center(X2 %*% weights_2) 9 Z3 = center(X3 %*% weights_3) 10 X = cbind (1, sigmoid(Z1), sigmoid(Z2), sigmoid(Z3)) 11 mean( (myocarde$PRONO -sigmoid(X %*% weights_))ˆ2)} @freakonometrics freakonometrics freakonometrics.hypotheses.org 124
  • 125. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 1 pca = princomp(myocarde [ ,1:7]) 2 W = get_pca_var(pca)$contrib 3 weights_0 = c(W[,1],W[,2],W[,3],c(-1,rep (1 ,3)/ 3)) 4 weights_opt = optim(weights_0,loss)$par 5 weights_1 = weights_opt [0+(1:7)] 6 weights_2 = weights_opt [7+(1:7)] 7 weights_3 = weights_opt [14+(1:7)] 8 weights_ = weights_opt [21+1:4] 9 X1=X2=X3=as.matrix(myocarde [ ,1:7]) 10 Z1 = center(X1 %*% weights_1) 11 Z2 = center(X2 %*% weights_2) 12 Z3 = center(X3 %*% weights_3) 13 X = cbind (1, sigmoid(Z1), sigmoid(Z2), sigmoid( Z3)) 14 y_5_4 = sigmoid(X %*% weights_) 15 pred = ROCR :: prediction(y_5_4,myocarde$PRONO) 16 perf = ROCR :: performance (pred ,"tpr", "fpr") 17 plot(perf ,col="blue",lwd =2) 18 plot(perf ,add=TRUE ,col="red") @freakonometrics freakonometrics freakonometrics.hypotheses.org 125
  • 126. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 1 library(nnet) 2 myocarde_minmax = myocarde 3 minmax = function(z) (z-min(z))/(max(z)-min(z) ) 4 for(j in 1:7) myocarde_minmax[,j] = minmax( myocarde_minmax[,j]) 5 model_nnet = nnet(PRONO˜.,data=myocarde_minmax ,size =3) 6 summary(model_nnet) 7 library(devtools) 8 source_url(’https://gist. githubusercontent .com /fawda123/7471137/raw/466 c1474d0a505ff044412703516c34f1a4684a5 /nnet _plot_update.r’) 9 plot.nnet(model_nnet) 10 library(neuralnet) 11 model_nnet = neuralnet(formula(glm(PRONO˜., data=myocarde_minmax)), 12 myocarde_minmax ,hidden =3, act.fct = sigmoid) 13 plot(model_nnet) @freakonometrics freakonometrics freakonometrics.hypotheses.org 126
  • 127. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Neural Nets Of course, one can consider a multiple layer network, p(x) = f ω0 + 3 h=1 ωhfh ωh,0 + zh T ωh where zh = f ωh,0 + kh j=1 ωh,jfh,j ωh,j,0 + xT ωh,j 1 library(neuralnet) 2 model_nnet = neuralnet(formula(glm(PRONO˜., data=myocarde_minmax)), 3 myocarde_minmax ,hidden =3, act.fct = sigmoid) 4 plot(model_nnet) @freakonometrics freakonometrics freakonometrics.hypotheses.org 127
  • 128. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine Linearly Separable sample [econometric notations] Data (y1, x1), · · · , (yn, xn) - with y ∈ {0, 1} - are linearly separable if there are (β0, β) such that - yi = 1 if β0 + x β > 0 - yi = 0 if β0 + x β < 0 Linearly Separable sample [ML notations] Data (y1, x1), · · · , (yn, xn) - with y ∈ {−1, +1} - are linearly separable if there are (b, ω) such that - yi = +1 if b + x, ω > 0 - yi = −1 if b + x, ω < 0 or equivalently yi · (b + x, ω ) > 0, ∀i. @freakonometrics freakonometrics freakonometrics.hypotheses.org 128
  • 129. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine yi · (b + x, ω ) = 0 is an hyperplane (in Rp) orthogonal with ω Use m(x) = 1b+ x,ω ≥0 − 1b+ x,ω <0 as classifier Problem : equation (i.e. (b, ω)) is not unique ! Canonical form : min i=1,··· ,n |b + xi, ω | = 1 Problem : solution here is not unique ! Idea : use the widest (safety) margin γ Vapnik & Lerner (1963, Pattern recognition using generalized portrait method) or Cover (1965, Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition) @freakonometrics freakonometrics freakonometrics.hypotheses.org 129
  • 130. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine Consider two points, ω−1 and ω+1 γ = 1 2 ω, ω+1 − ω−1 ω It is minimal when b + xi, ω−1 = −1 and b + xi, ω+1 = +1, and therefore γ = 1 ω Optimization problem becomes min (b ω) 1 2 ω 2 2 s.t. yi · (b + x, ω ) > 0, ∀i. convex optimization problem with linear constraints @freakonometrics freakonometrics freakonometrics.hypotheses.org 130
  • 131. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine Consider the following problem : min u∈Rp h(u) s.t. gi(u) ≥ 0 ∀i = 1, · · · , n where h is quadratic and gi’s are linear. Lagrangian is L : Rp × Rn → R defined as L(u, α) = h(u) − n i=1 αigi(u) where α are dual variables, and the dual function is Λ : α → L(uα, α) = min{L(u, α)} where uα = argmin L(u, α) One can solve the dual problem, max{Λ(α)} s.t. α ≥ 0. Solution is u = uα . Si gi(uα ) > 0, then necessarily αi = 0 (see Karush-Kuhn-Tucker (KKT) condition, αi · gi(uα ) = 0) @freakonometrics freakonometrics freakonometrics.hypotheses.org 131
  • 132. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine Here, L(b, ω, α) = 1 2 ω 2 − n i=1 αi · yi · (b + x, ω ) − 1 From the first order conditions, ∂L(b, ω, α) ∂ω = ω − n i=1 αi · yixi = 0, i.e.ωËĘ = n i=1 αi yixi ∂L(b, ω, α) ∂b = − n i=1 αi · yi = 0, i.e. n i=1 αi · yi and Λ(α) = n i=1 αi − 1 2 n i,j=1 αiαjyiyj αi, αj @freakonometrics freakonometrics freakonometrics.hypotheses.org 132
  • 133. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine min α 1 2 α Qα − 1 α s.t. αi ≥ 0, ∀i y 1 = 0 where Q = [Qi,j] and Qi,j = yiyj xi, xj , and then ω = n i=1 αi yixi and b = − 1 2 min i:yi=+1 { xi, ω } + min i:yi=−1 { xi, ω } Points xi such that αi > 0 are called support yi · (b + xi, ω ) = 1 Use m (x) = 1b + x,ω ≥0 − 1b + x,ω <0 as classifier Observe that γ = n i=1 α 2 i −1/2 @freakonometrics freakonometrics freakonometrics.hypotheses.org 133
  • 134. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine Optimization problem was min (b,ω) 1 2 ω 2 2 s.t. yi · (b + x, ω ) > 0, ∀i, which became min (b,ω) 1 2 ω 2 2 + n i=1 αi · 1 − yi · (b + x, ω ) , or min (b,ω) 1 2 ω 2 2 + penalty , @freakonometrics freakonometrics freakonometrics.hypotheses.org 134
  • 135. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine Consider here the more general case where the space is not linearly separable ( ω, xi + b)yi ≥ 1 becomes ( ω, xi + b)yi ≥ 1−ξi for some slack variables ξi’s. and penalize large slack variables ξi, i.e. solve (for some coste C) min ω,b 1 2 ω β + C n i=1 ξi subject to ∀i, ξi ≥ 0 and (xi ω + b)yi ≥ 1 − ξi. This is the soft-margin extension, see e1071::svm() or kernlab::ksvm() @freakonometrics freakonometrics freakonometrics.hypotheses.org 135
  • 136. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine The dual optimization problem is now min α 1 2 α Qα − 1 α s.t. 0 ≤ αi≤ C, ∀i y 1 = 0 where Q = [Qi,j] and Qi,j = yiyj xi, xj , and then ω = n i=1 αi yixi and b = − 1 2 min i:yi=+1 { xi, ω } + min i:yi=−1 { xi, ω } Note further that the (primal) optimization problem can be written min (b,ω) 1 2 ω 2 2 + n i=1 1 − yi · (b + x, ω ) + , where (1 − z)+ is a convex upper bound for empirical error 1z≤0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 136
  • 137. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine One can also consider the kernel trick : xi xj is replace by ϕ(xi) ϕ(xj) for some mapping ϕ, K(xi, xj) = ϕ(xi) ϕ(xj) For instance K(a, b) = (a b)3 = ϕ(a) ϕ(b) where ϕ(a1, a2) = (a3 1 , √ 3a2 1a2 , √ 3a1a2 2 , a3 2) Consider polynomial kernels K(a, b) = (1 + a b)p or a Gaussian kernel K(a, b) = exp(−(a − b) (a − b)) Solve max αi≥0 n i=1 αi − 1 2 n i,j=1 yiyjαiαjK(xi, xj) @freakonometrics freakonometrics freakonometrics.hypotheses.org 137
  • 138. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine @freakonometrics freakonometrics freakonometrics.hypotheses.org 138
  • 139. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine Linear kernel @freakonometrics freakonometrics freakonometrics.hypotheses.org 139
  • 140. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine Polynomial kernel (degree 2) @freakonometrics freakonometrics freakonometrics.hypotheses.org 140
  • 141. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine Polynomial kernel (degree 3) @freakonometrics freakonometrics freakonometrics.hypotheses.org 141
  • 142. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine Polynomial kernel (degree 4) @freakonometrics freakonometrics freakonometrics.hypotheses.org 142
  • 143. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine Polynomial kernel (degree 5) @freakonometrics freakonometrics freakonometrics.hypotheses.org 143
  • 144. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine Polynomial kernel (degree 6) @freakonometrics freakonometrics freakonometrics.hypotheses.org 144
  • 145. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine Radial kernel @freakonometrics freakonometrics freakonometrics.hypotheses.org 145
  • 146. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine - Radial Kernel, impact of the cost C @freakonometrics freakonometrics freakonometrics.hypotheses.org 146
  • 147. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine - Radial Kernel, tuning parameter γ @freakonometrics freakonometrics freakonometrics.hypotheses.org 147
  • 148. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine The radial kernel is formed by taking an infinite sum over polynomial kernels... K(x, y) = exp − γ x − y 2 = ψ(x), ψ(y) where ψ is some Rn → R∞ function, since K(x, y) = exp − γ x − y 2 = exp(−γ x 2 − γ y 2 ) =constant · exp 2γ x, y i.e. K(x, y) ∝ exp 2γ x, y = ∞ k=0 2γ x, y k k! = ∞ k=0 2γKk(x, y) where Kk is the polynomial kernel of degree k. If K = K1 + K2 with ψj : Rn → Rdj then ψ : Rn → Rd with d ∼ d1 + d2 @freakonometrics freakonometrics freakonometrics.hypotheses.org 148
  • 149. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine A kernel is a measure of similarity between vectors. The smaller the value of γ the narrower the vectors should be to have a small measure Is there a probabilistic interpretation ? Platt (2000, Probabilities for SVM) suggested to use a logistic function over the SVM scores, p(x) = exp[b + x, ω ] 1 + exp[b + x, ω ] @freakonometrics freakonometrics freakonometrics.hypotheses.org 149
  • 150. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 SVM : Support Vector Machine Let C denote the cost of mis-classification. The optimization problem becomes min w∈Rd,b∈R,ξ∈Rn 1 2 ω 2 + C n i=1 ξi subject to yi · (ωTxi + b) ≥ 1 − ξi, with ξi ≥ 0, ∀i = 1, · · · , n 1 n = length(myocarde[,"PRONO"]) 2 myocarde0 = myocarde 3 myocarde0$PRONO = myocarde$PRONO*2-1 4 C = .5 5 f = function(param){ 6 w = param [1:7] 7 b = param [8] 8 xi = param [8+1: nrow(myocarde)] 9 .5*sum(wˆ2) + C*sum(xi)} 10 grad_f = function(param){ 11 w = param [1:7] 12 b = param [8] @freakonometrics freakonometrics freakonometrics.hypotheses.org 150
  • 151. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 13 xi = param [8+1: nrow(myocarde)] 14 c(2*w,0,rep(C,length(xi)))} (linear) constraints are written as Uθ − c ≥ 0 1 U = rbind(cbind(myocarde0[,"PRONO"]*as.matrix(myocarde [ ,1:7]) ,diag(n) ,myocarde0[,"PRONO"]), 2 cbind(matrix (0,n,7),diag(n,n),matrix (0,n,1))) 3 C = c(rep(1,n),rep(0,n)) then use 1 constrOptim (theta=p_init , f, grad_f, ui = U,ci = C) One can recognize in the separable case, but also in the non-separable case, a classic quadratic program min z∈Rd 1 2 zT Dz − dz subject to Az ≥ b. 1 library(quadprog) 2 eps = 5e-4 @freakonometrics freakonometrics freakonometrics.hypotheses.org 151
  • 152. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 3 y = myocarde[,&quot;PRONO&quot ;]*2-1 4 X = as.matrix(cbind (1, myocarde [ ,1:7])) 5 n = length(y) 6 D = diag(n+7+1) 7 diag(D)[8+0:n] = 0 8 d = matrix(c(rep (0 ,7) ,0,rep(C,n)), nrow=n+7+1) 9 A = Ui 10 b = Ci @freakonometrics freakonometrics freakonometrics.hypotheses.org 152
  • 153. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Support Vector Machine (SVM) - Primal Problem 1 sol = solve.QP(D+eps*diag(n+7+1) , d, t(A), b, meq =1) 2 qpsol = sol$solution 3 (omega = qpsol [1:7]) 4 [1] -0.106642005446 -0.002026198103 -0.022513312261 -0.018958578746 -0.023105767847 -0.018958578746 -1.080638988521 5 (b = qpsol[n+7+1]) 6 [1] 997.6289927 Given an observation x, the prediction is y = sign[ωTx + b] 1 y_pred = 2*((as.matrix(myocarde0 [ ,1:7])%*%omega+b) >0) -1 @freakonometrics freakonometrics freakonometrics.hypotheses.org 153
  • 154. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Support Vector Machine (SVM) - Dual Problem The Lagrangian of the separable problem could be written introducing Lagrange multipliers α ∈ Rn, α ≥ 0 as L(ω, b, α) = 1 2 ω 2 − n i=1 αi yi(ωT xi + b) − 1 Somehow, αi represents the influence of the observation (yi, xi). Consider the Dual Problem, with G = [Gij] and Gij = yiyjxj Txi. min α∈Rn 1 2 αT Gα − 1T α subject to yTα = 0. The Lagrangian of the non-separable problem could be written introducing Lagrange multipliers α, β ∈ Rn, α, β ≥ 0, @freakonometrics freakonometrics freakonometrics.hypotheses.org 154
  • 155. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Support Vector Machine (SVM) - Dual Problem and define the Lagrangian L(ω, b, ξ, α, β) as 1 2 ω 2 + C n i=1 ξi − n i=1 αi yi(ωT xi + b) − 1 + ξi − n i=1 βiξi Somehow, αi represents the influence of the observation (yi, xi). The Dual Problem become with G = [Gij] and Gij = yiyjxj Txi. min α∈Rn 1 2 αT Gα − 1T α subject to yTα = 0, α ≥ 0 and α ≤ C. As previously, one can also use quadratic programming @freakonometrics freakonometrics freakonometrics.hypotheses.org 155
  • 156. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Support Vector Machine (SVM) - Dual Problem 1 library(quadprog) 2 eps = 5e-4 3 y = myocarde[,"PRONO"]*2-1 4 X = as.matrix(cbind (1, myocarde [ ,1:7])) 5 n = length(y) 6 Q = sapply (1:n, function(i) y[i]*t(X)[,i]) 7 D = t(Q)%*%Q 8 d = matrix (1, nrow=n) 9 A = rbind(y,diag(n),-diag(n)) 10 C = .5 11 b = c(0,rep(0,n),rep(-C,n)) 12 sol = solve.QP(D+eps*diag(n), d, t(A), b, meq=1, factorized=FALSE) 13 qpsol = sol$solution @freakonometrics freakonometrics freakonometrics.hypotheses.org 156
  • 157. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Support Vector Machine (SVM) - Dual Problem The two problems are directed in the sense that for all xωTx + b = n i=1 αiyi(xTxi) + b To recover the solution of the primal problem, ω = n i=1 αiyixi thus 1 omega = apply(qpsol*y*X,2,sum) 2 omega 3 1 FRCAR INCAR INSYS 4 0.000000000000000243907 0.0550138658 -0.092016323 0.3609571899422 5 PRDIA PAPUL PVENT REPUL 6 -0.109401796528869235669 -0.0485213403 -0.066005864 0.0010093656567 More generally 1 svm.fit = function(X, y, C=NULL) { 2 n.samples = nrow(X) 3 n.features = ncol(X) 4 K = matrix(rep(0, n.samples*n.samples), nrow=n.samples) 5 for (i in 1:n.samples){ 6 for (j in 1:n.samples){ 7 K[i,j] = X[i,] %*% X[j,] }} @freakonometrics freakonometrics freakonometrics.hypotheses.org 157
  • 158. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 8 Dmat = outer(y,y) * K 9 Dmat = as.matrix(nearPD(Dmat)$mat) 10 dvec = rep(1, n.samples) 11 Amat = rbind(y, diag(n.samples), -1*diag(n.samples)) 12 bvec = c(0, rep(0, n.samples), rep(-C, n.samples)) 13 res = solve.QP(Dmat ,dvec ,t(Amat),bvec=bvec , meq =1) 14 a = res$solution 15 bomega = apply(a*y*X,2,sum) 16 return(bomega) 17 } @freakonometrics freakonometrics freakonometrics.hypotheses.org 158
  • 159. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Support Vector Machine (SVM) - Dual Problem To relate this duality optimization problem to OLS, recall that y = xTω + ε, so that y = xTω, where ω = [XTX]−1XTy. But one can also write y = xTω = n i=1 αi · xTxi where α = X[XTX]−1ω, or conversely, ω = XTα. @freakonometrics freakonometrics freakonometrics.hypotheses.org 159
  • 160. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Support Vector Machine (SVM) - Kernel Approach A positive kernel on X is a function K : X × X → R symmetric, and such that for any n, ∀α1, · · · , αn and ∀x1, · · · , xn, n i=1 n j=1 αiαjk(xi, xj) ≥ 0 For example, the linear kernel is k(xi, xj) = xi Txj. ThatâĂŹs what weâĂŹve been using here, so far. One can also define the product kernel k(xi, xj) = κ(xi) · κ(xj) where κ is some function X → R. @freakonometrics freakonometrics freakonometrics.hypotheses.org 160
  • 161. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Support Vector Machine (SVM) - Kernel Approach Finally, the Gaussian kernel is k(xi, xj) = exp[− xi − xj 2]. Since it is a function of xi − xj , it is also called a radial kernel. 1 linear.kernel = function(x1 , x2) { 2 return (x1%*%x2) 3 } 4 svm.fit = function(X, y, FUN=linear.kernel , C=NULL) { 5 n.samples = nrow(X) @freakonometrics freakonometrics freakonometrics.hypotheses.org 161
  • 162. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 6 n.features = ncol(X) 7 K = matrix(rep(0, n.samples*n.samples), nrow=n.samples) 8 for (i in 1:n.samples){ 9 for (j in 1:n.samples){ 10 K[i,j] = FUN(X[i,], X[j,]) 11 } 12 } 13 Dmat = outer(y,y) * K 14 Dmat = as.matrix(nearPD(Dmat)$mat) 15 dvec = rep(1, n.samples) 16 Amat = rbind(y, diag(n.samples), -1*diag(n.samples)) 17 bvec = c(0, rep(0, n.samples), rep(-C, n.samples)) 18 res = solve.QP(Dmat ,dvec ,t(Amat),bvec=bvec , meq =1) 19 a = res$solution 20 bomega = apply(a*y*X,2,sum) 21 return(bomega) 22 } @freakonometrics freakonometrics freakonometrics.hypotheses.org 162
  • 163. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 1 SVM2 = ksvm(y ˜ x1 + x2 , data = df0 , C=2, kernel = "vanilladot " , prob.model=TRUE , type="C-svc") 2 pred_SVM2 = function(x,y){ 3 return(predict(SVM2 ,newdata=data.frame(x1=x, x2=y), type=" probabilities ")[ ,2])} 4 plot(df$x1 ,df$x2 ,pch=c(1 ,19) [1+( df$y=="1")], 5 cex =1.5 , xlab="", 6 ylab="",xlim=c(0 ,1),ylim=c(0 ,1)) 7 vu = seq (-.1,1.1, length =251) 8 vv = outer(vu ,vu ,function(x,y) pred_SVM2(x,y )) 9 contour(vu ,vu ,vv ,add=TRUE ,lwd=2, levels = .5, col="red") @freakonometrics freakonometrics freakonometrics.hypotheses.org 163
  • 164. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 1 SVM3 = ksvm(y ˜ x1 + x2 , data = df0 , C=2, kernel = "rbfdot" , prob.model=TRUE , type="C-svc") 2 pred_SVM3 = function(x,y){ 3 return(predict(SVM3 ,newdata=data.frame(x1=x, x2=y), type=" probabilities ")[ ,2])} 4 plot(df$x1 ,df$x2 ,pch=c(1 ,19) [1+( df$y=="1")], 5 cex =1.5 , xlab="", 6 ylab="",xlim=c(0 ,1),ylim=c(0 ,1)) 7 vu = seq (-.1,1.1, length =251) 8 vv = outer(vu ,vu ,function(x,y) pred_SVM2(x,y )) 9 contour(vu ,vu ,vv ,add=TRUE ,lwd=2, levels = .5, col="red") @freakonometrics freakonometrics freakonometrics.hypotheses.org 164
  • 165. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Classification Trees 1 library(rpart) 2 cart = rpart(PRONO˜.,data= myocarde) 3 library(rpart.plot) 4 prp(cart ,type=2, extra =1) A (binary) split is based on one specific variable âĂŞ say xj âĂŞ and a cutoff, say s. Then, there are two options: • either xi,j ≤ s, then observation i goes on the left, in IL • or xi,j > s, then observation i goes on the right, in IR Thus, I = IL ∪ IR. @freakonometrics freakonometrics freakonometrics.hypotheses.org 165
  • 166. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Classification Trees Gini for node I is defined as G(I) = − y∈{0,1} py(1 − py) where py is the proportion of individuals in the leaf of type y, G(I) = − y∈{0,1} ny,I nI 1 − ny,I nI 1 gini = function(y,classe){ 2 T. = table(y,classe) 3 nx = apply(T,2,sum) 4 n. = sum(T) 5 pxy = T/matrix(rep(nx ,each =2) ,nrow =2) 6 omega = matrix(rep(nx ,each =2) ,nrow =2)/n 7 g. = -sum(omega*pxy*(1-pxy)) 8 return(g)} @freakonometrics freakonometrics freakonometrics.hypotheses.org 166
  • 167. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Classification Trees 1 -2*mean(myocarde$PRONO)*(1-mean(myocarde$PRONO)) 2 [1] -0.4832375 3 gini(y=myocarde$PRONO ,classe=myocarde$PRONO <Inf) 4 [1] -0.4832375 5 gini(y=myocarde$PRONO ,classe=myocarde [ ,1] <=100) 6 [1] -0.4640415 @freakonometrics freakonometrics freakonometrics.hypotheses.org 167
  • 168. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 Classification : Classification Trees if we split, define index G(IL, IR) = − x∈{L,R} nx nIx nI y∈{0,1} ny,Ix nIx 1 − ny,Ix nIx the entropic measure is E(I) = − y∈{0,1} ny,I nI log ny,I nI 1 entropy = function(y,classe){ 2 T = table(y,classe) 3 nx = apply(T,2,sum) 4 pxy = T/matrix(rep(nx ,each =2) ,nrow =2) 5 omega = matrix(rep(nx ,each =2) ,nrow =2)/sum(T) 6 g = sum(omega*pxy*log(pxy)) 7 return(g)} @freakonometrics freakonometrics freakonometrics.hypotheses.org 168
  • 169. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 1 mat_gini = mat_v=matrix(NA ,7 ,101) 2 for(v in 1:7){ 3 variable=myocarde[,v] 4 v_seuil=seq(quantile(myocarde[,v], 5 6/length(myocarde[,v])), 6 quantile(myocarde[,v],1-6/length( 7 myocarde[,v])),length =101) 8 mat_v[v,]=v_seuil 9 for(i in 1:101){ 10 CLASSE=variable <=v_seuil[i] 11 mat_gini[v,i]= 12 gini(y=myocarde$PRONO ,classe=CLASSE)}} 13 -(gini(y=myocarde$PRONO ,classe =( myocarde [ ,3] <19))- 14 gini(y=myocarde$PRONO ,classe =( myocarde [,3]< Inf)))/ 15 gini(y=myocarde$PRONO ,classe =( myocarde [,3]< Inf)) 16 [1] 0.5862131 @freakonometrics freakonometrics freakonometrics.hypotheses.org 169
  • 170. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 1 idx = which(myocarde$INSYS <19) 2 mat_gini = mat_v = matrix(NA ,7 ,101) 3 for(v in 1:7){ 4 variable = myocarde[idx ,v] 5 v_seuil = seq(quantile(myocarde[idx ,v], 6 7/length(myocarde[idx ,v])), 7 quantile(myocarde[idx ,v],1-7/length( 8 myocarde[idx ,v])), length =101) 9 mat_v[v,] = v_seuil 10 for(i in 1:101){ 11 CLASSE = variable <=v_seuil[i] 12 mat_gini[v,i]= 13 gini(y=myocarde$PRONO[idx],classe= CLASSE)}} 14 par(mfrow=c(3 ,2)) 15 for(v in 2:7){ 16 plot(mat_v[v,],mat_gini[v ,]) 17 } @freakonometrics freakonometrics freakonometrics.hypotheses.org 170
  • 171. Arthur Charpentier, Algorithms for Predictive Modeling - 2019 1 idx = which(myocarde$INSYS >=19) 2 mat_gini = mat_v = matrix(NA ,7 ,101) 3 for(v in 1:7){ 4 variable=myocarde[idx ,v] 5 v_seuil=seq(quantile(myocarde[idx ,v], 6 6/length(myocarde[idx ,v])), 7 quantile(myocarde[idx ,v],1-6/length( 8 myocarde[idx ,v])), length =101) 9 mat_v[v,]=v_seuil 10 for(i in 1:101){ 11 CLASSE=variable <=v_seuil[i] 12 mat_gini[v,i]= 13 gini(y=myocarde$PRONO[idx], 14 classe=CLASSE)}} 15 par(mfrow=c(3 ,2)) 16 for(v in 2:7){ 17 plot(mat_v[v,],mat_gini[v ,]) 18 } @freakonometrics freakonometrics freakonometrics.hypotheses.org 171