SlideShare a Scribd company logo
1 of 55
Download to read offline
Arthur Charpentier, SIDE Summer School, July 2019
# 3 Regularization & Penalized Regression
Arthur Charpentier (Universit´e du Qu´ebec `a Montr´eal)
Machine Learning & Econometrics
SIDE Summer School - July 2019
@freakonometrics freakonometrics freakonometrics.hypotheses.org 1
Arthur Charpentier, SIDE Summer School, July 2019
Linear Model and Variable Selection
Let s denote a subset of {0, 1, · · · , p}, with cardinal |s|.
Xs is the matrice with columns xj where j ∈ s.
Consider the model Y = Xsβs + η, so that βs = Xs Xs
−1
Xs y
In general, βs = (β)s
R2
is usually not a good measure since R2
(s) ≤ R2
(t) when s ⊂ t.
Some use the adjusted R2
, R
2
(s) = 1 −
n − 1
n − |s|
1 − R2
(s)
The mean square error is
mse(s) = E (Xβ − Xsβs)2
= E RSS(s) − nσ2
+ 2|s|σ2
Define Mallows’Cp as Cp(s) =
RSS(s)
σ2
− n + 2|s|
Rule of thumb: model with variables s is valid if Cp(s) ≤ |s|
@freakonometrics freakonometrics freakonometrics.hypotheses.org 2
Arthur Charpentier, SIDE Summer School, July 2019
Linear Model and Variable Selection
In a linear model,
log L(β, σ2
) = −
n
2
log σ2
−
n
2
log(2π) −
1
2σ2
y − Xβ 2
and
log L(βs, σ2
s ) = −
n
2
log
RSS(s)
n
−
n
2
[1 + log(2π)]
It is necessary to penalize too complex models
Akaike’s AIC : AIC(s) =
n
2
log
RSS(s)
n
+
n
2
[1 + log(2π)] + 2|s|
Schwarz’s BIC : BIC(s) =
n
2
log
RSS(s)
n
+
n
2
[1 + log(2π)] + |s| log n
Exhaustive search of all models, 2p+1
... too complicated.
Stepwise procedure, forward or backward... not very stable and satisfactory.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 3
Arthur Charpentier, SIDE Summer School, July 2019
Linear Model and Variable Selection
For variable selection, use the classical stats::step function, or leaps::regsubset
for best subset, forward stepwide and backward stepwise.
For leave-one-out-cross validation, we can write
1
n
n
i=1
(yi − y(i))2
=
1
n
n
i=1
yi − yi
1 − Hi,i
2
Heuristically, (yi − yi)2
underestimates the true prediction error
High underestimation if the correlation between yi and yi is high
One can use Cov[y, y], e.g. in Mallows’ Cp,
Cp =
1
n
n
i=1
(yi − yi)2
+
2
n
σ2
p where p =
1
σ2
trace[Cov(y, y])
with Gaussian errors, AIC and Mallows’ Cp are asymptotically equivalent.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 4
Arthur Charpentier, SIDE Summer School, July 2019
Penalized Inference and Shrinkage
Consider a parametric model, with true (unknown) parameter θ, then
mse(ˆθ) = E (ˆθ − θ)2
= E (ˆθ − E ˆθ )2
variance
+ E (E ˆθ − θ)2
bias2
One can think of a shrinkage of an unbiased estimator,
Let θ denote an unbiased estimator of θ. Then
ˆθ =
θ2
θ2 + mse(θ)
· θ = θ −
mse(θ)
θ2 + mse(θ)
· θ
penalty
satisfies mse(ˆθ) ≤ mse(θ).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 5
−2 −1 0 1 2 3 4
0.00.20.40.60.8
variance
Arthur Charpentier, SIDE Summer School, July 2019
Normalization : Euclidean 2 vs. Mahalonobis
We want to penalize complicated models :
if βk is “too small”, we prefer to have βk = 0.
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Instead of d(x, y) = (x − y)T
(x − y)
use dΣ(x, y) = (x − y)TΣ−1
(x − y)
beta1
beta2
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 6
Arthur Charpentier, SIDE Summer School, July 2019
Linear Regression Shortcoming
Least Squares Estimator β = (XT
X)−1
XT
y
Unbiased Estimator E[β] = β
Variance Var[β] = σ2
(XT
X)−1
which can be (extremely) large when det[(XT
X)] ∼ 0.
X =







1 −1 2
1 0 1
1 2 −1
1 1 0







then XT
X =




4 2 2
2 6 −4
2 −4 6



 while XT
X+I =




5 2 2
2 7 −4
2 −4 7




eigenvalues : {10, 6, 0} {11, 7, 1}
Ad-hoc strategy: use XT
X + λI
@freakonometrics freakonometrics freakonometrics.hypotheses.org 7
Arthur Charpentier, SIDE Summer School, July 2019
Ridge Regression
... like the least square, but it shrinks estimated coefficients towards 0.
β
ridge
λ = argmin



n
i=1
(yi − xT
i β)2
+ λ
p
j=1
β2
j



β
ridge
λ = argmin



y − Xβ
2
2
=criteria
+ λ β 2
2
=penalty



λ ≥ 0 is a tuning parameter.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 8
Arthur Charpentier, SIDE Summer School, July 2019
Ridge Regression
an Wieringen (2018 Lecture notes on ridge regression
Ridge Estimator (OLS)
β
ridge
λ = argmin



n
i=1
(yi − xi β)2
+ λ
p
j=1
β2
j



Ridge Estimator (GLM)
β
ridge
λ = argmin



−
n
i=1
log f(yi|µi = g−1
(xi β)) +
λ
2
p
j=1
β2
j



@freakonometrics freakonometrics freakonometrics.hypotheses.org 9
Arthur Charpentier, SIDE Summer School, July 2019
Ridge Regression
β
ridge
λ = argmin y − (β0 + Xβ)
2
2
+ λ β
2
2
can be seen as a constrained optimization problem
β
ridge
λ = argmin
β 2
2
≤hλ
y − (β0 + Xβ)
2
2
Explicit solution
βλ = (XT
X + λI)−1
XT
y
If λ → 0, β
ridge
0 = β
ols
If λ → ∞, β
ridge
∞ = 0.
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 10
Arthur Charpentier, SIDE Summer School, July 2019
Ridge Regression
This penalty can be seen as rather unfair if compo-
nents of x are not expressed on the same scale
• center: xj = 0, then β0 = y
• scale: xT
j xj = 1
Then compute
β
ridge
λ = argmin



y − Xβ 2
2
=loss
+ λ β 2
2
=penalty



beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
40
40
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 11
Arthur Charpentier, SIDE Summer School, July 2019
Ridge Regression
Observe that if xj1
⊥ xj2
, then
β
ridge
λ = [1 + λ]−1
β
ols
λ
which explain relationship with shrinkage.
But generally, it is not the case...
Smaller mse
There exists λ such that mse[β
ridge
λ ] ≤ mse[β
ols
λ ]
q
q
@freakonometrics freakonometrics freakonometrics.hypotheses.org 12
Arthur Charpentier, SIDE Summer School, July 2019
Ridge Regression
Lλ(β) =
n
i=1
(yi − β0 − xT
i β)2
+ λ
p
j=1
β2
j
∂Lλ(β)
∂β
= −2XT
y + 2(XT
X + λI)β
∂2
Lλ(β)
∂β∂βT
= 2(XT
X + λI)
where XT
X is a semi-positive definite matrix, and λI is a positive definite
matrix, and
βλ = (XT
X + λI)−1
XT
y
@freakonometrics freakonometrics freakonometrics.hypotheses.org 13
Arthur Charpentier, SIDE Summer School, July 2019
The Bayesian Interpretation
From a Bayesian perspective,
P[θ|y]
posterior
∝ P[y|θ]
likelihood
· P[θ]
prior
i.e. log P[θ|y] = log P[y|θ]
log likelihood
+ log P[θ]
penalty
If β has a prior N(0, τ2
I) distribution, then its posterior distribution has mean
E[β|y, X] = XT
X +
σ2
τ2
I
−1
XT
y.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 14
Arthur Charpentier, SIDE Summer School, July 2019
Properties of the Ridge Estimator
βλ = (XT
X + λI)−1
XT
y
E[βλ] = XT
X(λI + XT
X)−1
β.
i.e. E[βλ] = β.
Observe that E[βλ] → 0 as λ → ∞.
Ridge & Shrinkage
Assume that X is an orthogonal design matrix, i.e. XT
X = I, then
βλ = (1 + λ)−1
β
ols
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 15
Arthur Charpentier, SIDE Summer School, July 2019
Properties of the Ridge Estimator
Set W λ = (I + λ[XT
X]−1
)−1
. One can prove that
W λβ
ols
= βλ.
Thus,
Var[βλ] = W λVar[β
ols
]W T
λ
and
Var[βλ] = σ2
(XT
X + λI)−1
XT
X[(XT
X + λI)−1
]T
.
Observe that
Var[β
ols
] − Var[βλ] = σ2
W λ[2λ(XT
X)−2
+ λ2
(XT
X)−3
]W T
λ ≥ 0.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 16
Arthur Charpentier, SIDE Summer School, July 2019
Properties of the Ridge Estimator
Hence, the confidence ellipsoid of ridge estimator is
indeed smaller than the OLS,
If X is an orthogonal design matrix,
Var[βλ] = σ2
(1 + λ)−2
I.
mse[βλ] = σ2
trace(W λ(XT
X)−1
W T
λ) + βT
(W λ − I)T
(W λ − I)β.
If X is an orthogonal design matrix,
mse[βλ] =
pσ2
(1 + λ)2
+
λ2
(1 + λ)2
βT
β
@freakonometrics freakonometrics freakonometrics.hypotheses.org 17
0.0 0.2 0.4 0.6 0.8
−1.0−0.8−0.6−0.4−0.2
β1
β2
1
2
3
4
5
6
7
Arthur Charpentier, SIDE Summer School, July 2019
Properties of the Ridge Estimator
mse[βλ] =
pσ2
(1 + λ)2
+
λ2
(1 + λ)2
βT
β
is minimal for
λ =
pσ2
βT
β
Note that there exists λ > 0 such that mse[βλ] < mse[β0] = mse[β
ols
].
Ridge regression is obtained using glmnet::glmnet(..., alpha = 0) - and
glmnet::cv.glmnet for cross validation
@freakonometrics freakonometrics freakonometrics.hypotheses.org 18
Arthur Charpentier, SIDE Summer School, July 2019
SVD decomposition
For any matrix A, m × n, there are orthogonal matrices U (m × m), V (n × n)
and a ”diagonal” matrix Σ (m × n) such that A = UΣV T
, or AV = UΣ.
Hence, there exists a special orthonormal set of vectors (i.e. the columns of V ),
that is mapped by the matrix A into an orthonormal set of vectors (i.e. the
columns of U).
Let r = rank(A), then A =
r
i=1
σiuivT
i (called the dyadic decomposition of A).
Observe that it can be used to compute (e.g.) the Frobenius norm of A,
A = a2
i,j = σ2
1 + · · · + σ2
min{m,n}.
Further AT
A = V ΣT
ΣV T
while AAT
= UΣΣT
UT
.
Hence, σ2
i ’s are related to eigenvalues of AT
A and AAT
, and ui, vi are associated
eigenvectors.
Golub & Reinsh (1970, Singular Value Decomposition and Least Squares Solutions)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 19
Arthur Charpentier, SIDE Summer School, July 2019
SVD decomposition
Consider the singular value decomposition of X, X = UDV T
.
Then
β
ols
= V D−2
D UT
y
βλ = V (D2
+ λI)−1
D UT
y
Observe that
D−1
i,i ≥
Di,i
D2
i,i + λ
hence, the ridge penalty shrinks singular values.
Set now R = UD (n × n matrix), so that X = RV T
,
βλ = V (RT
R + λI)−1
RT
y
@freakonometrics freakonometrics freakonometrics.hypotheses.org 20
Arthur Charpentier, SIDE Summer School, July 2019
Hat matrix and Degrees of Freedom
Recall that Y = HY with
H = X(XT
X)−1
XT
Similarly
Hλ = X(XT
X + λI)−1
XT
trace[Hλ] =
p
j=1
d2
j,j
d2
j,j + λ
→ 0, as λ → ∞.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 21
Arthur Charpentier, SIDE Summer School, July 2019
Sparsity Issues
In several applications, k can be (very) large, but a lot of features are just noise:
βj = 0 for many j’s. Let s denote the number of relevant features, with s << k,
cf Hastie, Tibshirani & Wainwright (2015, Statistical Learning with Sparsity),
s = card{S} where S = {j; βj = 0}
The model is now y = XT
SβS + ε, where XT
SXS is a full rank matrix.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 22
q
q
= . +
Arthur Charpentier, SIDE Summer School, July 2019
Going further on sparsity issues
The Ridge regression problem was to solve
β = argmin
β∈{ β 2 ≤s}
{ Y − XT
β 2
2
}
Define a 0 = 1(|ai| > 0).
Here dim(β) = k but β 0 = s.
We wish we could solve
β = argmin
β∈{ β 0 =s}
{ Y − XT
β 2
2
}
Problem: it is usually not possible to describe all possible constraints, since
s
k
coefficients should be chosen here (with k (very) large).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 23
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
Arthur Charpentier, SIDE Summer School, July 2019
Going further on sparsity issues
In a convex problem, solve the dual problem,
e.g. in the Ridge regression : primal problem
min
β∈{ β 2 ≤s}
{ Y − XT
β 2
2
}
and the dual problem
min
β∈{ Y −XTβ 2 ≤t}
{ β 2
2
}
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
26
27
30
32
35
40
40
50
60
70
80
90
100
110
120
120
130
130
140 140
X
q
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
26
27
30
32
35
40
40
50
60
70
80
90
100
110
120
120
130
130
140 140
X
q
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 24
Arthur Charpentier, SIDE Summer School, July 2019
Going further on sparsity issues
Idea: solve the dual problem
β = argmin
β∈{ Y −XTβ 2 ≤h}
{ β 0
}
where we might convexify the 0 norm, · 0
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 25
Arthur Charpentier, SIDE Summer School, July 2019
Going further on sparsity issues
On [−1, +1]k
, the convex hull of β 0
is β 1
On [−a, +a]k
, the convex hull of β 0
is a−1
β 1
Hence, why not solve
β = argmin
β; β 1 ≤˜s
{ Y − XT
β 2 }
which is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization
problem
β = argmin{ Y − XT
β 2
2
+λ β 1
}
@freakonometrics freakonometrics freakonometrics.hypotheses.org 26
Arthur Charpentier, SIDE Summer School, July 2019
lasso Least Absolute Shrinkage and Selection Operator
lasso Estimator (OLS)
β
lasso
λ = argmin



n
i=1
(yi − xi β)2
+ λ
p
j=1
|βj|



lasso Estimator (GLM)
β
lasso
λ = argmin



−
n
i=1
log f(yi|µi = g−1
(xi β)) +
λ
2
p
j=1
|βj|



@freakonometrics freakonometrics freakonometrics.hypotheses.org 27
Arthur Charpentier, SIDE Summer School, July 2019
lasso Regression
No explicit solution...
If λ → 0, β
lasso
0 = β
ols
If λ → ∞, β
lasso
∞ = 0.
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 28
Arthur Charpentier, SIDE Summer School, July 2019
lasso Regression
For some λ, there are k’s such that β
lasso
k,λ = 0.
Further, λ → β
lasso
k,λ is piecewise linear
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
40
40
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 29
Arthur Charpentier, SIDE Summer School, July 2019
lasso Regression
In the orthogonal case, XT
X = I,
β
lasso
k,λ = sign(β
ols
k ) |β
ols
k | −
λ
2
i.e. the LASSO estimate is related to the soft
threshold function...
q
q
@freakonometrics freakonometrics freakonometrics.hypotheses.org 30
Arthur Charpentier, SIDE Summer School, July 2019
Optimal lasso Penalty
Use cross validation, e.g. K-fold,
β(−k)(λ) = argmin



i∈Ik
[yi − xT
i β]2
+ λ β 1



then compute the sum of the squared errors,
Qk(λ) =
i∈Ik
[yi − xT
i β(−k)(λ)]2
and finally solve
λ = argmin Q(λ) =
1
K
k
Qk(λ)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 31
Arthur Charpentier, SIDE Summer School, July 2019
Optimal lasso Penalty
Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009, Elements
of Statistical Learning) suggest the largest λ such that
Q(λ) ≤ Q(λ ) + se[λ ] with se[λ]2
=
1
K2
K
k=1
[Qk(λ) − Q(λ)]2
lasso regression is obtained using glmnet::glmnet(..., alpha = 1) - and
glmnet::cv.glmnet for cross validation.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 32
Arthur Charpentier, SIDE Summer School, July 2019
LASSO and Ridge, with R
1 > library(glmnet)
2 > chicago=read.table("http:// freakonometrics .free.fr/
chicago.txt",header=TRUE ,sep=";")
3 > standardize <- function(x) {(x-mean(x))/sd(x)}
4 > z0 <- standardize(chicago[, 1])
5 > z1 <- standardize(chicago[, 3])
6 > z2 <- standardize(chicago[, 4])
7 > ridge <-glmnet(cbind(z1 , z2), z0 , alpha =0, intercept=
FALSE , lambda =1)
8 > lasso <-glmnet(cbind(z1 , z2), z0 , alpha =1, intercept=
FALSE , lambda =1)
9 > elastic <-glmnet(cbind(z1 , z2), z0 , alpha =.5,
intercept=FALSE , lambda =1)
Elastic net, λ1 β 1 + λ2 β 2
2
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
qq
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
q
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
q
q
qq
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
q
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
qq
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
@freakonometrics freakonometrics freakonometrics.hypotheses.org 33
Arthur Charpentier, SIDE Summer School, July 2019
lasso and lar (Least-Angle Regression)
lasso estimation can be seen as an adaptation of LAR procedure
Least Angle Regression
(i) set (small)
(ii) start with initial residual ε = y, and β = 0
(iii) find the predictor xj with the highest correlation with ε
(iv) update βj = βj + δj = βj + · sign[ε xj]
(v) set ε = ε − δjxj and go to (iii)
see Efron et al. (2004, Least Angle Regression)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 34
Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
Define
a 0 =
d
i=1
1(ai = 0), a 1 =
d
i=1
|ai| and a 2 =
d
i=1
a2
i
1/2
, for a ∈ Rd
.
constrained penalized
optimization optimization
argmin
β; β 0 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 0 ( 0)
argmin
β; β 1 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 1
( 1)
argmin
β; β 2 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 2 ( 2)
Assume that is the quadratic norm.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 35
Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
The two problems ( 2) are equivalent : ∀(β , s ) solution of the left problem, ∃λ
such that (β , λ ) is solution of the right problem. And conversely.
The two problems ( 1) are equivalent : ∀(β , s ) solution of the left problem, ∃λ
such that (β , λ ) is solution of the right problem. And conversely. Nevertheless,
if there is a theoretical equivalence, there might be numerical issues since there is
not necessarily unicity of the solution.
The two problems ( 0) are not equivalent : if (β , λ ) is solution of the right
problem, ∃s such that β is a solution of the left problem. But the converse is
not true.
More generally, consider a p norm,
• sparsity is obtained when p ≤ 1
• convexity is obtained when p ≥ 1
@freakonometrics freakonometrics freakonometrics.hypotheses.org 36
Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
Foster & George (1994) the risk inflation criterion for multiple regression tried to
solve directly the penalized problem of ( 0).
But it is a complex combinatorial problem in high dimension (Natarajan (1995)
sparse approximate solutions to linear systems proved that it was a NP-hard
problem)
One can prove that if λ ∼ σ2
log(p), alors
E [xT
β − xT
β0]2
≤ E [xS
T
βS − xT
β0]2
=σ2#S
· 4 log p + 2 + o(1) .
In that case
β
sub
λ,j =



0 si j /∈ Sλ(β)
β
ols
j si j ∈ Sλ(β),
where Sλ(β) is the set of non-null values in solutions of ( 0).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 37
Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
If is no longer the quadratic norm but 1, problem ( 1) is not always strictly
convex, and optimum is not always unique (e.g. if XT
X is singular).
But in the quadratic case, is strictly convex, and at least Xβ is unique.
Further, note that solutions are necessarily coherent (signs of coefficients) : it is
not possible to have βj < 0 for one solution and βj > 0 for another one.
In many cases, problem ( 1) yields a corner-type solution, which can be seen as a
”best subset” solution - like in ( 0).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 38
Arthur Charpentier, SIDE Summer School, July 2019
@freakonometrics freakonometrics freakonometrics.hypotheses.org 39
Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
Consider a simple regression yi = xiβ + ε, with 1-penalty and a 2-loss fucntion.
( 1) becomes
min yT
y − 2yT
xβ + βxT
xβ + 2λ|β|
First order condition can be written
−2yT
x + 2xT
xβ±2λ = 0.
(the sign in ± being the sign of β). Assume that least-square estimate (λ = 0) is
(strictly) positive, i.e. yT
x > 0. If λ is not too large β and βols
have the same
sign, and
−2yT
x + 2xT
xβ + 2λ = 0.
with solution βlasso
λ =
yT
x − λ
xTx
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 40
Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
Increase λ so that βλ = 0.
Increase slightly more, βλ cannot become negative, because the sign of the first
order condition will change, and we should solve
−2yT
x + 2xT
xβ − 2λ = 0.
and solution would be βlasso
λ =
yT
x + λ
xTx
. But that solution is positive (we
assumed that yT
x > 0), to we should have βλ < 0.
Thus, at some point βλ = 0, which is a corner solution.
In higher dimension, see Tibshirani & Wasserman (2016, a closer look at sparse
regression) or Cand`es & Plan (2009, Near-ideal model selection by 1 minimization.)
With some additional technical assumption, that lasso estimator is ”sparsistent”
in the sense that the support of β
lasso
λ is the same as β,
@freakonometrics freakonometrics freakonometrics.hypotheses.org 41
Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
Thus, lasso can be used for variable selection - see Hastie et al. (2001, The
Elements of Statistical Learning).
Generally, βlasso
λ is a biased estimator but its variance can be small enough to
have a smaller least squared error than the OLS estimate.
With orthonormal covariates, one can prove that
βsub
λ,j = βols
j 1|βsub
λ,j
|>b
, βridge
λ,j =
βols
j
1 + λ
and βlasso
λ,j = signe[βols
j ] · (|βols
j | − λ)+.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 42
Arthur Charpentier, SIDE Summer School, July 2019
lasso for Autoregressive Time Series
Consider some AR(p) autoregressive time series,
Xt = φ1Xt−1 + φ2Xt−2 + · · · + φp−1Xt−p+1 + φpXt−p + εt,
for some white noise (εt), with a causal type representation. Write y = xT
φ + ε.
The lasso estimator φ is a minimizer of
1
2T
y = xT
φ 2
+ λ
p
i=1
λi|φi|,
for some tuning parameters (λ, λ1, · · · , λp).
See Nardi & Rinaldo (2011, Autoregressive process modeling via the Lasso
procedure).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 43
Arthur Charpentier, SIDE Summer School, July 2019
lasso and Non-Linearities
Consider knots k1, · · · , km, we want a function m which is a cubic polynomial
between every pair of knots, continuous at each knot, and with ontinuous first
and second derivatives at each knot.
We can write m as
m(x) = β0 + β1x + β2x2
+ β3x3
+ β4(x − k1)3
+ + · · · + βm+3(x − km)3
+
One strategy is the following
• fix the number of knots m (m < n)
• find the natural cubic spline m which minimizes
n
i=1
(yi − m(xi))2
• then choose m by cross validation
and alternative is to use a penalty based approach (Ridge type) to avoid overfit
(since with m = n, the residual sum of square is null).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 44
Arthur Charpentier, SIDE Summer School, July 2019
GAM, splines and Ridge regression
Consider a univariate nonlinear regression problem, so that E[Y |X = x] = m(x).
Given a sample {(y1, x1), · · · , (yn, xn)}, consider the following penalized problem
m = argmin
m∈C2
n
i=1
(yi − m(xi))2
+ λ
R
m (x)dx
with the Residual sum of squares on the left, and a penalty for the roughness of
the function.
The solution is a natural cubic spline with knots at unique values of x (see
Eubanks (1999, Nonparametric Regression and Spline Smoothing)
Consider some spline basis {h1, · · · , hn}, and let m(x) =
n
i=1
βihi(x).
Let H and Ω be the n × n matrices Hi,j = hj(xi), and Ωi,j =
R
hi (x)hj (x)dx.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 45
Arthur Charpentier, SIDE Summer School, July 2019
GAM, splines and Ridge regression
Then the objective function can be written
(y − Hβ)T
(y − Hβ) + λβT
Ωβ
Recognize here a generalized Ridge regression, with solution
βλ = HT
H + λΩ
−1
HT
y.
Note that predicted values are linear functions of the observed value since
y = H HT
H + λΩ
−1
HT
y = Sλy,
with degrees of freedom trace(Sλ).
One can obtain the so-called Reinsch form by considering the singular value
decomposition of H = UDV T
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 46
Arthur Charpentier, SIDE Summer School, July 2019
GAM, splines and Ridge regression
Here U is orthogonal since H is square (n × n), and D is here invertible. Then
Sλ = (I + λUT
D−1
V T
ΩV D−1
U)−1
= (I + λK)−1
where K is a positive semidefinite matrix, K = B∆BT
, where columns of B are
know as the Demmler-Reinsch basis.
In that (orthonormal) basis, Sλ is a diagonal matrix,
Sλ = B I + λ∆
−1
BT
Observe that SλBk =
1
1 + λ∆k,k
Bk.
Here again, eigenvalues are shrinkage coefficients of basis vectors.
With more covariates, consider an additive problem
(h1, · · · , hp) =
h1,··· ,hp∈C2
argmin



n
i=1

yi −
p
j=1
m(xi,j)


2
+ λ
p
j=1 R
mj (x)dx



@freakonometrics freakonometrics freakonometrics.hypotheses.org 47
Arthur Charpentier, SIDE Summer School, July 2019
GAM, splines and Ridge regression
which can be written
min



(y −
p
j=1
Hjβj)T
(y −
p
j=1
Hjβj) + λ β1
T
p
j=1
Ωjβj



where each matrix Hj is a Demmler-Reinsch basis for variable xj.
Chouldechova & Hastie (2015, Generalized Additive Model Selection)
Assume that the mean function for the jth variable is mj(x) = αjx + mj(x)T
βj.
One can write
min (y − α0 −
p
j=1
αjxj −
p
j=1
Hjβj)T
(y − α0 −
p
j=1
αjxj −
p
j=1
Hjβj)
+λ γ|α1| + (1 − γ) βj Ωj + ψ1β1
T
Ω1β1 + · · · + ψpβp
T
Ωpβp
where βj Ωj
= βj
TΩjβj.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 48
Arthur Charpentier, SIDE Summer School, July 2019
GAM, splines and Ridge regression
The second term is the selection penalty, with a mixture of 1 and 2 (type)
norm-based penalty
The third term is the end-to-path penalty (GAM type when λ = 0).
For each predictor xj, there are three possibilities
• zero, αj = 0 and βj = 0
• linear, αj = 0 and βj = 0
• nonlinear, βj = 0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 49
Arthur Charpentier, SIDE Summer School, July 2019
0.0 0.2 0.4 0.6 0.8
−30−101030
variable 1
smoothfunction
10 20 30 40 50 60 70
−30−101030
variable 2
smoothfunction
5 10 15 20
−30−101030
variable 3
smoothfunction
50 20 10 5 2 1
−30−20−10010
Linear Components
λ
α
50 20 10 5 2 1
05102030
Non−linear Components
λ
||β||
0 1 2 3 4
30507090
log(Lambda)
Mean−SquaredError
q
qqq
q
q
qq
q
q
q
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 1
42 27
0.0 0.2 0.4 0.6 0.8
−50510
v1
f(v1)
10 20 30 40 50 60 70
−50510
v2
f(v2)
5 10 15 20
−50510
v3
f(v3)@freakonometrics freakonometrics freakonometrics.hypotheses.org 50
Arthur Charpentier, SIDE Summer School, July 2019
Coordinate Descent
LASSO Coordinate Descent Algorithm
1. Set β0 = β
2 . For k = 1, · · ·
for j = 1, · · · , p
(i) compute Rj = xj y − X−jβk−1(−j)
(ii) set βk,j = Rj · 1 −
λ
2|Rj| +
3. The final estimate βκ is βλ
@freakonometrics freakonometrics freakonometrics.hypotheses.org 51
Arthur Charpentier, SIDE Summer School, July 2019
From LASSO to Dantzig Selection
Cand`es & Tao (2007, The Dantzig selector: Statistical estimation when p is much
larger than n) defined
β
dantzig
λ ∈ argmin
β∈Rp
β 1
s.t. X (y − Xβ) ∞
≤ λ
@freakonometrics freakonometrics freakonometrics.hypotheses.org 52
Arthur Charpentier, SIDE Summer School, July 2019
From LASSO to Adaptative Lasso
Zou (2006, The Adaptive Lasso)
β
a-lasso
λ ∈ argmin
β∈Rp



y − Xβ 2
2
+ λ
p
j=1
|βj|
|βγ-lasso
λ,j |



where β
γ-lasso
λ = ΠXs(λ)
y where s(λ) is the set of non null components β
lasso
λ
See library lqa or lassogrp
@freakonometrics freakonometrics freakonometrics.hypotheses.org 53
Arthur Charpentier, SIDE Summer School, July 2019
From LASSO to Group Lasso
Assume that variables x ∈ Rp
can be grouped in L subgroups, x = (x1 · · · , xL),
where dim[xl] = pl.
Yuan & Lin (2007, Model selection and estimation in the Gaussian graphical model)
defined, for some Kl matrices nl × nl definite positives
β
g-lasso
λ ∈ argmin
β∈Rp
y − Xβ 2
2
+ λ
L
l=1
βl Klβl
or, if Kl = plI
β
g-lasso
λ ∈ argmin
β∈Rp
y − Xβ 2
2
+ λ
L
l=1
pl βl 2
See library gglasso
@freakonometrics freakonometrics freakonometrics.hypotheses.org 54
Arthur Charpentier, SIDE Summer School, July 2019
From LASSO to Sparse-Group Lasso
Assume that variables x ∈ Rp
can be grouped in L subgroups, x = (x1 · · · , xL),
where dim[xl] = pl.
Simon et al. (2013, A Sparse-Group LASSO) defined, for some Kl matrices nl × nl
definite positives
β
sg-lasso
λ,µ ∈ argmin
β∈Rp
y − Xβ 2
2
+ λ
L
l=1
βl Klβl + µ β 1
See library SGL
@freakonometrics freakonometrics freakonometrics.hypotheses.org 55

More Related Content

What's hot (20)

Slides erm-cea-ia
Slides erm-cea-iaSlides erm-cea-ia
Slides erm-cea-ia
 
Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2
 
Graduate Econometrics Course, part 4, 2017
Graduate Econometrics Course, part 4, 2017Graduate Econometrics Course, part 4, 2017
Graduate Econometrics Course, part 4, 2017
 
Side 2019 #7
Side 2019 #7Side 2019 #7
Side 2019 #7
 
Side 2019 #5
Side 2019 #5Side 2019 #5
Side 2019 #5
 
Econometrics, PhD Course, #1 Nonlinearities
Econometrics, PhD Course, #1 NonlinearitiesEconometrics, PhD Course, #1 Nonlinearities
Econometrics, PhD Course, #1 Nonlinearities
 
Reinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and FinanceReinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and Finance
 
Side 2019 #10
Side 2019 #10Side 2019 #10
Side 2019 #10
 
Slides sales-forecasting-session2-web
Slides sales-forecasting-session2-webSlides sales-forecasting-session2-web
Slides sales-forecasting-session2-web
 
Lausanne 2019 #1
Lausanne 2019 #1Lausanne 2019 #1
Lausanne 2019 #1
 
Slides ineq-4
Slides ineq-4Slides ineq-4
Slides ineq-4
 
Slides edf-2
Slides edf-2Slides edf-2
Slides edf-2
 
Machine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & InsuranceMachine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & Insurance
 
Slides ACTINFO 2016
Slides ACTINFO 2016Slides ACTINFO 2016
Slides ACTINFO 2016
 
Side 2019 #4
Side 2019 #4Side 2019 #4
Side 2019 #4
 
transformations and nonparametric inference
transformations and nonparametric inferencetransformations and nonparametric inference
transformations and nonparametric inference
 
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive ModelingHands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
 
Side 2019 #12
Side 2019 #12Side 2019 #12
Side 2019 #12
 
Slides univ-van-amsterdam
Slides univ-van-amsterdamSlides univ-van-amsterdam
Slides univ-van-amsterdam
 
Side 2019 #8
Side 2019 #8Side 2019 #8
Side 2019 #8
 

Similar to Side 2019 #3

The Estimation for the Eigenvalue of Quaternion Doubly Stochastic Matrices us...
The Estimation for the Eigenvalue of Quaternion Doubly Stochastic Matrices us...The Estimation for the Eigenvalue of Quaternion Doubly Stochastic Matrices us...
The Estimation for the Eigenvalue of Quaternion Doubly Stochastic Matrices us...ijtsrd
 
Approximate Bayesian computation for the Ising/Potts model
Approximate Bayesian computation for the Ising/Potts modelApproximate Bayesian computation for the Ising/Potts model
Approximate Bayesian computation for the Ising/Potts modelMatt Moores
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learningSteve Nouri
 
Graph partitioning and characteristic polynomials of Laplacian matrics of Roa...
Graph partitioning and characteristic polynomials of Laplacian matrics of Roa...Graph partitioning and characteristic polynomials of Laplacian matrics of Roa...
Graph partitioning and characteristic polynomials of Laplacian matrics of Roa...Yoshihiro Mizoguchi
 
Litv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfLitv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfAlexander Litvinenko
 
Approximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-LikelihoodsApproximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-LikelihoodsStefano Cabras
 
Intuitionistic First-Order Logic: Categorical semantics via the Curry-Howard ...
Intuitionistic First-Order Logic: Categorical semantics via the Curry-Howard ...Intuitionistic First-Order Logic: Categorical semantics via the Curry-Howard ...
Intuitionistic First-Order Logic: Categorical semantics via the Curry-Howard ...Marco Benini
 
Estimation Theory, PhD Course, Ghent University, Belgium
Estimation Theory, PhD Course, Ghent University, BelgiumEstimation Theory, PhD Course, Ghent University, Belgium
Estimation Theory, PhD Course, Ghent University, BelgiumStijn De Vuyst
 
Statistical Decision Theory
Statistical Decision TheoryStatistical Decision Theory
Statistical Decision TheorySangwoo Mo
 
Ch1 sets and_logic(1)
Ch1 sets and_logic(1)Ch1 sets and_logic(1)
Ch1 sets and_logic(1)Kwonpyo Ko
 
Talk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniquesTalk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniquesPierre Jacob
 

Similar to Side 2019 #3 (20)

Slides ensae-2016-9
Slides ensae-2016-9Slides ensae-2016-9
Slides ensae-2016-9
 
Slides ensae 9
Slides ensae 9Slides ensae 9
Slides ensae 9
 
Econometrics 2017-graduate-3
Econometrics 2017-graduate-3Econometrics 2017-graduate-3
Econometrics 2017-graduate-3
 
Ica group 3[1]
Ica group 3[1]Ica group 3[1]
Ica group 3[1]
 
Slides ub-3
Slides ub-3Slides ub-3
Slides ub-3
 
Math Analysis I
Math Analysis I Math Analysis I
Math Analysis I
 
Intro to ABC
Intro to ABCIntro to ABC
Intro to ABC
 
The Estimation for the Eigenvalue of Quaternion Doubly Stochastic Matrices us...
The Estimation for the Eigenvalue of Quaternion Doubly Stochastic Matrices us...The Estimation for the Eigenvalue of Quaternion Doubly Stochastic Matrices us...
The Estimation for the Eigenvalue of Quaternion Doubly Stochastic Matrices us...
 
Approximate Bayesian computation for the Ising/Potts model
Approximate Bayesian computation for the Ising/Potts modelApproximate Bayesian computation for the Ising/Potts model
Approximate Bayesian computation for the Ising/Potts model
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learning
 
Graph partitioning and characteristic polynomials of Laplacian matrics of Roa...
Graph partitioning and characteristic polynomials of Laplacian matrics of Roa...Graph partitioning and characteristic polynomials of Laplacian matrics of Roa...
Graph partitioning and characteristic polynomials of Laplacian matrics of Roa...
 
Litv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfLitv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdf
 
Approximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-LikelihoodsApproximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-Likelihoods
 
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
 
Intuitionistic First-Order Logic: Categorical semantics via the Curry-Howard ...
Intuitionistic First-Order Logic: Categorical semantics via the Curry-Howard ...Intuitionistic First-Order Logic: Categorical semantics via the Curry-Howard ...
Intuitionistic First-Order Logic: Categorical semantics via the Curry-Howard ...
 
Estimation Theory, PhD Course, Ghent University, Belgium
Estimation Theory, PhD Course, Ghent University, BelgiumEstimation Theory, PhD Course, Ghent University, Belgium
Estimation Theory, PhD Course, Ghent University, Belgium
 
Statistical Decision Theory
Statistical Decision TheoryStatistical Decision Theory
Statistical Decision Theory
 
The Gaussian Hardy-Littlewood Maximal Function
The Gaussian Hardy-Littlewood Maximal FunctionThe Gaussian Hardy-Littlewood Maximal Function
The Gaussian Hardy-Littlewood Maximal Function
 
Ch1 sets and_logic(1)
Ch1 sets and_logic(1)Ch1 sets and_logic(1)
Ch1 sets and_logic(1)
 
Talk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniquesTalk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniques
 

More from Arthur Charpentier

More from Arthur Charpentier (14)

Family History and Life Insurance
Family History and Life InsuranceFamily History and Life Insurance
Family History and Life Insurance
 
ACT6100 introduction
ACT6100 introductionACT6100 introduction
ACT6100 introduction
 
Family History and Life Insurance (UConn actuarial seminar)
Family History and Life Insurance (UConn actuarial seminar)Family History and Life Insurance (UConn actuarial seminar)
Family History and Life Insurance (UConn actuarial seminar)
 
Control epidemics
Control epidemics Control epidemics
Control epidemics
 
STT5100 Automne 2020, introduction
STT5100 Automne 2020, introductionSTT5100 Automne 2020, introduction
STT5100 Automne 2020, introduction
 
Family History and Life Insurance
Family History and Life InsuranceFamily History and Life Insurance
Family History and Life Insurance
 
Optimal Control and COVID-19
Optimal Control and COVID-19Optimal Control and COVID-19
Optimal Control and COVID-19
 
Slides OICA 2020
Slides OICA 2020Slides OICA 2020
Slides OICA 2020
 
Lausanne 2019 #3
Lausanne 2019 #3Lausanne 2019 #3
Lausanne 2019 #3
 
Lausanne 2019 #2
Lausanne 2019 #2Lausanne 2019 #2
Lausanne 2019 #2
 
Side 2019 #11
Side 2019 #11Side 2019 #11
Side 2019 #11
 
Pareto Models, Slides EQUINEQ
Pareto Models, Slides EQUINEQPareto Models, Slides EQUINEQ
Pareto Models, Slides EQUINEQ
 
Econ. Seminar Uqam
Econ. Seminar UqamEcon. Seminar Uqam
Econ. Seminar Uqam
 
Mutualisation et Segmentation
Mutualisation et SegmentationMutualisation et Segmentation
Mutualisation et Segmentation
 

Recently uploaded

Instant Issue Debit Cards - School Designs
Instant Issue Debit Cards - School DesignsInstant Issue Debit Cards - School Designs
Instant Issue Debit Cards - School Designsegoetzinger
 
VIP Kolkata Call Girl Jodhpur Park 👉 8250192130 Available With Room
VIP Kolkata Call Girl Jodhpur Park 👉 8250192130  Available With RoomVIP Kolkata Call Girl Jodhpur Park 👉 8250192130  Available With Room
VIP Kolkata Call Girl Jodhpur Park 👉 8250192130 Available With Roomdivyansh0kumar0
 
VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With Room...
VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With Room...VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With Room...
VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With Room...Suhani Kapoor
 
Independent Call Girl Number in Kurla Mumbai📲 Pooja Nehwal 9892124323 💞 Full ...
Independent Call Girl Number in Kurla Mumbai📲 Pooja Nehwal 9892124323 💞 Full ...Independent Call Girl Number in Kurla Mumbai📲 Pooja Nehwal 9892124323 💞 Full ...
Independent Call Girl Number in Kurla Mumbai📲 Pooja Nehwal 9892124323 💞 Full ...Pooja Nehwal
 
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdfFinTech Belgium
 
00_Main ppt_MeetupDORA&CyberSecurity.pptx
00_Main ppt_MeetupDORA&CyberSecurity.pptx00_Main ppt_MeetupDORA&CyberSecurity.pptx
00_Main ppt_MeetupDORA&CyberSecurity.pptxFinTech Belgium
 
Pooja 9892124323 : Call Girl in Juhu Escorts Service Free Home Delivery
Pooja 9892124323 : Call Girl in Juhu Escorts Service Free Home DeliveryPooja 9892124323 : Call Girl in Juhu Escorts Service Free Home Delivery
Pooja 9892124323 : Call Girl in Juhu Escorts Service Free Home DeliveryPooja Nehwal
 
The Economic History of the U.S. Lecture 18.pdf
The Economic History of the U.S. Lecture 18.pdfThe Economic History of the U.S. Lecture 18.pdf
The Economic History of the U.S. Lecture 18.pdfGale Pooley
 
Dharavi Russian callg Girls, { 09892124323 } || Call Girl In Mumbai ...
Dharavi Russian callg Girls, { 09892124323 } || Call Girl In Mumbai ...Dharavi Russian callg Girls, { 09892124323 } || Call Girl In Mumbai ...
Dharavi Russian callg Girls, { 09892124323 } || Call Girl In Mumbai ...Pooja Nehwal
 
High Class Call Girls Nashik Maya 7001305949 Independent Escort Service Nashik
High Class Call Girls Nashik Maya 7001305949 Independent Escort Service NashikHigh Class Call Girls Nashik Maya 7001305949 Independent Escort Service Nashik
High Class Call Girls Nashik Maya 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
High Class Call Girls Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
High Class Call Girls Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsHigh Class Call Girls Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
High Class Call Girls Nagpur Grishma Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Q3 2024 Earnings Conference Call and Webcast Slides
Q3 2024 Earnings Conference Call and Webcast SlidesQ3 2024 Earnings Conference Call and Webcast Slides
Q3 2024 Earnings Conference Call and Webcast SlidesMarketing847413
 
TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...
TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...
TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...ssifa0344
 
CALL ON ➥8923113531 🔝Call Girls Gomti Nagar Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Gomti Nagar Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Gomti Nagar Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Gomti Nagar Lucknow best sexual serviceanilsa9823
 
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptxFinTech Belgium
 
Dividend Policy and Dividend Decision Theories.pptx
Dividend Policy and Dividend Decision Theories.pptxDividend Policy and Dividend Decision Theories.pptx
Dividend Policy and Dividend Decision Theories.pptxanshikagoel52
 
The Economic History of the U.S. Lecture 17.pdf
The Economic History of the U.S. Lecture 17.pdfThe Economic History of the U.S. Lecture 17.pdf
The Economic History of the U.S. Lecture 17.pdfGale Pooley
 
Andheri Call Girls In 9825968104 Mumbai Hot Models
Andheri Call Girls In 9825968104 Mumbai Hot ModelsAndheri Call Girls In 9825968104 Mumbai Hot Models
Andheri Call Girls In 9825968104 Mumbai Hot Modelshematsharma006
 

Recently uploaded (20)

Instant Issue Debit Cards - School Designs
Instant Issue Debit Cards - School DesignsInstant Issue Debit Cards - School Designs
Instant Issue Debit Cards - School Designs
 
Commercial Bank Economic Capsule - April 2024
Commercial Bank Economic Capsule - April 2024Commercial Bank Economic Capsule - April 2024
Commercial Bank Economic Capsule - April 2024
 
VIP Kolkata Call Girl Jodhpur Park 👉 8250192130 Available With Room
VIP Kolkata Call Girl Jodhpur Park 👉 8250192130  Available With RoomVIP Kolkata Call Girl Jodhpur Park 👉 8250192130  Available With Room
VIP Kolkata Call Girl Jodhpur Park 👉 8250192130 Available With Room
 
VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With Room...
VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With Room...VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With Room...
VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With Room...
 
Independent Call Girl Number in Kurla Mumbai📲 Pooja Nehwal 9892124323 💞 Full ...
Independent Call Girl Number in Kurla Mumbai📲 Pooja Nehwal 9892124323 💞 Full ...Independent Call Girl Number in Kurla Mumbai📲 Pooja Nehwal 9892124323 💞 Full ...
Independent Call Girl Number in Kurla Mumbai📲 Pooja Nehwal 9892124323 💞 Full ...
 
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
 
00_Main ppt_MeetupDORA&CyberSecurity.pptx
00_Main ppt_MeetupDORA&CyberSecurity.pptx00_Main ppt_MeetupDORA&CyberSecurity.pptx
00_Main ppt_MeetupDORA&CyberSecurity.pptx
 
Veritas Interim Report 1 January–31 March 2024
Veritas Interim Report 1 January–31 March 2024Veritas Interim Report 1 January–31 March 2024
Veritas Interim Report 1 January–31 March 2024
 
Pooja 9892124323 : Call Girl in Juhu Escorts Service Free Home Delivery
Pooja 9892124323 : Call Girl in Juhu Escorts Service Free Home DeliveryPooja 9892124323 : Call Girl in Juhu Escorts Service Free Home Delivery
Pooja 9892124323 : Call Girl in Juhu Escorts Service Free Home Delivery
 
The Economic History of the U.S. Lecture 18.pdf
The Economic History of the U.S. Lecture 18.pdfThe Economic History of the U.S. Lecture 18.pdf
The Economic History of the U.S. Lecture 18.pdf
 
Dharavi Russian callg Girls, { 09892124323 } || Call Girl In Mumbai ...
Dharavi Russian callg Girls, { 09892124323 } || Call Girl In Mumbai ...Dharavi Russian callg Girls, { 09892124323 } || Call Girl In Mumbai ...
Dharavi Russian callg Girls, { 09892124323 } || Call Girl In Mumbai ...
 
High Class Call Girls Nashik Maya 7001305949 Independent Escort Service Nashik
High Class Call Girls Nashik Maya 7001305949 Independent Escort Service NashikHigh Class Call Girls Nashik Maya 7001305949 Independent Escort Service Nashik
High Class Call Girls Nashik Maya 7001305949 Independent Escort Service Nashik
 
High Class Call Girls Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
High Class Call Girls Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsHigh Class Call Girls Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
High Class Call Girls Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
Q3 2024 Earnings Conference Call and Webcast Slides
Q3 2024 Earnings Conference Call and Webcast SlidesQ3 2024 Earnings Conference Call and Webcast Slides
Q3 2024 Earnings Conference Call and Webcast Slides
 
TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...
TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...
TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...
 
CALL ON ➥8923113531 🔝Call Girls Gomti Nagar Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Gomti Nagar Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Gomti Nagar Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Gomti Nagar Lucknow best sexual service
 
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
 
Dividend Policy and Dividend Decision Theories.pptx
Dividend Policy and Dividend Decision Theories.pptxDividend Policy and Dividend Decision Theories.pptx
Dividend Policy and Dividend Decision Theories.pptx
 
The Economic History of the U.S. Lecture 17.pdf
The Economic History of the U.S. Lecture 17.pdfThe Economic History of the U.S. Lecture 17.pdf
The Economic History of the U.S. Lecture 17.pdf
 
Andheri Call Girls In 9825968104 Mumbai Hot Models
Andheri Call Girls In 9825968104 Mumbai Hot ModelsAndheri Call Girls In 9825968104 Mumbai Hot Models
Andheri Call Girls In 9825968104 Mumbai Hot Models
 

Side 2019 #3

  • 1. Arthur Charpentier, SIDE Summer School, July 2019 # 3 Regularization & Penalized Regression Arthur Charpentier (Universit´e du Qu´ebec `a Montr´eal) Machine Learning & Econometrics SIDE Summer School - July 2019 @freakonometrics freakonometrics freakonometrics.hypotheses.org 1
  • 2. Arthur Charpentier, SIDE Summer School, July 2019 Linear Model and Variable Selection Let s denote a subset of {0, 1, · · · , p}, with cardinal |s|. Xs is the matrice with columns xj where j ∈ s. Consider the model Y = Xsβs + η, so that βs = Xs Xs −1 Xs y In general, βs = (β)s R2 is usually not a good measure since R2 (s) ≤ R2 (t) when s ⊂ t. Some use the adjusted R2 , R 2 (s) = 1 − n − 1 n − |s| 1 − R2 (s) The mean square error is mse(s) = E (Xβ − Xsβs)2 = E RSS(s) − nσ2 + 2|s|σ2 Define Mallows’Cp as Cp(s) = RSS(s) σ2 − n + 2|s| Rule of thumb: model with variables s is valid if Cp(s) ≤ |s| @freakonometrics freakonometrics freakonometrics.hypotheses.org 2
  • 3. Arthur Charpentier, SIDE Summer School, July 2019 Linear Model and Variable Selection In a linear model, log L(β, σ2 ) = − n 2 log σ2 − n 2 log(2π) − 1 2σ2 y − Xβ 2 and log L(βs, σ2 s ) = − n 2 log RSS(s) n − n 2 [1 + log(2π)] It is necessary to penalize too complex models Akaike’s AIC : AIC(s) = n 2 log RSS(s) n + n 2 [1 + log(2π)] + 2|s| Schwarz’s BIC : BIC(s) = n 2 log RSS(s) n + n 2 [1 + log(2π)] + |s| log n Exhaustive search of all models, 2p+1 ... too complicated. Stepwise procedure, forward or backward... not very stable and satisfactory. @freakonometrics freakonometrics freakonometrics.hypotheses.org 3
  • 4. Arthur Charpentier, SIDE Summer School, July 2019 Linear Model and Variable Selection For variable selection, use the classical stats::step function, or leaps::regsubset for best subset, forward stepwide and backward stepwise. For leave-one-out-cross validation, we can write 1 n n i=1 (yi − y(i))2 = 1 n n i=1 yi − yi 1 − Hi,i 2 Heuristically, (yi − yi)2 underestimates the true prediction error High underestimation if the correlation between yi and yi is high One can use Cov[y, y], e.g. in Mallows’ Cp, Cp = 1 n n i=1 (yi − yi)2 + 2 n σ2 p where p = 1 σ2 trace[Cov(y, y]) with Gaussian errors, AIC and Mallows’ Cp are asymptotically equivalent. @freakonometrics freakonometrics freakonometrics.hypotheses.org 4
  • 5. Arthur Charpentier, SIDE Summer School, July 2019 Penalized Inference and Shrinkage Consider a parametric model, with true (unknown) parameter θ, then mse(ˆθ) = E (ˆθ − θ)2 = E (ˆθ − E ˆθ )2 variance + E (E ˆθ − θ)2 bias2 One can think of a shrinkage of an unbiased estimator, Let θ denote an unbiased estimator of θ. Then ˆθ = θ2 θ2 + mse(θ) · θ = θ − mse(θ) θ2 + mse(θ) · θ penalty satisfies mse(ˆθ) ≤ mse(θ). @freakonometrics freakonometrics freakonometrics.hypotheses.org 5 −2 −1 0 1 2 3 4 0.00.20.40.60.8 variance
  • 6. Arthur Charpentier, SIDE Summer School, July 2019 Normalization : Euclidean 2 vs. Mahalonobis We want to penalize complicated models : if βk is “too small”, we prefer to have βk = 0. 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Instead of d(x, y) = (x − y)T (x − y) use dΣ(x, y) = (x − y)TΣ−1 (x − y) beta1 beta2 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 6
  • 7. Arthur Charpentier, SIDE Summer School, July 2019 Linear Regression Shortcoming Least Squares Estimator β = (XT X)−1 XT y Unbiased Estimator E[β] = β Variance Var[β] = σ2 (XT X)−1 which can be (extremely) large when det[(XT X)] ∼ 0. X =        1 −1 2 1 0 1 1 2 −1 1 1 0        then XT X =     4 2 2 2 6 −4 2 −4 6     while XT X+I =     5 2 2 2 7 −4 2 −4 7     eigenvalues : {10, 6, 0} {11, 7, 1} Ad-hoc strategy: use XT X + λI @freakonometrics freakonometrics freakonometrics.hypotheses.org 7
  • 8. Arthur Charpentier, SIDE Summer School, July 2019 Ridge Regression ... like the least square, but it shrinks estimated coefficients towards 0. β ridge λ = argmin    n i=1 (yi − xT i β)2 + λ p j=1 β2 j    β ridge λ = argmin    y − Xβ 2 2 =criteria + λ β 2 2 =penalty    λ ≥ 0 is a tuning parameter. @freakonometrics freakonometrics freakonometrics.hypotheses.org 8
  • 9. Arthur Charpentier, SIDE Summer School, July 2019 Ridge Regression an Wieringen (2018 Lecture notes on ridge regression Ridge Estimator (OLS) β ridge λ = argmin    n i=1 (yi − xi β)2 + λ p j=1 β2 j    Ridge Estimator (GLM) β ridge λ = argmin    − n i=1 log f(yi|µi = g−1 (xi β)) + λ 2 p j=1 β2 j    @freakonometrics freakonometrics freakonometrics.hypotheses.org 9
  • 10. Arthur Charpentier, SIDE Summer School, July 2019 Ridge Regression β ridge λ = argmin y − (β0 + Xβ) 2 2 + λ β 2 2 can be seen as a constrained optimization problem β ridge λ = argmin β 2 2 ≤hλ y − (β0 + Xβ) 2 2 Explicit solution βλ = (XT X + λI)−1 XT y If λ → 0, β ridge 0 = β ols If λ → ∞, β ridge ∞ = 0. beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 30 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 10
  • 11. Arthur Charpentier, SIDE Summer School, July 2019 Ridge Regression This penalty can be seen as rather unfair if compo- nents of x are not expressed on the same scale • center: xj = 0, then β0 = y • scale: xT j xj = 1 Then compute β ridge λ = argmin    y − Xβ 2 2 =loss + λ β 2 2 =penalty    beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 30 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 40 40 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 11
  • 12. Arthur Charpentier, SIDE Summer School, July 2019 Ridge Regression Observe that if xj1 ⊥ xj2 , then β ridge λ = [1 + λ]−1 β ols λ which explain relationship with shrinkage. But generally, it is not the case... Smaller mse There exists λ such that mse[β ridge λ ] ≤ mse[β ols λ ] q q @freakonometrics freakonometrics freakonometrics.hypotheses.org 12
  • 13. Arthur Charpentier, SIDE Summer School, July 2019 Ridge Regression Lλ(β) = n i=1 (yi − β0 − xT i β)2 + λ p j=1 β2 j ∂Lλ(β) ∂β = −2XT y + 2(XT X + λI)β ∂2 Lλ(β) ∂β∂βT = 2(XT X + λI) where XT X is a semi-positive definite matrix, and λI is a positive definite matrix, and βλ = (XT X + λI)−1 XT y @freakonometrics freakonometrics freakonometrics.hypotheses.org 13
  • 14. Arthur Charpentier, SIDE Summer School, July 2019 The Bayesian Interpretation From a Bayesian perspective, P[θ|y] posterior ∝ P[y|θ] likelihood · P[θ] prior i.e. log P[θ|y] = log P[y|θ] log likelihood + log P[θ] penalty If β has a prior N(0, τ2 I) distribution, then its posterior distribution has mean E[β|y, X] = XT X + σ2 τ2 I −1 XT y. @freakonometrics freakonometrics freakonometrics.hypotheses.org 14
  • 15. Arthur Charpentier, SIDE Summer School, July 2019 Properties of the Ridge Estimator βλ = (XT X + λI)−1 XT y E[βλ] = XT X(λI + XT X)−1 β. i.e. E[βλ] = β. Observe that E[βλ] → 0 as λ → ∞. Ridge & Shrinkage Assume that X is an orthogonal design matrix, i.e. XT X = I, then βλ = (1 + λ)−1 β ols . @freakonometrics freakonometrics freakonometrics.hypotheses.org 15
  • 16. Arthur Charpentier, SIDE Summer School, July 2019 Properties of the Ridge Estimator Set W λ = (I + λ[XT X]−1 )−1 . One can prove that W λβ ols = βλ. Thus, Var[βλ] = W λVar[β ols ]W T λ and Var[βλ] = σ2 (XT X + λI)−1 XT X[(XT X + λI)−1 ]T . Observe that Var[β ols ] − Var[βλ] = σ2 W λ[2λ(XT X)−2 + λ2 (XT X)−3 ]W T λ ≥ 0. @freakonometrics freakonometrics freakonometrics.hypotheses.org 16
  • 17. Arthur Charpentier, SIDE Summer School, July 2019 Properties of the Ridge Estimator Hence, the confidence ellipsoid of ridge estimator is indeed smaller than the OLS, If X is an orthogonal design matrix, Var[βλ] = σ2 (1 + λ)−2 I. mse[βλ] = σ2 trace(W λ(XT X)−1 W T λ) + βT (W λ − I)T (W λ − I)β. If X is an orthogonal design matrix, mse[βλ] = pσ2 (1 + λ)2 + λ2 (1 + λ)2 βT β @freakonometrics freakonometrics freakonometrics.hypotheses.org 17 0.0 0.2 0.4 0.6 0.8 −1.0−0.8−0.6−0.4−0.2 β1 β2 1 2 3 4 5 6 7
  • 18. Arthur Charpentier, SIDE Summer School, July 2019 Properties of the Ridge Estimator mse[βλ] = pσ2 (1 + λ)2 + λ2 (1 + λ)2 βT β is minimal for λ = pσ2 βT β Note that there exists λ > 0 such that mse[βλ] < mse[β0] = mse[β ols ]. Ridge regression is obtained using glmnet::glmnet(..., alpha = 0) - and glmnet::cv.glmnet for cross validation @freakonometrics freakonometrics freakonometrics.hypotheses.org 18
  • 19. Arthur Charpentier, SIDE Summer School, July 2019 SVD decomposition For any matrix A, m × n, there are orthogonal matrices U (m × m), V (n × n) and a ”diagonal” matrix Σ (m × n) such that A = UΣV T , or AV = UΣ. Hence, there exists a special orthonormal set of vectors (i.e. the columns of V ), that is mapped by the matrix A into an orthonormal set of vectors (i.e. the columns of U). Let r = rank(A), then A = r i=1 σiuivT i (called the dyadic decomposition of A). Observe that it can be used to compute (e.g.) the Frobenius norm of A, A = a2 i,j = σ2 1 + · · · + σ2 min{m,n}. Further AT A = V ΣT ΣV T while AAT = UΣΣT UT . Hence, σ2 i ’s are related to eigenvalues of AT A and AAT , and ui, vi are associated eigenvectors. Golub & Reinsh (1970, Singular Value Decomposition and Least Squares Solutions) @freakonometrics freakonometrics freakonometrics.hypotheses.org 19
  • 20. Arthur Charpentier, SIDE Summer School, July 2019 SVD decomposition Consider the singular value decomposition of X, X = UDV T . Then β ols = V D−2 D UT y βλ = V (D2 + λI)−1 D UT y Observe that D−1 i,i ≥ Di,i D2 i,i + λ hence, the ridge penalty shrinks singular values. Set now R = UD (n × n matrix), so that X = RV T , βλ = V (RT R + λI)−1 RT y @freakonometrics freakonometrics freakonometrics.hypotheses.org 20
  • 21. Arthur Charpentier, SIDE Summer School, July 2019 Hat matrix and Degrees of Freedom Recall that Y = HY with H = X(XT X)−1 XT Similarly Hλ = X(XT X + λI)−1 XT trace[Hλ] = p j=1 d2 j,j d2 j,j + λ → 0, as λ → ∞. @freakonometrics freakonometrics freakonometrics.hypotheses.org 21
  • 22. Arthur Charpentier, SIDE Summer School, July 2019 Sparsity Issues In several applications, k can be (very) large, but a lot of features are just noise: βj = 0 for many j’s. Let s denote the number of relevant features, with s << k, cf Hastie, Tibshirani & Wainwright (2015, Statistical Learning with Sparsity), s = card{S} where S = {j; βj = 0} The model is now y = XT SβS + ε, where XT SXS is a full rank matrix. @freakonometrics freakonometrics freakonometrics.hypotheses.org 22 q q = . +
  • 23. Arthur Charpentier, SIDE Summer School, July 2019 Going further on sparsity issues The Ridge regression problem was to solve β = argmin β∈{ β 2 ≤s} { Y − XT β 2 2 } Define a 0 = 1(|ai| > 0). Here dim(β) = k but β 0 = s. We wish we could solve β = argmin β∈{ β 0 =s} { Y − XT β 2 2 } Problem: it is usually not possible to describe all possible constraints, since s k coefficients should be chosen here (with k (very) large). @freakonometrics freakonometrics freakonometrics.hypotheses.org 23 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0
  • 24. Arthur Charpentier, SIDE Summer School, July 2019 Going further on sparsity issues In a convex problem, solve the dual problem, e.g. in the Ridge regression : primal problem min β∈{ β 2 ≤s} { Y − XT β 2 2 } and the dual problem min β∈{ Y −XTβ 2 ≤t} { β 2 2 } beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 26 27 30 32 35 40 40 50 60 70 80 90 100 110 120 120 130 130 140 140 X q −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 26 27 30 32 35 40 40 50 60 70 80 90 100 110 120 120 130 130 140 140 X q −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 24
  • 25. Arthur Charpentier, SIDE Summer School, July 2019 Going further on sparsity issues Idea: solve the dual problem β = argmin β∈{ Y −XTβ 2 ≤h} { β 0 } where we might convexify the 0 norm, · 0 . @freakonometrics freakonometrics freakonometrics.hypotheses.org 25
  • 26. Arthur Charpentier, SIDE Summer School, July 2019 Going further on sparsity issues On [−1, +1]k , the convex hull of β 0 is β 1 On [−a, +a]k , the convex hull of β 0 is a−1 β 1 Hence, why not solve β = argmin β; β 1 ≤˜s { Y − XT β 2 } which is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization problem β = argmin{ Y − XT β 2 2 +λ β 1 } @freakonometrics freakonometrics freakonometrics.hypotheses.org 26
  • 27. Arthur Charpentier, SIDE Summer School, July 2019 lasso Least Absolute Shrinkage and Selection Operator lasso Estimator (OLS) β lasso λ = argmin    n i=1 (yi − xi β)2 + λ p j=1 |βj|    lasso Estimator (GLM) β lasso λ = argmin    − n i=1 log f(yi|µi = g−1 (xi β)) + λ 2 p j=1 |βj|    @freakonometrics freakonometrics freakonometrics.hypotheses.org 27
  • 28. Arthur Charpentier, SIDE Summer School, July 2019 lasso Regression No explicit solution... If λ → 0, β lasso 0 = β ols If λ → ∞, β lasso ∞ = 0. beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 28
  • 29. Arthur Charpentier, SIDE Summer School, July 2019 lasso Regression For some λ, there are k’s such that β lasso k,λ = 0. Further, λ → β lasso k,λ is piecewise linear beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 30 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 40 40 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 29
  • 30. Arthur Charpentier, SIDE Summer School, July 2019 lasso Regression In the orthogonal case, XT X = I, β lasso k,λ = sign(β ols k ) |β ols k | − λ 2 i.e. the LASSO estimate is related to the soft threshold function... q q @freakonometrics freakonometrics freakonometrics.hypotheses.org 30
  • 31. Arthur Charpentier, SIDE Summer School, July 2019 Optimal lasso Penalty Use cross validation, e.g. K-fold, β(−k)(λ) = argmin    i∈Ik [yi − xT i β]2 + λ β 1    then compute the sum of the squared errors, Qk(λ) = i∈Ik [yi − xT i β(−k)(λ)]2 and finally solve λ = argmin Q(λ) = 1 K k Qk(λ) @freakonometrics freakonometrics freakonometrics.hypotheses.org 31
  • 32. Arthur Charpentier, SIDE Summer School, July 2019 Optimal lasso Penalty Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009, Elements of Statistical Learning) suggest the largest λ such that Q(λ) ≤ Q(λ ) + se[λ ] with se[λ]2 = 1 K2 K k=1 [Qk(λ) − Q(λ)]2 lasso regression is obtained using glmnet::glmnet(..., alpha = 1) - and glmnet::cv.glmnet for cross validation. @freakonometrics freakonometrics freakonometrics.hypotheses.org 32
  • 33. Arthur Charpentier, SIDE Summer School, July 2019 LASSO and Ridge, with R 1 > library(glmnet) 2 > chicago=read.table("http:// freakonometrics .free.fr/ chicago.txt",header=TRUE ,sep=";") 3 > standardize <- function(x) {(x-mean(x))/sd(x)} 4 > z0 <- standardize(chicago[, 1]) 5 > z1 <- standardize(chicago[, 3]) 6 > z2 <- standardize(chicago[, 4]) 7 > ridge <-glmnet(cbind(z1 , z2), z0 , alpha =0, intercept= FALSE , lambda =1) 8 > lasso <-glmnet(cbind(z1 , z2), z0 , alpha =1, intercept= FALSE , lambda =1) 9 > elastic <-glmnet(cbind(z1 , z2), z0 , alpha =.5, intercept=FALSE , lambda =1) Elastic net, λ1 β 1 + λ2 β 2 2 q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qq q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q qq q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qq q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q @freakonometrics freakonometrics freakonometrics.hypotheses.org 33
  • 34. Arthur Charpentier, SIDE Summer School, July 2019 lasso and lar (Least-Angle Regression) lasso estimation can be seen as an adaptation of LAR procedure Least Angle Regression (i) set (small) (ii) start with initial residual ε = y, and β = 0 (iii) find the predictor xj with the highest correlation with ε (iv) update βj = βj + δj = βj + · sign[ε xj] (v) set ε = ε − δjxj and go to (iii) see Efron et al. (2004, Least Angle Regression) @freakonometrics freakonometrics freakonometrics.hypotheses.org 34
  • 35. Arthur Charpentier, SIDE Summer School, July 2019 Going further, 0, 1 and 2 penalty Define a 0 = d i=1 1(ai = 0), a 1 = d i=1 |ai| and a 2 = d i=1 a2 i 1/2 , for a ∈ Rd . constrained penalized optimization optimization argmin β; β 0 ≤s n i=1 (yi, β0 + xT β) argmin β,λ n i=1 (yi, β0 + xT β) + λ β 0 ( 0) argmin β; β 1 ≤s n i=1 (yi, β0 + xT β) argmin β,λ n i=1 (yi, β0 + xT β) + λ β 1 ( 1) argmin β; β 2 ≤s n i=1 (yi, β0 + xT β) argmin β,λ n i=1 (yi, β0 + xT β) + λ β 2 ( 2) Assume that is the quadratic norm. @freakonometrics freakonometrics freakonometrics.hypotheses.org 35
  • 36. Arthur Charpentier, SIDE Summer School, July 2019 Going further, 0, 1 and 2 penalty The two problems ( 2) are equivalent : ∀(β , s ) solution of the left problem, ∃λ such that (β , λ ) is solution of the right problem. And conversely. The two problems ( 1) are equivalent : ∀(β , s ) solution of the left problem, ∃λ such that (β , λ ) is solution of the right problem. And conversely. Nevertheless, if there is a theoretical equivalence, there might be numerical issues since there is not necessarily unicity of the solution. The two problems ( 0) are not equivalent : if (β , λ ) is solution of the right problem, ∃s such that β is a solution of the left problem. But the converse is not true. More generally, consider a p norm, • sparsity is obtained when p ≤ 1 • convexity is obtained when p ≥ 1 @freakonometrics freakonometrics freakonometrics.hypotheses.org 36
  • 37. Arthur Charpentier, SIDE Summer School, July 2019 Going further, 0, 1 and 2 penalty Foster & George (1994) the risk inflation criterion for multiple regression tried to solve directly the penalized problem of ( 0). But it is a complex combinatorial problem in high dimension (Natarajan (1995) sparse approximate solutions to linear systems proved that it was a NP-hard problem) One can prove that if λ ∼ σ2 log(p), alors E [xT β − xT β0]2 ≤ E [xS T βS − xT β0]2 =σ2#S · 4 log p + 2 + o(1) . In that case β sub λ,j =    0 si j /∈ Sλ(β) β ols j si j ∈ Sλ(β), where Sλ(β) is the set of non-null values in solutions of ( 0). @freakonometrics freakonometrics freakonometrics.hypotheses.org 37
  • 38. Arthur Charpentier, SIDE Summer School, July 2019 Going further, 0, 1 and 2 penalty If is no longer the quadratic norm but 1, problem ( 1) is not always strictly convex, and optimum is not always unique (e.g. if XT X is singular). But in the quadratic case, is strictly convex, and at least Xβ is unique. Further, note that solutions are necessarily coherent (signs of coefficients) : it is not possible to have βj < 0 for one solution and βj > 0 for another one. In many cases, problem ( 1) yields a corner-type solution, which can be seen as a ”best subset” solution - like in ( 0). @freakonometrics freakonometrics freakonometrics.hypotheses.org 38
  • 39. Arthur Charpentier, SIDE Summer School, July 2019 @freakonometrics freakonometrics freakonometrics.hypotheses.org 39
  • 40. Arthur Charpentier, SIDE Summer School, July 2019 Going further, 0, 1 and 2 penalty Consider a simple regression yi = xiβ + ε, with 1-penalty and a 2-loss fucntion. ( 1) becomes min yT y − 2yT xβ + βxT xβ + 2λ|β| First order condition can be written −2yT x + 2xT xβ±2λ = 0. (the sign in ± being the sign of β). Assume that least-square estimate (λ = 0) is (strictly) positive, i.e. yT x > 0. If λ is not too large β and βols have the same sign, and −2yT x + 2xT xβ + 2λ = 0. with solution βlasso λ = yT x − λ xTx . @freakonometrics freakonometrics freakonometrics.hypotheses.org 40
  • 41. Arthur Charpentier, SIDE Summer School, July 2019 Going further, 0, 1 and 2 penalty Increase λ so that βλ = 0. Increase slightly more, βλ cannot become negative, because the sign of the first order condition will change, and we should solve −2yT x + 2xT xβ − 2λ = 0. and solution would be βlasso λ = yT x + λ xTx . But that solution is positive (we assumed that yT x > 0), to we should have βλ < 0. Thus, at some point βλ = 0, which is a corner solution. In higher dimension, see Tibshirani & Wasserman (2016, a closer look at sparse regression) or Cand`es & Plan (2009, Near-ideal model selection by 1 minimization.) With some additional technical assumption, that lasso estimator is ”sparsistent” in the sense that the support of β lasso λ is the same as β, @freakonometrics freakonometrics freakonometrics.hypotheses.org 41
  • 42. Arthur Charpentier, SIDE Summer School, July 2019 Going further, 0, 1 and 2 penalty Thus, lasso can be used for variable selection - see Hastie et al. (2001, The Elements of Statistical Learning). Generally, βlasso λ is a biased estimator but its variance can be small enough to have a smaller least squared error than the OLS estimate. With orthonormal covariates, one can prove that βsub λ,j = βols j 1|βsub λ,j |>b , βridge λ,j = βols j 1 + λ and βlasso λ,j = signe[βols j ] · (|βols j | − λ)+. @freakonometrics freakonometrics freakonometrics.hypotheses.org 42
  • 43. Arthur Charpentier, SIDE Summer School, July 2019 lasso for Autoregressive Time Series Consider some AR(p) autoregressive time series, Xt = φ1Xt−1 + φ2Xt−2 + · · · + φp−1Xt−p+1 + φpXt−p + εt, for some white noise (εt), with a causal type representation. Write y = xT φ + ε. The lasso estimator φ is a minimizer of 1 2T y = xT φ 2 + λ p i=1 λi|φi|, for some tuning parameters (λ, λ1, · · · , λp). See Nardi & Rinaldo (2011, Autoregressive process modeling via the Lasso procedure). @freakonometrics freakonometrics freakonometrics.hypotheses.org 43
  • 44. Arthur Charpentier, SIDE Summer School, July 2019 lasso and Non-Linearities Consider knots k1, · · · , km, we want a function m which is a cubic polynomial between every pair of knots, continuous at each knot, and with ontinuous first and second derivatives at each knot. We can write m as m(x) = β0 + β1x + β2x2 + β3x3 + β4(x − k1)3 + + · · · + βm+3(x − km)3 + One strategy is the following • fix the number of knots m (m < n) • find the natural cubic spline m which minimizes n i=1 (yi − m(xi))2 • then choose m by cross validation and alternative is to use a penalty based approach (Ridge type) to avoid overfit (since with m = n, the residual sum of square is null). @freakonometrics freakonometrics freakonometrics.hypotheses.org 44
  • 45. Arthur Charpentier, SIDE Summer School, July 2019 GAM, splines and Ridge regression Consider a univariate nonlinear regression problem, so that E[Y |X = x] = m(x). Given a sample {(y1, x1), · · · , (yn, xn)}, consider the following penalized problem m = argmin m∈C2 n i=1 (yi − m(xi))2 + λ R m (x)dx with the Residual sum of squares on the left, and a penalty for the roughness of the function. The solution is a natural cubic spline with knots at unique values of x (see Eubanks (1999, Nonparametric Regression and Spline Smoothing) Consider some spline basis {h1, · · · , hn}, and let m(x) = n i=1 βihi(x). Let H and Ω be the n × n matrices Hi,j = hj(xi), and Ωi,j = R hi (x)hj (x)dx. @freakonometrics freakonometrics freakonometrics.hypotheses.org 45
  • 46. Arthur Charpentier, SIDE Summer School, July 2019 GAM, splines and Ridge regression Then the objective function can be written (y − Hβ)T (y − Hβ) + λβT Ωβ Recognize here a generalized Ridge regression, with solution βλ = HT H + λΩ −1 HT y. Note that predicted values are linear functions of the observed value since y = H HT H + λΩ −1 HT y = Sλy, with degrees of freedom trace(Sλ). One can obtain the so-called Reinsch form by considering the singular value decomposition of H = UDV T . @freakonometrics freakonometrics freakonometrics.hypotheses.org 46
  • 47. Arthur Charpentier, SIDE Summer School, July 2019 GAM, splines and Ridge regression Here U is orthogonal since H is square (n × n), and D is here invertible. Then Sλ = (I + λUT D−1 V T ΩV D−1 U)−1 = (I + λK)−1 where K is a positive semidefinite matrix, K = B∆BT , where columns of B are know as the Demmler-Reinsch basis. In that (orthonormal) basis, Sλ is a diagonal matrix, Sλ = B I + λ∆ −1 BT Observe that SλBk = 1 1 + λ∆k,k Bk. Here again, eigenvalues are shrinkage coefficients of basis vectors. With more covariates, consider an additive problem (h1, · · · , hp) = h1,··· ,hp∈C2 argmin    n i=1  yi − p j=1 m(xi,j)   2 + λ p j=1 R mj (x)dx    @freakonometrics freakonometrics freakonometrics.hypotheses.org 47
  • 48. Arthur Charpentier, SIDE Summer School, July 2019 GAM, splines and Ridge regression which can be written min    (y − p j=1 Hjβj)T (y − p j=1 Hjβj) + λ β1 T p j=1 Ωjβj    where each matrix Hj is a Demmler-Reinsch basis for variable xj. Chouldechova & Hastie (2015, Generalized Additive Model Selection) Assume that the mean function for the jth variable is mj(x) = αjx + mj(x)T βj. One can write min (y − α0 − p j=1 αjxj − p j=1 Hjβj)T (y − α0 − p j=1 αjxj − p j=1 Hjβj) +λ γ|α1| + (1 − γ) βj Ωj + ψ1β1 T Ω1β1 + · · · + ψpβp T Ωpβp where βj Ωj = βj TΩjβj. @freakonometrics freakonometrics freakonometrics.hypotheses.org 48
  • 49. Arthur Charpentier, SIDE Summer School, July 2019 GAM, splines and Ridge regression The second term is the selection penalty, with a mixture of 1 and 2 (type) norm-based penalty The third term is the end-to-path penalty (GAM type when λ = 0). For each predictor xj, there are three possibilities • zero, αj = 0 and βj = 0 • linear, αj = 0 and βj = 0 • nonlinear, βj = 0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 49
  • 50. Arthur Charpentier, SIDE Summer School, July 2019 0.0 0.2 0.4 0.6 0.8 −30−101030 variable 1 smoothfunction 10 20 30 40 50 60 70 −30−101030 variable 2 smoothfunction 5 10 15 20 −30−101030 variable 3 smoothfunction 50 20 10 5 2 1 −30−20−10010 Linear Components λ α 50 20 10 5 2 1 05102030 Non−linear Components λ ||β|| 0 1 2 3 4 30507090 log(Lambda) Mean−SquaredError q qqq q q qq q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 1 42 27 0.0 0.2 0.4 0.6 0.8 −50510 v1 f(v1) 10 20 30 40 50 60 70 −50510 v2 f(v2) 5 10 15 20 −50510 v3 f(v3)@freakonometrics freakonometrics freakonometrics.hypotheses.org 50
  • 51. Arthur Charpentier, SIDE Summer School, July 2019 Coordinate Descent LASSO Coordinate Descent Algorithm 1. Set β0 = β 2 . For k = 1, · · · for j = 1, · · · , p (i) compute Rj = xj y − X−jβk−1(−j) (ii) set βk,j = Rj · 1 − λ 2|Rj| + 3. The final estimate βκ is βλ @freakonometrics freakonometrics freakonometrics.hypotheses.org 51
  • 52. Arthur Charpentier, SIDE Summer School, July 2019 From LASSO to Dantzig Selection Cand`es & Tao (2007, The Dantzig selector: Statistical estimation when p is much larger than n) defined β dantzig λ ∈ argmin β∈Rp β 1 s.t. X (y − Xβ) ∞ ≤ λ @freakonometrics freakonometrics freakonometrics.hypotheses.org 52
  • 53. Arthur Charpentier, SIDE Summer School, July 2019 From LASSO to Adaptative Lasso Zou (2006, The Adaptive Lasso) β a-lasso λ ∈ argmin β∈Rp    y − Xβ 2 2 + λ p j=1 |βj| |βγ-lasso λ,j |    where β γ-lasso λ = ΠXs(λ) y where s(λ) is the set of non null components β lasso λ See library lqa or lassogrp @freakonometrics freakonometrics freakonometrics.hypotheses.org 53
  • 54. Arthur Charpentier, SIDE Summer School, July 2019 From LASSO to Group Lasso Assume that variables x ∈ Rp can be grouped in L subgroups, x = (x1 · · · , xL), where dim[xl] = pl. Yuan & Lin (2007, Model selection and estimation in the Gaussian graphical model) defined, for some Kl matrices nl × nl definite positives β g-lasso λ ∈ argmin β∈Rp y − Xβ 2 2 + λ L l=1 βl Klβl or, if Kl = plI β g-lasso λ ∈ argmin β∈Rp y − Xβ 2 2 + λ L l=1 pl βl 2 See library gglasso @freakonometrics freakonometrics freakonometrics.hypotheses.org 54
  • 55. Arthur Charpentier, SIDE Summer School, July 2019 From LASSO to Sparse-Group Lasso Assume that variables x ∈ Rp can be grouped in L subgroups, x = (x1 · · · , xL), where dim[xl] = pl. Simon et al. (2013, A Sparse-Group LASSO) defined, for some Kl matrices nl × nl definite positives β sg-lasso λ,µ ∈ argmin β∈Rp y − Xβ 2 2 + λ L l=1 βl Klβl + µ β 1 See library SGL @freakonometrics freakonometrics freakonometrics.hypotheses.org 55