Andheri Call Girls In 9825968104 Mumbai Hot Models
Side 2019 #3
1. Arthur Charpentier, SIDE Summer School, July 2019
# 3 Regularization & Penalized Regression
Arthur Charpentier (Universit´e du Qu´ebec `a Montr´eal)
Machine Learning & Econometrics
SIDE Summer School - July 2019
@freakonometrics freakonometrics freakonometrics.hypotheses.org 1
2. Arthur Charpentier, SIDE Summer School, July 2019
Linear Model and Variable Selection
Let s denote a subset of {0, 1, · · · , p}, with cardinal |s|.
Xs is the matrice with columns xj where j ∈ s.
Consider the model Y = Xsβs + η, so that βs = Xs Xs
−1
Xs y
In general, βs = (β)s
R2
is usually not a good measure since R2
(s) ≤ R2
(t) when s ⊂ t.
Some use the adjusted R2
, R
2
(s) = 1 −
n − 1
n − |s|
1 − R2
(s)
The mean square error is
mse(s) = E (Xβ − Xsβs)2
= E RSS(s) − nσ2
+ 2|s|σ2
Define Mallows’Cp as Cp(s) =
RSS(s)
σ2
− n + 2|s|
Rule of thumb: model with variables s is valid if Cp(s) ≤ |s|
@freakonometrics freakonometrics freakonometrics.hypotheses.org 2
3. Arthur Charpentier, SIDE Summer School, July 2019
Linear Model and Variable Selection
In a linear model,
log L(β, σ2
) = −
n
2
log σ2
−
n
2
log(2π) −
1
2σ2
y − Xβ 2
and
log L(βs, σ2
s ) = −
n
2
log
RSS(s)
n
−
n
2
[1 + log(2π)]
It is necessary to penalize too complex models
Akaike’s AIC : AIC(s) =
n
2
log
RSS(s)
n
+
n
2
[1 + log(2π)] + 2|s|
Schwarz’s BIC : BIC(s) =
n
2
log
RSS(s)
n
+
n
2
[1 + log(2π)] + |s| log n
Exhaustive search of all models, 2p+1
... too complicated.
Stepwise procedure, forward or backward... not very stable and satisfactory.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 3
4. Arthur Charpentier, SIDE Summer School, July 2019
Linear Model and Variable Selection
For variable selection, use the classical stats::step function, or leaps::regsubset
for best subset, forward stepwide and backward stepwise.
For leave-one-out-cross validation, we can write
1
n
n
i=1
(yi − y(i))2
=
1
n
n
i=1
yi − yi
1 − Hi,i
2
Heuristically, (yi − yi)2
underestimates the true prediction error
High underestimation if the correlation between yi and yi is high
One can use Cov[y, y], e.g. in Mallows’ Cp,
Cp =
1
n
n
i=1
(yi − yi)2
+
2
n
σ2
p where p =
1
σ2
trace[Cov(y, y])
with Gaussian errors, AIC and Mallows’ Cp are asymptotically equivalent.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 4
5. Arthur Charpentier, SIDE Summer School, July 2019
Penalized Inference and Shrinkage
Consider a parametric model, with true (unknown) parameter θ, then
mse(ˆθ) = E (ˆθ − θ)2
= E (ˆθ − E ˆθ )2
variance
+ E (E ˆθ − θ)2
bias2
One can think of a shrinkage of an unbiased estimator,
Let θ denote an unbiased estimator of θ. Then
ˆθ =
θ2
θ2 + mse(θ)
· θ = θ −
mse(θ)
θ2 + mse(θ)
· θ
penalty
satisfies mse(ˆθ) ≤ mse(θ).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 5
−2 −1 0 1 2 3 4
0.00.20.40.60.8
variance
6. Arthur Charpentier, SIDE Summer School, July 2019
Normalization : Euclidean 2 vs. Mahalonobis
We want to penalize complicated models :
if βk is “too small”, we prefer to have βk = 0.
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Instead of d(x, y) = (x − y)T
(x − y)
use dΣ(x, y) = (x − y)TΣ−1
(x − y)
beta1
beta2
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 6
7. Arthur Charpentier, SIDE Summer School, July 2019
Linear Regression Shortcoming
Least Squares Estimator β = (XT
X)−1
XT
y
Unbiased Estimator E[β] = β
Variance Var[β] = σ2
(XT
X)−1
which can be (extremely) large when det[(XT
X)] ∼ 0.
X =
1 −1 2
1 0 1
1 2 −1
1 1 0
then XT
X =
4 2 2
2 6 −4
2 −4 6
while XT
X+I =
5 2 2
2 7 −4
2 −4 7
eigenvalues : {10, 6, 0} {11, 7, 1}
Ad-hoc strategy: use XT
X + λI
@freakonometrics freakonometrics freakonometrics.hypotheses.org 7
8. Arthur Charpentier, SIDE Summer School, July 2019
Ridge Regression
... like the least square, but it shrinks estimated coefficients towards 0.
β
ridge
λ = argmin
n
i=1
(yi − xT
i β)2
+ λ
p
j=1
β2
j
β
ridge
λ = argmin
y − Xβ
2
2
=criteria
+ λ β 2
2
=penalty
λ ≥ 0 is a tuning parameter.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 8
9. Arthur Charpentier, SIDE Summer School, July 2019
Ridge Regression
an Wieringen (2018 Lecture notes on ridge regression
Ridge Estimator (OLS)
β
ridge
λ = argmin
n
i=1
(yi − xi β)2
+ λ
p
j=1
β2
j
Ridge Estimator (GLM)
β
ridge
λ = argmin
−
n
i=1
log f(yi|µi = g−1
(xi β)) +
λ
2
p
j=1
β2
j
@freakonometrics freakonometrics freakonometrics.hypotheses.org 9
10. Arthur Charpentier, SIDE Summer School, July 2019
Ridge Regression
β
ridge
λ = argmin y − (β0 + Xβ)
2
2
+ λ β
2
2
can be seen as a constrained optimization problem
β
ridge
λ = argmin
β 2
2
≤hλ
y − (β0 + Xβ)
2
2
Explicit solution
βλ = (XT
X + λI)−1
XT
y
If λ → 0, β
ridge
0 = β
ols
If λ → ∞, β
ridge
∞ = 0.
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 10
11. Arthur Charpentier, SIDE Summer School, July 2019
Ridge Regression
This penalty can be seen as rather unfair if compo-
nents of x are not expressed on the same scale
• center: xj = 0, then β0 = y
• scale: xT
j xj = 1
Then compute
β
ridge
λ = argmin
y − Xβ 2
2
=loss
+ λ β 2
2
=penalty
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
40
40
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 11
12. Arthur Charpentier, SIDE Summer School, July 2019
Ridge Regression
Observe that if xj1
⊥ xj2
, then
β
ridge
λ = [1 + λ]−1
β
ols
λ
which explain relationship with shrinkage.
But generally, it is not the case...
Smaller mse
There exists λ such that mse[β
ridge
λ ] ≤ mse[β
ols
λ ]
q
q
@freakonometrics freakonometrics freakonometrics.hypotheses.org 12
13. Arthur Charpentier, SIDE Summer School, July 2019
Ridge Regression
Lλ(β) =
n
i=1
(yi − β0 − xT
i β)2
+ λ
p
j=1
β2
j
∂Lλ(β)
∂β
= −2XT
y + 2(XT
X + λI)β
∂2
Lλ(β)
∂β∂βT
= 2(XT
X + λI)
where XT
X is a semi-positive definite matrix, and λI is a positive definite
matrix, and
βλ = (XT
X + λI)−1
XT
y
@freakonometrics freakonometrics freakonometrics.hypotheses.org 13
14. Arthur Charpentier, SIDE Summer School, July 2019
The Bayesian Interpretation
From a Bayesian perspective,
P[θ|y]
posterior
∝ P[y|θ]
likelihood
· P[θ]
prior
i.e. log P[θ|y] = log P[y|θ]
log likelihood
+ log P[θ]
penalty
If β has a prior N(0, τ2
I) distribution, then its posterior distribution has mean
E[β|y, X] = XT
X +
σ2
τ2
I
−1
XT
y.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 14
15. Arthur Charpentier, SIDE Summer School, July 2019
Properties of the Ridge Estimator
βλ = (XT
X + λI)−1
XT
y
E[βλ] = XT
X(λI + XT
X)−1
β.
i.e. E[βλ] = β.
Observe that E[βλ] → 0 as λ → ∞.
Ridge & Shrinkage
Assume that X is an orthogonal design matrix, i.e. XT
X = I, then
βλ = (1 + λ)−1
β
ols
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 15
16. Arthur Charpentier, SIDE Summer School, July 2019
Properties of the Ridge Estimator
Set W λ = (I + λ[XT
X]−1
)−1
. One can prove that
W λβ
ols
= βλ.
Thus,
Var[βλ] = W λVar[β
ols
]W T
λ
and
Var[βλ] = σ2
(XT
X + λI)−1
XT
X[(XT
X + λI)−1
]T
.
Observe that
Var[β
ols
] − Var[βλ] = σ2
W λ[2λ(XT
X)−2
+ λ2
(XT
X)−3
]W T
λ ≥ 0.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 16
17. Arthur Charpentier, SIDE Summer School, July 2019
Properties of the Ridge Estimator
Hence, the confidence ellipsoid of ridge estimator is
indeed smaller than the OLS,
If X is an orthogonal design matrix,
Var[βλ] = σ2
(1 + λ)−2
I.
mse[βλ] = σ2
trace(W λ(XT
X)−1
W T
λ) + βT
(W λ − I)T
(W λ − I)β.
If X is an orthogonal design matrix,
mse[βλ] =
pσ2
(1 + λ)2
+
λ2
(1 + λ)2
βT
β
@freakonometrics freakonometrics freakonometrics.hypotheses.org 17
0.0 0.2 0.4 0.6 0.8
−1.0−0.8−0.6−0.4−0.2
β1
β2
1
2
3
4
5
6
7
18. Arthur Charpentier, SIDE Summer School, July 2019
Properties of the Ridge Estimator
mse[βλ] =
pσ2
(1 + λ)2
+
λ2
(1 + λ)2
βT
β
is minimal for
λ =
pσ2
βT
β
Note that there exists λ > 0 such that mse[βλ] < mse[β0] = mse[β
ols
].
Ridge regression is obtained using glmnet::glmnet(..., alpha = 0) - and
glmnet::cv.glmnet for cross validation
@freakonometrics freakonometrics freakonometrics.hypotheses.org 18
19. Arthur Charpentier, SIDE Summer School, July 2019
SVD decomposition
For any matrix A, m × n, there are orthogonal matrices U (m × m), V (n × n)
and a ”diagonal” matrix Σ (m × n) such that A = UΣV T
, or AV = UΣ.
Hence, there exists a special orthonormal set of vectors (i.e. the columns of V ),
that is mapped by the matrix A into an orthonormal set of vectors (i.e. the
columns of U).
Let r = rank(A), then A =
r
i=1
σiuivT
i (called the dyadic decomposition of A).
Observe that it can be used to compute (e.g.) the Frobenius norm of A,
A = a2
i,j = σ2
1 + · · · + σ2
min{m,n}.
Further AT
A = V ΣT
ΣV T
while AAT
= UΣΣT
UT
.
Hence, σ2
i ’s are related to eigenvalues of AT
A and AAT
, and ui, vi are associated
eigenvectors.
Golub & Reinsh (1970, Singular Value Decomposition and Least Squares Solutions)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 19
20. Arthur Charpentier, SIDE Summer School, July 2019
SVD decomposition
Consider the singular value decomposition of X, X = UDV T
.
Then
β
ols
= V D−2
D UT
y
βλ = V (D2
+ λI)−1
D UT
y
Observe that
D−1
i,i ≥
Di,i
D2
i,i + λ
hence, the ridge penalty shrinks singular values.
Set now R = UD (n × n matrix), so that X = RV T
,
βλ = V (RT
R + λI)−1
RT
y
@freakonometrics freakonometrics freakonometrics.hypotheses.org 20
21. Arthur Charpentier, SIDE Summer School, July 2019
Hat matrix and Degrees of Freedom
Recall that Y = HY with
H = X(XT
X)−1
XT
Similarly
Hλ = X(XT
X + λI)−1
XT
trace[Hλ] =
p
j=1
d2
j,j
d2
j,j + λ
→ 0, as λ → ∞.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 21
22. Arthur Charpentier, SIDE Summer School, July 2019
Sparsity Issues
In several applications, k can be (very) large, but a lot of features are just noise:
βj = 0 for many j’s. Let s denote the number of relevant features, with s << k,
cf Hastie, Tibshirani & Wainwright (2015, Statistical Learning with Sparsity),
s = card{S} where S = {j; βj = 0}
The model is now y = XT
SβS + ε, where XT
SXS is a full rank matrix.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 22
q
q
= . +
23. Arthur Charpentier, SIDE Summer School, July 2019
Going further on sparsity issues
The Ridge regression problem was to solve
β = argmin
β∈{ β 2 ≤s}
{ Y − XT
β 2
2
}
Define a 0 = 1(|ai| > 0).
Here dim(β) = k but β 0 = s.
We wish we could solve
β = argmin
β∈{ β 0 =s}
{ Y − XT
β 2
2
}
Problem: it is usually not possible to describe all possible constraints, since
s
k
coefficients should be chosen here (with k (very) large).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 23
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
24. Arthur Charpentier, SIDE Summer School, July 2019
Going further on sparsity issues
In a convex problem, solve the dual problem,
e.g. in the Ridge regression : primal problem
min
β∈{ β 2 ≤s}
{ Y − XT
β 2
2
}
and the dual problem
min
β∈{ Y −XTβ 2 ≤t}
{ β 2
2
}
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
26
27
30
32
35
40
40
50
60
70
80
90
100
110
120
120
130
130
140 140
X
q
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
26
27
30
32
35
40
40
50
60
70
80
90
100
110
120
120
130
130
140 140
X
q
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 24
25. Arthur Charpentier, SIDE Summer School, July 2019
Going further on sparsity issues
Idea: solve the dual problem
β = argmin
β∈{ Y −XTβ 2 ≤h}
{ β 0
}
where we might convexify the 0 norm, · 0
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 25
26. Arthur Charpentier, SIDE Summer School, July 2019
Going further on sparsity issues
On [−1, +1]k
, the convex hull of β 0
is β 1
On [−a, +a]k
, the convex hull of β 0
is a−1
β 1
Hence, why not solve
β = argmin
β; β 1 ≤˜s
{ Y − XT
β 2 }
which is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization
problem
β = argmin{ Y − XT
β 2
2
+λ β 1
}
@freakonometrics freakonometrics freakonometrics.hypotheses.org 26
27. Arthur Charpentier, SIDE Summer School, July 2019
lasso Least Absolute Shrinkage and Selection Operator
lasso Estimator (OLS)
β
lasso
λ = argmin
n
i=1
(yi − xi β)2
+ λ
p
j=1
|βj|
lasso Estimator (GLM)
β
lasso
λ = argmin
−
n
i=1
log f(yi|µi = g−1
(xi β)) +
λ
2
p
j=1
|βj|
@freakonometrics freakonometrics freakonometrics.hypotheses.org 27
29. Arthur Charpentier, SIDE Summer School, July 2019
lasso Regression
For some λ, there are k’s such that β
lasso
k,λ = 0.
Further, λ → β
lasso
k,λ is piecewise linear
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
40
40
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 29
30. Arthur Charpentier, SIDE Summer School, July 2019
lasso Regression
In the orthogonal case, XT
X = I,
β
lasso
k,λ = sign(β
ols
k ) |β
ols
k | −
λ
2
i.e. the LASSO estimate is related to the soft
threshold function...
q
q
@freakonometrics freakonometrics freakonometrics.hypotheses.org 30
31. Arthur Charpentier, SIDE Summer School, July 2019
Optimal lasso Penalty
Use cross validation, e.g. K-fold,
β(−k)(λ) = argmin
i∈Ik
[yi − xT
i β]2
+ λ β 1
then compute the sum of the squared errors,
Qk(λ) =
i∈Ik
[yi − xT
i β(−k)(λ)]2
and finally solve
λ = argmin Q(λ) =
1
K
k
Qk(λ)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 31
32. Arthur Charpentier, SIDE Summer School, July 2019
Optimal lasso Penalty
Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009, Elements
of Statistical Learning) suggest the largest λ such that
Q(λ) ≤ Q(λ ) + se[λ ] with se[λ]2
=
1
K2
K
k=1
[Qk(λ) − Q(λ)]2
lasso regression is obtained using glmnet::glmnet(..., alpha = 1) - and
glmnet::cv.glmnet for cross validation.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 32
34. Arthur Charpentier, SIDE Summer School, July 2019
lasso and lar (Least-Angle Regression)
lasso estimation can be seen as an adaptation of LAR procedure
Least Angle Regression
(i) set (small)
(ii) start with initial residual ε = y, and β = 0
(iii) find the predictor xj with the highest correlation with ε
(iv) update βj = βj + δj = βj + · sign[ε xj]
(v) set ε = ε − δjxj and go to (iii)
see Efron et al. (2004, Least Angle Regression)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 34
35. Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
Define
a 0 =
d
i=1
1(ai = 0), a 1 =
d
i=1
|ai| and a 2 =
d
i=1
a2
i
1/2
, for a ∈ Rd
.
constrained penalized
optimization optimization
argmin
β; β 0 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 0 ( 0)
argmin
β; β 1 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 1
( 1)
argmin
β; β 2 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 2 ( 2)
Assume that is the quadratic norm.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 35
36. Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
The two problems ( 2) are equivalent : ∀(β , s ) solution of the left problem, ∃λ
such that (β , λ ) is solution of the right problem. And conversely.
The two problems ( 1) are equivalent : ∀(β , s ) solution of the left problem, ∃λ
such that (β , λ ) is solution of the right problem. And conversely. Nevertheless,
if there is a theoretical equivalence, there might be numerical issues since there is
not necessarily unicity of the solution.
The two problems ( 0) are not equivalent : if (β , λ ) is solution of the right
problem, ∃s such that β is a solution of the left problem. But the converse is
not true.
More generally, consider a p norm,
• sparsity is obtained when p ≤ 1
• convexity is obtained when p ≥ 1
@freakonometrics freakonometrics freakonometrics.hypotheses.org 36
37. Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
Foster & George (1994) the risk inflation criterion for multiple regression tried to
solve directly the penalized problem of ( 0).
But it is a complex combinatorial problem in high dimension (Natarajan (1995)
sparse approximate solutions to linear systems proved that it was a NP-hard
problem)
One can prove that if λ ∼ σ2
log(p), alors
E [xT
β − xT
β0]2
≤ E [xS
T
βS − xT
β0]2
=σ2#S
· 4 log p + 2 + o(1) .
In that case
β
sub
λ,j =
0 si j /∈ Sλ(β)
β
ols
j si j ∈ Sλ(β),
where Sλ(β) is the set of non-null values in solutions of ( 0).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 37
38. Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
If is no longer the quadratic norm but 1, problem ( 1) is not always strictly
convex, and optimum is not always unique (e.g. if XT
X is singular).
But in the quadratic case, is strictly convex, and at least Xβ is unique.
Further, note that solutions are necessarily coherent (signs of coefficients) : it is
not possible to have βj < 0 for one solution and βj > 0 for another one.
In many cases, problem ( 1) yields a corner-type solution, which can be seen as a
”best subset” solution - like in ( 0).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 38
39. Arthur Charpentier, SIDE Summer School, July 2019
@freakonometrics freakonometrics freakonometrics.hypotheses.org 39
40. Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
Consider a simple regression yi = xiβ + ε, with 1-penalty and a 2-loss fucntion.
( 1) becomes
min yT
y − 2yT
xβ + βxT
xβ + 2λ|β|
First order condition can be written
−2yT
x + 2xT
xβ±2λ = 0.
(the sign in ± being the sign of β). Assume that least-square estimate (λ = 0) is
(strictly) positive, i.e. yT
x > 0. If λ is not too large β and βols
have the same
sign, and
−2yT
x + 2xT
xβ + 2λ = 0.
with solution βlasso
λ =
yT
x − λ
xTx
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 40
41. Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
Increase λ so that βλ = 0.
Increase slightly more, βλ cannot become negative, because the sign of the first
order condition will change, and we should solve
−2yT
x + 2xT
xβ − 2λ = 0.
and solution would be βlasso
λ =
yT
x + λ
xTx
. But that solution is positive (we
assumed that yT
x > 0), to we should have βλ < 0.
Thus, at some point βλ = 0, which is a corner solution.
In higher dimension, see Tibshirani & Wasserman (2016, a closer look at sparse
regression) or Cand`es & Plan (2009, Near-ideal model selection by 1 minimization.)
With some additional technical assumption, that lasso estimator is ”sparsistent”
in the sense that the support of β
lasso
λ is the same as β,
@freakonometrics freakonometrics freakonometrics.hypotheses.org 41
42. Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
Thus, lasso can be used for variable selection - see Hastie et al. (2001, The
Elements of Statistical Learning).
Generally, βlasso
λ is a biased estimator but its variance can be small enough to
have a smaller least squared error than the OLS estimate.
With orthonormal covariates, one can prove that
βsub
λ,j = βols
j 1|βsub
λ,j
|>b
, βridge
λ,j =
βols
j
1 + λ
and βlasso
λ,j = signe[βols
j ] · (|βols
j | − λ)+.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 42
43. Arthur Charpentier, SIDE Summer School, July 2019
lasso for Autoregressive Time Series
Consider some AR(p) autoregressive time series,
Xt = φ1Xt−1 + φ2Xt−2 + · · · + φp−1Xt−p+1 + φpXt−p + εt,
for some white noise (εt), with a causal type representation. Write y = xT
φ + ε.
The lasso estimator φ is a minimizer of
1
2T
y = xT
φ 2
+ λ
p
i=1
λi|φi|,
for some tuning parameters (λ, λ1, · · · , λp).
See Nardi & Rinaldo (2011, Autoregressive process modeling via the Lasso
procedure).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 43
44. Arthur Charpentier, SIDE Summer School, July 2019
lasso and Non-Linearities
Consider knots k1, · · · , km, we want a function m which is a cubic polynomial
between every pair of knots, continuous at each knot, and with ontinuous first
and second derivatives at each knot.
We can write m as
m(x) = β0 + β1x + β2x2
+ β3x3
+ β4(x − k1)3
+ + · · · + βm+3(x − km)3
+
One strategy is the following
• fix the number of knots m (m < n)
• find the natural cubic spline m which minimizes
n
i=1
(yi − m(xi))2
• then choose m by cross validation
and alternative is to use a penalty based approach (Ridge type) to avoid overfit
(since with m = n, the residual sum of square is null).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 44
45. Arthur Charpentier, SIDE Summer School, July 2019
GAM, splines and Ridge regression
Consider a univariate nonlinear regression problem, so that E[Y |X = x] = m(x).
Given a sample {(y1, x1), · · · , (yn, xn)}, consider the following penalized problem
m = argmin
m∈C2
n
i=1
(yi − m(xi))2
+ λ
R
m (x)dx
with the Residual sum of squares on the left, and a penalty for the roughness of
the function.
The solution is a natural cubic spline with knots at unique values of x (see
Eubanks (1999, Nonparametric Regression and Spline Smoothing)
Consider some spline basis {h1, · · · , hn}, and let m(x) =
n
i=1
βihi(x).
Let H and Ω be the n × n matrices Hi,j = hj(xi), and Ωi,j =
R
hi (x)hj (x)dx.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 45
46. Arthur Charpentier, SIDE Summer School, July 2019
GAM, splines and Ridge regression
Then the objective function can be written
(y − Hβ)T
(y − Hβ) + λβT
Ωβ
Recognize here a generalized Ridge regression, with solution
βλ = HT
H + λΩ
−1
HT
y.
Note that predicted values are linear functions of the observed value since
y = H HT
H + λΩ
−1
HT
y = Sλy,
with degrees of freedom trace(Sλ).
One can obtain the so-called Reinsch form by considering the singular value
decomposition of H = UDV T
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 46
47. Arthur Charpentier, SIDE Summer School, July 2019
GAM, splines and Ridge regression
Here U is orthogonal since H is square (n × n), and D is here invertible. Then
Sλ = (I + λUT
D−1
V T
ΩV D−1
U)−1
= (I + λK)−1
where K is a positive semidefinite matrix, K = B∆BT
, where columns of B are
know as the Demmler-Reinsch basis.
In that (orthonormal) basis, Sλ is a diagonal matrix,
Sλ = B I + λ∆
−1
BT
Observe that SλBk =
1
1 + λ∆k,k
Bk.
Here again, eigenvalues are shrinkage coefficients of basis vectors.
With more covariates, consider an additive problem
(h1, · · · , hp) =
h1,··· ,hp∈C2
argmin
n
i=1
yi −
p
j=1
m(xi,j)
2
+ λ
p
j=1 R
mj (x)dx
@freakonometrics freakonometrics freakonometrics.hypotheses.org 47
48. Arthur Charpentier, SIDE Summer School, July 2019
GAM, splines and Ridge regression
which can be written
min
(y −
p
j=1
Hjβj)T
(y −
p
j=1
Hjβj) + λ β1
T
p
j=1
Ωjβj
where each matrix Hj is a Demmler-Reinsch basis for variable xj.
Chouldechova & Hastie (2015, Generalized Additive Model Selection)
Assume that the mean function for the jth variable is mj(x) = αjx + mj(x)T
βj.
One can write
min (y − α0 −
p
j=1
αjxj −
p
j=1
Hjβj)T
(y − α0 −
p
j=1
αjxj −
p
j=1
Hjβj)
+λ γ|α1| + (1 − γ) βj Ωj + ψ1β1
T
Ω1β1 + · · · + ψpβp
T
Ωpβp
where βj Ωj
= βj
TΩjβj.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 48
49. Arthur Charpentier, SIDE Summer School, July 2019
GAM, splines and Ridge regression
The second term is the selection penalty, with a mixture of 1 and 2 (type)
norm-based penalty
The third term is the end-to-path penalty (GAM type when λ = 0).
For each predictor xj, there are three possibilities
• zero, αj = 0 and βj = 0
• linear, αj = 0 and βj = 0
• nonlinear, βj = 0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 49
51. Arthur Charpentier, SIDE Summer School, July 2019
Coordinate Descent
LASSO Coordinate Descent Algorithm
1. Set β0 = β
2 . For k = 1, · · ·
for j = 1, · · · , p
(i) compute Rj = xj y − X−jβk−1(−j)
(ii) set βk,j = Rj · 1 −
λ
2|Rj| +
3. The final estimate βκ is βλ
@freakonometrics freakonometrics freakonometrics.hypotheses.org 51
52. Arthur Charpentier, SIDE Summer School, July 2019
From LASSO to Dantzig Selection
Cand`es & Tao (2007, The Dantzig selector: Statistical estimation when p is much
larger than n) defined
β
dantzig
λ ∈ argmin
β∈Rp
β 1
s.t. X (y − Xβ) ∞
≤ λ
@freakonometrics freakonometrics freakonometrics.hypotheses.org 52
53. Arthur Charpentier, SIDE Summer School, July 2019
From LASSO to Adaptative Lasso
Zou (2006, The Adaptive Lasso)
β
a-lasso
λ ∈ argmin
β∈Rp
y − Xβ 2
2
+ λ
p
j=1
|βj|
|βγ-lasso
λ,j |
where β
γ-lasso
λ = ΠXs(λ)
y where s(λ) is the set of non null components β
lasso
λ
See library lqa or lassogrp
@freakonometrics freakonometrics freakonometrics.hypotheses.org 53
54. Arthur Charpentier, SIDE Summer School, July 2019
From LASSO to Group Lasso
Assume that variables x ∈ Rp
can be grouped in L subgroups, x = (x1 · · · , xL),
where dim[xl] = pl.
Yuan & Lin (2007, Model selection and estimation in the Gaussian graphical model)
defined, for some Kl matrices nl × nl definite positives
β
g-lasso
λ ∈ argmin
β∈Rp
y − Xβ 2
2
+ λ
L
l=1
βl Klβl
or, if Kl = plI
β
g-lasso
λ ∈ argmin
β∈Rp
y − Xβ 2
2
+ λ
L
l=1
pl βl 2
See library gglasso
@freakonometrics freakonometrics freakonometrics.hypotheses.org 54
55. Arthur Charpentier, SIDE Summer School, July 2019
From LASSO to Sparse-Group Lasso
Assume that variables x ∈ Rp
can be grouped in L subgroups, x = (x1 · · · , xL),
where dim[xl] = pl.
Simon et al. (2013, A Sparse-Group LASSO) defined, for some Kl matrices nl × nl
definite positives
β
sg-lasso
λ,µ ∈ argmin
β∈Rp
y − Xβ 2
2
+ λ
L
l=1
βl Klβl + µ β 1
See library SGL
@freakonometrics freakonometrics freakonometrics.hypotheses.org 55