machine learning using python - marcelscharth.commarcelscharth.com/ml/ml-02 regularised linear...

Machine Learning Using PythonLesson 2: Regularised Linear Models

Marcel Scharth

The University of Sydney Business School

Lesson 2: Regularised Linear Models

1. Introduction

2. Ridge Regression

3. The Lasso

4. Elastic Net

5. Which method to use?

6. Regularised risk minimisation

2/24

Introduction

Linear Methods for Regression

In this lesson we again focus on the linear regression model forprediction, but now move beyond least squares estimation toconsider other training methods.

The motivation for studying these methods is that using manypredictors in a linear regression model typically leads to overfitting.We will therefore accept some bias in order reduce variance.

3/24

Linear regression

The linear regression model is a special case of the additive errormodel

Y = f(x) + ε,

where we specify a linear regression function of the form

f(x) = β0 + β1x1 + β2x2 + . . . + βpxp.

4/24

Review: OLS

In the OLS method, we select the coefficient values that minimisethe residual sum of squares

β̂ols = argminβ

N∑i=1

yi − β0 −p∑

j=1βjxij

2

The solution has the matrix formula

β̂ols = (XT X)−1XT y.

5/24

Why we are not satisfied with OLS?

Prediction accuracy. Low bias (if the linearity assumption isapproximately correct), but potentially high variance. We canimprove performance by setting some coefficients to zero orshrinking them.

Interpretability. A regression estimated with too many predictorsand high variance is hard or impossible to interpret. In order tounderstand the big picture, we are willing to sacrifice some of thesmall details.

6/24

Regularised linear models (key concept)

Regularisation (shrinkage) methods fit a model involving all the p

predictors, but shrink the coefficients towards zero relative to OLS.

Depending on the type of regularisation, some estimatedcoefficients may be zero, in which case we say that the methodalso performs variable selection.

7/24

Ridge Regression

Ridge regression (key concept)

The ridge regression method solves the penalised estimationproblem

β̂ridge = argminβ

N∑

i=1

yi − β0 −p∑

j=1βixij

2

+ λp∑

j=1β2

j

,

for a tuning parameter λ.

The penalty term has the effect of shrinking the coefficientsrelative to OLS. We refer to this procedure as ℓ2 regularisationbecause of the form of the penalty.

8/24

Ridge regression

Penalised estimation problem:


N∑

i=1

yi − β0 −p∑

j=1βixij

2

+ λp∑

j=1β2

j

The hyperparameter λ controls the strength of regularisation. Ahigher λ (stronger penalty) leads to smaller coefficients relative toOLS. We select λ through cross validation.

9/24

Ridge regression: practical details

1. We do not penalise the intercept.

2. The method is not invariant on the scale of the inputs. Westandardise the predictors before solving the minimisationproblem.

10/24

Ridge Regression: properties

• The ridge regression algorithm shrinks all coefficients relativeto OLS for λ > 0.

• Furthermore, it shrinks the coefficients of correlated predictorstoward each other. In the extreme case where two predictorsare perfectly positively correlated, their coefficients willbecome identical for a sufficiently large penalty λ.

11/24

Illustration: Ridge coefficient profiles

12/24

Ridge regression

The ridge estimator has an equivalent formulation as a constrainedminimisation problem


N∑i=1

yi − β0 −p∑

j=1βixij

2

subject top∑

j=1β2

j < t.

for some t > 0.

This will be relevant for comparing it with the Lasso.

13/24

The Lasso

The Lasso

The Lasso (least absolute shrinkage and selection operator)method solves the penalised estimation problem

β̂lasso = argminβ

N∑

i=1

yi − β0 −p∑

j=1βixij

2

+ λp∑

j=1|βj |

,

for a tuning parameter λ.

The Lasso therefore performs ℓ1 regularisation.

14/24

The Lasso

β̂lasso = argminβ

N∑

i=1

yi − β0 −p∑

j=1βixij

2

+ λp∑

j=1|βj |

,

Exactly as before, the hyperparameter λ controls the strength ofregularisation. A higher λ (stronger penalty) leads to smallercoefficients relative to OLS. We select λ through cross validation.

15/24

The Lasso: properties

• The lasso shrinks the regression coefficients towards zerorelative to OLS. However, the nature of this shrinkage issomewhat different compared to a ridge regression algorithm,as we will illustrate below.

• In addition to shrinkage, the lasso also performs variableselection. With λ sufficiently large, some estimatedcoefficients will be exactly zero, leading to sparse models.This is a key difference from ridge.

16/24

The Lasso

The equivalent formulation of the lasso as a constrainedminimisation problem is

β̂lasso =argminβ

N∑i=1

yi − β0 −p∑

j=1βixij

2

subject top∑

j=1|βj | < t.

for some t > 0.

17/24

The Lasso: variable selection property

Estimation picture for the lasso (left) and ridge regression (right):

18/24

Illustration: Lasso coefficient profiles

19/24

Elastic Net

Elastic Net

The elastic net is a compromise between ridge regression and thelasso:

β̂EN = argminβ

N∑i=1

yi − β0 −p∑

j=1βixij

2

+ λp∑

j=1

(αβ2

j + (1 − α)|βj |)

,

for tuning parameters λ ≥ 0 and 0 < α < 1.

The elastic net performs variable selection like the lasso, andshrinks the coefficients of correlated predictors toward each otherlike ridge regression.

20/24

Which method to use?

Which method to use?

• Recall the no free lunch theorem: neither ridge regression orthe lasso universally outperform the other. The choice ofmethod should be data-driven, and will be problem specific.

• In general terms, we can expect the lasso to perform betterwhen a small subset of predictors have important coefficients,while the remaining predictors having small or zerocoefficients (sparse problems).

• Ridge regression will tend to perform better when thepredictors all have similar importance.

• The lasso may have better interpretability since it performsvariable selection.

21/24

Regularised risk minimisation

All of machine learning in one equation

• Regularised linear methods are special cases of a principleknown as regularised risk minimisation.

• The regularised risk minimisation framework summarises thebig picture for a large number of machine learning algorithms.

22/24

Empirical risk minimisation

Let D = {(yi, xi)}Ni=1 be training data and f(·; θ) a prediction

function that depends on the parameter vector θ. The empiricalrisk minimisation estimation principle estimates θ as

θ̂ = argminθ

1N

N∑i=1

L(yi, f(xi; θ))

Example: least squares estimation.

23/24

Regularised risk minimisation

Minimising the empirical risk will typically lead to overfitting. Inregularised risk minimisation, we estimate the model by solvingthe optimisation problem

θ̂ = argminθ

[1N

N∑i=1

L(yi, f(xi; θ))]

+ λ C(f(·); θ)),

where C(f(·); θ)) measures the complexity of the predictionfunction and λ is a complexity penalty.

24/24

machine learning using python - marcelscharth.commarcelscharth.com/ml/ml-02 regularised linear...

Documents