machine learning using python - marcelscharth.commarcelscharth.com/ml/ml-02 regularised linear...

30
Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel Scharth The University of Sydney Business School

Upload: trinhnhu

Post on 31-Aug-2018

274 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Machine Learning Using PythonLesson 2: Regularised Linear Models

Marcel Scharth

The University of Sydney Business School

Page 2: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Lesson 2: Regularised Linear Models

1. Introduction

2. Ridge Regression

3. The Lasso

4. Elastic Net

5. Which method to use?

6. Regularised risk minimisation

2/24

Page 3: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Introduction

Page 4: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Linear Methods for Regression

In this lesson we again focus on the linear regression model forprediction, but now move beyond least squares estimation toconsider other training methods.

The motivation for studying these methods is that using manypredictors in a linear regression model typically leads to overfitting.We will therefore accept some bias in order reduce variance.

3/24

Page 5: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Linear regression

The linear regression model is a special case of the additive errormodel

Y = f(x) + ε,

where we specify a linear regression function of the form

f(x) = β0 + β1x1 + β2x2 + . . . + βpxp.

4/24

Page 6: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Review: OLS

In the OLS method, we select the coefficient values that minimisethe residual sum of squares

β̂ols = argminβ

N∑i=1

yi − β0 −p∑

j=1βjxij

2

The solution has the matrix formula

β̂ols = (XT X)−1XT y.

5/24

Page 7: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Why we are not satisfied with OLS?

Prediction accuracy. Low bias (if the linearity assumption isapproximately correct), but potentially high variance. We canimprove performance by setting some coefficients to zero orshrinking them.

Interpretability. A regression estimated with too many predictorsand high variance is hard or impossible to interpret. In order tounderstand the big picture, we are willing to sacrifice some of thesmall details.

6/24

Page 8: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Regularised linear models (key concept)

Regularisation (shrinkage) methods fit a model involving all the p

predictors, but shrink the coefficients towards zero relative to OLS.

Depending on the type of regularisation, some estimatedcoefficients may be zero, in which case we say that the methodalso performs variable selection.

7/24

Page 9: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Ridge Regression

Page 10: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Ridge regression (key concept)

The ridge regression method solves the penalised estimationproblem

β̂ridge = argminβ

N∑

i=1

yi − β0 −p∑

j=1βixij

2

+ λp∑

j=1β2

j

,

for a tuning parameter λ.

The penalty term has the effect of shrinking the coefficientsrelative to OLS. We refer to this procedure as ℓ2 regularisationbecause of the form of the penalty.

8/24

Page 11: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Ridge regression

Penalised estimation problem:

β̂ridge = argminβ

N∑

i=1

yi − β0 −p∑

j=1βixij

2

+ λp∑

j=1β2

j

The hyperparameter λ controls the strength of regularisation. Ahigher λ (stronger penalty) leads to smaller coefficients relative toOLS. We select λ through cross validation.

9/24

Page 12: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Ridge regression: practical details

1. We do not penalise the intercept.

2. The method is not invariant on the scale of the inputs. Westandardise the predictors before solving the minimisationproblem.

10/24

Page 13: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Ridge Regression: properties

• The ridge regression algorithm shrinks all coefficients relativeto OLS for λ > 0.

• Furthermore, it shrinks the coefficients of correlated predictorstoward each other. In the extreme case where two predictorsare perfectly positively correlated, their coefficients willbecome identical for a sufficiently large penalty λ.

11/24

Page 14: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Illustration: Ridge coefficient profiles

12/24

Page 15: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Ridge regression

The ridge estimator has an equivalent formulation as a constrainedminimisation problem

β̂ridge = argminβ

N∑i=1

yi − β0 −p∑

j=1βixij

2

subject top∑

j=1β2

j < t.

for some t > 0.

This will be relevant for comparing it with the Lasso.

13/24

Page 16: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

The Lasso

Page 17: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

The Lasso

The Lasso (least absolute shrinkage and selection operator)method solves the penalised estimation problem

β̂lasso = argminβ

N∑

i=1

yi − β0 −p∑

j=1βixij

2

+ λp∑

j=1|βj |

,

for a tuning parameter λ.

The Lasso therefore performs ℓ1 regularisation.

14/24

Page 18: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

The Lasso

β̂lasso = argminβ

N∑

i=1

yi − β0 −p∑

j=1βixij

2

+ λp∑

j=1|βj |

,

Exactly as before, the hyperparameter λ controls the strength ofregularisation. A higher λ (stronger penalty) leads to smallercoefficients relative to OLS. We select λ through cross validation.

15/24

Page 19: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

The Lasso: properties

• The lasso shrinks the regression coefficients towards zerorelative to OLS. However, the nature of this shrinkage issomewhat different compared to a ridge regression algorithm,as we will illustrate below.

• In addition to shrinkage, the lasso also performs variableselection. With λ sufficiently large, some estimatedcoefficients will be exactly zero, leading to sparse models.This is a key difference from ridge.

16/24

Page 20: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

The Lasso

The equivalent formulation of the lasso as a constrainedminimisation problem is

β̂lasso =argminβ

N∑i=1

yi − β0 −p∑

j=1βixij

2

subject top∑

j=1|βj | < t.

for some t > 0.

17/24

Page 21: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

The Lasso: variable selection property

Estimation picture for the lasso (left) and ridge regression (right):

18/24

Page 22: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Illustration: Lasso coefficient profiles

19/24

Page 23: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Elastic Net

Page 24: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Elastic Net

The elastic net is a compromise between ridge regression and thelasso:

β̂EN = argminβ

N∑i=1

yi − β0 −p∑

j=1βixij

2

+ λp∑

j=1

(αβ2

j + (1 − α)|βj |)

,

for tuning parameters λ ≥ 0 and 0 < α < 1.

The elastic net performs variable selection like the lasso, andshrinks the coefficients of correlated predictors toward each otherlike ridge regression.

20/24

Page 25: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Which method to use?

Page 26: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Which method to use?

• Recall the no free lunch theorem: neither ridge regression orthe lasso universally outperform the other. The choice ofmethod should be data-driven, and will be problem specific.

• In general terms, we can expect the lasso to perform betterwhen a small subset of predictors have important coefficients,while the remaining predictors having small or zerocoefficients (sparse problems).

• Ridge regression will tend to perform better when thepredictors all have similar importance.

• The lasso may have better interpretability since it performsvariable selection.

21/24

Page 27: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Regularised risk minimisation

Page 28: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

All of machine learning in one equation

• Regularised linear methods are special cases of a principleknown as regularised risk minimisation.

• The regularised risk minimisation framework summarises thebig picture for a large number of machine learning algorithms.

22/24

Page 29: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Empirical risk minimisation

Let D = {(yi, xi)}Ni=1 be training data and f(·; θ) a prediction

function that depends on the parameter vector θ. The empiricalrisk minimisation estimation principle estimates θ as

θ̂ = argminθ

1N

N∑i=1

L(yi, f(xi; θ))

Example: least squares estimation.

23/24

Page 30: Machine Learning Using Python - marcelscharth.commarcelscharth.com/ml/ML-02 Regularised Linear Models.pdf · Machine Learning Using Python Lesson 2: Regularised Linear Models Marcel

Regularised risk minimisation

Minimising the empirical risk will typically lead to overfitting. Inregularised risk minimisation, we estimate the model by solvingthe optimisation problem

θ̂ = argminθ

[1N

N∑i=1

L(yi, f(xi; θ))]

+ λ C(f(·); θ)),

where C(f(·); θ)) measures the complexity of the predictionfunction and λ is a complexity penalty.

24/24