ridge regression - university of washington · 2017-01-30 · ridge regression (a.k.a l 2...

28
1/12/17 1 CSE 446: Machine Learning Emily Fox University of Washington January 13, 2017 ©2017 Emily Fox Ridge Regression: Regulating overfitting when using many features CSE 446: Machine Learning 2 Training, true, & test error vs. model complexity ©2017 Emily Fox Model complexity Error Overfitting if: x y x y

Upload: others

Post on 25-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

1

CSE 446: Machine Learning

CSE 446: Machine Learning Emily Fox University of Washington January 13, 2017

©2017 Emily Fox

Ridge Regression: Regulating overfitting when using many features

CSE 446: Machine Learning 2

Training, true, & test error vs. model complexity

©2017 Emily Fox

Model complexity

Err

or

Overfitting if:

x

y

x

y

Page 2: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

2

CSE 446: Machine Learning 3

Error vs. amount of data

©2017 Emily Fox

# data points in training set

Err

or

CSE 446: Machine Learning

Overfitting of polynomial regression

©2017 Emily Fox

Page 3: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

3

CSE 446: Machine Learning 5

Flexibility of high-order polynomials

©2017 Emily Fox

square feet (sq.ft.)

pri

ce

($

)

x

square feet (sq.ft.)

pri

ce

($

)

x

y fŵ

y

yi = w0 + w1 xi+ w2

xi2 + … + wp

xip + εi

OVERFIT

CSE 446: Machine Learning 6

Symptom of overfitting

Often, overfitting associated with very large estimated parameters ŵ

©2017 Emily Fox

Page 4: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

4

CSE 446: Machine Learning

Overfitting of linear regression models more generically

©2017 Emily Fox

CSE 446: Machine Learning 8

Overfitting with many features

Not unique to polynomial regression, but also if lots of inputs (d large)

Or, generically, lots of features (D large)

yi = wj hj(xi) + εi

©2017 Emily Fox

DX

j=0

- Square feet

- # bathrooms

- # bedrooms

- Lot size

- Year built

- …

Page 5: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

5

CSE 446: Machine Learning 9

How does # of observations influence overfitting?

Few observations (N small) à rapidly overfit as model complexity increases

Many observations (N very large) à harder to overfit

©2017 Emily Fox

square feet (sq.ft.)

pri

ce

($

)

x

y

square feet (sq.ft.)

pri

ce

($

)

x

fŵ y

CSE 446: Machine Learning 10

How does # of inputs influence overfitting?

1 input (e.g., sq.ft.): Data must include representative examples of all possible (sq.ft., $) pairs to avoid overfitting

©2017 Emily Fox

HARD

square feet (sq.ft.)

pri

ce

($

)

x

y

Page 6: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

6

CSE 446: Machine Learning 11

How does # of inputs influence overfitting?

d inputs (e.g., sq.ft., #bath, #bed, lot size, year,…):

Data must include examples of all possible (sq.ft., #bath, #bed, lot size, year,…., $) combos to avoid overfitting

©2017 Emily Fox

MUCH!!!

HARDER square feet (sq.ft.)

pri

ce

($

)

x[1]

y x[2]

CSE 446: Machine Learning

Adding term to cost-of-fit to prefer small coefficients

©2017 Emily Fox

Page 7: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

7

CSE 446: Machine Learning 13

Desired total cost format

Want to balance:

i.  How well function fits data

ii.  Magnitude of coefficients

Total cost =

measure of fit + measure of magnitude of coefficients

©2017 Emily Fox

small # = good fit to training data

small # = not overfit

want to balance measure quality of fit

CSE 446: Machine Learning 14

Measure of fit to training data

©2017 Emily Fox

RSS(w) = (yi-h(xi)Tw)2

NX

i=1

square feet (sq.ft.)

pri

ce

($

)

x[1]

y

# b

athr

oom

s

x[2]

Page 8: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

8

CSE 446: Machine Learning 15

What summary # is indicative of size of regression coefficients?

-  Sum?

-  Sum of absolute value?

-  Sum of squares (L2 norm)

©2017 Emily Fox

Measure of magnitude of regression coefficient

CSE 446: Machine Learning 16

Consider specific total cost

Total cost =

measure of fit + measure of magnitude of coefficients

©2017 Emily Fox

Page 9: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

9

CSE 446: Machine Learning 17

Consider specific total cost

Total cost =

measure of fit + measure of magnitude of coefficients

©2017 Emily Fox

RSS(w) ||w||2

2

CSE 446: Machine Learning 18

Consider resulting objective

What if ŵ selected to minimize

If λ=0:

If λ=∞:

If λ in between:

RSS(w) + ||w||2

tuning parameter = balance of fit and magnitude

©2017 Emily Fox

λ

2

Page 10: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

10

CSE 446: Machine Learning 19

Consider resulting objective

What if ŵ selected to minimize

RSS(w) + ||w||2

©2017 Emily Fox

λ

Ridge regression (a.k.a L2 regularization)

tuning parameter = balance of fit and magnitude

2

CSE 446: Machine Learning 20

Bias-variance tradeoff

Large λ:

high bias, low variance

(e.g., ŵ =0 for λ=∞)

Small λ:

low bias, high variance

(e.g., standard least squares (RSS) fit of high-order polynomial for λ=0)

©2017 Emily Fox

In essence, λ controls model

complexity

Page 11: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

11

CSE 446: Machine Learning 21

Revisit polynomial fit demo

What happens if we refit our high-order polynomial, but now using ridge regression?

Will consider a few settings of λ …

©2017 Emily Fox

CSE 446: Machine Learning 22

0 50000 100000 150000 200000−100000

0100000

200000

coefficient paths −− Ridge

h

coefficient

bedroomsbathroomssqft_livingsqft_lotfloorsyr_builtyr_renovatwaterfront

Coefficient path

©2017 Emily Fox

λ

co

effi

cie

nts

ŵj

Page 12: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

12

CSE 446: Machine Learning

Fitting the ridge regression model (for given λ value)

©2017 Emily Fox

CSE 446: Machine Learning

Step 1: Rewrite total cost in matrix notation

©2017 Emily Fox

Page 13: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

13

CSE 446: Machine Learning 25

Recall matrix form of RSS

Model for all N observations together

©2017 Emily Fox

= + y

H

ε

CSE 446: Machine Learning 26

Recall matrix form of RSS

©2017 Emily Fox

RSS(w) = (yi- h(xi)Tw)2

= (y-Hw)T(y-Hw)

NX

i=1

Page 14: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

14

CSE 446: Machine Learning 27

Rewrite magnitude of coefficients in vector notation

©2017 Emily Fox

||w||2 = w02 + w1

2 + w2 2 + … + wD

2

=

2

CSE 446: Machine Learning 28

Putting it all together

©2017 Emily Fox

In matrix form, ridge regression cost is:

RSS(w) + λ||w||2

= (y-Hw)T(y-Hw) + λwTw

2

Page 15: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

15

CSE 446: Machine Learning

Step 2: Compute the gradient

©2017 Emily Fox

CSE 446: Machine Learning 30

Why? By analogy to 1d case…

wTw analogous to w2 and derivative of w2=2w

Gradient of ridge regression cost

©2017 Emily Fox

[RSS(w) + λ||w||2] = [(y-Hw)T(y-Hw) + λwTw]

[(y-Hw)T(y-Hw)] + λ [wTw]

Δ

Δ Δ

-2HT(y-Hw)

Why?

Δ

=

2

2w

Page 16: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

16

CSE 446: Machine Learning

Step 3, Approach 1: Set the gradient = 0

©2017 Emily Fox

CSE 446: Machine Learning 32

Ridge closed-form solution

©2017 Emily Fox

cost(w) = -2HT(y-Hw) +2λIw= 0

Δ

Solve for w:

Page 17: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

17

CSE 446: Machine Learning 33

Interpreting ridge closed-form solution

©2017 Emily Fox

ŵ = ( HTH + λI)-1 HTy

If λ=0:

If λ=∞:

CSE 446: Machine Learning 34

Recall discussion on previous closed-form solution

©2017 Emily Fox

ŵ = ( HTH )-1 HTy Invertible if:

In general, (# linearly independent obs) N > D

Complexity of inverse:

O(D3)

Page 18: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

18

CSE 446: Machine Learning 35

Discussion of ridge closed-form solution

©2017 Emily Fox

ŵ = ( HTH + λI)-1 HTy

+

Invertible if: Always if λ>0, even if N < D

Complexity of inverse:

O(D3)… big for large D!

CSE 446: Machine Learning

Step 3, Approach 2: Gradient descent

©2017 Emily Fox

Page 19: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

19

CSE 446: Machine Learning 37

Elementwise ridge regression gradient descent algorithm

©2017 Emily Fox

wj(t+1) ß wj

(t) – η *

[-2 hj(xi)(yi-ŷi(w(t)))

+2λwj(t) ]

NX

i=1

Update to jth feature weight:

cost(w) = -2HT(y-Hw) +2λw

Δ

CSE 446: Machine Learning 38

Recall previous algorithm

©2017 Emily Fox

init w(1)=0 (or randomly, or smartly), t=1

while || RSS(w(t))|| > ε for j=0,…,D

partial[j] =-2 hj(xi)(yi-ŷi(w(t)))

wj(t+1) ß wj

(t) – η partial[j]

t ß t + 1

Δ

NX

i=1

Page 20: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

20

CSE 446: Machine Learning 39

Summary of ridge regression algorithm

©2017 Emily Fox

init w(1)=0 (or randomly, or smartly), t=1

while || RSS(w(t))|| > ε for j=0,…,D

partial[j] =-2 hj(xi)(yi-ŷi(w(t)))

wj(t+1) ß (1-2ηλ)wj

(t) – η partial[j]

t ß t + 1

Δ

NX

i=1

CSE 446: Machine Learning

How to choose λ

©2017 Emily Fox

Page 21: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

21

CSE 446: Machine Learning 41

The regression/ML workflow

1.  Model selection Need to choose tuning parameters λ controlling model complexity

2.  Model assessment Having selected a model, assess generalization error

©2017 Emily Fox

CSE 446: Machine Learning 42

Hypothetical implementation

1.  Model selection For each considered λ : i.  Estimate parameters ŵλ on training data ii.  Assess performance of ŵλ on test data iii.  Choose λ* to be λ with lowest test error 2.  Model assessment Compute test error of ŵλ* (fitted model for selected λ*) to approx. generalization error

©2017 Emily Fox

Training set Test set

Overly optimistic!

Page 22: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

22

CSE 446: Machine Learning 43

Hypothetical implementation

©2017 Emily Fox

Issue: Just like fitting ŵ and assessing its performance both on training data •  λ* was selected to minimize test error (i.e., λ* was fit on test data)

•  If test data is not representative of the whole world, then ŵλ* will typically perform worse than test error indicates

Training set Test set

CSE 446: Machine Learning 44

Training set Test set

Practical implementation

©2017 Emily Fox

Solution: Create two “test” sets! 1.  Select λ* such that ŵλ* minimizes error on validation set

2.  Approximate generalization error of ŵλ* using test set

Validation set

Training set Test set

Page 23: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

23

CSE 446: Machine Learning 45

Practical implementation

©2017 Emily Fox

Validation set

Training set Test set

fit ŵλ test performance of ŵλ to select λ*

assess generalization

error of ŵλ*

CSE 446: Machine Learning 46

Typical splits

©2017 Emily Fox

Validation set

Training set Test set

80% 10% 10%

50% 25% 25%

Page 24: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

24

CSE 446: Machine Learning

How to handle the intercept

©2017 Emily Fox

OPTIONAL

CSE 446: Machine Learning 48

Recall multiple regression model

©2017 Emily Fox

Model: yi = w0h0(xi) + w1

h1(xi) + … + wD hD(xi)+ εi

= wj hj(xi) + εi

feature 1 = h0(x)…often 1 (constant) feature 2 = h1(x)… e.g., x[1] feature 3 = h2(x)… e.g., x[2] … feature D+1 = hD(x)… e.g., x[d]

DX

j=0

Page 25: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

25

CSE 446: Machine Learning 49

If constant feature…

©2017 Emily Fox

yi = w0 + w1 h1(xi) + … + wD

hD(xi)+ εi

In matrix notation for N observations:

w0111111111111111

CSE 446: Machine Learning 50

Do we penalize intercept?

Standard ridge regression cost: Encourages intercept w0 to also be small Do we want a small intercept? Conceptually, not indicative of overfitting…

©2017 Emily Fox

RSS(w) + ||w||2

λ

2

strength of penalty

Page 26: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

26

CSE 446: Machine Learning 51

Option 1: Don’t penalize intercept

Modified ridge regression cost:

How to implement this in practice?

©2017 Emily Fox

RSS(w0,wrest) + ||wrest||2

λ

2

CSE 446: Machine Learning 52

Option 1: Don’t penalize intercept – Closed-form solution –

©2017 Emily Fox

ŵ = ( HTH + λImod)-1 HTy

+

0

Page 27: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

27

CSE 446: Machine Learning 53

Option 1: Don’t penalize intercept – Gradient descent algorithm –

©2017 Emily Fox

while || RSS(w(t))|| > ε for j=0,…,D

partial[j] =-2 hj(xi)(yi-ŷi(w(t)))

if j==0

w0(t+1) ß w0

(t) – η partial[j]

else

wj(t+1) ß (1-2ηλ)wj

(t) – η partial[j]

t ß t + 1

Δ

NX

i=1

CSE 446: Machine Learning 54

Option 2: Center data first

If data are first centered about 0, then favoring small intercept not so worrisome

Step 1: Transform y to have 0 mean

Step 2: Run ridge regression as normal (closed-form or gradient algorithms)

©2017 Emily Fox

Page 28: Ridge Regression - University of Washington · 2017-01-30 · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning

1/12/17

28

CSE 446: Machine Learning

Summary for ridge regression

©2017 Emily Fox

CSE 446: Machine Learning 56

What you can do now… •  Describe what happens to magnitude of estimated

coefficients when model is overfit

•  Motivate form of ridge regression cost function

•  Describe what happens to estimated coefficients of ridge regression as tuning parameter λ is varied

•  Interpret coefficient path plot

•  Estimate ridge regression parameters: -  In closed form

-  Using an iterative gradient descent algorithm

•  Use a validation set to select the ridge regression tuning parameter λ

©2017 Emily Fox