regularized linear regression, nonlinear regressionaarti/class/10315_fall19/lecs/lecture17.pdf ·...

27
Regularized Linear Regression, Nonlinear Regression Aarti Singh Machine Learning 10-315 Oct 30, 2019

Upload: others

Post on 06-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

Regularized Linear Regression,Nonlinear Regression

Aarti Singh

Machine Learning 10-315Oct 30, 2019

Page 2: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

Least Squares and M(C)LE

39

Intuition: Signal plus (zero-mean) Noise model

Least Square Estimate is same as Maximum Conditional Likelihood Estimate under a Gaussian model !

Conditional log likelihood

= X�⇤

p({Yi}ni=1|�,�2, {Xi}ni=1)

Page 3: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

Regularized Least Squares and M(C)AP

40

What if is not invertible ?

Conditional log likelihood log prior

I) Gaussian Prior

0

Ridge Regression

b�MAP = (AAA>AAA+ �III)�1AAA>YYY

p({Yi}ni=1|�,�2, {Xi}ni=1)

Page 4: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

Regularized Least Squares and M(C)AP

41

What if is not invertible ?

Prior belief that β is Gaussian with zero-mean biases solution to “small” β

I) Gaussian Prior

0

Ridge Regression

Conditional log likelihood log prior

p({Yi}ni=1|�,�2, {Xi}ni=1)

Page 5: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

Regularized Least Squares and M(C)AP

42

What if is not invertible ?

Prior belief that β is Laplace with zero-mean biases solution to “sparse” β

Lasso

II) Laplace Prior

Conditional log likelihood log prior

p({Yi}ni=1|�,�2, {Xi}ni=1)

Page 6: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

Beyond Linear Regression

43

Polynomial regressionRegression with nonlinear features

Kernelized Ridge Regression

Local Kernel Regression

Page 7: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

Polynomial Regression

44

Univariate (1-dim) case:

where ,

Multivariate (p-dim) case:

degree m

f(X) = �0 + �1X(1) + �2X

(2) + · · ·+ �pX(p)

+pX

i=1

pX

j=1

�ijX(i)X(j) +

pX

i=1

pX

j=1

pX

k=1

X(i)X(j)X(k)

+ . . . terms up to degree m

b�MAP = (ATA+ �I)�1ATYor

where

Page 8: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

45

Polynomial Regression

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-45

-40

-35

-30

-25

-20

-15

-10

-5

0

5

k=1 k=2

k=3 k=7

Polynomial of order k, equivalently of degree up to k-1

What is the right order? Recall overfitting!

Page 9: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

Bias – Variance Tradeoff

Large bias, Small variance – poor approximation but robust/stable

Small bias, Large variance – good approximation but unstable

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

3 Independent training datasets

Page 10: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

47

Bias – Variance Decomposition

It can be shown that

E[(f(X) - f*(X))2] = Bias2 + Variance

Bias = E[f(X)] – f*(X) How far is the model from best model on average

Variance = E[(f(X) - E[f(X)])2] How variable is the model

Page 11: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

Effect of Model Complexity

Test errorVariance

Bias

Page 12: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

Effect of Model Complexity

Test errorVariance

BiasTraining error

Page 13: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

50

Regression with nonlinear features

In general, use any nonlinear features e.g. eX, log X, 1/X, sin(X), …

Nonlinear features

Weight ofeach feature

�0(X1) �1(X1) . . . �m(X1)

�0(Xn) �1(Xn) . . . �m(Xn)

X = [�0(X) �1(X) . . . �m(X)]

b�MAP = (ATA+ �I)�1ATYor

Page 14: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

51

Can we use kernels?

Page 15: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

52

Ridge regression (dual)

Similarity with SVMsPrimal problem: SVM Primal problem:

Lagrangian:

αi – Lagrange parameter, one per training point

min�,zi

nX

i=1

z2i + �k�k22

s.t. zi = Yi �Xi�

b�MAP = (ATA+ �I)�1ATY

minw,⇠i

CnX

i=1

⇠i +1

2kwk22

s.t. ⇠i = max(1� YiXi · w, 0)w, 0)

nX

i=1

z2i + �k�k2 +nX

i=1

↵i(zi � Yi +Xi�)

Page 16: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

53

Ridge regression (dual)

Dual problem:

α = {αi} for i = 1,…, n

Taking derivatives of Lagrangian wrt b and zi we get:

Dual problem:

n-dimensional optimization problem

b�MAP = (ATA+ �I)�1ATY

max↵

�↵>↵

2� 1

2�↵>AA>↵� ↵>Y

max↵

min�,zi

nX

i=1

z2i + �k�k2 +nX

i=1

↵i(zi � Yi +Xi�)

� = � 1

2�A>↵ zi = �↵i

2

4 4

Page 17: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

54

Ridge regression (dual)

Dual problem:

can get back

b�MAP = (ATA+ �I)�1ATY

b� = AT (AAT + �I)�1Y

max↵

�↵>↵

2� 1

2�↵>AA>↵� ↵>Y ) b↵ = �

AA>

�+ I

!�1

Y

b� = � 1

nX

i=1

↵iXi = � 1

�A>↵ = A>(AA> + �I)�1Y

Weight of each training point (but typically not sparse)

4 42

Weighted average of training points

�̂ = � 1

2�A>↵̂

Page 18: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

55

Kernelized ridge regression

Using dual, can re-write solution as:

How does this help? • Only need to invert n x n matrix (instead of p x p or m x m)• More importantly, kernel trick!

where

Work with kernels, never need to write out the high-dim vectors

KX(i) = ���(X) · ���(Xi)

K(i, j) = ���(Xi) · ���(Xj)

b�MAP = (ATA+ �I)�1ATY

b� = AT (AAT + �I)�1Y

bfn(X) = KX(K+ �I)�1Y

Page 19: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

56

Kernelized ridge regression

where

Work with kernels, never need to write out the high-dim vectors

Examples of kernels:

Polynomials of degree exactly d

Polynomials of degree up to d

Gaussian/Radial kernels

KX(i) = ���(X) · ���(Xi)

K(i, j) = ���(Xi) · ���(Xj)bfn(X) = KX(K+ �I)�1Y

Ridge Regression with (implicit) nonlinear features !KX(i) = ���(X) · ���(Xi)f(X) = ���(X)�

Page 20: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

Local Kernel Regression• What is the temperature

in the room?

57Average “Local” Average

at location x?

x

Page 21: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

Local Kernel Regression

• Nonparametric estimator akin to kNN• Nadaraya-Watson Kernel Estimator

Where

• Weight each training point based on distance to test point

• Boxcar kernel yieldslocal average

58

h

Page 22: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

Kernels

59

Page 23: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

Choice of kernel bandwidth h

60

Image Source: Larry’s book – All of NonparametricStatistics

h=1 h=10

h=50 h=200

Choice of kernel isnot that important

Too small

Too largeJust right

Too small

Page 24: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

Kernel Regression as Weighted Least Squares

61

Weighted Least Squares

Kernel regression corresponds to locally constant estimator obtained from (locally) weighted least squares

i.e. set f(Xi) = b (a constant)

Page 25: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

Kernel Regression as Weighted Least Squares

62

constant

Notice that

set f(Xi) = b (a constant)

Page 26: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

Local Linear/Polynomial Regression

63

Weighted Least Squares

Local Polynomial regression corresponds to locally polynomial estimator obtained from (locally) weighted least squares

i.e. set (local polynomial of degree p around X)

Page 27: Regularized Linear Regression, Nonlinear Regressionaarti/Class/10315_Fall19/lecs/Lecture17.pdf · Linear Regression Least Squares Estimator Normal Equations Gradient Descent Probabilistic

What you should knowLinear Regression

Least Squares EstimatorNormal EquationsGradient DescentProbabilistic Interpretation (connection to MCLE)

Regularized Linear Regression (connection to MCAP)Ridge Regression, Lasso

Beyond Linear Polynomial regression, Regression with Non-linear features, Bias-

variance tradeoff, Kernelized ridge regression, Local Kernel Regression and Weighted Least Squares

64