simple linear regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… ·...

48
Simple Linear Regression (single variable) Introduction to Machine Learning Marek Petrik January 31, 2017 Some of the figures in this presentation are taken from ”An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Wien, T. Hastie and R. Tibshirani

Upload: others

Post on 21-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Simple Linear Regression (single variable)Introduction to Machine Learning

Marek Petrik

January 31, 2017

Some of the figures in this presentation are taken from ”An Introduction to Statistical Learning, with applications in R”(Springer, 2013) with permission from the authors: G. James, D. Wien, T. Hastie and R. Tibshirani

Page 2: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Last Class

1. Basic machine learning framework

Y = f(X)

2. Prediction vs inference: predict Y vs understand f

3. Parametric vs non-parametric: linear regression vs k-NN

4. Classification vs regressions: k-NN vs linear regression

5. Why we need to have a test set: overfiing

Page 3: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

What is Machine LearningI Discover unknown function f :

Y = f(X)

I X = set of features, or inputsI Y = target, or response

0 50 100 200 300

51

01

52

02

5

TV

Sa

les

0 10 20 30 40 50

51

01

52

02

5

Radio

Sa

les

0 20 40 60 80 100

51

01

52

02

5Newspaper

Sa

les

Sales = f(TV,Radio,Newspaper)

Page 4: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Errors in Machine Learning: World is Noisy

I World is too complex to model preciselyI Many features are not captured in data setsI Need to allow for errors ε in f :

Y = f(X) + ε

Page 5: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

How Good are Predictions?

I Learned function fI Test data: (x1, y1), (x2, y2), . . .

I Mean Squared Error (MSE):

MSE =1

n

n∑i=1

(yi − f(xi))2

I This is the estimate of:

MSE = E[(Y − f(X))2] =1

|Ω|∑ω∈Ω

(Y (ω)− f(X(ω)))2

I Important: Samples xi are i.i.d.

Page 6: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Do We Need Test Data?I Why not just test on the training data?

0 20 40 60 80 100

24

68

10

12

X

Y

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

Me

an

Sq

ua

red

Err

or

I Flexibility is the degree of polynomial being fitI Gray line: training error, red line: testing error

Page 7: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Types of Function f

Regression: continuous target

f : X → R

Years of Education

Sen

iorit

y

Incom

e

Classification: discrete target

f : X → 1, 2, 3, . . . , k

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o oo

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

X1

X2

Page 8: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Bias-Variance Decomposition

Y = f(X) + ε

Mean Squared Error can be decomposed as:

MSE = E(Y − f(X))2 = Var(f(X))︸ ︷︷ ︸Variance

+ (E(f(X)))2︸ ︷︷ ︸Bias

+ Var(ε)

I Bias: How well would method work with infinite dataI Variance: How much does output change with dierent data

sets

Page 9: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Today

I Basics of linear regressionI Why linear regressionI How to compute itI Why compute it

Page 10: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Simple Linear RegressionI We have only one feature

Y ≈ β0 + β1X Y = β0 + β1X + ε

I Example:

0 50 100 150 200 250 300

510

1520

25

TV

Sal

es

Sales ≈ β0 + β1 × TV

Page 11: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

How To Estimate CoeicientsI No line that will have no errors on data xiI Prediction:

yi = β0 + β1xi

I Errors (yi are true values):

ei = yi − yi

0 50 100 150 200 250 300

510

15

20

25

TV

Sale

s

Page 12: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Residual Sum of Squares

I Residual Sum of Squares

RSS = e21 + e2

2 + e23 + · · ·+ e2

n =

n∑i=1

e2i

I Equivalently:

RSS =

n∑i=1

(yi − β0 − β1xi)2

Page 13: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Minimizing Residual Sum of Squares

minβ0,β1

RSS = minβ0,β1

n∑i=1

e2i = min

β0,β1

n∑i=1

(yi − β0 − β1xi)2

RS

S

β1

β0

Page 14: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Minimizing Residual Sum of Squares

minβ0,β1

RSS = minβ0,β1

n∑i=1

e2i = min

β0,β1

n∑i=1

(yi − β0 − β1xi)2

β0

β1

2.15

2.2

2.3

2.5

3

3

3

3

5 6 7 8 9

0.0

30.0

40.0

50.0

6

Page 15: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Solving for Minimal RSS

minβ0,β1

n∑i=1

(yi − β0 − β1xi)2

I RSS is a convex function of β0, β1

I Minimum achieved when (recall the chain rule):

∂ RSS

∂β0= −2

n∑i=1

(yi − β0 − β1xi) = 0

∂ RSS

∂β1= −2

n∑i=1

xi(yi − β0 − β1xi) = 0

Page 16: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Linear Regression Coeicients

minβ0,β1

n∑i=1

(yi − β0 − β1xi)2

Solution:

β0 = y − β1x

β1 =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2=

∑ni=1 xi(yi − y)∑ni=1 xi(xi − x)

where

x =1

n

n∑i=1

xi y =1

n

n∑i=1

yi

Page 17: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Why Minimize RSS

1. Maximize likelihood when Y = β0 + β1X + ε whenε ∼ N (0, σ2)

2. Best Linear Unbiased Estimator (BLUE): Gauss-MarkovTheorem (ESL 3.2.2)

3. It is convenient: can be solved in closed form

Page 18: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Why Minimize RSS

1. Maximize likelihood when Y = β0 + β1X + ε whenε ∼ N (0, σ2)

2. Best Linear Unbiased Estimator (BLUE): Gauss-MarkovTheorem (ESL 3.2.2)

3. It is convenient: can be solved in closed form

Page 19: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Why Minimize RSS

1. Maximize likelihood when Y = β0 + β1X + ε whenε ∼ N (0, σ2)

2. Best Linear Unbiased Estimator (BLUE): Gauss-MarkovTheorem (ESL 3.2.2)

3. It is convenient: can be solved in closed form

Page 20: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Why Minimize RSS

1. Maximize likelihood when Y = β0 + β1X + ε whenε ∼ N (0, σ2)

2. Best Linear Unbiased Estimator (BLUE): Gauss-MarkovTheorem (ESL 3.2.2)

3. It is convenient: can be solved in closed form

Page 21: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Bias in Estimation

I Assume a true value µ?

I Estimate µ is unbiased when E[µ] = µ?

I Standard mean estimate is unbiased (e.g. X ∼ N (0, 1)):

E

[1

n

n∑i=1

Xi

]= 0

I Standard variance estimate is biased (e.g. X ∼ N (0, 1)):

E

[1

n

n∑i=1

(Xi − X)2

]6= 1

Page 22: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Linear Regression is Unbiased

−2 −1 0 1 2

−10

−5

05

10

X

Y

−2 −1 0 1 2

−10

−5

05

10

X

Y

Gauss-Markov Theorem (ESL 3.2.2)

Page 23: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Solution of Linear Regression

0 50 100 150 200 250 300

510

1520

25

TV

Sal

es

Page 24: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

How Good is the Fit

I How well is linear regression predicting the training data?I Can we be sure that TV advertising really influences the sales?I What is the probability that we just got lucky?

Page 25: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

R2 Statistic

R2 = 1− RSS

TSS= 1−

∑ni=1(yi − yi)2∑ni=1(yi − y)2

I RSS - residual sum of squares, TSS - total sum of squaresI R2 measures the goodness of the fit as a proportionI Proportion of data variance explained by the modelI Extreme values:

0: Model does not explain data1: Model explains data perfectly

Page 26: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Example: TV Impact on Sales

0 50 100 150 200 250 300

510

1520

25

TV

Sal

es

R2 = 0.61

Page 27: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Example: TV Impact on Sales

0 50 100 150 200 250 300

510

1520

25

TV

Sal

es

R2 = 0.61

Page 28: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Example: Radio Impact on Sales

0 10 20 30 40 50

510

1520

25

Radio

Sal

es

R2 = 0.33

Page 29: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Example: Radio Impact on Sales

0 10 20 30 40 50

510

1520

25

Radio

Sal

es

R2 = 0.33

Page 30: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Example: Newspaper Impact on Sales

0 20 40 60 80 100

510

1520

25

Newspaper

Sal

es

R2 = 0.05

Page 31: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Example: Newspaper Impact on Sales

0 20 40 60 80 100

510

1520

25

Newspaper

Sal

es

R2 = 0.05

Page 32: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Correlation Coeicient

I Measures dependence between two random variables X and Y

r =Cov(X,Y )√

Var(X)√

Var(Y )

I Correlation coeicient r is between [−1, 1]

0: Variables are not related1: Variables are perfectly related (same)−1: Variables are negatively related (dierent)

I R2 = r2

Page 33: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Correlation Coeicient

I Measures dependence between two random variables X and Y

r =Cov(X,Y )√

Var(X)√

Var(Y )

I Correlation coeicient r is between [−1, 1]

0: Variables are not related1: Variables are perfectly related (same)−1: Variables are negatively related (dierent)

I R2 = r2

Page 34: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Hypothesis Testing

I Null hypothesis H0:

There is no relationship between X and Y

β1 = 0

I Alternative hypothesis H1:

There is some relationship between X and Y

β1 6= 0

I Seek to reject hypothesis H0 with small “probability” (p-value)of making a mistake

I Important topic, but beyond the scope of the course

Page 35: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Multiple Linear Regression

I Usually more than one feature is available

sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε

I In general:

Y = β0 +

p∑j=1

βjXj

Page 36: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Multiple Linear Regression

X1

X2

Y

Page 37: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Estimating Coeicients

I Prediction:

yi = β0 +

p∑j=1

βjxij

I Errors (yi are true values):

ei = yi − yi

I Residual Sum of Squares

RSS = e21 + e2

2 + e23 + · · ·+ e2

n =

n∑i=1

e2i

I How to minimize RSS? Linear algebra!

Page 38: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Inference from Linear Regression

1. Are predictors X1, X2, . . . , Xp really predicting Y ?

2. Is only a subset of predictors useful?

3. How well does linear model fit data?

4. What Y should be predict and how accurate is it?

Page 39: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Inference 1

“Are predictors X1, X2, . . . , Xp really predicting Y ?”

I Null hypothesis H0:

There is no relationship between X and Y

β1 = 0

I Alternative hypothesis H1:

There is some relationship between X and Y

β1 6= 0

I Seek to reject hypothesis H0 with small “probability” (p-value)of making a mistake

I See ISL 3.2.2 on how to compute F-statistic and reject H0

Page 40: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Inference 2

“Is only a subset of predictors useful?”

I Compare prediction accuracy with only a subset of features

I RSS always decreases with more features!I Other measures control for number of variables:

1. Mallows Cp

2. Akaike information criterion3. Bayesian information criterion4. Adjusted R2

I Testing all subsets of features is impractical: 2p options!I More on how to do this later

Page 41: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Inference 2

“Is only a subset of predictors useful?”

I Compare prediction accuracy with only a subset of featuresI RSS always decreases with more features!

I Other measures control for number of variables:1. Mallows Cp

2. Akaike information criterion3. Bayesian information criterion4. Adjusted R2

I Testing all subsets of features is impractical: 2p options!I More on how to do this later

Page 42: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Inference 2

“Is only a subset of predictors useful?”

I Compare prediction accuracy with only a subset of featuresI RSS always decreases with more features!I Other measures control for number of variables:

1. Mallows Cp

2. Akaike information criterion3. Bayesian information criterion4. Adjusted R2

I Testing all subsets of features is impractical: 2p options!I More on how to do this later

Page 43: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Inference 2

“Is only a subset of predictors useful?”

I Compare prediction accuracy with only a subset of featuresI RSS always decreases with more features!I Other measures control for number of variables:

1. Mallows Cp

2. Akaike information criterion3. Bayesian information criterion4. Adjusted R2

I Testing all subsets of features is impractical: 2p options!

I More on how to do this later

Page 44: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Inference 2

“Is only a subset of predictors useful?”

I Compare prediction accuracy with only a subset of featuresI RSS always decreases with more features!I Other measures control for number of variables:

1. Mallows Cp

2. Akaike information criterion3. Bayesian information criterion4. Adjusted R2

I Testing all subsets of features is impractical: 2p options!I More on how to do this later

Page 45: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Inference 3

“How well does linear model fit data?”

I R2 also always increases with more features (like RSS)I Is the model linear? Plot it:

Sales

Radio

TV

I More on this later

Page 46: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Inference 4

“What Y should be predict and how accurate is it?”

I The linear model is used to make predictions:

ypredicted = β0 + β1 xnew

I Can also predict a confidence interval (based on estimate on ε):

I Example: Spent $100 000 on TV and $20 000 on Radioadvertising

I Confidence interval: predict f(X) (the average response):

f(x) ∈ [10.985, 11, 528]

I Prediction interval: predict f(X) + ε (response + possiblenoise)

f(x) ∈ [7.930, 14.580]

Page 47: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

Inference 4

“What Y should be predict and how accurate is it?”

I The linear model is used to make predictions:

ypredicted = β0 + β1 xnew

I Can also predict a confidence interval (based on estimate on ε):I Example: Spent $100 000 on TV and $20 000 on Radio

advertisingI Confidence interval: predict f(X) (the average response):

f(x) ∈ [10.985, 11, 528]

I Prediction interval: predict f(X) + ε (response + possiblenoise)

f(x) ∈ [7.930, 14.580]

Page 48: Simple Linear Regression (single variable)mpetrik/teaching/intro_ml_17/intro_ml_17_files/cla… · Simple Linear Regression (single variable) Introduction to Machine Learning Marek

R notebook