comp 551 -applied machine learning lecture 3 ---linear regression … · 2019-09-10 · • solve...

47
COMP 551 - Applied Machine Learning Lecture 3 --- Linear Regression (cont.) William L. Hamilton (with slides and content from Joelle Pineau) * Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission. William L. Hamilton, McGill University and Mila 1

Upload: others

Post on 07-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

COMP 551 - Applied Machine LearningLecture 3 --- Linear Regression (cont.)William L. Hamilton(with slides and content from Joelle Pineau)* Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.

William L. Hamilton, McGill University and Mila 1

Page 2: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Self-assessment/practice quizzes§ Quiz 0 – Attempt 1 is out and due tonight at11:59pm.

§ Quiz 0 – Attempt 2 will be out Thursday at 12am.

§ You have one hour to complete it once you begin.

§ This week’s quizzes do not count towards your grade. They are for self-assessment and practice.

§ If you are really struggling or confused by the quiz’s this week, it might be worthwhile to wait and take this course in a later semester.

§ (Ideally you should be able to get 100% on your second attempt)

William L. Hamilton, McGill University and Mila 2

Page 3: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Recall: Linear regression

§ The linear regression problem: fw(x)=w0 +∑j=1:mwj xjwhere m = the dimension of observation space, i.e. number of features.

§ Goal: Find the best linear model (i.e., weights) given the data.§ Many different possible evaluation criteria!

§ Most common choice is to find the w that minimizes:

Err(w) = ∑i=1:n (yi - wTxi)2

(A note on notation: Here w and x are vectors of size m+1.)William L. Hamilton, McGill University and Mila 3

Page 4: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Is linear regression always easy to solve?

§ Recall the least-square solution:

ŵ=(XTX)-1XTY§ Assuming for now that X is reasonably small and that

we can find a closed-form solution in poly-time.

§ Does this always work (well)?

William L. Hamilton, McGill University and Mila 4

Page 5: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Is linear regression always easy to solve?

§ Recall the least-square solution: ŵ=(XTX)-1XTY

§ Two main failure modes:§ 1) (XTX) is singular, so the inverse

is undefined.

§ 2) A simple linear function is a bad fit… (e.g., figure to the right)

William L. Hamilton, McGill University and Mila 5

Page 6: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Failure mode 1: Avoiding singularities

§ Recall the least-square solution: ŵ=(XTX)-1XTY

§ To have a unique solution, we need XTXto be nonsingular.

§ That means Xmust have full column rank (i.e. no correlation between features).

William L. Hamilton, McGill University and Mila 6

Page 7: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Failure mode 1: Avoiding singularities

§ Recall the least-square solution: ŵ=(XTX)-1XTY

§ To have a unique solution, we need XTXto be nonsingular.

§ That means Xmust have full column rank (i.e. no correlation between features).

William L. Hamilton, McGill University and Mila

Exercise: What if X does not have full column rank? When would this happen?

Design an example. Try to solve it.

7

Page 8: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Dealing with difficult cases of (XTX)-1§ Case #1: The weights are not uniquely defined.

§ Example: One feature is a linear function of the other, e.g., Xm=2Xm-1+2

§ Solution: Re-code or drop some redundant columns of X.

§ Case #2: The number of features/weights (m) exceeds the number of training examples (n).§ Solution: Reduce the number of features using various techniques

(to be studied later.)

William L. Hamilton, McGill University and Mila 8

Page 9: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Failure mode 2: When linear is a bad fit

§ This function looks complicated, and a linear hypothesis does not seem very good.

§ What should we do?

William L. Hamilton, McGill University and Mila 9

Page 10: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Failure mode 2: When linear is a bad fit

§ This function looks complicated, and a linear hypothesis does not seem very good.

§ What should we do?§ Pick a better function?

§ Use more features?

§ Get more data?

William L. Hamilton, McGill University and Mila 10

Page 11: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Input variables for linear regression§ Original quantitative variables X1,…,Xm

William L. Hamilton, McGill University and Mila 11

Page 12: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Input variables for linear regression§ Original quantitative variables X1,…,Xm§ Transformations of variables, e.g. Xm+1 =log(Xi)

William L. Hamilton, McGill University and Mila 12

Page 13: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Input variables for linear regression§ Original quantitative variables X1,…,Xm§ Transformations of variables, e.g. Xm+1 =log(Xi)§ Basis expansions, e.g. Xm+1 =Xi2,Xm+2 =Xi3,…

William L. Hamilton, McGill University and Mila 13

Page 14: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Input variables for linear regression§ Original quantitative variables X1,…,Xm§ Transformations of variables, e.g. Xm+1 =log(Xi)§ Basis expansions, e.g. Xm+1 =Xi2,Xm+2 =Xi3,…§ Interaction terms, e.g. Xm+1 =XiXj

William L. Hamilton, McGill University and Mila 14

Page 15: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Input variables for linear regression§ Original quantitative variables X1,…,Xm§ Transformations of variables, e.g. Xm+1 =log(Xi)§ Basis expansions, e.g. Xm+1 =Xi2,Xm+2 =Xi3,…§ Interaction terms, e.g. Xm+1 =XiXj§ Numeric coding of qualitative variables, e.g. Xm+1 =1if Xi is

true and 0 otherwise.

William L. Hamilton, McGill University and Mila 15

Page 16: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Input variables for linear regression§ Original quantitative variables X1,…,Xm§ Transformations of variables, e.g. Xm+1 =log(Xi)§ Basis expansions, e.g. Xm+1 =Xi2,Xm+2 =Xi3,…§ Interaction terms, e.g. Xm+1 =XiXj§ Numeric coding of qualitative variables, e.g. Xm+1 =1if Xi is

true and 0 otherwise.

In all cases, we can add Xm+1,…,Xm+k to the list of original variables and perform the linear regression.

William L. Hamilton, McGill University and Mila 16

Page 17: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Variable coding is not always straightforward!§ Suppose we have a categorical variable with four possible

values: Xcat={blue,red,green,yellow}§ How do we encode this?

William L. Hamilton, McGill University and Mila 17

Page 18: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Variable coding is not always straightforward!§ Suppose we have a categorical variable with four possible

values: Xcat={blue,red,green,yellow}§ How do we encode this?§ Solution 1: Four separate binary variables (e.g., Xblue =1if Xcat

=blueand Xblue =0otherwise).

William L. Hamilton, McGill University and Mila 18

Page 19: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Variable coding is not always straightforward!§ Suppose we have a categorical variable with four possible

values: Xcat={blue,red,green,yellow}§ How do we encode this?§ Solution 1: Four separate binary variables (e.g., Xblue =1if Xcat

=blueand Xblue =0otherwise).§ Careful… this can lead to singularities, since these four

variables will be perfectly correlated. If we know Xcat is not {red,green,yellow} then it must be blue!

William L. Hamilton, McGill University and Mila 19

Page 20: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Variable coding is not always straightforward!§ Suppose we have a categorical variable with four possible

values: Xcat={blue,red,green,yellow}§ How do we encode this?§ Solution 1: Four separate binary variables (e.g., Xblue =1if Xcat

=blueand Xblue =0otherwise).§ Careful… this can lead to singularities, since these four

variables will be perfectly correlated. I.e., if we know Xcat is not {red,green,yellow} then it must be blue!

§ Solution 2: Three separate binary variables for {red,green,yellow}with the bluecategory represented by the intercept.

§ This is a better solution!

William L. Hamilton, McGill University and Mila 20

Page 21: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Variable coding is not always straightforward!§ What about if our input is just raw text (or images)…?

§ We need to get creative! E.g.,

§ word counts

§ average word length

§ number of words per sentence

§ etc…

William L. Hamilton, McGill University and Mila 21

Page 22: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Variable coding is not always straightforward!§ What about if our input is just raw text (or images)…?

§ We need to get creative! E.g.,

§ word counts

§ average word length

§ number of words per sentence

§ etc…

§ This is called feature design, and it will be a recurring theme throughout the course!

William L. Hamilton, McGill University and Mila 22

Page 23: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Input variables for linear regression§ Original quantitative variables X1,…,Xm§ Transformations of variables, e.g. Xm+1 =log(Xi)§ Basis expansions, e.g. Xm+1 =Xi2,Xm+2 =Xi3,…§ Interaction terms, e.g. Xm+1 =XiXj§ Numeric coding of qualitative variables, e.g. Xm+1 =1if Xi is

true and 0 otherwise.

In all cases, we can add Xm+1,…,Xm+k to the list of original variables and perform the linear regression.

William L. Hamilton, McGill University and Mila 23

Page 24: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Ex: Linear regression with polynomial terms

William L. Hamilton, McGill University and Mila 24

Answer: Polynomial regression

• Given data: (x1, y1), (x2, y2), . . . , (xm, ym).

• Suppose we want a degree-d polynomial fit.

• Let Y be as before and let

X =

2

664

xd1 . . . x2

1 x1 1xd2 . . . x2

2 x2 1...

......

...

xdm . . . x2

m xm 1

3

775

• Solve the linear regression Xw ⌅ Y .

COMP-652, Lecture 1 - September 6, 2012 44

Example of quadratic regression: Data matrices

X =

2

6666666666666664

0.75 0.86 10.01 0.09 10.73 �0.85 10.76 0.87 10.19 �0.44 10.18 �0.43 11.22 �1.10 10.16 0.40 10.93 �0.96 10.03 0.17 1

3

7777777777777775

Y =

2

6666666666666664

2.490.83�0.253.100.870.02�0.121.81�0.830.43

3

7777777777777775

COMP-652, Lecture 1 - September 6, 2012 45

fw(x) =w0 +w1 x+w2 x2

x2 x

Page 25: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Data matrices XTX

XTX =

0.75 0.01 0.73 0.76 0.19 0.18 1.22 0.16 0.93 0.030.86 0.09 �0.85 0.87 �0.44 �0.43 �1.10 0.40 �0.96 0.171 1 1 1 1 1 1 1 1 1

�⇥

2

6666664

0.75 0.86 10.01 0.09 10.73 �0.85 10.76 0.87 10.19 �0.44 10.18 �0.43 11.22 �1.10 10.16 0.40 10.93 �0.96 10.03 0.17 1

3

7777775

=

2

44.11 �1.64 4.95�1.64 4.95 �1.394.95 �1.39 10

3

5

COMP-652, Lecture 1 - September 6, 2012 46

XTY

XTY =

0.75 0.01 0.73 0.76 0.19 0.18 1.22 0.16 0.93 0.030.86 0.09 �0.85 0.87 �0.44 �0.43 �1.10 0.40 �0.96 0.171 1 1 1 1 1 1 1 1 1

�⇥

2

6666664

2.490.83�0.253.100.870.02�0.121.81�0.830.43

3

7777775

=

2

43.606.498.34

3

5

COMP-652, Lecture 1 - September 6, 2012 47

William L. Hamilton, McGill University and Mila 25

Page 26: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Data matrices

XTX

XTX =

0.75 0.01 0.73 0.76 0.19 0.18 1.22 0.16 0.93 0.030.86 0.09 �0.85 0.87 �0.44 �0.43 �1.10 0.40 �0.96 0.171 1 1 1 1 1 1 1 1 1

�⇥

2

6666664

0.75 0.86 10.01 0.09 10.73 �0.85 10.76 0.87 10.19 �0.44 10.18 �0.43 11.22 �1.10 10.16 0.40 10.93 �0.96 10.03 0.17 1

3

7777775

=

2

44.11 �1.64 4.95�1.64 4.95 �1.394.95 �1.39 10

3

5

COMP-652, Lecture 1 - September 6, 2012 46

XTY

XTY =

0.75 0.01 0.73 0.76 0.19 0.18 1.22 0.16 0.93 0.030.86 0.09 �0.85 0.87 �0.44 �0.43 �1.10 0.40 �0.96 0.171 1 1 1 1 1 1 1 1 1

�⇥

2

6666664

2.490.83�0.253.100.870.02�0.121.81�0.830.43

3

7777775

=

2

43.606.498.34

3

5

COMP-652, Lecture 1 - September 6, 2012 47

William L. Hamilton, McGill University and Mila 26

Page 27: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Solving the problemSolving for w

w = (XTX)�1XTY =

4.11 �1.64 4.95�1.64 4.95 �1.394.95 �1.39 10

��1 3.606.498.34

�=

0.681.740.73

So the best order-2 polynomial is y = 0.68x2 + 1.74x+ 0.73.

COMP-652, Lecture 1 - September 6, 2012 48

Order-2 fit

x

y

Is this a better fit to the data?

COMP-652, Lecture 1 - September 6, 2012 49

William L. Hamilton, McGill University and Mila

Compared to y=1.6x+1.05for the

order-1 polynomial.

Solving for w

w = (XTX)�1XTY =

4.95 �1.39�1.39 10

��1 6.498.34

�=

1.601.05

So the best fit line is y = 1.60x+ 1.05.

COMP-652, Lecture 1 - September 6, 2012 38

Data and line y = 1.60x+ 1.05

x

yCOMP-652, Lecture 1 - September 6, 2012 39

27

Page 28: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Order-3 fit: Is this better?

William L. Hamilton, McGill University and Mila 28

Page 29: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Order-4 fit

William L. Hamilton, McGill University and Mila 29

Page 30: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Order-5 fit

William L. Hamilton, McGill University and Mila 30

Page 31: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Order-6 fit

William L. Hamilton, McGill University and Mila 31

Page 32: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Order-7 fit

William L. Hamilton, McGill University and Mila 32

Page 33: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Order-8 fit

William L. Hamilton, McGill University and Mila 33

Page 34: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Order-9 fit

PROBLEM!

William L. Hamilton, McGill University and Mila 34

Page 35: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

This is overfitting!

§ We can find a hypothesis that explains perfectly the training data, but

does not generalize well to new data.

§ In this example: we have a lot of parameters (weights), so the model

matches the data points exactly,

but is wild everywhere else.

§ A very important problem in machine learning.

William L. Hamilton, McGill University and Mila 35

Page 36: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Overfitting§ Every hypothesis has a true error measured on all possible data items we

could ever encounter (e.g. fw(xi)- yi ).

§ Since we don’t have all possible data, in order to decide what is a good hypothesis, we measure error over the training set.

§ Formally: Suppose we compare hypotheses f1 and f2.§ Assume f1 has lower error on the training set.

§ If f2 has lower true error, then our algorithm is overfitting.

William L. Hamilton, McGill University and Mila 36

Page 37: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Generalization

§ In machine learning our goal is to develop models that generalize.

§ With lots of parameters it is easy to perfectly fit a training set!§ For example, if my model has more parameters than there are training

points, then the model can just memorize the training data!

§ Performing well on previously unseen data (i.e., generalization) is the real challenge.

§ But how do we evaluate generalization?

William L. Hamilton, McGill University and Mila 37

Page 38: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Overfitting

William L. Hamilton, McGill University and Mila 38

Which hypothesis has the lowest true error?

Page 39: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Evaluating on held out data

William L. Hamilton, McGill University and Mila 39

§ Partition your data into a training set, validation set, and test set.

§ The proportions in each set can vary.

§ Training set is used to fit a model (find the best hypothesis in the class).

§ Validation set is used for model selection, i.e., to estimate true error and compare hypothesis classes. (E.g., compare different order polynominals).

§ Test set is what you report the final accuracy on.

Page 40: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Evaluating on held out data

William L. Hamilton, McGill University and Mila 40

Answer: Polynomial regression

• Given data: (x1, y1), (x2, y2), . . . , (xm, ym).

• Suppose we want a degree-d polynomial fit.

• Let Y be as before and let

X =

2

664

xd1 . . . x2

1 x1 1xd2 . . . x2

2 x2 1...

......

...

xdm . . . x2

m xm 1

3

775

• Solve the linear regression Xw ⌅ Y .

COMP-652, Lecture 1 - September 6, 2012 44

Example of quadratic regression: Data matrices

X =

2

6666666666666664

0.75 0.86 10.01 0.09 10.73 �0.85 10.76 0.87 10.19 �0.44 10.18 �0.43 11.22 �1.10 10.16 0.40 10.93 �0.96 10.03 0.17 1

3

7777777777777775

Y =

2

6666666666666664

2.490.83�0.253.100.870.02�0.121.81�0.830.43

3

7777777777777775

COMP-652, Lecture 1 - September 6, 2012 45

Training set =

Validation set =

Test set =

Page 41: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Validation set vs. test set

William L. Hamilton, McGill University and Mila 41

§ Validation set is used compare different models during development.

§ Compare hypothesis/model classes. E.g., should I use a first- or second-order polynomial fit?

§ Compare hyperparameters (i.e., a parameter that is not learned but that could impact performance). E.g., what learning rate should I use?

§ Test set is used evaluate the final performance of your model (e.g.,

compare between research groups).§ Why separate them? If you try too many models or hyperparameter

settings, you might start to overfit the validation set! (A kind of “meta”-overfitting).

Page 42: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

k-fold cross validation§ Instead of just one validation set, we can evaluate on many splits!

§ Consider k partitions of the training/non-test data (usually of equal size).

§ Train with k-1 subsets, validate on kth subset. Repeat k times.

§ Average the prediction error over the k rounds/folds.

William L. Hamilton, McGill University and Mila 42

Source: http://stackoverflow.com/questions/31947183/how-to-implement-walk-forward-testing-in-sklearn

(increases computation time by a factor of k)

Page 43: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Leave-one-out cross validation § Let k=n, the size of the training set

§ For each model / hyperparameter setting,

§ Repeat n times:

§ Set aside one instance <xi,yi>from the training set.

§ Use all other data points to find w (optimization).

§ Measure prediction error on the held-out <xi,yi>.§ Average the prediction error over all n subsets.

§ Choose the setting with lowest estimated true prediction error.

William L. Hamilton, McGill University and Mila 43

Page 44: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Generalization: test vs. train error

William L. Hamilton, McGill University and Mila 44

38 2. Overview of Supervised Learning

High Bias

Low Variance

Low Bias

High Variance

Prediction

Error

Model Complexity

Training Sample

Test Sample

Low High

FIGURE 2.11. Test and training error as a function of model complexity.

be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.

The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.

More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.

Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1

N

!i(yi − yi)2. Unfortunately

training error is not a good estimate of test error, as it does not properlyaccount for model complexity.

Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.

[From Hastie et al. textbook]§ Overly simple model:§ High training error and

high test error.§ Overly complex model:

§ Low training error but high test error.

Page 45: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Traditional statistics vs. machine learning

William L. Hamilton, McGill University and Mila 45

§ Traditional statistics: how can we accurately estimate the effect of x on y?

§ Machine learning: how can we design a function that predicts x from y and generalizes to unseen data?

Page 46: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

Recap: Evaluating on held out data

William L. Hamilton, McGill University and Mila 46

§ Partition your data into a training set, validation set, and test set.

§ The proportions in each set can vary.

§ Training set is used to fit a model (find the best hypothesis in the class).

§ Validation set is used for model selection, i.e., to estimate true error and compare hypothesis classes. (E.g., compare different order polynominals).

§ Test set is what you report the final accuracy on.

Page 47: COMP 551 -Applied Machine Learning Lecture 3 ---Linear Regression … · 2019-09-10 · • Solve the linear regression Xw ⌅ Y. COMP-652, Lecture 1 - September 6, 2012 44 Example

What you should know

§ Linear regression (hypothesis class, error function, algorithm).

§ How linear regression can fail

§ Gradient descent method (algorithm, properties).

§ Train vs. validation vs. test sets.

§ The notion of generalization.

William L. Hamilton, McGill University and Mila 47