comp 551 -applied machine learning lecture 3 ---linear regression … · 2019-09-10 · • solve...
TRANSCRIPT
COMP 551 - Applied Machine LearningLecture 3 --- Linear Regression (cont.)William L. Hamilton(with slides and content from Joelle Pineau)* Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.
William L. Hamilton, McGill University and Mila 1
Self-assessment/practice quizzes§ Quiz 0 – Attempt 1 is out and due tonight at11:59pm.
§ Quiz 0 – Attempt 2 will be out Thursday at 12am.
§ You have one hour to complete it once you begin.
§ This week’s quizzes do not count towards your grade. They are for self-assessment and practice.
§ If you are really struggling or confused by the quiz’s this week, it might be worthwhile to wait and take this course in a later semester.
§ (Ideally you should be able to get 100% on your second attempt)
William L. Hamilton, McGill University and Mila 2
Recall: Linear regression
§ The linear regression problem: fw(x)=w0 +∑j=1:mwj xjwhere m = the dimension of observation space, i.e. number of features.
§ Goal: Find the best linear model (i.e., weights) given the data.§ Many different possible evaluation criteria!
§ Most common choice is to find the w that minimizes:
Err(w) = ∑i=1:n (yi - wTxi)2
(A note on notation: Here w and x are vectors of size m+1.)William L. Hamilton, McGill University and Mila 3
Is linear regression always easy to solve?
§ Recall the least-square solution:
ŵ=(XTX)-1XTY§ Assuming for now that X is reasonably small and that
we can find a closed-form solution in poly-time.
§ Does this always work (well)?
William L. Hamilton, McGill University and Mila 4
Is linear regression always easy to solve?
§ Recall the least-square solution: ŵ=(XTX)-1XTY
§ Two main failure modes:§ 1) (XTX) is singular, so the inverse
is undefined.
§ 2) A simple linear function is a bad fit… (e.g., figure to the right)
William L. Hamilton, McGill University and Mila 5
Failure mode 1: Avoiding singularities
§ Recall the least-square solution: ŵ=(XTX)-1XTY
§ To have a unique solution, we need XTXto be nonsingular.
§ That means Xmust have full column rank (i.e. no correlation between features).
William L. Hamilton, McGill University and Mila 6
Failure mode 1: Avoiding singularities
§ Recall the least-square solution: ŵ=(XTX)-1XTY
§ To have a unique solution, we need XTXto be nonsingular.
§ That means Xmust have full column rank (i.e. no correlation between features).
William L. Hamilton, McGill University and Mila
Exercise: What if X does not have full column rank? When would this happen?
Design an example. Try to solve it.
7
Dealing with difficult cases of (XTX)-1§ Case #1: The weights are not uniquely defined.
§ Example: One feature is a linear function of the other, e.g., Xm=2Xm-1+2
§ Solution: Re-code or drop some redundant columns of X.
§ Case #2: The number of features/weights (m) exceeds the number of training examples (n).§ Solution: Reduce the number of features using various techniques
(to be studied later.)
William L. Hamilton, McGill University and Mila 8
Failure mode 2: When linear is a bad fit
§ This function looks complicated, and a linear hypothesis does not seem very good.
§ What should we do?
William L. Hamilton, McGill University and Mila 9
Failure mode 2: When linear is a bad fit
§ This function looks complicated, and a linear hypothesis does not seem very good.
§ What should we do?§ Pick a better function?
§ Use more features?
§ Get more data?
William L. Hamilton, McGill University and Mila 10
Input variables for linear regression§ Original quantitative variables X1,…,Xm
William L. Hamilton, McGill University and Mila 11
Input variables for linear regression§ Original quantitative variables X1,…,Xm§ Transformations of variables, e.g. Xm+1 =log(Xi)
William L. Hamilton, McGill University and Mila 12
Input variables for linear regression§ Original quantitative variables X1,…,Xm§ Transformations of variables, e.g. Xm+1 =log(Xi)§ Basis expansions, e.g. Xm+1 =Xi2,Xm+2 =Xi3,…
William L. Hamilton, McGill University and Mila 13
Input variables for linear regression§ Original quantitative variables X1,…,Xm§ Transformations of variables, e.g. Xm+1 =log(Xi)§ Basis expansions, e.g. Xm+1 =Xi2,Xm+2 =Xi3,…§ Interaction terms, e.g. Xm+1 =XiXj
William L. Hamilton, McGill University and Mila 14
Input variables for linear regression§ Original quantitative variables X1,…,Xm§ Transformations of variables, e.g. Xm+1 =log(Xi)§ Basis expansions, e.g. Xm+1 =Xi2,Xm+2 =Xi3,…§ Interaction terms, e.g. Xm+1 =XiXj§ Numeric coding of qualitative variables, e.g. Xm+1 =1if Xi is
true and 0 otherwise.
William L. Hamilton, McGill University and Mila 15
Input variables for linear regression§ Original quantitative variables X1,…,Xm§ Transformations of variables, e.g. Xm+1 =log(Xi)§ Basis expansions, e.g. Xm+1 =Xi2,Xm+2 =Xi3,…§ Interaction terms, e.g. Xm+1 =XiXj§ Numeric coding of qualitative variables, e.g. Xm+1 =1if Xi is
true and 0 otherwise.
In all cases, we can add Xm+1,…,Xm+k to the list of original variables and perform the linear regression.
William L. Hamilton, McGill University and Mila 16
Variable coding is not always straightforward!§ Suppose we have a categorical variable with four possible
values: Xcat={blue,red,green,yellow}§ How do we encode this?
William L. Hamilton, McGill University and Mila 17
Variable coding is not always straightforward!§ Suppose we have a categorical variable with four possible
values: Xcat={blue,red,green,yellow}§ How do we encode this?§ Solution 1: Four separate binary variables (e.g., Xblue =1if Xcat
=blueand Xblue =0otherwise).
William L. Hamilton, McGill University and Mila 18
Variable coding is not always straightforward!§ Suppose we have a categorical variable with four possible
values: Xcat={blue,red,green,yellow}§ How do we encode this?§ Solution 1: Four separate binary variables (e.g., Xblue =1if Xcat
=blueand Xblue =0otherwise).§ Careful… this can lead to singularities, since these four
variables will be perfectly correlated. If we know Xcat is not {red,green,yellow} then it must be blue!
William L. Hamilton, McGill University and Mila 19
Variable coding is not always straightforward!§ Suppose we have a categorical variable with four possible
values: Xcat={blue,red,green,yellow}§ How do we encode this?§ Solution 1: Four separate binary variables (e.g., Xblue =1if Xcat
=blueand Xblue =0otherwise).§ Careful… this can lead to singularities, since these four
variables will be perfectly correlated. I.e., if we know Xcat is not {red,green,yellow} then it must be blue!
§ Solution 2: Three separate binary variables for {red,green,yellow}with the bluecategory represented by the intercept.
§ This is a better solution!
William L. Hamilton, McGill University and Mila 20
Variable coding is not always straightforward!§ What about if our input is just raw text (or images)…?
§ We need to get creative! E.g.,
§ word counts
§ average word length
§ number of words per sentence
§ etc…
William L. Hamilton, McGill University and Mila 21
Variable coding is not always straightforward!§ What about if our input is just raw text (or images)…?
§ We need to get creative! E.g.,
§ word counts
§ average word length
§ number of words per sentence
§ etc…
§ This is called feature design, and it will be a recurring theme throughout the course!
William L. Hamilton, McGill University and Mila 22
Input variables for linear regression§ Original quantitative variables X1,…,Xm§ Transformations of variables, e.g. Xm+1 =log(Xi)§ Basis expansions, e.g. Xm+1 =Xi2,Xm+2 =Xi3,…§ Interaction terms, e.g. Xm+1 =XiXj§ Numeric coding of qualitative variables, e.g. Xm+1 =1if Xi is
true and 0 otherwise.
In all cases, we can add Xm+1,…,Xm+k to the list of original variables and perform the linear regression.
William L. Hamilton, McGill University and Mila 23
Ex: Linear regression with polynomial terms
William L. Hamilton, McGill University and Mila 24
Answer: Polynomial regression
• Given data: (x1, y1), (x2, y2), . . . , (xm, ym).
• Suppose we want a degree-d polynomial fit.
• Let Y be as before and let
X =
2
664
xd1 . . . x2
1 x1 1xd2 . . . x2
2 x2 1...
......
...
xdm . . . x2
m xm 1
3
775
• Solve the linear regression Xw ⌅ Y .
COMP-652, Lecture 1 - September 6, 2012 44
Example of quadratic regression: Data matrices
X =
2
6666666666666664
0.75 0.86 10.01 0.09 10.73 �0.85 10.76 0.87 10.19 �0.44 10.18 �0.43 11.22 �1.10 10.16 0.40 10.93 �0.96 10.03 0.17 1
3
7777777777777775
Y =
2
6666666666666664
2.490.83�0.253.100.870.02�0.121.81�0.830.43
3
7777777777777775
COMP-652, Lecture 1 - September 6, 2012 45
fw(x) =w0 +w1 x+w2 x2
x2 x
Data matrices XTX
XTX =
0.75 0.01 0.73 0.76 0.19 0.18 1.22 0.16 0.93 0.030.86 0.09 �0.85 0.87 �0.44 �0.43 �1.10 0.40 �0.96 0.171 1 1 1 1 1 1 1 1 1
�⇥
2
6666664
0.75 0.86 10.01 0.09 10.73 �0.85 10.76 0.87 10.19 �0.44 10.18 �0.43 11.22 �1.10 10.16 0.40 10.93 �0.96 10.03 0.17 1
3
7777775
=
2
44.11 �1.64 4.95�1.64 4.95 �1.394.95 �1.39 10
3
5
COMP-652, Lecture 1 - September 6, 2012 46
XTY
XTY =
0.75 0.01 0.73 0.76 0.19 0.18 1.22 0.16 0.93 0.030.86 0.09 �0.85 0.87 �0.44 �0.43 �1.10 0.40 �0.96 0.171 1 1 1 1 1 1 1 1 1
�⇥
2
6666664
2.490.83�0.253.100.870.02�0.121.81�0.830.43
3
7777775
=
2
43.606.498.34
3
5
COMP-652, Lecture 1 - September 6, 2012 47
William L. Hamilton, McGill University and Mila 25
Data matrices
XTX
XTX =
0.75 0.01 0.73 0.76 0.19 0.18 1.22 0.16 0.93 0.030.86 0.09 �0.85 0.87 �0.44 �0.43 �1.10 0.40 �0.96 0.171 1 1 1 1 1 1 1 1 1
�⇥
2
6666664
0.75 0.86 10.01 0.09 10.73 �0.85 10.76 0.87 10.19 �0.44 10.18 �0.43 11.22 �1.10 10.16 0.40 10.93 �0.96 10.03 0.17 1
3
7777775
=
2
44.11 �1.64 4.95�1.64 4.95 �1.394.95 �1.39 10
3
5
COMP-652, Lecture 1 - September 6, 2012 46
XTY
XTY =
0.75 0.01 0.73 0.76 0.19 0.18 1.22 0.16 0.93 0.030.86 0.09 �0.85 0.87 �0.44 �0.43 �1.10 0.40 �0.96 0.171 1 1 1 1 1 1 1 1 1
�⇥
2
6666664
2.490.83�0.253.100.870.02�0.121.81�0.830.43
3
7777775
=
2
43.606.498.34
3
5
COMP-652, Lecture 1 - September 6, 2012 47
William L. Hamilton, McGill University and Mila 26
Solving the problemSolving for w
w = (XTX)�1XTY =
4.11 �1.64 4.95�1.64 4.95 �1.394.95 �1.39 10
��1 3.606.498.34
�=
0.681.740.73
�
So the best order-2 polynomial is y = 0.68x2 + 1.74x+ 0.73.
COMP-652, Lecture 1 - September 6, 2012 48
Order-2 fit
x
y
Is this a better fit to the data?
COMP-652, Lecture 1 - September 6, 2012 49
William L. Hamilton, McGill University and Mila
Compared to y=1.6x+1.05for the
order-1 polynomial.
Solving for w
w = (XTX)�1XTY =
4.95 �1.39�1.39 10
��1 6.498.34
�=
1.601.05
�
So the best fit line is y = 1.60x+ 1.05.
COMP-652, Lecture 1 - September 6, 2012 38
Data and line y = 1.60x+ 1.05
x
yCOMP-652, Lecture 1 - September 6, 2012 39
27
Order-3 fit: Is this better?
William L. Hamilton, McGill University and Mila 28
Order-4 fit
William L. Hamilton, McGill University and Mila 29
Order-5 fit
William L. Hamilton, McGill University and Mila 30
Order-6 fit
William L. Hamilton, McGill University and Mila 31
Order-7 fit
William L. Hamilton, McGill University and Mila 32
Order-8 fit
William L. Hamilton, McGill University and Mila 33
Order-9 fit
PROBLEM!
William L. Hamilton, McGill University and Mila 34
This is overfitting!
§ We can find a hypothesis that explains perfectly the training data, but
does not generalize well to new data.
§ In this example: we have a lot of parameters (weights), so the model
matches the data points exactly,
but is wild everywhere else.
§ A very important problem in machine learning.
William L. Hamilton, McGill University and Mila 35
Overfitting§ Every hypothesis has a true error measured on all possible data items we
could ever encounter (e.g. fw(xi)- yi ).
§ Since we don’t have all possible data, in order to decide what is a good hypothesis, we measure error over the training set.
§ Formally: Suppose we compare hypotheses f1 and f2.§ Assume f1 has lower error on the training set.
§ If f2 has lower true error, then our algorithm is overfitting.
William L. Hamilton, McGill University and Mila 36
Generalization
§ In machine learning our goal is to develop models that generalize.
§ With lots of parameters it is easy to perfectly fit a training set!§ For example, if my model has more parameters than there are training
points, then the model can just memorize the training data!
§ Performing well on previously unseen data (i.e., generalization) is the real challenge.
§ But how do we evaluate generalization?
William L. Hamilton, McGill University and Mila 37
Overfitting
William L. Hamilton, McGill University and Mila 38
Which hypothesis has the lowest true error?
Evaluating on held out data
William L. Hamilton, McGill University and Mila 39
§ Partition your data into a training set, validation set, and test set.
§ The proportions in each set can vary.
§ Training set is used to fit a model (find the best hypothesis in the class).
§ Validation set is used for model selection, i.e., to estimate true error and compare hypothesis classes. (E.g., compare different order polynominals).
§ Test set is what you report the final accuracy on.
Evaluating on held out data
William L. Hamilton, McGill University and Mila 40
Answer: Polynomial regression
• Given data: (x1, y1), (x2, y2), . . . , (xm, ym).
• Suppose we want a degree-d polynomial fit.
• Let Y be as before and let
X =
2
664
xd1 . . . x2
1 x1 1xd2 . . . x2
2 x2 1...
......
...
xdm . . . x2
m xm 1
3
775
• Solve the linear regression Xw ⌅ Y .
COMP-652, Lecture 1 - September 6, 2012 44
Example of quadratic regression: Data matrices
X =
2
6666666666666664
0.75 0.86 10.01 0.09 10.73 �0.85 10.76 0.87 10.19 �0.44 10.18 �0.43 11.22 �1.10 10.16 0.40 10.93 �0.96 10.03 0.17 1
3
7777777777777775
Y =
2
6666666666666664
2.490.83�0.253.100.870.02�0.121.81�0.830.43
3
7777777777777775
COMP-652, Lecture 1 - September 6, 2012 45
Training set =
Validation set =
Test set =
Validation set vs. test set
William L. Hamilton, McGill University and Mila 41
§ Validation set is used compare different models during development.
§ Compare hypothesis/model classes. E.g., should I use a first- or second-order polynomial fit?
§ Compare hyperparameters (i.e., a parameter that is not learned but that could impact performance). E.g., what learning rate should I use?
§ Test set is used evaluate the final performance of your model (e.g.,
compare between research groups).§ Why separate them? If you try too many models or hyperparameter
settings, you might start to overfit the validation set! (A kind of “meta”-overfitting).
k-fold cross validation§ Instead of just one validation set, we can evaluate on many splits!
§ Consider k partitions of the training/non-test data (usually of equal size).
§ Train with k-1 subsets, validate on kth subset. Repeat k times.
§ Average the prediction error over the k rounds/folds.
William L. Hamilton, McGill University and Mila 42
Source: http://stackoverflow.com/questions/31947183/how-to-implement-walk-forward-testing-in-sklearn
(increases computation time by a factor of k)
Leave-one-out cross validation § Let k=n, the size of the training set
§ For each model / hyperparameter setting,
§ Repeat n times:
§ Set aside one instance <xi,yi>from the training set.
§ Use all other data points to find w (optimization).
§ Measure prediction error on the held-out <xi,yi>.§ Average the prediction error over all n subsets.
§ Choose the setting with lowest estimated true prediction error.
William L. Hamilton, McGill University and Mila 43
Generalization: test vs. train error
William L. Hamilton, McGill University and Mila 44
38 2. Overview of Supervised Learning
High Bias
Low Variance
Low Bias
High Variance
Prediction
Error
Model Complexity
Training Sample
Test Sample
Low High
FIGURE 2.11. Test and training error as a function of model complexity.
be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.
The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.
More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.
Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1
N
!i(yi − yi)2. Unfortunately
training error is not a good estimate of test error, as it does not properlyaccount for model complexity.
Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.
[From Hastie et al. textbook]§ Overly simple model:§ High training error and
high test error.§ Overly complex model:
§ Low training error but high test error.
Traditional statistics vs. machine learning
William L. Hamilton, McGill University and Mila 45
§ Traditional statistics: how can we accurately estimate the effect of x on y?
§ Machine learning: how can we design a function that predicts x from y and generalizes to unseen data?
Recap: Evaluating on held out data
William L. Hamilton, McGill University and Mila 46
§ Partition your data into a training set, validation set, and test set.
§ The proportions in each set can vary.
§ Training set is used to fit a model (find the best hypothesis in the class).
§ Validation set is used for model selection, i.e., to estimate true error and compare hypothesis classes. (E.g., compare different order polynominals).
§ Test set is what you report the final accuracy on.
What you should know
§ Linear regression (hypothesis class, error function, algorithm).
§ How linear regression can fail
§ Gradient descent method (algorithm, properties).
§ Train vs. validation vs. test sets.
§ The notion of generalization.
William L. Hamilton, McGill University and Mila 47