over-fitting and regularization chapter 4 textbook lectures 11 and 12 on amlbook.com
TRANSCRIPT
![Page 1: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/1.jpg)
Over-fitting and RegularizationChapter 4 textbook
Lectures 11 and 12 on amlbook.com
![Page 2: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/2.jpg)
Over-fitting is easy to recognize in 1DParabolic target function4th order hypothesis5 data points -> Ein = 0
![Page 3: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/3.jpg)
The origin of over-fitting can be analyzed in 1D: Bias/variance dilemma. How does this apply to case on previous slide?
![Page 4: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/4.jpg)
Shape of fit very sensitive to noise in dataOut-of-sample error will vary greatly from one dataset to another.
![Page 5: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/5.jpg)
Over-fitting is easy to avoid in 1D:Results from HW1
Sum
of s
quar
ed d
evia
tions
Degree of polynomial
Eval
Ein
![Page 6: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/6.jpg)
Using Eval to avoid over-fitting works in all dimensions but computation grows rapidly for large d
EEEin
Ecv-1
Eval
Digit recognition one vs not oned = 2 (intensity and symmetry)Terms in F5(x) added successively500 pts in training set
Validation set needs to be large; 8798 this case
![Page 7: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/7.jpg)
What if we want to add higher order terms to a linear model but don’t have enough data for a validation set?
Solution: Augment the error function used to optimize weights
Example
Penalizes choices with large |w|. Called “weight decay”
![Page 8: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/8.jpg)
Normal equations with weight decay essentially unchanged
(ZTZ + lI) wreg =ZTy
![Page 9: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/9.jpg)
Best value l is subjective
In this case l = 0.0001 is large enough to suppress swings but data still important in determining optimum weights.
![Page 10: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/10.jpg)
![Page 11: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/11.jpg)
Review for Quiz 2Topics:
linear modelsextending linear models by transformationdimensionality reductionover fitting and regularization
2 classes are distinguished by a threshold values of a linear combination of d attributes. Explain how h(w|x) = sign(wTx) becomes a hypothesis set for linear binary classification
![Page 12: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/12.jpg)
More Review for Quiz 2Topics:
linear modelsextending linear models by transformationdimensionality reductionover fitting and regularization
We have used 1-step optimization in 4 ways:polynomial regression in 1D (curve fitting)multivariate linear regressionextending linear models by transformationregularization by weight decay
2 of these are equivalent; which ones
![Page 13: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/13.jpg)
More Review for Quiz 2Topics:
linear modelsextending linear models by transformationdimensionality reductionover fitting and regularization
1-step optimization requires the in-sample error to be the sum of squared residuals. Define the in-sample error for following
multivariate linear regression,extending linear models by transformationregularization by weight decay
![Page 14: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/14.jpg)
For multivariate linear regression
Derive the normal equations for extended linear regression with weight decay
![Page 15: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/15.jpg)
Interpret the “learning curve” for multivariate linear regression when training data has normally distributed noise
• Why does Eout approach s2 from above?• Why does Ein approach s2 from below?• Why is Ein not defined for N<d+1?
![Page 16: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/16.jpg)
What do these learning curves say about simple vs complex models
Still larger than bound set by noise
![Page 17: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/17.jpg)
How do we estimate a good level of complexity without sacrificing training data?
![Page 18: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/18.jpg)
Why chose 3 rather than 4?
![Page 19: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/19.jpg)
Review:Maximum Likelihood Estimation• Estimate parameters q of a probability
distribution given a sample X drawn from that distribution
19Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
![Page 20: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/20.jpg)
Form the likelihood function• Likelihood of q given the sample X
l(θ|X) = p (X |θ) = ∏t p(xt|θ)
• Log likelihood L(θ|X) = log(l(θ|X)) = ∑
t log p(xt|θ)
• Maximum likelihood estimator (MLE)θ* = argmaxθ L(θ|X)
the value of θ that maximizes L(θ|X)
20Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
![Page 21: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/21.jpg)
How was MLE used in logistic regression to derive an expression for in-sample error?
![Page 22: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/22.jpg)
In logistic regression, parameters are the weights
Likelihood of w given the sample Xl(w|X) = p (X |w) = ∏
t p(xt|w)
Log likelihood L(w|X) = log(l(w|X)) = ∑
t log p(xt|w)
In logistic regression, p(xt|w) = q(ynwT xn)
![Page 23: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/23.jpg)
Since Log is a monotone increasing function, maximizing log(likelihood) is equivalent to minimizing -log(likelihood)Text also normalizes by dividing by N; hence error function becomes
How?
![Page 24: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/24.jpg)
Derive the log-likelihood function for a 1D Gaussian distribution
![Page 25: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/25.jpg)
)xwexp(y1
ye
derive
)xwexp(-yln(1)y),h(x(e
given
nT
n
nin
nT
nnnin
nx
Stochastic gradient decent: correct weights by error in each data point
![Page 26: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/26.jpg)
I want to perform PCA on a dataset. What must I assume about the noise in data?
PCA
![Page 27: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/27.jpg)
Correlation coefficients of normally distributed attributes x are zero. What can we say about the covariance of x
More PCA
![Page 28: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/28.jpg)
Attributes x are normally distributed with mean m and covariance S.
z = Mx is a linear transformation to feature space defined by matrix M.
What are the mean and covariance of these features?
More PCA
![Page 29: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/29.jpg)
zk is the a feature defined by projection of attributes in the direction of the eigenvector wk of the covariance matrix.
Prove that eigenvalue lk is the variance of zk
29Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
More PCA
![Page 30: Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com](https://reader036.vdocuments.mx/reader036/viewer/2022070414/5697c0151a28abf838ccde60/html5/thumbnails/30.jpg)
How do we find values of x1 and x2 that minimize f(x1, x2) subject to the constraint g(x1, x2) = c?
Constrained optimization
Find stationary points of f(x1, x2) = 1 - x12 – x2
2 subject to constraint g(x1, x2) = x1 + x2 = 1