regularization - virginia techjbhuang/teaching/ece5424-cs5824/sp19/... · regularization. •keep...
TRANSCRIPT
![Page 1: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/1.jpg)
Regularization
Jia-Bin Huang
Virginia Tech Spring 2019ECE-5424G / CS-5824
![Page 2: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/2.jpg)
Administrative
• Women in Data Science Blacksburg• Location: Holtzman Alumni Center
• Welcome, 3:30 - 3:40, Assembly hall
• Keynote Speaker: Milinda Lakkam, "Detecting automation on LinkedIn's platform," 3:40 - 4:05, Assembly hall
• Career Panel, 4:05 - 5:00, hall
• Break , 5:00 - 5:20, Grand hallAssembly
• Keynote Speaker: Sally Morton , "Bias," 5:20 - 5:45, Assembly hall
• Dinner with breakout discussion groups, 5:45 - 7:00, Museum
• Introductory track tutorial: Jennifer Van Mullekom, "Data Visualization", 7:00 - 8:15, Assembly hall
• Advanced track tutorial: Cheryl Danner, "Focal-loss-based Deep Learning for Object Detection," 7-8:15, 2nd floor board room
![Page 3: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/3.jpg)
k-NN (Classification/Regression)
• Model𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑚 , 𝑦 𝑚
• Cost function
None
• Learning
Do nothing
• Inference
ො𝑦 = ℎ 𝑥test = 𝑦(𝑘), where 𝑘 = argmin𝑖 𝐷(𝑥test, 𝑥(𝑖))
![Page 4: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/4.jpg)
Linear regression (Regression)
• Modelℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +⋯+ 𝜃𝑛𝑥𝑛 = 𝜃⊤𝑥
• Cost function
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2
• Learning
1) Gradient descent: Repeat {𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1
𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗
𝑖}
2) Solving normal equation 𝜃 = (𝑋⊤𝑋)−1𝑋⊤𝑦
• Inferenceො𝑦 = ℎ𝜃 𝑥test = 𝜃⊤𝑥test
![Page 5: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/5.jpg)
Naïve Bayes (Classification)
• Modelℎ𝜃 𝑥 = 𝑃(𝑌|𝑋1, 𝑋2, ⋯ , 𝑋𝑛) ∝ 𝑃 𝑌 Π𝑖𝑃 𝑋𝑖 𝑌)
• Cost functionMaximum likelihood estimation: 𝐽 𝜃 = − log 𝑃 Data 𝜃Maximum a posteriori estimation :𝐽 𝜃 = − log 𝑃 Data 𝜃 𝑃 𝜃
• Learning𝜋𝑘 = 𝑃(𝑌 = 𝑦𝑘)
(Discrete 𝑋𝑖) 𝜃𝑖𝑗𝑘 = 𝑃(𝑋𝑖 = 𝑥𝑖𝑗𝑘|𝑌 = 𝑦𝑘)
(Continuous 𝑋𝑖) mean 𝜇𝑖𝑘, variance 𝜎𝑖𝑘2 , 𝑃 𝑋𝑖 𝑌 = 𝑦𝑘) = 𝒩(𝑋𝑖|𝜇𝑖𝑘 , 𝜎𝑖𝑘
2 )
• Inference𝑌 ← argmax
𝑦𝑘
𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖test 𝑌 = 𝑦𝑘)
![Page 6: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/6.jpg)
Logistic regression (Classification)
• Modelℎ𝜃 𝑥 = 𝑃 𝑌 = 1 𝑋1, 𝑋2, ⋯ , 𝑋𝑛 =
1
1+𝑒−𝜃⊤𝑥
• Cost function
𝐽 𝜃 =1
𝑚
𝑖=1
𝑚
Cost(ℎ𝜃(𝑥𝑖 ), 𝑦(𝑖))) Cost(ℎ𝜃 𝑥 , 𝑦) = ൝
−log ℎ𝜃 𝑥 if 𝑦 = 1
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
• LearningGradient descent: Repeat {𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
1
𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗
𝑖}
• Inference𝑌 = ℎ𝜃 𝑥test =
1
1 + 𝑒−𝜃⊤𝑥test
![Page 7: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/7.jpg)
Logistic Regression
•Hypothesis representation
•Cost function
• Logistic regression with gradient descent
•Regularization
•Multi-class classification
ℎ𝜃 𝑥 =1
1 + 𝑒−𝜃⊤𝑥
Cost(ℎ𝜃 𝑥 , 𝑦) = ൝−log ℎ𝜃 𝑥 if 𝑦 = 1
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗(𝑖)
![Page 8: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/8.jpg)
How about MAP?
• Maximum conditional likelihood estimate (MCLE)
• Maximum conditional a posterior estimate (MCAP)
𝜃MCLE = argmax𝜃
ς𝑖=1𝑚 𝑃𝜃 𝑦(𝑖)|𝑥 𝑖
𝜃MCAP = argmax𝜃
ς𝑖=1𝑚 𝑃𝜃 𝑦(𝑖)|𝑥 𝑖 𝑃(𝜃)
![Page 9: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/9.jpg)
Prior 𝑃(𝜃)
• Common choice of 𝑃(𝜃): • Normal distribution, zero mean, identity covariance
• “Pushes” parameters towards zeros
• Corresponds to Regularization• Helps avoid very large weights and overfitting
Slide credit: Tom Mitchell
![Page 10: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/10.jpg)
MLE vs. MAP
• Maximum conditional likelihood estimate (MCLE)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗(𝑖)
• Maximum conditional a posterior estimate (MCAP)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼𝜆𝜃𝑗 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗(𝑖)
![Page 11: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/11.jpg)
Logistic Regression
•Hypothesis representation
•Cost function
• Logistic regression with gradient descent
•Regularization
•Multi-class classification
![Page 12: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/12.jpg)
Multi-class classification
• Email foldering/taggning: Work, Friends, Family, Hobby
• Medical diagrams: Not ill, Cold, Flu
• Weather: Sunny, Cloudy, Rain, Snow
Slide credit: Andrew Ng
![Page 13: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/13.jpg)
Binary classification
𝑥2
𝑥1
Multiclass classification
𝑥2
𝑥1
![Page 14: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/14.jpg)
One-vs-all (one-vs-rest)
𝑥2
𝑥1Class 1:Class 2: Class 3:
ℎ𝜃𝑖𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3)
𝑥2
𝑥1
𝑥2
𝑥1
𝑥2
𝑥1
ℎ𝜃1
𝑥
ℎ𝜃2
𝑥
ℎ𝜃3
𝑥
Slide credit: Andrew Ng
![Page 15: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/15.jpg)
One-vs-all
•Train a logistic regression classifier ℎ𝜃𝑖𝑥 for
each class 𝑖 to predict the probability that 𝑦 = 𝑖
•Given a new input 𝑥, pick the class 𝑖 that maximizes
maxi
ℎ𝜃𝑖𝑥
Slide credit: Andrew Ng
![Page 16: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/16.jpg)
Generative ApproachEx: Naïve Bayes
Estimate 𝑃(𝑌) and 𝑃(𝑋|𝑌)
Predictionො𝑦 = argmax𝑦 𝑃 𝑌 = 𝑦 𝑃(𝑋 = 𝑥|𝑌 = 𝑦)
Discriminative ApproachEx: Logistic regression
Estimate 𝑃(𝑌|𝑋) directly
(Or a discriminant function: e.g., SVM)
Predictionො𝑦 = 𝑃(𝑌 = 𝑦|𝑋 = 𝑥)
![Page 17: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/17.jpg)
Further readings
• Tom M. MitchellGenerative and discriminative classifiers: Naïve Bayes and Logistic Regressionhttp://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
• Andrew Ng, Michael JordanOn discriminative vs. generative classifiers: A comparison of logistic regression and naive bayeshttp://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf
![Page 18: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/18.jpg)
Regularization
• Overfitting
• Cost function
• Regularized linear regression
• Regularized logistic regression
![Page 19: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/19.jpg)
Regularization
• Overfitting
• Cost function
• Regularized linear regression
• Regularized logistic regression
![Page 20: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/20.jpg)
Example: Linear regressionPrice ($)in 1000’s
Size in feet^2
Price ($)in 1000’s
Size in feet^2
Price ($)in 1000’s
Size in feet^2
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥
2 +𝜃3𝑥
3 + 𝜃4𝑥4 +⋯
Underfitting OverfittingJust right
Slide credit: Andrew Ng
![Page 21: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/21.jpg)
Overfitting
• If we have too many features (i.e. complex model), the learned hypothesis may fit the training set very well
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2≈ 0
but fail to generalize to new examples (predict prices on new examples).
Slide credit: Andrew Ng
![Page 22: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/22.jpg)
Example: Linear regressionPrice ($)in 1000’s
Size in feet^2
Price ($)in 1000’s
Size in feet^2
Price ($)in 1000’s
Size in feet^2
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥
2 +𝜃3𝑥
3 + 𝜃4𝑥4 +⋯
Underfitting OverfittingJust right
High bias High varianceSlide credit: Andrew Ng
![Page 23: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/23.jpg)
Bias-Variance Tradeoff
•Bias: difference between what you expect to learn and truth• Measures how well you expect to represent true solution
• Decreases with more complex model
•Variance: difference between what you expect to learn and what you learn from a particular dataset • Measures how sensitive learner is to specific dataset
• Increases with more complex model
![Page 24: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/24.jpg)
Low variance High variance
Low bias
High bias
![Page 25: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/25.jpg)
Bias–variance decomposition
• Training set { 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , ⋯ , 𝑥𝑛, 𝑦𝑛 }
• 𝑦 = 𝑓 𝑥 + 𝜀
• We want መ𝑓 𝑥 that minimizes 𝐸 𝑦 − መ𝑓 𝑥2
𝐸 𝑦 − መ𝑓 𝑥2= Bias መ𝑓 𝑥
2+ Var መ𝑓 𝑥 + 𝜎2
Bias መ𝑓 𝑥 = 𝐸 መ𝑓 𝑥 − 𝑓(𝑥)
Var መ𝑓 𝑥 = 𝐸 መ𝑓 𝑥 2 − 𝐸 መ𝑓 𝑥2
https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
![Page 26: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/26.jpg)
Overfitting
Tumor Size
Age
Tumor Size
Age
Tumor Size
Age
ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2) ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 +𝜃3𝑥1
2 + 𝜃4𝑥22 + 𝜃5𝑥1𝑥2)
ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 +𝜃3𝑥1
2 + 𝜃4𝑥22 + 𝜃5𝑥1𝑥2 +
𝜃6𝑥13𝑥2 + 𝜃7𝑥1𝑥2
3 +⋯)
Underfitting OverfittingSlide credit: Andrew Ng
![Page 27: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/27.jpg)
Addressing overfitting
• 𝑥1 = size of house
• 𝑥2 = no. of bedrooms
• 𝑥3 = no. of floors
• 𝑥4 = age of house
• 𝑥5 = average income in neighborhood
• 𝑥6 = kitchen size
• ⋮
• 𝑥100
Price ($)in 1000’s
Size in feet^2
Slide credit: Andrew Ng
![Page 28: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/28.jpg)
Addressing overfitting
• 1. Reduce number of features.• Manually select which features to keep.
• Model selection algorithm (later in course).
• 2. Regularization.• Keep all the features, but reduce magnitude/values of parameters 𝜃𝑗.
• Works well when we have a lot of features, each of which contributes a bit to predicting 𝑦.
Slide credit: Andrew Ng
![Page 29: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/29.jpg)
Overfitting Thriller
• https://www.youtube.com/watch?v=DQWI1kvmwRg
![Page 30: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/30.jpg)
Regularization
• Overfitting
• Cost function
• Regularized linear regression
• Regularized logistic regression
![Page 31: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/31.jpg)
Intuition
• Suppose we penalize and make 𝜃3, 𝜃4 really small.
min𝜃
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 1000 𝜃3
2 + 1000 𝜃42
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥
2 + 𝜃3𝑥3 + 𝜃4𝑥
4
Price ($)in 1000’s
Size in feet^2
Price ($)in 1000’s
Size in feet^2
Slide credit: Andrew Ng
![Page 32: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/32.jpg)
Regularization.
•Small values for parameters 𝜃1, 𝜃2, ⋯ , 𝜃𝑛• “Simpler” hypothesis• Less prone to overfitting
•Housing:• Features: 𝑥1, 𝑥2, ⋯ , 𝑥100•Parameters: 𝜃0, 𝜃1, 𝜃2, ⋯ , 𝜃100
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 𝜆
𝑗=1
𝑛
𝜃𝑗2
Slide credit: Andrew Ng
![Page 33: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/33.jpg)
Regularization
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 𝜆
𝑗=1
𝑛
𝜃𝑗2
min𝜃
𝐽(𝜃)
Price ($)in 1000’s
Size in feet^2
𝜆: Regularization parameter
Slide credit: Andrew Ng
![Page 34: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/34.jpg)
Question
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 𝜆
𝑗=1
𝑛
𝜃𝑗2
What if 𝜆 is set to an extremely large value (say 𝜆 = 1010)?
1. Algorithm works fine; setting to be very large can’t hurt it
2. Algorithm fails to eliminate overfitting.
3. Algorithm results in underfitting. (Fails to fit even training data well).
4. Gradient descent will fail to converge.
Slide credit: Andrew Ng
![Page 35: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/35.jpg)
Question
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 𝜆
𝑗=1
𝑛
𝜃𝑗2
What if 𝜆 is set to an extremely large value (say 𝜆 = 1010)?Price ($)in 1000’s
Size in feet^2
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +⋯+ 𝜃𝑛𝑥𝑛 = 𝜃⊤𝑥Slide credit: Andrew Ng
![Page 36: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/36.jpg)
Regularization
• Overfitting
• Cost function
• Regularized linear regression
• Regularized logistic regression
![Page 37: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/37.jpg)
Regularized linear regression
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 𝜆
𝑗=1
𝑛
𝜃𝑗2
min𝜃
𝐽(𝜃)
𝑛: Number of features
𝜃0 is not panelizedSlide credit: Andrew Ng
![Page 38: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/38.jpg)
Gradient descent (Previously)
Repeat {
𝜃0 ≔ 𝜃0 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖
}
(𝑗 = 1, 2, 3,⋯ , 𝑛)
Slide credit: Andrew Ng
(𝑗 = 0)
![Page 39: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/39.jpg)
Gradient descent (Regularized)
Repeat {
𝜃0 ≔ 𝜃0 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖+ 𝜆𝜃𝑗
}𝜃𝑗 ≔ 𝜃𝑗(1 − 𝛼
𝜆
𝑚) − 𝛼
1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖
Slide credit: Andrew Ng
![Page 40: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/40.jpg)
Comparison
Regularized linear regression
𝜃𝑗 ≔ 𝜃𝑗(1 − 𝛼𝜆
𝑚) − 𝛼
1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖
Un-regularized linear regression
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖
1 − 𝛼𝜆
𝑚< 1: Weight decay
![Page 41: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/41.jpg)
Normal equation
• 𝑋 =
𝑥 1 ⊤
𝑥 2 ⊤
⋮
𝑥 𝑚 ⊤
∈ 𝑅𝑚×(𝑛+1) 𝑦 =
𝑦(1)
𝑦(2)
⋮𝑦(𝑚)
∈ 𝑅𝑚
• min𝜃
𝐽(𝜃)
• 𝜃 = 𝑋⊤𝑋 + 𝜆
0 0 ⋯ 00 1 0 0⋮ ⋮ ⋱ ⋮0 0 0 1
−1
𝑋⊤𝑦
(𝑛 + 1 ) × (𝑛 + 1) Slide credit: Andrew Ng
![Page 42: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/42.jpg)
Regularization
• Overfitting
• Cost function
• Regularized linear regression
• Regularized logistic regression
![Page 43: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/43.jpg)
Regularized logistic regression
• Cost function:
𝐽 𝜃 =1
𝑚
𝑖=1
𝑚
𝑦 𝑖 log ℎ𝜃 𝑥 𝑖 + (1 − 𝑦 𝑖 ) log 1 − ℎ𝜃 𝑥 𝑖 +𝜆
2
𝑗=1
𝑛
𝜃𝑗2
Tumor Size
Age
ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 +𝜃3𝑥1
2 + 𝜃4𝑥22 + 𝜃5𝑥1𝑥2 +
𝜃6𝑥13𝑥2 + 𝜃7𝑥1𝑥2
3 +⋯)
Slide credit: Andrew Ng
![Page 44: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/44.jpg)
Gradient descent (Regularized)
Repeat {
𝜃0 ≔ 𝜃0 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖− 𝜆𝜃𝑗
}
ℎ𝜃 𝑥 =1
1 + 𝑒−𝜃⊤𝑥
𝜕
𝜕𝜃𝑗𝐽(𝜃)
Slide credit: Andrew Ng
![Page 45: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/45.jpg)
𝜃 1: Lasso regularization
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 𝜆
𝑗=1
𝑛
|𝜃𝑗|
LASSO: Least Absolute Shrinkage and Selection Operator
![Page 46: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/46.jpg)
Single predictor: Soft Thresholding
•minimize𝜃1
2𝑚σ𝑖=1𝑚 𝑥(𝑖)𝜃 − 𝑦 𝑖 2
+ 𝜆 𝜃 1
𝜃 =
1
𝑚< 𝒙, 𝒚 > −𝜆 if
1
𝑚< 𝒙, 𝒚 > > 𝜆
0 if1
𝑚| < 𝒙, 𝒚 > | ≤ 𝜆
1
𝑚< 𝒙, 𝒚 > +𝜆 if
1
𝑚< 𝒙, 𝒚 > < −𝜆
𝜃 = 𝑆𝜆(1
𝑚< 𝒙, 𝒚 >)
Soft Thresholding operator 𝑆𝜆 𝑥 = sign 𝑥 𝑥 − 𝜆 +
![Page 47: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/47.jpg)
Multiple predictors: : Cyclic Coordinate Desce
• minimize𝜃1
2𝑚σ𝑖=1𝑚 𝑥𝑗
𝑖𝜃𝑗 + σ𝑘≠𝑗 𝑥𝑖𝑗
𝑖𝜃𝑘 − 𝑦 𝑖
2+
𝜆
𝑘≠𝑗
|𝜃𝑘| + 𝜆 𝜃𝑗 1
For each 𝑗, update 𝜃𝑗 with
minimize𝜃1
2𝑚
𝑖=1
𝑚
𝑥𝑗𝑖𝜃𝑗 − 𝑟𝑗
(𝑖) 2+ 𝜆 𝜃𝑗 1
where 𝑟𝑗(𝑖)
= 𝑦 𝑖 − σ𝑘≠𝑗 𝑥𝑖𝑗𝑖𝜃𝑘
![Page 48: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/48.jpg)
L1 and L2 balls
Image credit: https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf
![Page 49: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/49.jpg)
TerminologyRegularization function
Name Solver
𝜃 22 =
𝑗=1
𝑛
𝜃𝑗2
Tikhonov regularizationRidge regression
Close form
𝜃1=
𝑗=1
𝑛
|𝜃𝑗|LASSO regression Proximal gradient
descent, least angle regression
𝛼 𝜃1+ (1 − 𝛼) 𝜃 2
2 Elastic net regularization Proximal gradient descent
![Page 50: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well](https://reader034.vdocuments.mx/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/50.jpg)
Things to remember
• Overfitting
• Cost function
• Regularized linear regression
• Regularized logistic regression