introduction into supervised learning · outline of my three classes 07/11/19 foundations and the...
TRANSCRIPT
![Page 1: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/1.jpg)
Optimization for Machine Learning
Introduction into supervised learning
Lecturer: Robert M. Gower
Master IASD: AI Systems and Data Science, 2019
![Page 2: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/2.jpg)
Core Info
● Where: ENS: 07/11 amphi Langevin, 03/12 U209, 05/12 amphi Langevin.
● Online: Teaching materials for these 3 classes: https://gowerrobert.github.io/
● Google docs with course info: Can also be found on https://gowerrobert.github.io/
![Page 3: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/3.jpg)
Outline of my three classes
● 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for ridge regression
● 03/12/19 SGD for convex optimization. Theory and variants
● 05/12/19 Lab on SGD and variants
![Page 4: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/4.jpg)
Detailed Outline today
● 13:30 – 14:00: Introduction to empirical risk minimization and classification and SGD
● 14:00 – 15:00 Revision on probability● 15:00 – 15:30: Tea Time! Break● 15:30 – 17:00: Exercises and proof of convergence of SGD for
ridge regression
![Page 5: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/5.jpg)
An Introduction to Supervised Learning
![Page 6: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/6.jpg)
References classes today
Convex Optimization, Stephen Boyd
Pages 67 to 79
Understanding Machine Learning: From Theory to Algorithms
Chapter 2
![Page 7: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/7.jpg)
Is There a Cat in the Photo?
Yes
No
![Page 8: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/8.jpg)
Is There a Cat in the Photo?
Yes
![Page 9: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/9.jpg)
Is There a Cat in the Photo?
Yes
![Page 10: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/10.jpg)
Is There a Cat in the Photo?
No
![Page 11: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/11.jpg)
Is There a Cat in the Photo?
Yes
![Page 12: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/12.jpg)
Find mapping h that assigns the “correct” target to each input
Is There a Cat in the Photo?
Yes
No
x: Input/Feature y: Output/Target
![Page 13: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/13.jpg)
Labeled Data: The training set
![Page 14: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/14.jpg)
Labeled Data: The training set
y= -1 means no/false
![Page 15: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/15.jpg)
Labeled Data: The training set
Learning Algorithm
y= -1 means no/false
![Page 16: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/16.jpg)
Labeled Data: The training set
Learning Algorithm
y= -1 means no/false
![Page 17: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/17.jpg)
Labeled Data: The training set
Learning Algorithm
-1
y= -1 means no/false
![Page 18: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/18.jpg)
Example: Linear Regression for Height
Sex 0
Age 30
Height 1,72 cm
Sex 1
Age 70
Height 1,52 cm
Labelled data
Male = 0Female = 1
![Page 19: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/19.jpg)
Example Hypothesis: Linear Model
Example: Linear Regression for Height
Sex 0
Age 30
Height 1,72 cm
Sex 1
Age 70
Height 1,52 cm
Labelled data
Male = 0Female = 1
![Page 20: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/20.jpg)
Example Training Problem:
Example Hypothesis: Linear Model
Example: Linear Regression for Height
Sex 0
Age 30
Height 1,72 cm
Sex 1
Age 70
Height 1,52 cm
Labelled data
Male = 0Female = 1
![Page 21: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/21.jpg)
Linear Regression for Height
Age
Height Sex = 0
![Page 22: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/22.jpg)
Linear Regression for Height
The Training Algorithm
Age
Height Sex = 0
![Page 23: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/23.jpg)
Linear Regression for Height
The Training Algorithm
Age
Height
Other options aside from linear?
Sex = 0
![Page 24: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/24.jpg)
Parametrizing the HypothesisHeight
Age
Linear:
Polinomial:
Age
Height
Neural Net:
![Page 25: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/25.jpg)
Loss Functions
Why a SquaredLoss?
![Page 26: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/26.jpg)
Loss Functions
Why a SquaredLoss?
Loss Functions
The Training Problem
![Page 27: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/27.jpg)
Loss Functions
Why a SquaredLoss?
Loss Functions
The Training Problem
Typically a convex function
![Page 28: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/28.jpg)
Choosing the Loss Function
Quadratic Loss
Binary Loss
Hinge Loss
![Page 29: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/29.jpg)
Choosing the Loss Function
Quadratic Loss
Binary Loss
Hinge Loss
y=1 in all figures
![Page 30: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/30.jpg)
Choosing the Loss Function
Quadratic Loss
Binary Loss
Hinge Loss
EXE: Plot the binary and hinge loss function in when
y=1 in all figures
![Page 31: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/31.jpg)
Loss Functions
Is a notion of Loss enough?
What happens when we do not have enough data?
![Page 32: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/32.jpg)
Loss FunctionsThe Training Problem
Is a notion of Loss enough?
What happens when we do not have enough data?
![Page 33: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/33.jpg)
Overfitting and Model Complexity
Fitting 1st order polynomial
![Page 34: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/34.jpg)
Overfitting and Model Complexity
Fitting 2nd order polynomial
![Page 35: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/35.jpg)
Overfitting and Model Complexity
Fitting 3rd order polynomial
![Page 36: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/36.jpg)
Overfitting and Model Complexity
Fitting 9th order polynomial
![Page 37: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/37.jpg)
Regularizor Functions
General Training Problem
Regularization
![Page 38: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/38.jpg)
Regularizor Functions
General Training Problem
Regularization
Goodness of fit, fidelity term ...etc
![Page 39: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/39.jpg)
Regularizor Functions
General Training Problem
Regularization
Goodness of fit, fidelity term ...etc
Penalizes complexity
![Page 40: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/40.jpg)
Regularizor Functions
General Training Problem
Regularization
Goodness of fit, fidelity term ...etc
Penalizes complexity
Controls tradeoff between fit and complexity
![Page 41: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/41.jpg)
Regularizor Functions
General Training Problem
Regularization
Exe:
Goodness of fit, fidelity term ...etc
Penalizes complexity
Controls tradeoff between fit and complexity
![Page 42: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/42.jpg)
Overfitting and Model Complexity
Fitting kth order polynomial
![Page 43: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/43.jpg)
Overfitting and Model Complexity
Fitting kth order polynomial
For big enough, λthe solution is a 2nd order polynomial
![Page 44: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/44.jpg)
Linear hypothesis
Exe: Ridge Regression
Ridge Regression
L2 loss
L2 regularizor
![Page 45: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/45.jpg)
Linear hypothesis
Exe: Support Vector Machines
SVM with soft margin
Hinge loss
L2 regularizor
![Page 46: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/46.jpg)
Linear hypothesis
Exe: Logistic Regression
Logistic Regression
Logistic loss
L2 regularizor
![Page 47: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/47.jpg)
The Machine Learners Job
![Page 48: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/48.jpg)
The Machine Learners Job
![Page 49: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/49.jpg)
The Machine Learners Job
![Page 50: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/50.jpg)
The Machine Learners Job
![Page 51: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/51.jpg)
The Machine Learners Job
![Page 52: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/52.jpg)
The Machine Learners Job
![Page 53: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/53.jpg)
54
A Datum Function
Finite Sum Training Problem
Re-writing as Sum of Terms
Can we use this sum structure?
![Page 54: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/54.jpg)
55
The Training Problem
![Page 55: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/55.jpg)
56
The Training Problem
![Page 56: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/56.jpg)
57
Stochastic Gradient Descent
Is it possible to design a method that uses only the gradient of a single data function at each iteration?
![Page 57: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/57.jpg)
58
Stochastic Gradient Descent
Is it possible to design a method that uses only the gradient of a single data function at each iteration?
Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then
![Page 58: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/58.jpg)
59
Stochastic Gradient Descent
Is it possible to design a method that uses only the gradient of a single data function at each iteration?
Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then
![Page 59: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/59.jpg)
60
Stochastic Gradient Descent
Is it possible to design a method that uses only the gradient of a single data function at each iteration?
Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then
EXE:
![Page 60: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/60.jpg)
61
Stochastic Gradient Descent
![Page 61: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/61.jpg)
62
Stochastic Gradient Descent
Optimal point
![Page 62: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/62.jpg)
Detailed Outline today
● 13:30 – 14:00: Introduction to empirical risk minimization and classification and SGD
● 14:00 – 15:00 Revision on probability● 15:00 – 15:30: Tea Time! Break● 15:30 – 17:00: Exercises and proof of convergence of
SGD for ridge regression
![Page 63: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/63.jpg)
64
Strong Convexity
Assumptions for Convergence
![Page 64: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/64.jpg)
65
Strong Convexity
Assumptions for Convergence
![Page 65: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/65.jpg)
66
Expected Bounded Stochastic Gradients
Strong Convexity
Assumptions for Convergence
![Page 66: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/66.jpg)
67
Expected Bounded Stochastic Gradients
Strong Convexity
Assumptions for Convergence
![Page 67: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/67.jpg)
68Complexity / Convergence
Theorem
EXE: Do exercises on convergence of random sequences.
![Page 68: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/68.jpg)
69Complexity / Convergence
Theorem
EXE: Do exercises on convergence of random sequences.
![Page 69: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/69.jpg)
70Complexity / Convergence
Theorem
EXE: Do exercises on convergence of random sequences.
![Page 70: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/70.jpg)
71Stochastic Gradient Descent α =0.01
![Page 71: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/71.jpg)
72Stochastic Gradient Descent α =0.1
![Page 72: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/72.jpg)
73Stochastic Gradient Descent α =0.2
![Page 73: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for](https://reader033.vdocuments.mx/reader033/viewer/2022042310/5ed840340fa3e705ec0e1f3d/html5/thumbnails/73.jpg)
74Stochastic Gradient Descent α =0.5