![Page 1: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/1.jpg)
Stochastic Gradient Multi-Class Classification Regularization
CS540 Introduction to Artificial IntelligenceLecture 4
Young WuBased on lecture slides by Jerry Zhu, Yingyu Liang, and Charles
Dyer
June 15, 2020
![Page 2: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/2.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Perceptron Algorithm vs Logistic RegressionMotivation
For LTU Perceptrons, w is updated for each instance xisequentially.
w � w � α pai � yi q xi
For Logistic Perceptrons, w is updated using the gradient thatinvolves all instances in the training data.
w � w � αn
i�1
pai � yi q xi
![Page 3: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/3.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Stochastic Gradient DescentMotivation
Each gradient descent step requires the computation ofgradients for all training instances i � 1, 2, ..., n. It is verycostly.
Stochastic gradient descent picks one instance xi randomly,compute the gradient, and update the weights and biases.
When a small subset of instances is selected randomly eachtime, it is called mini-batch gradient descent.
![Page 4: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/4.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Stochastic Gradient Descent Diagram 1Motivation
![Page 5: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/5.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Stochastic Gradient Descent Diagram 2Motivation
![Page 6: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/6.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Stochastic Gradient Descent, Part 1Algorithm
Inputs, Outputs: same as backpropagation.
Initialize the weights.
Randomly permute (shuffle) the training set. Evaluate theactivation functions at one instance at a time.
Compute the gradient using the chain rule.
BCBw plq
j 1j
� δplqij a
pl�1qij 1
BCBbplqj
� δplqij
![Page 7: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/7.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Stochastic Gradient Descent, Part 2Algorithm
Update the weights and biases using gradient descent.
For l � 1, 2, ..., L
wplqj 1j Ð w
plqj 1j � α
BCBw plq
j 1j
, j 1 � 1, 2, ....,mpl�1q, j � 1, 2, ....,mplq
bplqj Ð b
plqj � α
BCBbplqj
, j � 1, 2, ....,mplq
Repeat the process until convergent.
|C � C prev | ε
![Page 8: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/8.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Choice of Learning RateDiscussion
Changing the learning rate α as the weights get closer to theoptimal weights could speed up convergence.
Popular choices of learning rate includeα?t
andα
t, where t is
the current number of iterations.
Other methods of choosing step size include using the secondderivative (Hessian) information, such as Newton’s methodand BFGS, or using information about the gradient in previoussteps, such as adaptive gradient methods like AdaGrad andAdam.
![Page 9: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/9.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Stochastic vs Full Gradient DescentQuiz
Given the same initial weights and biases, stochastic gradientdescent with instances picked randomly without replacementand full gradient descent lead to the same updated weights.
A: Do not choose this.
B: True.
C: Do not choose this.
D: False.
E: Do not choose this.
![Page 10: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/10.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Multi-Class ClassificationMotivation
When there are K categories to classify, the labels can take Kdifferent values, yi P t1, 2, ...,Ku.Logistic regression and neural network cannot be directlyapplied to these problems.
![Page 11: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/11.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Method 1, One VS AllDiscussion
Train a binary classification model with labels y 1i � 1tyi�ju foreach j � 1, 2, ...,K .
Given a new test instance xi , evaluate the activation functionapjqi from model j .
yi � arg maxj
apjqi
One problem is that the scale of apjqi may be different for
different j .
![Page 12: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/12.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Method 2, One VS OneDiscussion
Train a binary classification model for each of theK pK � 1q
2pairs of labels.
Given a new test instance xi , apply allK pK � 1q
2models and
output the class that receives the largest number of votes.
yi � arg maxj
¸j 1�j
ypj vs j 1qi
One problem is that it is not clear what to do if multipleclasses receive the same number of votes.
![Page 13: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/13.jpg)
Stochastic Gradient Multi-Class Classification Regularization
One Hot EncodingDiscussion
If y is not binary, use one-hot encoding for y .
For example, if y has three categories, then
yi P$&%��1
00
�� ,��0
10
�� ,��0
01
��,.-
![Page 14: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/14.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Method 3, Softmax FunctionDiscussion
For both logistic regression and neural network, the last layerwill have K units, aij , for j � 1, 2, ...,K , and the softmaxfunction is used instead of the sigmoid function.
aij � g�wTj xi � bj
�
exp��wT
j xi � bj
K
j 1�1
exp��wT
j 1 xi � bj 1 , j � 1, 2, ...,K
![Page 15: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/15.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Softmax DerivativesDiscussion
Cross entropy loss is also commonly used with a softmaxactivation function.
The gradient of cross-entropy loss with respect to aij ,component j of the output layer activation for instance i hasthe same form as the one for logistic regression.
BCBaij � aij � yij ñ ∇aiC � ai � yi
The gradient with respect to the weights can be found usingthe chain rule.
![Page 16: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/16.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Generalization ErrorMotivation
With a large number of hidden units and small enoughlearning rate α, a multi-layer neural network can fit everyfinite training set perfectly.
It does not imply the performance on the test set will be good.
This problem is called overfitting.
![Page 17: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/17.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Generalization Error DiagramMotivation
![Page 18: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/18.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Method 1, Validation SetDiscussion
Set aside a subset of the training set as the validation set.
During training, the cost (or accuracy) on the training set willalways be decreasing until it hits 0.
Train the network until the cost (or accuracy) on thevalidation set begins to increase.
![Page 19: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/19.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Validation Set DiagramDiscussion
![Page 20: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/20.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Method 2, Drop OutDiscussion
At each hidden layer, a random set of units from that layer isset to 0.
For example, each unit is retained with probability p � 0.5.During the test, the activations are reduced by p � 0.5 (or 50percent).
The intuition is that if a hidden unit works well with differentcombinations of other units, it does not rely on other unitsand it is likely to be individually useful.
![Page 21: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/21.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Drop Out DiagramDiscussion
![Page 22: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/22.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Method 3, L1 and L2 RegularizationDiscussion
The idea is to include an additional cost for non-zero weights.
The models are simpler if many weights are zero.
For example, if logistic regression has only a few non-zeroweights, it means only a few features are relevant, so onlythese features are used for prediction.
![Page 23: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/23.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Method 3, L1 RegularizationDiscussion
For L1 regularization, add the 1-norm of the weights to thecost.
C �n
i�1
pai � yi q2 � λ
�����wb
�����1
�n
i�1
pai � yi q2 � λ
�m
i�1
|wi | � |b|�
Linear regression with L1 regularization is called LASSO (leastabsolute shrinkage and selection operator).
![Page 24: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/24.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Method 3, L2 RegularizationDiscussion
For L2 regularization, add the 2-norm of the weights to thecost.
C �n
i�1
pai � yi q2 � λ
�����wb
�����2
2
�n
i�1
pai � yi q2 � λ
�m
i�1
w2i � b2
�
![Page 25: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/25.jpg)
Stochastic Gradient Multi-Class Classification Regularization
L1 and L2 Regularization ComparisonDiscussion
L1 regularization leads to more weights that are exactly 0. Itis useful for feature selection.
L2 regularization leads to more weights that are close to 0. Itis easier to do gradient descent because 1-norm is notdifferentiable.
![Page 26: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/26.jpg)
Stochastic Gradient Multi-Class Classification Regularization
L1 and L2 Regularization DiagramDiscussion
![Page 27: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/27.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Method 4, Data AugmentationDiscussion
More training data can be created from the existing ones, forexample, by translating or rotating the handwritten digits.
![Page 28: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/28.jpg)
Stochastic Gradient Multi-Class Classification Regularization
HyperparametersDiscussion
It is not clear how to choose the learning rate α, the stoppingcriterion ε, and the regularization parameters.
For neural networks, it is also not clear how to choose thenumber of hidden layers and the number of hidden units ineach layer.
The parameters that are not parameters of the functions inthe hypothesis space are called hyperparameters.
![Page 29: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/29.jpg)
Stochastic Gradient Multi-Class Classification Regularization
K Fold Cross ValidationDiscussion
Partition the training set into K groups.
Pick one group as the validation set.
Train the model on the remaining training set.
Repeat the process for each of the K groups.
Compare accuracy (or cost) for models with differenthyperparameters and select the best one.
![Page 30: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/30.jpg)
Stochastic Gradient Multi-Class Classification Regularization
5 Fold Cross Validation ExampleDiscussion
Partition the training set S into 5 subsets S1,S2, S3,S4, S5
Si X Sj � H and5¤
i�1
Si � S
Iteration Training Validation
1 S2 Y S3 Y S4 Y S5 S1
2 S1 Y S3 Y S4 Y S5 S2
3 S1 Y S2 Y S4 Y S5 S3
4 S1 Y S2 Y S3 Y S5 S4
5 S1 Y S2 Y S3 Y S4 S5
![Page 31: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles](https://reader030.vdocuments.mx/reader030/viewer/2022040616/5f11ac8dae4ad245454202d7/html5/thumbnails/31.jpg)
Stochastic Gradient Multi-Class Classification Regularization
Leave One Out Cross ValidationDiscussion
If K � n, each time exactly one training instance is left out asthe validation set. This special case is called Leave One OutCross Validation (LOOCV).