review of deep learning book, chapter 5, sec. 5.8-5.9 ... · chapter 5, sec. 5.8-5.9 [1] ana karen...
TRANSCRIPT
Review of ‘Deep Learning’ book,Chapter 5, Sec. 5.8-5.9 [1]
Ana Karen Roldan
Department of Mathematics and StatisticsUniversity of Calgary
Lunch at the Lab, Fall 2019
October 22, 2019
Ana Karen Roldan Review of ‘Deep Learning’ book, Chapter 5, Sec. 5.8-5.9 [1]October 22, 2019 1 / 16
Introduction
This series of presentations have the purpose to present a review ofthe book Deep Learning.
Chapter 5 talks about the most popular and general techniques usedin Machine Learning.
A machine learning algorithm is an algorithm that is able to learnfrom data. But what do we mean by learning? “A computer programis said to learn from experience E with respect to some class of tasksT and performance measure P , if its performance at tasks in T , asmeasured by P , improves with experience E.”
Section 5.8 and 5.9 deals with the experience E
Ana Karen Roldan Review of ‘Deep Learning’ book, Chapter 5, Sec. 5.8-5.9 [1]October 22, 2019 2 / 16
Outline
1 5.8 Unsupervised Learning AlgorithmsDefinitionPrincipal Components Analysis (PCA)k-means Clustering
2 5.9 Stochastic Gradient DescentDefinitionSGD Example
Ana Karen Roldan Review of ‘Deep Learning’ book, Chapter 5, Sec. 5.8-5.9 [1]October 22, 2019 3 / 16
Definition
Recall from previous section, supervised learning algorithms learn toassociate some input with some output. E.g. Linear regression.
Unsupervised algorithms are those that experience only ”features”(input) but not a supervision signal (output).
For example, attempts to extract information from a distribution,density estimation, learning to draw samples from a distribution andclustering the data into groups.
Ana Karen Roldan Review of ‘Deep Learning’ book, Chapter 5, Sec. 5.8-5.9 [1]October 22, 2019 4 / 16
Simple representation
Those algorithms are looking for a simpler representation of the data, themust common and not mutually exclusive are the following:
Lower dimensional representations (latent variables)
Sparse representations (increments the dimension)
Independent representations (disentangle variation)
Figure: Dimensionality reductionFigure: Sparse Matrix
Ana Karen Roldan Review of ‘Deep Learning’ book, Chapter 5, Sec. 5.8-5.9 [1]October 22, 2019 5 / 16
Outline
1 5.8 Unsupervised Learning AlgorithmsDefinitionPrincipal Components Analysis (PCA)k-means Clustering
2 5.9 Stochastic Gradient DescentDefinitionSGD Example
Ana Karen Roldan Review of ‘Deep Learning’ book, Chapter 5, Sec. 5.8-5.9 [1]October 22, 2019 6 / 16
Principal Components Analysis (PCA)
PCA is an unsupervised learning algorithm that learns a representation ofdata based on two of the criteria of simple representation: Lower dimen-sionality and Independent representation.
Figure: PCA representation
PCA is a linear projection that aligns the direction of greatest variance withthe axes of the new space.
Ana Karen Roldan Review of ‘Deep Learning’ book, Chapter 5, Sec. 5.8-5.9 [1]October 22, 2019 7 / 16
Principal Components Analysis (PCA)
PCA learns an orthogonal, linear transformation of the data that projectsan input x to a representation z = x
TW , where W describes the rotation
of the input space.
Getting the transformation z involves the following steps:
Construct the covariance matrix of the data.
Compute the eigenvectors of this matrix.
Eigenvectors corresponding to the largest eigenvalues are used toreconstruct a large fraction of variance of the original data.
Ana Karen Roldan Review of ‘Deep Learning’ book, Chapter 5, Sec. 5.8-5.9 [1]October 22, 2019 8 / 16
Outline
1 5.8 Unsupervised Learning AlgorithmsDefinitionPrincipal Components Analysis (PCA)k-means Clustering
2 5.9 Stochastic Gradient DescentDefinitionSGD Example
Ana Karen Roldan Review of ‘Deep Learning’ book, Chapter 5, Sec. 5.8-5.9 [1]October 22, 2019 9 / 16
k-means Clusteringk-means clustering is another example of a simple representation learningalgorithm that divides the training set into k different clusters of examplesthat are near each other.
This is an example of sparse representation. If x belongs to cluster i , thenhi = 1 and all other entries of the representation h are zero.
Figure: k-means Clustering
The k-means algorithm works by initializing k different centroids, where iis the index of nearest centroid µ(i)
Ana Karen Roldan Review of ‘Deep Learning’ book, Chapter 5, Sec. 5.8-5.9 [1]October 22, 2019 10 / 16
Outline
1 5.8 Unsupervised Learning AlgorithmsDefinitionPrincipal Components Analysis (PCA)k-means Clustering
2 5.9 Stochastic Gradient DescentDefinitionSGD Example
Ana Karen Roldan Review of ‘Deep Learning’ book, Chapter 5, Sec. 5.8-5.9 [1]October 22, 2019 11 / 16
DefinitionNearly all of deep learning is powered by one very importantalgorithm: stochastic gradient descent or SGD.
A recurring problem in machine learning is that large training sets arenecessary for good generalization, but large training sets are also morecomputationally expensive.
The main purpose is to minimize the cost function used by a machinelearning. We can sample a minibatch of examples drawn uniformlyfrom the training set.
Figure: Gradient descent algorithm
Ana Karen Roldan Review of ‘Deep Learning’ book, Chapter 5, Sec. 5.8-5.9 [1]October 22, 2019 12 / 16
Outline
1 5.8 Unsupervised Learning AlgorithmsDefinitionPrincipal Components Analysis (PCA)k-means Clustering
2 5.9 Stochastic Gradient DescentDefinitionSGD Example
Ana Karen Roldan Review of ‘Deep Learning’ book, Chapter 5, Sec. 5.8-5.9 [1]October 22, 2019 13 / 16
SGD ExampleFor example, the negative conditional log-likelihood of the training data,L(x , y , θ) = −log p(y |x ; θ) can be written as
J(θ) = Ex ,y∼p̂dataL(x , y , θ) =1
m
m∑i=1
L(x (i), y (i), θ)
The gradient descent requires the following computing
OθJ(θ) =1
m
m∑i=1
OθL(x (i), y (i), θ)
The minibatch size m′ is typically chosen to be relatively small.
g =1
m′
m′∑i=1
OθL(x (i), y (i), θ)
Finally, the stochastic gradient descent algorithm follows the estimated gra-dient downhill: θ ← θ − ε ∗ g , where ε is the learning rate.
Ana Karen Roldan Review of ‘Deep Learning’ book, Chapter 5, Sec. 5.8-5.9 [1]October 22, 2019 14 / 16
Thank you!
Ana Karen Roldan Review of ‘Deep Learning’ book, Chapter 5, Sec. 5.8-5.9 [1]October 22, 2019 15 / 16
References
[1] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. The MITPress, 2016.
Ana Karen Roldan Review of ‘Deep Learning’ book, Chapter 5, Sec. 5.8-5.9 [1]October 22, 2019 16 / 16