review of deep learning book, chapter 5, sec. 5.8-5.9 ... · chapter 5, sec. 5.8-5.9 [1] ana karen...

Review of ‘Deep Learning’ book,Chapter 5, Sec. 5.8-5.9 [1]

Ana Karen Roldan

Department of Mathematics and StatisticsUniversity of Calgary

Lunch at the Lab, Fall 2019

October 22, 2019

Ana Karen Roldan Review of ‘Deep Learning’ book, Chapter 5, Sec. 5.8-5.9 [1]October 22, 2019 1 / 16

Introduction

This series of presentations have the purpose to present a review ofthe book Deep Learning.

Chapter 5 talks about the most popular and general techniques usedin Machine Learning.

A machine learning algorithm is an algorithm that is able to learnfrom data. But what do we mean by learning? “A computer programis said to learn from experience E with respect to some class of tasksT and performance measure P , if its performance at tasks in T , asmeasured by P , improves with experience E.”

Section 5.8 and 5.9 deals with the experience E


Outline

1 5.8 Unsupervised Learning AlgorithmsDefinitionPrincipal Components Analysis (PCA)k-means Clustering

2 5.9 Stochastic Gradient DescentDefinitionSGD Example


Definition

Recall from previous section, supervised learning algorithms learn toassociate some input with some output. E.g. Linear regression.

Unsupervised algorithms are those that experience only ”features”(input) but not a supervision signal (output).

For example, attempts to extract information from a distribution,density estimation, learning to draw samples from a distribution andclustering the data into groups.


Simple representation

Those algorithms are looking for a simpler representation of the data, themust common and not mutually exclusive are the following:

Lower dimensional representations (latent variables)

Sparse representations (increments the dimension)

Independent representations (disentangle variation)

Figure: Dimensionality reductionFigure: Sparse Matrix


Outline




Principal Components Analysis (PCA)

PCA is an unsupervised learning algorithm that learns a representation ofdata based on two of the criteria of simple representation: Lower dimen-sionality and Independent representation.

Figure: PCA representation

PCA is a linear projection that aligns the direction of greatest variance withthe axes of the new space.


Principal Components Analysis (PCA)

PCA learns an orthogonal, linear transformation of the data that projectsan input x to a representation z = x

TW , where W describes the rotation

of the input space.

Getting the transformation z involves the following steps:

Construct the covariance matrix of the data.

Compute the eigenvectors of this matrix.

Eigenvectors corresponding to the largest eigenvalues are used toreconstruct a large fraction of variance of the original data.


Outline




k-means Clusteringk-means clustering is another example of a simple representation learningalgorithm that divides the training set into k different clusters of examplesthat are near each other.

This is an example of sparse representation. If x belongs to cluster i , thenhi = 1 and all other entries of the representation h are zero.

Figure: k-means Clustering

The k-means algorithm works by initializing k different centroids, where iis the index of nearest centroid µ(i)


Outline




DefinitionNearly all of deep learning is powered by one very importantalgorithm: stochastic gradient descent or SGD.

A recurring problem in machine learning is that large training sets arenecessary for good generalization, but large training sets are also morecomputationally expensive.

The main purpose is to minimize the cost function used by a machinelearning. We can sample a minibatch of examples drawn uniformlyfrom the training set.

Figure: Gradient descent algorithm


Outline




SGD ExampleFor example, the negative conditional log-likelihood of the training data,L(x , y , θ) = −log p(y |x ; θ) can be written as

J(θ) = Ex ,y∼p̂dataL(x , y , θ) =1

m

m∑i=1

L(x (i), y (i), θ)

The gradient descent requires the following computing

OθJ(θ) =1

m

m∑i=1

OθL(x (i), y (i), θ)

The minibatch size m′ is typically chosen to be relatively small.

g =1

m′

m′∑i=1

OθL(x (i), y (i), θ)

Finally, the stochastic gradient descent algorithm follows the estimated gra-dient downhill: θ ← θ − ε ∗ g , where ε is the learning rate.


Thank you!


References

[1] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. The MITPress, 2016.


review of deep learning book, chapter 5, sec. 5.8-5.9 ... · chapter 5, sec. 5.8-5.9 [1] ana karen...

Documents