deep learning algorithms report.pdf

Deep Learning Algorithms: Sparse AutoEncoders

Gabriel Broadwin Nongsiej (13CS60R02)

Under the guidance of

Prof. Sudeshna Sarkar

1. Introduction

Supervised learning is one of the most powerful tools of AI, and has led to a number of

innovative applications over the years. Despite its significant successes, supervised learning

today is still severely limited. Specifically, most applications of it still require that we manually

specify the input features x given to the algorithm. Once a good feature representation is given, a

supervised learning algorithm can do well. But these features are not easy to detect and

formulate. And the difficult feature engineering work does not scale well to new problems.

This seminar report describes the sparse autoencoder learning algorithm, which is one

approach to automatically learn features from unlabeled data. It has been found that these

features produced by a sparse autoencoder do surprisingly well and are competitive and

sometimes superior to the best hand-engineered features.

2. Artificial Neural Networks (ANN)

Consider a supervised learning problem where we have access to labeled training

examples (x (i)

, y (i)

). Neural networks give a way of defining a complex, non-linear form of

hypotheses hW,b(x), with parameters W, b that we can fit to our data.

An artificial neural network (ANN) comprises of one or more “neurons”. A neuron is a

computational unit that takes some set of inputs, calculates a linear combination of these inputs

and outputs hW,b(x) = f (WT, x) = f(Σ

3i=1Wixi + b), where f : R → R is called the activation

function.

Figure 1 : Simple Neuron

A neural network is put together by hooking together many of our simple “neurons,” so

that the output of a neuron can be the input of another. For example, here is a small neural

network:

Figure 2 : Artificial Neural Network

In this figure, we have used circles to also denote the inputs to the network. The circles

labelled “+1” are called bias units, and correspond to the intercept term. The leftmost layer of the

network is called the input layer, and the rightmost layer the output layer (which, in this

example, has only one node). The middle layer of nodes is called the hidden layer, because its

values are not observed in the training set. We also say that our example neural network has 3

input units (not counting the bias unit), 3 hidden units, and 1 output unit.

ANNs may contain one or more hidden layers between the input layer and the output

layer. The most common choice is an nl -layered network where layer l is the input layer, layer nl

is the output layer, and each layer l is densely connected to layer l + 1. This is one example of a

feedforward neural network, since the connectivity graph does not have any directed loops or

cycles. Neural networks can also have multiple output units.

Figure 3 : Feedforward Network

The parameters of any ANN are given below:

The interconnection pattern between the different layers of neurons

The learning process for updating the weights of the interconnections

The activation function that converts a neuron's weighted input to its output activation.

Suppose we have a fixed training set {(x (1)

, y (1)

), ..., (x (m)

, y (m)

)} of m training examples.

We can train our neural network using batch gradient descent. In detail, for a single training

example (x, y), we define the cost function with respect to that single example to be

‖ ‖

.

3. BackPropagation Algorithm:

Some input and output patterns can be easily learned by single-layer neural networks (i.e.

perceptrons). However, these single-layer perceptrons cannot learn some relatively simple

patterns, such as those that are not linearly separable. A single-layer neural network however,

must learn a function that outputs a label solely using the features of the data. There is no way

for it to learn any abstract features of the input since it is limited to having only one layer. A

multi-layered network overcomes this limitation as it can create internal representations and

learn different features in each layer. Each higher layer learns more and more abstract features

that can be used to describe the data. Each layer finds patterns in the layer below it and it is this

ability to create internal representations that are independent of outside input that gives multi-

layered networks its power. The goal and motivation for developing the backpropagation

algorithm is to find a way to train multi-layered neural networks such that it can learn the

appropriate internal representations to allow it to learn any arbitrary mapping of input to output.

Algorithm for a 3-layer network (only one hidden layer):

initialize network weights (often small random values)

do

for each training example x

run a feedforward pass to predict what the ANN will output (activations)

compute error (prediction - actual) at the output units

compute Δ wh(l)

for all weights from hidden layer to output layer

compute Δ wi(l)

for all weights from input layer to hidden layer

update network weights

until stopping criterion satisfied

return the network

The BackPropagation algorithm however had many problems which caused it to give

sub-optimal results. Some of these problems are:

Gradient progressively becomes “diluted”

Gets stuck in local minima

In usual settings, we can use only labelled data

4. Deep Learning:

Deep learning is a set of algorithms in machine learning that attempt to model high-level

abstractions in data by using architectures composed of multiple non-linear transformations. The

underlying assumption is that observed data is generated by the interactions of many different

factors on different levels. Some of the reasons to use deep learning are:

Performs far better than its predecessors

Simple to construct

Allows abstraction to develop naturally

Help the network to initialize with good parameters

Allows refining of the features so that they become more relevant to the task

Trades space for time: More layers but less hardware

5. AutoEncoders:

An autoencoder is an artificial neural network used for learning efficient codings. The

aim of an auto-encoder is to learn a compressed, distributed representation (encoding) for a set of

data. This means it is being used for dimensionality reduction. An autoencoder is trained to

encode the input in some representation so that the input can be reconstructed from that

representation. The target output is the input itself.

The autoencoder tries to learn a function hW,b (x) ≈ x. In other words, it is trying to learn

an approximation to the identity function, so as to output y that is similar to x (See Fig. 4).

The first hidden layer is trained to replicate the input. After the error has been reduced to

an acceptable range, the next layer can be introduced. The output of the first hidden layer is

treated as the input for the next hidden layer. So we are always training one hidden layer keeping

all previous layers intact and hidden. For each layer, we try to minimize a cost function so that

the output does not deviate too much from the input. Here, hW,b (x) = f (WTx) where f : R → R is

the activation function. f can be a sigmoid function or tanh function.

Figure 4 : AutoEncoder

6. Sparsity:

The number of hidden units in an ANN can vary. But even when the number of hidden

units is large, we can impose a sparsity constraint on the hidden units. We can set a constraint

such that the average activation of each hidden neuron to be close to 0. This constraint makes

most of the neurons inactive most of the time. This technique of putting constraints on the

autoencoder such that only a few of the links are “activated” at any given time is called sparsity.

We introduce a sparsity parameter ρ defined as a threshold parameter which decides the

activation of the neuron. We initialize ρ close to 0.

Let aj (l )

(x) denote the activation of the hidden unit j when the network is given a specific

input x.

Let

∑*

( )+

be the average activation of hidden unit (averaged over the training set).

A penalty will be used to penalize those hidden units that deviate from this sparsity

parameter ρ. Penalty term that can be used is the Kullback – Leibler Divergence (KL).

| )

So our penalty term in the equations becomes

∑

Our overall cost function becomes:

∑ | )

Our derivative calculation changes from

(∑

) (

)

to

((∑

) (

)) (

)

We will need to know ρ to compute this term. We run a forward pass on all the training

examples first to compute the average activations on the training set, before computing

backpropagation on any example.

A sparse representation uses more features but at any given time a significant number of

the features will have a value close to 0. This leads to more localized features where a particular

node (or small group of nodes) with value 1 signifies the presence of a feature. A sparse

autoencoder uses more hidden nodes. It can learn from corrupted training instances, decoding

only the uncorrupted instances and learning conditional dependencies.

We have shown how to train one layer of the autoencoder. Using the output of the first

hidden layer as input, we can add another hidden layer after it. This addition of layers can be

done many times to create a “deep” network. We can then train this layer in a similar fashion.

Supervised training (backpropagation) is performed on the last layer using final features,

followed by supervised training on the entire network to fine-tune all weights.

7. Visualization:

Let us take an example of an image processor.

Figure 5 : Sample Image

Given a picture, the autoencoder selects a grid of pixels to encode. Let the grid be a

640 x 480 grid. Each pixel is used as an input xi. The autoencoder tries to output the pixels so

that they look similar to the original pixels. For doing this, the autoencoder tries to learn the

features of the image in each layer.

Figure 6 : Image in the selected grid with whitewashing.

On passing the image grid through the autoencoder, the output of the autoencoder is

similar to the image shown below:

Figure 7 : Output of the autoencoder for image grid

Each square in the figure shows the input image that maximally actives one of the many

hidden units. We see that the different hidden units have learned to detect edges at different

positions and orientations in the image.

8. Applications:

Image Processing

Computer Vision

Automated systems

Natural Language Processing

Sound Processing

Tactile Recognition

Data Processing

References

“Deep Learning and Unsupervised Feature Learning Winter 2011”, Stanford University,

Stanford, California 94305. https://www.stanford.edu/class/cs294a/

“Stanford’s Unsupervised Feature and Deep Learning tutorial”, Stanford University,

Stanford, California 94305.

http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial

Nando de Freitas, “Deep learning with autoencoders”, Department of Computer Science,

University of British Columbia, Vancouver, Canada (March 2013, Lecture for CPSC540).

Tom M. Mitchell, “Machine Learning”, Tom M. Mitchell, McGraw-Hill Publications,

March 1997, ISBN: 0070428077.

Honglak Lee, “Tutorial on Deep Learning and Applications”, NIPS 2010 Workshop on

Deep Learning and Unsupervised Feature Learning, University of Michigan, USA.

Itamar Arel, Derek C. Rose, and Thomas P. Karnowski, “Deep Machine Learning — A New

Frontier in Artificial Intelligence Research”, The University of Tennessee, USA, IEEE

Computational Intelligence Magazine, November 2010.

Yoshua Bengio, “Learning Deep Architectures for AI”, Foundations and Trends in Machine

Learning, Vol. 2, No. 1 (2009).

deep learning algorithms report.pdf

Documents