deep learning algorithms report.pdf
TRANSCRIPT
Deep Learning Algorithms: Sparse AutoEncoders
Gabriel Broadwin Nongsiej (13CS60R02)
Under the guidance of
Prof. Sudeshna Sarkar
1. Introduction
Supervised learning is one of the most powerful tools of AI, and has led to a number of
innovative applications over the years. Despite its significant successes, supervised learning
today is still severely limited. Specifically, most applications of it still require that we manually
specify the input features x given to the algorithm. Once a good feature representation is given, a
supervised learning algorithm can do well. But these features are not easy to detect and
formulate. And the difficult feature engineering work does not scale well to new problems.
This seminar report describes the sparse autoencoder learning algorithm, which is one
approach to automatically learn features from unlabeled data. It has been found that these
features produced by a sparse autoencoder do surprisingly well and are competitive and
sometimes superior to the best hand-engineered features.
2. Artificial Neural Networks (ANN)
Consider a supervised learning problem where we have access to labeled training
examples (x (i)
, y (i)
). Neural networks give a way of defining a complex, non-linear form of
hypotheses hW,b(x), with parameters W, b that we can fit to our data.
An artificial neural network (ANN) comprises of one or more “neurons”. A neuron is a
computational unit that takes some set of inputs, calculates a linear combination of these inputs
and outputs hW,b(x) = f (WT, x) = f(Σ
3i=1Wixi + b), where f : R → R is called the activation
function.
Figure 1 : Simple Neuron
A neural network is put together by hooking together many of our simple “neurons,” so
that the output of a neuron can be the input of another. For example, here is a small neural
network:
Figure 2 : Artificial Neural Network
In this figure, we have used circles to also denote the inputs to the network. The circles
labelled “+1” are called bias units, and correspond to the intercept term. The leftmost layer of the
network is called the input layer, and the rightmost layer the output layer (which, in this
example, has only one node). The middle layer of nodes is called the hidden layer, because its
values are not observed in the training set. We also say that our example neural network has 3
input units (not counting the bias unit), 3 hidden units, and 1 output unit.
ANNs may contain one or more hidden layers between the input layer and the output
layer. The most common choice is an nl -layered network where layer l is the input layer, layer nl
is the output layer, and each layer l is densely connected to layer l + 1. This is one example of a
feedforward neural network, since the connectivity graph does not have any directed loops or
cycles. Neural networks can also have multiple output units.
Figure 3 : Feedforward Network
The parameters of any ANN are given below:
The interconnection pattern between the different layers of neurons
The learning process for updating the weights of the interconnections
The activation function that converts a neuron's weighted input to its output activation.
Suppose we have a fixed training set {(x (1)
, y (1)
), ..., (x (m)
, y (m)
)} of m training examples.
We can train our neural network using batch gradient descent. In detail, for a single training
example (x, y), we define the cost function with respect to that single example to be
‖ ‖
.
3. BackPropagation Algorithm:
Some input and output patterns can be easily learned by single-layer neural networks (i.e.
perceptrons). However, these single-layer perceptrons cannot learn some relatively simple
patterns, such as those that are not linearly separable. A single-layer neural network however,
must learn a function that outputs a label solely using the features of the data. There is no way
for it to learn any abstract features of the input since it is limited to having only one layer. A
multi-layered network overcomes this limitation as it can create internal representations and
learn different features in each layer. Each higher layer learns more and more abstract features
that can be used to describe the data. Each layer finds patterns in the layer below it and it is this
ability to create internal representations that are independent of outside input that gives multi-
layered networks its power. The goal and motivation for developing the backpropagation
algorithm is to find a way to train multi-layered neural networks such that it can learn the
appropriate internal representations to allow it to learn any arbitrary mapping of input to output.
Algorithm for a 3-layer network (only one hidden layer):
initialize network weights (often small random values)
do
for each training example x
run a feedforward pass to predict what the ANN will output (activations)
compute error (prediction - actual) at the output units
compute Δ wh(l)
for all weights from hidden layer to output layer
compute Δ wi(l)
for all weights from input layer to hidden layer
update network weights
until stopping criterion satisfied
return the network
The BackPropagation algorithm however had many problems which caused it to give
sub-optimal results. Some of these problems are:
Gradient progressively becomes “diluted”
Gets stuck in local minima
In usual settings, we can use only labelled data
4. Deep Learning:
Deep learning is a set of algorithms in machine learning that attempt to model high-level
abstractions in data by using architectures composed of multiple non-linear transformations. The
underlying assumption is that observed data is generated by the interactions of many different
factors on different levels. Some of the reasons to use deep learning are:
Performs far better than its predecessors
Simple to construct
Allows abstraction to develop naturally
Help the network to initialize with good parameters
Allows refining of the features so that they become more relevant to the task
Trades space for time: More layers but less hardware
5. AutoEncoders:
An autoencoder is an artificial neural network used for learning efficient codings. The
aim of an auto-encoder is to learn a compressed, distributed representation (encoding) for a set of
data. This means it is being used for dimensionality reduction. An autoencoder is trained to
encode the input in some representation so that the input can be reconstructed from that
representation. The target output is the input itself.
The autoencoder tries to learn a function hW,b (x) ≈ x. In other words, it is trying to learn
an approximation to the identity function, so as to output y that is similar to x (See Fig. 4).
The first hidden layer is trained to replicate the input. After the error has been reduced to
an acceptable range, the next layer can be introduced. The output of the first hidden layer is
treated as the input for the next hidden layer. So we are always training one hidden layer keeping
all previous layers intact and hidden. For each layer, we try to minimize a cost function so that
the output does not deviate too much from the input. Here, hW,b (x) = f (WTx) where f : R → R is
the activation function. f can be a sigmoid function or tanh function.
Figure 4 : AutoEncoder
6. Sparsity:
The number of hidden units in an ANN can vary. But even when the number of hidden
units is large, we can impose a sparsity constraint on the hidden units. We can set a constraint
such that the average activation of each hidden neuron to be close to 0. This constraint makes
most of the neurons inactive most of the time. This technique of putting constraints on the
autoencoder such that only a few of the links are “activated” at any given time is called sparsity.
We introduce a sparsity parameter ρ defined as a threshold parameter which decides the
activation of the neuron. We initialize ρ close to 0.
Let aj (l )
(x) denote the activation of the hidden unit j when the network is given a specific
input x.
Let
∑*
( )+
be the average activation of hidden unit (averaged over the training set).
A penalty will be used to penalize those hidden units that deviate from this sparsity
parameter ρ. Penalty term that can be used is the Kullback – Leibler Divergence (KL).
| )
So our penalty term in the equations becomes
∑
Our overall cost function becomes:
∑ | )
Our derivative calculation changes from
(∑
) (
)
to
((∑
) (
)) (
)
We will need to know ρ to compute this term. We run a forward pass on all the training
examples first to compute the average activations on the training set, before computing
backpropagation on any example.
A sparse representation uses more features but at any given time a significant number of
the features will have a value close to 0. This leads to more localized features where a particular
node (or small group of nodes) with value 1 signifies the presence of a feature. A sparse
autoencoder uses more hidden nodes. It can learn from corrupted training instances, decoding
only the uncorrupted instances and learning conditional dependencies.
We have shown how to train one layer of the autoencoder. Using the output of the first
hidden layer as input, we can add another hidden layer after it. This addition of layers can be
done many times to create a “deep” network. We can then train this layer in a similar fashion.
Supervised training (backpropagation) is performed on the last layer using final features,
followed by supervised training on the entire network to fine-tune all weights.
7. Visualization:
Let us take an example of an image processor.
Figure 5 : Sample Image
Given a picture, the autoencoder selects a grid of pixels to encode. Let the grid be a
640 x 480 grid. Each pixel is used as an input xi. The autoencoder tries to output the pixels so
that they look similar to the original pixels. For doing this, the autoencoder tries to learn the
features of the image in each layer.
Figure 6 : Image in the selected grid with whitewashing.
On passing the image grid through the autoencoder, the output of the autoencoder is
similar to the image shown below:
Figure 7 : Output of the autoencoder for image grid
Each square in the figure shows the input image that maximally actives one of the many
hidden units. We see that the different hidden units have learned to detect edges at different
positions and orientations in the image.
8. Applications:
Image Processing
Computer Vision
Automated systems
Natural Language Processing
Sound Processing
Tactile Recognition
Data Processing
References
“Deep Learning and Unsupervised Feature Learning Winter 2011”, Stanford University,
Stanford, California 94305. https://www.stanford.edu/class/cs294a/
“Stanford’s Unsupervised Feature and Deep Learning tutorial”, Stanford University,
Stanford, California 94305.
http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial
Nando de Freitas, “Deep learning with autoencoders”, Department of Computer Science,
University of British Columbia, Vancouver, Canada (March 2013, Lecture for CPSC540).
Tom M. Mitchell, “Machine Learning”, Tom M. Mitchell, McGraw-Hill Publications,
March 1997, ISBN: 0070428077.
Honglak Lee, “Tutorial on Deep Learning and Applications”, NIPS 2010 Workshop on
Deep Learning and Unsupervised Feature Learning, University of Michigan, USA.
Itamar Arel, Derek C. Rose, and Thomas P. Karnowski, “Deep Machine Learning — A New
Frontier in Artificial Intelligence Research”, The University of Tennessee, USA, IEEE
Computational Intelligence Magazine, November 2010.
Yoshua Bengio, “Learning Deep Architectures for AI”, Foundations and Trends in Machine
Learning, Vol. 2, No. 1 (2009).