unsupervised learning: autoencoders - yunsheng...

Unsupervised Learning: Autoencoders

Yunsheng Bai

Roadmap1. Introduction to Autoencoders2. Sparse Autoencoders (SAE) (2008)3. Denoising Autoencoders (DAE) (2008)4. Contractive Autoencoders (CAE) (2011)5. Stacked Convolutional Autoencoders (SCAE) (2011)6. Recursive Autoencoders (RAE) (2011)7. Variational Autoencoders (VAE) (2013)8. Adversarial Autoencoders (AAE) (2015)9. Wasserstein Autoencoders (WAE) (2017)

10. Autoencoders for Graphs

Introduction to Autoencoders

https://en.wikipedia.org/wiki/Principal_component_analysis#/media/File:GaussianScatterPCA.svg

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

https://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf


Inner product between them

Change of basis

PCA ≈ Autoencoder with Linear Activation Function

Not necessarily orthogonal

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systemshttps://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf

Could have many layers, but as long as activation is linear → a single W and a single V

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systemshttps://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf

PCA ≈ Autoencoder with Linear Activation Function

https://towardsdatascience.com/autoencoders-are-essential-in-deep-neural-nets-f0365b2d1d7cHands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

PCA vs Autoencoder— autoencoders are much more

flexible than PCA.

— NN activation functions

introduce “non-linearities”

in encoding, but PCA only does

linear transformation.

— we can stack autoencoders to

form a deep autoencoder

network

Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python

Layer 1 Layer 2 Layer 3 Layer 4

Stacked

Goal: Learn Useful Features from DataWe’ve seen that autoencoders can do PCA, but fundamentally, why does an autoencoder work?

https://hackernoon.com/autoencoders-deep-learning-bits-1-11731e200694

Goal: Feature/Representation LearningWhy can’t an autoencoder simply copy input to output through identity functions?


1 0 0 0 0 00 1 0 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 00 0 0 0 0 1

1 0 0 0 0 00 1 0 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 00 0 0 0 0 1

Overcomplete

min ||x-g(f(x))||2

f g

To Achieve Feature Learning, Conflicting GoalsAutoencoders are designed to be unable to learn to copy perfectly. Usually they are restricted in ways that allow them to copy only approximately. Because the model is forced to prioritize which aspects of the input should be copied, it often learns useful properties of the data.

Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville)

Undercomplete Autoencoders

Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Pythonhttp://rgraphgallery.blogspot.com/2013/04/rg-3d-scatter-plots-with-vertical-lines.html

hEncoders and decoders are too powerful :(

“If you could speak only a few words per month, you would probably try to make them worth listening to.”

Regularized AutoencodersRegularized autoencoders use a loss function that encourages the model to have other properties besides the ability to copy its input to its output. These other properties include sparsity of the representation, smallness of the derivative of the representation, and robustness to noise or to missing inputs. A regularized autoencoder can be nonlinear and overcomplete but still learn something useful about the data distribution, even if the model capacity is great enough to learn a trivial identity function.

→ introduce new things to the loss

→ they are just different regularizers

2008: Sparse Autoencoders (SAE)

2008: Denoising Autoencoders (DAE)

2011: Contractive Autoencoders (CAE)

2011: Stacked Convolutional Autoencoders (SCAE)

2011: Recursive Autoencoders (RAE)

2013: Variational Autoencoders (VAE)

2015: Adversarial Autoencoders (AAE)

2017: Wasserstein Autoencoders (WAE)


Properties of Autoencoders (Ideally)

1. Learn useful features from data (effective representations)a. Capture the intrinsic properties of data → feed them into downstream applicationsb. Can be thought of as patterns in data → generate new data

2. Produce low-dimensional vectors (efficient/compact representations)a. Efficient for storageb. Efficient for downstream modelsc. May be free of noise in inputd. Easier to visualize than high-dimensional data

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Properties of Autoencoders (Ideally)3. Are flexible: Can be modified/guided/regularized in various ways:

a. Input data, e.g. add noiseb. Output data, e.g. something different from the inputc. Architecture, e.g. fully connected layer → convolutional layerd. Loss, e.g. add additional loss terms → capture other useful information from inpute. Latent space, e.g. Gaussian (more later in VAE)

i. Enforce certain prior knowledge, usually through additional loss termsii. Analyzing the latent space/representations is a trend (?), e.g. debiasing word

embeddingsf. … (Be creative! This is where research comes from)

History of Autoencoders10 years ago, we thought that deep nets would also need an unsupervised cost, like the autoencoder cost, to regularize them.

Today, we know we are able to recognize images just by using backprop on the supervised cost as long as there is enough labeled data.

(Humans can learn from very few labeled examples. Why? One popular hypothesis: Brain can leverage unsupervised or semi-supervised learning.)

There are other tasks where we do still use autoencoders, but they’re not the fundamental solution to training deep nets that people once thought they were going to be.

(Ian Goodfellow, 2016)https://www.quora.com/Why-are-autoencoders-considered-a-failure-What-are-their-alternativeshttps://www.doc.ic.ac.uk/~js4416/163/website/nlp/#XGlorot2011Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville)

Applications of Autoencoders1. Data Compression for Storage

a. Difficult to train an autoencoder better than a basic algorithm like JPEGb. Autoencoders are data-specific: may be hard to generalize to unseen data

2. Dimensionality Reduction for Data Visualizationa. t-SNE is good, but typically requires relatively low-dimensional data

i. For high-dimensional data, first use autoencode, then use t-SNEb. Latent space visualization (more later)

https://blog.keras.io/building-autoencoders-in-keras.htmlhttps://www.doc.ic.ac.uk/~js4416/163/website/nlp/#XVincent2008https://hackernoon.com/latent-space-visualization-deep-learning-bits-2-bd09a46920df

Applications of Autoencoders3. Unsupervised Pretraining

a. Greedy Layer-Wise Unsupervised Pretraining: Train each layer of feedforward net greedily; continue stacking layers; output of prior layers is input for the next one; fine tune

b. Today, we have random weight initialization, rectified linear units (ReLUs) (2011), dropout (2012), batch normalization (2014), residual learning (2015) + large labeled datasets

c. Still usefuli. Train a deep autoencoderii. Train an autoencoder on an unlabeled dataset, and reuse the lower layers to create a

new network trained on the labeled data (~supervised pretraining) iii. Train an autoencoder on an unlabeled dataset, and use the learned representations in

downstream tasks (see more in 4)

https://blog.keras.io/building-autoencoders-in-keras.htmlhttps://www.doc.ic.ac.uk/~js4416/163/website/nlp/#XVincent2008

Greedy Layer-Wise Unsupervised Pretrainingfor Training Deep Autoencoders



Unsupervised Pretraining for Supervised Tasks


Unsupervised Pretraining for Supervised Tasks

downside: two-staged → hyperparameters tuning :(


Supervised Pretraining

https://www.youtube.com/watch?v=R3DNKE3zKFk

Multi-Task Learning

Transfer Learning Domain Adaptation

supervised pretraining

https://www.youtube.com/watch?v=R3DNKE3zKFk

Multi-Task Learning

Applications of Autoencoders4. Generate Representations for Downstream Tasks

a. Special case of unsupervised pre-training (3.c.iii)b. Useful when the initial representation is poor, and there is a lot of unlabeled data

i. Word embeddings (better than one-hot representations)ii. Graph node embeddingsiii. Image embeddings (Images already lie in a rich vector space? Check out puppy image

embeddings!)iv. Semantic hashing: turn database entries (text, image, etc.) into low-dimensional and

binary codes → Information retrievalc. Question: If there are labels, is there any reason to use a decoder with a reconstruction loss?

5. Generate New Data (Generative Model)a. Especially, Variational Autoencoders (VAE), Adversarial Autoencoders (AAE) (more later)b. Creative applications (more later)


https://github.com/ericzhao28/puppy_image_embeddings

https://github.com/ericzhao28/puppy_image_embeddings

Or logistic regression SVM, classifier, etc.

using the Hidden 2 as input

Copy output of Layer 2

(embedding)


graph node embedding

Copy output of Layer 2(code)

Database Query

Hidden rep.

Compare the codes

Query

in databaseHands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

semantic hashing

Applications of Autoencoders6. Self-supervised Learning

a. ∈ supervised learning where the targets are generated from the input datab. Merely learning to reconstruct the input might not be enough to learn abstract features of the

kind that label-supervised learning induces (where targets are "dog", "car"...)i. Data denoising

ii. Jigsaw puzzle solveriii. ...

https://blog.keras.io/building-autoencoders-in-keras.htmlhttps://www.doc.ic.ac.uk/~js4416/163/website/nlp/#XVincent2008

Skipgram vs Autoencoders1. In NLP word embeddings, why is Skipgram more popular than autoencoders?

a. Simplerb. More efficientc. Works well already

2. When does Skipgram no longer suffice? Additional goals, e.g.a. Denoisingb. Complex characteristics of word use + polysemy → Use bidirectional LSTM with attention

as the encoder!c. Generative setting (generate new data)d. Inductive setting (embed unseen words)

3. Can Skipgram be viewed as a special case of some autoencoder model?a. In fact, encoding and decoding are very general concepts and are used in many places

Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).Peters, Matthew E., et al. "Deep contextualized word representations." arXiv preprint arXiv:1802.05365 (2018).

Sparse Autoencoders (SAE) (2008)

Motivation 1: Sparse Coding

https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf

An image should be represented by only a few bases.



A document should be about only a few topics.



D x

DT

��

x ��=

=

DTx=

Dh=DDT =x→ DDT=I

Change of basis

+ Sparsity constraint

“If you could speak only a few words per month, you would probably try to make them worth listening to.”

Inner product between them


Change of basis

Recall PCA

Motivation 2: Prevent Identity Transform


1 0 0 0 0 00 1 0 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 00 0 0 0 0 1

1 0 0 0 0 00 1 0 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 00 0 0 0 0 1

f g

W x

WT

h

x h=

=

WTx=hf(x)=h

Wh=WWT =x→ WWT=I (fine)g(h)=x

1 0 0 0 0 00 1 0 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 00 0 0 0 0 10 0 0 0 0 00 0 0 0 0 00 0 0 0 0 0

1 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 00 0 1 0 0 0 0 0 00 0 0 1 0 0 0 0 00 0 0 0 1 0 0 0 00 0 0 0 0 1 0 0 0

W ≈ I



0.20.3 0.1 + ...

Same as input, i.e. x

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 1 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 1 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 1 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 1 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 1 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 1 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

W

In the case of image, we can think

of W as a set of convolution filters

(each with the same size as the input, e.g. 4x4).



1000000000000000

0100000000000000

0010000000000000

1 0 1 + ...

Sparse Autoencodersf g

Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in PythonHands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systemshttp://web.stanford.edu/class/cs294a/sparseAutoencoder.pdf

# training samples Reconstruction loss Regularization term Sparsity penalty

This results in sparse activation of hidden units across training points, but does not guarantee that each input has a sparse representation. (Makhzani, Alireza, and Brendan Frey. "K-sparse autoencoders." arXiv preprint arXiv:1312.5663 (2013).)

activation of hidden unit j of layer 2 (assume two layers

in encoder)


Results

Techniques to Interpret Autoencoders1. Visualize the weight matrix W

a. Each column of W corresponds to the weights of a particular neuronb. When there is a natural interpretation of the weights, can visualize them

i. Especially true in the case of image as seen previously (~convolution filters)ii. Especially true for the top hidden layers since they often capture relatively large

features

2. Visualize the most exciting input per neurona. Treat each neuron as a feature detector. To find the feature a particular neuron is looking for,

i. Feed a random inputii. Measure the activation of the neuron you are interested iniii. Perform backpropagation to tweak the input so that the neuron will activate even

more (gradient ascent)iv. Iterate several times


Denoising Autoencoders (DAE) (2008)

Sparse Coding Could Also Handle Image DenoisingKey: the use of sparse and redundant representations over trained dictionaries.

https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdfElad, Michael, and Michal Aharon. "Image denoising via learned dictionaries and sparse representation." Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on. Vol. 1. IEEE, 2006.

Denoising Autoencoders: Implementation-level


Denoising Autoencoders: Results


Gaussian noise

Denoising Autoencoders: Results


Salt and pepper noise

Denoising Autoencoders: Research-level


Why equivalent to reconstruction loss

? (1) Intuitively (2) Recall Least Square Estimate is same as Maximum Likelihood Estimate under a Gaussian model

http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/slides/linear_regression.pdf




Contractive Autoencoders (CAE) (2011)

CAE: Resist Infinitesimal Perturbations of InputAll autoencoder training procedures involve a compromise between two opposing forces: being data-specific and being data-insensitive.


CAE and DAE are equivalent under

certain conditions.

Stacked Convolutional Autoencoders (SCAE) (2011)

SCAEUse convolutional + pooling layers instead of fully connected layers.

Dong, Chao, et al. "Image super-resolution using deep convolutional networks." IEEE transactions on pattern analysis and machine intelligence 38.2 (2016): 295-307.

Image Deblurring/Denoising/Super-Resolution and Image Colorization

Dong, Chao, et al. "Image super-resolution using deep convolutional networks." IEEE transactions on pattern analysis and machine intelligence 38.2 (2016): 295-307.https://hackernoon.com/autoencoders-deep-learning-bits-1-11731e200694

Recursive Autoencoders (RAE) (2011)

Sentence Representation

Socher, Richard, et al. "Semi-supervised recursive autoencoders for predicting sentiment distributions." Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2011.

Why not simple average? “white blood cells destroying an infection” ≠ “an infection destroying white blood cells”

Sentence Representation

https://www.doc.ic.ac.uk/~js4416/163/website/nlp/recursive.htmlSocher, Richard, et al. "Dynamic pooling and unfolding recursive autoencoders for paraphrase detection." Advances in neural information processing systems. 2011.

Could use parse tree

Could introduce a supervised loss

Could penalize top-level nodes more heavily, which contain more children

Could use many layers

Could normalize the hidden representations

Could predict all children underneath → unfolding RAE

Variational Autoencoders (VAE) (2013)

VAE: Intuition

https://www.jeremyjordan.me/variational-autoencoders/

Encoder Outputs Statistical Distributions; Feed Samples into Decoder→ Add Noise at All Times; Generate New Data After Training

VAE: Implementation-level


Probabilistic (produce and even after

training) + generativeautoencoders

Assume the prior distribution of z, i.e. p(z)

to be Gaussian → encourage the learned posterior q(z|x) to be

similar to p(z) through an additional loss term measuring their KL

divergence

-

VAE: Research-level

Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013).https://www.jeremyjordan.me/variational-autoencoders/

generative model

variational approximation

to the intractable true posterior

probabilistic encoder(recognition model)

probabilistic decoder(generative model)

latent representation or code

Variational Bayesian Inference

MusicVAE: Generative Model → Creative Artists

https://magenta.tensorflow.org/music-vaeRoberts, Adam, et al. "A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music." arXiv preprint arXiv:1803.05428 (2018).

The desirable properties of a latent space can be summarized as follows:

1. Expression: Any real example can be mapped to some point in the latent space and reconstructed from it.

2. Realism: Any point in this space represents some realistic example, including ones not in the training set.

3. Smoothness: Examples from nearby points in latent space have similar qualities to one another.

https://experiments.withgoogle.com/ai/beat-blender/view/

https://experiments.withgoogle.com/ai/beat-blender/view/

Key: Design Latent Space Properties


Learn smooth latent state representations of the input data.Good for interpolation, sampling, generation, downstream

classification, etc.

“holes” :(

Interpolation → Smooth Transformation


SketchRNN: Seq2seq + Variational Autoencoder

https://research.googleblog.com/2017/04/teaching-machines-to-draw.html

sequence-to-sequence (seq2seq) autoencoder framework with variational inferencesketch → sequence of motor actions controlling a pen (how about text or graph as a sequence?)by adding noise to the latent vector, the model cannot reproduce the input sketch exactly

Arithmetic operations on sketch embeddings!

Smoothness of latent space

Adversarial Autoencoders (AAE) (2015)

DKL

Prior p(z) Additional

loss term

Makhzani, Alireza, et al. "Adversarial autoencoders." arXiv preprint arXiv:1511.05644 (2015).

AAE: Regularized by An Adversarial Network Which Guides Posterior q(z|x) to Match Any Arbitrary Prior p(z)VAE AAE

AAE: Design Arbitrary Prior


AAE: Labels Can Further Guide (Semi-Supervised)


Wasserstein Autoencoders (WAE) (2017)

WAE: MotivationVAE

Pros:

1. Theoretically elegant2. Stable training3. Encoder-decoder architecture4. Nice latent manifold structure

Cons:

1. Tend to generate blurry samples

GAN

Pros:

1. Good visual quality of images

Cons:

1. Harder to train2. No encoder; only a decoder/generator and

a discriminator3. “Mode collapse” problem 4. ~JS divergence, “worse” than Wasserstein

distance (see details in the paper)

Tolstikhin, Ilya, et al. "Wasserstein Auto-Encoders." arXiv preprint arXiv:1711.01558 (2017).

Combine VAE + GAN in A Principled Way?VAE GAN

DKL

Prior p(z) Additional

loss term

Tolstikhin, Ilya, et al. "Wasserstein Auto-Encoders." arXiv preprint arXiv:1711.01558 (2017).

decoder/generator

discriminator

decoderencoder

AAE

WAEA generalization of AAE; minimizes Wasserstein distance between the model and the target distribution.

Tolstikhin, Ilya, et al. "Wasserstein Auto-Encoders." arXiv preprint arXiv:1711.01558 (2017).https://openreview.net/forum?id=HkL7n1-0b

Autoencoders for Graphs

Graphs Are Different1. Are there smooth linear interpolations? Arithmetic operations?2. Graph is composed of correlated substructures

a. E.g. Two triangles → rectangleb. Hierarchy: pixels (atomic) → patterns → images; words (atomic) → phrases → sentences →

paragraphs/documents; nodes (atomic) → substructures → graphs (transfer learning)

3. Graphs are of different sizes4. Graph nodes lack order5. How to detect substructures?

a. For image, convolutional layers → SCAEb. For graph, graph convolutional layers → node/substructure/graph?c. Some people treat graph as sequences/random walks → “deconstruction” view

i. ~Parse sentences into trees instead of feeding into LSTMd. How about decompose graphs into equal-size subgraphs?

Simonovsky, Martin, and Nikos Komodakis. "GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders." arXiv preprint arXiv:1802.03480 (2018).

GraphVAE

Simonovsky, Martin, and Nikos Komodakis. "GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders." arXiv preprint arXiv:1802.03480 (2018).

Thank you!

unsupervised learning: autoencoders - yunsheng...

Documents