dnn tutorial

105
Deep Learning: Past, Present and Future (?) Kyunghyun Cho Laboratoire d’Informatique des Systèmes Adaptatifs, Département d’informatique et de recherche opérationnelle, Facult des arts et des sciences, Université de Montréal [email protected] ([email protected])

Upload: pavitrakumar123

Post on 18-Jul-2016

27 views

Category:

Documents


0 download

DESCRIPTION

()

TRANSCRIPT

Page 1: Dnn Tutorial

Deep Learning: Past, Present and Future (?)

Kyunghyun Cho

Laboratoire d’Informatique des Systèmes Adaptatifs,Département d’informatique et de recherche opérationnelle,

Facult des arts et des sciences,Université de Montréal

[email protected] ([email protected])

Page 2: Dnn Tutorial

Machine Learning

Learning Inference

?

1. Let the modelM learn the data D2. Let the modelM infer unknown quantities

Page 3: Dnn Tutorial

Machine Learning & Perception

Learning Inference

?

Perception is the organization, identification,and interpretation of sensory information inorder to represent and understand theenvironment.

–Wikipedia

(Farabet et al., 2013)

Page 4: Dnn Tutorial

Machine Learning & Perception: Examples

Learning Inference

?

Data Sensory Information QueryLabeled Images An image Is a cat in the image?Transcribed Speech A speech segment What is this person saying?Paraphrases A pair of sentences Is this sentence a paraphrase?Movie Ratings Ratings of Y and by X Will a user X like a movie Y ?Parallel Corpora A Finnish sentence What is “moi” in English?

Page 5: Dnn Tutorial

A human possesses the best machinery for perception, called a brain.

But, how does our brain do it?

Page 6: Dnn Tutorial

Deep Learning: Motivated from Human Learning

(Van Essen&Gallant, 1994)

Learn massive data simple functions Multi-layered

(Krizhevsky et al., 2012)

Page 7: Dnn Tutorial

Boltzmann machines? I remember working on them in 80s and 90s..

– Anonymous Interviewer, 2011paraphrased

Page 8: Dnn Tutorial

Deep Learning: History

1958 Rosenblatt proposed perceptrons1980 Neocognitron (Fukushima, 1980)

1982 Hopfield network, SOM (Kohonen, 1982), Neural PCA (Oja, 1982)

1985 Boltzmann machines (Ackley et al., 1985)

1986 Multilayer perceptrons and backpropagation (Rumelhart et al., 1986)

1988 RBF networks (Broomhead&Lowe, 1988)

1989 Autoencoders (Baldi&Hornik, 1989), Convolutional network (LeCun, 1989)

1992 Sigmoid belief network (Neal, 1992)

1993 Sparse coding (Field, 1993)

Page 9: Dnn Tutorial
Page 10: Dnn Tutorial
Page 11: Dnn Tutorial

Why all this fuss about deep learning now?

Page 12: Dnn Tutorial

ImageNet: ILSVRC 2012 – Classification Task

Top Rankers1. SuperVision (0.153): Deep Conv. Neural Network (Krizhevsky et al.)

2. ISI (0.262): Features + FV + Linear classifier (Gunji et al.)

3. OXFORD_VGG (0.270): Features + FV + SVM (Simonyan et al.)

4. XRCE/INRIA (0.271): SIFT + FV + PQ + SVM (Perronin et al.)

5. University of Amsterdam (0.300): Color desc. + SVM (van de Sande et al.)

(Krizhevsky et al., 2012)

Page 13: Dnn Tutorial

ImageNet: ILSVRC 2013 – Classification TaskTop Rankers1. Clarifi (0.117): Deep Convolutional Neural Networks (Zeiler)

2. NUS: Deep Convolutional Neural Networks3. ZF: Deep Convolutional Neural Networks4. Andrew Howard: Deep Convolutional Neural Networks5. OverFeat: Deep Convolutional Neural Networks6. UvA-Euvision: Deep Convolutional Neural Networks7. Adobe: Deep Convolutional Neural Networks8. VGG: Deep Convolutional Neural Networks9. CognitiveVision: Deep Convolutional Neural Networks

10. decaf: Deep Convolutional Neural Networks11. IBM Multimedia Team: Deep Convolutional Neural Networks12. Deep Punx (0.209): Deep Convolutional Neural Networks13. MIL (0.244): Local image descriptors + FV + linear classifier (Hidaka et

al.)14. Minerva-MSRA: Deep Convolutional Neural Networks15. Orange: Deep Convolutional Neural Networks16. BUPT-Orange: Deep Convolutional Neural Networks17. Trimps-Soushen1: Deep Convolutional Neural Networks18. QuantumLeap: 15 features + RVM (Shu&Shu)

(Sermanet et al., 2013)

Page 14: Dnn Tutorial

You’re already using deep learning!

Page 15: Dnn Tutorial

How do you tell deep learning from not-so-deep machine learning?

Page 16: Dnn Tutorial

Not-So-Deep Machine Learning

1. Feature engineering ← not learned!2. Learning3. Inference

features

xx

x

1

2

N

...

...

......

(1) (2)

(3)

Separation between domain knowledge and general machine learning

Page 17: Dnn Tutorial

Unsupervised Learning of Representation

1a. Feature engineering1b. Feature/Representation learning2. Learning3. Inference

features

xx

x

1

2

N

...

...

...

...

(1a)

(1b)

(2)

(3)

Page 18: Dnn Tutorial

Deep Learning: toward the Ultimate Machine Learning?

1. Jointly learn everything2. Inference

(1)

(2)

The data decides –Yoshua Bengio

Page 19: Dnn Tutorial

Why now? Why not 20 years ago?

Page 20: Dnn Tutorial

What has happened in last 20+ years?

I We have connected the dots, e.g.,I PCA ⇔ Neural PCA ⇔ Probabilistic PCA ⇔ AutoencoderI Autoencoder ⇔ Belief network ⇔ Restricted Boltzmann machine

I We understand learning betterI Model structure matters a lotI Learning is but is not optimizationI No need to be scared of non-convex optimization

I We understand how learning and inference interactI Exponential growth of the amount of data and computational power

Page 21: Dnn Tutorial

And Beyond. . .

(Goodfellow, 2013)

Page 22: Dnn Tutorial

Today’s Tutorial: Introduction to Deep Learning

1. Deep Learning: Past, Present and Future2. Supervised Neural Networks

I Multilayer Perceptron and LearningI RegularizationI Practical RecipeI Task-specific Neural Network

3. Unsupervised Neural NetworksI Unsupervised LearningI Generative Modeling with Neural Networks

I Density/Distribution LearningI Learning to Infer

I Semi-Supervised Learning and Pretraining

4. Advanced TopicsI Beyond Computer VisionI Advanced Topics

Page 23: Dnn Tutorial

Supervised Neural Networks

Kyunghyun Cho

Laboratoire d’Informatique des Systèmes Adaptatifs,Département d’informatique et de recherche opérationnelle,

Facult des arts et des sciences,Université de Montréal

[email protected], [email protected]

Page 24: Dnn Tutorial

Warning!!!

The next 9–10 slides may be extremely boring.

Page 25: Dnn Tutorial

Supervised Learning: Rough Picture

Data:

D = (x1, y1), (x2, y2) . . . , (xN , yN)

Assumption:

yn = f ∗(xn)

Find a function f , using D, that emulates f ∗ as well as possible onsamples potentially not included in D.

Page 26: Dnn Tutorial

Supervised Learning: Probabilistic Picture

Underlying distributions:

X and Y | X

Data:

D = (x , y) | x ∼ X & y ∼ Y | X = x

Find a distribution p(y | x), using D, that emulates Y | X as well aspossible on new samples from p(x) potentially not included in D.

Page 27: Dnn Tutorial

Supervised Learning: Evaluation and Generalization

Evaluation:

Ex∼p(x) [‖f (x)− f ∗(x)‖p] ≈∑

x∈Dtest

[‖f (x)− f ∗(x)‖p]

or

Ex∼p(x) [KL(p(y | x)‖p∗(y | x))] ≈∑

x∈Dtest

KL(p‖p∗),

where D 6= Dtest

Why do we test the found solution f ∗ or p∗ on Dtest, not on D?

Page 28: Dnn Tutorial

Linear/Ridge Regression

Underlying assumption:

y = W ∗x + b∗ + ε,

where ε is a white Gaussian noise.

Data:

D = (x1, y1), (x2, y2) . . . , (xN , yN) ,

where yn = W ∗xn + b∗ + ε.

Learning:

W , b = argminW ,b

1N

N∑n=1

‖Wxn + b − yn‖22 + λ(‖W ‖2F + ‖b‖22

)

Page 29: Dnn Tutorial

Multilayer Perceptron: (Binary) Classification

Underlying assumption:

y ∼ B (p = fθ∗(x)) ,

where fθ∗ is a nonlinear function parameterized with θ∗ and B(p) is aBernoulli distribution of mean p.

Data:

D = (x1, y1), (x2, y2) . . . , (xN , yN) ,

where yn ∼ B (p = fθ∗(xn)).

Learning:

θ = argminθ

1N

N∑n=1

yn log fθ(xn) + (1− yn) log (1− fθ(xn)) + λΩ(θ,D)

Page 30: Dnn Tutorial

Learning as an Optimization

Ultimately, learning is (mostly)

θ = argminθ

1N

N∑n=1

c ((xn, yn) | θ) + λΩ (θ,D) ,

where c ((x , y) | θ) is a per-sample cost function.

Page 31: Dnn Tutorial

Gradient Descent

Gradient-descent Algorithm:

θt = θt−1 − η∇L(θt−1)

where, in our case,

L(θ) =1N

N∑n=1

l ((xn, yn) | θ) .

Let us assume that Ω (θ,D) = 0.

Page 32: Dnn Tutorial

Stochastic Gradient Descent

Often, it is too costly to compute C (θ) due to a large training set.

Stochastic gradient descent algorithm:

θt = θt−1 − ηt∇l((x ′, y ′) | θt−1) ,

where (x ′, y ′) is a randomly chosen sample from D, and

∞∑t=1

ηt →∞ and∞∑t=1

(ηt)2 <∞.

Let us assume that Ω (θ,D) = 0.

Page 33: Dnn Tutorial

Any question so far?

Page 34: Dnn Tutorial

Almost there. . .

How do we compute the gradient efficiently for deep neural networks?

Page 35: Dnn Tutorial

Backpropagation Algorithm – (1) Forward Pass

x1

x2

h1

h2

f L

Forward Computation:

L(f (h1(x1, x2,θh1), h2(x1, x2,θh2),θf ), y)

Multilayer Perceptron with a single hidden layer:

L(x, y ,θ) =12(y −U>φ

(W>x

))2

Page 36: Dnn Tutorial

Backpropagation Algorithm – (2) Chain Rule

x1

x2

h1

h2

f L∂L∂f

∂h∂x1

2

∂h∂x

2

1 ∂f∂h2

∂f∂h1

Chain rule of derivatives:

∂L∂x1

=∂L∂f

∂f∂x1

=∂L∂f

(∂f∂h1

∂h1

∂x1+

∂f∂h2

∂h2

∂x1

)

Page 37: Dnn Tutorial

Backpropagation Algorithm – (3) Shared Derivatives

x1

x2

h1

h2

f L

∂f∂h1

∂f∂h2

∂L∂f

Local derivatives are shared:

∂L∂x1

=∂L∂f

(∂f∂h1

∂h1

∂x1+

∂f∂h2

∂h2

∂x1

)∂L∂x2

=∂L∂f

(∂f∂h1

∂h1

∂x2+

∂f∂h2

∂h2

∂x2

)

Page 38: Dnn Tutorial

Backpropagation Algorithm – (4) Local Computation

∂L∂h

∂h∂a1

∂h∂a2

∂h∂aq

∂L∂a1

∂L∂a2

∂L∂aq

Each node computesI Forward: h(a1, a2, . . . , aq)

I Backward: ∂h∂a1

, ∂h∂a2

, . . . , ∂h∂aq

Page 39: Dnn Tutorial

Backpropagation Algorithm – Requirements

∂L∂h

∂h∂a1

∂h∂a2

∂h∂aq

∂L∂a1

∂L∂a2

∂L∂aq

I Each node computes adifferentiable function1

I Directed Acyclic Graph2

1Well. . . ?2Well. . . ?

Page 40: Dnn Tutorial

Backpropagation Algorithm – Automatic Differentiation

x1

x2

h1

h2

f L

∂f∂h1

∂f∂h2

∂L∂f

I Generalized approach to computing partial derivativesI As long as your neural network fits the requirements, you do not

need to derive the derivatives yourself!I Theano, Torch, . . .

Page 41: Dnn Tutorial

Any question on backpropagation and automatic differentiation?

Page 42: Dnn Tutorial

Regularization – (1) Maximum a Posteriori

Probabilistic Perspective: Find a modelM by. . .I Maximum Likelihood (ML): argmaxθ p(D | M)

I Maximum a Posteriori (MAP): argmaxθ p(M | D)

What is the probability of θ given a current data D?

p(M | D) =p(D | M)p(M)∑M p(D,M)

∝ p(D | M)p(M)

P(M): What do we think a good model should be?

Page 43: Dnn Tutorial

Regularization – (2) Weight Decay

P(M): What do we think a good model should be?

Weight-Decay Regularization

I Prior Distribution: θj ∼ N(0, (Mλ)−1

)Maximum a Posteriori Estimation

θ = argmaxθ

N∑n=1

p(yn | xn,θ) + λ

M∑j=1

θ2j .

Page 44: Dnn Tutorial

Regularization – (3) Smoothness and Noise Injection

Prior on a Model: SmoothnessI f (x) ≈ f (x + ε): the model should be insensitive to small

change/noise ⇐⇒ Minimize∑N

n=1

∣∣∣∂f (xn)∂x

∣∣∣2Regularizing

∑Nn=1

∣∣∣∂f (xn)∂x

∣∣∣2 is equivalent to adding random Gaussiannoise in the input (Bishop, 1995)

argminθ

N∑n=1

‖fθ(xn)− yn‖2 + λ

∣∣∣∣∂f (xn)

∂x

∣∣∣∣2

≈∑ε

p(ε)

(argmin

θ

N∑n=1

‖fθ(xn+ε)− yn‖2)

Page 45: Dnn Tutorial

Regularization – (4a) Ensemble Learning and DropoutWisdom of the Crowd: Train M classifiers and let them vote

f (x) =1M

M∑m=1

fMm (x)

(Ciresan et al., 2012)

Page 46: Dnn Tutorial

Regularization – (4b) Ensemble Learning and Dropout

Dropout: Train one, but exponentially many classifiers (Hinton et al., 2012)

L(θ) =1N

N∑n=1

logEm [p(yn,m | xn,θ)] ≥ 1N

N∑n=1

Em [log p(yn,m | xn,θ)]

x1

x2

h1

hH

fh2

h3...

Each update samples one network out of exponentially many classifiers.

θt = θt−1 − ηt∇l((x ′, y ′),m | θt−1) ,

where mi,l ∼ B (0.5).

Page 47: Dnn Tutorial

Regularization – (4b) Ensemble Learning and Dropout

Dropout: When testing, halve the activations

L(θ) ≥ 1N

N∑n=1

Em [log p(yn,m | xn,θ)] ≈ 1N

N∑n=1

p(yn,m =

12| xn,θ

)

x1

x2

h1

hH

fh2

h3

...

12

12

12

12

Page 48: Dnn Tutorial

Do you see why I spent so much time on regularization?

Page 49: Dnn Tutorial

Common Recipe for Deep Neural Networks

1. Use a piecewise linear hidden unitI Rectifier: h(x) = max 0, x (Glorot&Bengio, 2011)I Maxout: h(x1, . . . , xp) = max x1, . . . , xp (Goodfellow et al., 2013)

2. Preprocess data and choose features carefullyI Images: Whitening? Local contrast normalization? Raw? SIFT?

HoG?I Speech: Raw? Spectrum?I Text: Characters? Words? Tree?I General: z-Normalization?

3. Use Dropout and other regularization methods4. Unsupervised Pretraining (Hinton&Salakhutdinov, 2006)

I Few labeled samples, but a lot of unlabeled samples

5. Carefully search for hyperparametersI Random search, Bayesian optimization

(Bergstra&Bengio, 2013;Bergstra et al., 2011;Snoek et al.,2012)

6. Often, deeper the better7. Build an ensemble of neural networks

Page 50: Dnn Tutorial

But, nobody seems to use a vanilla multilayer perceptron, right?

Page 51: Dnn Tutorial

How to Encode Prior/Domain Knowledge?

Data Preprocessing and Feature Extraction:I Object recognition from images

I Lighting condition shouldn’t matter → Contrast normalizationI Gesture recognition from skeleton

I Relative positions of joints are important → Relative coordinatesystem w.r.t. the body center

I Language ProcessingI Word counts are important → Bag-of-Words representation

Model Architecture Design

Page 52: Dnn Tutorial

Convolutional Neural Networks – (1)

Suitable for Images, Videos and Speech

Prior/Domain KnowledgeI Translation invarianceI Rotation invariance: Images and VideosI Temporal invariance: Videos and SpeechI Frequency invariance: Speech

Page 53: Dnn Tutorial

Convolutional Neural Networks – (2) Convolution andPooling

ConvolutionI Global

PoolingI Local

max max max

Page 54: Dnn Tutorial

Convolutional Neural Networks – (3)Convolutional Neural Network

(TeraDeep, 2013)

Convolutional Layer1. Contrast Normalization2. Convolution3. Pooling4. Nonlinearity

Page 55: Dnn Tutorial

Convolutional Neural Networks – (3)Deep Convolutional Neural Network

(Krizhevsky et al., 2012)

Page 56: Dnn Tutorial

Recursive Neural Networks – (1)

Suitable for Text and Variable-Length Sequences

Prior/Domain KnowledgeI Compositionality ≈ Tree-based Grammar (?)I Location invarianceI Variable Length

Page 57: Dnn Tutorial

Recursive Neural Networks – (2)

Compositional Structure

A small crowd quietly enters the historic church

Small (local) pieces are glued together to form a global structure

Page 58: Dnn Tutorial

Recursive Neural Networks – (3)

Finding a good, compact representation of variable-length sequence

(Socher et al., 2011)

Page 59: Dnn Tutorial

What other architectures can you think of?

Page 60: Dnn Tutorial

Further Topics

I Is learning solved for supervised neural networks?I Recurrent neural networks: cope with variable-length inputs/outputsI Beyond sigmoid and rectifier functions

Page 61: Dnn Tutorial

Unsupervised Neural Networks

Kyunghyun Cho

Laboratoire d’Informatique des Systèmes Adaptatifs,Département d’informatique et de recherche opérationnelle,

Facult des arts et des sciences,Université de Montréal

[email protected] ([email protected])

Page 62: Dnn Tutorial

Warning!!!

The first half of this session can be boring.

Page 63: Dnn Tutorial

Unsupervised Learning

No more label!

D = x1, x2, . . . , xN

What can we do?

Page 64: Dnn Tutorial

(Exploratory) Data Analysis

The most important step in machine learning

−200 −150 −100 −50 0 50 100 150 200 250 300

−150

−100

−50

0

50

100

1500

50

100

150

zx

y

Human Gestures Visualization by DNN

(Cho&Chen, 2014)

Page 65: Dnn Tutorial

Feature Extraction

With domain knowledge → Engineered FeaturesWithout domain knowledge → Learned Features

x f?

Page 66: Dnn Tutorial

Generative Model: Probabilistic Picture

Underlying distribution:

X ∼ PD

Data:

D = x | x ∼ PD

Find a distribution p(x), using D, that emulates PD as well as possibleon new samples from p(x) potentially not included in D.

Page 67: Dnn Tutorial

What should we do with data D = xnNn=1

Example target tasksI Classification p(xclass | xins)I Missing value reconstruction p(xm | xo)

I Denoising p(x | x)I Structured Output Prediction p(xout | xin)

I Outlier Detection p(x)? > τ

Ultimately, it comes down to learning a distribution of x.

Page 68: Dnn Tutorial

Density/Distribution Estimation – (1)

Latent Variable Models pθ(x)

θ∗ = argmaxθ

1N

∑n=1

log∑

h

p(xn | h)p(h)

1. Define a parametric form of joint distribution pθ(x , h)2. Derive a learning rule for θ

Page 69: Dnn Tutorial

Density/Distribution Estimation –(2) Restricted Boltzmann Machines

x1 x2 xp

h1 h2 hq

1. Joint distribution

pθ(x , h) =1

Z(θ)exp

0@ pXi=1

qXj=1

xihjwij

1A2. Marginal distribution

Xh

pθ(x , h) =1

Z(θ)

qYj=1

1 + exp

pXi=1

wi,jxi

!!

3. Learning rule: Maximum Likelihood

L(θ) =Ex∼Pd

24logqY

j=1

1 + exp

pXi=1

wi,jxi

!!− log Z(θ)

35∇wij =〈xihj 〉d − 〈xihj 〉m

(Smolensky, 1986)

Page 70: Dnn Tutorial

Density/Distribution Estimation – (3) Belief Networks

...

...

...

...

1. Joint distribution

pθ(x , h) = p(x | h1)p(h1 | h2) · · · p(h)

2. Marginal distributionXh

pθ(x , h) = ?

3. Learning rule: Maximum Likelihood

L(θ) =Ex∼Pd

"Xh

pθ(x , h)

#

(Neal, 1996)

Page 71: Dnn Tutorial

Density/Distribution Estimation – (4) NADE

1. Joint distribution

pθ(x) = pθ(x1)pθ(x2 | x1) · pθ(xd | x1, . . . , xd−1)

2. Marginal distribution: no latent variable h

3. Learning rule:3. Maximum Likelihood (fixed order), Order-agnostic (all orders)

(Larochelle&Murray, 2011)

Page 72: Dnn Tutorial

Density/Distribution Estimation – (4) Issues

Intractability! Intractability! Intractability!

(General) Boltzmann MachinesI Normalization Constant Z(θ)

I Marginal ProbabilityP

h p(x, h)

I Posterior Probability p(h | x)

I Conditional Probabilityp(xmis | xobs)

Restricted Boltzmann MachinesI Normalization Constant Z(θ)

I Marginal ProbabilityP

h p(x, h)

I Posterior Probability p(h | x)

I Conditional Probabilityp(xmis | xobs)

Belief NetworksI Normalization Constant Z(θ)

I Marginal ProbabilityP

h p(x, h)

I Posterior Probability p(h | x)

I Conditional Probabilityp(xmis | xobs)

NADEI Normalization Constant Z(θ)

I Marginal ProbabilityP

h p(x, h)

I Posterior Probability p(h | x)

I Conditional Probabilityp(xmis | xobs)

I Somewhat unsatisfactoryperformance

Page 73: Dnn Tutorial

Do we want to learn the distribution?

Page 74: Dnn Tutorial

Generative Model – (1) Learn to Infer

Example target tasksI Classification p(xclass | xins)I Missing value reconstruction p(xm | xo)

I Denoising p(x | x)I Structured Output Prediction p(xout | xin)

I Outlier Detection p(x)? > τ

All we want is to infer the conditional distribution of unknown variables

(Goodfellow et al., 2013; Brakel et al., 2013; Stoyanov et al., 2011, Raiko et al., 2014)

Page 75: Dnn Tutorial

Generative Model – (2) Learn to Infer

Approximate Inference in a Graphical Model

p(xmis | xobs) ≈ Q(xmis | xobs)

Methods:I Loopy belief propagationI Variational inference/message-passing

At the end of the day. . .

Qk(xmis | xobs) = f(Qk−1(xmis | xobs)

)until convergence

Page 76: Dnn Tutorial

Generative Model – (3) Learn to Infer

x<0> x<1> x<2>

h<1> h<2>

x<k>

h<k>

...

...x

h

Approximate Inference in Restricted Boltzmann Machine

p(xmis | xobs) ≈ Q(xmis | xobs)

Mean-field Fixed-point Iteration

µkx = σ

(Wσ

(W>µk−1

x + c)

+ b)

At the end of the day, a multilayer perceptron with k − 1 hidden layers.→ Use backpropagation and stochastic gradient descent!

Page 77: Dnn Tutorial

Generative Model – (4) Learn to Infer – NADE-k

x<0> x<1> x<2>

h<1> h<2>

x<k>

h<k>

...

...x

h

→ v<0> v<1>

h<1> h<1>[1] [2]

UW V

h<2> h<2>[1] [2]

UW V

v<2>

Further Generalization with Deep Neural Networks

p(xmis | xobs) ≈ Q(xmis | xobs) = fθ(xobs)

Interpret the model as a mixture of NADE’s with different orders ofvariables

I Exact computation of p(x) possibleI Fast inference p(xmis | xobs)I Flexible

(Raiko et al., 2014)

Page 78: Dnn Tutorial

Generative Model – (5) Learn to Infer

Lesson:

- Do not maximize log-likelihood, but minimize the actual cost!

⇐⇒

- Don’t do what a model tells you to do, but do what you aim to do.

Page 79: Dnn Tutorial

But, popular science journalists don’t care about generative models..

Page 80: Dnn Tutorial

Manifold Learning – Semi-Supervised Learning (1)

???

I Which class does the dot belong to, red or blue?

Page 81: Dnn Tutorial

Manifold Learning – Semi-Supervised Learning (2)

???

I Now, which class does the dot belong to, red or blue?I The black dots are unlabeled

Page 82: Dnn Tutorial

Manifold Learning – New Representation (1)

Representation φ on the data manifold?

Hidden space

Data space

κ(x)

1. φ should reflect changes alongthe manifold

φ(xi ) 6= φ(xj), for all xi , xj ∈ D

2. φ should not reflect anychange orthogonal to themanifold

φ(xi + ε) = φ(xi )

Page 83: Dnn Tutorial

Manifold Learning – New Representation (2)Denoising Autoencoder

(Vincent et al., 2011)

Representation that capture manifold1. φ(xi ) 6= φ(xj), for all xi , xj ∈ D2. φ(xi + ε) = φ(xi )

Denoising autoencoder achieves it by

minθ,θ′‖x− gθ′ (fθ (x + ε))‖2

Hidden space

Data space

κ(x)

Page 84: Dnn Tutorial

Semi-Supervised Learning in Action (1)Layer-wise Pretraining

x

h1 h2

h3 y

Page 85: Dnn Tutorial

Semi-Supervised Learning in Action (2)Layer-wise Pretraining

xx

h[1] h[1] h[1]

h[2] h[2]

y

Pretraining (1st layer)

Pretraining (2nd layer)

(Hinton&Salakhutdinov, 2006; Bengio et al., 2007; Ranzato et al., 2007)

Page 86: Dnn Tutorial

Manifold Embedding and Visualization – (1)

I Manifold Embedding: M⊂ Rd → Rq, q dI If q = 2 or 3, data visualization

Data

z1z2

z2

z1

(Oja, 1991; Kramer, 1991; Hinton & Salakhutdinov, 2006)

Page 87: Dnn Tutorial

Manifold Embedding and Visualization – (2)

Handwritten Digits [0, 1]196 → R2

0123456789

Pose Frame Data R30 → R2

rotateArmsLBackrotateArmsRBackrotateArmsBBack

Page 88: Dnn Tutorial

What other applications can you think of?

Page 89: Dnn Tutorial

Advanced Topics

Kyunghyun Cho

Laboratoire d’Informatique des Systèmes Adaptatifs,Département d’informatique et de recherche opérationnelle,

Facult des arts et des sciences,Université de Montréal

[email protected] ([email protected])

Page 90: Dnn Tutorial

Is deep learning all about computer vision and speech recognition?

Page 91: Dnn Tutorial

Deep Reinforcement Learning

a

h1

h2

h3

s

Q LearningI Q(s, a): state-action functionI Action at time t

= argmaxa∈[1,j] Q(s, a)

I Update Q on-the-fly

Deep Q LearningI Model Q with a deep neural networkI Predict Q(aj , ·) for all j at onceI State s: visual perception

not internal states!

(Mnih et al., 2013)

Page 92: Dnn Tutorial

Natural Language Processing

In neuropsychology, linguistics and the philosophy of language, a naturallanguage or ordinary language is any language which arises,unpremeditated, in the brains of human beings.

–Wikipedia

Page 93: Dnn Tutorial

Natural Language Processing

To machine learning researchers:

Natural Language is a huge set of variable-length sequences ofhigh-dimensional vectors.

Page 94: Dnn Tutorial

Natural Language Processing – (1)How should we represent a linguistic symbol?

Say, we have four symbols (words):

[EU], [3], [France], [three]

Most naïve, uninformative coding:

[EU] = [1, 0, 0, 0]

[3] = [0, 1, 0, 0]

[France] = [0, 0, 1, 0]

[three] = [0, 0, 0, 1]

Not satisfying..

Page 95: Dnn Tutorial

Natural Language Processing – (2)How should we represent a linguistic symbol?

Say, we have four symbols (words):

[EU], [3], [France], [three]

Is there a representation that preserves the similarities of meanings ofsymbols?

D([EU] , [France]) < D([EU] , [3] ,

D([3] , [three]) < D([France] , [three] ,

D([3] , [three]) < ε

...

Page 96: Dnn Tutorial

Natural Language Processing – (3)Continuous-Space Representation

Sample sentences:1. There are three teams left for the qualification.2. 3 teams have passed the first round.

Task: Predict a following word given a current work [three]

Naïve approach: build a table (so called n-gram)I (three, teams), (3, teams)I The table can grow arbitrarily.

Machine learning: compress the table into a continuous functionI Map three and 3 to nearby points x in a continuous spaceI From x , map to [teams].

Page 97: Dnn Tutorial

Natural Language Processing – (4)Continuous-Space Representation

(Cho et al., 2014)

Page 98: Dnn Tutorial

Natural Language Processing – (5)Beyond Word Representation

(Cho et al., 2014)

Page 99: Dnn Tutorial

NN: I am very powerful and can model anything as long as I’m fedenough computational resource.

SVM: But, you have to optimize a high-dimensional, non-convexfunction which has many, many local minima!

NN: Really?

Page 100: Dnn Tutorial

Advanced Optimization – (1) Statistical Physics says. . .

Not really

Page 101: Dnn Tutorial

Advanced Optimization – (2)Local Minima? Saddle Points?

1.5 1.0 0.5 0.0 0.5 1.0 1.54

3

2

1

0

1

2

3

4

X

1.00.5

0.00.5

1.0

Y

1.00.5

0.00.5

1.0

Z

2.52.01.51.00.5

0.00.51.0

X

1.5 1.0 0.5 0.0 0.5 1.0 1.5

Y

1.51.0

0.50.0

0.51.0

1.5

Z

3

2

1

0

1

2

X

1.0 0.5 0.0 0.5 1.0

Y

1.0

0.5

0.0

0.5

1.0

Z

0.50.0

0.5

1.0

(Dauphin et al., 2014; Pascanu et al., 2014)

Page 102: Dnn Tutorial

Advanced Optimization – (3)Beyond the 2nd-order Method

(Quasi-)Newton Method

θ ← θ − H−1∇L(θ)

How well does the quadratic approximation hold when training neuralnetworks?

Saddle-Free Newton Method (very new!!) (Dauphin et al., 2014)

θ ← θ − |H|−1∇L(θ),

where |H| is constructed by

|H| = U |Σ|V

when H = UΣV .

Page 103: Dnn Tutorial

Lastly but not at all least,is there any theoretical ground for using deep neural networks?

Page 104: Dnn Tutorial

Theoretical Analysis –Deep Rectifier Networks Fold the Space

1. Fold along the 2. Fold along thehorizontal axisvertical axis

3.

(a)

S1S2S3

S4

S ′4 S ′1

S ′1S ′1

S ′1 S ′4

S ′4S ′4

S ′2

S ′2S ′2

S ′2 S ′3 S ′3

S ′3 S ′3

S ′1S ′4

S ′2S ′3

Input Space

First Layer Space

Second LayerSpace

(b) (c)

(Montufar et al., 2014; Pascanu et al., 2014)

Page 105: Dnn Tutorial

Is it the beginning of deep learning or the end of deep learning?