probabilistic graphical modelsepxing/class/10708-15/slides/lecture25-dl.pdfprobabilistic graphical...

68
Deep Learning and Graphical Models Eric Xing Lecture 25, April 15, 2015 Reading: Probabilistic Graphical Models © Eric Xing @ CMU, 2015 1

Upload: others

Post on 25-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Deep Learning and Graphical Models

Eric Xing

Lecture 25, April 15, 2015

Reading:

Probabilistic Graphical Models

© Eric Xing @ CMU, 2015 1

Page 2: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

From biological neuron to artificial neuron (perceptron)

From biological neuron network to artificial neuron networks

Perceptron and Neural Nets

Threshold

Inputs

x1

x2

OutputY

HardLimiter

w2

w1

LinearCombiner

Soma Soma

Synapse

Synapse

Dendrites

Axon

Synapse

DendritesAxon

Input Layer Output Layer

Middle Layer

I n p

u t

S i

g n

a l s

O u

t p

u t

S i

g n

a l s

© Eric Xing @ CMU, 2015 2

Page 3: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

A perceptron learning algorithm

Recall the nice property of sigmoid function

Consider regression problem f:XY , for scalar Y:

We used to maximize the conditional data likelihood

Here …

© Eric Xing @ CMU, 2015 3

Page 4: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Gradient Descent

xd = input

td = target output

od =observed unit

output

wi =weight i

© Eric Xing @ CMU, 2015 4

Page 5: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

The perceptron learning rules

xd = input

td = target output

od =observed unit

output

wi =weight i

Batch mode:Do until converge:

1. compute gradient ED[w]

2.

Incremental mode:Do until converge:

For each training example d in D

1. compute gradient Ed[w]

2.

where

© Eric Xing @ CMU, 2015 5

Page 6: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

2Gender

Stage 4

.6

.5

.8

.2

.1

.3.7

.2

WeightsHiddenLayer

“Probability of beingAlive”

0.6

.4

.2

Neural Network Model

© Eric Xing @ CMU, 2015 6

Page 7: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

2Gender

Stage 4

.6

.5

.8

.1

.7

WeightsHiddenLayer

“Probability of beingAlive”

0.6

“Combined logistic models”

© Eric Xing @ CMU, 2015 7

Page 8: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

2Gender

Stage 4

.5

.8.2

.3

.2

WeightsHiddenLayer

“Probability of beingAlive”

0.6

© Eric Xing @ CMU, 2015 8

Page 9: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

1Gender

Stage 4

.6.5

.8.2

.1

.3.7

.2

WeightsHiddenLayer

“Probability of beingAlive”

0.6

© Eric Xing @ CMU, 2015 9

Page 10: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

WeightsIndependent variables

Dependent variable

Prediction

Age 34

2Gender

Stage 4

.6

.5

.8

.2

.1

.3.7

.2

WeightsHiddenLayer

“Probability of beingAlive”

0.6

.4

.2

Not really, no target for hidden units...

© Eric Xing @ CMU, 2015 10

Page 11: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Backpropagation Algorithm Initialize all weights to small random numbers

Until convergence, Do

1. Input the training example to the network and compute the network outputs

1. For each output unit k

2. For each hidden unit h

3. Undate each network weight wi,j

where

xd = input

td = target output

od =observed unit

output

wi =weight i

© Eric Xing @ CMU, 2015 11

Page 12: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

More on Backpropatation It is doing gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimum

In practice, often works well (can run multiple times)

Often include weight momentum

Minimizes error over training examples Will it generalize well to subsequent testing examples?

Training can take thousands of iterations, very slow! Using network after training is very fast

© Eric Xing @ CMU, 2015 12

Page 13: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Learning Hidden Layer Representation A network:

A target function:

Can this be learned?

© Eric Xing @ CMU, 2015 13

Page 14: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Learning Hidden Layer Representation A network:

Learned hidden layer representation:

© Eric Xing @ CMU, 2015 14

Page 15: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Training

© Eric Xing @ CMU, 2015 15

Page 16: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Training

© Eric Xing @ CMU, 2015 16

Page 17: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Modern ANN topics: “Deep” Learning

© Eric Xing @ CMU, 2015 17

Page 18: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

X1 X2 X3

“X1” “X1X3” “X1X2X3”

Y

“X2”

X1 X2 X3 X1X2 X1X3 X2X3

Y

(23-1) possible combinations

X1X2X3

Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ...

Non-linear LR vs. ANN

© Eric Xing @ CMU, 2015 18

Page 19: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Courtesy: Lee and Ng

Computer vision features

SIFT Spin image

HoG RIFT

Textons GLOH

Drawbacks of feature engineering1. Needs expert knowledge2. Time consuming hand-tuning

© Eric Xing @ CMU, 2015 19

Page 20: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Using ANN to capture hierarchical representation

Good Representations are hierarchical

• In Language: hierarchy in syntax and semantics– Words->Parts of Speech->Sentences->Text– Objects,Actions,Attributes...-> Phrases -> Statements -> Stories

• In Vision: part-whole hierarchy– Pixels->Edges->Textons->Parts->Objects->Scenes

TrainableFeatureExtractor

TrainableFeatureExtractor

TrainableClassifier

© Eric Xing @ CMU, 2015 20

Page 21: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

“Deep” learning: learning hierarchical representations

• Deep Learning: learning a hierarchy of internal representations• From low-level features to mid-level invariant representations,

to object identities• Representations are increasingly invariant as we go up the

layers• using multiple stages gets around the specificity/invariance

dilemma

TrainableFeatureExtractor

TrainableFeatureExtractor

TrainableClassifier

Learned Internal Representation

© Eric Xing @ CMU, 2015 21

Page 22: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

“Deep” models Neural Networks: Feed-forward*

You have seen it

“Deep” Restrictive Boltzmann Machines (RBM) Probabilistic – Undirected: MRFs and RBMs*

Autoencoders (multilayer NN with output = input) Non-Probabilistic -Directed: PCA, Sparse Coding

Recurrent Neural Networks*

Convolutional Neural Nets

Different loss function designs

Different architecture designs

© Eric Xing @ CMU, 2015 22

Page 23: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Example I: The Restrictive Boltzmann Machines, aka., the “Harmonium”

hidden units

visible units

The Harmonium (Smolensky –’86)

History:Smolensky (’86), Proposed the architechture.Freund & Haussler (’92), The “Combination Machine” (binary), learning with projection pursuit.Hinton (’02), The “Restricted Boltzman Machine” (binary), learning with contrastive divergence. Marks & Movellan (’02), Diffusion Networks (Gaussian).Welling, Hinton, Osindero (’02), “Product of Student-T Distributions” (super-Gaussian)

© Eric Xing @ CMU, 2015 23

Page 24: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

A Two-layer MRFs

hidden units

visible units

{ }∑ ∑∑,

,, )(-),(+)(+)(exp=)|,(j ji

jijijijjji

iii Ahxhxhxp θ

Boltzmann machines:

© Eric Xing @ CMU, 2015 24

Page 25: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

how do we couple them?

∏ )(exp∝)(indi

iii xfp x

∏ )(exp∝)(indj

jjj hgp h

jh

ix

A Constructive Definition

{ }∑ ∑∑,

, )()(+)(+)(exp=)|,(j ji

jjjiiT

ijjji

iii hgxfhgxfhxp W

© Eric Xing @ CMU, 2015 25

Page 26: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

jh

They map to the harmonium random field:

{ }

∑∑

)(+=)(+=ˆ

})ˆ({+)(ˆexp=)|(

,)|(=)|(

jjj

jiaia

jbjjb

jbiaiaia

iaia

iiaiai

ii

hgWhgW

Axfxp

xpp

h

hhx

{ }

∑∑

)(+=)(+=ˆ

})ˆ({+)(ˆexp=)|(

)|(=)|(

iii

jbijb

iaiia

jbiajbjb

jbjb

jjbjbj

jj

xfWxfW

Bhghp

hpp

x

xxh

{ }∑ ∑∑,

, )()(+)(+)(exp=)|,(j ji

jjjiiT

ijjji

iii hgxfhgxfhxp W

vector of local sufficient statistics (features)

coupling in the log-domain withshifted parameters

ix

A Constructive Definition

© Eric Xing @ CMU, 2015 26

Page 27: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

The Computational Trade-off

Undirected model: Learning is hard, inference is easy.

Directed Model: Learning is "easier", inference is hard.

Example: Document Retrieval.

topics

wordsRetrieval is based on comparing (posterior) topic distributions of documents.- directed models: inference is slow. Learning is relatively “easy”.- undirected model: inference is fast. Learning is slow but can be done offline.

© Eric Xing @ CMU, 2015 27

Page 28: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

An RMB for text/image

words counts

topics

[ ]∏ ∑+Softmax=)|(i j

jijix hWpi

hx

[ ]∏ ∑ 1,Normal=)|(j i

iijh xWpj

xh

xi,n = 1: word i has count n

hj = 3: topic j has strength 3

EFH or DWH (welling et al 2005, Xing et al 2005):

1=∑ },1,0{∈ ,n, nini xx

,∈Rjh ijiij xWh •∑= ,

],,...,,[=max,2,1, Niiii xxxx

( ){ } , ∑+exp∝]∑+[Softmax iT

jijjijijjix xhWhWi

note parameterization cost! ( ) jiwwW N

jijiij ,∀ ],,...,[= max,

1,

In practice, only 1-0 counting is used!!!© Eric Xing @ CMU, 2015 28

Page 29: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

RecallProperties of Directed Networks

hidden units

visible units

Semantic naturalness: intuitive causal structural, easy to design, comprehend and manipulate

Bayesian extensions: straightforward to set up, conjugate ..., but hyper-para fitting is non-trivial

Computational properties: © Eric Xing @ CMU, 2015 29

Page 30: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

( | )p h v

1max ( | )t tQ

)(~ hph

)|(~ hxpx

Properties of Directed Networks Factors are marginally

independent.

Factors are conditionally dependent given observations on the visible nodes.

Easy ancestral sampling.

Learning with (variational) EM

)()()|(

=)|(w

ww

PPP

P

© Eric Xing @ CMU, 2015 30

Page 31: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

)|(~ xhph

)|(~ hxpx

Properties of RBMs Factors are marginally dependent.

Factors are conditionally independent given observations on the visible nodes.

Iterative Gibbs sampling.

Learning with contrastive divergence

)|(∏=)|( ww ii PP

© Eric Xing @ CMU, 2015 31

Page 32: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Deep RBMs

© Eric Xing @ CMU, 2015 32

Page 33: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Convolutions,Filtering

PoolingSubsampling

Convolutions,Filtering Pooling

Subsampling

Convolutions,Filtering

Convolutions,Classification

Example II: Convolutional Networks

Hierarchical ArchitectureRepresentations are more global, more invariant, and more abstract as we go up the layers

Alternated Layers of Filtering and Spatial PoolingFiltering detects conjunctions of featuresPooling computes local disjunctions of features

Fully TrainableAll the layers are trainable

© Eric Xing @ CMU, 2015 33

Page 34: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Filtering + NonLinearity + Pooling = 1 stage of a Convolutional Net

• [Hubel & Wiesel 1962]: – simple cells detect local features– complex cells “pool” the outputs of simple cells within a retinotopic

neighborhood.

pooling subsampling

“Simple cells”“Complex cells”

Multiple convolutions

Retinotopic Feature Maps© Eric Xing @ CMU, 2015 34

Page 35: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

input1@32x32

Layer 16@28x28

Layer 26@14x14

Layer 312@10x10 Layer 4

12@5x5

Layer 5100@1x1

10

5x5convolution

5x5convolution

5x5convolution2x2

pooling/subsampling

2x2pooling/subsampling

Layer 6: 10

Convolutional Net Architecture for Hand-writing recognition

Convolutional net for handwriting recognition (400,000 synapses) Convolutional layers (simple cells): all units in a feature plane share the same weights Pooling/subsampling layers (complex cells): for invariance to small distortions. Supervised gradient-descent learning using back-propagation The entire network is trained end-to-end. All the layers are trained simultaneously. [LeCun et al. Proc IEEE, 1998]

© Eric Xing @ CMU, 2015 35

Page 36: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

How to train?

But this is very slow !!!© Eric Xing @ CMU, 2015 36

Page 37: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

input ...

Layer-wise Unsupervised Pre-training

© Eric Xing @ CMU, 2015 37

Page 38: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

input ...

features ...

Layer-wise Unsupervised Pre-training

© Eric Xing @ CMU, 2015 38

Page 39: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

input ...

features ...

Reconstructionof input

... ... input?=

Layer-wise Unsupervised Pre-training

© Eric Xing @ CMU, 2015 39

Page 40: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

input ...

features ...

Layer-wise Unsupervised Pre-training

© Eric Xing @ CMU, 2015 40

Page 41: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

input ...

features ...

More abstract features

...

Layer-wise Unsupervised Pre-training

© Eric Xing @ CMU, 2015 41

Page 42: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

input ...

features ...

More abstract features

...

Reconstructionof features

... ...=?

Layer-wise Unsupervised Pre-training

© Eric Xing @ CMU, 2015 42

Page 43: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

input ...

features ...

More abstract features

...

Layer-wise Unsupervised Pre-training

© Eric Xing @ CMU, 2015 43

Page 44: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

input ...

features ...

More abstract features

...

Even more abstract features

...

Layer-wise Unsupervised Pre-training

© Eric Xing @ CMU, 2015 44

Page 45: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

input ...

features ...

More abstract features

...

Even more abstract features

...

Outputf(X) =

? TargetY

Layer-wise Unsupervised Pre-training

© Eric Xing @ CMU, 2015 45

Page 46: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Feature learning Successful learning of intermediate representations

[Lee et al ICML 2009, Lee et al NIPS 2009]

© Eric Xing @ CMU, 2015 46

Page 47: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Deep Learning is Amazing!!!

© Eric Xing @ CMU, 2015 47

Page 48: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

A Few Thoughts on How We May Want to Further Study DNN

© Eric Xing @ CMU, 2015 48

Page 49: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

What makes it work? Why?

© Eric Xing @ CMU, 2015 49

Page 50: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

An MLer’s View of the World

Loss functions(likelihood, reconstruction, margin, …)Loss functions(likelihood, reconstruction, margin, …)

Constraints(normality, sparsity, label, prior, KL, sum, …) Constraints(normality, sparsity, label, prior, KL, sum, …)

AlgorithmsMC (MCMC, Importance), Opt (gradient, IP), … AlgorithmsMC (MCMC, Importance), Opt (gradient, IP), …

Stopping criteriaChange in objective, change in update …

Stopping criteriaChange in objective, change in update …

Structures(Graphical, group, chain, tree, iid, …)Structures(Graphical, group, chain, tree, iid, …)

Empirical Performances?Empirical Performances?

© Eric Xing @ CMU, 2015 50

Page 51: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

DL ML (e.g., GM)

Empirical goal: e.g., classification, feature learning

e.g., transfer learning, latent variable inference

Structure: Graphical Graphical

Objective: Something aggregated from local functions

Something aggregated from local functions

Vocabulary: Neuron, activation/gate function …

Variables, potential function

Algorithm: A single, unchallenged, inference algorithm -- BP

A major focus of open research,many algorithms, and more to come

Evaluation: On a black-box score -- end performance

On almost every intermediate quantity

Implementation: Many untold-tricks More or less standardized

Experiments: Massive, real data (GTunknown)

Modest, often simulated data (GT known)

© Eric Xing @ CMU, 2015 51

Page 52: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

A slippery slope to mythology? How to conclusively determine what an improve in

performance could come from: Better model (architecture, activation, loss, size)? Better algorithm (more accurate, faster convergence)? Better training data?

Current research in DL seem to get everything above mixed by evaluating on a black-box “performance score” that is not directly reflecting Correctness of inference Achievability/usefulness of model Variance due to stochasticity

© Eric Xing @ CMU, 2015 52

Page 53: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Although a single dimension (# of layers) is compared, many other dimensions may also change, to name a few:

• Per training-iteration time• Tolerance to inaccurate inference• Identifiability• …

An Example

© Eric Xing @ CMU, 2015 53

Page 54: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Inference quality Training error is the old concept of a classifier with

no hidden states, no inference is involved, and thus inference accuracy is not an issue

But a DNN is not just a classifier, some DNNs are not even fully supervised, there are MANY hidden states, why their inference quality is not taken seriously?

In DNN, inference accuracy = visualizing features Study of inference accuracy is badly discouraged Loss/accuracy is not monitored

© Eric Xing @ CMU, 2015 54

Page 55: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Inference/Learning Algorithm, and their evaluation

© Eric Xing @ CMU, 2015 55

Page 56: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Learning a GM with Hidden Variables – the thought process In fully observed iid settings, the log likelihood decomposes into a

sum of local terms (at least for directed models).

With latent variables, all the parameters become coupled together via marginalization

),|(log)|(log)|,(log);( xzc zxpzpzxpD l

z

xzz

c zxpzpzxpD ),|()|(log)|,(log);( l

© Eric Xing @ CMU, 2015 56

Page 57: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Gradient Learning for mixture models We can learn mixture densities using gradient descent on the log

likelihood. The gradients are quite interesting:

In other words, the gradient is aggregated from many other intermediate states Implication: costly iteration, heavy coupling between parameters

Other issues: imposing constraints, identifiability …© Eric Xing @ CMU, 2015 57

Page 58: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Then Alternative Approaches Were Proposed The EM algorithm

M: a convex problem E: approximate constrained optimization

Mean field BP/LBP Marginal polytope

Spectrum algorithm: redefine intermediate states, convexify the original problem

© Eric Xing @ CMU, 2015 58

Page 59: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Learning a DNN

© Eric Xing @ CMU, 2015 59

Page 60: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Learning a DNN In a nutshell, sequentially, and recursively apply:

Things can getting hairy when locally defined losses are introduced, e.g., auto-encoder, which breaks a loss-driven global optimization formulation

Depending on starting point, BP converge or diverge with probability 1 A serious problem in Large-Scale DNN

© Eric Xing @ CMU, 2015 60

Page 61: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Some new ideas to speed up Stacking from smaller building blocks

Layers Blocks

Approximate Inference Undirected connections for all layers (Markov net) [Related work: Salakhutdinov

and Hinton, 2009] Block Gibbs sampling or mean-field Hierarchical probabilistic inference

Layer-wise Unsupervised Learning

© Eric Xing @ CMU, 2015 61

Page 62: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

© Eric Xing @ CMU, 2015 62

Page 63: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

DL

Utility of the network A vehicle to conceptually synthesize

complex decision hypothesis stage-wise projection and aggregation

A vehicle for organizing computing operations stage-wise update of latent states

A vehicle for designing processing steps/computing modules Layer-wise parallization

No obvious utility in evaluating DL algorithms

Utility of the Loss Function Global loss? Well it is non-convex

anyway, why bother ?

GM

A vehicle for synthesizing a global loss function from local structure potential function, feature function

A vehicle for designing sound and efficient inference algorithms Sum-product, mean-field

A vehicle to inspire approximation and penalization Structured MF, Tree-approx

A vehicle for monitoring theoretical and empirical behavior and accuracy of inference

A major measure of quality of algorithm and model

© Eric Xing @ CMU, 2015 63

Page 64: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

GMFr

GMFb

BP

An Old Study of DL as GM LearningA sigmoid belief network at a GM, and mean-field partitions

Study focused on only inference/learning accuracy, speed, and partition

[Xing, Russell, Jordan, UAI 2003]

Now we can ask, with a correctly learned DN, is it doing will on the desired task?© Eric Xing @ CMU, 2015 64

Page 65: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Why A Graphical Model formulation of DL might be fruitful? Modular design: easy to incorporate knowledge and

interpret, easy to integrate feature learning with high level tasks, easy to built on existing (partial) solutions

Defines an explicit and natural objective Guilds strategies for systematic study of inference,

parallelization, evaluation, and theoretical analysis A clear path to further upgrade:

structured prediction Integration of multiple data modality Modeling complex: time series, missing data, online data …

Big DL on distributed architectures, where things can get messy everywhere due to incorrect parallel computations

© Eric Xing @ CMU, 2015 65

Page 66: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Easy to incorporate knowledge and interpret

1S 2S 3S4S KS.......

1t 2t 3t Kt4t

1z 2z 3zKz4z

1o 2o 3o Ko4o

1y 2y 3y Ky4y

1n2n 3n Kn

4n

1N 2N 3N KN4N

h

articulation

targets

distortion-free acoustics

distorted acoustics

distortion factors & feedback to articulation Slides Courtesy:

Li Deng

© Eric Xing @ CMU, 2015 66

Page 67: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Easy to integrate feature learning with high level tasks

Hidden Markov Model+

Gaussian Mixture Model

Hidden Markov Model+

Deep Neural Network

Jointly trained, but shallow Deep, but separately trained

Hidden Markov Model+

Deep Graphical Models

Jointly trained and deep© Eric Xing @ CMU, 2015 67

Page 68: Probabilistic Graphical Modelsepxing/Class/10708-15/slides/lecture25-DL.pdfProbabilistic Graphical Models ... Directed Model: Learning is "easier", inference is hard. Example: Document

Conclusion In GM: lots of efforts are directed to improving inference

accuracy and convergence speed An advanced tutorial would survey dozen’s of inference

algorithms/theories, but few use cases on empirical tasks

In DL: most effort is directed to comparing different architectures and gate functions (based on empirical performance on a downstream task) An advanced tutorial typically consist of a list of all designs of nets,

many use cases, but a single name of algorithm: back prop of SGD

The two fields are similar at the beginning (energy, structure, etc.), and soon diverge to their own signature pipelines

A convergence might be necessary and fruitful

© Eric Xing @ CMU, 2015 68