the deep learning approach to structured probabilistic modelssrihari/cse676/16.7... · 2020. 5....

Machine Learning Srihari

1

The Deep Learning Approach to Structured Probabilistic Models

Sargur N. [email protected]


Topics in Structured PGMs for Deep Learning0. Overview1.Challenge of Unstructured Modeling2.Using graphs to describe model structure3.Sampling from graphical models4.Advantages of structured modeling5.Learning about Dependencies6.Inference and Approximate Inference7.Deep learning approach to structured models

1. Ex: The Restricted Boltzmann machine2. Training an RBM

2


Special nature of PGMs in Deep Learning• General PGMs

• PGMs in Deep Learning

• Same computational tools used– But different design decisions on combining tools

• Very different flavor from traditional PGMs 3

HMM

RBM

Shaded nodes are observed


Differences in PGM treatment

• Traditional PGMs vs. PGMs in deep learning1.Depth2.Proportion of observed to latent variables3.Latent semantics (meaning of a latent variable)4.Connectivity and inference algorithm5.Intractability and approximation

4

Machine Learning Srihari1. Depth of PGM

• Depth in a PGM:– Latent variable hi is at depth j if shortest path from hi

to observed variable is j steps• Depth of model is the greatest such depth

• Depth in a computational graph (deep learning)– Generative models often have no latent variables or

only one layer of latent variables– But use deep computational graphs to define the

conditional distributions within a model• PGMs in deep learning are not deep PGMs

5


2. Proportion of Observed/Latent Variables• Deep Learning has more latent variables than

observed variables– Since it always uses distributed representations

• Even shallow models (such as for pretraining) have a single large layer of latent variables

– Complicated nonlinear interactions between variables are accomplished via indirect connections that flow through multiple latent variables

• Traditional PGMs contain mostly variables that are observed (i.e., few latent variables)

6


3. Latent variable semantics• In DL

– Not take any specific semantics ahead of time– Training algorithm invents concepts to model data– Latent variables not easy to interpret after the fact

• In traditional PGMs – Specific semantics in mind

• Document Topic, intelligence, disease causing symptoms

7


4. Connectivity

• Deep learning PGMs have large groups of units connected other large groups of units

• So that interactions can be described by a single matrix

• Traditional PGMs have few connections and choice of connections is individually designed

• Design of model structure is tightly linked to choice of inference algorithm

8


5. Inference• Traditional PGMs: tractability of exact inference

– When this is too limiting a popular approximate approach is loopy belief propagation

– Both approaches work well with sparsely connected graphs

• Deep learning PGMs: graphs are not sparse– Use either Gibbs sampling or variational inference– Rather than simplifying model until exact inference

is feasible, make model complex enough as long as we can compute a gradient

9


Example: RBMs or Harmonium

• RBMs are a quintessential example of how graphical models are used for deep learning

• RBM itself is not a deep model – It has a single layer of latent units that may be used

to learn a representation for the input– RBMs can be used to build many deeper models

10


Boltzmann Machine and RBM• A general Boltzmann machine can have

arbitrary connections

• RBM: No visible-visible and hidden-hidden connections– Bipartite graph

• Used to learn features for input to neural networks in Deep Learning 11

General BM


RBM Characteristics

• Units are organized into large groups called layers

• Connectivity between layers is described by a matrix

• Connectivity is relatively dense• Allows efficient Gibbs sampling• Learns latent variables whose semantics is not

defined by the designer

12

Machine Learning SrihariCanonical RBM• A general energy-based model is defined as

• RBM is an energy based model with binary hidden and variable units, with x={v,h}E(v,h)= -bTv –cTh –vTWh– where b, c and W are unconstrained, real-valued

learnable parameters– We can see that the model is divided into two

groups of units v and h and the interaction between them is described by matrix W

• Model is graphically depicted next13

!p(x) = exp(−E(x)) p(x) = 1

Zp̂(x)

Z = !p(x)dx∫ Factors are of the form

exp(a), exp(b)=exp(a+b)


An RBM drawn as a Markov network• The model is depicted graphically as

• None between visible units or between hidden units– Hence restricted, a general Boltzmann has arbitrary connections

• This is different from a neural networkh=g(WTx+c)g(z)=max{0,z}, is the ReLU activation

14

!p(v,h) = exp(−E(v,h))

E(v,h)= -bTv –cTh –vTWh

p(x) = 1

Zp̂(x)

Z = !p(x)dx∫x={v,h}

Two representational forms of a neuralnetwork:

y = hTw +b

Second form used for RBM in next slide


Network view of RBM

15

v

E(v,h) = -bTv –cTh –vTWh


Properties of RBMs• Restrictions of RBM structure yields nice

properties:p(h|v)=Πi p(hi|v) andp(v|h)=Πi p(vi|h)

• Individual conditionals are simple to compute – For binary RBM vj, hi ∈{0,1} we obtain

p(hi=1|v)=σ (vTW:,i+bi)p(hi=0|v)=1-σ (vTW:,i+bi)

16

Since nodes at same levelare independent

E(v,h)= -bTv –cTh –vTWh

!p(v,h) = exp(−E(v,h))


17

Block Gibbs Sampling with RBM

• Gibbs sampling with M variables– Initialize first sample: {zi , i = 1,..,M}

– For t=1,..,T, T = no of samples• Sample z1

(t +1)~ p(z1|z2(t ), z3

(t ),…, zM(t ) )• Sample z2

(t +1)~ p(z2|z1(t +1), z3

(t ),…, zM(t ) )• …..• Sample zj(t +1)~ p(zj|z1

(t +1),.. zj-1(t +1), zj+1(t) …, zM(t ) )

• …..• Sample zM(t +1)~ p(zM|z1

(t +1), z2(t +1),…, zM-1(t +1) )

• RBM properties allow for block Gibbs sampling– Alternate between sampling all h simultaneously

and all v simultaneously• Gibbs sampling and samples obtained shown next

Machine Learning SrihariSamples from an RBM• RBM Gibbs samples

– Model trained on MNIST data

• Each column is a separate Gibbs process• Each row represents the output of another 1000

steps of Gibbs sampling, Successive samples are highly correlated

18

Corresponding weight vectors:

Machine Learning SrihariRBM: Comparison with Sparse Coding• RBM Gibbs samples (model trained on MNIST data)

• Linear factor model (e.g., PCA) samples• Sparse Coding Samples are blurry

19

Corresponding weight vectors:


Derivatives of Energy Function• Energy function: E(v,h)= -bTv –cTh –vTWh

• where b, c and W are unconstrained, real-valued learnable parameters

• Since the energy function is a linear function of its parameters, it is easy to take derivatives– E.g.,

• These two properties, efficient Gibbs sampling and efficient derivatives make training convenient

• Described next20

∂∂W

i,j

E(v,h) = −vih

j


Training Undirected Models• Training undirected models is accomplished by

computing derivatives – And applying to samples from the model– Training induces a representation h of data v– We can use Eh~p(h|v)[h] as a set of features to

describe v• RBM demonstrates the typical deep learning

approach to graphical models: – Representation learning accomplished via layers of

latent variables combined with efficient interactions between layers parameterized by matrices

21

∂∂W

i,j

E(v,h) = −vih

j

Machine Learning SrihariRBM with visible and hidden units• Joint configuration (v,h)

– visible and hidden units has an energy (Hopfield 1982)

– Network assigns a probability to every pair of hidden and visible vectors

• where partition function Z is a sum over all possible pairs of visible/hidden vectors

– Probability that network assigns to a visible vector v is

22

bj

ai

h

v

W

Hidden binaryunits

Visible binaryunits

Connectionsbias

RBM

Stochastic binary pixels v connected to stochastic binary feature detectors husing symmetrically weighted connections

p(v,h) = 1

Ze−E(v,h )

E(v,h) = − a

ii∈visible∑ v

i− b

jj∈hidden∑ h

j− v

ih

ji,j∑ w

ij

Z = e−E(v,h )

v,h∑

p(v) = 1

Ze−E(v,h )

h∑


Changing probability of v• Probability network assigns to training data v is

raised by adjusting weights and biases– Lower the energy of that image & raise energy of other images– Especially those that have low energies and make a high

contribution to the partition function• Maximum likelihood approach to determine W,a,b

23

∂∂w

i,j

E(v,h) = −vih

j

Likelihood: P({v(1),..v(M )}) = p(v(m))m∏

Log-likelihood:

lnP({v(1),..v(M )}) = ln p(v(m)m∑ ) = ln

1Z

e−E(v,h )(m )

h∑⎛

⎝⎜⎞⎠⎟m

∑ = ln e−E(v,h )(m )

h∑⎛⎝⎜

⎞⎠⎟m

∑ − ln ev,h∑ −E(v,h )⎛

⎝⎜⎞

⎠⎟m∑

ddx

lnx = 1x

Derivative of the log-probability of a training vector wrt a weight:∂ln p(v)∂w

ij

= Edata

(vih

j)− E

model(v

ih

j)

p(v,h) = 1Z

e−E(v,h )

E(v,h) = − a

ii∈visible∑ v

i− b

ji∈hiddene∑ h

j− v

ih

ji,j∑ w

ij

Learning rule for stochastic steepest ascent

Δwij= ε E

data(v

ih

j)− E

model(v

ih

j)( ). where ε is the learning rate

p(v) = 1

Ze−E(v,h )

h∑

Error for gradient descent step is Edata(vihj)-Emodel(vihj)


Samples for Computing Expectations• Getting unbiased samples for Edata(vihj)

• hj: Given random training image v, the binary state hj for each hidden unit is set to 1 with probability

• vi : Given a random training image v, the binary state vi for a visible unit is set to 1 with probability

• Getting unbiased samples for Emodel(vihj)– Can be done by starting at a random state of visible

units and performing Gibbs sampling for a long time• One iteration of alternating Gibbs sampling consists of

updating all hidden units in parallel followed by updating all visible units 24

p(h

j= 1 |v) = σ b

j+ v

iw

iji∑⎛

⎝⎜⎞⎠⎟

p(v

i= 1 |v) = σ ai + h

jw

ijj∑

⎛

⎝⎜⎞

⎠⎟


Summary of RBM training

p(x;θ) = 1

Z(θ)!p(x,θ)

Z(θ) = !p(x,θ)

x∑

gm= ∇θ log p(x

(m);θ) = ∇θ log !p(x(m);θ) − ∇θ logZ(θ)

For an RBM: x={v,h}

bj

ai

h

v

W

E(v,h) = −hTWv −aTv −bTh = Wi,jvi

j∑

i∑ h

j− a

ivi

i∑ − b

jj∑ h

j

p(v,h) = 1

Zexp(−E(v,h))

θ={W,a,b}

L({x (1),..x (M )};θ) = log !p(x (m);θ)m∑ − logZ(θ)

m∑

∇θ logZ(θ) = Ex~p(x )∇θ log !p(x)

Ex~p(x )

∇θ log !p(x) =1M

∇θi=1

M

∑ log !p(x (m);θ)

Binaryunits

Binaryunits

Connectionsbias

Determine parameters θ that maximize log-likelihood (negative loss)

maxθL({x (1),..x (M )};θ) = log p(x (m)

m∑ ;θ)

IntractablePartition function

∂∂W

i,j

E(v,h) = −vih

j

Probability Distribution of Undirected model (Gibbs)

An identity

For stochastic gradient ascent, take derivatives:

Derivative of negative phase:Derivative of positive phase:

1M

∇θm=1

M

∑ log !p(x (m);θ)

Summation is over samplesfrom the training setSince it is summed m times 1/m has no effect

Summation is over samples from the RBM

RBM

θ ← θ + εg

!p(x) = exp(−E(v,h))

Z = exp

v,h∑ (−E(v,h))


RBM Training Algorithm

• Contrastive Divergence• A method to overcome exponential complexity

in dealing with the partition function

26

the deep learning approach to structured probabilistic modelssrihari/cse676/16.7... · 2020. 5....

Documents