the deep learning approach to structured probabilistic modelssrihari/cse676/16.7... · 2020. 5....
TRANSCRIPT
Machine Learning Srihari
1
The Deep Learning Approach to Structured Probabilistic Models
Sargur N. [email protected]
Machine Learning Srihari
Topics in Structured PGMs for Deep Learning0. Overview1.Challenge of Unstructured Modeling2.Using graphs to describe model structure3.Sampling from graphical models4.Advantages of structured modeling5.Learning about Dependencies6.Inference and Approximate Inference7.Deep learning approach to structured models
1. Ex: The Restricted Boltzmann machine2. Training an RBM
2
Machine Learning Srihari
Special nature of PGMs in Deep Learning• General PGMs
• PGMs in Deep Learning
• Same computational tools used– But different design decisions on combining tools
• Very different flavor from traditional PGMs 3
HMM
RBM
Shaded nodes are observed
Machine Learning Srihari
Differences in PGM treatment
• Traditional PGMs vs. PGMs in deep learning1.Depth2.Proportion of observed to latent variables3.Latent semantics (meaning of a latent variable)4.Connectivity and inference algorithm5.Intractability and approximation
4
Machine Learning Srihari1. Depth of PGM
• Depth in a PGM:– Latent variable hi is at depth j if shortest path from hi
to observed variable is j steps• Depth of model is the greatest such depth
• Depth in a computational graph (deep learning)– Generative models often have no latent variables or
only one layer of latent variables– But use deep computational graphs to define the
conditional distributions within a model• PGMs in deep learning are not deep PGMs
5
Machine Learning Srihari
2. Proportion of Observed/Latent Variables• Deep Learning has more latent variables than
observed variables– Since it always uses distributed representations
• Even shallow models (such as for pretraining) have a single large layer of latent variables
– Complicated nonlinear interactions between variables are accomplished via indirect connections that flow through multiple latent variables
• Traditional PGMs contain mostly variables that are observed (i.e., few latent variables)
6
Machine Learning Srihari
3. Latent variable semantics• In DL
– Not take any specific semantics ahead of time– Training algorithm invents concepts to model data– Latent variables not easy to interpret after the fact
• In traditional PGMs – Specific semantics in mind
• Document Topic, intelligence, disease causing symptoms
7
Machine Learning Srihari
4. Connectivity
• Deep learning PGMs have large groups of units connected other large groups of units
• So that interactions can be described by a single matrix
• Traditional PGMs have few connections and choice of connections is individually designed
• Design of model structure is tightly linked to choice of inference algorithm
8
Machine Learning Srihari
5. Inference• Traditional PGMs: tractability of exact inference
– When this is too limiting a popular approximate approach is loopy belief propagation
– Both approaches work well with sparsely connected graphs
• Deep learning PGMs: graphs are not sparse– Use either Gibbs sampling or variational inference– Rather than simplifying model until exact inference
is feasible, make model complex enough as long as we can compute a gradient
9
Machine Learning Srihari
Example: RBMs or Harmonium
• RBMs are a quintessential example of how graphical models are used for deep learning
• RBM itself is not a deep model – It has a single layer of latent units that may be used
to learn a representation for the input– RBMs can be used to build many deeper models
10
Machine Learning Srihari
Boltzmann Machine and RBM• A general Boltzmann machine can have
arbitrary connections
• RBM: No visible-visible and hidden-hidden connections– Bipartite graph
• Used to learn features for input to neural networks in Deep Learning 11
General BM
Machine Learning Srihari
RBM Characteristics
• Units are organized into large groups called layers
• Connectivity between layers is described by a matrix
• Connectivity is relatively dense• Allows efficient Gibbs sampling• Learns latent variables whose semantics is not
defined by the designer
12
Machine Learning SrihariCanonical RBM• A general energy-based model is defined as
• RBM is an energy based model with binary hidden and variable units, with x={v,h}E(v,h)= -bTv –cTh –vTWh– where b, c and W are unconstrained, real-valued
learnable parameters– We can see that the model is divided into two
groups of units v and h and the interaction between them is described by matrix W
• Model is graphically depicted next13
!p(x) = exp(−E(x)) p(x) = 1
Zp̂(x)
Z = !p(x)dx∫ Factors are of the form
exp(a), exp(b)=exp(a+b)
Machine Learning Srihari
An RBM drawn as a Markov network• The model is depicted graphically as
• None between visible units or between hidden units– Hence restricted, a general Boltzmann has arbitrary connections
• This is different from a neural networkh=g(WTx+c)g(z)=max{0,z}, is the ReLU activation
14
!p(v,h) = exp(−E(v,h))
E(v,h)= -bTv –cTh –vTWh
p(x) = 1
Zp̂(x)
Z = !p(x)dx∫x={v,h}
Two representational forms of a neuralnetwork:
y = hTw +b
Second form used for RBM in next slide
Machine Learning Srihari
Network view of RBM
15
v
E(v,h) = -bTv –cTh –vTWh
Machine Learning Srihari
Properties of RBMs• Restrictions of RBM structure yields nice
properties:p(h|v)=Πi p(hi|v) andp(v|h)=Πi p(vi|h)
• Individual conditionals are simple to compute – For binary RBM vj, hi ∈{0,1} we obtain
p(hi=1|v)=σ (vTW:,i+bi)p(hi=0|v)=1-σ (vTW:,i+bi)
16
Since nodes at same levelare independent
E(v,h)= -bTv –cTh –vTWh
!p(v,h) = exp(−E(v,h))
Machine Learning Srihari
17
Block Gibbs Sampling with RBM
• Gibbs sampling with M variables– Initialize first sample: {zi , i = 1,..,M}
– For t=1,..,T, T = no of samples• Sample z1
(t +1)~ p(z1|z2(t ), z3
(t ),…, zM(t ) )• Sample z2
(t +1)~ p(z2|z1(t +1), z3
(t ),…, zM(t ) )• …..• Sample zj(t +1)~ p(zj|z1
(t +1),.. zj-1(t +1), zj+1(t) …, zM(t ) )
• …..• Sample zM(t +1)~ p(zM|z1
(t +1), z2(t +1),…, zM-1(t +1) )
• RBM properties allow for block Gibbs sampling– Alternate between sampling all h simultaneously
and all v simultaneously• Gibbs sampling and samples obtained shown next
Machine Learning SrihariSamples from an RBM• RBM Gibbs samples
– Model trained on MNIST data
• Each column is a separate Gibbs process• Each row represents the output of another 1000
steps of Gibbs sampling, Successive samples are highly correlated
18
Corresponding weight vectors:
Machine Learning SrihariRBM: Comparison with Sparse Coding• RBM Gibbs samples (model trained on MNIST data)
• Linear factor model (e.g., PCA) samples• Sparse Coding Samples are blurry
19
Corresponding weight vectors:
Machine Learning Srihari
Derivatives of Energy Function• Energy function: E(v,h)= -bTv –cTh –vTWh
• where b, c and W are unconstrained, real-valued learnable parameters
• Since the energy function is a linear function of its parameters, it is easy to take derivatives– E.g.,
• These two properties, efficient Gibbs sampling and efficient derivatives make training convenient
• Described next20
∂∂W
i,j
E(v,h) = −vih
j
Machine Learning Srihari
Training Undirected Models• Training undirected models is accomplished by
computing derivatives – And applying to samples from the model– Training induces a representation h of data v– We can use Eh~p(h|v)[h] as a set of features to
describe v• RBM demonstrates the typical deep learning
approach to graphical models: – Representation learning accomplished via layers of
latent variables combined with efficient interactions between layers parameterized by matrices
21
∂∂W
i,j
E(v,h) = −vih
j
Machine Learning SrihariRBM with visible and hidden units• Joint configuration (v,h)
– visible and hidden units has an energy (Hopfield 1982)
– Network assigns a probability to every pair of hidden and visible vectors
• where partition function Z is a sum over all possible pairs of visible/hidden vectors
– Probability that network assigns to a visible vector v is
22
bj
ai
h
v
W
Hidden binaryunits
Visible binaryunits
Connectionsbias
RBM
Stochastic binary pixels v connected to stochastic binary feature detectors husing symmetrically weighted connections
p(v,h) = 1
Ze−E(v,h )
E(v,h) = − a
ii∈visible∑ v
i− b
jj∈hidden∑ h
j− v
ih
ji,j∑ w
ij
Z = e−E(v,h )
v,h∑
p(v) = 1
Ze−E(v,h )
h∑
Machine Learning Srihari
Changing probability of v• Probability network assigns to training data v is
raised by adjusting weights and biases– Lower the energy of that image & raise energy of other images– Especially those that have low energies and make a high
contribution to the partition function• Maximum likelihood approach to determine W,a,b
23
∂∂w
i,j
E(v,h) = −vih
j
Likelihood: P({v(1),..v(M )}) = p(v(m))m∏
Log-likelihood:
lnP({v(1),..v(M )}) = ln p(v(m)m∑ ) = ln
1Z
e−E(v,h )(m )
h∑⎛
⎝⎜⎞⎠⎟m
∑ = ln e−E(v,h )(m )
h∑⎛⎝⎜
⎞⎠⎟m
∑ − ln ev,h∑ −E(v,h )⎛
⎝⎜⎞
⎠⎟m∑
ddx
lnx = 1x
Derivative of the log-probability of a training vector wrt a weight:∂ln p(v)∂w
ij
= Edata
(vih
j)− E
model(v
ih
j)
p(v,h) = 1Z
e−E(v,h )
E(v,h) = − a
ii∈visible∑ v
i− b
ji∈hiddene∑ h
j− v
ih
ji,j∑ w
ij
Learning rule for stochastic steepest ascent
Δwij= ε E
data(v
ih
j)− E
model(v
ih
j)( ). where ε is the learning rate
p(v) = 1
Ze−E(v,h )
h∑
Error for gradient descent step is Edata(vihj)-Emodel(vihj)
Machine Learning Srihari
Samples for Computing Expectations• Getting unbiased samples for Edata(vihj)
• hj: Given random training image v, the binary state hj for each hidden unit is set to 1 with probability
• vi : Given a random training image v, the binary state vi for a visible unit is set to 1 with probability
• Getting unbiased samples for Emodel(vihj)– Can be done by starting at a random state of visible
units and performing Gibbs sampling for a long time• One iteration of alternating Gibbs sampling consists of
updating all hidden units in parallel followed by updating all visible units 24
p(h
j= 1 |v) = σ b
j+ v
iw
iji∑⎛
⎝⎜⎞⎠⎟
p(v
i= 1 |v) = σ ai + h
jw
ijj∑
⎛
⎝⎜⎞
⎠⎟
Machine Learning Srihari
Summary of RBM training
p(x;θ) = 1
Z(θ)!p(x,θ)
Z(θ) = !p(x,θ)
x∑
gm= ∇θ log p(x
(m);θ) = ∇θ log !p(x(m);θ) − ∇θ logZ(θ)
For an RBM: x={v,h}
bj
ai
h
v
W
E(v,h) = −hTWv −aTv −bTh = Wi,jvi
j∑
i∑ h
j− a
ivi
i∑ − b
jj∑ h
j
p(v,h) = 1
Zexp(−E(v,h))
θ={W,a,b}
L({x (1),..x (M )};θ) = log !p(x (m);θ)m∑ − logZ(θ)
m∑
∇θ logZ(θ) = Ex~p(x )∇θ log !p(x)
Ex~p(x )
∇θ log !p(x) =1M
∇θi=1
M
∑ log !p(x (m);θ)
Binaryunits
Binaryunits
Connectionsbias
Determine parameters θ that maximize log-likelihood (negative loss)
maxθL({x (1),..x (M )};θ) = log p(x (m)
m∑ ;θ)
IntractablePartition function
∂∂W
i,j
E(v,h) = −vih
j
Probability Distribution of Undirected model (Gibbs)
An identity
For stochastic gradient ascent, take derivatives:
Derivative of negative phase:Derivative of positive phase:
1M
∇θm=1
M
∑ log !p(x (m);θ)
Summation is over samplesfrom the training setSince it is summed m times 1/m has no effect
Summation is over samples from the RBM
RBM
θ ← θ + εg
!p(x) = exp(−E(v,h))
Z = exp
v,h∑ (−E(v,h))
Machine Learning Srihari
RBM Training Algorithm
• Contrastive Divergence• A method to overcome exponential complexity
in dealing with the partition function
26