graphical models - daniel khashabidanielkhashabi.com/learn/graph.pdf · t 5 t 6 t 7 t 8 t 5 t 6 t 7...

Graphical Models

Daniel Khashabi

Spring 2014Last Update: March 19, 2015

1 Independence Rules

Here is a notation:

X − Y − Zis equivalent to:

X ⊥⊥ Z|Ywhich is also equivalent to

µ(X,Y, Z) ∝ f(X,Y )g(Y,Z) (for structly positive terms)

Definition 1 (Graph Separation). C separates A and B if any path starting in A and terminatingin B has at least one node in B.

Definition 2 (Global Markov Property). A distribution µ(.) on X V satisfies the Global MarkovProperty with respect to a graph G if for any partition (A,B,C) such that C separates A and B wehave:

µ(xA, xB|xC) ∝ µ(xA|xC)µ(xB|xC)

Definition 3 (Local Markov Property). A distribution µ(.) on X V satisfies the Global MarkovProperty with respect to a graph G if for any i ∈ V :

µ(xi, xV \i,∂i|x∂i) ∝ µ(xi|x∂i)µ(xV \i,∂i|x∂i)

Definition 4 (Pairwise Markov Property). A distribution µ(.) on X V satisfies the Global MarkovProperty with respect to a graph G if for any i, j ∈ V :

µ(xi, xj |xV \i,j) ∝ µ(xi|xV \i,j)µ(xj |xV \i,j)

• Global Markov Property: (G)

• Local Markov Property: (L)

• Pairwise Markov Property: (P)

1

Lemma 1. We have the following implications between the independence conditions:

(G)⇒ (L)⇒ (P)

The reverse implications are dependent on µ(.) being strictly positive.

Proof. • Proof of (G)⇒ (L): Easy to show, using just the definitions. Choose A and B to bei and ∂i.

• Proof of (L)⇒ (P):

xi ⊥⊥ xV \i,∂i|x∂i⇒xi ⊥⊥ xV \i,∂i|xV \i,j⇒xi ⊥⊥ xj |xV \i,j

• Proof of (P)⇒ (G): Proof with induction.

Remark 1. In general (P) ⇒ (G) does not hold. It holds if the following condition holds for alldisjoint subsets A,B,C and D ⊂ V :

xA ⊥⊥ xB|xC , xD and xA ⊥⊥ xC |xB ∪ xD then xA ⊥⊥ (xB, xC)|xD

For strictly positive µ(.) the above holds.

Remark 2. In general a distribution that is pairwise Markov with respect to G, does not implythat it is locally Markov with respect to the same graph.

1.1 Independence Map (I-map)

Let

• I(G) denote the set of all conditional independencies implied by graph G.

• I(µ) denote the set of all conditional independencies implied by a distribution µ(.).

Following the definition of the Local and Pairwise Markov property we can define IL(µ) andIP (µ). Again, similar to Lemma 1 for general distributions, IP (µ) is a strict subset of IL(µ), andIL(µ) is a strict subset of IG(µ) = I(µ). These sets are exactly the same if µ(.) is strictly positive.

G is an I-map of µ if µ satisfies all of the independencies of the graph G, i.e.

I(G) ⊂ I(µ)

Remark 3. A complete graph is always an I-map, since it does not express any independence in thegraph. Therefore I(G) = . It can then be observed that removing edges adds more independenciesto I(G). In the case when the graph G has no elements, I(G) contains all possible independenciesbetween vertecies without any conditioning.

2

Following Remark 3 we define the graph G which has the biggest number of independencies,but still an I-map. A graph G is the minimal I-map for µ(x) if it is an I-map and removing a singleedge from G causes the graph no longer be an I-map.

Example 1. Consider the following distribution:

P (x1 = x2 = x3 = x4 = 1) = 0.5, P (x1 = x2 = x3 = x4 = 0) = 0.5

The graphs in 1 (left and middle) are minimal I-maps. This can be proven by showing that, thegraphs are I-maps and by removing any edges, we add independence rules which are not expressedby the distribution.

Here is a Directed Graph construction of minimal I-maps:

1. Choose an arbitrary ordering x1, . . . , xn.

2. Choose a direct graph based on Bayes rule:

µ(x) = µ(x1)µ(x2|x1)µ(x3|x1, x2)...µ(xn|x1, x2...xn−1)

3. For each i choose its parent π(i) to be the minimal subset of x1, . . . , xi−1 such that

xi − xπ(i) − x1, ..., xi−1 \ xπ(i)

The following is the construction of an undirected graphical model. The construction gives aunique solution if µ(.) > 0.

1. Create a complete graph.

2. Remove edge eij ∈ E if i and j are conditionally independent given the rest of the nodes.

Example 2. Consider the distribution in Example 1. If we construct a graph based on pairwiseindependence rule of the distribution we will end up with the graph in Figure 1 (right), which hasis not an I-map. In other words, the construction does not give an I-map.

The same example shows that pairwise Markov property does not imply local Markov. Although1 (right) is locally Markov, but clearly, since no node has no neighbors, the pairwise independence(which is conditionals on neighbors) do not hold.

A graph G is called a Perfect Map for µ(x) if I(G) = I(µ).

3

𝑥1 𝑥2

𝑥3 𝑥4

𝑥1 𝑥2

𝑥3 𝑥4

𝑥1 𝑥2

𝑥3 𝑥4

Figure 1: Example graphs.

Figure 2: Different classes of perfect maps.

1.2 Moralization

Moralization is the procedure of converting a Bayesian networkD to an MRFG such that I(G) ⊂ D.It can be proven that the resulting G is a minimal (undirected) I-map of I(D).

Lemma 2. There exists a graph G such that I(D) = G if and only if morelization does not addany edges.

2 Undirected Graphical Models(Markov Random Fields)

Undirected graph G = (V,E) with vertices on the set of variables X = (X1, . . . , Xn). In thisnotation n = |V | and m = |E|.

The graph shows the connection (interaction/dependence) between the variables in the model.Thus it is more accurate to model the set of variables connected to each other jointly. The mostaccurate modelling is defining a joint distribution for any variables in a connected graph. But such

Moralization

moralization converts a BN D to a MRF G such that I(G) ⊆ I(D)

resulting G is an minimal (undirected) I-map of I(D)1. retain all edges in D and make them undirected2. connect every pair of parents with an edge

1

2

31

2

3

when do we lose nothing in converting a directed graph to anundirected one? (when moralization adds no more edges than alreadypresent in D)

there exists a G such that I(D) = I(G) if and only if moralization ofD does not add any edges

proof of ⇒: in the example above, the independence is notpreserved, and this argument can be made general

proof of ⇐: we will not prove this in the lecture

Markov property 3-31

Figure 3: An example for Moralization of the a graph.

4

modelling strategy will create very big and cumbersome distribution which are hard to train, anddo inference with. Thus we prefer to do approximations. One of such approximations is the Markovproperty in writing the joint distribution between the variables in the graph.

Definition 5 (Markov property in undirected graphical models). Let’s say we have a graph G =(V,E) in which the vertices represent the set of variables, and the edges represent the connectionbetween the variables. The variables have the Markov property if for any two set of variables XA

and XB (corresponding to the set of variables A and B respectively), if there is a set of variablesXS and S separates A and B, the variables XA and XB are independent, conditioned on XS. Here“S separates A and B” means that, any path from any vertex in A to any vertex in B, needs touse at least one of the vertices in S. The mathematical notation for this definition is the following,

XA ⊥⊥ XB|XS

Based on the above definition we can create the following two corollaries as a result of Makovproperty assumption.

Corollary 1. Global Markov property: For any two vertices which are not neighbours, giventhe other vertices they are independent:

Xa ⊥⊥ Xb|XV \a,b ⇔ (a, b) /∈ E

Local Markov property: Any vertex is independent of all vertices, but its neighbours and itself,conditioned on all of its neighbours.

Xa ⊥⊥ XV \a,ne(a)|Xne(a)

Note that the local Markov property is stronger that the global one.

To define proper factorization we need to be more careful. We need to define a factorization, assimple as possible, on the given input given graph, with respect to an input graph with its given edgeconnections, and obey the Markov independence property. Let’s define some more mathematicaldefinitions for graph properties. A clique C is a set of vertices, which are all connected to eachother. For example in the Figure 2 there are three cliques are depicted (red, purple and green).The easiest way to define the factorization of distributions is to use cliques,

p(X) =∏

C∈clique(G)

p(XC). (1)

This is a valid factorization since it has the local Markov property (proof?).

Definition 6 (Gibbs Random Field). When the probability factors defined for each of the factorsof the Markov random field are strictly positive, we call it Gibbs random field.

Theorem 1 (HammersleyClifford theorem). A positive probability distribution with continuousdensity satisfies the pairwise Markov property with respect to an undirected graph G if and only ifit factorizes according to G.

In other words, Hammersley-Clifford is saying that for any strictly positive distribution p(.),the Markov property is equivalent to the factorization in Equation 1.

5

Figure 4: An undirected graphical models

Example 3 (Discrete Markov Random Field). Let G = (E, V ) be an undirected graph. Eachvertex is associated with a random variable xs ∈ 0, 1, . . . , d− 1 Assume the following exponentialdensity,

p(x;θ) ∝ exp

∑

v∈Vθv(xv) +

∑

(v,u)∈E

θvu(xv, xu)

,

where the set of factors used in the definition of the above function are,θv(xv) =

∑d−1j=0 αv;jIj(xv)

θvu(xv, xu) =∑d−1

j=0

∑d−1k=0 αvu;jkIj(xv)Ik(xu)

where I.(.) is the indicator function,

Ij(xv) =

1 if xv = j

0 otherwise

and the parameters of the model are

θ = αv;j ∈ R, v ∈ V, j ∈ X ∪ αvu;jk ∈ R, (v, u) ∈ E, (j, k) ∈ X × X. The cumulant generating function (log normalization function) is,

A(θ) = log∑

x∈Xexp

∑

v∈Vθv(xv) +

∑

(v,u)∈E

θvu(xv, xu)

,

Example 4 (Ising Model). Ising Model is an special case of Discrete Markov Random Field, whenthe variable can get only two values xs ∈ 0, 1 (d = 2). The probability distribution can be shownin a more simpler form,

p(x;θ) ∝ exp

∑

v∈Vθvxv +

∑

(v,u)∈E

θvuxvxu

,

and the parameters of the model are,

θ = θv ∈ R, v ∈ V ∪ θvu ∈ R, (v, u) ∈ E. In this exponential representation the vector is φ(x) = [. . . xv . . . xvu . . .]

>, with length |V |+ |E|.This exponential family is regular with Ω = R|V |+|E|, and a minimal representation(if you don’tknow what each of these properties mean, refer to Wiki).

6

http://en.wikipedia.org/wiki/Exponential_family

Figure 5: A directed graphical model; the parents of the node in orange is depicted in a green oval.

Figure 6: Structure of a Hidden Markov Model.

3 Directed Graphical Models

Directed graphical models are directed acyclic graphs (DAG), which accept factorization of distri-butions based on child-parent order,

p(X) =∏

u∈Vp(u|pr(u)),

where pr(u) is the set of parent vertices for u. An example directed graphical model is depicted inFigure 3, in which is the set of parents for the orange vertex are denoted by a green oval.

Example 5 (HMM). An example for directed graphical models is Hidden Markov Models (HMM).The joint distribution for the observations and the latent variables shown in Figure 6 in an HMMis as following:

p(X,Z) = p(Z1).p(Z1|Z1).p(Z2|Z1).p(X2|Z2). . . . = p(X1|Z1)p(Z1)

n∏

i=2

p(Xi|Zi)p(Zi|Zi−1).

It should be clear that the above decomposition of probability distributions for an HMM exactlyfollows the definition of a directed graphical model.

3.1 Naive Bayes

In Naive Bayes Classifier, we assume that independence of input features, given the class, that is,we can write their joint distribution in the following way:

f(X1,X2, . . . ,Xn|G = g) =

n∏

i=1

f(Xi|G = g)

7

Note that in the above formulation, Xi refer to each of input features, each of which could beof several dimension(and should not be confused with dimensions of one single feature, though insome cases they can be tough as the same thing). By applying this assumption in equation ??, wehave:

logPr (G = k|X = x)

Pr (G = l|X = x)= log

πlf(X1,X2, . . . ,Xn|G = l)

πkf(X1,X2, . . . ,Xn|G = k)

= log

πl

n∏

i=1

f(Xi|G = l)

πk

n∏

i=1

f(Xi|G = k)

= logπlπk

+n∑

i=1

logf(Xi|G = l)

f(Xi|G = k)

Example 6. We know one intuition behind Gaussian BP computing the minimum of a quadraticfunction:

x = arg minx∈Rn

1

2〈x,Qx〉+ 〈b, x〉

(2)

for some positive definite Q ∈ Rn×n.

This can be extended to a more general case, when Q is not positive definite but full rank. Inthis case we can still define the inverse and therefore inversion x = −Q−1x is well-defined (whichis stationary/saddle point of the above quadratic program). The claim is that BP can still computethe correct result in this case.

We consider a specific model:y = As0 + w0

where s0 ∈ Rn is an unknown signal, y ∈ Rm is the vector of observations, A ∈ Rm×n is the vectorof observations, and w0 ∈ Rm is the vector of i.i.d. Gaussian noise with entries w0,i ∼ N (0, σ2).The goal is to reconstruct s0 and w0 given A and y.

A popular way to do this using Ridge Regression:

s = arg mins∈Rn

1

2‖y −As‖22 +

1

2λ‖s‖22

(3)

Drawing inspiration from the Ridge model, we define the following loss function:

CA,y(x = (z, s)) = −1

2‖z‖22 +

1

2λ‖s‖22 + 〈z, y −As〉 (4)

1. We simplify the cost in Eq 4 and write in the form of Eq 2:

CA,y(x = (z, s)) = −1

2

[z>, s>

]Q

[zs

]+ b>

[zs

]

8

where

Q =

[Im AA> λIn

], b =

[y

0n×1

](5)

2. Since we have proven that have proven that 4 could be written in the form of 2,

x = (z>, s>)> = −Q−1b

where Q and b are defined in 5. Thus, we just replace their definition and find s:

x =

[zs

]= −

[Im AA> −λIn

]−1 [y

0n×1

]

Since we want s, we can use use blockwise inversion trick:

s = (λIn +A>A)−1A>y = λ−1A>(Im + λ−1AA>)−1y

To verify that it corresponds to the solution of the Ridge regression it is enough to find theminimizer of the Ridge directly:

f(s) =1

2‖y −As‖22 +

1

2λ‖s‖22

Then

∇sf(s) = A>(y −As) + λs = 0⇒ (A>A+ λIn)s = A>y ⇒ sridge = (A>A+ λIn)−1A>y

which proves the equivalence between s and sridge.

3. To solve this we can use the similar updates for Gaussian BP. Suppose dimension of Q isn× n. Then the updates are:

bi→j = bi −

∑k∈[n]\j

Qikbk→iQk→i

Qi→j = Qii −∑

k∈[n]\jQ2ik

Qk→i

4. The proof of convergence is followed by the proof of convergence of standard Gaussian BPwhich is done on a computation tree, by showing that that BP convergences with at most twotimes the tree-width of the computation tree iterations (if it converges to any fixed point).

4 Factor graphs

Factor graphs, unifies the directed and undirected representation of graphical models [3]. As it isshown in Figure 7, there are two types of nodes, factor nodes, and variable nodes. In addition tovertices that represent random variables (as it was the case in Directed and Undirected GraphicalModels), we introduce rectangular vertices to the representation. Two examples are depicted inFigure 7. The nodes stand for random variables, and the rectangles are the joint functions amongthe variables which they are connected to. These representation can give more accurate decompo-sition of probability factors for the variables. The fact that factor graphs combine undirected anddirected graphical models is because, the function/distribution in the factors could be anything,

9

e.g. the joint distribution or a conditional distribution. To have a general representation we showeach function/distribution defined on each factor by f(x) where x is the set of variables whichare connected to that factor. For example for the factor graph in Figure 7, the expansion of thedistribution is as following,

p(x1, x2, x3) ∝ f(x1, x2)f(x2, x3)f(x1, x3)

Figure 7: Representation of factor graphs. As it is shown in the right figure, there are two types ofnodes, factor nodes, and variable nodes.

Example 7 (HMM as a factor graph). In Figure 8 an HMM model is shown. Let’s define the shorthand P(Y = y|X = x) = P(y|x). The probability of the observations is,

P(y|x) ∝∏

i

P(yi|xi)P(yi|yi−1)

One can convert this into the factor graph, and eliminate the observed variables. The vertical factorsP(yi|xi) represent the probability of observations and the horizontal factors P(yi|yi−1) represent theprobabilities of transitions.

..yi−1. yi. yi+1.

xi−1

.

xi

.

xi+1

.... yi−1... yi... yi+1.........

xi−1

.

xi−1

.

xi−1

.... yi−1... yi... yi+1........

1

Figure 8: HMM in factor graph representation.

Example 8 (Chain CRF as a factor graph). In Figure 9 a Chain-CRF [5] is represented. Assumethe input space is X = (X1, . . . , Xn). We know this CRF could be formulated as follows,

P(y|x) ∝ exp

(∑

i

w>fi(yi, x) +∑

i

w>fi(yi, yi−1, x)

)

10

The equivalent factor graph is also represented. The conversion is very simple. The vertical factorare exp

(∑iw>fi(yi, x)

)and the horizontal factors are exp

(w>fi(yi, yi−1, x)

). In general any CRF

can be shown as a factor graph.

..yi−1. yi. yi+1.

X

. .......

.... yi−1... yi... yi+1........

1

Figure 9: Chain-CRF as factor graph.

Example 9 (Dependency Parsing). In [7] the dependency parsing problem is modelled as a factorgraph. See Figure ??. For each sentence, we have a triangle of variables, in which each variablecan take 3 possible values:

yi ∈ left, right, off

.. O.

O

.

O

.

O

.

O

.

O

.

L

.

R

.

R

.

R

.

O

.

O

.

O

.

O

.

O

..He. ate. dinner. with. salad.....

1

.. O.

O

.

O

.

O

.

R

.

O

.

L

.

R

.

O

.

R

.

O

.

O

.

O

.

O

.

O

..He. ate. dinner. with. spoon.....

1

Figure 10: Dependency Parse for two example sentences.

.... y1.

y2

.

y3

.

y4

.

y5

.

y6

............

1

Figure 11: Modelling dependency parsing with a factor graph.

Example 10 (Joint inference: sequence labelling). In Figure 12 a joint sequence labelling is shownwith a factor graph.

11

.... yi−1... yi... yi+1...........

zi−1

...

zi

...

zi+1

..............

1

Figure 12: Joint inference with factor graph.

.. Ecoute. y1. y2. y3. y4.

moi

.

y5

.

y6

.

y7

.

y8

..

بده

.

گوش

.

من

.

به

۱

y1,1 y1,2 y1,3 y1,4

y2,1 y2,2 y2,3 y2,4

1

Figure 13: Alignment of sentences with factor graphs.

5 Sum-product algorithm

Let’s say we have a acyclic graph and we want to calculate the marginals. Let’s say we have thefollowing factorization on a an arbitrary acyclic graph,

p(x) =1

Z∏

s∈Vψ(xs)

∏

(u,v)∈E

ψ(xu,xv) (6)

Let’s say we are given a graph and we want to marginalize over a leaf variable xv (figure 14(left)),

p(x\v) ∝∑

xv

∏

s∈Vψ(xs)

∏

(a,b)∈E

ψ(xa,xb),

which can be simplified as,

p(x\v) ∝

∏

s∈V \v

ψ(xs)∏

(a,b)∈E, v 6=t

ψ(xa,xb)

.∑

xv

ψ(xv)

∏

(a,v)∈E

ψ(xa,xv)

.

𝑀𝑣,𝑢

𝑢 𝑣

𝑀𝑣,𝑢

𝑢 𝑣 𝑀𝑎,𝑣

𝑀𝑏,𝑣

𝑀𝑐,𝑣

Figure 14: The elimination of a leaf node.

12

Like the case of Figure 14 if we assume that v (the leaf) has only one parent u we can simplify theequation more,

p(x\v) ∝

∏

s∈V \v

ψ(xs)∏

(a,b)∈E, v 6=t

ψ(xa,xb)

.∑

xv

ψ(xv)ψ(xu,xv). (7)

Thus the marginalization over the leaf variables is equivalent to marginalization over the factors ofthe leaf variables, and the mutual factors with its parents, and multiplying the result to the rest ofthe product. This idea makes the problem of inference on acyclic graphs much easier. Instead ofglobal summations (integrations) we only need local integrations by carefully ordering the variables.Now we make the notation more concrete by defining message. We define Mv,u(xu) as the messagefrom v(a leaf) to u, which is the marginalization over the sender’s and mutual factors, and is onlya function of the recipient.

Mv→u(xu) =∑

xv

ψ(xv)ψ(xu,xv).

So, we can say that, when marginalization of leaf variables, we only need to keep the messages andthrough anything else away.

Let’s say we have a more general case, as in the Figure 14(right) where we are marginalizingover a set of variables, including v which is not a leaf. For simplicity we can assume that theothers are three leaf nodes, a, b, c. Since we want to marginalize over v, a, b, c, we can start bycalculating the messages from the leaves Ma,v(xv),Mb,v(xv) and Mc,v(xv) and multiplying theminto the joint distribution of the remaining variables. Now we only need to marginalize over v:

p(x\v) ∝

∏

s∈V \v

ψ(xs)∏

(a,b)∈E, v 6=t

ψ(xa,xb)

.∑

xv

ψ(xv)ψ(xu,xv) [Ma→v(xv)Mb→v(xv)Mc→v(xv)]︸︷︷︸Messages from children

.

The only difference between the above equation and Equation 7 is the last term which is themessages from children. In other words, to marginalize over a non-leaf variable v, we only need tocalculate the messages from its children, then sum over,

• The messages from its children (here a, b and c)

• The factor on itself, v

• The joint factor of it with its parent u

Now we can give the the general definition of a message,

Mv→u(xu) =∑

xv

ψ(xv)ψ(xu,xv)∏

a∈N(v)\u

Ma→v(xv).

In the above definition N(t) is the set of all vertices which neighbour v. Based on the above messageformula we can calculate the marginals as following,

p(xs) ∝ ψt(xt)∏

t∈N(s)

Mt→s(xs)

13

Corollary 2. Since any tree is a acyclic graph, the marginals can be calculated in one-time swipingthe nodes of the tree, with proper ordering of the vertices. In other words, the sum-product algorithmcan give the exact marginal distributions of every node in a tree in O(n = |V |) complexity.

Consider Figure 15. In generals, for any vertex like yi the beliefs come from the neighbouringsubgraphs (in this case A2,A3,A4) and are passed to another subgraph (in this cases A1).

..yi.

A1

.

A2

.

A3

.

A4

1

Figure 15: Transmission of belief from subgraphs.

5.1 Belief-Propagation on factor-graphs

It is easy to generalize the belief-propagation scheme mentioned in the previous section to factor-graphs. There is a nice discussion of factor graphs could be found at [4]. The messages fromvariables to factors is,

Mxi→f (xi) ,∏

f ′∈N(xi)\f

Mf ′→xi(xi).

And messages from factors to variables,

Mf→xi(xi) ,∑

x\xi

f(x)∏

x′∈N(f)\x

Mx′→f (x′).

Note that in the above notation, N(f) is the set of all variables adjacent to factor f , and N(x) isthe set of all factors adjacent to variable x. The belief propagation for factor graphs is depicted inFigure 16.

After the propagation of the beliefs is done, the beliefs (the probability distributions of eachoutcome of each variable) can be calculated by,

b(x) ∝ f(x)∏

x′∈N(f)

Mx′→f (x′)

The marginalization can simply be written as follows,

b(xi) =∑

x\xi

b(x)

14

..x..f

.

x1

.x2

....

...

f1

..

f2

......

.Mx→f

.Mf→x

.

Mx1→f

.

Mf1→x

......

1

Figure 16: Message passing on a factor graph.1306 P. RAVIKUMAR, M. J. WAINWRIGHT AND J. D. LAFFERTY

FIG. 1. Illustrations of different graph classes used in simulations. (a) Four-nearest neighborgrid (d = 4). (b) Eight-nearest neighbor grid (d = 8). (c) Star-shaped graph [d = (p), ord = (log(p))].

despite the very different regimes of (n,p) that underlie each curve, the differentcurves all line up with one another quite well. This fact shows that for a fixed de-gree graph (in this case deg = 4), the ratio n/ log(p) controls the success/failureof our model selection procedure which is consistent with the prediction of Theo-rem 1. Figure 3 shows analogous results for the 8-nearest-neighbor lattice model(d = 8), for the same range of problem size p ∈ 64,100,225 and for both mixedand attractive couplings. Notice how once again the curves for different problemsizes are all well aligned which is consistent with the prediction of Theorem 1.

FIG. 2. Plots of success probability P[N±(r) = N (r),∀r] versus the control parameterβ(n,p,d) = n/[10d log(p)] for Ising models on 2-D grids with four nearest-neighbor interac-tions (d = 4). (a) Randomly chosen mixed sign couplings θ∗

st = ±0.50. (b) All positive couplingsθ∗st = 0.50.

Figure 17: A grid of size l × l. An Ising model is define on such grid.

Usually people initialize Mf→xi = 1 and Mxi→f = 1, though if the graph has tree structure, theinitialization does not have a lot of effect.

Example 11 (BP for Ising Model). For l ∈ N let graph Gl = (Vl, El) be an l × l grid (Figure17). The parameters for this Ising model is Θ = θij , θi : (i, j) ∈ El, i ∈ Vl. The joint probabilitydistribution over x ∈ −1,+1|Vl| is defined as:

µ(x) =1

ZGexp

∑

(i,j)∈El

θijxixj +∑

i∈Vl

θixi

The belief update (up to a constant) is:

ν(t+1)i→j (xi) ∝ exp θixi

∏

k∈∂i\j

∑

xk

exp θikxixj ν(t)k→i(xk)

Since the updates can take two possible values, writing the update in the form of log-likelihood ratio

15

might make sens:

L(t+1)i→j =

1

2log

(ν(t)i→j(+1)

ν(t)i→j(−1)

)

=1

2log

exp +θi∏k∈∂i\j

∑xk

exp +θikxk ν(t)k→i(xk)

exp −θi∏k∈∂i\j

∑xk

exp −θikxk ν(t)k→i(xk)

= θi +1

2

∑

k∈∂i\j

log

(∑xk

exp +θikxk ν(t)k→i(xk)∑xk

exp −θikxk ν(t)k→i(xk)

)

= θi +1

2

∑

k∈∂i\j

log

(exp +θik ν(t)k→i(+1) + exp −θik ν(t)k→i(−1)

exp −θik ν(t)k→i(+1) + exp +θik ν(t)k→i(−1)

)

= θi +1

2

∑

k∈∂i\j

log

(exp +θik exp 2L

(t)k→i + exp −θik

exp −θik exp 2L(t)k→i + exp +θik

)

= θi +1

2

∑

k∈∂i\j

log

exp

+θik + L(t)k→i

+ exp

−(θik + L

(t)k→i

)

exp−(θik − L(t)

k→i

)+ exp

θik − L(t)

k→i

To simplify the final result, we use the following facts:

tanh(x) =ex − e−xex + e−x

,1

2log z = arctan

(z − 1

z + 1

)

If we define:

z ,exp

+θik + L

(t)k→i

+ exp

−(θik + L

(t)k→i

)

exp−(θik − L(t)

k→i

)+ exp

θik − L(t)

k→i

Then it can be shown that:

z − 1

z + 1=

exp +θik − exp −θikexp +θik+ exp −θik

.exp

+L

(t)k→i

− exp

−L(t)

k→i

exp

+L(t)k→i

+ exp

−L(t)

k→i

= tanh(θik) tanh(L(t)k→i)

Which would give the following result:

L(t+1)i→j = θi +

∑

k∈∂i\j

arctan(

tanh(θik) tanh(L(t)k→i)

)

Here we do some experiments on these updates. Suppose the following is our criterion for conver-gence:

∆(t) =1

|−→E l|∑

(i,j)∈El

∣∣∣ν(t+1)i→j (+1)− ν(t)i→j(+1)

∣∣∣

16

0 0.5 1 1.5 2 2.510

−20

10−15

10−10

10−5

β

∆(t)

∆(t=15)

∆(t=25)

0 0.5 1 1.5 2 2.5 310

−20

10−15

10−10

10−5

100

β

∆(t)

∆(15)

∆(25)

Figure 18: Convergence of BP for Ising model. Left: Initialization with random samples from [0, β].Right: Initialization with random samples from [−β, β].

where−→E l the set of directed edges and |−→E l| = 2|El|. We plot the convergence for ∆(t = 15) and

∆(t = 25), with uniformly random initialization of parameters in [0, β], in Figure 18 (left), asfunction of β. It can be observed that for medium values of β the initialization is slower.

If we initialize the parameters randomly from [−β, β] the convergence is depicted in Figure 18(right). In this case the convergence for small values of β is relatively fast. Although for big βs theconvergence gets very slow.

Lemma 3. The (parallel) sum-product algorithm converges at most in diameter of the graph iter-ations. (Note: diameter of the graph is the length of the longest path in a graph.)

Proof. The goal is to inductively prove that the (parallel) sum-product algorithm will not iteratemore than D, the diameter of the graph (where diameter of a graph, is defined to be the length ofthe maximum path in the graph.)

The proof is done by induction on the size of the diameter and a few additional steps at theend. Basically, given a tree T , we construct another tree T ′ which has diameter one less than T . Toconstruct T ′, we strip down the leaves of the tree T , and replace the unary potentials with modifiedpotentials:

ψ′i(xi) = ψi(xi)∏

k∈∂i∩L

∑

xk

ψik(xi, xk)ν(0)k→i(xk)

(8)

and the rest of the potential functions remain the same (Figure 19). The goal is to prove that, if

we run two separate BPs, one on T and one on T ′, with initialization ν′(0)i→j(xi) = ν

(1)i→j(xi), then:

ν′(t)i→j(xi) = ν

(t+1)i→j (xi), for all (i, j) ∈ E′,∀t ≥ 0 (9)

We know that the sum-product is

ν(t+1)i→j (xi) = ψi(xi)

∏

k∈∂i\j

∑

xk

ψik(xi, xk)ν(t)k→i(xk)

17

The most important thing to observe is that, for any outgoing message from leave i in T ′, ν ′′(t)i→j(xi)

will be the same as the corresponding message on the same edge in T , ν(t+1)i→j (xi), given that all of

the incoming messages are the same. This can be proved with induction on time.

When it is t = 0 based on the initialization, the outgoing message (i, j) (Figure 19) is:

ν(1)i→j(xi) = ψi(xi)

∏

k∈∂i\j

∑

xk

ψik(xi, xk)ν(0)k→i(xk)

and the corresponding message ν′(1)i→j(xi) in T ′ is the same as its unary potential, which based on

the definition of the modified unary potential in Equation 8 is the same as ν(0)i→j(xi). Note that the

equality between ν′(1)i→j(xi) and ν

(0)i→j(xi) for non-leaf vertices trivially holds, as their potentials are

not changed.

Now suppose (9) holds for t and we want to prove it for t + 1. With a similar argument asthe previous case, the outgoing message from i ∈ L′ is the same as its counterpart message in T :

ν′(t)i→j(xi) = ν

(t+1)i→j (xi). Also note that, based on the assumption of the base case for t = 1, the

equality of (9) holds for non-leaf messages.

Another step of the argument is to argue that the diameter of T ′ is strictly less than D− 1 (Dis diameter of T ). The point to be noticed is that, the longest path in a tree, essentially passesthrough a leaf node (proof by contradiction. Suppose it is not the case. Specifically there is anon-leaf node at one end of the longest path. In that case, the path can be extended to one of theleaves connected to this node.)

Based on the construction, at each step we remove the leaves of the tree and therefore, essen-tially the diameter of the new tree is (at least) one less than the diameter of the original tree.

Since each time that we construct a modified tree, the diameter of the tree decreases by atleast one, this construction can be repeated at most D − 1 times. To state it more accurately,if the sum-product in T converges in d < D iterations, the sum-product in T ′ will converge ind − 1 < D − 1 steps. The proof is simply by observing that, the time index of the messages inT ′ are one ahead of the corresponding messages in T (which can also be written in the form of aninduction).

Also, note that adding back the leaves to T ′ will add at most one more iteration to updates ofBP (at the beginning; equivalent to modifying the leaf potentials), until convergence to fixed (if itever converges). Therefore the BP converges at least in d < D iterations.

Example 12 (Forward-Backward Algorithm for HMM). It be shown that the forward-backwardalgorithm is a special case of the sum-product algorithm.

Remark 4. The forwardbackward algorithm can be used to find the most likely state for any point in time.It cannot, however, be used to find the most likely sequence of states.

18

L

L’

𝜈𝑖→𝑗(𝑥𝑖)

𝜈′𝑖→𝑗(𝑥𝑖)

𝜓𝑖(𝑥𝑖)

𝜓′𝑖(𝑥𝑖)

Figure 19: Stripping leaves of a given tree.

1

5 4

2 3

1

5 4

2 3

a

b c

Figure 20: An undirected graphical model and its corresponding factor graph.

Example 13. An undirected graphical model has been shown in Figure 20(left), along with its factorgraph representation. Running the BP updates in the undirected graph is not guaranteed to resultin a correct answer since the graph contains a loop. However in the factor graph representation theloop has been collapsed into one factor, and therefore there is no loop in the factor graph and BPupdates are guaranteed to converge to the correct answer.

Here we write the message updates in the factor graph.Messages from factor a:

νa→1(x1) =∑

x2,x3ψ1,2,3(x1, x2, x3)ν2→a(x2)ν3→a(x3)

νa→2(x2) =∑

x1,x3ψ1,2,3(x1, x2, x3)ν1→a(x1)ν3→a(x3)

νa→3(x3) =∑

x1,x2ψ1,2,3(x1, x2, x3)ν1→a(x1)ν2→a(x2)

Messages from factor b: νb→2(x2) =

∑x4ψ2,4(x2, x4)ν4→b(x4)

νb→4(x4) =∑

x2ψ2,4(x2, x4)ν2→b(x2)

Messages from factor c: νc→3(x3) =

∑x5ψ3,5(x3, x5)ν5→b(x5)

νc→5(x5) =∑

x3ψ3,5(x3, x5)ν3→b(x3)

19

6

4 5

5 4

6

A

B C

Figure 21: The undirected graph in Figure 20 after grouping the variables x1, x2, x3, and its corre-sponding factor graph.

Messages from variable:

ν2→b(x2) = νa→2(x2)

ν2→a(x2) = νb→2(x2)

ν3→c(x3) = νa→3(x3)

ν3→a(x3) = νc→3(x3)

ν1→a(x1) = 1|x|

ν4→b(x4) = 1|x|

ν5→c(x5) = 1|x|

By grouping the variables x1, x2, x3 into one variable x6 we get a tree graphical model in Figure21 in which BP updates are guaranteed to converge to the correct fixed point. The new potentialfunctions are:

ψ6(x1, x2, x3) = ψ1,2,3(x1, x2, x3)

ψ65(x1, x2, x3, x5) = ψ3,5(x3, x5)

ψ64(x1, x2, x3, x4) = ψ2,4(x2, x4)

Messages from factor A: νAs→6(x1, x2, x3) = ψ6(x1, x2, x3)

Messages from factor B:

νB→6(x1, x2, x3) =

∑x4ψ64(x1, x2, x3, x4)ν4→B(x4)

νB→4(x4) =∑

x1,x2,x3ψ64(x1, x2, x3, x4)ν6→B(x1, x2, x3)

Messages from factor C:

νC→6(x1, x2, x3) =

∑x5ψ65(x1, x2, x3, x5)ν5→C(x5)

νC→5(x5) =∑

x1,x2,x3ψ65(x1, x2, x3, x5)ν6→C(x1, x2, x3)

20

𝑦1

𝑥1

𝑦2

𝑥2

𝑦3

𝑥3

𝑦4

𝑥4

𝑦5

𝑥5

Figure 22: A second order HMM.

Message from variables:

ν5→C(x5) = 1|x|

ν4→B(x4) = 1|x|

ν6→A(x1, x2, x3) = ν6→B(x1, x2, x3)ν6→C(x1, x2, x3)

ν6→B(x1, x2, x3) = ν6→A(x1, x2, x3)ν6→C(x1, x2, x3)

ν6→C(x1, x2, x3) = ν6→B(x1, x2, x3)ν6→A(x1, x2, x3)

We can continue the collapsing further and concatenate all the variables into one variable. In thiscase BP will not do anything. In case we are interested in any marginal, it could be calculated withlooping through all possible configurations, which is exponential. This basically shows that, althoughone can gain accuracy by smart concatenation of variables, it increases the computational burden.Specifically, the case when everything is grouped into one variable, there is exponential computationneeded to calculate the marginals.

Example 14. Consider a random process that transitions on a finite set of states s1, . . . , sk. At eachtimes step i ∈ [N ]1, Xi represents the state of the system. Here is the full generative description:

• Sample the initial state X1 from an initial distribution p1.

– Sample duration from a duration distribution pD over integer set [M ].

– Remain in the current state for the next d time steps.

– Sample a successor state from a transition distribution pT (s′|s) over states s′ 6= s.

With that description, we can write the probability of observing state sequence s1, s1, s1, s2, s3, s3is:

p1(s1)pD(3)pT (s2|s1)pD(1)pT (s3|s2)pD(2)

The observations are the set of vectors yi sampled from distribution p(Yi = yi|Xi = s)

We assume M = 2 and pD(d) =

0.6 d = 1

0.4 d = 2and each Xi takes on values from a, b.

1. The dependence described in the model can be represented as Fig. 22.

2. We define the augmented state representation as (x, t) where x represents the state in thesystem and t the time elapsed with that state. For example for our previous example, the

1[N ] = 1, . . . , N

21

Time (M)

Sta

tes

(k)

Time (M)

Sta

tes

(k)

Figure 23: Smart state updates.

augmented representation is (s1, 1), (s1, 2), (s1, 3), (s2, 1), (s3, 1), (s3, 2). We define an HMMon the augmented state representation:

pXi+1,Ti+1|Xi,Ti(xi+1, ti+1|xi, ti) =

φ(xi, xi+1, ti) ti+1 = 1 and xi+1 6= xi

ξ(xi, ti) ti+1 = ti + 1 and xi+1 = xi

ξ(xi, t) = P (D ≥ ti + 1|D ≥ ti) =

∑d≥ti+1 pD(d)∑d≥ti pD(d)

φ(xi, xi+1, d) = P (D = ti|D ≥ ti) pT (xi+1|xi) =pD(ti)∑d≥tt pD(d)

pT (xi+1|xi)

The observation distribution:

pYi|Xi,Ti(yi|xi, ti) = p(yi|xi)

3. Suppose S = s1, . . . , sk. Here are the naive forward-backward updates:

νi→(i+1)(xi, ti) ∝∑

xi−1∈S,ti−1∈[M ]

pXi,Ti|Xi−1,Ti−1(xi, ti|xi−1, ti−1)pYi−1|Xi−1,Ti−1

(yi−1|xi−1, ti−1)ν(i−1)→i(xi−1, ti−1)

ν(i+1)→i(xi, ti) ∝∑

xi+1∈S,ti+1∈[M ]

pXi+1,Ti+1|Xi,Ti(xi+1, ti+1|xi, ti)pYi|Xi,Ti(yi|xi, ti)ν(i+2)→(i+1)(xi+1, ti+1)

The complexity of such updates is O(N(kM)2). However the structure of the problem couldbe exploited to make the computations faster. In other words, rather than repeating the aboveupdates for any possible (xi, ti) we do the updates in smarter way. First we can fix ti = 1and update all states (just like the traditional HMM). This is shown in the Figure 23 (left;yellow column). This has complexity O(Nk2) Then, for a fixed state xi, we can fill in thestates corresponding to different time steps (xi, 1 ≤ t ≤ M) (Figure 23 (right; green row)).Since in the original model, the delays are not dependent on the state, we can use the resultfor any state (thus copying the green state to any other rows). This will have an additionalO(NkM) complexity. In total complexity is O(N(k2 + kM)).

6 Max-product algorithm

Before anything, we show the necessity of finding max-probability for a set of random variables.

Example 15 (Inference in graphical models). Let’s consider a general undirected graphical modelwith set of cliques C,

p(x) =1

Z∏

C∈Cψ(xC)

22

(a) Confirm (e.g., by checking Sylvesters criterion to see if the determinants of all principal minors arepositive) that J is a valid information matrix–i.e., it is positive definite–if ρ = .39 or ρ = .4. Computethe variances for each of the components (i.e., the diagonal elements of Λ = J−1)–you can use Matlabto do this if youd like.

(b) We now want to examine Loopy BP for this model, focusing on the recursions for the information matrixparameters. Write out these recursions in detail for this model. Implement these recursions and try forρ = .39 and ρ = .4. Describe the behavior that you observe.

(c) Construct the computation tree for this model. Note that the effective “J” – parameters for this modelare copies of the corresponding ones for the original model (so that every time the edge (1, 2) appearsin the computation tree, the corresponding J-component is −ρ. Use Matlab to check the positive-definiteness of these implied models on computation trees for different depths and for our two differentvalues of ρ. What do you observe that would explain the result in part (b)?

Problem 3.5 In this problem, we apply inference techniques in graphical models to find Maximum WeightMatching (MWM) in a complete bipartite graph. This is one of a few problems where belief propagationconverges and is correct on a general graph with loops. Other such examples include Gaussian graphicalmodels studied in class.

Consider an undirected weighted complete bipartite graph G(X,Y,E) where X is a set of n nodes and Yis another set of n nodes: |X| = |Y | = n. In a bipartite complete graph all the nodes in X are connected toall the nodes in Y and vice versa, as shown below. Further, each edge in this graph is associated with a real

a complete bipartite graph a perfect matching

valued weight wij ∈ R. A matching in a graph is a subset of edges such that no edges in this matching share anode. A matching is a perfect matching if it matches all the nodes in the graph. Let π = (π(1), . . . , π(n)) be apermutation of n nodes. In a bipartite graph, a permutation π defines a perfect matching (i, π(i)i∈1,...,n.From now on, we use a permutation to represent a matching. A weight of a (perfect) matching is defined asWπ =

∑ni=1 wi,π(i). The problem of maximum weight matching is to find a matching such that

π∗ = arg maxπ

Wπ .

We want to solve this maximization by introducing a graphical model with probabilty proportional tothe weight of a matching:

µ(π) =1

ZeCWπ I(π is a perfect matching) ,

for some constant C.

(a) The set of matchings can be encoded by a pair of vectors x ∈ 1, . . . , nn and y ∈ 1, . . . , nn, whereeach node takes an integer value from 1 to n. With these, we can represent the joint distribution as apari-wise graphical model:

µ(x, y) =1

Z

∏

(i,j)∈1,...,n2ψij(xi, yj)

n∏

i=1

ewi,xin∏

i=1

ewyi,i ,

4

Figure 24: Example graphs for matching.

with a set of observation variables xE. We want to maximize the posterior probability of variablesgiven some observations O ,

θ = arg maxθp(x = O)

The max-product algorithm is so similar to the sum-product algorithm we discussed. In themax-product algorithm we assume that we have a joint distribution of a group of variables, and wewant to find its mode with respect to several variables. Let’s say we have the same factorizationas in Equation 6. It it easy to observe that maximization has distributive property on factors, likesummation in the sum-product algorithm. Thus, if in the sum-product we substitute the summationwith maximization, we will end-up with the desired algorithm,

Mv→u(xu) = maxx′v∈Xv

ψ(x′v)ψ(xu,x

′v)

∏

a∈N(v)\u

Ma→v(xv)

.

And the max-marginal can be found using the following,

p(xu) ∝ ψu(xu)∏

v∈N(u)

Mv→u(xu)

Corollary 3. Since any tree is a acyclic graph, the max-marginals can be calculated in one-timeswiping the nodes of the tree. In other words, the max-product algorithm can give the exact max-marginal distributions of every node in a tree in O(n = |V |) complexity.

Example 16. We want to solve the Maximum Weighted Matching problem in a bipartitte graph.Given a bipartitte graph G(X,Y,E) the variables X and Y are vertices on the left and right side,respectively, and their size |X| = |Y | = n. Any variable in X can be connected to only variables inY . For any edge eij ∈ E, there is a weight wij associated. A perfect matching is the case when eachvariables in X is mapped to exactly one variable in Y , which can be captured with a permutationfunction on the nodes π = (π(1), . . . , π(n)) and the total weight of the matching is Wπ =

∑iwi,pi(i).

The problem of maximum weighted matching is defined as arg maxπWπ. We want to solve this byinducing a graphical model with weight proportional to the weight of matching.

µ(π) ∝ eWπI(π is a bipartite matching)

1. One way to realize the distribution is the following form:

µ(x, y) =∏

(i,j)∈E

ψi,j(xi, yj)∏

i

ewi,xi∏

j

ewyj,j

23

where x, y ∈ 1, . . . , nn encode the assignment of the nodes and

ψi,j(xi, yj) =

0 xi = j and yj 6= i

0 xi = j and yj 6= i

1 otherwise

To see the correctness, it is easy to see that having a perfect matching is equivalent to having xi = jand yj = i for any i, j, since in this case each node in the X side is connected to only one node inthe Y side. Also it is easy to observe that, whens have a valid assignment (i.e. perfect matching)

the value of the probability distribution is exp(∑

iwi, xi +∑

j wj , yj

)= exp 2Wπ.

2. We have shown that µ(x, y) is nonzero iff the assignment is a valid perfect matching. Tofind the maximum weighted matching it is enough to find (x∗, y∗) = arg maxx,y µ(x, y). This wouldessentially give the correct answer since µ(x, y) is proportional to Wπ, when the assignment iscorrect. Also note that such assignment is essentially a perfect matching (by definition of µ(x, y)).

3. We want to derive an algorithm based on Max-Product for finding the maximum of µ(x, y).Let’s denote the variables on the left side by li and the variables on the right side by rj. Forexample, a message from right to left is denoted with vrj→li(yj)

(t) and the message from left to right

is denoted by vli→rj (xi)(t). We initialize the messages as following:

vrj→li(yj)(0) = ewi,xi , vli→rj (xi)

(0) = ewyj,j

The update rules for messages are the followings:

vrj→li(yj)

(t+1) =∏k∈[n]\jmaxxk ψk,j(xk, yj)vlk→rj (xk)

(t+1)

vli→rj (xi)(t+1) =

∏k∈[n]\imaxyk ψi,k(xi, yk)vrk→li(yk)

(t+1)

To find the marginals we can do

µi(xi) =

∏k∈[n]

maxyk ψi,k(xi, yk)vrk→li(yk)

(tmax)

µi(yj) =∏k∈[n]

maxxk ψk,j(xk, yj)vri→lk(xk)

(tmax)

The correct configuration could be found by back-tracking the maximums (back-pointers).

7 Belief Propagation for graphs with loops

In general in a graph with cycles, there might be cliques that create groups of vertices that are allconnected to each other. In this case, unlike sampling, running the belief updates is not guaranteedto converge to the optimal answer. Also, unlike mean-field approaches, it is not guaranteed toconverge to a value (even if that is not optimal). But in practice it might give good results, thoughit is mostly dependent on the structure of the graph.

The loopy BP works better if the graph has shorter and smaller loops. It is known that if ithas only one loop, the answers will be exact (reference?).

Example 17 (An example on loopy BP). See the loopy binary graphical model in Figure 25. Wewant to perform BP step by step in this graph. Since the whole graph is a loop, there is no specific

24

ordering. Since, it might take many iterations until convergence, we just perform several initialsteps. We initialize the beliefs with2,

Mf→xi(xi) = [0.5, 0.5]>

Mxi→f (xi) = [0.5, 0.5]>

1. From x1 to x4:Mx1→f4,1(x1) = Mf1,2→x1 = [0.5, 0.5]>

Mf4,1→x4(x4) =∑

x1

f(x4, x1)Mx1→f4,1(x1) =

[4

12,

2

12

]>

2. From x4 to x1:Mx4→f4,1(x4) = Mf3,4→x4 = [0.5, 0.5]>

Mf4,1→x1(x1) =∑

x4

f(x4, x1)Mx4→f4,1(x4) =

[4

12,

2

12

]>

3. From x1 to x2:

Mx1→f1,2(x1) = Mf4,1→x1 =

[4

12,

2

12

]>

Mf1,2→x2(x2) =∑

x1

f(x1, x2)Mx1→f1,2(x1) =

[1

22,

5

33

]>

We don’t continue further.

..y1.

y2

.

y4

.

y3

.........x2 = 0 x2 = 1

x1 = 0 111

111

x1 = 1 111

811

= f1,2(x1, x2).

x3 = 0 x3 = 1x2 = 0 1

19919

x2 = 1 819

119

= f2,3(x2, x3)

.f4,1(x4, x1) =

x1 = 0 x1 = 1x4 = 0 7

12112

x4 = 1 112

312

.

f3,4(x3, x4) =

x4 = 0 x4 = 1x3 = 0 6

12112

x3 = 1 212

312

1

Figure 25: A loopy graphical model in factor-graph notation.

2A note on the notation: since the variables are binary, the function of one binary variable, could take only twopossible value, which could be represented in a vector.

25

8 Gaussian Graphical Models

First a brief on the properties of the Gaussian distribution:

Two representation of the Gaussian distribution

1. Covariance form: If x ∼ N (m,Λ), then

µ(x) =1

(2π)n/2‖Λ‖1/2 exp

−1

2(x−m)> Λ−1 (x−m)

2. Information form: If x ∼ N−1(h, J), then

µ(x) ∝ exp

−1

2x>Jx+ h>x

where h is the potential vector and J is the precision/information matrix.

A few important points:

• The equivalence between the two representation:J = Λ−1

h = Λ−1m = Jm

• Marginalizing: Given the following two joint distributions of (x1, x2) forms, we want tofind the the distribution of x1.

x =

[x1x2

]= N

([m1

m2

],

[Λ11 Λ12

Λ21 Λ22

])= N−1

([h1h2

],

[J11 J12J21 J22

])

If the marginal form is x1 ∼ N (m, Λ) = N (h, J).

– m = m1 and Λ = Λ11

– J = Λ−111 = J11 − J12J−122 J21 and h = Jm1 = h1 − J12J−122 h2

• Conditioning: The goal is to find:

x1|x2 ∼ N (m′,Λ′) = N−1(h′, J ′)

– h′ = h1 − J12x2 and J ′ = J11.

– m′ = m1 + Λ12Λ−122 (x2 −m2) and Λ′ = Λ11 − Λ12Λ

−112 Λ21.

• Independent in the covariance form: In x ∼ N (m,Λ), xi and xj are independent if andonly if Λij = 0.

• Pairwise independence in the information form: if x ∼ N−1(h, J), we have the pairwiseindependence x1 − xV \i,j − xj , if and only if Jij = 0.

26

x1 x2 x3

x4

y1 y2

5. Now assume that we have an additional measurement y3 = x3 + v3, where v3 is a zero-mean Gaussianvariable with variance 1 and is independent from all other variables. Find the new C. Representmessages hx3→x2

and Jx3→x2in terms of y2, yw and the elements of hx and Jx.

6. The BP message from x3 to x2 define a Gaussian distribution with mean mx3→x2= J−1x3→x2

hx3→x2and

variance σx3→x2= J−1x3→x2

. Comment on the difference in the mean and the variance of this messagewhen computed using a single observation y2 versus when computed using multiple observations (y2, y3).Can you guess the mean and variance of the BP message when the number of observations grows toinfinity?

Problem 3.3 In this exercise, you will construct an undirected graphical model for the problem ofsegmenting foreground and background in an image, and use loopy belief propagation to solve it. Loadthe image flower.bmp into MATLAB using imread. (The command imshow may also come in handy.)Partial labeling of the foreground and background pixels are given in the mask images foreground.bmp andbackground.bmp, respectively. In each mask, the white pixels indicate positions of representative samplesof foreground or background pixels in the image. Let y = yi be an observed color image, so each yiis a 3-vector (of RGB values between 0 and 1) representing the pixel indexed by i. Let x = xi, wherexi ∈ 0, 1 2 is a foreground(1)/background(0) labeling of the image at pixel i. Let us say the probabilisticmodel for x and y given by their joint distribution can be factored in the form

µ(x, y) =1

Z

∏

i

φ(xi, yi)∏

(j,k)∈Eψ(xj , xk) (1)

where E is the set of all pairs of adjacent pixels in the same row or column as in 2-dimensional grid. Supposethat we choose

ψ(xj , xk) =

0.9 if xj = xk0.1 if xj 6= xk

This encourages neighboring pixels to have the same label–a reasonable assumption. Suppose further thatwe use a simple model for the conditional distribution φ(xi, yi) = PYi|Xi(yi|xi):

P(yi|xi = α) ∝ 1

(2π)3/2√

detΛαexp

− 1

2(yi − µα)TΛ−1α (yi − µα)

+ ε

for yi ∈ [0, 1]3. That is, the distribution of color pixel values over the same type of image region is a modifiedGaussian distribution, where ε accounts for outliers. Set ε = 0.01 in this problem.

(a) Sketch an undirected graphical model that represents µ(x, y).

(b) Compute µα ∈ R3 and Λα ∈ R3×3 for each α ∈ 0, 1 from the labeled masks by finding the samplemean and covariance of the RGB values of those pixels for which the label xi = α is known from

2

Figure 26: A sample graphical model.

If the messages are of the information form:

νk→i(xk) = N−1(hk→i, Jk→i)

then the Gaussian belief updates are the followings:

hi→j = hi −

∑k∈∂i\j JikJ

−1k→ihk→i

Ji→j = Jii −∑

k∈∂i\j JikJ−1k→iJki

And the marginals can be computed as xi ∼ N−1(hi, Ji) where

hi = hi −

∑k∈∂i JikJ

−1k→ihk→i

Ji = Jii −∑

k∈∂i JikJ−1k→iJki

The complexity of these update are O(n.d3). Compare this to the complexity of of naively marginal-ization which is O((n.d)3).

Example 18. If x ∼ N−1(hx, Jx) and y = Cx+ v where v ∼ N (0, R).a. We want to find the potential vector hy|x and the information matrix Jy|x of p(y|x).

p(x, y) = p(y|x)p(x) ∝ exp

−1

2(y − Cx)>R−1 (y − Cx)

. exp

−1

2x>Jxx+ h>x x

= exp

−1

2y>R−1y +

1

2x>C>R−1y +

1

2y>R−1Cx− 1

2x>(Jx + CR−1C>

)x+ h>x x

Then

hx,y =[hx, 0

]>, Jx,y =

[R−1 −C>R−1−R−1C Jx + CR−1C>

]

b. Given the joint distribution we can find the conditional x|y we have:

hx|y = hx + C>R−1y

Jx|y = Jx + C>R−1C(10)

c. Suppose we have R = I and y1 = x1 + v1

y2 = x3 + v2

27

The corresponding graphical model is shown in Figure 26. The C can be written as:

C =

[1 0 0 00 0 1 0

]

Based on the message updates in Equation 10, if we replace the definition of C we will have:hx3|y = hx3 + y2

Jx3|y = Jx3 + 1

d. Similar to the previous section we can replace C in Equation 10.

1 0 0 00 0 1 00 0 1 0

Which gives the following message updates:hx3|y = hx3 + y2 + y3

Jx3|y = Jx3 + 2

e. Naturally, as there is more observation is received, the distributions (of messages) are expectedto lean towards the empirical mean with decreasing variance (of the observation). We can verifythis by checking the message parameters of 3 → 2 for one observation y2 and two observations y2and y3.

One observation:

N (hx3 + y2Jx3,x3 + 1

,1

Jx3,x3 + 1)

Two observations:

N (hx3 + y2 + y3Jx3,x3 + 2

,1

Jx3,x3 + 2)

And in the case of n observations, we expect the mean to get closer to 1n

∑i yi.

9 Submodularity and tractable inference

We want to investigate the connection between min-cut problem and pair-wise graphical models.Consider a graphical model defined on G(V,E):

µ(x) ∝ exp

−

∑

i∈Vφi(xi)−

∑

(i,j)∈E

φij(xi, xj)

for xi ∈ 0, 1n. Also we assume that φij(0, 0) = φij(1, 1) = 0 for all (i, j) ∈ E:

φi(.) =

[φi(0)φi(1)

], φij(.) =

[0 φij(0, 1)

φij(1, 0) 0

],

The goal is to find the assignment which maximizes the above distribution, which as we will show,can be cast into min-cut problem. To this end, given a pairwise MRF on G(V,E) and compatibilityfunction φij(., .)’s we create a directed and weighted graph G(V , E) as follow:

28

1. For any node in V add one node to V .

2. Add one node for source s and one node for sink t to V .

3. For any undirected edge (i, j) ∈ E, add two directed edges (i, j) and (j, i) to E.

4. Add directed edge from source s to any node (except the sink) to E.

5. Add directed edge from any node (except the source) to the sink t to E.

1. If any of the potentials is negativem we add a positive value to the edges to that they are notnegative anymore, but the solution to the min-cut (and therefore the energy function) is notchanged. For example, if φ1(0) < 0, we add big enough α > 0 to the two edges out of thesource φ1(0) and the edge into the sink φ1(1). This corresponds to adding a constant α tothe energy function:

E′(x) = (α+ φ0(xi)) +∑

i∈V \0

φi(xi) +∑

(i,j)∈E

φij(xi, xj)

= α+∑

i∈Vφi(xi) +

∑

(i,j)∈E

φij(xi, xj)

2. If the diagonal elements of φ (i.e. φij(0, 0) and φij(1, 1) ) are not zero, we can add them intothe graph and still get correct result from the min-cut algorithm. For any (i, j) ∈ E, suchthat φij(0, 0) or φij(1, 1) are nonzero,

• We add the extra weight φab(0, 0) to the edge (s, a) ∈ E. This is to ensure that when thecut is contained both (s, a) and (s, b) (i.e. both assigned zero), it will also take φab(0, 0)into account.

• We add the extra weight φab(1, 1) to the edge (a, t) ∈ E. Similar to the previous part,the reason is that, if the cut contains both (a, t) and (b, t) (both assigned 1), it takes thecost φab(1, 1) into account.

• What if the cut contains only a and not b? We subtract φab(1, 1) + φab(0, 0) from theedge (a, b) ∈ E to remove extra costs.

3. In order to solve the problem the max-flow algorithm we need to have a graph with positiveweights. The only place that we might have negative weights are the pairwise edges, wherethe weights are

φab(1, 0)− φab(0, 0)− φab(1, 1)

φab(0, 1)(11)

If any of the weights here are negative, we can use the previous trick of adding positiveconstants (part a), but with slight difference.

E′(x) = α+∑

i∈Vφi(xi) +

∑

(i,j)∈E

φij(xi, xj)

= (α+ φab (xa, xb)) +∑

i∈Vφi(xi) +

∑

(i,j)∈E\(a,b)

φij(xi, xj)

29

Basically the extra constant is added to each of φab(1, 0), φab(0, 0), φab(1, 1) and φab(0, 1) in11, which gives the following modified weights:

φab(1, 0)− φab(0, 0)− φab(1, 1)− αφab(0, 1) + α

Setting each of the weights to be bigger than zero we get:

−φab(0, 1) ≤ α ≤ φab(1, 0)− φab(0, 0)− φab(1, 1)

⇒ φab(1, 0) + φab(0, 1) ≤ φab(0, 0) + φab(1, 1)

which is known as submodularity property.

10 On convergence of BP

In general there is no guarantee for BP to converge. In [9, 8] there is discussion on sufficientconditions for BP to converge.

11 Some general tips

Here are some general tips:

1. In a real-world problem, usually some of the factors are hard to compute. If there is a no orderin computing the beliefs, try to compute the heavy beliefs (slowest message) less frequently.

2. In some structured problems when there needs to be some structure between the outcomes ofthe variables, doing the BP updates might be inefficient. Instead one can use the structureof the problem, and perform some intermediary steps in the dynamic programming, to makethe BP updates more efficient. There are nice examples of this in [10, 6, 2, 1].

3. You don’t usually need to run until convergence. Many people previously have shown than,the results, even before the convergence, could be promising.

12 Bibliographical notes

In preparation of this notes I have used Martin J. Wainwright’s slides, and David Burkett’s ACLtutorial . The nice paper [11] is a very important resource for modelling BP in factor-graphs.Thanks to https://github.com/jluttine/tikz-bayesnet, many graphical models are drawnwith this package. Some of the images are from Sewoong Oh’s course on Graphical Models.

References

[1] David Burkett and Dan Klein. Fast inference in phrase extraction models with belief propaga-tion. In Proceedings of the 2012 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, pages 29–38. Association forComputational Linguistics, 2012.

30

http://nlp.cs.berkeley.edu/tutorials/variational-tutorial-slides.pdf

http://nlp.cs.berkeley.edu/tutorials/variational-tutorial-slides.pdf

https://github.com/jluttine/tikz-bayesnet

[2] Fabien Cromieres and Sadao Kurohashi. An alignment algorithm using belief propagationand a structure-based distortion model. In Proceedings of the 12th Conference of the Euro-pean Chapter of the Association for Computational Linguistics, pages 166–174. Association forComputational Linguistics, 2009.

[3] Brendan J Frey. Extending factor graphs so as to unify directed and undirected graphicalmodels. In Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence,pages 257–264. Morgan Kaufmann Publishers Inc., 2002.

[4] Frank R Kschischang, Brendan J Frey, and H-A Loeliger. Factor graphs and the sum-productalgorithm. Information Theory, IEEE Transactions on, 47(2):498–519, 2001.

[5] John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random fields:Probabilistic models for segmenting and labeling sequence data. 2001.

[6] Andre FT Martins, Noah A Smith, Eric P Xing, Pedro MQ Aguiar, and Mario AT Figueiredo.Turbo parsers: Dependency parsing by approximate variational inference. In Proceedingsof the 2010 Conference on Empirical Methods in Natural Language Processing, pages 34–44.Association for Computational Linguistics, 2010.

[7] Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajic. Non-projective dependencyparsing using spanning tree algorithms. In Proceedings of the conference on Human LanguageTechnology and Empirical Methods in Natural Language Processing, pages 523–530. Associationfor Computational Linguistics, 2005.

[8] Joris M. Mooij and Hilbert J. Kappen. Sufficient conditions for convergence of loopy beliefpropagation. In F. Bacchus and T. Jaakkola, editors, Proceedings of the 21st Annual Conferenceon Uncertainty in Artificial Intelligence (UAI-05), pages 396–403, Corvallis, Oregon, 2005.AUAI Press.

[9] Joris M. Mooij and Hilbert J. Kappen. Sufficient conditions for convergence of the sum-productalgorithm. IEEE Transactions on Information Theory, 53(12):4422–4437, December 2007.

[10] David A Smith and Jason Eisner. Dependency parsing by belief propagation. In Proceed-ings of the Conference on Empirical Methods in Natural Language Processing, pages 145–156.Association for Computational Linguistics, 2008.

[11] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Constructing free-energy approxima-tions and generalized belief propagation algorithms. Information Theory, IEEE Transactionson, 51(7):2282–2312, 2005.

31

graphical models - daniel khashabidanielkhashabi.com/learn/graph.pdf · t 5 t 6 t 7 t 8 t 5 t 6 t 7...

Documents