adaptive belief propagation  · xxxxx [email protected] xxxxxxxxxxxxxxxxxxxxx [email protected]...

1
Adaptive Belief Propagation https://github.com/geopapa11/adabp Georgios Papachristoudis and John W. Fisher III xxxxx [email protected] xxxxxxxxxxxxxxxxxxxxx [email protected] Massachusetts Institute of Technology Motivation Graphical models are commonly used to represent large-scale inference problems. At any given time, only a small subset of all variables may be of interest. Observations may arrive at different times, resulting in a change of distribution. The marginals of variables of interest change after the addition of new observations. It is desirable to efficiently evaluate desired statistics and avoid redundant computations. #1 ? w 1 v 1 ? v 2 #2 ? v 2 w 2 ? #3 ? w 2 w 3 v 3 v 2 #4 ? w 4 v 4 v 2 ? ··· ··· Model parameters (bold nodes) change due to the addition of new observations, while only a small subset of latent nodes is of interest (?) Temperature measurements (from sensors) are added sequentially, while only a part of latent variables (server) is of interest. Contributions We develop an adaptive inference approach which gives the exact marginals on trees and whose average-case performance is significantly better than that of standard BP. We provide an extension to Gaussian loopy graphs, where the results are exact. Implementation is straightforward, code is publicly available. Problem statement X = {X 1 ,..., X N }: N latent variables that are the focus of inference. Direct dependencies between latent variables are represented by edge set E . Neighbors of X k are represented by N (k ). Each latent node X k is linked to m k 0 measurements/observations: Y k ,1 ,..., Y k ,m k . Y 1,1 Y 1,m 1 Y 2,1 Y 2,m 2 ... Y 3,1 Y 3,m 3 ... X 1 X 2 X 3 X 4 Y 4,m 4 Y 4,1 ... ... ... X N Discrete setting: X k ∈X , Gaussian setting: X k R d . Updating node potentials ϕ (0) k (x k ) : Node potentials of latent variables. χ k (x k , y ) : Pairwise potential between latent and observed variables. ψ ij (x i , x j ) : Pairwise potential between latent variables. ... X 1 X 2 X 3 X 4 X N Y 1,1 Y 1,m 1 Y 2,1 Y 2,m 2 ... Y 3,1 Y 3,m 3 ... X 1 X 2 X 3 X 4 Y 4,m 4 Y 4,1 ... ... ... X N ) A new measurement Y w ,u = y u at iteration changes the node potential of X w as ϕ () w (x w )= ϕ (-1) w (x w )χ w u (x w , y u ). We only consider the graph of latent nodes. Measurement and marginal orders Measurement order w = {w 1 ,..., w M }: The order of acquiring measurements. Marginal order v = {v 1 ,..., v M }: The order of (latent) nodes whose marginal is of interest at each step. ... X 1 X 2 X 3 X 4 X N ? v 1 w 1 ... X 1 X 2 X 3 X 4 X N ? v 2 w 2 ... X 1 X 2 X 3 X 4 X N ? w 3 v 3 Adaptive BP Belief Propagation Belief propagation is a message passing algorithm that runs in linear time to the number of latent nodes and computes the node marginals. A message is given by m i j (x j )= x i ϕ i (x i )ψ ij (x i , x j ) Q k ∈N (i )\j m k i (x i ). T v T u ... ... ··· ··· ··· i j k u v ... ... T i m i!j Lowest Common Ancestor (LCA) The lca of two nodes w , v is the lowest (deepest) node that has both w and v as descendants. For trees, the path between two nodes is uniquely determined from their lca. The lca of two nodes is determined in constant time by reduction to the Range Minimum Query (RMQ) problem [Czumaj et al., 2007]. This requires the building of the so-called RMQ structure in O(N log N ) time and space. w v lca(w, v ) path(w, v ) Adaptive BP (AdaBP) Denote by M(w v ) the directed path from node w to node v , which is unique for trees. ? w ` v ` w `-1 Send messages from w -1 to w ? w ` v ` w `-1 Send messages from w to v Theorem. AdaBP provides the exact marginals in path M(w v ), for tree MRFs. Preprocessing Build the RMQ structure. Initialization Initialize node, pairwise potentials and messages. for =1, 2,... do Determine path M(w -1 w ). Compute messages in M(w -1 w ). Update the node potential at X w . Determine path M(w v ). Compute messages in M(w -1 w ). Compute the marginal of interest p X v (x v ). end for Extension to Max-Product Find the most likely sequence: x * arg max x p (x ). m i j (x j ) = max x i ϕ i (x i )ψ ij (x i , x j ) Q k ∈N (i )\j m k i (x i ). δ i j (x j ) = arg max x i ϕ i (x i )ψ ij (x i , x j ) Q k ∈N (i )\j m k i (x i ). Propagate mmessages in M(w -1 w ). w ` w `-1 Extension to Gaussian Loopy MRFs Feedback Message Passing (FMP) by [Liu et al., 2012] is a belief-propagation-like algorithm, which provides the exact marginal means and variances in Gaussian loopy graphs. F : FVS nodes, set of nodes whose removal breaks loops (here: 16, 17, 18). T : Remaining acyclic graph. A: Anchors (neighbors of FVS nodes). w T = w , if w ∈T and w T = w T -1 , otherwise. 1 2 3 4 16 5 6 7 8 9 10 11 12 18 13 17 14 15 Extension of AdaBP to Gaussian loopy graphs 1 2 3 4 16 5 6 7 8 9 10 11 12 18 13 17 14 15 w ` w `-1 w -1 , w ∈T w `-1 1 2 3 4 16 5 6 7 8 9 10 11 12 18 13 17 14 15 w ` w T ` w -1 ∈F , w ∈T First phase: Send messages from w T -1 to w 1 2 3 4 16 5 6 7 8 9 10 11 12 18 13 17 14 15 w T ` ? v ` M(w T →A) ? 1 2 3 4 16 5 6 7 8 9 10 11 12 18 13 17 14 15 v ` w T ` M(A→ v ) Second phase: Send messages from w T to A and then to v Experiments Comparison against standard BP and RCTreeBP by [S¨ umer et al., 2011]. Method by [S¨ umer et al., 2011] is an adaptive inference approach which constructs a balanced representation of an elimination tree to evaluate marginals in logarithmic time. Preprocessing time (for trees): O(|X | 3 N )/ O(N log N ) for AdaBP. Time/update (for trees): O(|X | 3 log N )/ O(|X | 2 dist(w -1 , w )) for AdaBP. Synthetic data We construct unbalanced trees of varying sizes (N ∈{10, 10 2 , 10 3 , 10 4 }). We generate measurement order w randomly (column (a)) or such that E[dist(w -1 , w )] ≤ |X | log N (column (b)) and compute marginals at each step. AdaBP is orders of magnitude faster than standard BP, AdaBP is 1.3–4.7 faster than RCTreeBP when E[dist(w -1 , w )] ≤ |X | log N (b). N 10 1 10 2 10 3 10 4 Speedup ratio 10 0 10 1 10 2 10 3 10 4 |X | =2 t(BP)/t(AdaBP) t(RCTreeBP)/t(AdaBP) N 10 1 10 2 10 3 10 4 Speedup ratio 10 0 10 1 10 2 10 3 10 4 |X | =2 t(BP)/t(AdaBP) t(RCTreeBP)/t(AdaBP) N 10 1 10 2 10 3 10 4 Speedup ratio 10 -2 10 -1 10 0 10 1 E[dist(w -1 ,w )] N t(BP)/t(AdaBP) t(RCTreeBP)/t(AdaBP) (c) E[dist(w -1 , w )] N . N 10 1 10 2 10 3 10 4 Speedup ratio 10 0 10 1 10 2 10 3 10 4 |X | = 10 t(BP)/t(AdaBP) t(RCTreeBP)/t(AdaBP) (a) Unconstrained w . N 10 1 10 2 10 3 10 4 Speedup ratio 10 0 10 1 10 2 10 3 10 4 |X | = 10 t(BP)/t(AdaBP) t(RCTreeBP)/t(AdaBP) (b) Constrained w . N 10 1 10 2 10 3 10 4 Speedup ratio 10 0 10 1 10 2 10 3 10 4 E[dist(w -1 ,w )] = 2 t(BP)/t(AdaBP) t(RCTreeBP)/t(AdaBP) (d) E[dist(w -1 , w )] fixed. Real data We explore the effect of basepair mutation in the birth/death of CpG islands. AdaMP is up to 8 times faster than RCTreeMP. Update time per iteration is much smaller for AdaMP for small dist(w -1 , w ). Both methods are not sensitive to changes in the MAP sequence. N 0 2 4 6 8 10 2 10 3 10 4 10 5 10 -2 10 0 10 2 10 4 RCTreeMP AdaMP (a) Speedups of AdaMP over RCTreeMP. 10 -2 10 0 10 2 10 -2 10 0 10 2 10 -5 10 0 10 5 10 0 10 1 10 2 10 3 10 4 10 5 10 -5 10 0 10 5 (b) Update times of AdaMP over RCTreeMP. 0 200 400 600 800 Update time (sec) 5 10 15 20 25 RCTreeMP (ρ = 0.09) AdaMP (ρ = 0.14) (c) Sensitivity to changes in MAP sequence. We analyze temperature measurements collected from 53 wireless sensors from Intel Berkeley Research Lab. We assume temperature evolution follows a Gaussian distribution. We collect measurements in a 6-hour window on a random order and are interested in computing the marginal of selected areas. AdaBP is up to 4–6 times faster than standard Kalman filtering/smoothing techniques. Iteration j 50 100 150 200 Speedup ratio 20 40 60 t(KF)/t(AdaBP) dist(w ,w -1 ) 20 40 60 80 100 120 140 160 180 Time (sec) 5 10 15 20 25 30 AdaBP KF

Upload: others

Post on 24-Mar-2020

7 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Adaptive Belief Propagation  · xxxxx geopapa@mit.edu xxxxxxxxxxxxxxxxxxxxx fisher@csail.mit.edu Massachusetts Institute of Technology Motivation ... whose average-case performance

Adaptive Belief Propagationhttps://github.com/geopapa11/adabp

Georgios Papachristoudis and John W. Fisher IIIxxxxx [email protected] xxxxxxxxxxxxxxxxxxxxx [email protected]

Massachusetts Institute of Technology

Motivation

� Graphical models are commonly used to represent large-scale inference problems.

� At any given time, only a small subset of all variables may be of interest.

� Observations may arrive at different times, resulting in a change of distribution.

� The marginals of variables of interest change after the addition of new observations.

� It is desirable to efficiently evaluate desired statistics and avoid redundant computations.

#1?

w1

v1

?v2

#2

?v2

w2

?

#3

?

w2w3

v3

v2

#4 ?

w4

v4

v2

?

· · · · · ·Model parameters (bold nodes) change due to the addition of newobservations, while only a small subset of latent nodes is of interest (?)

Temperature measurements (from sensors) are added sequentially,while only a part of latent variables (server) is of interest.

Contributions

� We develop an adaptive inference approach which gives the exact marginals on trees andwhose average-case performance is significantly better than that of standard BP.

� We provide an extension to Gaussian loopy graphs, where the results are exact.

� Implementation is straightforward, code is publicly available.

Problem statement

� X = {X1, . . . ,XN}: N latent variables that are thefocus of inference.

� Direct dependencies between latent variables arerepresented by edge set E .

� Neighbors of Xk are represented by N (k).

� Each latent node Xk is linked to mk ≥ 0measurements/observations: Yk,1, . . . ,Yk,mk

.

Y1,1

Y1,m1

Y2,1 Y2,m2

. . .

Y3,1 Y3,m3

... X1 X2

X3X4

Y4,m4Y4,1. . .

. . .

. . .

XN

� Discrete setting: Xk ∈ X , Gaussian setting: Xk ∈ Rd .

Updating node potentials

� ϕ(0)k (xk) : Node potentials of latent variables.

� χk`(xk, y`) : Pairwise potential between latent andobserved variables.

� ψij(xi , xj) : Pairwise potential between latentvariables.

. . .

X1 X2

X3X4

XN

Y1,1

Y1,m1

Y2,1 Y2,m2

. . .

Y3,1 Y3,m3

... X1 X2

X3X4

Y4,m4Y4,1. . .

. . .

. . .

XN

)

� A new measurement Yw`,u = yu at iteration ` changes the node potential of Xw` as

ϕ(`)w` (xw`) = ϕ

(`−1)w` (xw`)χw`u(xw`, yu).

� We only consider the graph of latent nodes.

Measurement and marginal orders

Measurement order w = {w1, . . . ,wM}: The order of acquiring measurements.

Marginal order v = {v1, . . . , vM}: The order of (latent) nodes whose marginal is ofinterest at each step.

. . .

X1 X2

X3X4

XN

?v1

w1

. . .

X1 X2

X3X4

XN

?

v2

w2

. . .

X1 X2

X3X4

XN

?

w3

v3

Adaptive BP

Belief Propagation

� Belief propagation is a message passing algorithm that runs in linear time tothe number of latent nodes and computes the node marginals.

� A message is given by mi→j(xj) =∑

xiϕi(xi)ψij(xi , xj)

∏k∈N (i)\j mk→i(xi).

TvTu

...

...

· · ·

· · ·

· · ·

i

j

k

u v

... ...

Ti

mi!j

Lowest Common Ancestor (LCA)

� The lca of two nodes w , v is the lowest (deepest) node that has both w and v asdescendants.

� For trees, the path between two nodes is uniquely determined from their lca.

� The lca of two nodes is determined in constant time by reduction to theRange Minimum Query (RMQ) problem [Czumaj et al., 2007].

� This requires the building of the so-called RMQ structure in O(N logN) time and space.

w

v

lca(w, v)

path(w, v)

Adaptive BP (AdaBP)

� Denote by M(w → v) the directed path from node w to node v , which is unique for trees.

?w`

v` w`�1

Send messages from w`−1 to w`

?w`

v` w`�1

Send messages from w` to v`

Theorem. AdaBP provides the exact marginals in pathM(w`→ v`),∀` for tree MRFs.

Preprocessing Build the RMQ structure.Initialization Initialize node, pairwise potentials and messages.for ` = 1, 2, . . . do

Determine path M(w`−1 → w`).Compute messages in M(w`−1 → w`).Update the node potential at Xw`

.Determine path M(w`→ v`).Compute messages in M(w`−1 → w`).Compute the marginal of interest pXv`

(xv`).end for

Extension to Max-Product

� Find the most likely sequence: x∗ ∈ arg maxx p(x).

� mi→j(xj) = maxxi ϕi(xi)ψij(xi , xj)∏

k∈N (i)\j mk→i(xi).

� δi→j(xj) = arg maxxi ϕi(xi)ψij(xi , xj)∏

k∈N (i)\j mk→i(xi).

� Propagate m, δ messages in M(w`−1 → w`).

w`

w`�1

Extension to Gaussian Loopy MRFs

� Feedback Message Passing (FMP) by [Liu et al., 2012] is abelief-propagation-like algorithm, which provides the exact marginal meansand variances in Gaussian loopy graphs.

� F : FVS nodes, set of nodes whose removal breaks loops (here: 16, 17, 18).

� T : Remaining acyclic graph.

� A: Anchors (neighbors of FVS nodes).

� wT` = w`, if w` ∈ T and wT` = wT`−1, otherwise.

1 2 3 4 16

5 6 7 8 9

10 11 12

18

13

17 14 15

Extension of AdaBP to Gaussian loopy graphs

1 2 3 4 16

5 6 7 8 9

10 11 12

18

13

17 14 15

w`w`�1

w`−1,w` ∈ T

w`�1

1 2 3 4 16

5 6 7 8 9

10 11 12

18

13

17 14 15w`

wT`

w`−1 ∈ F ,w` ∈ TFirst phase: Send messages from wT`−1 to w`

1 2 3 4 16

5 6 7 8 9

10 11 12

18

13

17 14 15

wT`

?v`

M(wT` → A)

?1 2 3 4 16

5 6 7 8 9

10 11 12

18

13

17 14 15

v` wT`

M(A → v`)

Second phase: Send messages from wT` to A and then to v`

Experiments

� Comparison against standard BP and RCTreeBP by [Sumer et al., 2011].

� Method by [Sumer et al., 2011] is an adaptive inference approach which constructs abalanced representation of an elimination tree to evaluate marginals in logarithmic time.

� Preprocessing time (for trees): O(|X |3N) / O(N logN) for AdaBP.� Time/update (for trees): O(|X |3 logN) / O(|X |2dist(w`−1,w`)) for AdaBP.

Synthetic data

� We construct unbalanced trees of varying sizes (N ∈ {10, 102, 103, 104}).

� We generate measurement order w randomly (column (a)) or such thatE[dist(w`−1,w`)] ≤ |X | logN (column (b)) and compute marginals at each step.

� AdaBP is orders of magnitude faster than standard BP,� AdaBP is 1.3–4.7 faster than RCTreeBP when E[dist(w`−1,w`)] ≤ |X | logN (b).

N

101

102

103

104

Speedup r

atio

100

101

102

103

104

|X | = 2

t(BP)/t(AdaBP)t(RCTreeBP)/t(AdaBP)

N

101

102

103

104

Speedup r

atio

100

101

102

103

104

|X | = 2

t(BP)/t(AdaBP)t(RCTreeBP)/t(AdaBP)

N10

110

210

310

4

Speedup r

atio

10-2

10-1

100

101

E[dist(wℓ−1, wℓ)] ∼ N

t(BP)/t(AdaBP)t(RCTreeBP)/t(AdaBP)

(c) E[dist(w`−1,w`)] ∼ N .

N

101

102

103

104

Speedup r

atio

100

101

102

103

104

|X | = 10

t(BP)/t(AdaBP)t(RCTreeBP)/t(AdaBP)

(a) Unconstrained w .

N

101

102

103

104

Speedup r

atio

100

101

102

103

104

|X | = 10

t(BP)/t(AdaBP)t(RCTreeBP)/t(AdaBP)

(b) Constrained w .

N10

110

210

310

4

Speedup r

atio

100

101

102

103

104

E[dist(wℓ−1, wℓ)] = 2

t(BP)/t(AdaBP)t(RCTreeBP)/t(AdaBP)

(d) E[dist(w`−1,w`)] fixed.

Real data

� We explore the effect of basepair mutation in the birth/death of CpG islands.

� AdaMP is up to 8 times faster than RCTreeMP.

� Update time per iteration is much smaller for AdaMP for small dist(w`−1,w`).

� Both methods are not sensitive to changes in the MAP sequence.

N0

2

4

6

8

102

103

104

10510

-2

100

102

104RCTreeMP

AdaMP

(a) Speedups of AdaMP over RCTreeMP.

10-2100102

10-2100102

10-5100105

100 101 102 103 104 10510-5100105

(b) Update times of AdaMP over RCTreeMP.

0 200 400 600 800

Update

tim

e (

sec)

5

10

15

20

25

RCTreeMP (ρ = 0.09)AdaMP (ρ = 0.14)

(c) Sensitivity to changes in MAP sequence.

� We analyze temperature measurements collected from 53 wireless sensors from IntelBerkeley Research Lab.

� We assume temperature evolution follows a Gaussian distribution.

� We collect measurements in a 6-hour window on a random order and are interested incomputing the marginal of selected areas.

� AdaBP is up to 4–6 times faster than standard Kalman filtering/smoothing techniques.

Iteration j50 100 150 200

Speedup r

atio

20

40

60

t(KF)/t(AdaBP)

dist(wℓ, wℓ−1)20 40 60 80 100 120 140 160 180

Tim

e (

se

c)

5

10

15

20

25

30

AdaBPKF

[Sumer et al., 2011] O. Sumer, U. A. Acar, A. T. Ihler and R. R. Mettu. Adaptive Exact Inference in Graphical Models. Journal of Machine Learning Research (JMLR), 12:3147–3186,November 2011.

[Czumaj et al., 2007] A. Czumaj, M. Kowaluk and A. Lingas. Faster algorithms for finding lowest common ancestors in directed acyclic graphs. Theor. Comput. Sci., 380:37–46, July 2007.

[Liu et al., 2012] Y. Liu, V. Chandrasekaran, A. Anandkumar and A. S. Willsky. Feedback Message Passing for Inference in Gaussian Graphical Models. IEEE Transactions on SignalProcessing, 60(8):4135–4150, August 2012.