query-specific learning and inference for probabilistic graphical models
DESCRIPTION
Query-Specific Learning and Inference for Probabilistic Graphical Models. Anton Chechetka. - PowerPoint PPT PresentationTRANSCRIPT
Carnegie Mellon
Query-Specific Learning and Inferencefor Probabilistic Graphical Models
Thesis committee: Carlos Guestrin Eric Xing J. Andrew Bagnell Pedro Domingos (University of Washington)
14 June 2011
Anton Chechetka
2
Motivation
Fundamental problem: to reason accurately about
noisyhigh-dimensional data with
local interactions
3
Sensor networks
• noisy: sensors fail noise in readings• high-dimensional: many sensors, (temperature, humidity, …) per sensor• local interactions: nearby locations have high correlations
4
Hypertext classification
• noisy: automated text understanding is far from perfect• high-dimensional: a variable for every webpage• local interactions: directly linked pages have correlated topics
5
Image segmentation
• noisy: local information is not enough camera sensor noise compression artifacts• high-dimensional: a variable for every patch• local interactions: cows are next to grass, airplanes next to sky
6
Probabilistic graphical models
Noisy
high-dimensional data with
local interactions
Probabilistic inference
a graph to encodeonly direct interactions
over many variables
)(
),()|(
EP
EQPEQP
query evidence
7
Graphical models semantics
Ff
XZ
XP
1
Factorized distributions
X3
Graph structure
X4
X5 X2
X1
X7
X6
543 ,, XXXX
X are small subsets of X compact representation
separator
8
Graphical models workflow
Ff
XZ
XP
1X3
Learn/constructstructure
Learn/defineparameters Inference P(Q|E=E)
Factorized distributions Graph structure
X4
X5 X2
X1
X7
X6
9
Graph. models fundamental problems
Learn/constructstructure
Learn/defineparameters
Inference
P(Q|E=E)
#P-complete (exact)NP-complete (approx)
NP-complete
exp(|X|)
Compoundingerrors
10
Domain knowledge structures don’t help
(webpages)
Domain knowledge-based structuresdo not support tractable inference
11
This thesis: general directions
Emphasizing the computational aspects of the graphLearn accurate and tractable models
Compensate for reduced expressive power withexact inference and optimal parametersGain significant speedups
Inference speedups via better prioritization of computationEstimate the long-term effects of propagating information through the graphUse long-term estimates to prioritize updates
New algorithms for learning and inference in graphical models
to make answering the queries better
12
Thesis contributionsLearn accurate and tractable models
In the generative setting P(Q,E) [NIPS 2007]
In the discriminative setting P(Q|E) [NIPS 2010]
Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]
13
Generative learning
)(
),()|(
EP
EQPEQP
query goallearning goal
Useful when E is not known in advance
Sensors fail unpredictably
Measurements are expensive (e.g. user time), want adaptive evidence selection
14
Tractable vs intractable models workflow
learn simple tractablestructure from
domain knowledge + data
approx. P(Q|E=E)
optimal parameters,exact inference
construct intractablestructure from
domain knowledge
approx. P(Q|E=E)
approximate algs:no quality
guarantees
learn intractablestructure from
data
Tractable models Intractable models
Tractability via low treewidth
Exact inference exponential in treewidth (sum-product)Treewidth NP-complete to compute in generalLow-treewidth graphs are easy to constructConvenient representation: junction treeOther tractable model classes exist too
15
7 2
1 5
3 4
6
Treewidth:size of largest clique in a
triangulated graph
16
Junction treesCliques connected by edges with separatorsRunning intersection propertyMost likely junction tree of given treewidth >1 is NP-completeWe will look for good approximations
C1
X1,X
5
X4,X
5
X1,X
2
X1,X
5
X1,X2,X7
X1,X2,X5
X1,X4,X5 X4,X5,X6
X1,X3,X5
C2
C3
C4
C5
X4,X5,X6
X1,X3,X5X1,X2,X5
X1,X4,X5 X4,X5,X6
X1,X3,X5X1,X2,X5
X1,X4,X5
7 2
1 5
3 4
6
17
Independencies in low-treewidth distributions
EC
CEC SP
CPXP
,,
),( )(conditional mutual information
works in the other way too!
SSSXXI | , XPPKL EC ),(||
X1,X
5 X1,X2,X7X1,X2,X5X1,X4,X5
X4,X5,X6 X1,X3,X5
0 | , SXXI
conditional independencies hold
S C C
X = X2 X3 X7X = X4 X6
P(X) factorizes according to a JT
18
Constraint-based structure learning SSSXXI | , XPPKL EC ),(||
Look for JTs where this holds(constraint-based structure learning)
S1: X1X2
S2: X1X3
S3: X1X4
Sm: Xn-1Xn
…
all candidateseparators
partition remainingvariables into weakly
dependent subsets
all variables X
find consistentjunction tree
C1S1
S8
S7
S3
C2
C3
C4
C5X1 X4
XX X
I(X , X X | S3) <
19
Mutual information complexity
I(X , X- | S) = H(X | S) - H(X | X- S3)
everything except for X conditional entropy
I(X , X- | S) depends on all assignments to X:exp(|X|) complexity in general
Our contribution: polynomial-time upper bound
20
Mutual info upper bound: intuition
I(A,B | C)=??
DF
hard
A BI(D,F|C)
easy
|DF| k
Only look at small subsets D, F
Poly number of small subsetsPoly complexity for every pair
Any conclusions about I(A,B|C)?
In general, no If a good junction tree exists, yes
21
Contribution: mutual info upper bound
Suppose an -JT of treewidth k for P(ABC) exists:
Let for |DF| k+1
Then I(A, B | C) |ABC| ( + )
= max I(D, F | C)
DF
A B|DF| treewidth+1I(D,F|C)
SSSXXI | ,
Theorem:
22
Mutual info upper bound: complexityDirect computation: complexity exp(|ABC|)Our upper bound:
O(|AB|treewidth + 1) small subsets
exp(|C|+ treewidth) time each
|C| = treewidthfor structure learning
I(D,F|C)D
F
A B
|DF| treewidth+1
polynomial(|ABC|) complexity
23
Guarantees on learned model quality
Theorem:Suppose a strongly connected -JT of treewidth k for P(X) exists.
Then our alg. will with probability at least (1-) find a JT s.t.
)2()1(|| XkPPKL JT
2
)/log( XO
2
32)/1log( k
XO
Corollary: strongly connected junction trees are PAC-learnable
quality guarantee
poly samples poly time
using samples and time.
24
Related workReference Model Guarantees Time[Bach+Jordan:2002] tractable local poly(n)[Chow+Liu:1968] tree global O(n2 log n)[Meila+Jordan:2001] tree mix local O(n2 log n)[Teyssier+Koller:2005] compact local poly(n)[Singh+Moore:2005] all global exp(n)[Karger+Srebro:2001] tractable const-factor poly(n)[Abbeel+al:2006] compact PAC poly(n)[Narasimhan+Bilmes:2004] tractable PAC exp(n)our work tractable PAC poly(n)[Gogate+al:2010] tractable with
high treewidthPAC poly(n)
25
Results – typical convergence time
good results early on in practice
Test
log-
likel
ihoo
d
bett
er
26
Results – log-likelihoodbe
tter
our method
OBS local search in limited in-degree Bayes netsChow-Liu most likely JTs of treewidth 1Karger-Srebro constant-factor approximation JTs
27
ConclusionsA tractable upper bound on conditional mutual infoGraceful quality degradation and PAC learnability guaranteesAnalysis on when dynamic programming works[in the thesis]
Dealing with unknown mutual information threshold[in the thesis]
Speedups preserving the guaranteesFurther speedups without guarantees
28
Thesis contributionsLearn accurate and tractable models
In the generative setting P(Q,E) [NIPS 2007]
In the discriminative setting P(Q|E) [NIPS 2010]
Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]
29
Discriminative learning
)(
),()|(
EP
EQPEQP
query goal learning goal
Useful when variables E are always the sameNon-adaptive, one-shot observation
Image pixels scene descriptionDocument text topic, named entities
Better accuracy than generative models
30
Discriminative log-linear models
EQfwwEZ
wEQP ,exp),(
1),|(
feature(domain knowledge)
weight(learn from data)
evidence-dependentnormalization
Don’t sum over all values of EDon’t model P(E)
No need for structure over E
Evidence
Query
f12
f34
31
Model tractability still important
Observation #1: tractable models are necessary for exact inference and parameter learning in the discriminative setting
Tractability is determined by the structure over query
32
Simple local models: motivation
evidence
query
Q=f(E)
E
Q
Locally almost linear
Exploiting evidence values overcomes the expressive power deficit of simple models
We will learn local tractable models
33
Context-specific independence
Observation #2: use evidence values at test time to tune the structure of the models, do not commit to a single tractable model
noedge
34
Low-dimensional dependencies in generative structure learning
CS
SCCS
H(C)H(S)LLH ),(cliques
Generative structure learning often relies only on low-dimensional marginals
Junction trees:decomposable scores
separators
Low-dimensional independence tests: ??
)|,( SBAI
Small changes to structure quick score recomputation
Discriminative structure learning: need inference in full modelfor every datapoint even for small changes in structure
35
Leverage generative learning
Observation #3: generative structure learning algorithms have very useful properties, can we leverage them?
36
Observations so farDiscriminative setting has extra information, including evidence values at test time
Want to use to learn local tractable models
Good structure learning algorithms exist for generative settingthat only require low-dimensional marginals P(Q)
Approach: 1. use local conditionals P(Q | E=E) as “fake marginals” to learn local tractable structures 2. learn exact discriminative feature weights
37
Evidence-specific CRF overviewApproach: 1. use local conditionals P(Q | E=E) as “fake marginals” to learn local tractable structures 2. learn exact discriminative feature weights
Local conditional density estimators P(Q | E)
Evidencevalue E=E
P(Q | E=E)
Generative structurelearning algorithm
Tractable structurefor E=E
Featureweights w
Tractable evidence-specific CRF
Evidence-specific CRF formalism
),(),(exp),,(
1),,|( uEIEQfw
uwEZuwEQP
Observation: identically zero feature 0 does not affect the model
evidence-specific structure: I(E,u){0, 1}extra “structural” parameters
Fixed dense model
Evidence-specific
tree “mask”
Evidence-specific model× =( () ) ( )
38
Evidence-specific
feature values( )
E=E2
E=E3
E=E1 ×××
39
Evidence-specific CRF learning
Learning is in the same order as testing
Local conditional density estimators P(Q | E)
Evidencevalue E=E
P(Q | E=E)
Generative structurelearning algorithm
Tractable structurefor E=E
Featureweights w
Tractable evidence-specific CRF
40
Plug in generative structure learning
),(),(exp),,(
1),,|( uEIEQw
uwEZuwEQP
encodes the output of the chosen structure learning algorithm
Generative Discriminative
P(Qi,Qj)
(pairwise marginals)+
Chow-Liu algorithm=
optimal tree
P(Qi,Qj|E=E)
(pairwise conditionals)+
Chow-Liu algorithm=
good tree for E=E
Directly generalize generative algorithms :
41
Evidence-specific CRF learning: structure
Choose generative structure learning algorithm A
Identify low-dimensional subsets Qβ that A may need
Chow-Liu
All pairs (Qi, Qj)
E Q E Q1,Q2 E EQ1,Q3 Q3,Q4
,original problem low-dimensional pairwise problems
…
),|,(ˆ1331 uEQQP ),|,(ˆ
3443 uEQQP),|,(ˆ1221 uEQQP
42
Estimating low-dimensional conditionals
Use the same features as the baseline high-treewidth model
QQEQuuEZ
uEQP
s.t. ,exp),(
1),|(
EQwwEZ
wEQP ,exp),(
1)|,(Baseline CRF
Low-dimensionalmodel
Scope restriction
End result: optimal u
43
Evidence-specific CRF learning: weights
),(),(exp),,(
1),,|( uEIEQw
uwEZuwEQP
Already chosen the algorithm behind I(E,u)
Already learned parameters u
Only need to learn feature weights w
log P(Q|E,w,u) is concave in w unique global optimum
“effective features”
44
Evidence-specific CRF learning: weights
EEQEEQ
E ,E,,),,|(log
),,|(
QffuIw
uwPuwQP
Tree-structured distribution
Fixed dense model
Evidence-specific
tree “mask”( () )
E=E2
E=E3
E=E1
Exacttree-structuredgradients wrt w( )
Q=Q2
Q=Q3
Q=Q1
Σ
Overall gradient(dense)( )
45
Results – WebKBText + links webpage topic
bett
er
Prediction error TimeSVM RMN ESS-CRF M3N
0
0.05
0.1
0.15
RMN ESS-CRF M3N0
200400600800
10001200
Ignore links Standard dense CRF Our work Max-margin model
46
Image segmentation - accuractlocal segment features + neighbor segments type of object
Logisti
c regressi
on
Dense CRF
ESS-CRF
0.6000000000000010.6400000000000010.6800000000000010.7200000000000010.760000000000001
Accuracy
bett
er
Ignore links Standard dense CRF Our work
47
Image segmentation - time
Train time (log scale)
bett
er
Logistic regression
Dense CRF ESS-CRF2
20
200
2000
20000
Test time (log scale)
Logistic regression
Dense CRF ESS-CRF0.3
3
30
300
3000
Ignore links Standard dense CRF Our work
48
ConclusionsUsing evidence values to tune low-treewidth model structure
Compensates for the reduced expressive powerOrder of magnitude speedup at test time (sometimes train time too)
General framework for plugging in existing generative structure learnersStraightforward relational extension [in the thesis]
49
Thesis contributionsLearn accurate and tractable models
In the generative setting P(Q,E) [NIPS 2007]
In the discriminative setting P(Q|E) [NIPS 2010]
Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]
50
Why high-treewidth models?A dense model expressing laws of nature
Protein folding
Max-margin parameters don’t work well (yet?) with evidence-specific structures
51
Query-Specific inference problemevidencequery not interesting
Using information about the queryto speed up convergence of belief propagation
for the query marginals
Eij
jiij XXfP )()(X
52
(loopy) Belief PropagationPassing messages along edges
Variable belief:
Update rule:
Result: all single-variable beliefs
ikEkjj
tjkji
xiji
tij xmxxfxm
j ,
)()1( )()()(
Eij
it
ijit xmxP )()(
~ )()(
kim r
sj
ki
h
u
53
(loopy) Belief PropagationMessage dependencies are local:
Freedom in scheduling updatesRound–robin schedule
Fix message orderApply updates in that order until convergence
r
sj
ki
h
u
dependence
dependence
dep.
54
Dynamic update prioritization
Fixed update sequence is not the best optionDynamic update scheduling can speed up convergence
Tree-Reweighted BP [Wainwright et. al., AISTATS 2003]Residual BP [Elidan et. al. UAI 2006]
Residual BP apply the largest change first
1
informative update
2
wasted computation
large change large
change
large change
small change
small change
small change
55
Residual BP [Elidan et. al., UAI 2006]
Update rule:
Pick edge with largest residual
Update
oldnew
)()(max OLDij
NEWij mm
More effort on the difficult parts of the model
ikEkj
jOLD
jkjix
ijiNEW
ij xmxxfxmj ,
)()( )()()(
)(OLDijm
)( NEWijm
)( NEWijm
But no query
56
• Residual BP updates• no influence on the query• wasted computation
Why edge importance weights?query
residual < residualwhich to update??
• want to update • influence on the query in the future
Residual BP max immediate residual reduction
Our work max approx. eventual effect on P(query)
57
Query-Specific BP
Update rule:
Pick edge with
Update
oldnew
ijOLD
ijNEW
ij Amm max )()(
ikEkj
jOLD
jkjix
ijiNEW
ij xmxxfxmj ,
)()( )()()(
)(OLDijm
)( NEWijm
)( NEWijm
Rest of the talk: defining and computing edge importance
edgeimportance
the only change!
Edge importance base case
approximate eventual update effect on P(Q)
query
r
sj
ki
h
uij
OLDij
NEWij Amm )()(
||P(NEW)(Q) P(OLD)(Q)|| ||m(NEW) m(OLD)||change in query belief change in message
tight bound
1
Base case: edge directly connected to the query Aji=??
1ji ji
||P(Q)|| ||m ||ji
||m ||over values of
all other messages
mjisup mrj
Edge one step away from the query:
Arj=??
mjisup mrj
Edge importance one step awayquery
r
sj
ki
h
u
||P(Q)||
change in query belief
change in message
can compute in closed formlooking at only fji [Mooij, Kappen; 2007]
message importance
ji
rj||m ||
One step away:
Arj=
Edge importance general case
queryr
sj
ki
h
u
||P(Q)|| ||msh|| P(Q)msh
sup
P(Q)msh
sup
Base case: Aji=1
mhrsup msh
mrjsup mhr
mjisup mrj
mjisup mrj
sensitivity(): max impact along the
path
Generalization? expensive to compute bound may be infinite
Edge importance general case
queryr
sj
ki
h
u
1
P(Q)msh
sup
mhrsup msh
mrjsup mhr
mjisup mrj
sensitivity(): max impact along the
path
2
Ash = max all paths from to query sensitivity()h
There are a lot of paths in a graph,trying out every one is intractable
62
Efficient edge importance computation
A = max all paths from to query sensitivity()
There are a lot of paths in a graph,trying out every one is intractable
always 1
always decreases as the path grows
mhrsup msh
mrjsup mhr
mjisup mrj
sensitivity( hrji ) =
always 1always 1
decomposes into individual edge contributions
Dijkstra’s (shortest paths) alg. will efficiently find max-sensitivity paths
for every edge
63
Aji = max all paths from i to query sensitivity()
Query-Specific BP
Run Dijkstra’s alg starting at query to get edge weights
Pick edge with largest weighted residual
Update
ijOLD
ijNEW
ij Amm max )()(
)(OLDijm
)( NEWijm )( NEWijm
More effort on the difficult parts of the model
Takes into account not only graphical structure, but also strength of dependencies
and relevant
64
Experiments – single query
Easy model(sparse connectivity,weak interactions)
Hard model(dense connectivity,strong interactions)
bett
er
Standard residual BP Our work
Faster convergence, but long initialization still a problem
65
Anytime query-specific BPquery
Dijkstra’s alg. BP updates
Query-specific BP:
Anytime QSBP:
same BP update sequence!
r
sj
ki
66
Experiments – anytime QSBP
Easy model(sparse connectivity,weak interactions)
Hard model(dense connectivity,strong interactions)
bett
er
Standard residual BP Our work
Much shorter initialization
Our work + anytime
67
Experiments – multiquery
Easy model(sparse connectivity,weak interactions)
Hard model(dense connectivity,strong interactions)
bett
er
Standard residual BP Our work Our work + anytime
68
ConclusionsWeighting edges is a simple and effective way to improve prioritizationWe introduce a principled notion of edge importance based on both structure and parameters of the modelRobust speedups in the query-specific setting
Don’t spend computation on nuisance variables unless needed for the query marginal
Deferring BP initialization has a large impact
69
Thesis contributionsLearn accurate and tractable models
In the generative setting P(Q,E) [NIPS 2007]
In the discriminative setting P(Q|E) [NIPS 2010]
Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]
70
Future workMore practical JT learning
SAT solvers to construct structure, pruning heuristics, …
Evidence-specific learningTrade efficiency for accuracyMax-margin evidence-specific models
Theory on ES structures too
Inference:Beyond query-specific: better prioritization in generalBeyond BP: query-specific Gibbs sampling?
71
Thesis conclusionsGraphical models are a regularization technique for high-dimensional distributionsRepresentation-based structure is well-understood
Conditional independencies
Right now, structured computation is a “consequence” of representation
Major issues with tractability, approximation quality
Logical next step structured computation as a primary basis of regularizationThis thesis: computation-centric approaches have better efficiency and do not sacrifice accuracy
72
Thank you!
Collaborators: Carlos Guestrin, Joseph Bradley, Dafna Shahaf
73
Mutual info upper bound: qualityUpper bound:
Suppose an -JT exists is the largest mutual information over small subsetsThen I(A, B | C) |ABC| ( + )
No need to know the -JT, only that it exists
No connection between C and the JT separators
C can be of any size, no connection to JT treewidthThe bound is loose only when there is no hope to learn a good JT
74
Typical graphical models workflow
Learn/constructstructure
Learn/defineparameters
Inference
P(Q|E=e)
reasonable intractablestructure from
domain knowledge
approx. P(Q|E=e)
The graph isprimarily a
representationtool
approximate algs:no quality
guarantees
75
Contributions – tractable modelsLearn accurate and tractable models
In the generative setting [NIPS 2007]Polynomial-time conditional mutual information upper boundFirst PAC-learning result for strongly connected junction treesGraceful degradation guaranteesSpeedup heuristrics
In the discriminative setting [NIPS 2010]General framework for learning CRF structure that depends on evidence values at test timeExtensions to the relational settingEmpirical: order of magnitude speedups with the same accuracy as high-treewidth models
76
Contributions – faster inferenceSpeed up belief propagation for cases with many nuisance variables [AISTATS 2010]
A framework of importance-weighted residual belief propagationA principled measure of eventual impact of an edge update on the query belief
Prioritize updates by importance for the query instead of absolute magnitude
An anytime modification to defer much of initializationInitial inference results available much soonerOften much faster eventual convergenceThe same fixed points as the full model
77
Future workTwo main bottlenecks:
Constructing JTs given mutual information values.Esp. with non-uniform treewidth, dependence strength
Large sample: learnability guarantees for non-uniform treewidth Small sample: non-uniform treewidth for regularizationConstraint satisfaction, SAT solvers, etc?Relax strong connectivity requirement?
Evaluating mutual information:need to look at 2k+1 variables instead of k+1, large penalty
Branch on features instead of sets of variables? [Gogate+al:2010]
Speedups without guaranteesLocal search, greedy separator construction, …
78
Log-linear parameter learningconditional log-likelihood
DEQ
EQD),(
)|(log)( ,wP|wLLH
Convex optimization: unique global maximum
Gradient: features – [expected features]
EEQEQ
E ,E,),|(log
),|(
Qffw
wPwQP
need inference inference for every E given w
79
Log-linear parameter learning
Generative (E=) Discriminative
Tractable Closed-form Exact gradient-based
IntractableApproximate
gradient-based(no guarantees)
Approximategradient-based(no guarantees)
Inference once per weights update
Inference for every datapoint (Q,E)once per weights update
“manageable” slowdownby the number of datapoints
Complexity“phase
transition”
80
Plug in generative structure learning
),(),(exp),,(
1),,|( uEIEQw
uwEZuwEQP
encodes the output of the chosen structure learning algorithm
Chow-Liu for optimal treesOur thin junction tree learning from part 1Karger-Srebro for high-quality low-diameter junction treesLocal search, etc …
Fix algorithm always get structures with desired properties (e.g. treewidth):
replace P(Q) with approximate conditionals P(Q | E=E, u) everywhere
81
Evidence-specific CRF learning: weights
),(),(exp),,(
1),,|( uEIEQw
uwEZuwEQP
Already knowalgorithm behind I(E,u)Already learned u
Only need to learn w
Structure induced by I(E,u)is always tractable
Can find evidence-specific structure I(E=E,u)
for every training datapoint (Q,E)
Learn optimal w exactly
EEQEEQ
E ,E,,),,|(log
),,|(
QffuIw
uwPuwQP
Tree-structured distribution
82
Relational evidence-specific CRFRelational models: templated features + shared weights
webpage webpageLinksTo
LinksTo
LinksTo
Relation:
Groundings:
Learn a singleweight wLINK
wLINK
wLINK
Copy weight for every grounding
83
Relational evidence-specific CRFRelational models: templated features + shared weights
Every grounding is a separate datapoint for structure training
use propositional approach + shared weights
x1 x2
x3
x4 x5
Grounded model Training datasets for “structural” parameters u
x3 x4
x3 x5
x4 x5
x1 x2x1 x3
x1 x4
x1 x5
x2 x3
x2 x4
x2 x5
84
Future workFaster learning: pseudolikelihood is really fast, need to competeLarger treewidth: trade time for accuracyTheory on learning “structural parameters” uMax-margin learning
Inference is basic step in max-margin learning too tractable models are useful beyond log-likelihoodOptimizing feature weights w given local trees is straightforwardOptimizing “structural parameters” u for max-margin is hard
What is the right objective?
Almost tractable structures, other tractable modelsMake sure loops don’t hurt too much
85
Query versus nuisance variablesWe may actually care about only few variables
What are the topics of the webpages on the first page of Google search results for my query?Smart heating control: is anybody going to be at home for the next hour?Does the patient need immediate doctor attention?
But the model may have a lot of other variables to be accurate enough
Don’t care about them per se, but necessary to look at to get the query right
Both query and nuisance variables are unknown, inference algorithms don’t see a differenceSpeed up inference by focusing on the query
Only look at nuisance variable to the extent needed to answer the query
86
Our contributions
Using weighted residuals to prioritize updates
Define message weights reflecting the importance of the message to the query
Computing importance weights efficiently
Experiments: faster convergence on large relational models
87
Interleaving
Dijkstra’s expands the highest weight edges firstqueryexpanded on
previous iteration just expanded
not yet expanded
min expanded edges A A
suppose
M min expanded A
no need to expand further at this point
upper bound on priorityactual priority of
)()( max OLD
ijNEW
ijEDGESALLij mmM
ijOLD
ijNEW
ijEXPANDEDij Amm )()(max
88
Deferring BP initialization
Observation: Dijkstra’s alg. expands the most important edges first
Do we really need to look at every low importance edgebefore applying BP updates?
No! Can use upper bounds on priority instead.
89
Upper bounds in priority queue
Observation: for edges low in the priority queue, an upper bound on the priority is enough
r
sj
ki
Updates priority queue
Exact priority needed fortop element
Priority upper boundis enough here
90
|| factor( ) ||
Priority upper bound for not yet seen edges
priority( ) = residual( ) importance weight( )
importance weight( )s.t. is already expanded
priority( )
Component-wise upper bound without looking at the edge!
r
sj
ki
Expand several edges with Dijkstra’s : For :(residual) (weight) = (exact priority)
For all the other edges…
91
Interleaving BP and Dijkstra’s
Dijkstra
BP
Dijkstra
Dijkstra
BP BP
…
exact priority upper bound BP update
exact priority upper bound Dijkstra expand an edge
queryfull model
><