plenary - willsky
TRANSCRIPT
-
8/12/2019 Plenary - Willsky
1/46
Learning and InferenceGraphical and HierarchiModels: A Personal Jou
Alan S. Willsky
http://lids.mit.edu
http://ssg.mit.edu
July 2012
-
8/12/2019 Plenary - Willsky
2/46
Undirected graphical m G= (V, E) (V=vertices; E VV = edges) Markovianity on G
Hammersley-Clifford (NASC for positive dist.) C= Set of all cliques in G
c= Clique potential Z = partition function
Pairwise graphical models
-
8/12/2019 Plenary - Willsky
3/46
Directing/undirecting the Undirected models: Factors are not typically
Although they canbe for cycle-free graphs (see Directed models: Specify in terms of transiti
probabilities (parents to children)
Directed to undirected: Easy (after moralizat Undirected to directed: Hard (and often a m
the graph is a tree
-
8/12/2019 Plenary - Willsky
4/46
Gaussian models X ~ N(, P)or N-1(h, J), with J = P-1and h = The sparsity structure of J determines graph s
I.e., Jst= 0 if (s, t) is not an edge Directed model (0-mean for simplicity):
AX = w
W~ N(0, I) A Lower triangular
A = J1/2 In general we get lots of fill, unlestructure is a tree
And its even more complicated for non-Gausshigher-order cliques are introduced
-
8/12/2019 Plenary - Willsky
5/46
Belief Propagation: Messag
for pairwise models on tree
Fixed point equations for likelihoods from disjothe graph:
-
8/12/2019 Plenary - Willsky
6/46
BP on trees Gives factored form for distribution in terms of p
distributions
Great flexibility in message-scheduling Leaves-root-trees = Rauch-Tung-Striebel Completely parallel messaging, convergence in numbe
diameter of the graph
What do we do if we dont have a tree? Turn it into one viajunction trees
Not usually a good idea OK, then what?
-
8/12/2019 Plenary - Willsky
7/46
Modeling of structure
Four questions What should the structure of the g Which variables go where? What about adding hidden (unobs
nodes (and variables)?
What should the dimensions ovarious variables be?
-
8/12/2019 Plenary - Willsky
8/46
Our initial foray (with thankMichle Basseville and Alber
Benveniste) What can be multiresolution?
The phenomenonbeing modeled The datathat are collected The objectivesof statistical inference The algorithmsthat result
Some applications that motivated us (and oth Oceanography Groundwater hydrology Fractal priors in regularization formulations
vision, mathematical physics,
Texture discrimination Helioseismology (?)
-
8/12/2019 Plenary - Willsky
9/46
Specifying MR models o MR synthesis, leads, as with Markov chains,
about directed trees:
E.g.: Hidden Markov trees Midpoint deflection Note that the dimension of the variables com But lets assume we pre-specify the tree struc
This looks closer to what we know how to handle
-
8/12/2019 Plenary - Willsky
10/46
A control theorists idea
Internal models Variables at coarser nodes are linear functional
scale variables**
Some of these may be specified, measured or impo The rest are to be designed
To approximate the condition for tree-Markovianemphasizes the state-like nature of variables in
To yield a model that is close to the true fine-s Scale-recursive algebraic design**
Criterion used: Canonical correlations or predictive ef
Alternate using midpoint deflection or wavelets** Confounding the control theorist
Internal models need not have minimal state dimen
-
8/12/2019 Plenary - Willsky
11/46
So, what if we want a tre
not necessarily a hierarch One approach: Construct the maximum like
given sample data (or the full second-order
NOTE: This is quite different from what sytypically consider There is no state to identify: All of the variable
desire are observed
It is the index set that we get to play with Chow-Liu found a very efficient algorithm
Form graph with each observed variable at a diff Edge Weight between any two variables is their Compute max-weight spanning tree
What if we want hidden/latent states?
-
8/12/2019 Plenary - Willsky
12/46
Reconstruction of a La
Tree Reconstruct a latent tree using
(first) or samples (to follow) of
only
Exact/consistent recovery of mtrees Each hidden node has at lea Observed variables are neith
dependent nor independent
Other objectives: Computational efficiency Low sample complexity Good sample complexity
-
8/12/2019 Plenary - Willsky
13/46
Information Distance Gaussian distributions
Discrete distributionsJoint probability matrix Marginal probabi
Additivity
-
8/12/2019 Plenary - Willsky
14/46
Testing Node RelationshNode j a leaf node Node i
for all k i, jCan identify (parent, leaf ch
Node i and j leaf nodes and s
parent (sibling nodes)
for all k i, j
Can identify leaf-sibling pairs
-
8/12/2019 Plenary - Willsky
15/46
Recursive Grouping (exact s
Step 1. Compute for all observ
Step 2. Identify (parent, leaf child) or (leaf si
Step 3. Introduce a hidden parent node for
group without a parent.
Step 4. Compute the information distance fo
hidden nodes. E.g.:
Step 5. Remove the identified child nodes a
-
8/12/2019 Plenary - Willsky
16/46
Recursive Grouping Identifies a group of family nodes at each st Introduces hidden nodes recursively. Correctly recovers all minimal latent trees.
Computational complexity O(diam(T) m3).
Worst case
-
8/12/2019 Plenary - Willsky
17/46
Chow-Liu Tree
Minimum spannin
using D as edg
V = set of observ
D = information d
Computational complexity O(m
2
log m)
-
8/12/2019 Plenary - Willsky
18/46
Surrogate Nodes and the CV = set of observed nodes
Surrogate node of i
If (i, j) is an edge in the latent tre
(Sg(i), Sg(j)) is an edge in the Ch
-
8/12/2019 Plenary - Willsky
19/46
CLGrouping AlgorithmStep 1. Using information distances of observe
MST(V; D). Identify the set of internal nodes.
Step 2. Select an internal node and its neighborecursive-grouping (RG) algorithm.
Step 3. Replace the output of RG
sub-tree spanning the neighborh
Repeat Steps 2-3 until all internal no
Computational complexity
O(m2log m + (#internal nodes) (max
-
8/12/2019 Plenary - Willsky
20/46
Sample-based Algorith Compute the ML estimates of information di Relaxed constraints for testing node relation Consistent (only in structure for discrete dis
Regularized CLGrouping for learning latent approximations.
-
8/12/2019 Plenary - Willsky
21/46
-
8/12/2019 Plenary - Willsky
22/46
-
8/12/2019 Plenary - Willsky
23/46
-
8/12/2019 Plenary - Willsky
24/46
-
8/12/2019 Plenary - Willsky
25/46
The dark side of trees
bright side: No loops So, what do we do? Try #1: Turn it into a junction tree
Variable dimension/cardinality can be prohibitive And learning/finding good models of this type can be
Try #2: Pretend the problem isnt there and u If the real objectives are at coarse scales, then fine-
may not matter
Try #3: Pretend its a tree and use (Loopy) BP Try #4: Think!
What does LBP do? Better algorithms? Other graphs for which inference is scalable?
-
8/12/2019 Plenary - Willsky
26/46
Borrowing/adapting a first id
numerical solution of PDEs
Nested dissection Cut the PDE graph via a nested set of separ Solve the problem from small to large Top-down has a problem
High-dimensional variables for large separators
Recursive cavity models Start from the bottomrather than top of th
Construct approximate models (i.e., invers Do this by variable elimination and model t
Variable elimination introduces new edges due to Remove edges so that the thinned model is optim
unthinned model (convex optimization)
-
8/12/2019 Plenary - Willsky
27/46
RCM in pictures Cavity thinning
Collision
Reversing the process (bring your own
-
8/12/2019 Plenary - Willsky
28/46
RCM in action: We are th
This is the information-form of RTS, with a thinapproximation at each stage
How do the thinning errors propagate?A control-theoretic stability question
We can iteratively refine estimates usingRichardson iterations
-
8/12/2019 Plenary - Willsky
29/46
Walk-sums, BP, and efficient mes
scheduling for Gaussian models
Information-form:
Inference: Go from (h, J) to means and
Assume J normalized to have unit diago
R is the matrix of partial correlation coewith sparsity pattern identical to J
corresponds to sum over weightedwalksfrom s to t in graph
-
8/12/2019 Plenary - Willsky
30/46
Inference algorithms may compute walks iorder (different message schedules) Would like convergence regardless of order Condition of walk-summability
For BP Walk-summability guarantees convergence If BP converges it collects all walks for i but only
self-return walks required for Pii
Thecomputation tree provides insight aboabout what can happen for Non-WS models
Walk-summable models
-
8/12/2019 Plenary - Willsky
31/46
A computation tree
BP includes the walk (1,2,3,2,1) BP does not include the walk (1,2,3,1) Note that back-tracking paths traverse each ednumber of times
This implies that the sign of the s has no evariance computations
BUT: For Non-WS models, the tree may be no
-
8/12/2019 Plenary - Willsky
32/46
A simple example
-
8/12/2019 Plenary - Willsky
33/46
Preconditioned Richardson iterations Stationary
Efficient if S corresponds to some tractable subgrap Non-stationary: Variations: Gauss-Seidel iterations, where we
variables and update others
For WS models: We get exact answers if all wcollected
Adaptive max-weight spanning tree algorithmthe tree structure to use at each iteration to requation error by collecting the best additiona Can provide exponential speed-up in many cases
Embedded Trees
-
8/12/2019 Plenary - Willsky
34/46
An alternate approach: Us
(Pseudo-) Feedback Vert
1. Provide additional potentials to allow computationeeded in mean/variance/covariance computatio
2. Run BP with both original potentials and the add3. Feed back information to FVS to allow computati
variance and mean within the FVS4. Send modified information potentials to neighbor5. Run BP with modified information potentials
1. Yields exact means immediately2. Combining with results from Steps 2, 3 yields
-
8/12/2019 Plenary - Willsky
35/46
Approximate FVS Complexity is O(k2n), where k = |F| If k is too large
Use a pseudo- (i.e., partial) FVS, breaking only som On the remaining graph, run LBP or some other algo
Assuming convergence (which does not requir Always get the correct means and variances on F, e
and (for LBP) approximate variances on T
The approximate variances collect more walks than
Local (fast) method for choosing nodes for the Enhance convergence Collect the most important wants
Theoretical and empirical evidence show k O
-
8/12/2019 Plenary - Willsky
36/46
Motivation from PDEs,
MultiPOLE Models Motivation from methods for efficient
preconditioners for PDEs
Influence of variables at a distance are welapproximated by coarser approximation
We then only need to do LOCALsmoothincorrection
The idea for statistical models: Pyramidal structure in scale However, when conditioned on neighboring
remaining correlation structure at a g
sparse and local
-
8/12/2019 Plenary - Willsky
37/46
Models on graphs and
conjugate graphsGarden variety graphical model: sparse inverse
Conjugate models: sparse covariance
-
8/12/2019 Plenary - Willsky
38/46
Sparse In-scale Conditional Covarian
Multiresolution Model (SIM Model)
Condition1 and scaindepend
Scale 1
Learn
Duaoptim(sshh
-
8/12/2019 Plenary - Willsky
39/46
Multipole Estimation Richardson Iteration to solve (Jh + (c)-1)x = h
Global tree-based inference Sparse matrix multiplication for in-scale correction
Jhxnew= h (c)-
Compute last term via spa
cz = xold
xnew=c(h Jhxol
-
8/12/2019 Plenary - Willsky
40/46
Stock Returns Example
Monthly returns of 84 companies in the S&P 100 index (19 Hierarchy based on the Standard Industrial Classification s Market, 6 divisions, 26 industries, and 84 individual comp Conjugate edges find strong residual correlations
Oil service companies (Schlumberger,) and oil com Computer companies, Software companies, electrical
-
8/12/2019 Plenary - Willsky
41/46
What If Some Phenomen
Hidden?
Hidden variablo Hedge fund io Patent accepo Geopolitical fo Regulatory is
Many dependenciesamong observedvariables
Less concise model
Ford
GM
Chrysler
JetBlue
United
Delta
American
Continental
JetBlue
United
Delta
American
Continental
O
-
8/12/2019 Plenary - Willsky
42/46
Graphical Models With Hidden V
Sparse Modeling Meets PCA
Sparse
=+
Marginal
concentration
matrix
-
8/12/2019 Plenary - Willsky
43/46
Convex Optimization for M
Samples of obs. vars.:
=
+
S L
given sparse low-rank
Last two terms provide convex regularization for sparsity low-rank in L
Weights allow tradeoff between sparsity and rank
-
8/12/2019 Plenary - Willsky
44/46
When does this work? Identifiability conditions
The sparse part cant be low rank and the cant be sparse
There are precise conditions (including on regularization weights) that guarant
Exact recovery if exact statistics are given Consistency results if samples are available
probability scaling laws on problem dimenavailable sample size)
-
8/12/2019 Plenary - Willsky
45/46
On the way: Construct riche
of models for which inferenc We have
Methods to learn hidden trees with fixed structure bvariable dimensions
Method to learn hidden trees with unknown structurvariable dimension Can we do both at the same time?
Certainly ill-posed: Can convex optimization save the We have method for recovering sparse plus lo
How about tree plus low rank?
This would yield models with small FVSs How about building desire for tractable models
learning objective function?
-
8/12/2019 Plenary - Willsky
46/46
M.J. Wainwright, T.S. Jaakkola, and A.S. Willsky, Tree-Based ReparametrizationFramework for Approximate Estimation on Graphs with Cycles,IEEE Trans. onInformation Theory, Vol. 49, No. 5, May 2003, pp. 1120-1146.
E. Sudderth, M.J. Wainwright, and A.S. Willsky, Embedded Trees: Estimation ofGaussian on Graphs with Cycles,IEEE Transactions on Signal Processing, Vol. 52, No.11, Nov. 2004, pp. 3136-3150.
D.M. Malioutov, J. Johnson, and A.S. Willsky, Walk-Sums and Belief Propagation inGaussian Graphical Models,Journal of Machine Learning Research, Vol. 7, Oct. 2006,
pp. 2031-2064.
V. Chandrasekaran, J. Johnson, and A.S. Willsky, Estimation in Gaussian GraphicalModels Using Tractable Subgraphs: A Walk-Sum Analysis,IEEE Transactions onSignal Processing, Vol 56, No.5, May 2008, pp. 1916-1930.
7. M.J. Choi, Venkat Chandrasekaran, and Alan S. Willsky, "Gaussian Multiresolution Models:Exploiting Sparse Markov and Covariance Structure,"IEEE Trans. on Signal Processing, Vol.58, March 2010, pp. 1012-1024.
10.Y. Liu, V. Chandrasekaran, A. Anandkumar, and A.S. Willsky, Feedback Message Passing forInference in Gaussian Graphical Models, accepted for publication inIEEE Trans. on SignalProcessing.