plenary - willsky

8/12/2019 Plenary - Willsky

1/46

Learning and InferenceGraphical and HierarchiModels: A Personal Jou

Alan S. Willsky

[email protected]

http://lids.mit.edu

http://ssg.mit.edu

July 2012


2/46

Undirected graphical m G= (V, E) (V=vertices; E VV = edges) Markovianity on G

Hammersley-Clifford (NASC for positive dist.) C= Set of all cliques in G

c= Clique potential Z = partition function

Pairwise graphical models


3/46

Directing/undirecting the Undirected models: Factors are not typically

Although they canbe for cycle-free graphs (see Directed models: Specify in terms of transiti

probabilities (parents to children)

Directed to undirected: Easy (after moralizat Undirected to directed: Hard (and often a m

the graph is a tree


4/46

Gaussian models X ~ N(, P)or N-1(h, J), with J = P-1and h = The sparsity structure of J determines graph s

I.e., Jst= 0 if (s, t) is not an edge Directed model (0-mean for simplicity):

AX = w

W~ N(0, I) A Lower triangular

A = J1/2 In general we get lots of fill, unlestructure is a tree

And its even more complicated for non-Gausshigher-order cliques are introduced


5/46

Belief Propagation: Messag

for pairwise models on tree

Fixed point equations for likelihoods from disjothe graph:


6/46

BP on trees Gives factored form for distribution in terms of p

distributions

Great flexibility in message-scheduling Leaves-root-trees = Rauch-Tung-Striebel Completely parallel messaging, convergence in numbe

diameter of the graph

What do we do if we dont have a tree? Turn it into one viajunction trees

Not usually a good idea OK, then what?


7/46

Modeling of structure

Four questions What should the structure of the g Which variables go where? What about adding hidden (unobs

nodes (and variables)?

What should the dimensions ovarious variables be?


8/46

Our initial foray (with thankMichle Basseville and Alber

Benveniste) What can be multiresolution?

The phenomenonbeing modeled The datathat are collected The objectivesof statistical inference The algorithmsthat result

Some applications that motivated us (and oth Oceanography Groundwater hydrology Fractal priors in regularization formulations

vision, mathematical physics,

Texture discrimination Helioseismology (?)


9/46

Specifying MR models o MR synthesis, leads, as with Markov chains,

about directed trees:

E.g.: Hidden Markov trees Midpoint deflection Note that the dimension of the variables com But lets assume we pre-specify the tree struc

This looks closer to what we know how to handle


10/46

A control theorists idea

Internal models Variables at coarser nodes are linear functional

scale variables**

Some of these may be specified, measured or impo The rest are to be designed

To approximate the condition for tree-Markovianemphasizes the state-like nature of variables in

To yield a model that is close to the true fine-s Scale-recursive algebraic design**

Criterion used: Canonical correlations or predictive ef

Alternate using midpoint deflection or wavelets** Confounding the control theorist

Internal models need not have minimal state dimen


11/46

So, what if we want a tre

not necessarily a hierarch One approach: Construct the maximum like

given sample data (or the full second-order

NOTE: This is quite different from what sytypically consider There is no state to identify: All of the variable

desire are observed

It is the index set that we get to play with Chow-Liu found a very efficient algorithm

Form graph with each observed variable at a diff Edge Weight between any two variables is their Compute max-weight spanning tree

What if we want hidden/latent states?


12/46

Reconstruction of a La

Tree Reconstruct a latent tree using

(first) or samples (to follow) of

only

Exact/consistent recovery of mtrees Each hidden node has at lea Observed variables are neith

dependent nor independent

Other objectives: Computational efficiency Low sample complexity Good sample complexity


13/46

Information Distance Gaussian distributions

Discrete distributionsJoint probability matrix Marginal probabi

Additivity


14/46

Testing Node RelationshNode j a leaf node Node i

for all k i, jCan identify (parent, leaf ch

Node i and j leaf nodes and s

parent (sibling nodes)

for all k i, j

Can identify leaf-sibling pairs


15/46

Recursive Grouping (exact s

Step 1. Compute for all observ

Step 2. Identify (parent, leaf child) or (leaf si

Step 3. Introduce a hidden parent node for

group without a parent.

Step 4. Compute the information distance fo

hidden nodes. E.g.:

Step 5. Remove the identified child nodes a


16/46

Recursive Grouping Identifies a group of family nodes at each st Introduces hidden nodes recursively. Correctly recovers all minimal latent trees.

Computational complexity O(diam(T) m3).

Worst case


17/46

Chow-Liu Tree

Minimum spannin

using D as edg

V = set of observ

D = information d

Computational complexity O(m

2

log m)


18/46

Surrogate Nodes and the CV = set of observed nodes

Surrogate node of i

If (i, j) is an edge in the latent tre

(Sg(i), Sg(j)) is an edge in the Ch


19/46

CLGrouping AlgorithmStep 1. Using information distances of observe

MST(V; D). Identify the set of internal nodes.

Step 2. Select an internal node and its neighborecursive-grouping (RG) algorithm.

Step 3. Replace the output of RG

sub-tree spanning the neighborh

Repeat Steps 2-3 until all internal no

Computational complexity

O(m2log m + (#internal nodes) (max


20/46

Sample-based Algorith Compute the ML estimates of information di Relaxed constraints for testing node relation Consistent (only in structure for discrete dis

Regularized CLGrouping for learning latent approximations.


21/46


22/46


23/46


24/46


25/46

The dark side of trees

bright side: No loops So, what do we do? Try #1: Turn it into a junction tree

Variable dimension/cardinality can be prohibitive And learning/finding good models of this type can be

Try #2: Pretend the problem isnt there and u If the real objectives are at coarse scales, then fine-

may not matter

Try #3: Pretend its a tree and use (Loopy) BP Try #4: Think!

What does LBP do? Better algorithms? Other graphs for which inference is scalable?


26/46

Borrowing/adapting a first id

numerical solution of PDEs

Nested dissection Cut the PDE graph via a nested set of separ Solve the problem from small to large Top-down has a problem

High-dimensional variables for large separators

Recursive cavity models Start from the bottomrather than top of th

Construct approximate models (i.e., invers Do this by variable elimination and model t

Variable elimination introduces new edges due to Remove edges so that the thinned model is optim

unthinned model (convex optimization)


27/46

RCM in pictures Cavity thinning

Collision

Reversing the process (bring your own


28/46

RCM in action: We are th

This is the information-form of RTS, with a thinapproximation at each stage

How do the thinning errors propagate?A control-theoretic stability question

We can iteratively refine estimates usingRichardson iterations


29/46

Walk-sums, BP, and efficient mes

scheduling for Gaussian models

Information-form:

Inference: Go from (h, J) to means and

Assume J normalized to have unit diago

R is the matrix of partial correlation coewith sparsity pattern identical to J

corresponds to sum over weightedwalksfrom s to t in graph


30/46

Inference algorithms may compute walks iorder (different message schedules) Would like convergence regardless of order Condition of walk-summability

For BP Walk-summability guarantees convergence If BP converges it collects all walks for i but only

self-return walks required for Pii

Thecomputation tree provides insight aboabout what can happen for Non-WS models

Walk-summable models


31/46

A computation tree

BP includes the walk (1,2,3,2,1) BP does not include the walk (1,2,3,1) Note that back-tracking paths traverse each ednumber of times

This implies that the sign of the s has no evariance computations

BUT: For Non-WS models, the tree may be no


32/46

A simple example


33/46

Preconditioned Richardson iterations Stationary

Efficient if S corresponds to some tractable subgrap Non-stationary: Variations: Gauss-Seidel iterations, where we

variables and update others

For WS models: We get exact answers if all wcollected

Adaptive max-weight spanning tree algorithmthe tree structure to use at each iteration to requation error by collecting the best additiona Can provide exponential speed-up in many cases

Embedded Trees


34/46

An alternate approach: Us

(Pseudo-) Feedback Vert

1. Provide additional potentials to allow computationeeded in mean/variance/covariance computatio

2. Run BP with both original potentials and the add3. Feed back information to FVS to allow computati

variance and mean within the FVS4. Send modified information potentials to neighbor5. Run BP with modified information potentials

1. Yields exact means immediately2. Combining with results from Steps 2, 3 yields


35/46

Approximate FVS Complexity is O(k2n), where k = |F| If k is too large

Use a pseudo- (i.e., partial) FVS, breaking only som On the remaining graph, run LBP or some other algo

Assuming convergence (which does not requir Always get the correct means and variances on F, e

and (for LBP) approximate variances on T

The approximate variances collect more walks than

Local (fast) method for choosing nodes for the Enhance convergence Collect the most important wants

Theoretical and empirical evidence show k O


36/46

Motivation from PDEs,

MultiPOLE Models Motivation from methods for efficient

preconditioners for PDEs

Influence of variables at a distance are welapproximated by coarser approximation

We then only need to do LOCALsmoothincorrection

The idea for statistical models: Pyramidal structure in scale However, when conditioned on neighboring

remaining correlation structure at a g

sparse and local


37/46

Models on graphs and

conjugate graphsGarden variety graphical model: sparse inverse

Conjugate models: sparse covariance


38/46

Sparse In-scale Conditional Covarian

Multiresolution Model (SIM Model)

Condition1 and scaindepend

Scale 1

Learn

Duaoptim(sshh


39/46

Multipole Estimation Richardson Iteration to solve (Jh + (c)-1)x = h

Global tree-based inference Sparse matrix multiplication for in-scale correction

Jhxnew= h (c)-

Compute last term via spa

cz = xold

xnew=c(h Jhxol


40/46

Stock Returns Example

Monthly returns of 84 companies in the S&P 100 index (19 Hierarchy based on the Standard Industrial Classification s Market, 6 divisions, 26 industries, and 84 individual comp Conjugate edges find strong residual correlations

Oil service companies (Schlumberger,) and oil com Computer companies, Software companies, electrical


41/46

What If Some Phenomen

Hidden?

Hidden variablo Hedge fund io Patent accepo Geopolitical fo Regulatory is

Many dependenciesamong observedvariables

Less concise model

Ford

GM

Chrysler

JetBlue

United

Delta

American

Continental

JetBlue

United

Delta

American

Continental

O


42/46

Graphical Models With Hidden V

Sparse Modeling Meets PCA

Sparse

=+

Marginal

concentration

matrix


43/46

Convex Optimization for M

Samples of obs. vars.:

=

+

S L

given sparse low-rank

Last two terms provide convex regularization for sparsity low-rank in L

Weights allow tradeoff between sparsity and rank


44/46

When does this work? Identifiability conditions

The sparse part cant be low rank and the cant be sparse

There are precise conditions (including on regularization weights) that guarant

Exact recovery if exact statistics are given Consistency results if samples are available

probability scaling laws on problem dimenavailable sample size)


45/46

On the way: Construct riche

of models for which inferenc We have

Methods to learn hidden trees with fixed structure bvariable dimensions

Method to learn hidden trees with unknown structurvariable dimension Can we do both at the same time?

Certainly ill-posed: Can convex optimization save the We have method for recovering sparse plus lo

How about tree plus low rank?

This would yield models with small FVSs How about building desire for tractable models

learning objective function?


46/46

M.J. Wainwright, T.S. Jaakkola, and A.S. Willsky, Tree-Based ReparametrizationFramework for Approximate Estimation on Graphs with Cycles,IEEE Trans. onInformation Theory, Vol. 49, No. 5, May 2003, pp. 1120-1146.

E. Sudderth, M.J. Wainwright, and A.S. Willsky, Embedded Trees: Estimation ofGaussian on Graphs with Cycles,IEEE Transactions on Signal Processing, Vol. 52, No.11, Nov. 2004, pp. 3136-3150.

D.M. Malioutov, J. Johnson, and A.S. Willsky, Walk-Sums and Belief Propagation inGaussian Graphical Models,Journal of Machine Learning Research, Vol. 7, Oct. 2006,

pp. 2031-2064.

V. Chandrasekaran, J. Johnson, and A.S. Willsky, Estimation in Gaussian GraphicalModels Using Tractable Subgraphs: A Walk-Sum Analysis,IEEE Transactions onSignal Processing, Vol 56, No.5, May 2008, pp. 1916-1930.

7. M.J. Choi, Venkat Chandrasekaran, and Alan S. Willsky, "Gaussian Multiresolution Models:Exploiting Sparse Markov and Covariance Structure,"IEEE Trans. on Signal Processing, Vol.58, March 2010, pp. 1012-1024.

10.Y. Liu, V. Chandrasekaran, A. Anandkumar, and A.S. Willsky, Feedback Message Passing forInference in Gaussian Graphical Models, accepted for publication inIEEE Trans. on SignalProcessing.

plenary - willsky

Documents