plenary - willsky

Upload: rishi-relan

Post on 03-Jun-2018

239 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Plenary - Willsky

    1/46

    Learning and InferenceGraphical and HierarchiModels: A Personal Jou

    Alan S. Willsky

    [email protected]

    http://lids.mit.edu

    http://ssg.mit.edu

    July 2012

  • 8/12/2019 Plenary - Willsky

    2/46

    Undirected graphical m G= (V, E) (V=vertices; E VV = edges) Markovianity on G

    Hammersley-Clifford (NASC for positive dist.) C= Set of all cliques in G

    c= Clique potential Z = partition function

    Pairwise graphical models

  • 8/12/2019 Plenary - Willsky

    3/46

    Directing/undirecting the Undirected models: Factors are not typically

    Although they canbe for cycle-free graphs (see Directed models: Specify in terms of transiti

    probabilities (parents to children)

    Directed to undirected: Easy (after moralizat Undirected to directed: Hard (and often a m

    the graph is a tree

  • 8/12/2019 Plenary - Willsky

    4/46

    Gaussian models X ~ N(, P)or N-1(h, J), with J = P-1and h = The sparsity structure of J determines graph s

    I.e., Jst= 0 if (s, t) is not an edge Directed model (0-mean for simplicity):

    AX = w

    W~ N(0, I) A Lower triangular

    A = J1/2 In general we get lots of fill, unlestructure is a tree

    And its even more complicated for non-Gausshigher-order cliques are introduced

  • 8/12/2019 Plenary - Willsky

    5/46

    Belief Propagation: Messag

    for pairwise models on tree

    Fixed point equations for likelihoods from disjothe graph:

  • 8/12/2019 Plenary - Willsky

    6/46

    BP on trees Gives factored form for distribution in terms of p

    distributions

    Great flexibility in message-scheduling Leaves-root-trees = Rauch-Tung-Striebel Completely parallel messaging, convergence in numbe

    diameter of the graph

    What do we do if we dont have a tree? Turn it into one viajunction trees

    Not usually a good idea OK, then what?

  • 8/12/2019 Plenary - Willsky

    7/46

    Modeling of structure

    Four questions What should the structure of the g Which variables go where? What about adding hidden (unobs

    nodes (and variables)?

    What should the dimensions ovarious variables be?

  • 8/12/2019 Plenary - Willsky

    8/46

    Our initial foray (with thankMichle Basseville and Alber

    Benveniste) What can be multiresolution?

    The phenomenonbeing modeled The datathat are collected The objectivesof statistical inference The algorithmsthat result

    Some applications that motivated us (and oth Oceanography Groundwater hydrology Fractal priors in regularization formulations

    vision, mathematical physics,

    Texture discrimination Helioseismology (?)

  • 8/12/2019 Plenary - Willsky

    9/46

    Specifying MR models o MR synthesis, leads, as with Markov chains,

    about directed trees:

    E.g.: Hidden Markov trees Midpoint deflection Note that the dimension of the variables com But lets assume we pre-specify the tree struc

    This looks closer to what we know how to handle

  • 8/12/2019 Plenary - Willsky

    10/46

    A control theorists idea

    Internal models Variables at coarser nodes are linear functional

    scale variables**

    Some of these may be specified, measured or impo The rest are to be designed

    To approximate the condition for tree-Markovianemphasizes the state-like nature of variables in

    To yield a model that is close to the true fine-s Scale-recursive algebraic design**

    Criterion used: Canonical correlations or predictive ef

    Alternate using midpoint deflection or wavelets** Confounding the control theorist

    Internal models need not have minimal state dimen

  • 8/12/2019 Plenary - Willsky

    11/46

    So, what if we want a tre

    not necessarily a hierarch One approach: Construct the maximum like

    given sample data (or the full second-order

    NOTE: This is quite different from what sytypically consider There is no state to identify: All of the variable

    desire are observed

    It is the index set that we get to play with Chow-Liu found a very efficient algorithm

    Form graph with each observed variable at a diff Edge Weight between any two variables is their Compute max-weight spanning tree

    What if we want hidden/latent states?

  • 8/12/2019 Plenary - Willsky

    12/46

    Reconstruction of a La

    Tree Reconstruct a latent tree using

    (first) or samples (to follow) of

    only

    Exact/consistent recovery of mtrees Each hidden node has at lea Observed variables are neith

    dependent nor independent

    Other objectives: Computational efficiency Low sample complexity Good sample complexity

  • 8/12/2019 Plenary - Willsky

    13/46

    Information Distance Gaussian distributions

    Discrete distributionsJoint probability matrix Marginal probabi

    Additivity

  • 8/12/2019 Plenary - Willsky

    14/46

    Testing Node RelationshNode j a leaf node Node i

    for all k i, jCan identify (parent, leaf ch

    Node i and j leaf nodes and s

    parent (sibling nodes)

    for all k i, j

    Can identify leaf-sibling pairs

  • 8/12/2019 Plenary - Willsky

    15/46

    Recursive Grouping (exact s

    Step 1. Compute for all observ

    Step 2. Identify (parent, leaf child) or (leaf si

    Step 3. Introduce a hidden parent node for

    group without a parent.

    Step 4. Compute the information distance fo

    hidden nodes. E.g.:

    Step 5. Remove the identified child nodes a

  • 8/12/2019 Plenary - Willsky

    16/46

    Recursive Grouping Identifies a group of family nodes at each st Introduces hidden nodes recursively. Correctly recovers all minimal latent trees.

    Computational complexity O(diam(T) m3).

    Worst case

  • 8/12/2019 Plenary - Willsky

    17/46

    Chow-Liu Tree

    Minimum spannin

    using D as edg

    V = set of observ

    D = information d

    Computational complexity O(m

    2

    log m)

  • 8/12/2019 Plenary - Willsky

    18/46

    Surrogate Nodes and the CV = set of observed nodes

    Surrogate node of i

    If (i, j) is an edge in the latent tre

    (Sg(i), Sg(j)) is an edge in the Ch

  • 8/12/2019 Plenary - Willsky

    19/46

    CLGrouping AlgorithmStep 1. Using information distances of observe

    MST(V; D). Identify the set of internal nodes.

    Step 2. Select an internal node and its neighborecursive-grouping (RG) algorithm.

    Step 3. Replace the output of RG

    sub-tree spanning the neighborh

    Repeat Steps 2-3 until all internal no

    Computational complexity

    O(m2log m + (#internal nodes) (max

  • 8/12/2019 Plenary - Willsky

    20/46

    Sample-based Algorith Compute the ML estimates of information di Relaxed constraints for testing node relation Consistent (only in structure for discrete dis

    Regularized CLGrouping for learning latent approximations.

  • 8/12/2019 Plenary - Willsky

    21/46

  • 8/12/2019 Plenary - Willsky

    22/46

  • 8/12/2019 Plenary - Willsky

    23/46

  • 8/12/2019 Plenary - Willsky

    24/46

  • 8/12/2019 Plenary - Willsky

    25/46

    The dark side of trees

    bright side: No loops So, what do we do? Try #1: Turn it into a junction tree

    Variable dimension/cardinality can be prohibitive And learning/finding good models of this type can be

    Try #2: Pretend the problem isnt there and u If the real objectives are at coarse scales, then fine-

    may not matter

    Try #3: Pretend its a tree and use (Loopy) BP Try #4: Think!

    What does LBP do? Better algorithms? Other graphs for which inference is scalable?

  • 8/12/2019 Plenary - Willsky

    26/46

    Borrowing/adapting a first id

    numerical solution of PDEs

    Nested dissection Cut the PDE graph via a nested set of separ Solve the problem from small to large Top-down has a problem

    High-dimensional variables for large separators

    Recursive cavity models Start from the bottomrather than top of th

    Construct approximate models (i.e., invers Do this by variable elimination and model t

    Variable elimination introduces new edges due to Remove edges so that the thinned model is optim

    unthinned model (convex optimization)

  • 8/12/2019 Plenary - Willsky

    27/46

    RCM in pictures Cavity thinning

    Collision

    Reversing the process (bring your own

  • 8/12/2019 Plenary - Willsky

    28/46

    RCM in action: We are th

    This is the information-form of RTS, with a thinapproximation at each stage

    How do the thinning errors propagate?A control-theoretic stability question

    We can iteratively refine estimates usingRichardson iterations

  • 8/12/2019 Plenary - Willsky

    29/46

    Walk-sums, BP, and efficient mes

    scheduling for Gaussian models

    Information-form:

    Inference: Go from (h, J) to means and

    Assume J normalized to have unit diago

    R is the matrix of partial correlation coewith sparsity pattern identical to J

    corresponds to sum over weightedwalksfrom s to t in graph

  • 8/12/2019 Plenary - Willsky

    30/46

    Inference algorithms may compute walks iorder (different message schedules) Would like convergence regardless of order Condition of walk-summability

    For BP Walk-summability guarantees convergence If BP converges it collects all walks for i but only

    self-return walks required for Pii

    Thecomputation tree provides insight aboabout what can happen for Non-WS models

    Walk-summable models

  • 8/12/2019 Plenary - Willsky

    31/46

    A computation tree

    BP includes the walk (1,2,3,2,1) BP does not include the walk (1,2,3,1) Note that back-tracking paths traverse each ednumber of times

    This implies that the sign of the s has no evariance computations

    BUT: For Non-WS models, the tree may be no

  • 8/12/2019 Plenary - Willsky

    32/46

    A simple example

  • 8/12/2019 Plenary - Willsky

    33/46

    Preconditioned Richardson iterations Stationary

    Efficient if S corresponds to some tractable subgrap Non-stationary: Variations: Gauss-Seidel iterations, where we

    variables and update others

    For WS models: We get exact answers if all wcollected

    Adaptive max-weight spanning tree algorithmthe tree structure to use at each iteration to requation error by collecting the best additiona Can provide exponential speed-up in many cases

    Embedded Trees

  • 8/12/2019 Plenary - Willsky

    34/46

    An alternate approach: Us

    (Pseudo-) Feedback Vert

    1. Provide additional potentials to allow computationeeded in mean/variance/covariance computatio

    2. Run BP with both original potentials and the add3. Feed back information to FVS to allow computati

    variance and mean within the FVS4. Send modified information potentials to neighbor5. Run BP with modified information potentials

    1. Yields exact means immediately2. Combining with results from Steps 2, 3 yields

  • 8/12/2019 Plenary - Willsky

    35/46

    Approximate FVS Complexity is O(k2n), where k = |F| If k is too large

    Use a pseudo- (i.e., partial) FVS, breaking only som On the remaining graph, run LBP or some other algo

    Assuming convergence (which does not requir Always get the correct means and variances on F, e

    and (for LBP) approximate variances on T

    The approximate variances collect more walks than

    Local (fast) method for choosing nodes for the Enhance convergence Collect the most important wants

    Theoretical and empirical evidence show k O

  • 8/12/2019 Plenary - Willsky

    36/46

    Motivation from PDEs,

    MultiPOLE Models Motivation from methods for efficient

    preconditioners for PDEs

    Influence of variables at a distance are welapproximated by coarser approximation

    We then only need to do LOCALsmoothincorrection

    The idea for statistical models: Pyramidal structure in scale However, when conditioned on neighboring

    remaining correlation structure at a g

    sparse and local

  • 8/12/2019 Plenary - Willsky

    37/46

    Models on graphs and

    conjugate graphsGarden variety graphical model: sparse inverse

    Conjugate models: sparse covariance

  • 8/12/2019 Plenary - Willsky

    38/46

    Sparse In-scale Conditional Covarian

    Multiresolution Model (SIM Model)

    Condition1 and scaindepend

    Scale 1

    Learn

    Duaoptim(sshh

  • 8/12/2019 Plenary - Willsky

    39/46

    Multipole Estimation Richardson Iteration to solve (Jh + (c)-1)x = h

    Global tree-based inference Sparse matrix multiplication for in-scale correction

    Jhxnew= h (c)-

    Compute last term via spa

    cz = xold

    xnew=c(h Jhxol

  • 8/12/2019 Plenary - Willsky

    40/46

    Stock Returns Example

    Monthly returns of 84 companies in the S&P 100 index (19 Hierarchy based on the Standard Industrial Classification s Market, 6 divisions, 26 industries, and 84 individual comp Conjugate edges find strong residual correlations

    Oil service companies (Schlumberger,) and oil com Computer companies, Software companies, electrical

  • 8/12/2019 Plenary - Willsky

    41/46

    What If Some Phenomen

    Hidden?

    Hidden variablo Hedge fund io Patent accepo Geopolitical fo Regulatory is

    Many dependenciesamong observedvariables

    Less concise model

    Ford

    GM

    Chrysler

    JetBlue

    United

    Delta

    American

    Continental

    JetBlue

    United

    Delta

    American

    Continental

    O

  • 8/12/2019 Plenary - Willsky

    42/46

    Graphical Models With Hidden V

    Sparse Modeling Meets PCA

    Sparse

    =+

    Marginal

    concentration

    matrix

  • 8/12/2019 Plenary - Willsky

    43/46

    Convex Optimization for M

    Samples of obs. vars.:

    =

    +

    S L

    given sparse low-rank

    Last two terms provide convex regularization for sparsity low-rank in L

    Weights allow tradeoff between sparsity and rank

  • 8/12/2019 Plenary - Willsky

    44/46

    When does this work? Identifiability conditions

    The sparse part cant be low rank and the cant be sparse

    There are precise conditions (including on regularization weights) that guarant

    Exact recovery if exact statistics are given Consistency results if samples are available

    probability scaling laws on problem dimenavailable sample size)

  • 8/12/2019 Plenary - Willsky

    45/46

    On the way: Construct riche

    of models for which inferenc We have

    Methods to learn hidden trees with fixed structure bvariable dimensions

    Method to learn hidden trees with unknown structurvariable dimension Can we do both at the same time?

    Certainly ill-posed: Can convex optimization save the We have method for recovering sparse plus lo

    How about tree plus low rank?

    This would yield models with small FVSs How about building desire for tractable models

    learning objective function?

  • 8/12/2019 Plenary - Willsky

    46/46

    M.J. Wainwright, T.S. Jaakkola, and A.S. Willsky, Tree-Based ReparametrizationFramework for Approximate Estimation on Graphs with Cycles,IEEE Trans. onInformation Theory, Vol. 49, No. 5, May 2003, pp. 1120-1146.

    E. Sudderth, M.J. Wainwright, and A.S. Willsky, Embedded Trees: Estimation ofGaussian on Graphs with Cycles,IEEE Transactions on Signal Processing, Vol. 52, No.11, Nov. 2004, pp. 3136-3150.

    D.M. Malioutov, J. Johnson, and A.S. Willsky, Walk-Sums and Belief Propagation inGaussian Graphical Models,Journal of Machine Learning Research, Vol. 7, Oct. 2006,

    pp. 2031-2064.

    V. Chandrasekaran, J. Johnson, and A.S. Willsky, Estimation in Gaussian GraphicalModels Using Tractable Subgraphs: A Walk-Sum Analysis,IEEE Transactions onSignal Processing, Vol 56, No.5, May 2008, pp. 1916-1930.

    7. M.J. Choi, Venkat Chandrasekaran, and Alan S. Willsky, "Gaussian Multiresolution Models:Exploiting Sparse Markov and Covariance Structure,"IEEE Trans. on Signal Processing, Vol.58, March 2010, pp. 1012-1024.

    10.Y. Liu, V. Chandrasekaran, A. Anandkumar, and A.S. Willsky, Feedback Message Passing forInference in Gaussian Graphical Models, accepted for publication inIEEE Trans. on SignalProcessing.