lecture 17 slides may 30 th , 2006

Lec 17: May 30th, 2006 EE512 - Graphical Models - J. Bilmes

University of WashingtonDepartment of Electrical Engineering

EE512 Spring, 2006 Graphical Models

Jeff A. Bilmes <[email protected]>Jeff A. Bilmes <[email protected]>

Lecture 17 Slides

May 30th, 2006


• READING: – M. Jordan: Chapters 13,14,15 (on Gaussians and Kalman)

• Reminder: TA discussions and office hours:– Office hours: Thursdays 3:30-4:30, Sieg Ground Floor

Tutorial Center– Discussion Sections: Fridays 9:30-10:30, Sieg Ground Floor

Tutorial Center Lecture Room

• No more homework this quarter, concentrate on final projects!!

• Makeup class, tomorrow Wednesday, 5-7pm, room TBA (watch email).

Announcements


• L1: Tues, 3/28: Overview, GMs, Intro BNs.• L2: Thur, 3/30: semantics of BNs + UGMs• L3: Tues, 4/4: elimination, probs, chordal I• L4: Thur, 4/6: chrdal, sep, decomp, elim• L5: Tue, 4/11: chdl/elim, mcs, triang, ci props.• L6: Thur, 4/13: MST,CI axioms, Markov prps.• L7: Tues, 4/18: Mobius, HC-thm, (F)=(G)• L8: Thur, 4/20: phylogenetic trees, HMMs• L9: Tue, 4/25: HMMs, inference on trees• L10: Thur, 4/27: Inference on trees, start poly

• L11: Tues, 5/2: polytrees, start JT inference• L12: Thur, 5/4: Inference in JTs• Tues, 5/9: away• Thur, 5/11: away• L13: Tue, 5/16: JT, GDL, Shenoy-Schafer• L14: Thur, 5/18: GDL, Search, Gaussians I• L15: Mon, 5/22: laptop crash • L16: Tues, 5/23: search, Gaussians I• L17: Thur, 5/25: Gaussians• Mon, 5/29: Holiday• L18: Tue, 5/30• L19: Thur, 6/1: final presentations

Class Road Map


• L1: Tues, 3/28: • L2: Thur, 3/30:• L3: Tues, 4/4: • L4: Thur, 4/6:• L5: Tue, 4/11:• L6: Thur, 4/13:• L7: Tues, 4/18:• L8: Thur, 4/20: Team Lists, short abstracts I• L9: Tue, 4/25:• L10: Thur, 4/27: short abstracts II• L11: Tues, 5/2:

• L12: Thur, 5/4: abstract II + progress• L--: Tues, 5/9• L--: Thur, 5/11: 1 page progress report• L13: Tue, 5/16:

• L14: Thur, 5/18: 1 page progress report• L15: Tues, 5/23• L16: Thur, 5/25: 1 page progress report• L17: Tue, 5/30: Today• L18: Wed, 5/31:• L19: Thur, 6/1: final presentations

• L20: Tue, 6/6 4-page papers due (like a conference paper), Only .pdf versions accepted.

Final Project Milestone Due Dates

• Team lists, abstracts, and progress reports must be turned in, in class and using paper (dead tree versions only).

• Final reports must be turned in electronically in PDF (no other formats accepted).

• No need to repeat what was on previous progress reports/abstracts, I have those available to refer to.

• Progress reports must report who did what so far!!


• Gaussian Graphical Models

Summary of Last Time


• Other forms of inference.• Structure learning in graphical models

Outline of Today’s Lecture


Books and Sources for Today

• Jordan chapters 13-15• Other references contained in presentation …


Graphical Models

1. We start with some probability distribution P1. Could be specified as a given, or more likely we have training data of

some number of samples. Goal is to learn P or some approximation to it (training) and then use P in some way (inference for making decisions, such as most probable assignment, max-product semi-ring, etc.)

2. The graph =(,) represents “structure” in P

3. Graph can provide efficient representation and computational inference for

4. There can be multiple graphs that represent a given (e.g., complete graph represents all ).

5. Goal: find computationally cheap exact or approximate graph cover for 6. Once we do this, we just compute probabilities using the junction tree

algorithm or search algorithm, etc.


Graphical Models & Tree-width

1. The complexity parameter for G=(V,E)

2. Def: k-tree: k-nodes, clique of size k. n>k nodes, connect nth node to previous k fully connected nodes

3. Example: 4-tree

note: all separators are of size 4

4-tree with 4 nodes4-tree with 5 nodes4-tree with 6 nodes


Graphical Models & Tree-width

1. Def: partial k-tree: any sub-graph of a k-tree

2. Def: tree-width of a graph G is smallest k such that G is a partial k-tree.

3. Thm: The tree-width decision problem is NP-complete1. We mentioned this before, proven by Arnborg,

4. Thm: exact probabilistic inference (computing probabilities, etc.) is exponential in the tree-width

1. Time-space tradeoffs can help here, but what if all of the points in the achievable region are intolerably computationally expensive?

5. The big question, what if exact inference is too expensive?


When exact inference is too expensive

1. Two general approaches: either an exact solution to an approximate problem, or an approximate solution to an exact problem.

2. Exact solution to approximate problem1. Structure learning: find a low tree-width (or “cheap” in some way)

graphical model that is still “high-quality” in some way, and then perform exact inference on the approximate model.

2. This can be easy or hard depending on the tree-width and on the measure of “high-quality”, and on the learning paradigm.

3. Approximate solution to an exact problem1. Approximate inference, tries to approximate in some way what

must be computed: Loopy Belief propagation, Sampling/Pruning, Variational/Mean-field, and hybrids between the above


Finding k-trees

1. How do we score a k-tree?1. Maximum likelihood, or conditional score

2. May we assume that truth itself is a k-tree1. Sometimes simplifications can be made if we assume that truth is

part of a known model class, such as a k-tree for some fixed constant k independent of n=|V|, the number of nodes.

3. How to find best 1-tree?


Finding 1-trees

1. Given P, goal is to find best 1-tree approximation of P in a maximum likelihood sense.


Finding 1-trees


Plethora of negative results

• Chickering1996, Chickering/Meek/Heckerman2003: learning Bayesian networks in ML sense is NP-hard (“is there a BN with fixed upper bound on in-degree that achieves a given ML score?”)

• Dasgupta1999: learning polytrees in ML sense is NP-hard (“is there a poly-tree with fixed upper-bound in-degree with given ML score?”) and worse, there is constant c such that NP-complete to decide if there is polytree with score <= c*OPT_score.

• Meek2001: learning even a path (sub-class of trees) in ML sense is NP-hard.


Plethora of negative results

• Srebro/Karger2001: learning k-trees in ML sense is hard.• So, generative model structure learning is likely to be a

difficult problem (unless k=1, or P=NP).• We next spend a bit of time talking about the Srebro/Karger

result.


Optimal ML k-trees is NP-complete


Some good news …

• PAC framework: key difference, assume graph is in concept class (learn the class of k-trees). This means that if we have sampled data, we assume that the sampled data is from truth which itself is a k-tree.

• Hoeffgen’93: Can robustly (polynomial samples in n, 1/ 1/) PAC learn bounded tree-width graphical models, and can robustly and efficiently (algorithm polynomial in same) PAC learn 1-trees.

• Narasimhan&Bilmes2004: Can robustly and efficiently PAC learn bounded tree-width graphical models.


More good news …

• Abbeel,Koller,Ng2005: Can robustly and efficiently PAC learn bounded-degree factor graphs

– note: this does not have complexity guarantee. E.g., x grids have bounded degree but not tree-width. Star has unbounded degree but bounded tree-width. Tree-width crucial for computation in general.


How to PAC-learn such graphs …

• Mutual information is symmetric submodular


How to PAC-learn such graphs …

• Submodularity and Optimization

(Narisimhan&Bilmes,2004)


Another positive result

• Since mutual information is symmetric-submodular, we can find optimal partitions:

• where• This has implications for clustering (Narishamhan,Jojic,Bilmes’05) and

also for structure learning (can find optimal 1-step graph decomposition by finding the optimal k-separator).


Finding ML decompositions …

• Optimal to one level


Discriminative structure

• Goal might be classification using a generative model.

• Distinction between parameters & structure• Two possible goals:

– 1) find one global structure that classifies well– 2) find class-specific structure (one per class)

• In either case, finding a good discriminative structure may render discriminative parameter learning less necessary.


Optimal discriminative structure procedure …

• choose (for now, lets just assume =1)• Find tree that best satisfies:


Properties

• Options:– can fix structure and train parameters using either maximum

likelihood (generative) or maximum conditional likelihood (discriminative)

– Can learn discriminative structure, and can train either generatively or discriminatively

– In all cases, assume appropriate regularization.

• Bad news: KL-divergence not decomposable w.r.t. tree in the discriminative case.

• Goal: identify a local discriminative measure on edges in a graph (analogous to mutual information for generative case).


EAR measure

• EAR (explaining away residual) measure.

(Bilmes,’98)• Goal is to maximize EAR:

– Intuition: dependence class-conditionally, but otherwise independent

• EAR is approximation to expected log conditional posterior. Exact for independent “auxiliary” variables.


Conditional mutual information?

• Conditional mutual information is not guaranteed to discriminate well.

• Building a MST using (;|) as edge weights will not necessarily produce a tree with good classification properties. EAR fixes this in certain cases.

• Example: 3 features (,,) and a class


Generative training/structure


General Structure Learning

lecture 17 slides may 30 th , 2006

Documents

graphical models jeff

2006ee512 graphical

progress reports

page progress reportl13

page progress reportl17

page progress reportl15

computational inference

way inference