1 parameter learning 2 structure learning 1: the good graphical models – 10708 carlos guestrin...

42
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th , 2006 Readings: K&F: 14.1, 14.2, 14.3, 14.4, 15.1, 15.2, 15.3.1, 15.4.1

Upload: brook-newman

Post on 21-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

1

Parameter Learning 2

Structure Learning 1:The good

Graphical Models – 10708

Carlos Guestrin

Carnegie Mellon University

September 27th, 2006

Readings:K&F: 14.1, 14.2, 14.3, 14.4, 15.1, 15.2, 15.3.1, 15.4.1

Page 2: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 2

Your first learning algorithm

Set derivative to zero:

Page 3: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 3

Learning the CPTs

x(1)

… x(m)

DataFor each discrete variable Xi

Page 4: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 4

Maximum likelihood estimation (MLE) of BN parameters – General case Data: x(1),…,x(m)

Restriction: x(j)[PaXi] ! assignment to PaXi in x(j)

Given structure, log likelihood of data:

Page 5: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 5

Taking derivatives of MLE of BN parameters – General case

Page 6: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 6

General MLE for a CPT Take a CPT: P(X|U) Log likelihood term for this CPT

Parameter X=x|U=u :

Page 7: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 7

Announcements

Late homeworks: 3 late days for the semester

one late day corresponds to 24 hours! (i.e., 3 late days due Saturday by noon)

Give late homeworks to Monica Hopes, Wean Hall 4619 If she is not in her office, time stamp (date and time) your homework, sign it,

and put it under her door After late days are used up:

Half credit within 48 hours Zero credit after 48 hours

All homeworks must be handed in, even for zero credit Homework 2 out later today Recitation tomorrow:

review perfect maps, parameter learning

Page 8: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 8

m

Can we really trust MLE?

What is better? 3 heads, 2 tails

30 heads, 20 tails

3x1023 heads, 2x1023 tails

Many possible answers, we need distributions over possible parameters

Page 9: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 9

Bayesian Learning

Use Bayes rule:

Or equivalently:

Page 10: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 10

Bayesian Learning for Thumbtack

Likelihood function is simply Binomial:

What about prior? Represent expert knowledge Simple posterior form

Conjugate priors: Closed-form representation of posterior (more details soon) For Binomial, conjugate prior is Beta distribution

Page 11: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 11

Beta prior distribution – P()

Likelihood function: Posterior:

Page 12: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 12

Posterior distribution

Prior: Data: mH heads and mT tails

Posterior distribution:

Page 13: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 13

Conjugate prior

Given likelihood function P(D|)

(Parametric) prior of the form P(|) is conjugate to likelihood function if posterior is of the same parametric family, and can be written as: P(|’), for some new set of parameters ’

Prior: Data: mH heads and mT tails (binomial likelihood) Posterior distribution:

Page 14: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 14

Using Bayesian posterior

Posterior distribution:

Bayesian inference: No longer single parameter:

Integral is often hard to compute

Page 15: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 15

Bayesian prediction of a new coin flip

Prior: Observed mH heads, mT tails, what is

probability of m+1 flip is heads?

Page 16: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 16

Asymptotic behavior and equivalent sample size

Beta prior equivalent to extra thumbtack flips:

As m → 1, prior is “forgotten” But, for small sample size, prior

is important! Equivalent sample size:

Prior parameterized by H,T, or

m’ (equivalent sample size) and

Fix m’, change

Fix , change m’

Page 17: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 17

Bayesian learning corresponds to smoothing

m=0 ) prior parameter m!1 ) MLE

m

Page 18: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 18

Bayesian learning for multinomial

What if you have a k sided coin??? Likelihood function if multinomial:

Conjugate prior for multinomial is Dirichlet:

Observe m data points, mi from assignment i, posterior:

Prediction:

Page 19: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 19

Bayesian learning for two-node BN

Parameters X, Y|X

Priors: P(X):

P(Y|X):

Page 20: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 20

Very important assumption on prior:Global parameter independence

Global parameter independence: Prior over parameters is product

of prior over CPTs

Page 21: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 21

Global parameter independence, d-separation and local prediction

Flu Allergy

Sinus

Headache Nose

Independencies in meta BN:

Proposition: For fully observable data D, if prior satisfies global parameter independence, then

Page 22: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 22

Within a CPT

Meta BN including CPT parameters:

Are Y|X=t and Y|X=f d-separated given D?

Are Y|X=t and Y|X=f independent given D? Context-specific independence!!!

Posterior decomposes:

Page 23: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 23

Priors for BN CPTs (more when we talk about structure learning)

Consider each CPT: P(X|U=u) Conjugate prior:

Dirichlet(X=1|U=u,…, X=k|U=u)

More intuitive: “prior data set” D’ with m’ equivalent sample size “prior counts”: prediction:

Page 24: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 24

An example

Page 25: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 25

What you need to know about parameter learning

MLE: score decomposes according to CPTs optimize each CPT separately

Bayesian parameter learning: motivation for Bayesian approach Bayesian prediction conjugate priors, equivalent sample size Bayesian learning ) smoothing

Bayesian learning for BN parameters Global parameter independence Decomposition of prediction according to CPTs Decomposition within a CPT

Page 26: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 26

Where are we with learning BNs?

Given structure, estimate parameters Maximum likelihood estimation Bayesian learning

What about learning structure?

Page 27: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 27

Learning the structure of a BN

Constraint-based approach BN encodes conditional independencies Test conditional independencies in data Find an I-map

Score-based approach Finding a structure and parameters is a

density estimation task Evaluate model as we evaluated parameters

Maximum likelihood Bayesian etc.

Data

<x1(1),…,xn

(1)>…

<x1(m),…,xn

(m)>

Flu Allergy

Sinus

Headache Nose

Learn structure andparam

eters

Page 28: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 28

Remember: Obtaining a P-map?

Given the independence assertions that are true for P Obtain skeleton Obtain immoralities

From skeleton and immoralities, obtain every (and any) BN structure from the equivalence class

Constraint-based approach: Use Learn PDAG algorithm Key question: Independence test

Page 29: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 29

Independence tests

Statistically difficult task! Intuitive approach: Mutual information

Mutual information and independence: Xi and Xj independent if and only if I(Xi,Xj)=0

Conditional mutual information:

Page 30: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 30

Independence tests and the constraint based approach

Using the data D Empirical distribution:

Mutual information:

Similarly for conditional MI Use learning PDAG algorithm:

When algorithm asks: (XY|U)?

Many other types of independence tests See reading…

Page 31: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 31

Score-based approach

Data

<x1(1),…,xn

(1)>…

<x1(m),…,xn

(m)>

Flu Allergy

Sinus

Headache Nose

Possible structures Score structureLearn parameters

Page 32: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 32

Information-theoretic interpretation of maximum likelihood

Given structure, log likelihood of data:Flu Allergy

Sinus

Headache Nose

Page 33: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 33

Information-theoretic interpretation of maximum likelihood 2

Given structure, log likelihood of data:Flu Allergy

Sinus

Headache Nose

Page 34: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 34

Decomposable score

Log data likelihood

Decomposable score: Decomposes over families in BN (node and its parents) Will lead to significant computational efficiency!!! Score(G : D) = i FamScore(Xi|PaXi : D)

Page 35: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 35

How many trees are there?Nonetheless – Efficient optimal algorithm finds best tree

Page 36: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 36

Scoring a tree 1: I-equivalent trees

Page 37: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 37

Scoring a tree 2: similar trees

Page 38: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 38

Chow-Liu tree learning algorithm 1

For each pair of variables Xi,Xj

Compute empirical distribution:

Compute mutual information:

Define a graph Nodes X1,…,Xn

Edge (i,j) gets weight

Page 39: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 39

Chow-Liu tree learning algorithm 2

Optimal tree BN Compute maximum weight

spanning tree Directions in BN: pick any

node as root, breadth-first-search defines directions

Page 40: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 40

Can we extend Chow-Liu 1

Tree augmented naïve Bayes (TAN) [Friedman et al. ’97] Naïve Bayes model overcounts, because

correlation between features not considered

Same as Chow-Liu, but score edges with:

Page 41: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 41

Can we extend Chow-Liu 2

(Approximately learning) models with tree-width up to k [Narasimhan & Bilmes ’04] But, O(nk+1)…

and more subtleties

Page 42: 1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

10-708 – Carlos Guestrin 2006 42

What you need to know about learning BN structures so far

Decomposable scores Maximum likelihood Information theoretic interpretation

Best tree (Chow-Liu) Best TAN Nearly best k-treewidth (in O(Nk+1))