probabilistic graphical models: bayesian networks li xiong

87
Probabilistic Graphical Models: Bayesian Networks Li Xiong Slide credits: Page (Wisconsin) CS760, Zhu (Wisconsin) KDD ’12 tutorial

Upload: others

Post on 12-Nov-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Probabilistic Graphical Models:

Bayesian Networks

Li Xiong

Slide credits: Page (Wisconsin) CS760, Zhu (Wisconsin) KDD ’12 tutorial

Page 2: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Outline

Graphical models

Bayesian networks - definition

Bayesian networks - inference

Bayesian networks - learning

November 5, 2017 Data Mining: Concepts and Techniques 2

Page 3: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Overview

The envelope quiz

There are two envelopes, one has a red ball ($100)

and a black ball, the other two black balls

You randomly picked an envelope, randomly took

out a ball – it was black

At this point, you are given the option to switch

envelopes. Should you?

Page 4: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Overview

The envelope quiz

Random variables E ∈ {1, 0}, B ∈ {r, b}

P (E = 1) = P (E = 0) = 1/2

P (B = r | E = 1) = 1/2,P (B = r | E = 0) = 0

We ask: P (E = 1 | B = b)

Page 5: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Overview

The envelope quiz

Random variables E ∈ {1, 0}, B ∈ {r, b}

P (E = 1) = P (E = 0) = 1/2

P (B = r | E = 1) = 1/2,P (B = r | E = 0) = 0

We ask: P (E = 1 | B = b)

P (E = 1 | B = b) = =P(B=b|E=1)P (E=1) 1/2×1/2

P (B=b) 3/4= 1/3

B

The graphical model:

E

Page 6: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Overview

Reasoning with uncertainty

• A set of random variables x1, . . . , xn

e.g. (x1, . . . , xn−1) a feature vector, xn ≡ y the class label

• Inference: given joint distribution p(x1, . . . , xn), computep(XQ | XE ) where XQ ∪XE ⊆ {x1 . . .xn}

e.g. Q = {n}, E = {1 . . . n − 1}, by the definition of conditional

p(x | x , . . . , xn 1 n −1) =p(x1, . . . , xn−1, xn)

Σ vp(x1, . . . , xn−1, xn = v)

• Learning: estimate p(x1, . . . , xn) from training data X(1), . . . ,X (N ),

1 nwhere X( i) = (x(i), . . . , x(i))

Page 7: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Overview

It is difficult to reason with uncertainty

• joint distribution p(x1, . . . , xn)

exponential naive storage (2n for binary r.v.)

hard to interpret (conditional independence)

• inference p(XQ |XE )

Often can’t afford to do it by brute force

• If p(x1, . . . , xn) not given, estimate it from data

Often can’t afford to do it by brute force

• Graphical model: efficient representation, inference, and learningon p(x1, . . . , xn), exactly or approximately

Page 8: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Definitions

Graphical-Model-Nots

Graphical model is the study of probabilistic models

Just because there are nodes and edges doesn’t mean it’s a graphical

model

These are not graphical models:

neural network decision tree network flow HMM template

Page 9: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Graphical Models

Bayesian networks – directed

Markov networks – undirected

November 5, 2017 Data Mining: Concepts and Techniques 9

Page 10: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Outline

Graphical models

Bayesian networks - definition

Bayesian networks - inference

Bayesian networks - learning

November 5, 2017 Data Mining: Concepts and Techniques 10

Page 11: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Bayesian Networks: Intuition

• A graphical representation for a joint probability distribution

Nodes are random variables

Directed edges between nodes reflect dependence

• Some informal examples:

UnderstoodMaterial

Assignment Grade

ExamGrade Alarm

Smoking At Sensor

Fire

Page 12: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Bayesian networks

• a BN consists of a Directed Acyclic Graph (DAG) and a set of conditional probability distributions

• in the DAG

– each node denotes random a variable

– each edge from X to Y represents that X directly influences Y

– formally: each variable X is independent of its non-descendants given its parents

• each node X has a conditional probability distribution(CPD) representing P(X | Parents(X) )

Page 13: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Definitions Directed Graphical Models

Conditional Independence

Two r.v.s A, B are independent if P (A, B) = P (A)P (B) orP (A|B) = P (A) (the two are equivalent)

Two r.v.s A, B are conditionally independent given C ifP (A, B | C) = P (A | C)P (B | C) or P (A | B, C) = P (A | C) (the

two are equivalent)

Page 14: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Bayesian networks

P(X1, … , Xn )

• a BN provides a compact representation of a joint probability distribution

n

P(Xi | Parents(Xi))i1

P(X1, … , Xn ) P(X1)P(Xi | X1, … , Xi1))i2

• using the chain rule, a joint probability distribution can be expressed as

n

Page 15: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Bayesian network example

• Consider the following 5 binary random variables:

B = a burglary occurs at your house

E = an earthquake occurs at your house

A = the alarm goes off

J = John calls to report the alarm

M = Mary calls to report the alarm

• Suppose we want to answer queries like what is

P(B | M, J) ?

Page 16: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Bayesian network example

Burglary Earthquake

Alarm

JohnCalls MaryCalls

B E t f

t t 0.95 0.05

t f 0.94 0.06

f t 0.29 0.71

f f 0.001 0.999

P ( A | B, E )

t f

0.001 0.999

P ( B )

t f

0.001 0.999

P ( E )

A t f

t

f

0.9 0.1

0.05 0.95

P ( J |A)

A t f

t

f

0.7 0.3

0.01 0.99

P ( M |A)

Page 17: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Bayesian networks

P(B, E, A, J, M ) P(B)

P(E)

P(A | B,E)

P(J | A)

P(M | A)

• a standard representation of the joint distribution for the Alarm example has 25 = 32 parameters

• the BN representation of this distribution has 20 parameters

Burglary Earthquake

Alarm

JohnCalls MaryCalls

Page 18: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Bayesian networks

= 1024

• consider a case with 10 binary random variables

• How many parameters does a BN with the following graph structure have?

2

4 4

= 42

44 4

4

4 8 4

• How many parameters does the standard table representation of the joint distribution have?

Page 19: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Advantages of the Bayesian

network representation

• Captures independence and conditional independence where they exist

• Encodes the relevant portion of the full joint among

variables where dependencies exist

• Uses a graphical representation which lends insight into

the complexity of inference

Page 20: Probabilistic Graphical Models: Bayesian Networks Li Xiong

20

Bayesian Networks

Graphical models

Bayesian networks - definition

Bayesian networks – inference

Exact inference

Approximate inference

Bayesian networks – learning

Parameter learning

Network learning

Page 21: Probabilistic Graphical Models: Bayesian Networks Li Xiong

The inference task in Bayesian networks

Given: values for some variables in the network (evidence), and a set of query variables

Do: compute the posterior distribution over the query variables

• variables that are neither evidence variables nor query variables are other variables

• the BN representation is flexible enough that any set can be the evidence variables and any set can be the query variables

Page 22: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Recall Naïve Bayesian Classifier

Derive the maximum posterior

Independence assumption

Simplified network

)(

)()|()|(

X

XX

Pi

CPi

CP

iCP

)|(...)|()|(

1

)|()|(21

CixPCixPCixPn

kCixPCiP

nk

X

Page 23: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Inference Exact Inference

Exact Inference by Enumeration

Let X = (XQ, XE ,XO ) for query, evidence, and other variables.

Infer P(XQ | XE )

By definition

Q EP (X | X ) = Q EP (X , X )

P (XE )

Σ X OQ E OP (X , X , X )

=ΣX Q ,XO

P (XQ, XE ,XO )

Page 24: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Inference by enumeration example

• let a denote A=true, and ¬a denote A=false

• suppose we’re given the query: P(b | j, m)

“probability the house is being burglarized given that John and Mary both called”

• from the graph structure we can first compute:

P(b, j,m)P(b)P(e)P(a |b,e)P( j | a)P(m | a)e a

sum over possible

values for E and A

variables (e, ¬e, a, ¬a)

A

B E

MJ

Page 25: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Inference by enumeration example

B E P(A)

t t 0.95

t f 0.94

f t 0.29

f f 0.001

P(B)

0.001

P(E)

0.001

A P(J)

t

f

0.9

0.05

A P(M)

t

f

0.7

0.01

P(b, j,m) P(b)P(e)P(a | b,e)P( j | a)P(m | a)e a

P(b)P(e)P(a | b,e)P( j | a)P(m | a)e a

0.001 (0.001 0.95 0.9 0.7

0.001 0.05 0.05 0.01

0.999 0.94 0.9 0.7

0.999 0.06 0.05 0.01)

e, a

e, ¬a

¬e, a

¬ e, ¬ a

B E A J M

A

B E

MJ

Page 26: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Inference by enumeration example

• now do equivalent calculation for P(¬b, j, m)

• and determine P(b | j, m)

P(b | j,m) P(b, j,m)

P(b, j,m)

P( j,m) P(b, j,m) P(b, j,m)

Page 27: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Inference Exact Inference

Exact Inference by Enumeration

Let X = (XQ, XE ,XO ) for query, evidence, and other variables.

Infer P(XQ | XE )

By definition

Q EP (X | X ) = Q EP (X , X )

P (XE )

Σ X OQ E OP (X , X , X )

=ΣX Q ,XO

P (XQ, XE ,XO )

Computational issue: summing exponential number of terms - with k

variables in XO each taking r values, there are rk terms

Page 28: Probabilistic Graphical Models: Bayesian Networks Li Xiong

28

Bayesian Networks

Graphical models

Bayesian networks - definition

Bayesian networks – inference

Exact inference

Approximate inference

Bayesian networks – learning

Parameter learning

Network learning

Page 29: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Approximate (Monte Carlo) Inference

in Bayes Nets

• Basic idea: repeatedly generate data

samples according to the distribution

represented by the Bayes Net• Estimate the probability P (XQ | XE )

A

B E

MJ

Page 30: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Inference Markov Chain Monte Carlo

Forward Sampling: Example

P(A | B, E) = 0.95

P(A | B, ~E) = 0.94

P(A | ~B, E) = 0.29

P(A | ~B, ~E) =0.001

P(J | A) =0.9

P(J | ~A) = 0.05 P(M | ~A) = 0.01

A

J M

P(M | A) = 0.7

B E

P(E)=0.002P(B)=0.001

To generate a sample X = (B, E, A, J,M ):

Sample B ∼ Ber(0.001): r ∼ U (0, 1). If (r < 0.001) then B = 1 else

B = 0

Sample E ∼ Ber(0.002)

If B = 1 and E = 1, sample A ∼ Ber(0.95), and so on

If A = 1 sample J ∼ Ber(0.9) else J ∼ Ber(0.05)

If A = 1 sample M ∼ Ber(0.7) else M ∼Ber(0.01)

Page 31: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Inference Markov Chain Monte Carlo

Inference with Forward Sampling

Given inference task is P (B = 1 | E = 1, M = 1)

Throw away all samples except those with (E = 1, M = 1)

p(B = 1 | E = 1, M = 1) ≈1

m

m

Σi=1

1(B ( i )=1)

where m is the number of surviving samples

Issue: Can be highly inefficient (note P (E = 1) tiny),

few samples agree with the evidence

Page 32: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Markov Chain Monte Carlo

• Fix evidence variables, sample non-evidence

variables

• Generate next setting probabilistically based

on current setting (Markov chain)

• Gibbs Sampling for Bayes Networks

Page 33: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Inference Markov Chain Monte Carlo

Gibbs Sampler: Example P (B = 1 |E = 1,M = 1)

Initialization:

Fix evidence; randomly set other variablese.g. X (0) = (B = 0, E = 1, A = 0, J = 0, M = 1)

P(A | B, E) = 0.95

P(A | B, ~E) = 0.94

P(A | ~B, E) = 0.29

P(A | ~B, ~E) =0.001

P(J | A) = 0.9

P(J | ~A) =0.05

A

J

B

P(E)=0.002P(B)=0.001

E=1

M=1

P(M | A) = 0.7

P(M | ~A) =0.01

Page 34: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Inference Markov Chain Monte Carlo

Gibbs Update• For each non-evidence variable xi, fixing all other nodes X−i,

resample xi ∼ P (xi | X− i) equivalent to xi ∼ P (xi | MarkovBlanket(xi))

• MarkovBlanket(xi) includes xi’s parents, spouses, and children

P (xi | MarkovBlanket(xi)) ∝ P (xi |P a(xi)) P (y |P a(y))y∈C(xi)

where P a(x) are the parents of x, and C(x) the children of x.

Example: B ∼ P (B | E = 1, A = 0, J = 0, M = 1)

∝ P (B | E = 1, A = 0) ∝P (B)P (A = 0 | B, E = 1)

P(J | A) = 0.9 P(M | A) =0. 7

P(B)=0.001 P(E)=0.002

B E=1

P(A | B, E) = 0.95

P(A | B, ~E) = 0.94

P(A | ~B, E) = 0.29

P(A | ~B, ~E) =0.001

A

J M=1

Page 35: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Inference Markov Chain Monte Carlo

Gibbs Update

• Say we sampled B = 1. Then

X(1) = (B = 1, E = 1, A = 0, J = 0, M = 1)

• Sample A ∼ P (A | B = 1, E = 1, J = 0, M = 1) to get X(2)

• Move on to J, then repeat B, A, J,B, A, J . . .

• Keep all later samples, P (B = 1 | E = 1, M = 1) is the fraction of

samples with B = 1.

P(A | B, E) = 0.95

P(A | B, ~E) = 0.94

P(A | ~B, E) = 0.29

P(A | ~B, ~E) =0.001

P(J | A) = 0.9

P(J | ~A) =0.05

A

J

B

P(E)=0.002P(B)=0.001

E=1

M=1

P(M | A) = 0.7

P(M | ~A) =0.01

Page 36: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Inference Markov Chain Monte Carlo

Gibbs Sampling as a Markov Chain

• A Markov chain is defined by a transition matrix T (X j |X )

• Certain Markov chains have a stationary distribution π such thatπ = Tπ

• Gibbs sampler is such a Markov chain withTi((X−i, xj

i) | (X−i, x i)) = p(xji | X−i), and stationarydistribution

p(xQ | XE )

• It takes time for the chain to reach stationary distribution

• In practice: discard early ones

Page 37: Probabilistic Graphical Models: Bayesian Networks Li Xiong

38

Bayesian Networks

Graphical models

Bayesian networks - definition

Bayesian networks – inference

Exact inference

Approximate inference

Bayesian networks – learning

Parameter learning

Parameter learning and inference with partial data

Network learning

Page 38: Probabilistic Graphical Models: Bayesian Networks Li Xiong

The parameter learning task

• Given: a set of training instances, the graph structure of a BN

• Do: infer the parameters of the CPDs

B E A J M

f f f t f

f t f f f

f f t f t

Burglary Earthquake

Alarm

JohnCalls MaryCalls

Page 39: Probabilistic Graphical Models: Bayesian Networks Li Xiong

The parameter and data learning task

• Given: a set of training instances (with some missing or

unobservable data), the graph structure of a BN

• Do: infer the parameters of the CPDs

infer the missing data values

B E A J M

f f ? t f

f t ? f f

f f ? f t

Burglary Earthquake

Alarm

JohnCalls MaryCalls

Page 40: Probabilistic Graphical Models: Bayesian Networks Li Xiong

The structure learning task

• Given: a set of training instances

• Do: infer the graph structure (and perhaps the parameters of the CPDs too)

B E A J M

f f f t f

f t f f f

f f t f t

Page 41: Probabilistic Graphical Models: Bayesian Networks Li Xiong

42

Bayesian Networks

Graphical models

Bayesian networks - definition

Bayesian networks – inference

Exact inference

Approximate inference

Bayesian networks – learning

Parameter learning

Parameter and data learning with partial data

Network learning

Page 42: Probabilistic Graphical Models: Bayesian Networks Li Xiong

The parameter learning task

• Given: a set of training instances, the graph structure of a BN

• Do: infer the parameters of the CPDs

B E A J M

f f f t f

f t f f f

f f t f t

Burglary Earthquake

Alarm

JohnCalls MaryCalls

Page 43: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Parameter learning and maximum

likelihood estimation

• maximum likelihood estimation (MLE)

– given a model structure (e.g. a Bayes net graph) and a set of data D

– set the model parameters θ to maximize P(D | θ)

• i.e. make the data D look as likely as possible under the model P(D | θ)

Page 44: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Maximum likelihood estimation

for h heads in n flips

the MLE is h/n

x1L = P(x|) (1 )1x1 xn (1 )1xn

xi(1 )

n xi

consider trying to estimate the parameter θ (probability of heads) of

a biased coin from a sequence of flips

x 1,1,1,0,1,0,0,1,0,1

the likelihood function for θ is given by:

Page 45: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Maximum likelihood estimation

P( j | a) 30.75

4

P(j | a) 1 0.25

4

P( j | a) 20.5

4

P(j | a) 2 0.5

4

P(b) 1 0.125

8

7P(b) 0.875

8

B E A J M

f f f t f

f t f f f

f f f t t

t f f f t

f f t t f

f f t f t

f f t t t

f f t t t

A

B E

MJ

consider estimating the CPD parameters for B and J in the alarm

network given the following data set

Page 46: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Maximum likelihood estimation

P(b) 0 0

8

8P(b) 1

8

B E A J M

f f f t f

f t f f f

f f f t t

f f f f t

f f t t f

f f t f t

f f t t t

f f t t t

A

B E

MJ

suppose instead, our data set was this…

do we really want to

set this to 0?

Page 47: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Maximum a posteriori (MAP) estimation

• instead of estimating parameters strictly from the data, we could start with some prior belief for each

• for example, we could use Laplace estimates

• where nv represents the number of occurrences of value v

P(X x) nx 1

(nv 1)vValues( X )

pseudocounts

Page 48: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Maximum a posteriori estimation

a more general form: m-estimates

P(X x) nx pxm

vValues( X )

nv m number of “virtual” instances

prior probability of value x

Page 49: Probabilistic Graphical Models: Bayesian Networks Li Xiong

M-estimates example

B E A J M

f f f t f

f t f f f

f f f t t

f f f f t

f f t t f

f f t f t

f f t t t

f f t t t

A

B E

MJ

now let’s estimate parameters for B using m=4 and pb=0.25

P(b) 0 0.25 4

1 0.08

8 4 12P(b)

8 0.75 4

11 0.92

8 4 12

Page 50: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Incorporating a Prior

• The pseudo-counts are really parameters of a beta distribution

• beta distribution (a, b) is the conjugate prior for the parameter p given the likelihood function

• We can specify initial parameters a and b such that a/(a+b)=p, and specify confidence in this belief with high initial values for a and b

• After h heads out n trials, posterior distribution is P(heads)=(a+h)/(a+b+n).

Page 51: Probabilistic Graphical Models: Bayesian Networks Li Xiong
Page 52: Probabilistic Graphical Models: Bayesian Networks Li Xiong

56

Bayesian Networks

Graphical models

Bayesian networks - definition

Bayesian networks – inference

Exact inference

Approximate inference

Bayesian networks – learning

Parameter learning

Parameter learning + inference

Network learning

Page 53: Probabilistic Graphical Models: Bayesian Networks Li Xiong

The parameter learning task + inference

• Given: a set of training instances (with some missing or

unobservable data), the graph structure of a BN

• Do: infer the parameters of the CPDs

infer the missing data values

B E A J M

f f ? t f

f t ? f f

f f ? f t

Burglary Earthquake

Alarm

JohnCalls MaryCalls

Page 54: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Inferring missing data and

parameter learning with EM

Given:

• data set with some missing values

• model structure, initial model parameters

Repeat until convergence

• Expectation (E) step: using current model, compute expectation over missing values

• Maximization (M) step: given the expectations,

compute/update maximum likelihood (MLE) or

maximum posterior probability (MAP, maximum a

posteriori) parameters

Page 55: Probabilistic Graphical Models: Bayesian Networks Li Xiong

example: EM for parameter learning

B E A J M

f f ? f f

f f ? t f

t f ? t t

f f ? f t

f t ? t f

f f ? f t

t t ? t t

f f ? f f

f f ? t f

f f ? f t

A

B E

MJ

B E P(A)

t t 0.9

t f 0.6

f t 0.3

f f 0.2

P(B)

0.1

P(E)

0.2

A P(J)

t

f

0.9

0.2

A P(M)

t

f

0.8

0.1

suppose we’re given the following initial BN and training set

Page 56: Probabilistic Graphical Models: Bayesian Networks Li Xiong

example: E-step

A J M

t: 0.0069

f: 0.9931f f

t f

t t

f t

t f

f t

t t

f f

t f

B E

f f

f f

t f

f f

f t

f f

t t

f f

f f

f f

t:0.2

f:0.8

t:0.98

f: 0.02

t: 0.2

f: 0.8

t: 0.3

f: 0.7

t:0.2

f: 0.8

t: 0.997

f: 0.003

t: 0.0069

f: 0.9931

t:0.2

f: 0.8

t: 0.2

f: 0.8f t

A

B E

MJ

B E P(A)

t t 0.9

t f 0.6

f t 0.3

f f 0.2

P(B)

0.1

P(E)

0.2

A P(J)

t

f

0.9

0.2

A P(M)

t

f

0.8

0.1

P(a |b,e,j,m)

P(a |b,e,j,m)

Page 57: Probabilistic Graphical Models: Bayesian Networks Li Xiong

example: E-stepP(b,e,a,j,m)

P(a |b,e,j,m) P(b,e,a,j,m) P(b,e,a,j,m)

0.9 0.8 0.2 0.1 0.2

0.9 0.8 0.2 0.1 0.2 0.9 0.8 0.8 0.8 0.9

0.00288

.4176 0.0069

P(b,e,a, j,m)P(a |b,e, j,m)

P(b,e,a, j,m) P(b,e,a, j,m)

0.9 0.8 0.2 0.9 0.2

0.9 0.8 0.2 0.9 0.2 0.9 0.8 0.8 0.2 0.9

0.02592

.1296 0.2

P(a | b,e, j,m)P(b,e,a, j,m)

P(b,e,a, j,m) P(b,e,a, j, m)

0.1 0.8 0.6 0.9 0.8

0.1 0.8 0.6 0.9 0.8 0.1 0.8 0.4 0.2 0.1

0.03456

.0352 0.98

Page 58: Probabilistic Graphical Models: Bayesian Networks Li Xiong

example: M-step

B E A J M

f f t: 0.0069 f ff: 0.9931

f f t:0.2 t ff:0.8

t f t:0.98 t tf: 0.02

f f t: 0.2 f tf: 0.8

f t t: 0.3 t ff: 0.7

f f t:0.2 f tf: 0.8

t t t: 0.997 t tf: 0.003

f f t: 0.0069 f ff: 0.9931

f f t:0.2 t ff: 0.8

f f t: 0.2 f tf: 0.8

A

B E

MJ

P(a | b,e) 0.997

1

P(a | b,e) 0.98

1

P(a |b,e) 0.3

1

P(a |b,e) 0.0069 0.2 0.2 0.2 0.0069 0.2 0.2

7

P(a | b,e) E #(a b e)

E #(b e)

re-estimate probabilities

using expected counts

B E P(A)

t t 0.997

t f 0.98

f t 0.3

f f 0.145

re-estimate probabilities for

P(J | A) and P(M | A) in same way

Page 59: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Convergence of EM

• E and M steps are iterated until probabilities converge

• will converge to a maximum in the data likelihood

(MLE or MAP)

• the maximum may be a local optimum, however

• the optimum found depends on starting conditions

(initial estimated probability parameters)

Page 60: Probabilistic Graphical Models: Bayesian Networks Li Xiong

65

Bayesian Networks

Graphical models

Bayesian networks - definition

Bayesian networks – inference

Exact inference

Approximate inference

Bayesian networks – learning

Parameter learning

Network learning

Page 61: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Learning structure + parameters

• number of structures is super-exponential in the number of variables

• finding optimal structure is NP-complete problem

• two common options:

– search very restricted space of possible structures

(e.g. networks with tree DAGs)

– use heuristic search (e.g. sparse candidate)

Page 62: Probabilistic Graphical Models: Bayesian Networks Li Xiong

The Chow-Liu algorithm(Chow & Liu 1968)

• learns a BN with a tree structure that maximizes the likelihood of the training data

• algorithm

1. compute weight I(Xi, Xj) of each possible edge (Xi, Xj)

2. find maximum weight spanning tree (MST)

3. assign edge directions in MST

Page 63: Probabilistic Graphical Models: Bayesian Networks Li Xiong

The Chow-Liu algorithm

1. use mutual information to calculate edge weights

I(X,Y )P(x, y)

P(x, y) log2P(x)P(y)

x values(X ) y values(Y )

Page 64: Probabilistic Graphical Models: Bayesian Networks Li Xiong
Page 65: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Joint

Page 66: Probabilistic Graphical Models: Bayesian Networks Li Xiong
Page 67: Probabilistic Graphical Models: Bayesian Networks Li Xiong

November 5, 2017 Data Mining: Concepts and Techniques 72

Page 68: Probabilistic Graphical Models: Bayesian Networks Li Xiong
Page 69: Probabilistic Graphical Models: Bayesian Networks Li Xiong

The Chow-Liu algorithm

2. find maximum weight spanning tree: a maximal-weight tree that connects all vertices in a graph

A C

D E

F G

1

5

1

5

1

7

1

8

1

9

1

7

B

1

15

1

6

1

81

9

1

11

Page 70: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Prim’s algorithm for finding an MST

given: graph with vertices V and edges E

Vnew

Enew

← { v } where v is an arbitrary vertex from V

← { }

repeat until Vnew = V

{

choose an edge (u, v) in E with max weight where u is in Vnew and v is not

add v to Vnew and (u, v) to Enew

}

return Vnew and Enew which represent an MST

Page 71: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Kruskal’s algorithm for finding an MST

given: graph with vertices V and edges E

Enew ← { }

for each (u, v) in E ordered by weight (from high to low)

{

remove (u, v) from E

if adding (u, v) to Enew does not create a cycle

add (u, v) to Enew

}

return V and Enew which represent an MST

Page 72: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Returning directed graph in Chow-Liu

A

B

C

D E

F G

A C

D E

F G

1

5

1

5

1

7

1

8

1

9

1

7

B

1

15

1

6

1

81

9

1

11

3. pick a node for the root, and assign edge directions

Page 73: Probabilistic Graphical Models: Bayesian Networks Li Xiong

The Chow-Liu algorithm

• How do we know that Chow-Liu will find a tree that maximizes the data likelihood?

• Two key questions:

– Why can we represent data likelihood as sum of I(X;Y)

over edges?

– Why can we pick any direction for edges in the tree?

Page 74: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Glog P(D | G, )

2log P(x

(d ) | Pa(X ))i i

D I(Xi, Pa(Xi))H (Xi))i

P(Xi, Pa(Xi))log2P(Xi, Pa(Xi)) / Pa(Xi)) D i values(Xi,Pa(Xi ))

dD i

P(Xi, Pa(Xi))log2P(Xi | Pa(Xi)))i values(Xi,Pa( Xi ))

D

D i values(Xi,Pa( Xi ))

P(Xi, Pa(Xi))log2P(Xi, Pa(Xi)) / P(Xi)(Pa(Xi)))P(Xi, Pa(Xi))log2 P(Xi)

Why Chow-Liu maximizes likelihood (for a tree)

data likelihood given directed edges of G, best fit parametersG

(since summing over all examples is equivalent to computing average

over all examples and then multiplying by total number of examples |D|)

Page 75: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Glog P(D | G, ) log P(x (d )

2 i i| Parents(X ))

Why Chow-Liu maximizes likelihood (for a tree)

data likelihood given directed edges

argmaxG log P(D | G,G ) argmaxG I(Xi, X j )

dD i

D I (Xi,Parents(Xi)) H (Xi ))i

we’re interested in finding the graph G that maximizes this

argmaxG log P(D | G,G ) argmaxG I (Xi,Parents(Xi))i

if we assume a tree, one node has no parents, all others have exactly one

(Xi ,X j )edges

edge directions don’t matter for likelihood, because MI is symmetric

I (Xi, X j ) I (X j , Xi)

Page 76: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Learning structure + parameters

• number of structures is super-exponential in the number of variables

• finding optimal structure is NP-complete problem

• two common options:

– search very restricted space of possible structures

(e.g. networks with tree DAGs)

– use heuristic search (e.g. sparse candidate)

Page 77: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Heuristic search for structure learning

• each state in the search space represents a DAG Bayes net structure

• to instantiate a search approach, we need to specify

– state transition operators

– scoring function for states

– search algorithm (how to move through state space)

Page 78: Probabilistic Graphical Models: Bayesian Networks Li Xiong

The typical structure search operators

A

B C

D

A

B C

D

add an edge

A

B C

D

reverse an edge

given the current network

at some stage of the search,

we can…

A

B C

delete an edge

D

Page 79: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Scoring function decomposability

• If score is likelihood, and all instances in D are complete,

then score can be decomposed as follows (and so can

some other scores we’ll see later)

score(G, D) score(Xi,Parents(Xi) : D)i

• thus we can

– score a network by summing terms over the nodes in

the network

– efficiently score changes in a local search procedure

Page 80: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Bayesian network search:

hill-climbinggiven: data set D, initial network B0

i = 0

Bbest←B0

while stopping criteria not met

{

for each possible operator application a

{

Bnew ← apply(a, Bi)

if score(Bnew) > score(Bbest)

Bbest ← Bnew

}

++i

Bi ← Bbest

}

return Bi

Page 81: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Bayesian network search:

the Sparse Candidate algorithm[Friedman et al., UAI 1999]

given: data set D, initial network B0, parameter k

i = 0

repeat

{

++i

// restrict step

select for each variable Xj a set Cji (|Cj | ≤ k) of candidate parents

i

// maximize step

find network Bi maximizing score among networks where∀Xj, Parents(Xj) ⊆Cj

i

} until convergence

return Bi

Page 82: Probabilistic Graphical Models: Bayesian Networks Li Xiong

• to identify candidate parents in the first iteration, can compute the mutual information between pairs of variables, select top k

• subsequent iterations, condition on parents with conditional

mutual information:

The restrict step in Sparse Candidate

x ,y

P(x,y)I(X,Y ) P(x, y)log

P(x)P(y)

x,y

P(x, y | z)I(X,Y | Z) P(x, y, z) log

P(x | z)P(y | z)

Page 83: Probabilistic Graphical Models: Bayesian Networks Li Xiong

The maximize step in Sparse Candidate

• hill-climbing search with add-edge, delete-edge, reverse-edge operators

• test to ensure that cycles aren’t introduced into the graph

Page 84: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Scoring functions for structure learning

• Can we find a good structure just by trying to maximize the likelihood of the data?

argmaxG, logP(D | G,G)G

Page 85: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Scoring functions for structure learning

• Can we find a good structure just by trying to maximize the likelihood of the data?

argmaxG, logP(D | G,G)G

• If we have a strong restriction on the the structures allowed (e.g. a tree), then maybe.

• Otherwise, no! Adding an edge will never decrease likelihood. Overfitting likely.

Page 86: Probabilistic Graphical Models: Bayesian Networks Li Xiong

Scoring functions for structure learning

• one general approach (where n is number of data points)

argminG, G

f (n)G logP(D | G,G)

complexity penalty

Akaike Information Criterion (AIC):

Bayesian Information Criterion (BIC):

when f (n) 1

f (n) 1

log(n) 2

Page 87: Probabilistic Graphical Models: Bayesian Networks Li Xiong

99

Bayesian Networks

Graphical models

Bayesian networks - definition

Bayesian networks – inference

Exact inference

Approximate inference

Bayesian networks – learning

Parameter learning

Parameter learning + inference

Network learning