an introduction to bayesian networks: concepts and learning...

81
An Introduction to Bayesian Networks: Concepts and Learning from Data Jeong-Ho Chang Seoul National University [email protected]

Upload: lelien

Post on 12-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

An Introduction to Bayesian Networks: Concepts and

Learning from Data

Jeong-Ho ChangSeoul National University

[email protected]

2005-04-16 (Sat.) 2An Introduction to Bayesian Networks

IntroductionBasic Concepts of Bayesian NetworksLearning Bayesian Networks

Parameter LearningStructural Learning

ApplicationsDNA microarrayClassificationDependency Analysis

Summary

2005-04-16 (Sat.) 3An Introduction to Bayesian Networks

Bayesian NetworksA compact representation of a joint probability of variables on the basis of the concept of conditional independence.Qualitative part: graph theory

Directed acyclic graph Nodes: variables Edges: dependency or influence of a variable on another.

Quantitative part: probability theorySet of conditional probabilities for all variables

Naturally handles the problem of complexity and uncertainty.

2005-04-16 (Sat.) 4An Introduction to Bayesian Networks

Bayesian Networks (Cont’d)BN encodes probabilistic relationships among a set of objects or variables.

It is useful in thatdependency encoding among all variables: Modular representation of knowledge.can be used for the learning of causal relationships

helpful in understanding a problem domain.has both a causal and probabilistic semantics

can naturally combine prior knowledge and data.provide an efficient and principled approach for avoiding overfitting data in conjunction with Bayesian statistical methods.

2005-04-16 (Sat.) 5An Introduction to Bayesian Networks

BN as a Probabilistic Graphical ModelGraphical model

directed graphundirected graph

C

E

D

BA

C

E

D

BA

Bayesian NetworksMarkov Random Field

2005-04-16 (Sat.) 6An Introduction to Bayesian Networks

Representation of Joint Probability

),,(),,(),(1

)(1),,,,(

321 EDCDCBBA

EDCBAPi

ii

ϕϕϕα

ϕα

=

= ∏ ZC

E

D

BA

Z3

Z2

Z1

normalization constant

C

E

D

BA

))(|())(|( ))(|())(|())(|(),,,,(

EEPDDPCCPBBPAAPEDCBAP

PaPaPaPaPa=

2005-04-16 (Sat.) 7An Introduction to Bayesian Networks

Joint probability as a product of conditional probabilities

Can dramatically reduce the parameters for data modeling in Bayesian networks.

PCWP CO

HRBP

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

ANAPHYLAXIS

MINOVL

PVSAT

FIO2PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

(From NIPS’01 tutorial by Friedman, N. and Daphne K.)

37 variables in total

509 parameters 254

2005-04-16 (Sat.) 8An Introduction to Bayesian Networks

Real World Applications of BNIntelligent agents

Microsoft Office assistant: Bayesian user modelingMedical diagnosis

PATHFINDER (Heckerman, 1992): diagnosis of lymph node disease commercialized as INTELLIPATH (http://www.intellipath.com/)

Control decision support systemSpeech recognition (HMMs)Genome data analysis

gene expression, DNA sequence, a combined analysis of heterogeneous data.

Turbocodes (channel coding)

2005-04-16 (Sat.) 9An Introduction to Bayesian Networks

IntroductionBasic Concepts of Bayesian NetworksLearning Bayesian Networks

Parameter LearningStructural Learning

ApplicationsDNA microarrayClassificationDependency Analysis

Summary

2005-04-16 (Sat.) 10An Introduction to Bayesian Networks

Causal NetworksNode: eventArc: causal relationship between two nodes

A B: A causes B.Causal network for the car start problem [Jensen 01]

Fuel

Fuel MeterStanding Start

Clean SparkPlugs

2005-04-16 (Sat.) 11An Introduction to Bayesian Networks

Reasoning with Causal Networks

• My car does not start. increases the certainty of no fuel and dirty spark plugs. increases the certainty of fuel meter’s standing for the empty.

• Fuel meter stands for the half. decreases the certainty of no fuel increases the certainty of dirty spark plugs.

Fuel

Fuel MeterStanding Start

Clean SparkPlugs

2005-04-16 (Sat.) 12An Introduction to Bayesian Networks

d-separationConnections in causal networks

CA BC

A B C

A B

Serial diverging converging

Definition [Jensen 01]:♦Two nodes in a causal network are d-separated if for all paths between them there is an intermediate node V such that

the connection is serial or diverging and the state of V is known orthe connection is converging and neither V nor any of V’s

descendants have received evidence.

2005-04-16 (Sat.) 13An Introduction to Bayesian Networks

d-separation (Cont’d)

CA BC

A B

A and B is marginally dependent

CA BC

A B

A and B is conditionally independent

C

A B

A and B is marginally independent

C

A B

A and B is conditionally dependent

2005-04-16 (Sat.) 14An Introduction to Bayesian Networks

d-separation: Car Start Problem1. ‘Start’ and ‘Fuel’ are dependent on each other.

2. ‘Start’ and ‘Clean Spark Plugs’ are dependent on each other.

3. ‘Fuel’ and ‘Fuel Meter Standing’ are dependent on each other.

4. ‘Fuel’ and ‘Clean Spark Plugs’ are conditionally dependent on each other given the value of ‘Start’.

5. ‘Fuel Meter Standing’ and ‘Start’ are conditionally independent given the value of ‘Fuel’.

Fuel

Fuel MeterStanding Start

Clean SparkPlugs

2005-04-16 (Sat.) 15An Introduction to Bayesian Networks

DefinitionA graphical model for the probabilistic relationships among a set of variables.Compact representation of joint probability distributions on the basis of conditional probabilities.Consists of the following.

A set of n variables X = {X1, X2, …, Xn} and a set of directed edges between variables.The variables (nodes) with the directed edges form a directed acyclic graph (DAG) structure.

To each variable Xi and its parents Pa(Xi), a conditional probability table for P(Xi|Pa(Xi)).

Modeling for continuous variables is also possible.

Bayesian Networks: Revisited

Quantitative part

Qualitative part

2005-04-16 (Sat.) 16An Introduction to Bayesian Networks

Quantitative Specification by Probability Calculus

FundamentalsConditional Probability

Product Rule

Chain Rule: a successive application of the product rule.

)(

),()|(APBAPABP =

)()|()()|(),( BPBAPAPABPBAP ==

∏=

−−−−

−−

=

===

n

iii

nnnnn

nnn

n

XXXPXP

XXXXPXXXXPXXXPXXXXPXXXP

XXXP

2111

1212211221

121121

21

),...,|()(

),...,,|(),...,,|(),...,,(

),...,,|(),...,,(),...,,(

Λ

2005-04-16 (Sat.) 17An Introduction to Bayesian Networks

Independence of two events

)|(),|( ),|(),|(|

CBPCABPCAPCBAPCBA

==⇔⊥

Conditional independence

)()|( )()|(

BPABPAPBAPBA

==⇔⊥

Marginal independence

CA BC

A B

C

A B

2005-04-16 (Sat.) 18An Introduction to Bayesian Networks

X4

Ch(X5)

Ch(X5): the children of X5

Pa(X5)

Pa(X5): the parents of X5X={X1, X2, … , X10}

∏==i ii XXPXPXP ))(Pa|()X ,()( 101 Κ

De(X5)

De(X5): the descendents of X5X1

X3

X5 X6

X7 X8

X2

X9 X10

X5

Topological Sort of X∈iXX1, X2, X3, X4, X5, X6, X7, X8, X9, X10

Chain rule in a reverse order

)|(}){\( }){\|(}){\()},{\(

81010

1010101010

XXPXXPXXXPXXPXXXP

==

)|(})'\{( })'\{|(})'\{()},'\{(

799

99999

XXPXXPXXXPXXPXXXP

==

),|(})'\{( })'\{|(})'\{()},'\{(

6588

88888

XXXPXXPXXXPXXPXXXP

==

2005-04-16 (Sat.) 19An Introduction to Bayesian Networks

X4

Ch(X5)

Ch(X5): the children of X5

Pa(X5)

Pa(X5): the parents of X5X={X1, X2, … , X10}

∏==i ii XXPXPXP ))(Pa|()X ,()( 101 Κ

De(X5)

De(X5): the descendents of X5X1

X3

X5 X6

X7 X8

X2

X9 X10

X5

Topological Sort of X∈iXX1, X2, X3, X4, X5, X6, X7, X8, X9, X10

Chain rule in a reverse order

)|(}){\( }){\|(}){\()},{\(

81010

1010101010

XXPXXPXXXPXXPXXXP

==

)|(})'\{( })'\{|(})'\{()},'\{(

799

99999

XXPXXPXXXPXXPXXXP

==

),|(})'\{( })'\{|(})'\{()},'\{(

6588

88888

XXXPXXPXXXPXXPXXXP

==

2005-04-16 (Sat.) 20An Introduction to Bayesian Networks

Bayesian Networkfor the Car Start Problem

Fuel

Fuel MeterStanding Start

Clean SparkPlugs

P(Fu = Yes) = 0.98 P(CSP = Yes) = 0.96

P(St|Fu, CSP)P(FMS|Fu)

0.0010.60

FMS = Half

0.9980.001Fu = No0.010.39Fu = Yes

FMS = Empty

FMS = Full

10(No, Yes)0.990.01(Yes, No)

10(No, No)

0.010.99(Yes, Yes)Start=NoStart=YES(Fu, CSP)

2005-04-16 (Sat.) 21An Introduction to Bayesian Networks

An Illustration of Conditional Independence in BN

Netica (http://www.norsys.com)

2005-04-16 (Sat.) 22An Introduction to Bayesian Networks

Main Issues in BNInference in Bayesian networks

Given an assignment of a subset of variables (evidence) in a BN,estimate the posterior distribution over another subset of unobserved variables of interest.

Learning Bayesian network from dataParameter Learning

Given a data set, estimate local probability distributions P(Xi|Pa(Xi)).for all variables (nodes) comprising the BN .

Structure learningFor a data set, search a network structure G (dependency structure) which is best or at least plausible.

This tutorial introduces this issue.

( )( )obs

obsunobsun P

PPx

xxxx ,)|( =

2005-04-16 (Sat.) 23An Introduction to Bayesian Networks

IntroductionBasic Concepts of Bayesian NetworksLearning Bayesian Networks

Parameter LearningStructural Learning

ApplicationsDNA microarrayClassificationDependency Analysis

Summary

2005-04-16 (Sat.) 24An Introduction to Bayesian Networks

Learning Bayesian Networks

Bayesian network learning

* Structure Search* Score Metric* Parameter Learning

Data Acquisition Preprocessing

BN = Structure + Local probability distribution

Prior knowledge

2005-04-16 (Sat.) 25An Introduction to Bayesian Networks

Learning Bayesian Networks (Cont’d)

Bayesian network learning consists of Structure learning (DAG structure)Parameter learning (for local probability distribution)

SituationsKnown structure and complete data.Unknown structure and complete data.Known structure and incomplete data.Unknown structure and incomplete data.

2005-04-16 (Sat.) 26An Introduction to Bayesian Networks

Parameter LearningTask: Given a network structure, estimate the parameters of the model from data.

SM

S2

S1

LHHL

……..…

HHHH

LLLH

DCBA

A B

C D

P(A)

0.010.99

LH

P(B)

0.070.93

LH

0.60.4(H, H)

0.80.2(H, L)

0.70.3(L, H)

(L, L)

(A, B)

P(C)

0.20.8

LH

0.10.9H

0.90.1L

B

P(D)

LH

2005-04-16 (Sat.) 27An Introduction to Bayesian Networks

Key point: independence of parameter estimation

D={s1, s2, …, sM}, where si=(ai, bi, ci, di) is an instance of a random vector variable S=(A, B, C, D).Assumption: samples si are independent and identically distributed(i.i.d).

[ ]

⎟⎟⎠

⎞⎜⎜⎝

⎛Θ⎟⎟

⎞⎜⎜⎝

⎛Θ⎟⎟

⎞⎜⎜⎝

⎛Θ⎟⎟

⎞⎜⎜⎝

⎛Θ=

ΘΘΘΘ=

Θ=Θ=Θ

∏∏∏∏

====

=

=

M

iiiG

M

iiiiG

M

iiG

M

iiG

M

iiiGiiiGiGiG

M

iiGGG

,|bdP,,b|acP|bP|aP

,|bdP,,b|acP|bP|aP

PDPDL

1111

1

1

)( )( )( )(

)()()()(

)|()|();( s A B

C D

G

Independent parameter estimation for each node (variable)

2005-04-16 (Sat.) 28An Introduction to Bayesian Networks

One can estimate the parameters for P(A), P(B), P(C|A, B), and P(D|B) in an independent manner.

If A, B, C, and D are all binary-valued, the number of parameters are reduced from 15 (24-1) to 8 (1+1+4+2).

A

)(AP

B

)(BP

A B C D

s1

s2

sM

),|( BACP

A B

C

)|( BDP

B

D

P(A,B,C,D)=P(A)xP(B)xP(C|A,B)xP(D|B)

⎥⎥⎥⎥

Md

dd

Μ2

1

⎢⎢⎢⎢

Mc

cc

Μ2

1

Mb

bb

Μ2

1

Ma

aa

Μ2

1

2005-04-16 (Sat.) 29An Introduction to Bayesian Networks

One can estimate the parameters for P(A), P(B), P(C|A, B), and P(D|B) in an independent manner.

If A, B, C, and D are all binary-valued, the number of parameters are reduced from 15 (24-1) to 8 (1+1+4+2).

A

)(AP

B

)(BP

A B C D

s1

s2

sM

),|( BACP

A B

C

)|( BDP

B

D

P(A,B,C,D)=P(A)xP(B)xP(C|A,B)xP(D|B)

⎥⎥⎥⎥

Md

dd

Μ2

1

⎢⎢⎢⎢

Mc

cc

Μ2

1

Mb

bb

Μ2

1

Ma

aa

Μ2

1

2005-04-16 (Sat.) 30An Introduction to Bayesian Networks

Methods for Parameter EstimationMaximum Likelihood Estimation

Choose the value of Θ which maximizes the likelihood for the observed data D.

Bayesian estimationRepresent uncertainty about parameters using a probability distribution over Θ.Θ is also a random variable rather than a parameter value.

)|(maxarg);(maxargˆ Θ=Θ=Θ ΘΘ DPDLG

)|()()(

)|()()|( ΘΘ∝ΘΘ

=Θ DPPDPDPPDP

posterior prior likelihood

2005-04-16 (Sat.) 31An Introduction to Bayesian Networks

Bayes Rule, MAP and MLBayes’ rule

)()()|()|(

DPhPhDPDhP =

)|(maxarg* hDPh h=

)|(maxarg* DhPh h=

)|( DhP

ML

h: hypothesis (models or parameters)D: data

(maximum likelihood) estimation

MAP (maximum a posteriori) estimation

Bayesian LearningNot a point estimate, but the posterior distribution

From NIPS’99 tutorial by Ghahramani, Z.

2005-04-16 (Sat.) 32An Introduction to Bayesian Networks

Bayesian estimation (for multinomial distribution)

∏ −∝k

kkP 1)( αθθ

∏ −+∝

kN

kkk

DPPDP1

)|()()|(αθ

θθθ

[ ]∑

∫∫

++

==

=

==+

l ll

kkkDP

k

M

NNE

dDP

dDPkPDkSP

)(

)|(

)|()|()|(

)|(

1

ααθ

θ

θ

θθ

θθθ

Smoothed version of MLE

s1 s2 s3 sM

θ

Prior P(ө)

Likelihood P(D| ө)

θSufficient statistics

Prior knowledge or pseudo counts

),,,(~ 21 KDir αααθ Κ

2005-04-16 (Sat.) 33An Introduction to Bayesian Networks

Bayesian estimation (for multinomial distribution)

∏ −∝k

kkP 1)( αθθ

∏ −+∝

kN

kkk

DPPDP1

)|()()|(αθ

θθθ

[ ]∑

∫∫

++

==

=

==+

l ll

kkkDP

k

M

NNE

dDP

dDPkPDkSP

)(

)|(

)|()|()|(

)|(

1

ααθ

θ

θ

θθ

θθθ

Smoothed version of MLE

s1 s2 s3 sM

θ

Prior P(ө)

Likelihood P(D| ө)

θSufficient statistics

Prior knowledge or pseudo counts

),,,(~ 21 KDir αααθ Κ

2005-04-16 (Sat.) 34An Introduction to Bayesian Networks

H H T H

θ

H H

( ) ( ), 5, 1H TN N =

( )1 ,1~ Dir15)|()()|( THDPPDP θθθθθ =×∝

75.043

)11()51(51)|( ==+++

+=DP H

)11()11()( −−∝ THP θθθ

15

)|(

TH

HHHHTHDP

θθ

θθθθθθ

=

H H T HH H

15 )|( THHHHHTHDP θθθθθθθθ ==θ

65

155 )|(maxargˆ =+

== θθ θ DP

833.0ˆ)( ≈== HHSP θ

Maximum likelihood estimation

Bayesian inference

An Example: Coin toss

2005-04-16 (Sat.) 35An Introduction to Bayesian Networks

H H T H

θ

H H

( ) ( ), 5, 1H TN N =

( )1 ,1~ Dir15)|()()|( THDPPDP θθθθθ =×∝

75.043

)11()51(51)|( ==+++

+=DP H

)11()11()( −−∝ THP θθθ

15

)|(

TH

HHHHTHDP

θθ

θθθθθθ

=

H H T HH H

15 )|( THHHHHTHDP θθθθθθθθ ==θ

65

155 )|(maxargˆ =+

== θθ θ DP

833.0ˆ)( ≈== HHSP θ

Maximum likelihood estimation

Bayesian inference

An Example: Coin toss

2005-04-16 (Sat.) 36An Introduction to Bayesian Networks

P(H)H HH HHT HHTH HHTHH HHTHH

HHHTHHHTTHH

(100, 50)

MLE 1.00 1.00 0.67 0.75 0.80 0.83 0.70 0.67B(0.5, 0.5) 0.75 0.83 0.63 0.70 0.75 0.79 0.68 0.67B(1, 1) 0.67 0.75 0.60 0.67 0.71 0.75 0.67 0.66B(2, 2) 0.60 0.67 0.57 0.63 0.67 0.70 0.64 0.66B(5, 5) 0.55 0.58 0.54 0.57 0.60 0.63 0.60 0.65

Dirichlet prior

2005-04-16 (Sat.) 37An Introduction to Bayesian Networks

B(1, 1)

B(2, 2) B(5, 5)

B(0.5, 0.5)

Variation of posterior distribution for the parameter

2005-04-16 (Sat.) 38An Introduction to Bayesian Networks

IntroductionBasic Concepts of Bayesian NetworksLearning Bayesian Networks

Parameter LearningStructural Learning

Scoring MetricSearch Strategy

Practical ApplicationsDNA microarrayClassificationDependency Analysis

Summary

2005-04-16 (Sat.) 39An Introduction to Bayesian Networks

Structural LearningTask: Given a data set, search a most plausible network structure underlying the generation of the data set.Metric (score)-based approach

Use a scoring metric to measure how well a particular structure fits the observed set of cases.

A B C DS1S2

SM

H L L L

H H H H

… … … …

L H H L

DC

A B

DC

A B

DC

A B

Score

Scoring metric

Search strategy

2005-04-16 (Sat.) 40An Introduction to Bayesian Networks

Scoring Metric)|()(

)()|()()|( GDPGP

DPGDPGPDGP ∝=

Prior for network structure

Marginal likelihood

Likelihood Score

Nodes of high mutual information (dependency) with their parents get higher score.Since, I(X;Y) ≤ I(X; {Y, Z}), the fully connected network is obtained in an unrestricted case.Prone to overfitting.

( )MLEGDPDGScore Θ= ,|log);( ∑∑==

−∝N

ii

N

iii XHXI

11)( );( Pa

2005-04-16 (Sat.) 41An Introduction to Bayesian Networks

∑∑∑

∑∑∑

= = =

= = =

=

=

N

i j

X

k ij

ijkijk

N

i j

X

k ij

ijkijk

i i

i i

NN

MN

M

NN

NL

1 1 1

1 1 1

log

logD);Θ̂(log

Pa

Pa

( )∑=

−=N

iiiXHM

1| Pa

( ) ⎟⎠

⎞⎜⎝

⎛−−= ∑ ∑

= =

N

i

N

iiiii XHXHXHM

1 1)()|()( Pa

∑∑==

−∝N

ii

N

iii XHXI

11)( );( Pa

H(X) H(Y)

H(X|Y) H(Y|X)I(X;Y)

A B

C D

A B

C D

Empty network

Likelihood Score in Relation with Information Theory

2005-04-16 (Sat.) 42An Introduction to Bayesian Networks

Bayesian ScoreConsider the uncertainty in parameter estimation in Bayesian network

Assuming a complete data and parameter independence, the marginal likelihood can be rewritten as

( )( )( ) ( )[ ]∏ ∫=

=N

iiiiii dGPGXXDPGDP

1

|, ;)|( θθθPa

∫ ΘΘΘ= dGPGDPGPDGScore )|(),|()();(

Marginal likelihood for each pair of (Xi;Pa(Xi))

2005-04-16 (Sat.) 43An Introduction to Bayesian Networks

Bayesian Dirichlet ScoreFor a multinomial case, if we assume a Dirichletprior for each parameter (Heckerman, 1995),

( )( )( ) ( ) ∏ ∏∫= = Γ

Γ=

i i

j

X

k ijk

ijkijk

ijij

ijiiiii

NN

dGPGXXDPPa

θθθPa1 1 )(

)()(

)(|, ;

αα

αα

( )iXijijijij Dir αααθ ,,, ~ 21 Κ ∑ =

= iX

k ijkij 1αα

∑ == iX

k ijkij NN1( )j

iikiiijk paXxXN === )(, of # Pa

∫∞ −−=Γ0

1)( dtetx tx

!)()1( nnnn ==Γ=+Γ Λ

1)1( =Γ

2005-04-16 (Sat.) 44An Introduction to Bayesian Networks

Bayesian Dirichlet Score (Cont’d)

H H T H

TH

HHPαα

αφ+

=)|(

TH

HHHPαα

α+++)1(

)1()|(

TH

THHTPαα

α++

=)2(

)|(

)1()2()2()|(+++

+=

TH

HHHTHPαα

α

[ ])3()1()2()1(

+×+××+×+×

ααααααα

ΛΛ THHH

)()4(

ααΓ

)()3(

H

H

ααΓ

+Γ)(

)1(

T

T

ααΓ

⎥⎦

⎤⎢⎣

⎡Γ

+Γ×

Γ+Γ

+ΓΓ

)()1(

)()3(

)4()(

T

T

H

H

αα

αα

αα

~ ( , ) H T H TDirθ α α α α α= +

2005-04-16 (Sat.) 45An Introduction to Bayesian Networks

Bayesian Dirichlet Score (Cont’d)

H H T H

TH

HHPαα

αφ+

=)|(

TH

HHHPαα

α+++)1(

)1()|(

TH

THHTPαα

α++

=)2(

)|(

)1()2()2()|(+++

+=

TH

HHHTHPαα

α

[ ])3()1()2()1(

+×+××+×+×

ααααααα

ΛΛ THHH

)()4(

ααΓ

)()3(

H

H

ααΓ

+Γ)(

)1(

T

T

ααΓ

⎥⎦

⎤⎢⎣

⎡Γ

+Γ×

Γ+Γ

+ΓΓ

)()1(

)()3(

)4()(

T

T

H

H

αα

αα

αα

~ ( , ) H T H TDirθ α α α α α= +

2005-04-16 (Sat.) 46An Introduction to Bayesian Networks

Bayesian Dirichlet Score (Cont’d)

∏∏ ∏= = = Γ

Γ=

N

i j

X

k ijk

ijkijk

ijij

iji i N

N1 1 1 )()(

)()(Pa

αα

αα

( )( )( ) ( )[ ]∏ ∫=

=N

iiiiii dGPGXXDPGDP

1

|, ;)|( θθθPa

log P(D|G) in Bayesian score is asymptotically equivalent to BIC (Schwarz, 1978) and minus the MDL criterion (Rissanen, 1987).

MGGDPDGBICGDP log2

)dim(),ˆ|(log);()|(log −Θ=≈

dim( ) # of parameters in G G=

2005-04-16 (Sat.) 47An Introduction to Bayesian Networks

Structure SearchGiven a data set, a score metric, and a set of possible structures,

Find the network structure with maximal score.Discrete optimization

One can utilize the property of independent score for each pair of (Xi, Pa(Xi)).

( )∏=

=N

iii GXXDPGDP

1

|))(;()|( Pa

∑=

==N

iii XPaXScoreGDPDGScore

1))(;()|(log);(

2005-04-16 (Sat.) 48An Introduction to Bayesian Networks

Tree-Structured NetworkDefinition: Each node has at most one parent.

An effective search algorithm exists.

[ ] ∑∑∑ +−==i

ii

iiii

ii XScoreXScorePaXScorePaXScoreDGScore )()()|( )|()|(Improvement over empty network

Score for empty network

Chow and Liu (1968)Construct the undirected complete graph with the weights of edge E(Xi, Xj) being I(Xi; Xj).

Build a maximum weighted spanning tree.

Transform to a directed tree with an arbitrary root node.

A

B C

D

2005-04-16 (Sat.) 49An Introduction to Bayesian Networks

A

B C

D

A

B C

I(A;B) = 0.0652 I(A;D)=

0.0834

I(A;C)= 0.0880

I(B;C)= 0.1913

I(C;D)= 0.1980

I(B;D)= 0.1967 D

A

B C

D

SM

S2

S1

LHHL

……..…

HHHH

LLLH

DCBA

Complete undirected graph

Maximum spanning

tree

Directed tree

2005-04-16 (Sat.) 50An Introduction to Bayesian Networks

Search Strategies for General Bayesian Networks

With more than one parents per node NP-hard(Chickering, 1996)

Heuristic search methods are usually employed.Greedy hill-climbing (local search)Greedy hill-climbing with random restartSimulated annealingTabu search…

2005-04-16 (Sat.) 51An Introduction to Bayesian Networks

Greedy Local Search AlgorithmINPUT: Data set, Scoring Metric,

Possible Structures

Initialize the structure• empty network, random network,

tree-structured network, etc

Do a local edge operation resulting in the largest improvement in scoreamong all possible operations

Score(Gt+1;D)>Score(Gt;D)YES

NO

Gfinal = Gt

A B

C D

INSERT

DELETE REVERSE

A B

C D

A B

C D

A B

C D

2005-04-16 (Sat.) 52An Introduction to Bayesian Networks

Enhanced SearchGreedy local search can get stuck in local maxima or plateaux.Standard heuristics to escape the two includes

Search with random restarts, Simulated annealing, Tabu searchGenetic algorithm: a population-based search.

scor

e

Greedy search

Local maximum

scor

e

Greedy search with random restarts

Random restart

scor

e

Simulated annealing

P( ) scor

e

Population-based search (GA)

2005-04-16 (Sat.) 53An Introduction to Bayesian Networks

Learning Bayesian networksJoint (probability) distribution product of conditional probabilities for each variable (node) (sum in log-based representation.)

Parameter LearningEstimation of local probability distributionMLE, Bayesian estimation.

Structure LearningScore metrics

Likelihood, Bayesian score, BIC, MDLSearch strategies

Optimization in discrete space maximize the scoreKey concept: the decomposability of the score Tree-structured and general Bayesian networks.

Learning Bayesian Networks: Summary

2005-04-16 (Sat.) 54An Introduction to Bayesian Networks

IntroductionBasic Concepts of Bayesian NetworksLearning Bayesian Networks

Parameter LearningStructural Learning

Applications DNA microarrayClassification (Naïve Bayes, TAN)Dependency Analysis

Summary

2005-04-16 (Sat.) 55An Introduction to Bayesian Networks

DNA Microarray Data AnalysisClassification

Gene expression data of 72 leukemia patients.Task: classification of samples into AML or ALL based on expression patterns using Bayesian networks.

Combined analysis of gene expression data and drug activity data

Gene expression data and drug activity data of 60 cancer samplesTask: construct a dependency network of genes and drugs.

ToolsWEKA (http://www.cs.waikato.ac.nz/~ml/weka/)

A collection of machine learning algorithms for data mining tasks (implemented in JAVA, open source software issued under the GNU General Public License)Bayesian network learning algorithms are also included.

BNJ software (http://bnj.sourceforge.net/) are used for visualization when needed.

2005-04-16 (Sat.) 56An Introduction to Bayesian Networks

DNA MicroarraysRecent developments in the technology for biological experiments have made it possible to produce massive biological data sets.

Monitor thousands of gene expression levels simultaneously. traditional one gene experiments.

parallel view of the expression patterns of thousands of genes in a cell.Bayesian networks is a useful tool for the identification of a variety of meaningful relationships among genes from the data.

2005-04-16 (Sat.) 57An Introduction to Bayesian Networks

Bayesian Networksin Microarray Data Analysis

Sample classification

Gene-gene relation analysis

Gene regulatory network analysisExpression profiles

Bayesian network construction

- Disease diagnosis

- Activation or inhibition between genes

- Global view on the relations among genes

Combined analysis of other biological data- DNA sequence data, drug activity data, and so on.

2005-04-16 (Sat.) 58An Introduction to Bayesian Networks

DNA Microarray

Image analysis

2005-04-16 (Sat.) 59An Introduction to Bayesian Networks

Data Preparation for Data Mining

Sample 1 Sample 2 Sample i

Sample k Sample M

Microarray image samples

Sample 1

Gene 2

Numerical data for data mining

Image analysis

2005-04-16 (Sat.) 60An Introduction to Bayesian Networks

Example 1: Tumor Type ClassificationTask: classification a leukemia samples into two classes based on gene expression patterns using Bayesian networks

Gene B

TargetGene D

Gene CGene A

- Selected genesand the target variable Gene B

TargetGene D

Gene CGene A

- Learned Bayesian network

- DNA microarraydata from two kinds of leukemia patients

2005-04-16 (Sat.) 61An Introduction to Bayesian Networks

Data Sets72 samples in total: 25 AML (acute myeloid leukemia) samples + 47 ALL (acute lymphoblastic leukemia) samples (Golub et al., 1999).A data sample consists of expression measurements for over 7,000 genes.Preprocessing

30 informative genes were selected by the P-metric

Expression values were discretized into 2 levels, High and Low.The final data is 72 samples of which each consists of ‘High’ or ‘Low’expression values of 30 genes

ALLAML

ALLAMLgscoreσσμμ

+−

=)(

2005-04-16 (Sat.) 62An Introduction to Bayesian Networks

Classification by Bayesian NetworksClassification as an inference for a variable (node) in Bayesian networks.

∑∈

=

=

hCc

Cc

DhPhcPDcPc

)|(),|(maxarg ),|(maxarg)(*

ggg

Bayes optimal classifier

),(),(

),()(

),()|( gg

gg

gg khl lh

kh

h

khkh cP

cPcP

PcPcP ∝==

∑=),( gtypeP )(typeP

),|1( fahtypembP −

)|( typepicP )|4( typesltcP )|( typezyxinP),|_( pictypemybcP )4,|( sltctypefahP

1)|ˆ( if =DhP* ˆc ( ) argmax ( | , )c C P c h∈=g g

2005-04-16 (Sat.) 63An Introduction to Bayesian Networks

Classification by Bayesian NetworksClassification as an inference for a variable (node) in Bayesian networks.

∑∈

=

=

hCc

Cc

DhPhcPDcPc

)|(),|(maxarg ),|(maxarg)(*

ggg

Bayes optimal classifier

),(),(

),()(

),()|( gg

gg

gg khl lh

kh

h

khkh cP

cPcP

PcPcP ∝==

∑=),( gtypeP )(typeP

),|1( fahtypembP −

)|( typepicP )|4( typesltcP )|( typezyxinP),|_( pictypemybcP )4,|( sltctypefahP

1)|ˆ( if =DhP* ˆc ( ) argmax ( | , )c C P c h∈=g g

2005-04-16 (Sat.) 64An Introduction to Bayesian Networks

Naïve Bayes ClassifierVery restricted form of Bayesian network

Conditional independence of all variables (nodes) given the value of the class variable.Though simple, widely used in classification tasks up to now.

)|()(),( typePtypePtypeP ss =

)|1_()|( )|4()|(

)|()|_()|(

typembPtypefahPtypesltcPtypepicP

typezyxinPtypemybcPtypeP =s

2005-04-16 (Sat.) 65An Introduction to Bayesian Networks

Tree-Augmented Network(Friedman et al., 1997)Naïve Bayes + tree-structured network of variables but class node.

Express the dependency between variables in a restricted way.‘Class’ is the root nodeXN can have at most one parent, besides ‘Class’node.

)|( typezyxinP

),|( fahtypezyxinP

2005-04-16 (Sat.) 66An Introduction to Bayesian Networks

ResultsNB TAN

2_NO 2_NB

3_NB3_NO

65/72 69/72

67/7266/72

69/72 (NB)69/72 (NB)

General BN(max #pa = 2)

67/72 (NB)70/72 (NB)

General BN(max #pa = 3)

68/7269/72TAN

68/7266/72NB

30 genes6 genes

Leave-one-out cross-validationBayesian network learning with 71 samples and test for the remaining one samples.72 iteration

2005-04-16 (Sat.) 67An Introduction to Bayesian Networks

Markov Blanket in BN

X3 X4

X1

X5 X6

X7X8

X2

X9 X10

T

Markov Blanket of the node T:

P(T | X\{T}) = P(T | MB(T))MB(T ) = {Pa(T), Ch(T), Pa(Ch(T))\{T}}

TAN

Markov Blanket of the node ‘type’

X4

X6

X8

X3

X7

2005-04-16 (Sat.) 68An Introduction to Bayesian Networks

Example 2: Gene-Drug Dependency Analysis

Task: Construct a Bayesian network for the combined analysis of gene expression data and drug activity data.

Gene B

CancerDrug B

Drug AGene A

- Selected genes, drugsand cancer type node

Drug A

CancerDrug B

Gene BGene A

< Learned Bayesian network >- Dependency analysis- Probabilistic inference

Drug activity DataDrug activity Data

Gene Expression DataGene Expression Data

Preprocessing- Thresholding- Discretization

Bayesian network

learning

2005-04-16 (Sat.) 69An Introduction to Bayesian Networks

Data SetsNCI60 data sets (Scherf et al., 2000)

9703 genes and 1400 chemical compounds from 60 human cancer samples.

Preprocessing12 genes and 4 drugs were selected based the correlation analysis result in (Scherf et al. 2000): Considering the learning time and visualization of Bayesian networks.Discretize the gene expression values and drug activity values into 3 levels, High, Mid, Low.

Nodes in Bayesian networks: 12 genes, 4 drugs, cancer type.

2005-04-16 (Sat.) 70An Introduction to Bayesian Networks

Results

Visualization by BNJ3 Software (http://bnj.sourceforge.net/)

Identification of a plausible dependency among 2 genes and 1 drugs

2005-04-16 (Sat.) 71An Introduction to Bayesian Networks

Inferring Gene Regulation Model Reconstruction of regulatory relation among genes from genome data.One of the challenges in reconstruction of gene regulatory networks using Bayesian networks is statistical robustness.

Arises from the sparse nature underlying gene expression data.

Usually, the number of samples are much smaller than that of genes (attributes).Can produce unstable, spurious results.

Bootstrap, model averaging, incorporation of prior biological knowledge.

2005-04-16 (Sat.) 72An Introduction to Bayesian Networks

Bootstrap-based approach Multiple Bayesian networks from bootstrapped samples. (Friedman, N. et al., 2000, Pe’er, D. et al., 2001)

Expression Data

Estimate feature statistics (e.g., gene Y is in the Markov blanket of X)

B1 B2 BL…

Identification of significant - pairwise relations (Friedman, 2000)- subnetworks (Pe’er, 2001)

Friedman et al., 2000

1

1conf ( ) ( )Lii

f f GL =

= ∑

Resampling with replacement

G1 G2 GL

Yeast cell-cycle data76 samples for 800 genes

2005-04-16 (Sat.) 73An Introduction to Bayesian Networks

Model averaging-based approach Estimate the confidence of features by averaging over posterior model (BN) distribution (Hertemink et al., 2002)

scor

e

Expression Data

conf ( ) ( | ) ( | , )Li ii

f P G D P f D G=∑

Location data(serves as a prior)

G1 G2 GL

Structure learning with simulated

annealing

2005-04-16 (Sat.) 74An Introduction to Bayesian Networks

Incorporation prior knowledge or assumption. Impose biologically-motivated assumptions on the network structure.Construction of gene regulation model from gene expression data: inferring regulator set (Pe’er, D., et al., 2002)

AssumptionA relatively small set of genes is directly involved in transcriptional regulation

Limit the number of candidate regulatorsLimit the number of parents for each node (gene) and the number of genes having outgoing edges.

Significantly reduce the number of possible models.Statistical robustness of the identified regulator sets.

2005-04-16 (Sat.) 75An Introduction to Bayesian Networks

358 samples from combined data sets for S. cerevisiae.

Gene expression

data

•Remove non-informative genes

•Discretization

3,622 genes in total

Candidate regulator set

(=456)

Learning with structure

constraints

2-level network

R1 R2 R3 RMRegulator set

Target

Score(g;Pag)=I(g;Pag)

(Pe’er et al., 2002)

2005-04-16 (Sat.) 76An Introduction to Bayesian Networks

SummaryBayesian networks provide an efficient/effective framework for organizing the body of knowledge by encoding the probabilistic relationships among variables of interest.

Graph theory + probability theory: DAG + local probability distribution.Conditional independence and conditional probability are keystones.A compact way to express complex systems by simpler probabilistic modules and thus a natural framework for dealing with complexity and uncertainty.

Two problems in the learning of Bayesian networks from dataParameter estimation: MLE, MAP, Bayesian estimationStructural learning: tree-structured network, heuristic search for general Bayesian network.

2005-04-16 (Sat.) 77An Introduction to Bayesian Networks

Summary (Cont’d)Not covered in this tutorial but important issues.

Probabilistic inference in general Bayesian networksExact & approximate inference

Incomplete data (missing data, hidden variables)Known structure & unknown structureExpectation-Maximization algorithm can be employed as in latent variable models.

Bayesian model averaging over structures as well as parameters.Dynamic Bayesian networks for temporal data.

Many applicationsText and web miningMedical diagnosisIntelligent agent systemBioinformatics

2005-04-16 (Sat.) 78An Introduction to Bayesian Networks

ReferencesBayesian Networks (papers & books)

Chow, C. and Liu, C., “Approximating discrete probability distributions with dependency trees”, IEEE Transactions on Information Theory, 14, pp. 462-467, 1968.Pearl, J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, 1988.Neapolitan, E., Probabilistic Reasoning in Expert Systems, 1990.Charniak, E., Bayesian networks without tears, AI Magazine, 12(4):50-63, 1991.Heckerman, Learning Bayesian networks: the combination of knowledge and statistical data, Machine Leanring, 20, 197-243, 1995.Jensen, F.V., An Introduction to Bayesian Networks, Springer-Verlag, 1996.Friedman, N., Geiger, D., and Goldszmidt, M., “Bayesian network classifiers”, Machine Learning, 29(2), pp. 131-163, 1997.Frey, B.J., Graphical Models for Machine Learning and Digital Communication, MIT Press, 1998.Jordan, M. I. eds., Learning in Graphical Models, Kluwer Academic Publishers, 1998.Friedman, N. and Goldszmidt, M., Learning Bayesian networks with local structure, In Learning in Graphical Models (Ed. Jordan, M. I.) , pp. 421-460, MIT Press, 1999.Heckerman, D., A tutorial on learning with Bayesian networks, In Learning with Graphical Models (Ed. Jordan, M. I.), pp. 301-354, MIT Press, 1999.J. Pearl and S. Russel, Bayesian networks, UCLA Cognitive Systems Laboratory, Technical Report (R-277), November 2000. Spirtes, P., Glymour, C., and Scheines, R., Causation, Prediction, and Search, 2nd edition, MIT Press, 2000.Jensen, F. V., Bayesian Networks and Decision Graphs, Springer, 2001.J. Pearl, Bayesian networks, causal inference and knowledge discovery, UCLA Cognitive Systems Laboratory, Technical Report (R-281), March 2001. Korb, K. B. and Nicholson, A. B., Bayesian Artificial Intelligence, CRC Press, 2004.

2005-04-16 (Sat.) 79An Introduction to Bayesian Networks

Graphical models and Bayesian networks (tutorials & lectures)

Friedman, N. and Koller, D., Learning Bayesian Networks from Data, (http://www.cs.huji.ac.il/~nir/Nips01-Tutorial/)Murphy, K., A Brief Introduction to Graphical Models and Bayesian Networks (http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html)Ghahramani, Z., Probabilistic Models for Unsupervised Learning, (http://www.gatsby.ucl.ac.uk/~zoubin/NIPStutorial.html)Ghahramani, Z., Bayesian Methods for Machine Learning, (http://www.gatsby.ucl.ac.uk/~zoubin/ICML04-tutorial.html)Moser, A., Probabilistic Independence Networks (I-II), (http://www.eecis.udel.edu/~bloch/eleg867/notes/PIN_I_rev2/index.htm, http://www.eecis.udel.edu/~bloch/eleg867/notes/PIN_II_rev2/index.htm)Poet, M., Special Topics: Belief Networks, (http://www.stat.duke.edu/courses/Spring99/sta294/)

2005-04-16 (Sat.) 80An Introduction to Bayesian Networks

Bayesian network with application to bioinformaticsMurphy, K. and Mian, S., Modeling gene expression data using dynamic Bayesian networks, Technical. report 1999: Computer Science Division, University of California, Berkeley, CA Friedman, N., Linial, M., Nachman, I., and Pe’er, D., Using Bayesian networks to analyze expression data, Journal of Computational Biology, 7(3/4), pp. 601-620, 2000.Cai, D., Delcher, A., Kao, B., and Kasif, S. Modeling splice sites with Bayes networks, Bioinformatics, 2000, 16(2), pp. 152-158, 2000.Pe’er, D., Regev, A., Elidan, G., and Friedman, N., Inferring subnetworks from perturbed expression profiles, Bioinformatics, 17(suppl 1), pp. S215-S224, 2001.Pe’er, D., Regev, A., and Tanay, A., Minreg: inferring an active regulator set, Bioinformatics, 18(suppl 1), pp. 258-267, 2002.Hartemink, A. J., Gifford, D. K., Jaakkola, T. S., and Young, R. A., Combining location and expression data for principled discovery of genetic regulatory network models, Pacific Symposium on Biocomputing, 7, pp. 437-449, 2002.Imoto, S., Kim, S., Goto, T., Aburatani, S., Tashiro, K., Kuhara, S., and Miyano, S., Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network, Journal of Bioinformatics and Computational Biology, 1(2), pp. 231-252, 2003.Ronald, J., et al., A Bayesian networks approach for predicting protein-protein interactions from genomic data, Science 302: 449-453, 2003.Olga G. Troyanskaya, O. G., Dolinski, K., Owen, A. B., Altman, R. B., and Botstein, D., A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae), PNAS, 100(14), pp. 8348–8353, 2003.Beer M. A. and Tavazoie, S. Predicting gene expression from sequence. Cell, 117(2), pp. 185-198, 2004. Friedman, N., “Inferring cellular networks using probabilistic graphical models”, Science, 303(5659), pp. 799-805, 2004.

A comprehensive list is available at http://zlab.bu.edu/kasif/bayes-net.html

2005-04-16 (Sat.) 81An Introduction to Bayesian Networks

Bayesian network software packagesBayes Net Toolbox (Kevin Murphy)

http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.htmlA variety of algorithms for learning and inference in graphical models (written in MATLAB).

WEKA http://www.cs.waikato.ac.nz/~ml/weka/Bayesian network learning and classification modules are included among a collection of machine learning algorithms (written in JAVA) .

Detailed lists and comparison are referred to http://www.cs.ubc.ca/~murphyk/Bayes/bnsoft.htmlBayesian network list in Google

http://directory.google.com/Top/Computers/Artificial_Intelligence/Belief_Networks/Software/

Korb, K. B. and Nicholson, A. B., Bayesian Artificial Intelligence, CRC Press, 2004.

http://www.csse.monash.edu.au/bai/book/appendix_b.pdf