tr-94-09_bayesianas

Learning Bayesian Networks The Combination of

Knowledge and Statistical Data

David Heckerman

heckermamicrosoftcom

Dan Geiger

dangcstechnionacil

David M Chickering

dmaxcsuclaedu

March Revised February

Technical Report

MSRTR

Microsoft Research

Advanced Technology Division

Microsoft Corporation

One Microsoft Way

Redmond WA

To appear in Machine Learning

Learning Bayesian Networks MSRTR

Abstract

We describe a Bayesian approach for learning Bayesian networks from a combination

of prior knowledge and statistical data First and foremost we develop a methodology

for assessing informative priors needed for learning Our approach is derived from a

set of assumptions made previously as well as the assumption of likelihood equivalence

which says that data should not help to discriminate network structures that represent

the same assertions of conditional independence We show that likelihood equivalence

when combined with previously made assumptions implies that the users priors for

network parameters can be encoded in a single Bayesian network for the next case

to be seena prior networkand a single measure of condence for that network

Second using these priors we show how to compute the relative posterior probabilities

of network structures given data Third we describe search methods for identifying

network structures with high posterior probabilities We describe polynomial algorithms

for nding the highestscoring network structures in the special case where every node

has at most k parent For the general case k which is NPhard we review

heuristic search algorithms including local search iterative local search and simulated

annealing Finally we describe a methodology for evaluating Bayesiannetwork learning

algorithms and apply this approach to a comparison of various approaches

Keywords Bayesian networks learning Dirichlet likelihood equivalence maximum

branching heuristic search

Introduction

A Bayesian network is an annotated directed graph that encodes probabilistic relationships

among distinctions of interest in an uncertainreasoning problem Howard and Matheson

Pearl

The representation formally encodes the joint probability distribution

for its domain yet includes a humanoriented qualitative structure that facilitates commu

nication between a user and a system incorporating the probabilistic model We discuss the

representation in detail in Section For over a decade AI researchers have used Bayesian

networks to encode expert knowledge More recently AI researchers and statisticians have

begun to investigate methods for learning Bayesian networks including Bayesian methods

Cooper and Herskovits Buntine Spiegelhalter et al Dawid and Lau

ritzen Heckerman et al quasiBayesian methods Lam and Bacchus

Suzuki and nonBayesian methods Pearl and Verma Spirtes et al

In this paper we concentrate on the Bayesian approach which takes prior knowledge

and combines it with data to produce one or more Bayesian networks Our approach is

illustrated in Figure for the problem of ICU ventilator management Using our method

a user species his prior knowledge about the problem by constructing a Bayesian network


called a prior network and by assessing his condence in this network A hypothetical prior

network is shown in Figure b the probabilities are not shown In addition a database of

cases is assembled as shown in Figure c Each case in the database contains observations

for every variable in the users prior network Our approach then takes these sources of

information and learns one or more new Bayesian networks as shown in Figure d To

appreciate the eectiveness of the approach note that the database was generated from

the Bayesian network in Figure a known as the Alarm network Beinlich et al

Comparing the three network structures we see that the structure of the learned network

is much closer to that of the Alarm network than is the structure of the prior network In

eect our learning algorithm has used the database to correct the prior knowledge of the

user

Our Bayesian approach can be understood as follows Suppose we have a domain of

discrete variables fx xng U and a database of cases fC Cmg D Further

suppose that we wish to determine the joint distribution pCjD the probability distri

bution of a new case C given the database and our current state of information Rather

than reason about this distribution directly we imagine that the data is a random sample

from an unknown Bayesian network structure Bs with unknown parameters Using Bhs to

denote the hypothesis that the data is generated by network structure Bs and assuming

the hypotheses corresponding to all possible network structures form a mutually exclusive

and collectively exhaustive set we have

pCjD X

all Bhs

pCjDBhs pBhs jD

In practice it is impossible to sum over all possible network structures Consequently

we attempt to identify a small subset H of networkstructure hypotheses that account for

a large fraction of the posterior probability of the hypotheses Rewriting the previous

equation we obtain

pCjD cX

BhsH

pCjDBhs pBhs jD

where c is the normalization constant P

BhsHpBhs jD From this relation we see that

only the relative posterior probabilities of hypotheses matter Thus rather than compute

a posterior probability which would entail summing over all structures we can compute a

Bayes factorpBhs jD pBhsjD where Bs is some reference structure such as the

one containing no arcs or simply pDBhs j pBhs j pDjB

hs In the latter case we

have

pCjD cX

BhsH

pCjDBhs pDBhs j


17

25

6 5 4

19

27

20

10 21

37

31

11 32

33

22

15

14

23

13

16

29

8 9

2812

34 35 36

24

30

72618

321

17

25 18 26

1 2 3

6 5 4

19

27

20

10 21

35 3736

31

11 32 34

1224

33

22

15

14

23

13

16

29

30

7 8 9

28

case # x1 x2 x3 x37

1

2

3

4

10,000

3

2

1

3

2

3

2

3

2

2

2

2

3

3

2

4

3

3

1

3

(a)

(d)

(b)

17

25 18 26

3

6 5 4

19

27

20

10 21

35 3736

31

11 32 34

1224

33

22

15

14

23

13

16

29

30

7 8 9

28

21

AA

A

A

RR

R

R

D

D

DD

R

D

D

(c)

Figure a The Alarm network structure b A prior network encoding a users beliefs

about the Alarm domain c A case database generated from the Alarm network

d The network learned from the prior network and a case database generated from

the Alarm network Arcs that are added deleted or reversed with respect to the Alarm

network are indicated with A D and R respectively


where c is another normalization constant P

BhsHpDBhs j

In short the Bayesian approach to learning Bayesian networks amounts to searching for

networkstructure hypotheses with high relative posterior probabilities Many nonBayesian

approaches use the same basic approach but optimize some other measure of how well the

structure ts the data In general we refer to such measures as scoring metrics We

refer to any formula for computing the relative posterior probability of a networkstructure

hypothesis as a Bayesian scoring metric

The Bayesian approach is not only an approximation for pCjD but a method for

learning network structure When jH j we learn a single network structure the MAP

maximum a posteriori structure of U When jH j we learn a collection of network

structures As we discuss in Section learning network structure is useful because we can

sometimes use structure to infer causal relationships in a domain and consequently predict

the eects of interventions

One of the most challenging tasks in designing a Bayesian learning procedure is identi

fying classes of easytoassess informative priors for computing the terms on the righthand

side of Equation In the rst part of the paper Sections through we explicate a

set of assumptions for discrete networksnetworks containing only discrete variablesthat

leads to such a class of informative priors Our assumptions are based on those made by

Cooper and Herskovits herein referred to as CHSpiegelhalter et al

and Dawid and Lauritzen herein referred to as SDLCand Buntine These

researchers assumed parameter independence which says that the parameters associated

with each node in a Bayesian network are independent parameter modularity which says

that if a node has the same parents in two distinct networks then the probability density

functions of the parameters associated with this node are identical in both networks and the

Dirichlet assumption which says that all network parameters have a Dirichlet distribution

We assume parameter independence and parameter modularity but instead of adopting

the Dirichlet assumption we introduce an assumption called likelihood equivalence which

says that data should not help to discriminate network structures that represent the same

assertions of conditional independence We argue that this property is necessary when

learning acausal Bayesian networks and is often reasonable when learning causal Bayesian

networks We then show that likelihood equivalence when combined with parameter inde

pendence and several weak conditions implies the Dirichlet assumption Furthermore we

show that likelihood equivalence constrains the Dirichlet distributions in such a way that

they may be obtained from the users prior networka Bayesian network for the next case

to be seenand a single equivalent sample size reecting the users condence in his prior

network


Our result has both a positive and negative aspect On the positive side we show that

parameter independence parameter modularity and likelihood equivalence lead to a simple

approach for assessing priors that requires the user to assess only one equivalent sample

size for the entire domain On the negative side the approach is sometimes too simple a

user may have more knowledge about one part of a domain than another We argue that

the assumptions of parameter independence and likelihood equivalence are sometimes too

strong and suggest a framework for relaxing these assumptions

A more straightforward task in learning Bayesian networks is using a given informative

prior to compute pDBhs j ie a Bayesian scoring metric and pCjDBhs When

databases are completethat is when there is no missing datathese terms can be derived

in closed form Otherwise wellknown statistical approximations may be used In this

paper we consider complete databases only and derive closedform expressions for these

terms A result is a likelihoodequivalent Bayesian scoring metric which we call the BDe

metric This metric is to be contrasted with the metrics of CH and Buntine which do not

make use of a prior network and to the metrics of CH and SDLC which do not satisfy the

property of likelihood equivalence

In the second part of the paper Section we examine methods for nding networks

with high scores The methods can be used with any scoring metric We describe polynomial

algorithms for nding the highestscoring networks in the special case where every node has

at most one parent In addition we describe localsearch and annealing algorithms for the

general case which is known to be NPhard

Finally in Sections and we describe a methodology for evaluating learning algo

rithms We use this methodology to compare various scoring metrics and search methods

We note that several researchers eg Dawid and Lauritzen and Madigan and

Raferty have developed methods for learning undirected network structures as de

scribed in eg Lauritzen In this paper we concentrate on learning directed

models because we can sometimes use them to infer causal relationships and because most

users nd them easier to interpret

Background

In this section we introduce notation and background material that we need for our discus

sion including a description of Bayesian networks exchangeability multinomial sampling

and the Dirichlet distribution A summary of our notation is given after the Appendix on

page

Throughout this discussion we consider a domain U of n discrete variables x xn


We use lowercase letters to refer to variables and uppercase letters to refer to sets of

variables We write xi k to denote that variable xi is in state k When we observe

the state for every variable in set X we call this set of observations an instance of X

and we write X kX as a shorthand for the observations xi ki xi X The joint

space of U is the set of all instances of U We use pX kX jY kY to denote the

probability that X kX given Y kY for a person with current state of information

We use pX jY to denote the set of probabilities for all possible observations of X given

all possible observations of Y The joint probability distribution over U is the probability

distribution over the joint space of U

A Bayesian network for domain U represents a joint probability distribution over U

The representation consists of a set of local conditional distributions combined with a set

of conditional independence assertions that allow us to construct a global joint probability

distribution from the local distributions In particular by the chain rule of probability we

have

px xnj nYi

pxijx xi

For each variable xi let i fx xig be a set of variables that renders xi and

fx xig conditionally independent That is

pxijx xi pxiji

A Bayesiannetwork structure Bs encodes the assertions of conditional independence in

Equations Namely Bs is a directed acyclic graph such that each variable in U

corresponds to a node in Bs and the parents of the node corresponding to xi are the

nodes corresponding to the variables in i In this paper we use xi to refer to both the

variable and its corresponding node in a graph A Bayesiannetwork probability set Bp is

the collection of local distributions pxiji for each node in the domain A Bayesian

network for U is the pair Bs Bp Combining Equations and we see that any Bayesian

network for U uniquely determines a joint probability distribution for U That is

px xnj nYi

pxiji

When a variable has only two states we say that it is binary A Bayesian network

for three binary variables x x and x is shown in Figure We see that

fxg and fxg Consequently this network represents the conditionalindependence

assertion pxjx x pxjx

It can happen that two Bayesiannetwork structures represent the same constraints of

conditional independencethat is every joint probability distribution encoded by one struc

ture can be encoded by the other and vice versa In this case the two network structures are


x1

x2

x3

p x( 1 = present, ) = 0.6

p x x

p x x

(

(2 1

2 1

= == =

present| present, ) = 0.8

present| absent, ) = 0.3

p x x

p x x

(

(3 2

3 2

= == =

present| present, ) = 0.9

present| absent, ) = 0.15

Figure A Bayesian network for three binary variables taken from CH The network

represents the assertion that x and x are conditionally independent given x Each variable

has states absent and present

said to be equivalent Verma and Pearl For example the structures x x x

and x x x both represent the assertion that x and x are conditionally indepen

dent given x and are equivalent In some of the technical discussions in this paper we

shall require the following characterization of equivalent networks proved in the Appendix

Theorem Let Bs and Bs be two Bayesiannetwork structures and RBsBs be the set

of edges by which Bs and Bs dier in directionality Then Bs and Bs are equivalent

if and only if there exists a sequence of jRBsBsj distinct arc reversals applied to Bs with

the following properties

After each reversal the resulting network structure contains no directed cycles and is

equivalent to Bs

After all reversals the resulting network structure is identical to Bs

If x y is the next arc to be reversed in the current network structure then x and y

have the same parents in both network structures with the exception that x is also a

parent of y in Bs

A drawback of Bayesian networks as dened is that network structure depends on vari

able order If the order is chosen carelessly the resulting network structure may fail to

reveal many conditional independencies in the domain Fortunately in practice Bayesian

networks are typically constructed using notions of cause and eect Loosely speaking to

construct a Bayesian network for a given set of variables we draw arcs from cause vari

ables to their immediate eects For example we would obtain the network structure in


y

y2 ymy1

Figure A Bayesian network showing the conditionalindependence assertions associated

with a multinomial sample

Figure if we believed that x is the immediate causal eect of x and x is the immediate

causal eect of x In almost all cases constructing a Bayesian network in this way yields

a Bayesian network that is consistent with the formal denition In Section we return to

this issue

Now let us consider exchangeability and random sampling Most of the concepts we

discuss can be found in Good and DeGroot Given a discrete variable y with

r states consider a nite sequence of observations y ym of this variable We can think

of this sequence as a database D for the onevariable domain U fyg This sequence

is said to be exchangeable if a sequence obtained by interchanging any two observations

in the sequence has the same probability as the original sequence Roughly speaking the

assumption that a sequence is exchangeable is an assertion that the processes generating

the data do not change in time

Given an exchangeable sequence De Finetti showed that there exists parameters

y fy yrg such that

yk k rrX

k

yk

pyl kjy yly yk

That is the parameters y render the individual observations in the sequence conditionally

independent and the probability that any given observation will be in state k is just yk

The conditional independence assertion Equation may be represented as a Bayesian

network as shown in Figure By the strong law of large numbers eg DeGroot

p we may think of yk as the longrun fraction of observations where y k

although there are other interpretations Howard

Also note that each parameter

yk is positive ie greater than zero

A sequence that satises these conditions is a particular type of random sample known

as an rdimensional multinomial sample with parameters y Good When r the


sequence is said to be a binomial sample One example of a binomial sample is the outcome

of repeated ips of a thumbtack If we knew the longrun fraction of heads point up

for a given thumbtack then the outcome of each ip would be independent of the rest

and would have a probability of heads equal to this fraction An example of a multinomial

sample is the outcome of repeated rolls of a multisided die As we shall see learning

Bayesian networks for discrete domains essentially reduces to the problem of learning the

parameters of a die having many sides

As y is a set of continuous variables it has a probability density which we denote

yj Throughout this paper we use j to denote a probability density for a continu

ous variable or set of continuous variables Given yj we can determine the probability

that y k in the next observation In particular by the rules of probability we have

py kj

Zpy kjy y j dy

Consequently by condition above we obtain

py kj

Zyk y j dy

which is the mean or expectation of yk with respect to y j denoted Eyk j

Suppose we have a prior density for y and then observe a database D We may obtain

the posterior density for y as follows From Bayes rule we have

y jD c pDjy yj

where c is a normalization constant Using Equation to rewrite the rst term on the right

hand side we obtain

y jD c rY

k

Nkyk yj

where Nk is the number of times x k in D Note that only the counts N Nr are

necessary to determine the posterior from the prior These counts are said to be a sucient

statistic for the multinomial sample

In addition suppose we assess a density for two dierent states of information and

and nd that yj y j Then for any multinomial sample D

pDj ZpDjy yj dy pDj

because pDjy pDjy by Equation That is if the densities for y are the

same then the probability of any two samples will be the same The converse is also true

Namely if pDj pDj for all databases D then yj yj We shall use

this equivalence when we discuss likelihood equivalence

We assume this result is well known although we havent found a proof in the literature


Given a multinomial sample a user is free to assess any probability density for y In

practice however one often uses the Dirichlet distribution because it has several convenient

properties The parameters y have a Dirichlet distribution with exponentsN N

r when

the probability density of y is given by

yj Pr

kNkQr

k Nk

rYk

N k

yk Nk

where is the Gamma function which satises x ! x x and When

the parameters y have a Dirichlet distribution we also say that yj is Dirichlet The

requirement that N k be greater than guarantees that the distribution can be normalized

Note that the exponents N k are a function of the users state of information Also note

that by Equation the Dirichlet distribution for y is technically a density over ynfykg

for some k the symbol n denotes set dierence Nonetheless we shall write Equation

as shown When r the Dirichlet distribution is also known as a beta distribution

From Equation we see that if the prior distribution of y is Dirichlet then the

posterior distribution of y given database D is also Dirichlet

yjD crY

k

N kNk

yk Ni

where c is a normalization constant We say that the Dirichlet distribution is closed under

multinomial sampling or that the Dirichlet distribution is a conjugate family of distributions

for multinomial sampling Also when y has a Dirichlet distribution the expectation of

ykiequal to the probability that x ki in the next observationhas a simple expression

Eykj py kj N kN

where N Pr

kNk We shall make use of these properties in our derivations

A survey of methods for assessing a beta distribution is given by Winkler These

methods include the direct assessment of the probability density using questions regarding

relative densities and relative areas assessment of the cumulative distribution function using

fractiles assessing the posterior means of the distribution given hypothetical evidence and

assessment in the form of an equivalent sample size These methods can be generalized with

varying di"culty to the nonbinary case

In our work we nd one method based on Equation particularly useful The equation

says that we can assess a Dirichlet distribution by assessing the probability distribution

pyj for the next observation and N In so doing we may rewrite Equation as

yj c rY

k

N pykjyk


where c is a normalization constant Assessing pyj is straightforward Furthermore the

following two observations suggest a simple method for assessing N

One the variance of a density for y is an indication of how much the mean of y

is expected to change given new observations The higher the variance the greater the

expected change It is sometimes said that the variance is a measure of a users condence

in the mean for y The variance of the Dirichlet distribution is given by

V aryk j py kj py kj

N !

Thus N is a reection of the users condence Two suppose we were initially completely

ignorant about a domainthat is our distribution y j was given by Equation with

each exponent N k Suppose we then saw N cases with su"cient statistics N N

r

Then by Equation our prior would be the Dirichlet distribution given by Equation

Thus we can assess N as an equivalent sample size the number of observations we

would have had to have seen starting from complete ignorance in order to have the same

condence in y that we actually have This assessment approach generalizes easily to

manyvariable domains and thus is useful for our work We note that some users at rst

nd judgments of equivalent sample size to be di"cult Our experience with such users has

been that they may be made more comfortable with the method by rst using some other

method for assessment eg fractiles on simple scenarios and by examining equivalent

sample sizes implied by their assessments

Bayesian Metrics Previous Work

CH Buntine and SDLC examine domains where all variables are discrete and derive essen

tially the same Bayesian scoring metric and formula for pCjDBhs based on the same

set of assumptions about the users prior knowledge and the database In this section we

present these assumptions and provide a derivation of pDBhs j and pCjDBhs

Roughly speaking the rst assumption is that Bhs is true i the database D can be

partitioned into a set of multinomial samples determined by the network structure Bs In

particular Bhs is true i for every variable xi in U and every instance of xis parents i

in Bs the observations of xi in D in those cases where i takes on the same instance

constitute a multinomial sample For example consider a domain consisting of two binary

variables x and y We shall use this domain to illustrate many of the concepts in this

paper There are three network structures for this domain x y x y and the empty

This prior distribution cannot be normalized and is sometimes called an improper prior To be more

precise we should say that each exponent is equal to some number close to zero


case 1

case 2

y x| y x|x

(a) (b)

case 1

case 2

y x| y x|x

Figure A Bayesiannetwork structure for a twobinaryvariable domain fx yg showing

conditional independencies associated with a the multinomialsample assumption and b

the added assumption of parameter independence In both gures it is assumed that the

network structure x y is generating the database

network structure containing no arc The hypothesis associated with the empty network

structure denoted Bhxy corresponds to the assertion that the database is made up of two

binomial samples the observations of x are a binomial sample with parameter x and

the observations of y are a binomial sample with parameter y

In contrast the hypothesis associated with the network structure x y denoted Bhxy

corresponds to the assertion that the database is made up of at most three binomial samples

the observations of x are a binomial sample with parameter x the observations of

y in those cases where x is true if any are a binomial sample with parameter yjx and

the observations of y in those cases where x is false if any are a binomial sample

with parameter yjx One consequence of the second and third assertions is that y in case

C is conditionally independent of the other occurrences of y in D given yjx yjx and x

in case C We can graphically represent this conditionalindependence assertion using a

Bayesiannetwork structure as shown in Figure a

Finally the hypothesis associated with the network structure x y denoted Bhxy

corresponds to the assertion that the database is made up of at most three binomial samples

one for y one for x given y is true and one for x given y is false

Before we state this assumption for arbitrary domains we introduce the following


notation Given a Bayesian network Bs for domain U let ri be the number of states

of variable xi and let qi Q

xli rl be the number of instances of i We use the integer

j to index the instances of i Thus we write pxi kji j to denote the probability

that xi k given the jth instance of the parents of xi Let ijk denote the multinomial

parameter corresponding to the probability pxi kji j ijk Pri

k ijk

In addition we dene

ij rikfijkg

i qijfijg

Bs nii

That is the parameters in Bs correspond to the probability set Bp for a singlecase

Bayesian network

Assumption Multinomial Sample Given domain U and database D let Dl denote

the rst l cases in the database In addition let xil and il denote the variable xi and

the parent set i in the lth case respectively Then for all network structures Bs in U

there exist positive parameters Bs such that for i n and for all k k ki

pxil kjxl k xil ki DlBs Bhs ijk

where j is the instance of il consistent with fxl k xil kig

There is an important implication of this assumption which we examine in Section

Nonetheless Equation is all that we need and all that CH Buntine and SDLC used

to derive a metric Also note that the positivity requirement excludes logical relationships

among variables We can relax this requirement although we do not do so in this paper

The second assumption is an independence assumption

Assumption Parameter Independence Given network structure Bs if pBhs j

then

a BsjBhs Qn

i ijBhs

b For i n ijBhs

Qqij ij jB

hs

Assumption a says that the parameters associated with each variable in a network structure

are independent We call this assumption global parameter independence after Spiegelhal

ter and Lauritzen Assumption b says that the parameters associated with each

Whenever possible we use CHs notation


instance of the parents of a variable are independent We call this assumption local param

eter independence again after Spiegelhalter and Lauritzen We refer to the combination of

these assumptions simply as parameter independence The assumption of parameter inde

pendence for our twobinaryvariable domain is shown in the Bayesiannetwork structure of

Figure b

As we shall see Assumption greatly simplies the computation of pDBhs j The

assumption is reasonable for some domains but not for others In Section we describe

a simple characterization of the assumption that provides a test for deciding whether the

assumption is reasonable in a given domain

The third assumption was also made to simplify computations

Assumption Parameter Modularity Given two network structures Bs and Bs

such that pBhsj and pBhsj if xi has the same parents in Bs and Bs

then

ij jBhs ij jB

hs j qi

We call this property parameter modularity because it says that the densities for parameters

ij depend only on the structure of the network that is local to variable xinamely ij

only depends on xi and its parents For example consider the network structure x y

and the empty structure for our twovariable domain In both structures x has the same

set of parents the empty set Consequently by parameter modularity xjBhxy

xjBhxy We note that CH Buntine and SDLC implicitly make the assumption of

parameter modularity Cooper and Herskovits Equation A p Buntine

p Spiegelhalter et al pp

The fourth assumption restricts each parameter set ij to have a Dirichlet distribution

Assumption Dirichlet Given a network structure Bs such that pBhs j ij jB

hs

is Dirichlet for all ij Bs That is there exists exponents N ijk which depend on Bhs

and that satisfy

ij jBhs c

Yk

N ijk

ijk

where c is a normalization constant

When every parameter set ofBs has a Dirichlet distribution we simply say that BsjBhs

is Dirichlet Note that by the assumption of parameter modularity we do not require

Dirichlet exponents for every network structure Bs Rather we require exponents only for

every node and for every possible parent set of each node


Assumptions through are assumptions about the domain Given Assumption we

can compute pDjBs Bhs as a function of Bs for any given database see Equation

Also as we show in Section Assumptions through determine BsjBhs for every

network structure Bs Thus from the relation

pDBhs j pBhs j

ZpDjBs B

hs BsjB

hs dBs

these assumptions in conjunction with the prior probabilities of network structure pBhs j

form a complete representation of the users prior knowledge for purposes of computing

pDBhs j By a similar argument we can show that Assumptions through also deter

mine the probability distribution pCjDBhs for any given database and network struc

ture

In contrast the fth assumption is an assumption about the database

Assumption Complete Data The database is complete That is it contains no

missing data

This assumption was made in order to compute pDBhs j and pCjDBhs in closed

form In this paper we concentrate on complete databases for the same reason Nonethe

less the reader should recognize that given Assumptions through these probabilities

can be computedin principlefor any complete or incomplete database In practice

these probabilities can be approximated for incomplete databases by wellknown statistical

methods Such methods include lling in missing data based on the data that is present

Titterington Spiegelhalter and Lauritzen the EM algorithm Dempster et al

and Gibbs sampling ie Markov chain Monte Carlo methods York Madigan and Raerty

Let us now explore the consequences of these assumptions First from the multinomial

sample assumption and the assumption of no missing data we obtain

pCljDlBs Bhs

nYi

qiYj

riYk

lijkijk

where lijk if xi k and i j in case Cl and lijk otherwise Thus if we let

Nijk be the number of cases in database D in which xi k and i j we have

pDjBs Bhs

Yi

Yj

Yk

Nijkijk

From this result it follows that the parameters Bs remain independent given database D

a property we call posterior parameter independence In particular from the assumption of

parameter independence we have

BsjDBhs c pDjBs B

hs

nYi

qiYj

ij jBhs


where c is some normalization constant Combining Equations and we obtain

BsjDBhs c

Yi

Yj

ij jB

hs

Yk

Nijkijk

and posterior parameter independence follows We note that by Equation and the

assumption of parameter modularity parameters remain modular a posteriori as well

Given these basic relations we can derive a metric and a formula for pCjDBhs

From the rules of probability we have

pDjBhs mYl

pCljDl Bhs

From this equation we see that the Bayesian scoring metric can be viewed as a form of cross

validation where rather than use D n fClg to predict Cl we use only cases C Cl to

predict Cl

Conditioning on the parameters of the network structure Bs we obtain

pCljDl Bhs

ZpCljDlBs B

hs BsjDl B

hs dBs

Using Equation and posterior parameter independence to rewrite the rst and second

terms in the integral respectively and interchanging integrals with products we get

pCljDl Bhs

nYi

qiYj

Z riYk

lijkijk ij jDl B

hs dij

When lijk the integral is the expectation of ijk with respect to the density ijjDl Bhs

Consequently we have

pCljDl Bhs

nYi

qiYj

riYk

EijkjDl Bhs

lijk

To compute pCjDBhs we set l m ! and interpret Cm to be C To compute

pDjBhs we combine Equations and and rearrange products obtaining

pDjBhs nYi

qiYj

riYk

mYl

EijkjC Cl Bhs

lijk

Thus all that remains is to determine to the expectations in Equations and Given

the Dirichlet assumption Assumption this evaluation is straightforward Combining the

Dirichlet assumption and Equation we obtain

ij jDBhs c

Yk

N ijk

Nijk

ijk


where c is another normalization constant Note that the countsNijk are a su"cient statistic

for the database Also as we discussed in Section the Dirichlet distributions are conjugate

for the database The posterior distribution of each parameter ij remains in the Dirichlet

family Thus applying Equations and to Equation with l m ! Cm C

and Dm D we obtain

pCmjDBhs

nYi

qiYj

riYk

N ijk !Nijk

N ij !Nij

mijk

where

N ij

riXk

N ijk Nij

riXk

Nijk

Similarly from Equation we obtain the scoring metric

pDBhs j pBhs j

nYi

qiYj

N ijN ij

N ij !

N ij !

N ij !Nij

N ij !Nij

N ij

N ij !Nij

N ij !

N ij !Nij !

N ij !Nij

N ij !Nij !Nij

N ijri

N ij !Pri

k Nijk

N ijri !

N ij !Pri

k Nijk !

N ijri !Nijri

Nij !N ij

pBhs j nYi

qiYj

N ij

N ij !Nij

riYk

N ijk !Nijk

N ijk

We call Equation the BD Bayesian Dirichlet metric

As is apparent from Equation the exponents N ijk in conjunction with pBhs j com

pletely specify a users current knowledge about the domain for purposes of learning network

structures Unfortunately the specication of N ijk for all possible variable#parent congu

rations and for all values of i j and k is formidable to say the least CH suggest a simple

uninformative assignment N ijk We shall refer to this special case of the BD metric as

the K metric Buntine suggests the uninformative assignment N ijk Nri qi

We shall examine this special case again in Section In Section we address the assess

ment of the priors on network structure pBhs j

Acausal networks causal networks and likelihood equiva

lence

In this section we examine another assumption for learning Bayesian networks that has

been previously overlooked


Before we do so it is important to distinguish between acausal and causal Bayesian

networks Although Bayesian networks have been formally described as a representation

of conditional independence as we noted in Section people often construct them using

notions of cause and eect Recently several researchers have begun to explore a formal

causal semantics for Bayesian networks eg Pearl and Verma Pearl Spirtes

et al Druzdzel and Simon and Heckerman and Shachter They ar

gue that the representation of causal knowledge is important not only for assessment but

for prediction as well In particular they argue that causal knowledgeunlike statistical

knowledgeallows one to derive beliefs about a domain after intervention For example

most of us believe that smoking causes lung cancer From this knowledge we infer that if

we stop smoking then we decrease our chances of getting lung cancer In contrast if we

knew only that there was a statistical correlation between smoking and lung cancer then

we could not make this inference The formal semantics of cause and eect proposed by

these researchers is not important for this discussion The interested reader should consult

the references given

First let us consider acausal networks Recall our assumption that the hypothesis

Bhs is true i the database D is a collection of multinomial samples determined by the

network structure Bs This assumption is equivalent to saying that the database D is a

multinomial sample from the joint space of U with parameters U and the hypothesis

Bhs is true i the parameters U satisfy the conditionalindependence assertions of Bs We

can think of condition as a denition of the hypothesis Bhs

For example in our twobinaryvariable domain regardless of which hypothesis is true

we may assert that the database is a multinomial sample from the joint space U fx yg

with parameters U fxy xy yx xyg Furthermore given the hypothesis Bhxyfor

examplewe know that the parameters U are unconstrained except that they must sum

to one because the network structure x y represents no assertions of conditional inde

pendence In contrast given the hypothesis Bhxy we know that the parameters U must

satisfy the independence constraints xy xy xy xy and so on

Given this denition of Bhs for acausal Bayesian networks it follows that if two network

structures Bs and Bs are equivalent then Bhs B

hs For example in our twovariable

domain both the hypotheses Bhxy and Bhxy assert that there are no constraints on the

parameters U Consequently we have Bhxy B

hxy In general we call this property

hypothesis equivalence

One technical aw with the denition of Bhs is that hypotheses are not mutually exclusive For example

in our twovariable domain the hypotheses Bhxy and Bhxy both include the possibility y yjx This

aw is potentially troublesome because mutual exclusivity is important for our Bayesian interpretation of


In light of this property we should associate each hypothesis with an equivalence class

of structures rather than a single network structure Also given the property of hypothesis

equivalence we have prior equivalence if network structures Bs and Bs are equivalent

then pBhsj pBhsj likelihood equivalence if Bs and Bs are equivalent then for

all databases D pDjBhs pDjBhs and score equivalence if Bs and Bs are

equivalent then pDBhsj pDBhsj

Now let us consider causal networks For these networks the assumption of hypothesis

equivalence is unreasonable In particular for causal networks we must modify the deni

tion of Bhs to include the assertion that each nonroot node in Bs is a direct causal eect of

its parents For example in our twovariable domain the causal networks x y and x y

represent the same constraints on U ie none but the former also asserts that x causes

y whereas the latter asserts that y causes x Thus the hypotheses Bhxy and Bhxy are

not equal Indeed it is reasonable to assume that these hypothesesand the hypotheses

associated with any two dierent causalnetwork structuresare mutually exclusive

Nonetheless for many realworld problems that we have encountered we have found

it reasonable to assume likelihood equivalence That is we have found it reasonable to

assume that data cannot distinguish between equivalent network structures Of course for

any given problem it is up to the decision maker to assume likelihood equivalence or not In

Section we describe a characterization of likelihood equivalence that suggests a simple

procedure for deciding whether the assumption is reasonable in a given domain

Because the assumption of likelihood equivalence is appropriate for learning acausal

networks in all domains and for learning causal networks in many domains we adopt this

assumption in our remaining treatment of scoring metrics As we have stated it likelihood

equivalence says that for any databaseD the probability of D is the same given hypotheses

corresponding to any two equivalent network structures From our discussion surrounding

Equation however we may also state likelihood equivalence in terms of U

Assumption Likelihood Equivalence Given two network structures Bs and Bs

such that pBhsj and pBhsj if Bs and Bs are equivalent then U jB

hs

U jBhs

We close this section with a few additional remarks about inferring causal relationships

network learning see Equation Nonetheless because the densities BsjBhs must be integrable and

hence bounded the overlap of hypotheses will be of measure zero and we may use Equation without

modication For example in our twobinaryvariable domain given the hypothesis Bhxy the probability

that Bhxy is true ie y yjx has measure zeroUsing the same convention as for the Dirichlet distribution we write U jB

hs to denote a density

over a set of the nonredundant parameters in U


Given the distinction between statistical and causal dependence it would seem impossible

to learn causal networks from data produced by observation alone For example consider

the simple threevariable domain U fx x xg If we nd through the observation of

data that the network structure x x x is very likely then we cannot conclude that

x and x are causes for x Rather it may be the case that there is a hidden common cause

of x and x as well as a hidden common cause of x and x If however we assume that

every statistical association derives from causal interaction and that there are no hidden

common causes then we can interpret learned networks as causal networks In our example

under these assumptions we can infer that x and x are causes for x

Under the assumption of likelihood equivalence the ratio of posterior probabilities of

two equivalent network structures must be equal to the ratio of their prior probabilities

Consequently if the priors on network structures are not too dierent then typically learn

ing will produce many equivalent network structures each having a large relative posterior

probability Furthermore even for domains where the assumption of likelihood equivalence

does not hold there is a good chance that more than one hypothesis will have a large relative

posterior probability In such situations we nd it reasonable to average the causal asser

tions contained in individual learned networks For example in our threevariable domain

let us suppose that the data supports only the network structure x x x and its

equivalent cousins x x x and x x x If each of the hypotheses correspond

ing to these structures have the same prior probability then the posterior probability of

each hypothesis will be and we infer that the proposition x causes x has probability

Under these same conditions the proposition that both x and x are causes of x has

probability

The BDe Metric

The assumption of likelihood equivalence when combined the previous assumptions intro

duces constraints on the Dirichlet exponents N ijk The result is a likelihoodequivalent

specialization of the BD metric which we call the BDe metric In this section we derive

this metric In addition we show that as a consequence of the exponent constraints the

user may construct an informative prior for the parameters of all network structures merely

by building a Bayesian network for the next case to be seen and by assessing an equivalent

sample size Most remarkable we show that Dirichlet assumption Assumption is not

needed to obtain the BDe metric

Note that in some circumstances we can identify causes and eects from network structure even when

there are hidden common causes See Pearl for a discussion


Informative Priors

In this section we show how the added assumption of likelihood equivalence simplies the

construction of informative priors

Before we do so we need to dene the concept of a complete network structure A

complete network structure is one that has no missing edgesthat is it encodes no assertions

of conditional independence In a domain with n variables there are n$ complete network

structures An important property of complete network structures is that all such structures

for a given domain are equivalent

Now for a given domain U suppose we have assessed the density U jBhsc where

Bsc is some complete network structure for U Given parameter independence parameter

modularity likelihood equivalence and one additional assumption it turns out that we can

compute the prior BsjBhs for any network structure Bs in U from the given density

To see how this computation is done consider again our twobinaryvariable domain

Suppose we are given a density for the parameters of the joint space xy xy xy jBhxy

From this density we construct the parameter densities for each of the three network struc

tures in the domain First consider the network structure x y A parameter set for this

network structure is fx yjx yjxg These parameters are related to the parameters of the

joint space by the following relations

xy xyjx xy xyjx xy x yjx

Thus we may obtain x yjx yjxjBhxy from the given density by changing variables

x yjx yjxjBhxy Jxy xy xy xyjB

hxy

where Jxy is the Jacobian of the transformation

Jxy

xyx xyx xyx

xyyjx xyyjx xyyjx

xyyjx xyyjx xyyjx

x x

The Jacobian JBsc for the transformation from U to Bsc where Bsc is an arbitrary

complete network structure is given in the Appendix Theorem

Next consider the network structure x y Assuming that the hypothesis Bhxy

is also possible we obtain xy xy xyjBhxy xy xy xyjBhxy by likelihood

equivalence Therefore we can compute the density for the network structure x y using

the Jacobian Jxy y y

Finally consider the empty network structure Given the assumption of parameter

independence we may obtain the densities xjBhxy and y jB

hxy separately To


( , , | , ) ( , , | , )xy x y x y x yh

xy x y x y x yhB B =

Bxy:

Bx y : Bx y :

change ofvariable

parametermodularity

Figure A computation of the parameter densities for the three network structures

of the twobinaryvariable domain fx yg The approach computes the densities from

xy xy xyjBhxy using likelihood equivalence parameter independence and param

eter modularity

obtain the density for x we rst extract xjBhxy from the density for the network

structure x y This extraction is straightforward because by parameter independence

the parameters for x y must be independent Then we use parameter modularity

which says that xjBhxy xjB

hxy To obtain the density for y we extract

yjBhxy from the density for the network structure x y and again apply parameter

modularity The approach is summarized in Figure

In this construction it is important that both hypotheses Bhxy and Bhxy have nonzero

prior probabilities lest we could not make use of likelihood equivalence to obtain the param

eter densities for the empty structure In order to take advantage of likelihood equivalence

in general we adopt the following assumption

Assumption Structure Possibility Given a domain U pBhscj for all com

plete network structures Bsc

Note that in the context of acausal Bayesian networks there is only one hypothesis corre

sponding to the equivalence class of complete network structures In this case Assumption

says that this single hypothesis is possible In the context of causal Bayesian networks the

assumption implies that each of the n$ complete network structures is possible Although

we make the assumption of structure possibility as a matter of convenience we have found

it to be reasonable in many realworld networklearning problems

Given this assumption we can now describe our construction method in general


Theorem Given domain U and a probability density U jBhsc where Bsc is some

complete network structure for U the assumptions of parameter independence Assump

tion parameter modularity Assumption likelihood equivalence Assumption and

structure possibility Assumption uniquely determine BsjBhs for any network struc

ture Bs in U

Proof Consider any Bs By Assumption if we determine ij jBhs for every param

eter set ij associated with Bs then we determine BsjBhs So consider a particular

ij Let i be the parents of xi in Bs and Bsc be a complete beliefnetwork structure with

variable ordering i xi followed by the remaining variables First using Assumption

we recognize that the hypothesis Bhsc is possible Consequently we use Assumption to

obtain U jBhsc U jB

hsc Next we change variables from U to Bsc yielding

Bsc jBhsc Using parameter independence we then extract the density ij jB

hsc

from Bsc jBhsc Finally because xi has the same parents in Bs and Bsc we apply

parameter modularity to obtain the desired density ij jBhs ij jB

hsc To show

uniqueness we note that the only freedom we have in choosing Bsc is that the parents of

xi can be shu%ed with one another and nodes following xi in the ordering can be shu%ed

with one another The Jacobian of the changeofvariable from U to Bsc has the same

terms in ij regardless of our choice

Consistency and the BDe Metric

In our procedure for generating priors we cannot use an arbitrary density U jBhsc In

our twovariable domain for example suppose we use the density

xy xy xyjBhxy

c

xy ! xy xy ! xy

c

x x

where c is a normalization constant Then using Equations and we obtain

x yjx yjxjBhxy c

for the network structure x y which satises parameter independence and the Dirichlet

assumption For the network structure y x however we have

y xjy xjyjBhxy

c y y

x x

c y y

yxjy ! yxjy yxjy ! yxjy

This density satises neither parameter independence nor the Dirichlet assumption

In general if we do not choose U jBhsc carefully we may not satisfy both parameter

independence and the Dirichlet assumption Indeed the question arises Is there any


choice for U jBhsc that is consistent with these assumptions& The following theorem

and corollary answers this question in the a"rmative In the remainder of Section

we require additional notation We use XkX jYkY

to denote the multinomial parameter

corresponding to the probability pX kX jY kY X and Y may be single variables

and kX and kY are often implicit Also we use XjYkY to denote the set of multinomial

parameters corresponding to the probability distribution pX jY kY and XjY to

denote the parameters XjYkY

for all instances of kY When Y is empty we omit the

conditioning bar

Theorem Given a domain U fx xng with multinomial parameters U if the

density U j is Dirichletthat is if

U j c Y

xxn

x xn N xxn

then for any complete network structure Bsc in U the density Bscj is Dirichlet and

satises parameter independence In particular

Bscj c nYi

Yxxi

xijxxi N xijx xi

where

N xijxxi X

xixn

N xxn

Proof Let Bsc be any complete network structure for U Reorder the variables in U so

that the ordering matches this structure and relabel the variables x xn Now change

variables from xxn to Bsc using the Jacobian given by Theorem The dimension of

this transformation is Qn

i ri where ri are the number of instances of xi Substituting

the relationship x xn Qn

i xijxxi and multiplying with the Jacobian we obtain

Bscj

c

Yxxn

nYi

xi jxxi

N x xn

nYi

Yxxi

xijxxi Qn

jirj

which implies Equation Collecting the powers of xijxxi and usingQn

ji rj Pxixn

we obtain Equation

Corollary Let U be a domain with multinomial parameters U and Bsc be a com

plete network structure for U such that pBhscj If U jBhsc is Dirichlet then

BscjBhsc is Dirichlet and satises parameter independence


Given these results we can compute the Dirichlet exponents N ijk using a Dirichlet dis

tribution for U jBhsc in conjunction with our method for constructing priors described

in Theorem Namely suppose we desire the exponent N ijk for a network structure where

xi has parents i Let Bsc be a complete network structure where xi has these parents By

likelihood equivalence we have U jBhsc U jB

hsc As we discussed in Section

we may write the exponents for U jBhsc as follows

N xxn N px xnjB

hsc

where N is the users equivalent sample size for the U jBhsc Furthermore by deni

tion N ijk is the Dirichlet exponent for ijk in Bsc Consequently from Equations and

we have

N ijk N pxi ki jjB

hsc

We call the BD metric with this restriction on the exponents the BDe metric e for

likelihood equivalence To summarize we have the following theorem

Theorem BDe Metric Given domain U suppose that U jBhsc is Dirichlet with

equivalent sample size N for some complete network structure Bsc in U Then for any

network structure Bs in U Assumptions through and through imply

pDBhs j pBhs j

nYi

qiYj

N ij

N ij !Nij

riYk

N ijk !Nijk

N ijk

where

N ijk N pxi ki jjB

hsc

Theorem shows that parameter independence likelihood equivalence structure possi

bility and the Dirichlet assumption are consistent for complete network structures Nonethe

less these assumptions and the assumption of parameter modularity may not be consistent

for all network structures To understand the potential for inconsistency note that we

obtained the BDe metric for all network structures using likelihood equivalence applied

only to complete network structures in combination with the other assumptions Thus it

could be that the BDe metric for incomplete network structures is not likelihood equiva

lent Nonetheless the following theorem shows that the BDe metric is likelihood equivalent

for all network structuresthat is given the other assumptions likelihood equivalence for

incomplete structures is implied by likelihood equivalence for complete network structures

Consequently our assumptions are consistent

Theorem For all domains U and all network structures Bs in U the BDe metric is

likelihood equivalent


Proof Given a database D equivalent sample size N joint probability distribution

pU jBhsc and a subset X of U consider the following function of X

lX YkX

N pX kX jB

hsc !NkX

where kX is an instance of X and NkXis the number of cases in D in which X kX

Then the likelihood term of the BDe metric becomes

pDjBhs nYi

lfxig i

li

Now by Theorem we know that a network structure can be transformed into an equivalent

structure by a series of arc reversals Thus we can demonstrate that BDe metric satises

likelihood equivalence in general if we can do so for the case where two equivalent structures

dier by a single arc reversal So let Bs and Bs be two equivalent network structures

that dier only in the direction of the arc between xi and xj say xi xj in Bs Let R be

the set of parents of xi in Bs By Theorem we know that R fxig is the set of parents

of xj in Bs R is the set of parents of xj in Bs and R fxjg is the set of parents of xi

in Bs Because the two structures dier only in the reversal of a single arc the only terms

in the product of Equation that can dier are those involving xi and xj For Bs these

terms arelfxig R

lR

lfxi xjg R

lfxig R

whereas for Bs they arelfxjg R

lR

lfxi xjg R

lfxjg R

These terms are equal and hence pDjBhs pDjBhs

We note that Buntines metric is a special case of the BDe metric where every

instance of the joint space conditioned on Bhsc is equally likely We call this special case the

BDeu metric u for uniform joint distribution Buntine noted that this metric satises

the property of likelihood equivalence

The Prior Network

To calculate the terms in the BDe metric or to construct informative priors for a more

general metric that can handle missing data we need priors on network structures pBhs j

and the Dirichlet distribution U jBhsc In Section we provide a simple method

for assessing priors on network structures Here we concentrate on the assessment of the

Dirichlet distribution for U


Recall from Sections and that we can assess this distribution by assessing a single

equivalent sample size N for the domain and the joint distribution of the domain for the

next case to be seen pU jBhsc where both assessments are conditioned on the state

of information Bhsc As we have discussed the assessment of equivalent sample size is

straightforward Furthermore a user can assess pU jBhsc by building a Bayesian network

for U given Bhsc We call this network the users prior network

The unusual aspect of this assessment is the conditioning hypothesis Bhsc Whether we

are dealing with acausal or causal Bayesian networks this hypothesis includes the assertion

that there are no independencies in the long run Thus at rst glance there seems to be a

contradiction in asking the user to construct a prior networkwhich may contain assertions

of independenceunder the assertion that Bhsc is true Nonetheless there is no contradic

tion because the assertions of independence in the prior network refer to independencies

in the next case to be seen whereas the assertion of full dependence Bhsc refers to the long

run

To help illustrate this point let us consider the following acausal example Suppose a

person repeatedly rolls a foursided die with labels and In addition suppose that

he repeatedly does one of the following rolls the die once and reports x true i

the die lands or and y true i the die lands or or rolls the die twice and

reports x true i the die lands or on the rst roll and reports y true i the die

lands or on the second roll In either case the multinomial assumption is reasonable

Furthermore condition corresponds to the hypothesis Bhxy x and y are independent in

the long run whereas condition corresponds to the hypothesis Bhxy Bhxy x and y are

dependent in the long run Also given these correspondences parameter modularity and

likelihood equivalence are reasonable Finally let us suppose that the parameters of the

multinomial sample have a Dirichlet distribution so that parameter independence holds

Thus this example ts the assumptions of our learning approach Now if we have no

reason to prefer one outcome of the die to another on the next roll then we will have

pyjxBhxy pyjBhxy That is our prior network will contain no arc between x

and y even though given Bhxy x and y are almost certainly dependent in the long run

We expect that most users would prefer to construct a prior network without having to

condition on Bhsc In the previous example it is possible to ignore the conditioning hypothe

sis because pU jBhxy pU jBhxy pU j In general however a user cannot ignore

this hypothesis In our foursided die example the joint distributions pU jBhxy and

pU jBhxy would have been dierent had we not been indierent about the die outcomes

Actually as we have discussed Bhxy includes the possibility that x and y are independent but only

with a probability of measure zero


We have had little experience with training people to condition on Bhsc when constructing a

prior network Nonetheless stories like the fourside die may help users make the necessary

distinction for assessment

A Simple Example

Consider again our twobinaryvariable domain Let Bxy and Byx denote the network

structures where x points to y and y points to x respectively Suppose thatN and that

the users prior network gives the joint distribution px yjBhxy px 'yjBhxy

p'x yjBhxy and p'x 'yjBhxy Also suppose we observe two cases

C fx yg and C fx 'yg Let i refer to variable x y and k denote

the true false state of a variable Thus for the network structure x y we have the

Dirichlet exponents N N N

N

N

and N

and

the su"cient statistics N N N N N and N

Consequently we obtain

pDjBhxy $

$

$

$

$

$$

$

$

$

$

$$

$

$

$

$

$

For the network structure x y we have the Dirichlet exponents N N

N N N

and N

and the su"cient statistics N N

N N N and N Consequently we have

pDjBhxy $

$

$

$

$

$$

$

$

$

$

$$

$

$

$

$

$

As required the BDe metric exhibits the property of likelihood equivalence

In contrast the K metric all N ijk does not satisfy this property In particular

given the same database we have

pDjBhxy $

$

$

$

$

$$

$

$

$

$

$$

$

$

$

$

$

pDjBhxy $

$

$

$

$

$$

$

$

$

$

$$

$

$

$

$

$

Elimination of the Dirichlet Assumption

In Section we saw that when U jBhsc is Dirichlet then BscjBhsc is consis

tent with parameter independence the Dirichlet assumption likelihood equivalence and

structure possibility Therefore it is natural to ask whether there are any other choices for

U jBhsc that are similarly consistent Actually because the Dirichlet assumption is so

strong it is more tting to ask whether there are any other choices for U jBhsc that are


consistent with all but the Dirichlet assumption In this section we show that if each den

sity function is positive ie the range of each function includes only numbers greater than

zero then a Dirichlet distribution for U jBhsc is the only consistent choice Conse

quently we show that under these conditions the BDe metric follows without the Dirichlet

assumption

First let us examine this question for our twobinaryvariable domain Combining

Equations and for the network structure x y the corresponding equations for the

network structure x y likelihood equivalence and structure possibility we obtain

x yjx yjxjBhxy

x x

y yy xjy xjyjB

hxy

wherey xyjx ! xyjx

xjy xy

xyjxxyjx

xjy xy

xyjxxyjx

Applying parameter independence to both sides of Equation we get

fxxfyjxyjxfyjxyjx x x

y yfyyfxjyxjyfxjy

where fx fyjx fyjx fy fxjy and fxjy are unknown density functions Equations and

dene a functional equation Methods for solving such equations have been well studied see

eg Aczel In our case Geiger and Heckerman show that if each function is

positive then the only solution to Equations and is for xy xyxyjBhxy to be

a Dirichlet distribution In fact they show that even when x and(or y have more than two

states the only solution consistent with likelihood equivalence is the Dirichlet

Theorem Geiger and Heckerman

Let xy xyjx and yxjy be pos

itive multinomial parameters related by the rules of probability If

fxxrxYk

fyjxkyjxk

Qrxk

ryxkQry

l rxyl

fyy

ryYl

fxjylxjyl

where each function is a positive probability density function then xyj is Dirichlet

This result for two variables is easily generalized to the nvariable case as we now

demonstrate


Theorem Let Bsc and Bsc be two complete network structures for U with variable

orderings x xn and xn x xn respectively If both structures have positive

multinomial parameters that obey

Bscj JBsc U j

and positive densities Bscj that satisfy parameter independence then U j is Dirich

let

Proof The theorem is trivial for domains with one variable n and is proved by

Theorem for n When n rst consider the complete network structure Bsc

Clustering the variables X fx xng into a single discrete variable with q Qn

i ri

states we obtain the network structure X xn with multinomial parameters X and

xnjX given by

X nYi

xijxxi

xnjX xnjxxn

By assumption the parameters of Bsc satisfy parameter independence Thus when we

change variables from Bsc to X xnjX using the Jacobian given by Theorem we

nd that the parameters for X xn also satisfy parameter independence Now consider

the complete network structure Bsc With the same variable cluster we obtain the network

structure xn X with parameters xn as in the original network structure and Xjxngiven by

Xjxn nYi

xijxnxxi

By assumption the parameters of Bsc satisfy parameter independence Thus when we

change variables from Bsc to xn Xjxn computing a Jacobian for each state of xn

we nd that the parameters for xn X again satisfy parameter independence Finally

these changes of variable in conjunction with Equation imply Equation Consequently

by Theorem XxnjBhsc U jB

hsc is Dirichlet

Thus we obtain the BDe metric without the Dirichlet assumption

Theorem Assumptions through excluding the Dirichlet assumption Assumption

and the assumption that parameter densities are positive imply the BDe metric Equa

tions and

Proof Given parameter independence likelihood equivalence structure possibility and

positive densities we have from Theorem that U jBhsc is Dirichlet Thus from

Theorem we obtain the BDe metric


The assumption that parameters are positive is important For example given a domain

consisting of only logical relationships we can have parameter independence likelihood

equivalence and structure possibility and yet U jBhsc will not be Dirichlet

Limitations of Parameter Independence and Likelihood Equivalence

There is a simple characterization of the assumption of parameter independence Recall

the property of posterior parameter independence which says that parameters remain in

dependent as long as complete cases are observed Thus suppose we have an uninformative

Dirichlet prior for the jointspace parameters all exponents very close to zero which satis

es parameter independence Then if we observe one ore more complete cases our posterior

will also satisfy parameter independence In contrast suppose we have the same uninfor

mative prior and observe one or more incomplete cases Then our posterior will not be

a Dirichlet distribution in fact it will be a linear combination of Dirichlet distributions

and will not satisfy parameter independence In this sense the assumption of parameter

independence corresponds to the assumption that ones knowledge is equivalent to having

seen only complete cases

When learning causal Bayesian networks there is a similar characterization of the as

sumption of likelihood equivalence Recall that when learning acausal networks the as

sumption must hold Namely until now we have considered only observational data data

obtained without intervention Nonetheless in many realworld studies we obtain experi

mental data data obtain by interventionfor example by randomizing subjects into con

trol and experimental groups Although we have not developed the concepts in this paper

to demonstrate the assertion it turns out that if we start with the uninformative Dirich

let prior which satises likelihood equivalence then the posterior will satisfy likelihood

equivalence if and only if we see no experimental data In this sense when learning causal

Bayesian networks the assumption of likelihood equivalence corresponds to the assumption

that ones knowledge is equivalent to having seen only nonexperimental data

In light of these characterizations we see that the assumptions of parameter indepen

dence and likelihood equivalence are unreasonable in many domains For example if we

learn about a portion of domain by reading or through word of mouth or simply apply

common sense then these assumptions should be suspect In these situations our method

ology for determining an informative prior from a prior network and a single equivalent

sample size is too simple

To relax one or both of these assumptions when they are unreasonable we can use

These characterizations of parameter independence and likelihood equivalence in the context of causal

networks are simplied for this presentation Heckerman provides more detailed characterizations


an equivalent database in place of an equivalent sample size Namely we ask a user to

imagine that he was initially completely ignorant about a domain having an uninformative

Dirichlet prior Then we ask the user to specify a database De that would produce a

posterior density that reects his current state of knowledge This database may contain

incomplete cases and(or experimental data Then to score a real database D we score

the database De D using the uninformative prior and a learning algorithm that handles

missing and experimental data such as Gibbs sampling

It remains to be determined if this approach is practical Needed is a compact represen

tation for specifying equivalent databases that allows a user to accurately reect his current

knowledge One possibility is to allow a user to specify a prior Bayesian network along with

equivalent sample sizes both experimental and nonexperimental for each variable Then

one could repeatedly sample equivalent databases from the prior network that satisfy these

samplesize constraints compute desired quantities such as a scoring metric from each

equivalent database and then average the results

SDLC suggest a dierent method for accommodating nonuniform equivalent sample

sizes Their method produces Dirichlet priors that satisfy parameter independence but not

likelihood equivalence

Priors for Network Structures

To complete the information needed to derive a Bayesian metric the user must assess

the prior probabilities of the network structures Although these assessments are logically

independent of the assessment of the prior network structures that closely resemble the

prior network will tend to have higher prior probabilities Here we propose the following

parametric formula for pBhs j that makes use of the prior network

Let i denote the number of nodes in the symmetric dierence of iBs and iP

iBs iP n iBsiP Then Bs and the prior network dier by Pn

i i

arcs and we penalize Bs by a constant factor for each such arc That is we set

pBhs j c

where c is a normalization constant which we can ignore when computing relative posterior

probabilities This formula is simple as it requires only the assessment of a single constant

Nonetheless we can imagine generalizing the formula by punishing dierent arc dierences

with dierent weights as suggested by Buntine Furthermore it may be more reasonable

to use a prior network constructed without conditioning on Bhsc

We note that this parametric form satises prior equivalence only when the prior net

work contains no arcs Consequently because the priors on network structures for acausal


networks must satisfy prior equivalence we should not use this parameterization for acausal

networks

Search Methods

In this section we examine methods for nding network structures with high posterior

probabilities Although our methods are presented in the context of Bayesian scoring met

rics they may be used in conjunction with other nonBayesian metrics as well Also we

note that researchers have proposed networkselection criteria other than relative posterior

probability eg Madigan and Raferty which we do not consider here

Many search methods for learning network structureincluding those that we describe

make use of a property of scoring metrics that we call decomposability Given a network

structure for domain U we say that a measure on that structure is decomposable if it can

be written as a product of measures each of which is a function only of one node and

its parents From Equation we see that the likelihood pDjBhs given by the BD

metric is decomposable Consequently if the prior probabilities of network structures are

decomposable as is the case for the priors given by Equation then the BD metric will

be decomposable Thus we can write

pDBhs j nYi

sxiji

where sxiji is only a function of xi and its parents Given a decomposable metric we

can compare the score for two network structures that dier by the addition or deletion of

arcs pointing to xi by computing only the term sxiji for both structures We note that

most known Bayesian and nonBayesian metrics are decomposable

SpecialCase Polynomial Algorithms

We rst consider the special case of nding the l network structures with the highest score

among all structures in which every node has at most one parent

For each arc xj xi including cases where xj is null we associate a weight wxi xj

log sxijxj log sxij From Equation we have

log pDBhs j nXi

log sxiji

nXi

wxi i !nXi

log sxij


where i is the possibly null parent of xi The last term in Equation is the same for all

network structures Thus among the network structures in which each node has at most

one parent ranking network structures by sum of weightsPn

i wxi i or by score has

the same result

Finding the network structure with the highest weight l is a special case of a

wellknown problem of nding maximum branchings describedfor examplein Evans and

Minieka The problem is dened as follows A treelike network is a connected

directed acyclic graph in which no two edges are directed into the same node The root of a

treelike network is a unique node that has no edges directed into it A branching is a directed

forest that consists of disjoint treelike networks A spanning branching is any branching that

includes all nodes in the graph A maximum branching is any spanning branching which

maximizes the sum of arc weights in our casePn

iwxi i An e"cient polynomial

algorithm for nding a maximum branching was rst described by Edmonds later

explored by Karp and made more e"cient by Tarjan and Gabow et al

The general case l was treated by Camerini et al

These algorithms can be used to nd the l branchings with the highest weights regardless

of the metric we use as long as one can associate a weight with every edge Therefore this

algorithm is appropriate for any decomposable metric When using metrics that are score

equivalent ie both prior and likelihood equivalent however we have

sxijxjsxj j sxj jxisxij

Thus for any two edges xi xj and xi xj the weights wxi xj and wxj xi are equal

Consequently the directionality of the arcs plays no role for scoreequivalent metrics and

the problem reduces to nding the l undirected forests for whichPwxi xj is a maximum

For the case l we can apply a maximum spanning tree algorithm with arc weights

wxi xj to identify an undirected forest F having the highest score The set of network

structures that are formed from F by adding any directionality to the arcs of F such that

the resulting network is a branching yields a collection of equivalent network structures each

having the same maximal score This algorithm is identical to the tree learning algorithm

described by Chow and Liu except that we use a scoreequivalent Bayesian metric

rather than the mutualinformation metric For the general case l we can use the

algorithm of Gabow to identify the l undirected forests having the highest score and

then determine the l equivalence classes of network structures with the highest score


Heuristic Search

A generalization of the problem described in the previous section is to nd the l best

networks from the set of all networks in which each node has no more than k parents

Unfortunately even when l the problem for k is NPhard In particular let us

consider the following decision problem which corresponds to our optimization problem

with l

kLEARN

INSTANCE Set of variables U database D fC Cmg where each Ci is an instance

of all variables in U scoring metric MDBs and real value p

QUESTION Does there exist a network structure Bs dened over the variables in U where

each node in Bs has at most k parents such that MDBs p&

H)ogen shows that a similar problem for PAC learning is NPcomplete His results

can be translated easily to show that kLEARN is NPcomplete for k when the BD

metric is used Chickering et al show that kLEARN is NPcomplete even when

we use the likelihoodequivalent BDe metric and the constraint of prior equivalence

Therefore it is appropriate to use heuristic search algorithms for the general case k

In this section we review several such algorithms

As is the case with essentially all search methods the methods that we examine have

two components an initialization phase and a search phase For example let us consider

the K search method not to be confused with the K metric described by CH The

initialization phase consists of choosing an ordering over the variables in U In the search

phase for each node xi in the ordering provided the node from fxi xig that most

increases the network score is added to the parent set of xi until no node increases the

score or the size of i exceeds a predetermined constant

The search algorithms we consider make successive arc changes to the network and

employ the property of decomposability to evaluate the merit of each change The possible

changes that can be made are easy to identify For any pair of variables if there is an

arc connecting them then this arc can either be reversed or removed If there is no arc

connecting them then an arc can be added in either direction All changes are subject to

the constraint that the resulting network contain no directed cycles We use E to denote

the set of eligible changes to a graph and *e to denote the change in log score of the

network resulting from the modication e E Given a decomposable metric if an arc to

xi is added or deleted only sxiji need be evaluated to determine *e If an arc between

xi and xj is reversed then only sxiji and sxj jj need be evaluated


One simple heuristic search algorithm is local search Johnson First we choose

a graph Then we evaluate *e for all e E and make the change e for which *e

is a maximum provided it is positive We terminate search when there is no e with a

positive value for *e As we visit network structures we retain l of them with the highest

overall score Using decomposable metrics we can avoid recomputing all terms *e after

every change In particular if neither xi xj nor their parents are changed then *e

remains unchanged for all changes e involving these nodes as long as the resulting network

is acyclic Candidates for the initial graph include the empty graph a random graph a

graph determined by one of the polynomial algorithms described in the previous section

and the prior network

A potential problem with local search is getting stuck at a local maximum Methods for

avoiding local maxima include iterated hillclimbing and simulated annealing In iterated

hillclimbing we apply local search until we hit a local maximum Then we randomly

perturb the current network structure and repeat the process for some manageable number

of iterations At all stages we retain the top l networks structures

In one variant of simulated annealing described by Metropolis et al we initialize

the system to some temperature T Then we pick some eligible change e at random and

evaluate the expression p exp*eT If p then we make the change e otherwise

we make the change with probability p We repeat this selection and evaluation process

times or until we make changes If we make no changes in repetitions then we stop

searching Otherwise we lower the temperature by multiplying the current temperature

T by a decay factor and continue the search process We stop searching if we

have lowered the temperature more than times Thus this algorithm is controlled by ve

parameters T and Throug

tr-94-09_bayesianas

Documents

bayesian approach

single bayesian network

bayesian methodscooper

search methods

combinationof prior

prior knowledgeand

quasibayesian methods

nonbayesian methods