tr-94-09_bayesianas
TRANSCRIPT
-
Learning Bayesian Networks The Combination of
Knowledge and Statistical Data
David Heckerman
heckermamicrosoftcom
Dan Geiger
dangcstechnionacil
David M Chickering
dmaxcsuclaedu
March Revised February
Technical Report
MSRTR
Microsoft Research
Advanced Technology Division
Microsoft Corporation
One Microsoft Way
Redmond WA
To appear in Machine Learning
-
Learning Bayesian Networks MSRTR
Abstract
We describe a Bayesian approach for learning Bayesian networks from a combination
of prior knowledge and statistical data First and foremost we develop a methodology
for assessing informative priors needed for learning Our approach is derived from a
set of assumptions made previously as well as the assumption of likelihood equivalence
which says that data should not help to discriminate network structures that represent
the same assertions of conditional independence We show that likelihood equivalence
when combined with previously made assumptions implies that the users priors for
network parameters can be encoded in a single Bayesian network for the next case
to be seena prior networkand a single measure of condence for that network
Second using these priors we show how to compute the relative posterior probabilities
of network structures given data Third we describe search methods for identifying
network structures with high posterior probabilities We describe polynomial algorithms
for nding the highestscoring network structures in the special case where every node
has at most k parent For the general case k which is NPhard we review
heuristic search algorithms including local search iterative local search and simulated
annealing Finally we describe a methodology for evaluating Bayesiannetwork learning
algorithms and apply this approach to a comparison of various approaches
Keywords Bayesian networks learning Dirichlet likelihood equivalence maximum
branching heuristic search
Introduction
A Bayesian network is an annotated directed graph that encodes probabilistic relationships
among distinctions of interest in an uncertainreasoning problem Howard and Matheson
Pearl
The representation formally encodes the joint probability distribution
for its domain yet includes a humanoriented qualitative structure that facilitates commu
nication between a user and a system incorporating the probabilistic model We discuss the
representation in detail in Section For over a decade AI researchers have used Bayesian
networks to encode expert knowledge More recently AI researchers and statisticians have
begun to investigate methods for learning Bayesian networks including Bayesian methods
Cooper and Herskovits Buntine Spiegelhalter et al Dawid and Lau
ritzen Heckerman et al quasiBayesian methods Lam and Bacchus
Suzuki and nonBayesian methods Pearl and Verma Spirtes et al
In this paper we concentrate on the Bayesian approach which takes prior knowledge
and combines it with data to produce one or more Bayesian networks Our approach is
illustrated in Figure for the problem of ICU ventilator management Using our method
a user species his prior knowledge about the problem by constructing a Bayesian network
-
Learning Bayesian Networks MSRTR
called a prior network and by assessing his condence in this network A hypothetical prior
network is shown in Figure b the probabilities are not shown In addition a database of
cases is assembled as shown in Figure c Each case in the database contains observations
for every variable in the users prior network Our approach then takes these sources of
information and learns one or more new Bayesian networks as shown in Figure d To
appreciate the eectiveness of the approach note that the database was generated from
the Bayesian network in Figure a known as the Alarm network Beinlich et al
Comparing the three network structures we see that the structure of the learned network
is much closer to that of the Alarm network than is the structure of the prior network In
eect our learning algorithm has used the database to correct the prior knowledge of the
user
Our Bayesian approach can be understood as follows Suppose we have a domain of
discrete variables fx xng U and a database of cases fC Cmg D Further
suppose that we wish to determine the joint distribution pCjD the probability distri
bution of a new case C given the database and our current state of information Rather
than reason about this distribution directly we imagine that the data is a random sample
from an unknown Bayesian network structure Bs with unknown parameters Using Bhs to
denote the hypothesis that the data is generated by network structure Bs and assuming
the hypotheses corresponding to all possible network structures form a mutually exclusive
and collectively exhaustive set we have
pCjD X
all Bhs
pCjDBhs pBhs jD
In practice it is impossible to sum over all possible network structures Consequently
we attempt to identify a small subset H of networkstructure hypotheses that account for
a large fraction of the posterior probability of the hypotheses Rewriting the previous
equation we obtain
pCjD cX
BhsH
pCjDBhs pBhs jD
where c is the normalization constant P
BhsHpBhs jD From this relation we see that
only the relative posterior probabilities of hypotheses matter Thus rather than compute
a posterior probability which would entail summing over all structures we can compute a
Bayes factorpBhs jD pBhsjD where Bs is some reference structure such as the
one containing no arcs or simply pDBhs j pBhs j pDjB
hs In the latter case we
have
pCjD cX
BhsH
pCjDBhs pDBhs j
-
Learning Bayesian Networks MSRTR
17
25
6 5 4
19
27
20
10 21
37
31
11 32
33
22
15
14
23
13
16
29
8 9
2812
34 35 36
24
30
72618
321
17
25 18 26
1 2 3
6 5 4
19
27
20
10 21
35 3736
31
11 32 34
1224
33
22
15
14
23
13
16
29
30
7 8 9
28
case # x1 x2 x3 x37
1
2
3
4
10,000
3
2
1
3
2
3
2
3
2
2
2
2
3
3
2
4
3
3
1
3
(a)
(d)
(b)
17
25 18 26
3
6 5 4
19
27
20
10 21
35 3736
31
11 32 34
1224
33
22
15
14
23
13
16
29
30
7 8 9
28
21
AA
A
A
RR
R
R
D
D
DD
R
D
D
(c)
Figure a The Alarm network structure b A prior network encoding a users beliefs
about the Alarm domain c A case database generated from the Alarm network
d The network learned from the prior network and a case database generated from
the Alarm network Arcs that are added deleted or reversed with respect to the Alarm
network are indicated with A D and R respectively
-
Learning Bayesian Networks MSRTR
where c is another normalization constant P
BhsHpDBhs j
In short the Bayesian approach to learning Bayesian networks amounts to searching for
networkstructure hypotheses with high relative posterior probabilities Many nonBayesian
approaches use the same basic approach but optimize some other measure of how well the
structure ts the data In general we refer to such measures as scoring metrics We
refer to any formula for computing the relative posterior probability of a networkstructure
hypothesis as a Bayesian scoring metric
The Bayesian approach is not only an approximation for pCjD but a method for
learning network structure When jH j we learn a single network structure the MAP
maximum a posteriori structure of U When jH j we learn a collection of network
structures As we discuss in Section learning network structure is useful because we can
sometimes use structure to infer causal relationships in a domain and consequently predict
the eects of interventions
One of the most challenging tasks in designing a Bayesian learning procedure is identi
fying classes of easytoassess informative priors for computing the terms on the righthand
side of Equation In the rst part of the paper Sections through we explicate a
set of assumptions for discrete networksnetworks containing only discrete variablesthat
leads to such a class of informative priors Our assumptions are based on those made by
Cooper and Herskovits herein referred to as CHSpiegelhalter et al
and Dawid and Lauritzen herein referred to as SDLCand Buntine These
researchers assumed parameter independence which says that the parameters associated
with each node in a Bayesian network are independent parameter modularity which says
that if a node has the same parents in two distinct networks then the probability density
functions of the parameters associated with this node are identical in both networks and the
Dirichlet assumption which says that all network parameters have a Dirichlet distribution
We assume parameter independence and parameter modularity but instead of adopting
the Dirichlet assumption we introduce an assumption called likelihood equivalence which
says that data should not help to discriminate network structures that represent the same
assertions of conditional independence We argue that this property is necessary when
learning acausal Bayesian networks and is often reasonable when learning causal Bayesian
networks We then show that likelihood equivalence when combined with parameter inde
pendence and several weak conditions implies the Dirichlet assumption Furthermore we
show that likelihood equivalence constrains the Dirichlet distributions in such a way that
they may be obtained from the users prior networka Bayesian network for the next case
to be seenand a single equivalent sample size reecting the users condence in his prior
network
-
Learning Bayesian Networks MSRTR
Our result has both a positive and negative aspect On the positive side we show that
parameter independence parameter modularity and likelihood equivalence lead to a simple
approach for assessing priors that requires the user to assess only one equivalent sample
size for the entire domain On the negative side the approach is sometimes too simple a
user may have more knowledge about one part of a domain than another We argue that
the assumptions of parameter independence and likelihood equivalence are sometimes too
strong and suggest a framework for relaxing these assumptions
A more straightforward task in learning Bayesian networks is using a given informative
prior to compute pDBhs j ie a Bayesian scoring metric and pCjDBhs When
databases are completethat is when there is no missing datathese terms can be derived
in closed form Otherwise wellknown statistical approximations may be used In this
paper we consider complete databases only and derive closedform expressions for these
terms A result is a likelihoodequivalent Bayesian scoring metric which we call the BDe
metric This metric is to be contrasted with the metrics of CH and Buntine which do not
make use of a prior network and to the metrics of CH and SDLC which do not satisfy the
property of likelihood equivalence
In the second part of the paper Section we examine methods for nding networks
with high scores The methods can be used with any scoring metric We describe polynomial
algorithms for nding the highestscoring networks in the special case where every node has
at most one parent In addition we describe localsearch and annealing algorithms for the
general case which is known to be NPhard
Finally in Sections and we describe a methodology for evaluating learning algo
rithms We use this methodology to compare various scoring metrics and search methods
We note that several researchers eg Dawid and Lauritzen and Madigan and
Raferty have developed methods for learning undirected network structures as de
scribed in eg Lauritzen In this paper we concentrate on learning directed
models because we can sometimes use them to infer causal relationships and because most
users nd them easier to interpret
Background
In this section we introduce notation and background material that we need for our discus
sion including a description of Bayesian networks exchangeability multinomial sampling
and the Dirichlet distribution A summary of our notation is given after the Appendix on
page
Throughout this discussion we consider a domain U of n discrete variables x xn
-
Learning Bayesian Networks MSRTR
We use lowercase letters to refer to variables and uppercase letters to refer to sets of
variables We write xi k to denote that variable xi is in state k When we observe
the state for every variable in set X we call this set of observations an instance of X
and we write X kX as a shorthand for the observations xi ki xi X The joint
space of U is the set of all instances of U We use pX kX jY kY to denote the
probability that X kX given Y kY for a person with current state of information
We use pX jY to denote the set of probabilities for all possible observations of X given
all possible observations of Y The joint probability distribution over U is the probability
distribution over the joint space of U
A Bayesian network for domain U represents a joint probability distribution over U
The representation consists of a set of local conditional distributions combined with a set
of conditional independence assertions that allow us to construct a global joint probability
distribution from the local distributions In particular by the chain rule of probability we
have
px xnj nYi
pxijx xi
For each variable xi let i fx xig be a set of variables that renders xi and
fx xig conditionally independent That is
pxijx xi pxiji
A Bayesiannetwork structure Bs encodes the assertions of conditional independence in
Equations Namely Bs is a directed acyclic graph such that each variable in U
corresponds to a node in Bs and the parents of the node corresponding to xi are the
nodes corresponding to the variables in i In this paper we use xi to refer to both the
variable and its corresponding node in a graph A Bayesiannetwork probability set Bp is
the collection of local distributions pxiji for each node in the domain A Bayesian
network for U is the pair Bs Bp Combining Equations and we see that any Bayesian
network for U uniquely determines a joint probability distribution for U That is
px xnj nYi
pxiji
When a variable has only two states we say that it is binary A Bayesian network
for three binary variables x x and x is shown in Figure We see that
fxg and fxg Consequently this network represents the conditionalindependence
assertion pxjx x pxjx
It can happen that two Bayesiannetwork structures represent the same constraints of
conditional independencethat is every joint probability distribution encoded by one struc
ture can be encoded by the other and vice versa In this case the two network structures are
-
Learning Bayesian Networks MSRTR
x1
x2
x3
p x( 1 = present, ) = 0.6
p x x
p x x
(
(2 1
2 1
= == =
present| present, ) = 0.8
present| absent, ) = 0.3
p x x
p x x
(
(3 2
3 2
= == =
present| present, ) = 0.9
present| absent, ) = 0.15
Figure A Bayesian network for three binary variables taken from CH The network
represents the assertion that x and x are conditionally independent given x Each variable
has states absent and present
said to be equivalent Verma and Pearl For example the structures x x x
and x x x both represent the assertion that x and x are conditionally indepen
dent given x and are equivalent In some of the technical discussions in this paper we
shall require the following characterization of equivalent networks proved in the Appendix
Theorem Let Bs and Bs be two Bayesiannetwork structures and RBsBs be the set
of edges by which Bs and Bs dier in directionality Then Bs and Bs are equivalent
if and only if there exists a sequence of jRBsBsj distinct arc reversals applied to Bs with
the following properties
After each reversal the resulting network structure contains no directed cycles and is
equivalent to Bs
After all reversals the resulting network structure is identical to Bs
If x y is the next arc to be reversed in the current network structure then x and y
have the same parents in both network structures with the exception that x is also a
parent of y in Bs
A drawback of Bayesian networks as dened is that network structure depends on vari
able order If the order is chosen carelessly the resulting network structure may fail to
reveal many conditional independencies in the domain Fortunately in practice Bayesian
networks are typically constructed using notions of cause and eect Loosely speaking to
construct a Bayesian network for a given set of variables we draw arcs from cause vari
ables to their immediate eects For example we would obtain the network structure in
-
Learning Bayesian Networks MSRTR
y
y2 ymy1
Figure A Bayesian network showing the conditionalindependence assertions associated
with a multinomial sample
Figure if we believed that x is the immediate causal eect of x and x is the immediate
causal eect of x In almost all cases constructing a Bayesian network in this way yields
a Bayesian network that is consistent with the formal denition In Section we return to
this issue
Now let us consider exchangeability and random sampling Most of the concepts we
discuss can be found in Good and DeGroot Given a discrete variable y with
r states consider a nite sequence of observations y ym of this variable We can think
of this sequence as a database D for the onevariable domain U fyg This sequence
is said to be exchangeable if a sequence obtained by interchanging any two observations
in the sequence has the same probability as the original sequence Roughly speaking the
assumption that a sequence is exchangeable is an assertion that the processes generating
the data do not change in time
Given an exchangeable sequence De Finetti showed that there exists parameters
y fy yrg such that
yk k rrX
k
yk
pyl kjy yly yk
That is the parameters y render the individual observations in the sequence conditionally
independent and the probability that any given observation will be in state k is just yk
The conditional independence assertion Equation may be represented as a Bayesian
network as shown in Figure By the strong law of large numbers eg DeGroot
p we may think of yk as the longrun fraction of observations where y k
although there are other interpretations Howard
Also note that each parameter
yk is positive ie greater than zero
A sequence that satises these conditions is a particular type of random sample known
as an rdimensional multinomial sample with parameters y Good When r the
-
Learning Bayesian Networks MSRTR
sequence is said to be a binomial sample One example of a binomial sample is the outcome
of repeated ips of a thumbtack If we knew the longrun fraction of heads point up
for a given thumbtack then the outcome of each ip would be independent of the rest
and would have a probability of heads equal to this fraction An example of a multinomial
sample is the outcome of repeated rolls of a multisided die As we shall see learning
Bayesian networks for discrete domains essentially reduces to the problem of learning the
parameters of a die having many sides
As y is a set of continuous variables it has a probability density which we denote
yj Throughout this paper we use j to denote a probability density for a continu
ous variable or set of continuous variables Given yj we can determine the probability
that y k in the next observation In particular by the rules of probability we have
py kj
Zpy kjy y j dy
Consequently by condition above we obtain
py kj
Zyk y j dy
which is the mean or expectation of yk with respect to y j denoted Eyk j
Suppose we have a prior density for y and then observe a database D We may obtain
the posterior density for y as follows From Bayes rule we have
y jD c pDjy yj
where c is a normalization constant Using Equation to rewrite the rst term on the right
hand side we obtain
y jD c rY
k
Nkyk yj
where Nk is the number of times x k in D Note that only the counts N Nr are
necessary to determine the posterior from the prior These counts are said to be a sucient
statistic for the multinomial sample
In addition suppose we assess a density for two dierent states of information and
and nd that yj y j Then for any multinomial sample D
pDj ZpDjy yj dy pDj
because pDjy pDjy by Equation That is if the densities for y are the
same then the probability of any two samples will be the same The converse is also true
Namely if pDj pDj for all databases D then yj yj We shall use
this equivalence when we discuss likelihood equivalence
We assume this result is well known although we havent found a proof in the literature
-
Learning Bayesian Networks MSRTR
Given a multinomial sample a user is free to assess any probability density for y In
practice however one often uses the Dirichlet distribution because it has several convenient
properties The parameters y have a Dirichlet distribution with exponentsN N
r when
the probability density of y is given by
yj Pr
kNkQr
k Nk
rYk
N k
yk Nk
where is the Gamma function which satises x ! x x and When
the parameters y have a Dirichlet distribution we also say that yj is Dirichlet The
requirement that N k be greater than guarantees that the distribution can be normalized
Note that the exponents N k are a function of the users state of information Also note
that by Equation the Dirichlet distribution for y is technically a density over ynfykg
for some k the symbol n denotes set dierence Nonetheless we shall write Equation
as shown When r the Dirichlet distribution is also known as a beta distribution
From Equation we see that if the prior distribution of y is Dirichlet then the
posterior distribution of y given database D is also Dirichlet
yjD crY
k
N kNk
yk Ni
where c is a normalization constant We say that the Dirichlet distribution is closed under
multinomial sampling or that the Dirichlet distribution is a conjugate family of distributions
for multinomial sampling Also when y has a Dirichlet distribution the expectation of
ykiequal to the probability that x ki in the next observationhas a simple expression
Eykj py kj N kN
where N Pr
kNk We shall make use of these properties in our derivations
A survey of methods for assessing a beta distribution is given by Winkler These
methods include the direct assessment of the probability density using questions regarding
relative densities and relative areas assessment of the cumulative distribution function using
fractiles assessing the posterior means of the distribution given hypothetical evidence and
assessment in the form of an equivalent sample size These methods can be generalized with
varying di"culty to the nonbinary case
In our work we nd one method based on Equation particularly useful The equation
says that we can assess a Dirichlet distribution by assessing the probability distribution
pyj for the next observation and N In so doing we may rewrite Equation as
yj c rY
k
N pykjyk
-
Learning Bayesian Networks MSRTR
where c is a normalization constant Assessing pyj is straightforward Furthermore the
following two observations suggest a simple method for assessing N
One the variance of a density for y is an indication of how much the mean of y
is expected to change given new observations The higher the variance the greater the
expected change It is sometimes said that the variance is a measure of a users condence
in the mean for y The variance of the Dirichlet distribution is given by
V aryk j py kj py kj
N !
Thus N is a reection of the users condence Two suppose we were initially completely
ignorant about a domainthat is our distribution y j was given by Equation with
each exponent N k Suppose we then saw N cases with su"cient statistics N N
r
Then by Equation our prior would be the Dirichlet distribution given by Equation
Thus we can assess N as an equivalent sample size the number of observations we
would have had to have seen starting from complete ignorance in order to have the same
condence in y that we actually have This assessment approach generalizes easily to
manyvariable domains and thus is useful for our work We note that some users at rst
nd judgments of equivalent sample size to be di"cult Our experience with such users has
been that they may be made more comfortable with the method by rst using some other
method for assessment eg fractiles on simple scenarios and by examining equivalent
sample sizes implied by their assessments
Bayesian Metrics Previous Work
CH Buntine and SDLC examine domains where all variables are discrete and derive essen
tially the same Bayesian scoring metric and formula for pCjDBhs based on the same
set of assumptions about the users prior knowledge and the database In this section we
present these assumptions and provide a derivation of pDBhs j and pCjDBhs
Roughly speaking the rst assumption is that Bhs is true i the database D can be
partitioned into a set of multinomial samples determined by the network structure Bs In
particular Bhs is true i for every variable xi in U and every instance of xis parents i
in Bs the observations of xi in D in those cases where i takes on the same instance
constitute a multinomial sample For example consider a domain consisting of two binary
variables x and y We shall use this domain to illustrate many of the concepts in this
paper There are three network structures for this domain x y x y and the empty
This prior distribution cannot be normalized and is sometimes called an improper prior To be more
precise we should say that each exponent is equal to some number close to zero
-
Learning Bayesian Networks MSRTR
case 1
case 2
y x| y x|x
(a) (b)
case 1
case 2
y x| y x|x
Figure A Bayesiannetwork structure for a twobinaryvariable domain fx yg showing
conditional independencies associated with a the multinomialsample assumption and b
the added assumption of parameter independence In both gures it is assumed that the
network structure x y is generating the database
network structure containing no arc The hypothesis associated with the empty network
structure denoted Bhxy corresponds to the assertion that the database is made up of two
binomial samples the observations of x are a binomial sample with parameter x and
the observations of y are a binomial sample with parameter y
In contrast the hypothesis associated with the network structure x y denoted Bhxy
corresponds to the assertion that the database is made up of at most three binomial samples
the observations of x are a binomial sample with parameter x the observations of
y in those cases where x is true if any are a binomial sample with parameter yjx and
the observations of y in those cases where x is false if any are a binomial sample
with parameter yjx One consequence of the second and third assertions is that y in case
C is conditionally independent of the other occurrences of y in D given yjx yjx and x
in case C We can graphically represent this conditionalindependence assertion using a
Bayesiannetwork structure as shown in Figure a
Finally the hypothesis associated with the network structure x y denoted Bhxy
corresponds to the assertion that the database is made up of at most three binomial samples
one for y one for x given y is true and one for x given y is false
Before we state this assumption for arbitrary domains we introduce the following
-
Learning Bayesian Networks MSRTR
notation Given a Bayesian network Bs for domain U let ri be the number of states
of variable xi and let qi Q
xli rl be the number of instances of i We use the integer
j to index the instances of i Thus we write pxi kji j to denote the probability
that xi k given the jth instance of the parents of xi Let ijk denote the multinomial
parameter corresponding to the probability pxi kji j ijk Pri
k ijk
In addition we dene
ij rikfijkg
i qijfijg
Bs nii
That is the parameters in Bs correspond to the probability set Bp for a singlecase
Bayesian network
Assumption Multinomial Sample Given domain U and database D let Dl denote
the rst l cases in the database In addition let xil and il denote the variable xi and
the parent set i in the lth case respectively Then for all network structures Bs in U
there exist positive parameters Bs such that for i n and for all k k ki
pxil kjxl k xil ki DlBs Bhs ijk
where j is the instance of il consistent with fxl k xil kig
There is an important implication of this assumption which we examine in Section
Nonetheless Equation is all that we need and all that CH Buntine and SDLC used
to derive a metric Also note that the positivity requirement excludes logical relationships
among variables We can relax this requirement although we do not do so in this paper
The second assumption is an independence assumption
Assumption Parameter Independence Given network structure Bs if pBhs j
then
a BsjBhs Qn
i ijBhs
b For i n ijBhs
Qqij ij jB
hs
Assumption a says that the parameters associated with each variable in a network structure
are independent We call this assumption global parameter independence after Spiegelhal
ter and Lauritzen Assumption b says that the parameters associated with each
Whenever possible we use CHs notation
-
Learning Bayesian Networks MSRTR
instance of the parents of a variable are independent We call this assumption local param
eter independence again after Spiegelhalter and Lauritzen We refer to the combination of
these assumptions simply as parameter independence The assumption of parameter inde
pendence for our twobinaryvariable domain is shown in the Bayesiannetwork structure of
Figure b
As we shall see Assumption greatly simplies the computation of pDBhs j The
assumption is reasonable for some domains but not for others In Section we describe
a simple characterization of the assumption that provides a test for deciding whether the
assumption is reasonable in a given domain
The third assumption was also made to simplify computations
Assumption Parameter Modularity Given two network structures Bs and Bs
such that pBhsj and pBhsj if xi has the same parents in Bs and Bs
then
ij jBhs ij jB
hs j qi
We call this property parameter modularity because it says that the densities for parameters
ij depend only on the structure of the network that is local to variable xinamely ij
only depends on xi and its parents For example consider the network structure x y
and the empty structure for our twovariable domain In both structures x has the same
set of parents the empty set Consequently by parameter modularity xjBhxy
xjBhxy We note that CH Buntine and SDLC implicitly make the assumption of
parameter modularity Cooper and Herskovits Equation A p Buntine
p Spiegelhalter et al pp
The fourth assumption restricts each parameter set ij to have a Dirichlet distribution
Assumption Dirichlet Given a network structure Bs such that pBhs j ij jB
hs
is Dirichlet for all ij Bs That is there exists exponents N ijk which depend on Bhs
and that satisfy
ij jBhs c
Yk
N ijk
ijk
where c is a normalization constant
When every parameter set ofBs has a Dirichlet distribution we simply say that BsjBhs
is Dirichlet Note that by the assumption of parameter modularity we do not require
Dirichlet exponents for every network structure Bs Rather we require exponents only for
every node and for every possible parent set of each node
-
Learning Bayesian Networks MSRTR
Assumptions through are assumptions about the domain Given Assumption we
can compute pDjBs Bhs as a function of Bs for any given database see Equation
Also as we show in Section Assumptions through determine BsjBhs for every
network structure Bs Thus from the relation
pDBhs j pBhs j
ZpDjBs B
hs BsjB
hs dBs
these assumptions in conjunction with the prior probabilities of network structure pBhs j
form a complete representation of the users prior knowledge for purposes of computing
pDBhs j By a similar argument we can show that Assumptions through also deter
mine the probability distribution pCjDBhs for any given database and network struc
ture
In contrast the fth assumption is an assumption about the database
Assumption Complete Data The database is complete That is it contains no
missing data
This assumption was made in order to compute pDBhs j and pCjDBhs in closed
form In this paper we concentrate on complete databases for the same reason Nonethe
less the reader should recognize that given Assumptions through these probabilities
can be computedin principlefor any complete or incomplete database In practice
these probabilities can be approximated for incomplete databases by wellknown statistical
methods Such methods include lling in missing data based on the data that is present
Titterington Spiegelhalter and Lauritzen the EM algorithm Dempster et al
and Gibbs sampling ie Markov chain Monte Carlo methods York Madigan and Raerty
Let us now explore the consequences of these assumptions First from the multinomial
sample assumption and the assumption of no missing data we obtain
pCljDlBs Bhs
nYi
qiYj
riYk
lijkijk
where lijk if xi k and i j in case Cl and lijk otherwise Thus if we let
Nijk be the number of cases in database D in which xi k and i j we have
pDjBs Bhs
Yi
Yj
Yk
Nijkijk
From this result it follows that the parameters Bs remain independent given database D
a property we call posterior parameter independence In particular from the assumption of
parameter independence we have
BsjDBhs c pDjBs B
hs
nYi
qiYj
ij jBhs
-
Learning Bayesian Networks MSRTR
where c is some normalization constant Combining Equations and we obtain
BsjDBhs c
Yi
Yj
ij jB
hs
Yk
Nijkijk
and posterior parameter independence follows We note that by Equation and the
assumption of parameter modularity parameters remain modular a posteriori as well
Given these basic relations we can derive a metric and a formula for pCjDBhs
From the rules of probability we have
pDjBhs mYl
pCljDl Bhs
From this equation we see that the Bayesian scoring metric can be viewed as a form of cross
validation where rather than use D n fClg to predict Cl we use only cases C Cl to
predict Cl
Conditioning on the parameters of the network structure Bs we obtain
pCljDl Bhs
ZpCljDlBs B
hs BsjDl B
hs dBs
Using Equation and posterior parameter independence to rewrite the rst and second
terms in the integral respectively and interchanging integrals with products we get
pCljDl Bhs
nYi
qiYj
Z riYk
lijkijk ij jDl B
hs dij
When lijk the integral is the expectation of ijk with respect to the density ijjDl Bhs
Consequently we have
pCljDl Bhs
nYi
qiYj
riYk
EijkjDl Bhs
lijk
To compute pCjDBhs we set l m ! and interpret Cm to be C To compute
pDjBhs we combine Equations and and rearrange products obtaining
pDjBhs nYi
qiYj
riYk
mYl
EijkjC Cl Bhs
lijk
Thus all that remains is to determine to the expectations in Equations and Given
the Dirichlet assumption Assumption this evaluation is straightforward Combining the
Dirichlet assumption and Equation we obtain
ij jDBhs c
Yk
N ijk
Nijk
ijk
-
Learning Bayesian Networks MSRTR
where c is another normalization constant Note that the countsNijk are a su"cient statistic
for the database Also as we discussed in Section the Dirichlet distributions are conjugate
for the database The posterior distribution of each parameter ij remains in the Dirichlet
family Thus applying Equations and to Equation with l m ! Cm C
and Dm D we obtain
pCmjDBhs
nYi
qiYj
riYk
N ijk !Nijk
N ij !Nij
mijk
where
N ij
riXk
N ijk Nij
riXk
Nijk
Similarly from Equation we obtain the scoring metric
pDBhs j pBhs j
nYi
qiYj
N ijN ij
N ij !
N ij !
N ij !Nij
N ij !Nij
N ij
N ij !Nij
N ij !
N ij !Nij !
N ij !Nij
N ij !Nij !Nij
N ijri
N ij !Pri
k Nijk
N ijri !
N ij !Pri
k Nijk !
N ijri !Nijri
Nij !N ij
pBhs j nYi
qiYj
N ij
N ij !Nij
riYk
N ijk !Nijk
N ijk
We call Equation the BD Bayesian Dirichlet metric
As is apparent from Equation the exponents N ijk in conjunction with pBhs j com
pletely specify a users current knowledge about the domain for purposes of learning network
structures Unfortunately the specication of N ijk for all possible variable#parent congu
rations and for all values of i j and k is formidable to say the least CH suggest a simple
uninformative assignment N ijk We shall refer to this special case of the BD metric as
the K metric Buntine suggests the uninformative assignment N ijk Nri qi
We shall examine this special case again in Section In Section we address the assess
ment of the priors on network structure pBhs j
Acausal networks causal networks and likelihood equiva
lence
In this section we examine another assumption for learning Bayesian networks that has
been previously overlooked
-
Learning Bayesian Networks MSRTR
Before we do so it is important to distinguish between acausal and causal Bayesian
networks Although Bayesian networks have been formally described as a representation
of conditional independence as we noted in Section people often construct them using
notions of cause and eect Recently several researchers have begun to explore a formal
causal semantics for Bayesian networks eg Pearl and Verma Pearl Spirtes
et al Druzdzel and Simon and Heckerman and Shachter They ar
gue that the representation of causal knowledge is important not only for assessment but
for prediction as well In particular they argue that causal knowledgeunlike statistical
knowledgeallows one to derive beliefs about a domain after intervention For example
most of us believe that smoking causes lung cancer From this knowledge we infer that if
we stop smoking then we decrease our chances of getting lung cancer In contrast if we
knew only that there was a statistical correlation between smoking and lung cancer then
we could not make this inference The formal semantics of cause and eect proposed by
these researchers is not important for this discussion The interested reader should consult
the references given
First let us consider acausal networks Recall our assumption that the hypothesis
Bhs is true i the database D is a collection of multinomial samples determined by the
network structure Bs This assumption is equivalent to saying that the database D is a
multinomial sample from the joint space of U with parameters U and the hypothesis
Bhs is true i the parameters U satisfy the conditionalindependence assertions of Bs We
can think of condition as a denition of the hypothesis Bhs
For example in our twobinaryvariable domain regardless of which hypothesis is true
we may assert that the database is a multinomial sample from the joint space U fx yg
with parameters U fxy xy yx xyg Furthermore given the hypothesis Bhxyfor
examplewe know that the parameters U are unconstrained except that they must sum
to one because the network structure x y represents no assertions of conditional inde
pendence In contrast given the hypothesis Bhxy we know that the parameters U must
satisfy the independence constraints xy xy xy xy and so on
Given this denition of Bhs for acausal Bayesian networks it follows that if two network
structures Bs and Bs are equivalent then Bhs B
hs For example in our twovariable
domain both the hypotheses Bhxy and Bhxy assert that there are no constraints on the
parameters U Consequently we have Bhxy B
hxy In general we call this property
hypothesis equivalence
One technical aw with the denition of Bhs is that hypotheses are not mutually exclusive For example
in our twovariable domain the hypotheses Bhxy and Bhxy both include the possibility y yjx This
aw is potentially troublesome because mutual exclusivity is important for our Bayesian interpretation of
-
Learning Bayesian Networks MSRTR
In light of this property we should associate each hypothesis with an equivalence class
of structures rather than a single network structure Also given the property of hypothesis
equivalence we have prior equivalence if network structures Bs and Bs are equivalent
then pBhsj pBhsj likelihood equivalence if Bs and Bs are equivalent then for
all databases D pDjBhs pDjBhs and score equivalence if Bs and Bs are
equivalent then pDBhsj pDBhsj
Now let us consider causal networks For these networks the assumption of hypothesis
equivalence is unreasonable In particular for causal networks we must modify the deni
tion of Bhs to include the assertion that each nonroot node in Bs is a direct causal eect of
its parents For example in our twovariable domain the causal networks x y and x y
represent the same constraints on U ie none but the former also asserts that x causes
y whereas the latter asserts that y causes x Thus the hypotheses Bhxy and Bhxy are
not equal Indeed it is reasonable to assume that these hypothesesand the hypotheses
associated with any two dierent causalnetwork structuresare mutually exclusive
Nonetheless for many realworld problems that we have encountered we have found
it reasonable to assume likelihood equivalence That is we have found it reasonable to
assume that data cannot distinguish between equivalent network structures Of course for
any given problem it is up to the decision maker to assume likelihood equivalence or not In
Section we describe a characterization of likelihood equivalence that suggests a simple
procedure for deciding whether the assumption is reasonable in a given domain
Because the assumption of likelihood equivalence is appropriate for learning acausal
networks in all domains and for learning causal networks in many domains we adopt this
assumption in our remaining treatment of scoring metrics As we have stated it likelihood
equivalence says that for any databaseD the probability of D is the same given hypotheses
corresponding to any two equivalent network structures From our discussion surrounding
Equation however we may also state likelihood equivalence in terms of U
Assumption Likelihood Equivalence Given two network structures Bs and Bs
such that pBhsj and pBhsj if Bs and Bs are equivalent then U jB
hs
U jBhs
We close this section with a few additional remarks about inferring causal relationships
network learning see Equation Nonetheless because the densities BsjBhs must be integrable and
hence bounded the overlap of hypotheses will be of measure zero and we may use Equation without
modication For example in our twobinaryvariable domain given the hypothesis Bhxy the probability
that Bhxy is true ie y yjx has measure zeroUsing the same convention as for the Dirichlet distribution we write U jB
hs to denote a density
over a set of the nonredundant parameters in U
-
Learning Bayesian Networks MSRTR
Given the distinction between statistical and causal dependence it would seem impossible
to learn causal networks from data produced by observation alone For example consider
the simple threevariable domain U fx x xg If we nd through the observation of
data that the network structure x x x is very likely then we cannot conclude that
x and x are causes for x Rather it may be the case that there is a hidden common cause
of x and x as well as a hidden common cause of x and x If however we assume that
every statistical association derives from causal interaction and that there are no hidden
common causes then we can interpret learned networks as causal networks In our example
under these assumptions we can infer that x and x are causes for x
Under the assumption of likelihood equivalence the ratio of posterior probabilities of
two equivalent network structures must be equal to the ratio of their prior probabilities
Consequently if the priors on network structures are not too dierent then typically learn
ing will produce many equivalent network structures each having a large relative posterior
probability Furthermore even for domains where the assumption of likelihood equivalence
does not hold there is a good chance that more than one hypothesis will have a large relative
posterior probability In such situations we nd it reasonable to average the causal asser
tions contained in individual learned networks For example in our threevariable domain
let us suppose that the data supports only the network structure x x x and its
equivalent cousins x x x and x x x If each of the hypotheses correspond
ing to these structures have the same prior probability then the posterior probability of
each hypothesis will be and we infer that the proposition x causes x has probability
Under these same conditions the proposition that both x and x are causes of x has
probability
The BDe Metric
The assumption of likelihood equivalence when combined the previous assumptions intro
duces constraints on the Dirichlet exponents N ijk The result is a likelihoodequivalent
specialization of the BD metric which we call the BDe metric In this section we derive
this metric In addition we show that as a consequence of the exponent constraints the
user may construct an informative prior for the parameters of all network structures merely
by building a Bayesian network for the next case to be seen and by assessing an equivalent
sample size Most remarkable we show that Dirichlet assumption Assumption is not
needed to obtain the BDe metric
Note that in some circumstances we can identify causes and eects from network structure even when
there are hidden common causes See Pearl for a discussion
-
Learning Bayesian Networks MSRTR
Informative Priors
In this section we show how the added assumption of likelihood equivalence simplies the
construction of informative priors
Before we do so we need to dene the concept of a complete network structure A
complete network structure is one that has no missing edgesthat is it encodes no assertions
of conditional independence In a domain with n variables there are n$ complete network
structures An important property of complete network structures is that all such structures
for a given domain are equivalent
Now for a given domain U suppose we have assessed the density U jBhsc where
Bsc is some complete network structure for U Given parameter independence parameter
modularity likelihood equivalence and one additional assumption it turns out that we can
compute the prior BsjBhs for any network structure Bs in U from the given density
To see how this computation is done consider again our twobinaryvariable domain
Suppose we are given a density for the parameters of the joint space xy xy xy jBhxy
From this density we construct the parameter densities for each of the three network struc
tures in the domain First consider the network structure x y A parameter set for this
network structure is fx yjx yjxg These parameters are related to the parameters of the
joint space by the following relations
xy xyjx xy xyjx xy x yjx
Thus we may obtain x yjx yjxjBhxy from the given density by changing variables
x yjx yjxjBhxy Jxy xy xy xyjB
hxy
where Jxy is the Jacobian of the transformation
Jxy
xyx xyx xyx
xyyjx xyyjx xyyjx
xyyjx xyyjx xyyjx
x x
The Jacobian JBsc for the transformation from U to Bsc where Bsc is an arbitrary
complete network structure is given in the Appendix Theorem
Next consider the network structure x y Assuming that the hypothesis Bhxy
is also possible we obtain xy xy xyjBhxy xy xy xyjBhxy by likelihood
equivalence Therefore we can compute the density for the network structure x y using
the Jacobian Jxy y y
Finally consider the empty network structure Given the assumption of parameter
independence we may obtain the densities xjBhxy and y jB
hxy separately To
-
Learning Bayesian Networks MSRTR
( , , | , ) ( , , | , )xy x y x y x yh
xy x y x y x yhB B =
Bxy:
Bx y : Bx y :
change ofvariable
parametermodularity
Figure A computation of the parameter densities for the three network structures
of the twobinaryvariable domain fx yg The approach computes the densities from
xy xy xyjBhxy using likelihood equivalence parameter independence and param
eter modularity
obtain the density for x we rst extract xjBhxy from the density for the network
structure x y This extraction is straightforward because by parameter independence
the parameters for x y must be independent Then we use parameter modularity
which says that xjBhxy xjB
hxy To obtain the density for y we extract
yjBhxy from the density for the network structure x y and again apply parameter
modularity The approach is summarized in Figure
In this construction it is important that both hypotheses Bhxy and Bhxy have nonzero
prior probabilities lest we could not make use of likelihood equivalence to obtain the param
eter densities for the empty structure In order to take advantage of likelihood equivalence
in general we adopt the following assumption
Assumption Structure Possibility Given a domain U pBhscj for all com
plete network structures Bsc
Note that in the context of acausal Bayesian networks there is only one hypothesis corre
sponding to the equivalence class of complete network structures In this case Assumption
says that this single hypothesis is possible In the context of causal Bayesian networks the
assumption implies that each of the n$ complete network structures is possible Although
we make the assumption of structure possibility as a matter of convenience we have found
it to be reasonable in many realworld networklearning problems
Given this assumption we can now describe our construction method in general
-
Learning Bayesian Networks MSRTR
Theorem Given domain U and a probability density U jBhsc where Bsc is some
complete network structure for U the assumptions of parameter independence Assump
tion parameter modularity Assumption likelihood equivalence Assumption and
structure possibility Assumption uniquely determine BsjBhs for any network struc
ture Bs in U
Proof Consider any Bs By Assumption if we determine ij jBhs for every param
eter set ij associated with Bs then we determine BsjBhs So consider a particular
ij Let i be the parents of xi in Bs and Bsc be a complete beliefnetwork structure with
variable ordering i xi followed by the remaining variables First using Assumption
we recognize that the hypothesis Bhsc is possible Consequently we use Assumption to
obtain U jBhsc U jB
hsc Next we change variables from U to Bsc yielding
Bsc jBhsc Using parameter independence we then extract the density ij jB
hsc
from Bsc jBhsc Finally because xi has the same parents in Bs and Bsc we apply
parameter modularity to obtain the desired density ij jBhs ij jB
hsc To show
uniqueness we note that the only freedom we have in choosing Bsc is that the parents of
xi can be shu%ed with one another and nodes following xi in the ordering can be shu%ed
with one another The Jacobian of the changeofvariable from U to Bsc has the same
terms in ij regardless of our choice
Consistency and the BDe Metric
In our procedure for generating priors we cannot use an arbitrary density U jBhsc In
our twovariable domain for example suppose we use the density
xy xy xyjBhxy
c
xy ! xy xy ! xy
c
x x
where c is a normalization constant Then using Equations and we obtain
x yjx yjxjBhxy c
for the network structure x y which satises parameter independence and the Dirichlet
assumption For the network structure y x however we have
y xjy xjyjBhxy
c y y
x x
c y y
yxjy ! yxjy yxjy ! yxjy
This density satises neither parameter independence nor the Dirichlet assumption
In general if we do not choose U jBhsc carefully we may not satisfy both parameter
independence and the Dirichlet assumption Indeed the question arises Is there any
-
Learning Bayesian Networks MSRTR
choice for U jBhsc that is consistent with these assumptions& The following theorem
and corollary answers this question in the a"rmative In the remainder of Section
we require additional notation We use XkX jYkY
to denote the multinomial parameter
corresponding to the probability pX kX jY kY X and Y may be single variables
and kX and kY are often implicit Also we use XjYkY to denote the set of multinomial
parameters corresponding to the probability distribution pX jY kY and XjY to
denote the parameters XjYkY
for all instances of kY When Y is empty we omit the
conditioning bar
Theorem Given a domain U fx xng with multinomial parameters U if the
density U j is Dirichletthat is if
U j c Y
xxn
x xn N xxn
then for any complete network structure Bsc in U the density Bscj is Dirichlet and
satises parameter independence In particular
Bscj c nYi
Yxxi
xijxxi N xijx xi
where
N xijxxi X
xixn
N xxn
Proof Let Bsc be any complete network structure for U Reorder the variables in U so
that the ordering matches this structure and relabel the variables x xn Now change
variables from xxn to Bsc using the Jacobian given by Theorem The dimension of
this transformation is Qn
i ri where ri are the number of instances of xi Substituting
the relationship x xn Qn
i xijxxi and multiplying with the Jacobian we obtain
Bscj
c
Yxxn
nYi
xi jxxi
N x xn
nYi
Yxxi
xijxxi Qn
jirj
which implies Equation Collecting the powers of xijxxi and usingQn
ji rj Pxixn
we obtain Equation
Corollary Let U be a domain with multinomial parameters U and Bsc be a com
plete network structure for U such that pBhscj If U jBhsc is Dirichlet then
BscjBhsc is Dirichlet and satises parameter independence
-
Learning Bayesian Networks MSRTR
Given these results we can compute the Dirichlet exponents N ijk using a Dirichlet dis
tribution for U jBhsc in conjunction with our method for constructing priors described
in Theorem Namely suppose we desire the exponent N ijk for a network structure where
xi has parents i Let Bsc be a complete network structure where xi has these parents By
likelihood equivalence we have U jBhsc U jB
hsc As we discussed in Section
we may write the exponents for U jBhsc as follows
N xxn N px xnjB
hsc
where N is the users equivalent sample size for the U jBhsc Furthermore by deni
tion N ijk is the Dirichlet exponent for ijk in Bsc Consequently from Equations and
we have
N ijk N pxi ki jjB
hsc
We call the BD metric with this restriction on the exponents the BDe metric e for
likelihood equivalence To summarize we have the following theorem
Theorem BDe Metric Given domain U suppose that U jBhsc is Dirichlet with
equivalent sample size N for some complete network structure Bsc in U Then for any
network structure Bs in U Assumptions through and through imply
pDBhs j pBhs j
nYi
qiYj
N ij
N ij !Nij
riYk
N ijk !Nijk
N ijk
where
N ijk N pxi ki jjB
hsc
Theorem shows that parameter independence likelihood equivalence structure possi
bility and the Dirichlet assumption are consistent for complete network structures Nonethe
less these assumptions and the assumption of parameter modularity may not be consistent
for all network structures To understand the potential for inconsistency note that we
obtained the BDe metric for all network structures using likelihood equivalence applied
only to complete network structures in combination with the other assumptions Thus it
could be that the BDe metric for incomplete network structures is not likelihood equiva
lent Nonetheless the following theorem shows that the BDe metric is likelihood equivalent
for all network structuresthat is given the other assumptions likelihood equivalence for
incomplete structures is implied by likelihood equivalence for complete network structures
Consequently our assumptions are consistent
Theorem For all domains U and all network structures Bs in U the BDe metric is
likelihood equivalent
-
Learning Bayesian Networks MSRTR
Proof Given a database D equivalent sample size N joint probability distribution
pU jBhsc and a subset X of U consider the following function of X
lX YkX
N pX kX jB
hsc !NkX
where kX is an instance of X and NkXis the number of cases in D in which X kX
Then the likelihood term of the BDe metric becomes
pDjBhs nYi
lfxig i
li
Now by Theorem we know that a network structure can be transformed into an equivalent
structure by a series of arc reversals Thus we can demonstrate that BDe metric satises
likelihood equivalence in general if we can do so for the case where two equivalent structures
dier by a single arc reversal So let Bs and Bs be two equivalent network structures
that dier only in the direction of the arc between xi and xj say xi xj in Bs Let R be
the set of parents of xi in Bs By Theorem we know that R fxig is the set of parents
of xj in Bs R is the set of parents of xj in Bs and R fxjg is the set of parents of xi
in Bs Because the two structures dier only in the reversal of a single arc the only terms
in the product of Equation that can dier are those involving xi and xj For Bs these
terms arelfxig R
lR
lfxi xjg R
lfxig R
whereas for Bs they arelfxjg R
lR
lfxi xjg R
lfxjg R
These terms are equal and hence pDjBhs pDjBhs
We note that Buntines metric is a special case of the BDe metric where every
instance of the joint space conditioned on Bhsc is equally likely We call this special case the
BDeu metric u for uniform joint distribution Buntine noted that this metric satises
the property of likelihood equivalence
The Prior Network
To calculate the terms in the BDe metric or to construct informative priors for a more
general metric that can handle missing data we need priors on network structures pBhs j
and the Dirichlet distribution U jBhsc In Section we provide a simple method
for assessing priors on network structures Here we concentrate on the assessment of the
Dirichlet distribution for U
-
Learning Bayesian Networks MSRTR
Recall from Sections and that we can assess this distribution by assessing a single
equivalent sample size N for the domain and the joint distribution of the domain for the
next case to be seen pU jBhsc where both assessments are conditioned on the state
of information Bhsc As we have discussed the assessment of equivalent sample size is
straightforward Furthermore a user can assess pU jBhsc by building a Bayesian network
for U given Bhsc We call this network the users prior network
The unusual aspect of this assessment is the conditioning hypothesis Bhsc Whether we
are dealing with acausal or causal Bayesian networks this hypothesis includes the assertion
that there are no independencies in the long run Thus at rst glance there seems to be a
contradiction in asking the user to construct a prior networkwhich may contain assertions
of independenceunder the assertion that Bhsc is true Nonetheless there is no contradic
tion because the assertions of independence in the prior network refer to independencies
in the next case to be seen whereas the assertion of full dependence Bhsc refers to the long
run
To help illustrate this point let us consider the following acausal example Suppose a
person repeatedly rolls a foursided die with labels and In addition suppose that
he repeatedly does one of the following rolls the die once and reports x true i
the die lands or and y true i the die lands or or rolls the die twice and
reports x true i the die lands or on the rst roll and reports y true i the die
lands or on the second roll In either case the multinomial assumption is reasonable
Furthermore condition corresponds to the hypothesis Bhxy x and y are independent in
the long run whereas condition corresponds to the hypothesis Bhxy Bhxy x and y are
dependent in the long run Also given these correspondences parameter modularity and
likelihood equivalence are reasonable Finally let us suppose that the parameters of the
multinomial sample have a Dirichlet distribution so that parameter independence holds
Thus this example ts the assumptions of our learning approach Now if we have no
reason to prefer one outcome of the die to another on the next roll then we will have
pyjxBhxy pyjBhxy That is our prior network will contain no arc between x
and y even though given Bhxy x and y are almost certainly dependent in the long run
We expect that most users would prefer to construct a prior network without having to
condition on Bhsc In the previous example it is possible to ignore the conditioning hypothe
sis because pU jBhxy pU jBhxy pU j In general however a user cannot ignore
this hypothesis In our foursided die example the joint distributions pU jBhxy and
pU jBhxy would have been dierent had we not been indierent about the die outcomes
Actually as we have discussed Bhxy includes the possibility that x and y are independent but only
with a probability of measure zero
-
Learning Bayesian Networks MSRTR
We have had little experience with training people to condition on Bhsc when constructing a
prior network Nonetheless stories like the fourside die may help users make the necessary
distinction for assessment
A Simple Example
Consider again our twobinaryvariable domain Let Bxy and Byx denote the network
structures where x points to y and y points to x respectively Suppose thatN and that
the users prior network gives the joint distribution px yjBhxy px 'yjBhxy
p'x yjBhxy and p'x 'yjBhxy Also suppose we observe two cases
C fx yg and C fx 'yg Let i refer to variable x y and k denote
the true false state of a variable Thus for the network structure x y we have the
Dirichlet exponents N N N
N
N
and N
and
the su"cient statistics N N N N N and N
Consequently we obtain
pDjBhxy $
$
$
$
$
$$
$
$
$
$
$$
$
$
$
$
$
For the network structure x y we have the Dirichlet exponents N N
N N N
and N
and the su"cient statistics N N
N N N and N Consequently we have
pDjBhxy $
$
$
$
$
$$
$
$
$
$
$$
$
$
$
$
$
As required the BDe metric exhibits the property of likelihood equivalence
In contrast the K metric all N ijk does not satisfy this property In particular
given the same database we have
pDjBhxy $
$
$
$
$
$$
$
$
$
$
$$
$
$
$
$
$
pDjBhxy $
$
$
$
$
$$
$
$
$
$
$$
$
$
$
$
$
Elimination of the Dirichlet Assumption
In Section we saw that when U jBhsc is Dirichlet then BscjBhsc is consis
tent with parameter independence the Dirichlet assumption likelihood equivalence and
structure possibility Therefore it is natural to ask whether there are any other choices for
U jBhsc that are similarly consistent Actually because the Dirichlet assumption is so
strong it is more tting to ask whether there are any other choices for U jBhsc that are
-
Learning Bayesian Networks MSRTR
consistent with all but the Dirichlet assumption In this section we show that if each den
sity function is positive ie the range of each function includes only numbers greater than
zero then a Dirichlet distribution for U jBhsc is the only consistent choice Conse
quently we show that under these conditions the BDe metric follows without the Dirichlet
assumption
First let us examine this question for our twobinaryvariable domain Combining
Equations and for the network structure x y the corresponding equations for the
network structure x y likelihood equivalence and structure possibility we obtain
x yjx yjxjBhxy
x x
y yy xjy xjyjB
hxy
wherey xyjx ! xyjx
xjy xy
xyjxxyjx
xjy xy
xyjxxyjx
Applying parameter independence to both sides of Equation we get
fxxfyjxyjxfyjxyjx x x
y yfyyfxjyxjyfxjy
where fx fyjx fyjx fy fxjy and fxjy are unknown density functions Equations and
dene a functional equation Methods for solving such equations have been well studied see
eg Aczel In our case Geiger and Heckerman show that if each function is
positive then the only solution to Equations and is for xy xyxyjBhxy to be
a Dirichlet distribution In fact they show that even when x and(or y have more than two
states the only solution consistent with likelihood equivalence is the Dirichlet
Theorem Geiger and Heckerman
Let xy xyjx and yxjy be pos
itive multinomial parameters related by the rules of probability If
fxxrxYk
fyjxkyjxk
Qrxk
ryxkQry
l rxyl
fyy
ryYl
fxjylxjyl
where each function is a positive probability density function then xyj is Dirichlet
This result for two variables is easily generalized to the nvariable case as we now
demonstrate
-
Learning Bayesian Networks MSRTR
Theorem Let Bsc and Bsc be two complete network structures for U with variable
orderings x xn and xn x xn respectively If both structures have positive
multinomial parameters that obey
Bscj JBsc U j
and positive densities Bscj that satisfy parameter independence then U j is Dirich
let
Proof The theorem is trivial for domains with one variable n and is proved by
Theorem for n When n rst consider the complete network structure Bsc
Clustering the variables X fx xng into a single discrete variable with q Qn
i ri
states we obtain the network structure X xn with multinomial parameters X and
xnjX given by
X nYi
xijxxi
xnjX xnjxxn
By assumption the parameters of Bsc satisfy parameter independence Thus when we
change variables from Bsc to X xnjX using the Jacobian given by Theorem we
nd that the parameters for X xn also satisfy parameter independence Now consider
the complete network structure Bsc With the same variable cluster we obtain the network
structure xn X with parameters xn as in the original network structure and Xjxngiven by
Xjxn nYi
xijxnxxi
By assumption the parameters of Bsc satisfy parameter independence Thus when we
change variables from Bsc to xn Xjxn computing a Jacobian for each state of xn
we nd that the parameters for xn X again satisfy parameter independence Finally
these changes of variable in conjunction with Equation imply Equation Consequently
by Theorem XxnjBhsc U jB
hsc is Dirichlet
Thus we obtain the BDe metric without the Dirichlet assumption
Theorem Assumptions through excluding the Dirichlet assumption Assumption
and the assumption that parameter densities are positive imply the BDe metric Equa
tions and
Proof Given parameter independence likelihood equivalence structure possibility and
positive densities we have from Theorem that U jBhsc is Dirichlet Thus from
Theorem we obtain the BDe metric
-
Learning Bayesian Networks MSRTR
The assumption that parameters are positive is important For example given a domain
consisting of only logical relationships we can have parameter independence likelihood
equivalence and structure possibility and yet U jBhsc will not be Dirichlet
Limitations of Parameter Independence and Likelihood Equivalence
There is a simple characterization of the assumption of parameter independence Recall
the property of posterior parameter independence which says that parameters remain in
dependent as long as complete cases are observed Thus suppose we have an uninformative
Dirichlet prior for the jointspace parameters all exponents very close to zero which satis
es parameter independence Then if we observe one ore more complete cases our posterior
will also satisfy parameter independence In contrast suppose we have the same uninfor
mative prior and observe one or more incomplete cases Then our posterior will not be
a Dirichlet distribution in fact it will be a linear combination of Dirichlet distributions
and will not satisfy parameter independence In this sense the assumption of parameter
independence corresponds to the assumption that ones knowledge is equivalent to having
seen only complete cases
When learning causal Bayesian networks there is a similar characterization of the as
sumption of likelihood equivalence Recall that when learning acausal networks the as
sumption must hold Namely until now we have considered only observational data data
obtained without intervention Nonetheless in many realworld studies we obtain experi
mental data data obtain by interventionfor example by randomizing subjects into con
trol and experimental groups Although we have not developed the concepts in this paper
to demonstrate the assertion it turns out that if we start with the uninformative Dirich
let prior which satises likelihood equivalence then the posterior will satisfy likelihood
equivalence if and only if we see no experimental data In this sense when learning causal
Bayesian networks the assumption of likelihood equivalence corresponds to the assumption
that ones knowledge is equivalent to having seen only nonexperimental data
In light of these characterizations we see that the assumptions of parameter indepen
dence and likelihood equivalence are unreasonable in many domains For example if we
learn about a portion of domain by reading or through word of mouth or simply apply
common sense then these assumptions should be suspect In these situations our method
ology for determining an informative prior from a prior network and a single equivalent
sample size is too simple
To relax one or both of these assumptions when they are unreasonable we can use
These characterizations of parameter independence and likelihood equivalence in the context of causal
networks are simplied for this presentation Heckerman provides more detailed characterizations
-
Learning Bayesian Networks MSRTR
an equivalent database in place of an equivalent sample size Namely we ask a user to
imagine that he was initially completely ignorant about a domain having an uninformative
Dirichlet prior Then we ask the user to specify a database De that would produce a
posterior density that reects his current state of knowledge This database may contain
incomplete cases and(or experimental data Then to score a real database D we score
the database De D using the uninformative prior and a learning algorithm that handles
missing and experimental data such as Gibbs sampling
It remains to be determined if this approach is practical Needed is a compact represen
tation for specifying equivalent databases that allows a user to accurately reect his current
knowledge One possibility is to allow a user to specify a prior Bayesian network along with
equivalent sample sizes both experimental and nonexperimental for each variable Then
one could repeatedly sample equivalent databases from the prior network that satisfy these
samplesize constraints compute desired quantities such as a scoring metric from each
equivalent database and then average the results
SDLC suggest a dierent method for accommodating nonuniform equivalent sample
sizes Their method produces Dirichlet priors that satisfy parameter independence but not
likelihood equivalence
Priors for Network Structures
To complete the information needed to derive a Bayesian metric the user must assess
the prior probabilities of the network structures Although these assessments are logically
independent of the assessment of the prior network structures that closely resemble the
prior network will tend to have higher prior probabilities Here we propose the following
parametric formula for pBhs j that makes use of the prior network
Let i denote the number of nodes in the symmetric dierence of iBs and iP
iBs iP n iBsiP Then Bs and the prior network dier by Pn
i i
arcs and we penalize Bs by a constant factor for each such arc That is we set
pBhs j c
where c is a normalization constant which we can ignore when computing relative posterior
probabilities This formula is simple as it requires only the assessment of a single constant
Nonetheless we can imagine generalizing the formula by punishing dierent arc dierences
with dierent weights as suggested by Buntine Furthermore it may be more reasonable
to use a prior network constructed without conditioning on Bhsc
We note that this parametric form satises prior equivalence only when the prior net
work contains no arcs Consequently because the priors on network structures for acausal
-
Learning Bayesian Networks MSRTR
networks must satisfy prior equivalence we should not use this parameterization for acausal
networks
Search Methods
In this section we examine methods for nding network structures with high posterior
probabilities Although our methods are presented in the context of Bayesian scoring met
rics they may be used in conjunction with other nonBayesian metrics as well Also we
note that researchers have proposed networkselection criteria other than relative posterior
probability eg Madigan and Raferty which we do not consider here
Many search methods for learning network structureincluding those that we describe
make use of a property of scoring metrics that we call decomposability Given a network
structure for domain U we say that a measure on that structure is decomposable if it can
be written as a product of measures each of which is a function only of one node and
its parents From Equation we see that the likelihood pDjBhs given by the BD
metric is decomposable Consequently if the prior probabilities of network structures are
decomposable as is the case for the priors given by Equation then the BD metric will
be decomposable Thus we can write
pDBhs j nYi
sxiji
where sxiji is only a function of xi and its parents Given a decomposable metric we
can compare the score for two network structures that dier by the addition or deletion of
arcs pointing to xi by computing only the term sxiji for both structures We note that
most known Bayesian and nonBayesian metrics are decomposable
SpecialCase Polynomial Algorithms
We rst consider the special case of nding the l network structures with the highest score
among all structures in which every node has at most one parent
For each arc xj xi including cases where xj is null we associate a weight wxi xj
log sxijxj log sxij From Equation we have
log pDBhs j nXi
log sxiji
nXi
wxi i !nXi
log sxij
-
Learning Bayesian Networks MSRTR
where i is the possibly null parent of xi The last term in Equation is the same for all
network structures Thus among the network structures in which each node has at most
one parent ranking network structures by sum of weightsPn
i wxi i or by score has
the same result
Finding the network structure with the highest weight l is a special case of a
wellknown problem of nding maximum branchings describedfor examplein Evans and
Minieka The problem is dened as follows A treelike network is a connected
directed acyclic graph in which no two edges are directed into the same node The root of a
treelike network is a unique node that has no edges directed into it A branching is a directed
forest that consists of disjoint treelike networks A spanning branching is any branching that
includes all nodes in the graph A maximum branching is any spanning branching which
maximizes the sum of arc weights in our casePn
iwxi i An e"cient polynomial
algorithm for nding a maximum branching was rst described by Edmonds later
explored by Karp and made more e"cient by Tarjan and Gabow et al
The general case l was treated by Camerini et al
These algorithms can be used to nd the l branchings with the highest weights regardless
of the metric we use as long as one can associate a weight with every edge Therefore this
algorithm is appropriate for any decomposable metric When using metrics that are score
equivalent ie both prior and likelihood equivalent however we have
sxijxjsxj j sxj jxisxij
Thus for any two edges xi xj and xi xj the weights wxi xj and wxj xi are equal
Consequently the directionality of the arcs plays no role for scoreequivalent metrics and
the problem reduces to nding the l undirected forests for whichPwxi xj is a maximum
For the case l we can apply a maximum spanning tree algorithm with arc weights
wxi xj to identify an undirected forest F having the highest score The set of network
structures that are formed from F by adding any directionality to the arcs of F such that
the resulting network is a branching yields a collection of equivalent network structures each
having the same maximal score This algorithm is identical to the tree learning algorithm
described by Chow and Liu except that we use a scoreequivalent Bayesian metric
rather than the mutualinformation metric For the general case l we can use the
algorithm of Gabow to identify the l undirected forests having the highest score and
then determine the l equivalence classes of network structures with the highest score
-
Learning Bayesian Networks MSRTR
Heuristic Search
A generalization of the problem described in the previous section is to nd the l best
networks from the set of all networks in which each node has no more than k parents
Unfortunately even when l the problem for k is NPhard In particular let us
consider the following decision problem which corresponds to our optimization problem
with l
kLEARN
INSTANCE Set of variables U database D fC Cmg where each Ci is an instance
of all variables in U scoring metric MDBs and real value p
QUESTION Does there exist a network structure Bs dened over the variables in U where
each node in Bs has at most k parents such that MDBs p&
H)ogen shows that a similar problem for PAC learning is NPcomplete His results
can be translated easily to show that kLEARN is NPcomplete for k when the BD
metric is used Chickering et al show that kLEARN is NPcomplete even when
we use the likelihoodequivalent BDe metric and the constraint of prior equivalence
Therefore it is appropriate to use heuristic search algorithms for the general case k
In this section we review several such algorithms
As is the case with essentially all search methods the methods that we examine have
two components an initialization phase and a search phase For example let us consider
the K search method not to be confused with the K metric described by CH The
initialization phase consists of choosing an ordering over the variables in U In the search
phase for each node xi in the ordering provided the node from fxi xig that most
increases the network score is added to the parent set of xi until no node increases the
score or the size of i exceeds a predetermined constant
The search algorithms we consider make successive arc changes to the network and
employ the property of decomposability to evaluate the merit of each change The possible
changes that can be made are easy to identify For any pair of variables if there is an
arc connecting them then this arc can either be reversed or removed If there is no arc
connecting them then an arc can be added in either direction All changes are subject to
the constraint that the resulting network contain no directed cycles We use E to denote
the set of eligible changes to a graph and *e to denote the change in log score of the
network resulting from the modication e E Given a decomposable metric if an arc to
xi is added or deleted only sxiji need be evaluated to determine *e If an arc between
xi and xj is reversed then only sxiji and sxj jj need be evaluated
-
Learning Bayesian Networks MSRTR
One simple heuristic search algorithm is local search Johnson First we choose
a graph Then we evaluate *e for all e E and make the change e for which *e
is a maximum provided it is positive We terminate search when there is no e with a
positive value for *e As we visit network structures we retain l of them with the highest
overall score Using decomposable metrics we can avoid recomputing all terms *e after
every change In particular if neither xi xj nor their parents are changed then *e
remains unchanged for all changes e involving these nodes as long as the resulting network
is acyclic Candidates for the initial graph include the empty graph a random graph a
graph determined by one of the polynomial algorithms described in the previous section
and the prior network
A potential problem with local search is getting stuck at a local maximum Methods for
avoiding local maxima include iterated hillclimbing and simulated annealing In iterated
hillclimbing we apply local search until we hit a local maximum Then we randomly
perturb the current network structure and repeat the process for some manageable number
of iterations At all stages we retain the top l networks structures
In one variant of simulated annealing described by Metropolis et al we initialize
the system to some temperature T Then we pick some eligible change e at random and
evaluate the expression p exp*eT If p then we make the change e otherwise
we make the change with probability p We repeat this selection and evaluation process
times or until we make changes If we make no changes in repetitions then we stop
searching Otherwise we lower the temperature by multiplying the current temperature
T by a decay factor and continue the search process We stop searching if we
have lowered the temperature more than times Thus this algorithm is controlled by ve
parameters T and Throug