tr-94-09_bayesianas

54

Upload: sase-schiffer-fiori

Post on 03-Oct-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

  • Learning Bayesian Networks The Combination of

    Knowledge and Statistical Data

    David Heckerman

    heckermamicrosoftcom

    Dan Geiger

    dangcstechnionacil

    David M Chickering

    dmaxcsuclaedu

    March Revised February

    Technical Report

    MSRTR

    Microsoft Research

    Advanced Technology Division

    Microsoft Corporation

    One Microsoft Way

    Redmond WA

    To appear in Machine Learning

  • Learning Bayesian Networks MSRTR

    Abstract

    We describe a Bayesian approach for learning Bayesian networks from a combination

    of prior knowledge and statistical data First and foremost we develop a methodology

    for assessing informative priors needed for learning Our approach is derived from a

    set of assumptions made previously as well as the assumption of likelihood equivalence

    which says that data should not help to discriminate network structures that represent

    the same assertions of conditional independence We show that likelihood equivalence

    when combined with previously made assumptions implies that the users priors for

    network parameters can be encoded in a single Bayesian network for the next case

    to be seena prior networkand a single measure of condence for that network

    Second using these priors we show how to compute the relative posterior probabilities

    of network structures given data Third we describe search methods for identifying

    network structures with high posterior probabilities We describe polynomial algorithms

    for nding the highestscoring network structures in the special case where every node

    has at most k parent For the general case k which is NPhard we review

    heuristic search algorithms including local search iterative local search and simulated

    annealing Finally we describe a methodology for evaluating Bayesiannetwork learning

    algorithms and apply this approach to a comparison of various approaches

    Keywords Bayesian networks learning Dirichlet likelihood equivalence maximum

    branching heuristic search

    Introduction

    A Bayesian network is an annotated directed graph that encodes probabilistic relationships

    among distinctions of interest in an uncertainreasoning problem Howard and Matheson

    Pearl

    The representation formally encodes the joint probability distribution

    for its domain yet includes a humanoriented qualitative structure that facilitates commu

    nication between a user and a system incorporating the probabilistic model We discuss the

    representation in detail in Section For over a decade AI researchers have used Bayesian

    networks to encode expert knowledge More recently AI researchers and statisticians have

    begun to investigate methods for learning Bayesian networks including Bayesian methods

    Cooper and Herskovits Buntine Spiegelhalter et al Dawid and Lau

    ritzen Heckerman et al quasiBayesian methods Lam and Bacchus

    Suzuki and nonBayesian methods Pearl and Verma Spirtes et al

    In this paper we concentrate on the Bayesian approach which takes prior knowledge

    and combines it with data to produce one or more Bayesian networks Our approach is

    illustrated in Figure for the problem of ICU ventilator management Using our method

    a user species his prior knowledge about the problem by constructing a Bayesian network

  • Learning Bayesian Networks MSRTR

    called a prior network and by assessing his condence in this network A hypothetical prior

    network is shown in Figure b the probabilities are not shown In addition a database of

    cases is assembled as shown in Figure c Each case in the database contains observations

    for every variable in the users prior network Our approach then takes these sources of

    information and learns one or more new Bayesian networks as shown in Figure d To

    appreciate the eectiveness of the approach note that the database was generated from

    the Bayesian network in Figure a known as the Alarm network Beinlich et al

    Comparing the three network structures we see that the structure of the learned network

    is much closer to that of the Alarm network than is the structure of the prior network In

    eect our learning algorithm has used the database to correct the prior knowledge of the

    user

    Our Bayesian approach can be understood as follows Suppose we have a domain of

    discrete variables fx xng U and a database of cases fC Cmg D Further

    suppose that we wish to determine the joint distribution pCjD the probability distri

    bution of a new case C given the database and our current state of information Rather

    than reason about this distribution directly we imagine that the data is a random sample

    from an unknown Bayesian network structure Bs with unknown parameters Using Bhs to

    denote the hypothesis that the data is generated by network structure Bs and assuming

    the hypotheses corresponding to all possible network structures form a mutually exclusive

    and collectively exhaustive set we have

    pCjD X

    all Bhs

    pCjDBhs pBhs jD

    In practice it is impossible to sum over all possible network structures Consequently

    we attempt to identify a small subset H of networkstructure hypotheses that account for

    a large fraction of the posterior probability of the hypotheses Rewriting the previous

    equation we obtain

    pCjD cX

    BhsH

    pCjDBhs pBhs jD

    where c is the normalization constant P

    BhsHpBhs jD From this relation we see that

    only the relative posterior probabilities of hypotheses matter Thus rather than compute

    a posterior probability which would entail summing over all structures we can compute a

    Bayes factorpBhs jD pBhsjD where Bs is some reference structure such as the

    one containing no arcs or simply pDBhs j pBhs j pDjB

    hs In the latter case we

    have

    pCjD cX

    BhsH

    pCjDBhs pDBhs j

  • Learning Bayesian Networks MSRTR

    17

    25

    6 5 4

    19

    27

    20

    10 21

    37

    31

    11 32

    33

    22

    15

    14

    23

    13

    16

    29

    8 9

    2812

    34 35 36

    24

    30

    72618

    321

    17

    25 18 26

    1 2 3

    6 5 4

    19

    27

    20

    10 21

    35 3736

    31

    11 32 34

    1224

    33

    22

    15

    14

    23

    13

    16

    29

    30

    7 8 9

    28

    case # x1 x2 x3 x37

    1

    2

    3

    4

    10,000

    3

    2

    1

    3

    2

    3

    2

    3

    2

    2

    2

    2

    3

    3

    2

    4

    3

    3

    1

    3

    (a)

    (d)

    (b)

    17

    25 18 26

    3

    6 5 4

    19

    27

    20

    10 21

    35 3736

    31

    11 32 34

    1224

    33

    22

    15

    14

    23

    13

    16

    29

    30

    7 8 9

    28

    21

    AA

    A

    A

    RR

    R

    R

    D

    D

    DD

    R

    D

    D

    (c)

    Figure a The Alarm network structure b A prior network encoding a users beliefs

    about the Alarm domain c A case database generated from the Alarm network

    d The network learned from the prior network and a case database generated from

    the Alarm network Arcs that are added deleted or reversed with respect to the Alarm

    network are indicated with A D and R respectively

  • Learning Bayesian Networks MSRTR

    where c is another normalization constant P

    BhsHpDBhs j

    In short the Bayesian approach to learning Bayesian networks amounts to searching for

    networkstructure hypotheses with high relative posterior probabilities Many nonBayesian

    approaches use the same basic approach but optimize some other measure of how well the

    structure ts the data In general we refer to such measures as scoring metrics We

    refer to any formula for computing the relative posterior probability of a networkstructure

    hypothesis as a Bayesian scoring metric

    The Bayesian approach is not only an approximation for pCjD but a method for

    learning network structure When jH j we learn a single network structure the MAP

    maximum a posteriori structure of U When jH j we learn a collection of network

    structures As we discuss in Section learning network structure is useful because we can

    sometimes use structure to infer causal relationships in a domain and consequently predict

    the eects of interventions

    One of the most challenging tasks in designing a Bayesian learning procedure is identi

    fying classes of easytoassess informative priors for computing the terms on the righthand

    side of Equation In the rst part of the paper Sections through we explicate a

    set of assumptions for discrete networksnetworks containing only discrete variablesthat

    leads to such a class of informative priors Our assumptions are based on those made by

    Cooper and Herskovits herein referred to as CHSpiegelhalter et al

    and Dawid and Lauritzen herein referred to as SDLCand Buntine These

    researchers assumed parameter independence which says that the parameters associated

    with each node in a Bayesian network are independent parameter modularity which says

    that if a node has the same parents in two distinct networks then the probability density

    functions of the parameters associated with this node are identical in both networks and the

    Dirichlet assumption which says that all network parameters have a Dirichlet distribution

    We assume parameter independence and parameter modularity but instead of adopting

    the Dirichlet assumption we introduce an assumption called likelihood equivalence which

    says that data should not help to discriminate network structures that represent the same

    assertions of conditional independence We argue that this property is necessary when

    learning acausal Bayesian networks and is often reasonable when learning causal Bayesian

    networks We then show that likelihood equivalence when combined with parameter inde

    pendence and several weak conditions implies the Dirichlet assumption Furthermore we

    show that likelihood equivalence constrains the Dirichlet distributions in such a way that

    they may be obtained from the users prior networka Bayesian network for the next case

    to be seenand a single equivalent sample size reecting the users condence in his prior

    network

  • Learning Bayesian Networks MSRTR

    Our result has both a positive and negative aspect On the positive side we show that

    parameter independence parameter modularity and likelihood equivalence lead to a simple

    approach for assessing priors that requires the user to assess only one equivalent sample

    size for the entire domain On the negative side the approach is sometimes too simple a

    user may have more knowledge about one part of a domain than another We argue that

    the assumptions of parameter independence and likelihood equivalence are sometimes too

    strong and suggest a framework for relaxing these assumptions

    A more straightforward task in learning Bayesian networks is using a given informative

    prior to compute pDBhs j ie a Bayesian scoring metric and pCjDBhs When

    databases are completethat is when there is no missing datathese terms can be derived

    in closed form Otherwise wellknown statistical approximations may be used In this

    paper we consider complete databases only and derive closedform expressions for these

    terms A result is a likelihoodequivalent Bayesian scoring metric which we call the BDe

    metric This metric is to be contrasted with the metrics of CH and Buntine which do not

    make use of a prior network and to the metrics of CH and SDLC which do not satisfy the

    property of likelihood equivalence

    In the second part of the paper Section we examine methods for nding networks

    with high scores The methods can be used with any scoring metric We describe polynomial

    algorithms for nding the highestscoring networks in the special case where every node has

    at most one parent In addition we describe localsearch and annealing algorithms for the

    general case which is known to be NPhard

    Finally in Sections and we describe a methodology for evaluating learning algo

    rithms We use this methodology to compare various scoring metrics and search methods

    We note that several researchers eg Dawid and Lauritzen and Madigan and

    Raferty have developed methods for learning undirected network structures as de

    scribed in eg Lauritzen In this paper we concentrate on learning directed

    models because we can sometimes use them to infer causal relationships and because most

    users nd them easier to interpret

    Background

    In this section we introduce notation and background material that we need for our discus

    sion including a description of Bayesian networks exchangeability multinomial sampling

    and the Dirichlet distribution A summary of our notation is given after the Appendix on

    page

    Throughout this discussion we consider a domain U of n discrete variables x xn

  • Learning Bayesian Networks MSRTR

    We use lowercase letters to refer to variables and uppercase letters to refer to sets of

    variables We write xi k to denote that variable xi is in state k When we observe

    the state for every variable in set X we call this set of observations an instance of X

    and we write X kX as a shorthand for the observations xi ki xi X The joint

    space of U is the set of all instances of U We use pX kX jY kY to denote the

    probability that X kX given Y kY for a person with current state of information

    We use pX jY to denote the set of probabilities for all possible observations of X given

    all possible observations of Y The joint probability distribution over U is the probability

    distribution over the joint space of U

    A Bayesian network for domain U represents a joint probability distribution over U

    The representation consists of a set of local conditional distributions combined with a set

    of conditional independence assertions that allow us to construct a global joint probability

    distribution from the local distributions In particular by the chain rule of probability we

    have

    px xnj nYi

    pxijx xi

    For each variable xi let i fx xig be a set of variables that renders xi and

    fx xig conditionally independent That is

    pxijx xi pxiji

    A Bayesiannetwork structure Bs encodes the assertions of conditional independence in

    Equations Namely Bs is a directed acyclic graph such that each variable in U

    corresponds to a node in Bs and the parents of the node corresponding to xi are the

    nodes corresponding to the variables in i In this paper we use xi to refer to both the

    variable and its corresponding node in a graph A Bayesiannetwork probability set Bp is

    the collection of local distributions pxiji for each node in the domain A Bayesian

    network for U is the pair Bs Bp Combining Equations and we see that any Bayesian

    network for U uniquely determines a joint probability distribution for U That is

    px xnj nYi

    pxiji

    When a variable has only two states we say that it is binary A Bayesian network

    for three binary variables x x and x is shown in Figure We see that

    fxg and fxg Consequently this network represents the conditionalindependence

    assertion pxjx x pxjx

    It can happen that two Bayesiannetwork structures represent the same constraints of

    conditional independencethat is every joint probability distribution encoded by one struc

    ture can be encoded by the other and vice versa In this case the two network structures are

  • Learning Bayesian Networks MSRTR

    x1

    x2

    x3

    p x( 1 = present, ) = 0.6

    p x x

    p x x

    (

    (2 1

    2 1

    = == =

    present| present, ) = 0.8

    present| absent, ) = 0.3

    p x x

    p x x

    (

    (3 2

    3 2

    = == =

    present| present, ) = 0.9

    present| absent, ) = 0.15

    Figure A Bayesian network for three binary variables taken from CH The network

    represents the assertion that x and x are conditionally independent given x Each variable

    has states absent and present

    said to be equivalent Verma and Pearl For example the structures x x x

    and x x x both represent the assertion that x and x are conditionally indepen

    dent given x and are equivalent In some of the technical discussions in this paper we

    shall require the following characterization of equivalent networks proved in the Appendix

    Theorem Let Bs and Bs be two Bayesiannetwork structures and RBsBs be the set

    of edges by which Bs and Bs dier in directionality Then Bs and Bs are equivalent

    if and only if there exists a sequence of jRBsBsj distinct arc reversals applied to Bs with

    the following properties

    After each reversal the resulting network structure contains no directed cycles and is

    equivalent to Bs

    After all reversals the resulting network structure is identical to Bs

    If x y is the next arc to be reversed in the current network structure then x and y

    have the same parents in both network structures with the exception that x is also a

    parent of y in Bs

    A drawback of Bayesian networks as dened is that network structure depends on vari

    able order If the order is chosen carelessly the resulting network structure may fail to

    reveal many conditional independencies in the domain Fortunately in practice Bayesian

    networks are typically constructed using notions of cause and eect Loosely speaking to

    construct a Bayesian network for a given set of variables we draw arcs from cause vari

    ables to their immediate eects For example we would obtain the network structure in

  • Learning Bayesian Networks MSRTR

    y

    y2 ymy1

    Figure A Bayesian network showing the conditionalindependence assertions associated

    with a multinomial sample

    Figure if we believed that x is the immediate causal eect of x and x is the immediate

    causal eect of x In almost all cases constructing a Bayesian network in this way yields

    a Bayesian network that is consistent with the formal denition In Section we return to

    this issue

    Now let us consider exchangeability and random sampling Most of the concepts we

    discuss can be found in Good and DeGroot Given a discrete variable y with

    r states consider a nite sequence of observations y ym of this variable We can think

    of this sequence as a database D for the onevariable domain U fyg This sequence

    is said to be exchangeable if a sequence obtained by interchanging any two observations

    in the sequence has the same probability as the original sequence Roughly speaking the

    assumption that a sequence is exchangeable is an assertion that the processes generating

    the data do not change in time

    Given an exchangeable sequence De Finetti showed that there exists parameters

    y fy yrg such that

    yk k rrX

    k

    yk

    pyl kjy yly yk

    That is the parameters y render the individual observations in the sequence conditionally

    independent and the probability that any given observation will be in state k is just yk

    The conditional independence assertion Equation may be represented as a Bayesian

    network as shown in Figure By the strong law of large numbers eg DeGroot

    p we may think of yk as the longrun fraction of observations where y k

    although there are other interpretations Howard

    Also note that each parameter

    yk is positive ie greater than zero

    A sequence that satises these conditions is a particular type of random sample known

    as an rdimensional multinomial sample with parameters y Good When r the

  • Learning Bayesian Networks MSRTR

    sequence is said to be a binomial sample One example of a binomial sample is the outcome

    of repeated ips of a thumbtack If we knew the longrun fraction of heads point up

    for a given thumbtack then the outcome of each ip would be independent of the rest

    and would have a probability of heads equal to this fraction An example of a multinomial

    sample is the outcome of repeated rolls of a multisided die As we shall see learning

    Bayesian networks for discrete domains essentially reduces to the problem of learning the

    parameters of a die having many sides

    As y is a set of continuous variables it has a probability density which we denote

    yj Throughout this paper we use j to denote a probability density for a continu

    ous variable or set of continuous variables Given yj we can determine the probability

    that y k in the next observation In particular by the rules of probability we have

    py kj

    Zpy kjy y j dy

    Consequently by condition above we obtain

    py kj

    Zyk y j dy

    which is the mean or expectation of yk with respect to y j denoted Eyk j

    Suppose we have a prior density for y and then observe a database D We may obtain

    the posterior density for y as follows From Bayes rule we have

    y jD c pDjy yj

    where c is a normalization constant Using Equation to rewrite the rst term on the right

    hand side we obtain

    y jD c rY

    k

    Nkyk yj

    where Nk is the number of times x k in D Note that only the counts N Nr are

    necessary to determine the posterior from the prior These counts are said to be a sucient

    statistic for the multinomial sample

    In addition suppose we assess a density for two dierent states of information and

    and nd that yj y j Then for any multinomial sample D

    pDj ZpDjy yj dy pDj

    because pDjy pDjy by Equation That is if the densities for y are the

    same then the probability of any two samples will be the same The converse is also true

    Namely if pDj pDj for all databases D then yj yj We shall use

    this equivalence when we discuss likelihood equivalence

    We assume this result is well known although we havent found a proof in the literature

  • Learning Bayesian Networks MSRTR

    Given a multinomial sample a user is free to assess any probability density for y In

    practice however one often uses the Dirichlet distribution because it has several convenient

    properties The parameters y have a Dirichlet distribution with exponentsN N

    r when

    the probability density of y is given by

    yj Pr

    kNkQr

    k Nk

    rYk

    N k

    yk Nk

    where is the Gamma function which satises x ! x x and When

    the parameters y have a Dirichlet distribution we also say that yj is Dirichlet The

    requirement that N k be greater than guarantees that the distribution can be normalized

    Note that the exponents N k are a function of the users state of information Also note

    that by Equation the Dirichlet distribution for y is technically a density over ynfykg

    for some k the symbol n denotes set dierence Nonetheless we shall write Equation

    as shown When r the Dirichlet distribution is also known as a beta distribution

    From Equation we see that if the prior distribution of y is Dirichlet then the

    posterior distribution of y given database D is also Dirichlet

    yjD crY

    k

    N kNk

    yk Ni

    where c is a normalization constant We say that the Dirichlet distribution is closed under

    multinomial sampling or that the Dirichlet distribution is a conjugate family of distributions

    for multinomial sampling Also when y has a Dirichlet distribution the expectation of

    ykiequal to the probability that x ki in the next observationhas a simple expression

    Eykj py kj N kN

    where N Pr

    kNk We shall make use of these properties in our derivations

    A survey of methods for assessing a beta distribution is given by Winkler These

    methods include the direct assessment of the probability density using questions regarding

    relative densities and relative areas assessment of the cumulative distribution function using

    fractiles assessing the posterior means of the distribution given hypothetical evidence and

    assessment in the form of an equivalent sample size These methods can be generalized with

    varying di"culty to the nonbinary case

    In our work we nd one method based on Equation particularly useful The equation

    says that we can assess a Dirichlet distribution by assessing the probability distribution

    pyj for the next observation and N In so doing we may rewrite Equation as

    yj c rY

    k

    N pykjyk

  • Learning Bayesian Networks MSRTR

    where c is a normalization constant Assessing pyj is straightforward Furthermore the

    following two observations suggest a simple method for assessing N

    One the variance of a density for y is an indication of how much the mean of y

    is expected to change given new observations The higher the variance the greater the

    expected change It is sometimes said that the variance is a measure of a users condence

    in the mean for y The variance of the Dirichlet distribution is given by

    V aryk j py kj py kj

    N !

    Thus N is a reection of the users condence Two suppose we were initially completely

    ignorant about a domainthat is our distribution y j was given by Equation with

    each exponent N k Suppose we then saw N cases with su"cient statistics N N

    r

    Then by Equation our prior would be the Dirichlet distribution given by Equation

    Thus we can assess N as an equivalent sample size the number of observations we

    would have had to have seen starting from complete ignorance in order to have the same

    condence in y that we actually have This assessment approach generalizes easily to

    manyvariable domains and thus is useful for our work We note that some users at rst

    nd judgments of equivalent sample size to be di"cult Our experience with such users has

    been that they may be made more comfortable with the method by rst using some other

    method for assessment eg fractiles on simple scenarios and by examining equivalent

    sample sizes implied by their assessments

    Bayesian Metrics Previous Work

    CH Buntine and SDLC examine domains where all variables are discrete and derive essen

    tially the same Bayesian scoring metric and formula for pCjDBhs based on the same

    set of assumptions about the users prior knowledge and the database In this section we

    present these assumptions and provide a derivation of pDBhs j and pCjDBhs

    Roughly speaking the rst assumption is that Bhs is true i the database D can be

    partitioned into a set of multinomial samples determined by the network structure Bs In

    particular Bhs is true i for every variable xi in U and every instance of xis parents i

    in Bs the observations of xi in D in those cases where i takes on the same instance

    constitute a multinomial sample For example consider a domain consisting of two binary

    variables x and y We shall use this domain to illustrate many of the concepts in this

    paper There are three network structures for this domain x y x y and the empty

    This prior distribution cannot be normalized and is sometimes called an improper prior To be more

    precise we should say that each exponent is equal to some number close to zero

  • Learning Bayesian Networks MSRTR

    case 1

    case 2

    y x| y x|x

    (a) (b)

    case 1

    case 2

    y x| y x|x

    Figure A Bayesiannetwork structure for a twobinaryvariable domain fx yg showing

    conditional independencies associated with a the multinomialsample assumption and b

    the added assumption of parameter independence In both gures it is assumed that the

    network structure x y is generating the database

    network structure containing no arc The hypothesis associated with the empty network

    structure denoted Bhxy corresponds to the assertion that the database is made up of two

    binomial samples the observations of x are a binomial sample with parameter x and

    the observations of y are a binomial sample with parameter y

    In contrast the hypothesis associated with the network structure x y denoted Bhxy

    corresponds to the assertion that the database is made up of at most three binomial samples

    the observations of x are a binomial sample with parameter x the observations of

    y in those cases where x is true if any are a binomial sample with parameter yjx and

    the observations of y in those cases where x is false if any are a binomial sample

    with parameter yjx One consequence of the second and third assertions is that y in case

    C is conditionally independent of the other occurrences of y in D given yjx yjx and x

    in case C We can graphically represent this conditionalindependence assertion using a

    Bayesiannetwork structure as shown in Figure a

    Finally the hypothesis associated with the network structure x y denoted Bhxy

    corresponds to the assertion that the database is made up of at most three binomial samples

    one for y one for x given y is true and one for x given y is false

    Before we state this assumption for arbitrary domains we introduce the following

  • Learning Bayesian Networks MSRTR

    notation Given a Bayesian network Bs for domain U let ri be the number of states

    of variable xi and let qi Q

    xli rl be the number of instances of i We use the integer

    j to index the instances of i Thus we write pxi kji j to denote the probability

    that xi k given the jth instance of the parents of xi Let ijk denote the multinomial

    parameter corresponding to the probability pxi kji j ijk Pri

    k ijk

    In addition we dene

    ij rikfijkg

    i qijfijg

    Bs nii

    That is the parameters in Bs correspond to the probability set Bp for a singlecase

    Bayesian network

    Assumption Multinomial Sample Given domain U and database D let Dl denote

    the rst l cases in the database In addition let xil and il denote the variable xi and

    the parent set i in the lth case respectively Then for all network structures Bs in U

    there exist positive parameters Bs such that for i n and for all k k ki

    pxil kjxl k xil ki DlBs Bhs ijk

    where j is the instance of il consistent with fxl k xil kig

    There is an important implication of this assumption which we examine in Section

    Nonetheless Equation is all that we need and all that CH Buntine and SDLC used

    to derive a metric Also note that the positivity requirement excludes logical relationships

    among variables We can relax this requirement although we do not do so in this paper

    The second assumption is an independence assumption

    Assumption Parameter Independence Given network structure Bs if pBhs j

    then

    a BsjBhs Qn

    i ijBhs

    b For i n ijBhs

    Qqij ij jB

    hs

    Assumption a says that the parameters associated with each variable in a network structure

    are independent We call this assumption global parameter independence after Spiegelhal

    ter and Lauritzen Assumption b says that the parameters associated with each

    Whenever possible we use CHs notation

  • Learning Bayesian Networks MSRTR

    instance of the parents of a variable are independent We call this assumption local param

    eter independence again after Spiegelhalter and Lauritzen We refer to the combination of

    these assumptions simply as parameter independence The assumption of parameter inde

    pendence for our twobinaryvariable domain is shown in the Bayesiannetwork structure of

    Figure b

    As we shall see Assumption greatly simplies the computation of pDBhs j The

    assumption is reasonable for some domains but not for others In Section we describe

    a simple characterization of the assumption that provides a test for deciding whether the

    assumption is reasonable in a given domain

    The third assumption was also made to simplify computations

    Assumption Parameter Modularity Given two network structures Bs and Bs

    such that pBhsj and pBhsj if xi has the same parents in Bs and Bs

    then

    ij jBhs ij jB

    hs j qi

    We call this property parameter modularity because it says that the densities for parameters

    ij depend only on the structure of the network that is local to variable xinamely ij

    only depends on xi and its parents For example consider the network structure x y

    and the empty structure for our twovariable domain In both structures x has the same

    set of parents the empty set Consequently by parameter modularity xjBhxy

    xjBhxy We note that CH Buntine and SDLC implicitly make the assumption of

    parameter modularity Cooper and Herskovits Equation A p Buntine

    p Spiegelhalter et al pp

    The fourth assumption restricts each parameter set ij to have a Dirichlet distribution

    Assumption Dirichlet Given a network structure Bs such that pBhs j ij jB

    hs

    is Dirichlet for all ij Bs That is there exists exponents N ijk which depend on Bhs

    and that satisfy

    ij jBhs c

    Yk

    N ijk

    ijk

    where c is a normalization constant

    When every parameter set ofBs has a Dirichlet distribution we simply say that BsjBhs

    is Dirichlet Note that by the assumption of parameter modularity we do not require

    Dirichlet exponents for every network structure Bs Rather we require exponents only for

    every node and for every possible parent set of each node

  • Learning Bayesian Networks MSRTR

    Assumptions through are assumptions about the domain Given Assumption we

    can compute pDjBs Bhs as a function of Bs for any given database see Equation

    Also as we show in Section Assumptions through determine BsjBhs for every

    network structure Bs Thus from the relation

    pDBhs j pBhs j

    ZpDjBs B

    hs BsjB

    hs dBs

    these assumptions in conjunction with the prior probabilities of network structure pBhs j

    form a complete representation of the users prior knowledge for purposes of computing

    pDBhs j By a similar argument we can show that Assumptions through also deter

    mine the probability distribution pCjDBhs for any given database and network struc

    ture

    In contrast the fth assumption is an assumption about the database

    Assumption Complete Data The database is complete That is it contains no

    missing data

    This assumption was made in order to compute pDBhs j and pCjDBhs in closed

    form In this paper we concentrate on complete databases for the same reason Nonethe

    less the reader should recognize that given Assumptions through these probabilities

    can be computedin principlefor any complete or incomplete database In practice

    these probabilities can be approximated for incomplete databases by wellknown statistical

    methods Such methods include lling in missing data based on the data that is present

    Titterington Spiegelhalter and Lauritzen the EM algorithm Dempster et al

    and Gibbs sampling ie Markov chain Monte Carlo methods York Madigan and Raerty

    Let us now explore the consequences of these assumptions First from the multinomial

    sample assumption and the assumption of no missing data we obtain

    pCljDlBs Bhs

    nYi

    qiYj

    riYk

    lijkijk

    where lijk if xi k and i j in case Cl and lijk otherwise Thus if we let

    Nijk be the number of cases in database D in which xi k and i j we have

    pDjBs Bhs

    Yi

    Yj

    Yk

    Nijkijk

    From this result it follows that the parameters Bs remain independent given database D

    a property we call posterior parameter independence In particular from the assumption of

    parameter independence we have

    BsjDBhs c pDjBs B

    hs

    nYi

    qiYj

    ij jBhs

  • Learning Bayesian Networks MSRTR

    where c is some normalization constant Combining Equations and we obtain

    BsjDBhs c

    Yi

    Yj

    ij jB

    hs

    Yk

    Nijkijk

    and posterior parameter independence follows We note that by Equation and the

    assumption of parameter modularity parameters remain modular a posteriori as well

    Given these basic relations we can derive a metric and a formula for pCjDBhs

    From the rules of probability we have

    pDjBhs mYl

    pCljDl Bhs

    From this equation we see that the Bayesian scoring metric can be viewed as a form of cross

    validation where rather than use D n fClg to predict Cl we use only cases C Cl to

    predict Cl

    Conditioning on the parameters of the network structure Bs we obtain

    pCljDl Bhs

    ZpCljDlBs B

    hs BsjDl B

    hs dBs

    Using Equation and posterior parameter independence to rewrite the rst and second

    terms in the integral respectively and interchanging integrals with products we get

    pCljDl Bhs

    nYi

    qiYj

    Z riYk

    lijkijk ij jDl B

    hs dij

    When lijk the integral is the expectation of ijk with respect to the density ijjDl Bhs

    Consequently we have

    pCljDl Bhs

    nYi

    qiYj

    riYk

    EijkjDl Bhs

    lijk

    To compute pCjDBhs we set l m ! and interpret Cm to be C To compute

    pDjBhs we combine Equations and and rearrange products obtaining

    pDjBhs nYi

    qiYj

    riYk

    mYl

    EijkjC Cl Bhs

    lijk

    Thus all that remains is to determine to the expectations in Equations and Given

    the Dirichlet assumption Assumption this evaluation is straightforward Combining the

    Dirichlet assumption and Equation we obtain

    ij jDBhs c

    Yk

    N ijk

    Nijk

    ijk

  • Learning Bayesian Networks MSRTR

    where c is another normalization constant Note that the countsNijk are a su"cient statistic

    for the database Also as we discussed in Section the Dirichlet distributions are conjugate

    for the database The posterior distribution of each parameter ij remains in the Dirichlet

    family Thus applying Equations and to Equation with l m ! Cm C

    and Dm D we obtain

    pCmjDBhs

    nYi

    qiYj

    riYk

    N ijk !Nijk

    N ij !Nij

    mijk

    where

    N ij

    riXk

    N ijk Nij

    riXk

    Nijk

    Similarly from Equation we obtain the scoring metric

    pDBhs j pBhs j

    nYi

    qiYj

    N ijN ij

    N ij !

    N ij !

    N ij !Nij

    N ij !Nij

    N ij

    N ij !Nij

    N ij !

    N ij !Nij !

    N ij !Nij

    N ij !Nij !Nij

    N ijri

    N ij !Pri

    k Nijk

    N ijri !

    N ij !Pri

    k Nijk !

    N ijri !Nijri

    Nij !N ij

    pBhs j nYi

    qiYj

    N ij

    N ij !Nij

    riYk

    N ijk !Nijk

    N ijk

    We call Equation the BD Bayesian Dirichlet metric

    As is apparent from Equation the exponents N ijk in conjunction with pBhs j com

    pletely specify a users current knowledge about the domain for purposes of learning network

    structures Unfortunately the specication of N ijk for all possible variable#parent congu

    rations and for all values of i j and k is formidable to say the least CH suggest a simple

    uninformative assignment N ijk We shall refer to this special case of the BD metric as

    the K metric Buntine suggests the uninformative assignment N ijk Nri qi

    We shall examine this special case again in Section In Section we address the assess

    ment of the priors on network structure pBhs j

    Acausal networks causal networks and likelihood equiva

    lence

    In this section we examine another assumption for learning Bayesian networks that has

    been previously overlooked

  • Learning Bayesian Networks MSRTR

    Before we do so it is important to distinguish between acausal and causal Bayesian

    networks Although Bayesian networks have been formally described as a representation

    of conditional independence as we noted in Section people often construct them using

    notions of cause and eect Recently several researchers have begun to explore a formal

    causal semantics for Bayesian networks eg Pearl and Verma Pearl Spirtes

    et al Druzdzel and Simon and Heckerman and Shachter They ar

    gue that the representation of causal knowledge is important not only for assessment but

    for prediction as well In particular they argue that causal knowledgeunlike statistical

    knowledgeallows one to derive beliefs about a domain after intervention For example

    most of us believe that smoking causes lung cancer From this knowledge we infer that if

    we stop smoking then we decrease our chances of getting lung cancer In contrast if we

    knew only that there was a statistical correlation between smoking and lung cancer then

    we could not make this inference The formal semantics of cause and eect proposed by

    these researchers is not important for this discussion The interested reader should consult

    the references given

    First let us consider acausal networks Recall our assumption that the hypothesis

    Bhs is true i the database D is a collection of multinomial samples determined by the

    network structure Bs This assumption is equivalent to saying that the database D is a

    multinomial sample from the joint space of U with parameters U and the hypothesis

    Bhs is true i the parameters U satisfy the conditionalindependence assertions of Bs We

    can think of condition as a denition of the hypothesis Bhs

    For example in our twobinaryvariable domain regardless of which hypothesis is true

    we may assert that the database is a multinomial sample from the joint space U fx yg

    with parameters U fxy xy yx xyg Furthermore given the hypothesis Bhxyfor

    examplewe know that the parameters U are unconstrained except that they must sum

    to one because the network structure x y represents no assertions of conditional inde

    pendence In contrast given the hypothesis Bhxy we know that the parameters U must

    satisfy the independence constraints xy xy xy xy and so on

    Given this denition of Bhs for acausal Bayesian networks it follows that if two network

    structures Bs and Bs are equivalent then Bhs B

    hs For example in our twovariable

    domain both the hypotheses Bhxy and Bhxy assert that there are no constraints on the

    parameters U Consequently we have Bhxy B

    hxy In general we call this property

    hypothesis equivalence

    One technical aw with the denition of Bhs is that hypotheses are not mutually exclusive For example

    in our twovariable domain the hypotheses Bhxy and Bhxy both include the possibility y yjx This

    aw is potentially troublesome because mutual exclusivity is important for our Bayesian interpretation of

  • Learning Bayesian Networks MSRTR

    In light of this property we should associate each hypothesis with an equivalence class

    of structures rather than a single network structure Also given the property of hypothesis

    equivalence we have prior equivalence if network structures Bs and Bs are equivalent

    then pBhsj pBhsj likelihood equivalence if Bs and Bs are equivalent then for

    all databases D pDjBhs pDjBhs and score equivalence if Bs and Bs are

    equivalent then pDBhsj pDBhsj

    Now let us consider causal networks For these networks the assumption of hypothesis

    equivalence is unreasonable In particular for causal networks we must modify the deni

    tion of Bhs to include the assertion that each nonroot node in Bs is a direct causal eect of

    its parents For example in our twovariable domain the causal networks x y and x y

    represent the same constraints on U ie none but the former also asserts that x causes

    y whereas the latter asserts that y causes x Thus the hypotheses Bhxy and Bhxy are

    not equal Indeed it is reasonable to assume that these hypothesesand the hypotheses

    associated with any two dierent causalnetwork structuresare mutually exclusive

    Nonetheless for many realworld problems that we have encountered we have found

    it reasonable to assume likelihood equivalence That is we have found it reasonable to

    assume that data cannot distinguish between equivalent network structures Of course for

    any given problem it is up to the decision maker to assume likelihood equivalence or not In

    Section we describe a characterization of likelihood equivalence that suggests a simple

    procedure for deciding whether the assumption is reasonable in a given domain

    Because the assumption of likelihood equivalence is appropriate for learning acausal

    networks in all domains and for learning causal networks in many domains we adopt this

    assumption in our remaining treatment of scoring metrics As we have stated it likelihood

    equivalence says that for any databaseD the probability of D is the same given hypotheses

    corresponding to any two equivalent network structures From our discussion surrounding

    Equation however we may also state likelihood equivalence in terms of U

    Assumption Likelihood Equivalence Given two network structures Bs and Bs

    such that pBhsj and pBhsj if Bs and Bs are equivalent then U jB

    hs

    U jBhs

    We close this section with a few additional remarks about inferring causal relationships

    network learning see Equation Nonetheless because the densities BsjBhs must be integrable and

    hence bounded the overlap of hypotheses will be of measure zero and we may use Equation without

    modication For example in our twobinaryvariable domain given the hypothesis Bhxy the probability

    that Bhxy is true ie y yjx has measure zeroUsing the same convention as for the Dirichlet distribution we write U jB

    hs to denote a density

    over a set of the nonredundant parameters in U

  • Learning Bayesian Networks MSRTR

    Given the distinction between statistical and causal dependence it would seem impossible

    to learn causal networks from data produced by observation alone For example consider

    the simple threevariable domain U fx x xg If we nd through the observation of

    data that the network structure x x x is very likely then we cannot conclude that

    x and x are causes for x Rather it may be the case that there is a hidden common cause

    of x and x as well as a hidden common cause of x and x If however we assume that

    every statistical association derives from causal interaction and that there are no hidden

    common causes then we can interpret learned networks as causal networks In our example

    under these assumptions we can infer that x and x are causes for x

    Under the assumption of likelihood equivalence the ratio of posterior probabilities of

    two equivalent network structures must be equal to the ratio of their prior probabilities

    Consequently if the priors on network structures are not too dierent then typically learn

    ing will produce many equivalent network structures each having a large relative posterior

    probability Furthermore even for domains where the assumption of likelihood equivalence

    does not hold there is a good chance that more than one hypothesis will have a large relative

    posterior probability In such situations we nd it reasonable to average the causal asser

    tions contained in individual learned networks For example in our threevariable domain

    let us suppose that the data supports only the network structure x x x and its

    equivalent cousins x x x and x x x If each of the hypotheses correspond

    ing to these structures have the same prior probability then the posterior probability of

    each hypothesis will be and we infer that the proposition x causes x has probability

    Under these same conditions the proposition that both x and x are causes of x has

    probability

    The BDe Metric

    The assumption of likelihood equivalence when combined the previous assumptions intro

    duces constraints on the Dirichlet exponents N ijk The result is a likelihoodequivalent

    specialization of the BD metric which we call the BDe metric In this section we derive

    this metric In addition we show that as a consequence of the exponent constraints the

    user may construct an informative prior for the parameters of all network structures merely

    by building a Bayesian network for the next case to be seen and by assessing an equivalent

    sample size Most remarkable we show that Dirichlet assumption Assumption is not

    needed to obtain the BDe metric

    Note that in some circumstances we can identify causes and eects from network structure even when

    there are hidden common causes See Pearl for a discussion

  • Learning Bayesian Networks MSRTR

    Informative Priors

    In this section we show how the added assumption of likelihood equivalence simplies the

    construction of informative priors

    Before we do so we need to dene the concept of a complete network structure A

    complete network structure is one that has no missing edgesthat is it encodes no assertions

    of conditional independence In a domain with n variables there are n$ complete network

    structures An important property of complete network structures is that all such structures

    for a given domain are equivalent

    Now for a given domain U suppose we have assessed the density U jBhsc where

    Bsc is some complete network structure for U Given parameter independence parameter

    modularity likelihood equivalence and one additional assumption it turns out that we can

    compute the prior BsjBhs for any network structure Bs in U from the given density

    To see how this computation is done consider again our twobinaryvariable domain

    Suppose we are given a density for the parameters of the joint space xy xy xy jBhxy

    From this density we construct the parameter densities for each of the three network struc

    tures in the domain First consider the network structure x y A parameter set for this

    network structure is fx yjx yjxg These parameters are related to the parameters of the

    joint space by the following relations

    xy xyjx xy xyjx xy x yjx

    Thus we may obtain x yjx yjxjBhxy from the given density by changing variables

    x yjx yjxjBhxy Jxy xy xy xyjB

    hxy

    where Jxy is the Jacobian of the transformation

    Jxy

    xyx xyx xyx

    xyyjx xyyjx xyyjx

    xyyjx xyyjx xyyjx

    x x

    The Jacobian JBsc for the transformation from U to Bsc where Bsc is an arbitrary

    complete network structure is given in the Appendix Theorem

    Next consider the network structure x y Assuming that the hypothesis Bhxy

    is also possible we obtain xy xy xyjBhxy xy xy xyjBhxy by likelihood

    equivalence Therefore we can compute the density for the network structure x y using

    the Jacobian Jxy y y

    Finally consider the empty network structure Given the assumption of parameter

    independence we may obtain the densities xjBhxy and y jB

    hxy separately To

  • Learning Bayesian Networks MSRTR

    ( , , | , ) ( , , | , )xy x y x y x yh

    xy x y x y x yhB B =

    Bxy:

    Bx y : Bx y :

    change ofvariable

    parametermodularity

    Figure A computation of the parameter densities for the three network structures

    of the twobinaryvariable domain fx yg The approach computes the densities from

    xy xy xyjBhxy using likelihood equivalence parameter independence and param

    eter modularity

    obtain the density for x we rst extract xjBhxy from the density for the network

    structure x y This extraction is straightforward because by parameter independence

    the parameters for x y must be independent Then we use parameter modularity

    which says that xjBhxy xjB

    hxy To obtain the density for y we extract

    yjBhxy from the density for the network structure x y and again apply parameter

    modularity The approach is summarized in Figure

    In this construction it is important that both hypotheses Bhxy and Bhxy have nonzero

    prior probabilities lest we could not make use of likelihood equivalence to obtain the param

    eter densities for the empty structure In order to take advantage of likelihood equivalence

    in general we adopt the following assumption

    Assumption Structure Possibility Given a domain U pBhscj for all com

    plete network structures Bsc

    Note that in the context of acausal Bayesian networks there is only one hypothesis corre

    sponding to the equivalence class of complete network structures In this case Assumption

    says that this single hypothesis is possible In the context of causal Bayesian networks the

    assumption implies that each of the n$ complete network structures is possible Although

    we make the assumption of structure possibility as a matter of convenience we have found

    it to be reasonable in many realworld networklearning problems

    Given this assumption we can now describe our construction method in general

  • Learning Bayesian Networks MSRTR

    Theorem Given domain U and a probability density U jBhsc where Bsc is some

    complete network structure for U the assumptions of parameter independence Assump

    tion parameter modularity Assumption likelihood equivalence Assumption and

    structure possibility Assumption uniquely determine BsjBhs for any network struc

    ture Bs in U

    Proof Consider any Bs By Assumption if we determine ij jBhs for every param

    eter set ij associated with Bs then we determine BsjBhs So consider a particular

    ij Let i be the parents of xi in Bs and Bsc be a complete beliefnetwork structure with

    variable ordering i xi followed by the remaining variables First using Assumption

    we recognize that the hypothesis Bhsc is possible Consequently we use Assumption to

    obtain U jBhsc U jB

    hsc Next we change variables from U to Bsc yielding

    Bsc jBhsc Using parameter independence we then extract the density ij jB

    hsc

    from Bsc jBhsc Finally because xi has the same parents in Bs and Bsc we apply

    parameter modularity to obtain the desired density ij jBhs ij jB

    hsc To show

    uniqueness we note that the only freedom we have in choosing Bsc is that the parents of

    xi can be shu%ed with one another and nodes following xi in the ordering can be shu%ed

    with one another The Jacobian of the changeofvariable from U to Bsc has the same

    terms in ij regardless of our choice

    Consistency and the BDe Metric

    In our procedure for generating priors we cannot use an arbitrary density U jBhsc In

    our twovariable domain for example suppose we use the density

    xy xy xyjBhxy

    c

    xy ! xy xy ! xy

    c

    x x

    where c is a normalization constant Then using Equations and we obtain

    x yjx yjxjBhxy c

    for the network structure x y which satises parameter independence and the Dirichlet

    assumption For the network structure y x however we have

    y xjy xjyjBhxy

    c y y

    x x

    c y y

    yxjy ! yxjy yxjy ! yxjy

    This density satises neither parameter independence nor the Dirichlet assumption

    In general if we do not choose U jBhsc carefully we may not satisfy both parameter

    independence and the Dirichlet assumption Indeed the question arises Is there any

  • Learning Bayesian Networks MSRTR

    choice for U jBhsc that is consistent with these assumptions& The following theorem

    and corollary answers this question in the a"rmative In the remainder of Section

    we require additional notation We use XkX jYkY

    to denote the multinomial parameter

    corresponding to the probability pX kX jY kY X and Y may be single variables

    and kX and kY are often implicit Also we use XjYkY to denote the set of multinomial

    parameters corresponding to the probability distribution pX jY kY and XjY to

    denote the parameters XjYkY

    for all instances of kY When Y is empty we omit the

    conditioning bar

    Theorem Given a domain U fx xng with multinomial parameters U if the

    density U j is Dirichletthat is if

    U j c Y

    xxn

    x xn N xxn

    then for any complete network structure Bsc in U the density Bscj is Dirichlet and

    satises parameter independence In particular

    Bscj c nYi

    Yxxi

    xijxxi N xijx xi

    where

    N xijxxi X

    xixn

    N xxn

    Proof Let Bsc be any complete network structure for U Reorder the variables in U so

    that the ordering matches this structure and relabel the variables x xn Now change

    variables from xxn to Bsc using the Jacobian given by Theorem The dimension of

    this transformation is Qn

    i ri where ri are the number of instances of xi Substituting

    the relationship x xn Qn

    i xijxxi and multiplying with the Jacobian we obtain

    Bscj

    c

    Yxxn

    nYi

    xi jxxi

    N x xn

    nYi

    Yxxi

    xijxxi Qn

    jirj

    which implies Equation Collecting the powers of xijxxi and usingQn

    ji rj Pxixn

    we obtain Equation

    Corollary Let U be a domain with multinomial parameters U and Bsc be a com

    plete network structure for U such that pBhscj If U jBhsc is Dirichlet then

    BscjBhsc is Dirichlet and satises parameter independence

  • Learning Bayesian Networks MSRTR

    Given these results we can compute the Dirichlet exponents N ijk using a Dirichlet dis

    tribution for U jBhsc in conjunction with our method for constructing priors described

    in Theorem Namely suppose we desire the exponent N ijk for a network structure where

    xi has parents i Let Bsc be a complete network structure where xi has these parents By

    likelihood equivalence we have U jBhsc U jB

    hsc As we discussed in Section

    we may write the exponents for U jBhsc as follows

    N xxn N px xnjB

    hsc

    where N is the users equivalent sample size for the U jBhsc Furthermore by deni

    tion N ijk is the Dirichlet exponent for ijk in Bsc Consequently from Equations and

    we have

    N ijk N pxi ki jjB

    hsc

    We call the BD metric with this restriction on the exponents the BDe metric e for

    likelihood equivalence To summarize we have the following theorem

    Theorem BDe Metric Given domain U suppose that U jBhsc is Dirichlet with

    equivalent sample size N for some complete network structure Bsc in U Then for any

    network structure Bs in U Assumptions through and through imply

    pDBhs j pBhs j

    nYi

    qiYj

    N ij

    N ij !Nij

    riYk

    N ijk !Nijk

    N ijk

    where

    N ijk N pxi ki jjB

    hsc

    Theorem shows that parameter independence likelihood equivalence structure possi

    bility and the Dirichlet assumption are consistent for complete network structures Nonethe

    less these assumptions and the assumption of parameter modularity may not be consistent

    for all network structures To understand the potential for inconsistency note that we

    obtained the BDe metric for all network structures using likelihood equivalence applied

    only to complete network structures in combination with the other assumptions Thus it

    could be that the BDe metric for incomplete network structures is not likelihood equiva

    lent Nonetheless the following theorem shows that the BDe metric is likelihood equivalent

    for all network structuresthat is given the other assumptions likelihood equivalence for

    incomplete structures is implied by likelihood equivalence for complete network structures

    Consequently our assumptions are consistent

    Theorem For all domains U and all network structures Bs in U the BDe metric is

    likelihood equivalent

  • Learning Bayesian Networks MSRTR

    Proof Given a database D equivalent sample size N joint probability distribution

    pU jBhsc and a subset X of U consider the following function of X

    lX YkX

    N pX kX jB

    hsc !NkX

    where kX is an instance of X and NkXis the number of cases in D in which X kX

    Then the likelihood term of the BDe metric becomes

    pDjBhs nYi

    lfxig i

    li

    Now by Theorem we know that a network structure can be transformed into an equivalent

    structure by a series of arc reversals Thus we can demonstrate that BDe metric satises

    likelihood equivalence in general if we can do so for the case where two equivalent structures

    dier by a single arc reversal So let Bs and Bs be two equivalent network structures

    that dier only in the direction of the arc between xi and xj say xi xj in Bs Let R be

    the set of parents of xi in Bs By Theorem we know that R fxig is the set of parents

    of xj in Bs R is the set of parents of xj in Bs and R fxjg is the set of parents of xi

    in Bs Because the two structures dier only in the reversal of a single arc the only terms

    in the product of Equation that can dier are those involving xi and xj For Bs these

    terms arelfxig R

    lR

    lfxi xjg R

    lfxig R

    whereas for Bs they arelfxjg R

    lR

    lfxi xjg R

    lfxjg R

    These terms are equal and hence pDjBhs pDjBhs

    We note that Buntines metric is a special case of the BDe metric where every

    instance of the joint space conditioned on Bhsc is equally likely We call this special case the

    BDeu metric u for uniform joint distribution Buntine noted that this metric satises

    the property of likelihood equivalence

    The Prior Network

    To calculate the terms in the BDe metric or to construct informative priors for a more

    general metric that can handle missing data we need priors on network structures pBhs j

    and the Dirichlet distribution U jBhsc In Section we provide a simple method

    for assessing priors on network structures Here we concentrate on the assessment of the

    Dirichlet distribution for U

  • Learning Bayesian Networks MSRTR

    Recall from Sections and that we can assess this distribution by assessing a single

    equivalent sample size N for the domain and the joint distribution of the domain for the

    next case to be seen pU jBhsc where both assessments are conditioned on the state

    of information Bhsc As we have discussed the assessment of equivalent sample size is

    straightforward Furthermore a user can assess pU jBhsc by building a Bayesian network

    for U given Bhsc We call this network the users prior network

    The unusual aspect of this assessment is the conditioning hypothesis Bhsc Whether we

    are dealing with acausal or causal Bayesian networks this hypothesis includes the assertion

    that there are no independencies in the long run Thus at rst glance there seems to be a

    contradiction in asking the user to construct a prior networkwhich may contain assertions

    of independenceunder the assertion that Bhsc is true Nonetheless there is no contradic

    tion because the assertions of independence in the prior network refer to independencies

    in the next case to be seen whereas the assertion of full dependence Bhsc refers to the long

    run

    To help illustrate this point let us consider the following acausal example Suppose a

    person repeatedly rolls a foursided die with labels and In addition suppose that

    he repeatedly does one of the following rolls the die once and reports x true i

    the die lands or and y true i the die lands or or rolls the die twice and

    reports x true i the die lands or on the rst roll and reports y true i the die

    lands or on the second roll In either case the multinomial assumption is reasonable

    Furthermore condition corresponds to the hypothesis Bhxy x and y are independent in

    the long run whereas condition corresponds to the hypothesis Bhxy Bhxy x and y are

    dependent in the long run Also given these correspondences parameter modularity and

    likelihood equivalence are reasonable Finally let us suppose that the parameters of the

    multinomial sample have a Dirichlet distribution so that parameter independence holds

    Thus this example ts the assumptions of our learning approach Now if we have no

    reason to prefer one outcome of the die to another on the next roll then we will have

    pyjxBhxy pyjBhxy That is our prior network will contain no arc between x

    and y even though given Bhxy x and y are almost certainly dependent in the long run

    We expect that most users would prefer to construct a prior network without having to

    condition on Bhsc In the previous example it is possible to ignore the conditioning hypothe

    sis because pU jBhxy pU jBhxy pU j In general however a user cannot ignore

    this hypothesis In our foursided die example the joint distributions pU jBhxy and

    pU jBhxy would have been dierent had we not been indierent about the die outcomes

    Actually as we have discussed Bhxy includes the possibility that x and y are independent but only

    with a probability of measure zero

  • Learning Bayesian Networks MSRTR

    We have had little experience with training people to condition on Bhsc when constructing a

    prior network Nonetheless stories like the fourside die may help users make the necessary

    distinction for assessment

    A Simple Example

    Consider again our twobinaryvariable domain Let Bxy and Byx denote the network

    structures where x points to y and y points to x respectively Suppose thatN and that

    the users prior network gives the joint distribution px yjBhxy px 'yjBhxy

    p'x yjBhxy and p'x 'yjBhxy Also suppose we observe two cases

    C fx yg and C fx 'yg Let i refer to variable x y and k denote

    the true false state of a variable Thus for the network structure x y we have the

    Dirichlet exponents N N N

    N

    N

    and N

    and

    the su"cient statistics N N N N N and N

    Consequently we obtain

    pDjBhxy $

    $

    $

    $

    $

    $$

    $

    $

    $

    $

    $$

    $

    $

    $

    $

    $

    For the network structure x y we have the Dirichlet exponents N N

    N N N

    and N

    and the su"cient statistics N N

    N N N and N Consequently we have

    pDjBhxy $

    $

    $

    $

    $

    $$

    $

    $

    $

    $

    $$

    $

    $

    $

    $

    $

    As required the BDe metric exhibits the property of likelihood equivalence

    In contrast the K metric all N ijk does not satisfy this property In particular

    given the same database we have

    pDjBhxy $

    $

    $

    $

    $

    $$

    $

    $

    $

    $

    $$

    $

    $

    $

    $

    $

    pDjBhxy $

    $

    $

    $

    $

    $$

    $

    $

    $

    $

    $$

    $

    $

    $

    $

    $

    Elimination of the Dirichlet Assumption

    In Section we saw that when U jBhsc is Dirichlet then BscjBhsc is consis

    tent with parameter independence the Dirichlet assumption likelihood equivalence and

    structure possibility Therefore it is natural to ask whether there are any other choices for

    U jBhsc that are similarly consistent Actually because the Dirichlet assumption is so

    strong it is more tting to ask whether there are any other choices for U jBhsc that are

  • Learning Bayesian Networks MSRTR

    consistent with all but the Dirichlet assumption In this section we show that if each den

    sity function is positive ie the range of each function includes only numbers greater than

    zero then a Dirichlet distribution for U jBhsc is the only consistent choice Conse

    quently we show that under these conditions the BDe metric follows without the Dirichlet

    assumption

    First let us examine this question for our twobinaryvariable domain Combining

    Equations and for the network structure x y the corresponding equations for the

    network structure x y likelihood equivalence and structure possibility we obtain

    x yjx yjxjBhxy

    x x

    y yy xjy xjyjB

    hxy

    wherey xyjx ! xyjx

    xjy xy

    xyjxxyjx

    xjy xy

    xyjxxyjx

    Applying parameter independence to both sides of Equation we get

    fxxfyjxyjxfyjxyjx x x

    y yfyyfxjyxjyfxjy

    where fx fyjx fyjx fy fxjy and fxjy are unknown density functions Equations and

    dene a functional equation Methods for solving such equations have been well studied see

    eg Aczel In our case Geiger and Heckerman show that if each function is

    positive then the only solution to Equations and is for xy xyxyjBhxy to be

    a Dirichlet distribution In fact they show that even when x and(or y have more than two

    states the only solution consistent with likelihood equivalence is the Dirichlet

    Theorem Geiger and Heckerman

    Let xy xyjx and yxjy be pos

    itive multinomial parameters related by the rules of probability If

    fxxrxYk

    fyjxkyjxk

    Qrxk

    ryxkQry

    l rxyl

    fyy

    ryYl

    fxjylxjyl

    where each function is a positive probability density function then xyj is Dirichlet

    This result for two variables is easily generalized to the nvariable case as we now

    demonstrate

  • Learning Bayesian Networks MSRTR

    Theorem Let Bsc and Bsc be two complete network structures for U with variable

    orderings x xn and xn x xn respectively If both structures have positive

    multinomial parameters that obey

    Bscj JBsc U j

    and positive densities Bscj that satisfy parameter independence then U j is Dirich

    let

    Proof The theorem is trivial for domains with one variable n and is proved by

    Theorem for n When n rst consider the complete network structure Bsc

    Clustering the variables X fx xng into a single discrete variable with q Qn

    i ri

    states we obtain the network structure X xn with multinomial parameters X and

    xnjX given by

    X nYi

    xijxxi

    xnjX xnjxxn

    By assumption the parameters of Bsc satisfy parameter independence Thus when we

    change variables from Bsc to X xnjX using the Jacobian given by Theorem we

    nd that the parameters for X xn also satisfy parameter independence Now consider

    the complete network structure Bsc With the same variable cluster we obtain the network

    structure xn X with parameters xn as in the original network structure and Xjxngiven by

    Xjxn nYi

    xijxnxxi

    By assumption the parameters of Bsc satisfy parameter independence Thus when we

    change variables from Bsc to xn Xjxn computing a Jacobian for each state of xn

    we nd that the parameters for xn X again satisfy parameter independence Finally

    these changes of variable in conjunction with Equation imply Equation Consequently

    by Theorem XxnjBhsc U jB

    hsc is Dirichlet

    Thus we obtain the BDe metric without the Dirichlet assumption

    Theorem Assumptions through excluding the Dirichlet assumption Assumption

    and the assumption that parameter densities are positive imply the BDe metric Equa

    tions and

    Proof Given parameter independence likelihood equivalence structure possibility and

    positive densities we have from Theorem that U jBhsc is Dirichlet Thus from

    Theorem we obtain the BDe metric

  • Learning Bayesian Networks MSRTR

    The assumption that parameters are positive is important For example given a domain

    consisting of only logical relationships we can have parameter independence likelihood

    equivalence and structure possibility and yet U jBhsc will not be Dirichlet

    Limitations of Parameter Independence and Likelihood Equivalence

    There is a simple characterization of the assumption of parameter independence Recall

    the property of posterior parameter independence which says that parameters remain in

    dependent as long as complete cases are observed Thus suppose we have an uninformative

    Dirichlet prior for the jointspace parameters all exponents very close to zero which satis

    es parameter independence Then if we observe one ore more complete cases our posterior

    will also satisfy parameter independence In contrast suppose we have the same uninfor

    mative prior and observe one or more incomplete cases Then our posterior will not be

    a Dirichlet distribution in fact it will be a linear combination of Dirichlet distributions

    and will not satisfy parameter independence In this sense the assumption of parameter

    independence corresponds to the assumption that ones knowledge is equivalent to having

    seen only complete cases

    When learning causal Bayesian networks there is a similar characterization of the as

    sumption of likelihood equivalence Recall that when learning acausal networks the as

    sumption must hold Namely until now we have considered only observational data data

    obtained without intervention Nonetheless in many realworld studies we obtain experi

    mental data data obtain by interventionfor example by randomizing subjects into con

    trol and experimental groups Although we have not developed the concepts in this paper

    to demonstrate the assertion it turns out that if we start with the uninformative Dirich

    let prior which satises likelihood equivalence then the posterior will satisfy likelihood

    equivalence if and only if we see no experimental data In this sense when learning causal

    Bayesian networks the assumption of likelihood equivalence corresponds to the assumption

    that ones knowledge is equivalent to having seen only nonexperimental data

    In light of these characterizations we see that the assumptions of parameter indepen

    dence and likelihood equivalence are unreasonable in many domains For example if we

    learn about a portion of domain by reading or through word of mouth or simply apply

    common sense then these assumptions should be suspect In these situations our method

    ology for determining an informative prior from a prior network and a single equivalent

    sample size is too simple

    To relax one or both of these assumptions when they are unreasonable we can use

    These characterizations of parameter independence and likelihood equivalence in the context of causal

    networks are simplied for this presentation Heckerman provides more detailed characterizations

  • Learning Bayesian Networks MSRTR

    an equivalent database in place of an equivalent sample size Namely we ask a user to

    imagine that he was initially completely ignorant about a domain having an uninformative

    Dirichlet prior Then we ask the user to specify a database De that would produce a

    posterior density that reects his current state of knowledge This database may contain

    incomplete cases and(or experimental data Then to score a real database D we score

    the database De D using the uninformative prior and a learning algorithm that handles

    missing and experimental data such as Gibbs sampling

    It remains to be determined if this approach is practical Needed is a compact represen

    tation for specifying equivalent databases that allows a user to accurately reect his current

    knowledge One possibility is to allow a user to specify a prior Bayesian network along with

    equivalent sample sizes both experimental and nonexperimental for each variable Then

    one could repeatedly sample equivalent databases from the prior network that satisfy these

    samplesize constraints compute desired quantities such as a scoring metric from each

    equivalent database and then average the results

    SDLC suggest a dierent method for accommodating nonuniform equivalent sample

    sizes Their method produces Dirichlet priors that satisfy parameter independence but not

    likelihood equivalence

    Priors for Network Structures

    To complete the information needed to derive a Bayesian metric the user must assess

    the prior probabilities of the network structures Although these assessments are logically

    independent of the assessment of the prior network structures that closely resemble the

    prior network will tend to have higher prior probabilities Here we propose the following

    parametric formula for pBhs j that makes use of the prior network

    Let i denote the number of nodes in the symmetric dierence of iBs and iP

    iBs iP n iBsiP Then Bs and the prior network dier by Pn

    i i

    arcs and we penalize Bs by a constant factor for each such arc That is we set

    pBhs j c

    where c is a normalization constant which we can ignore when computing relative posterior

    probabilities This formula is simple as it requires only the assessment of a single constant

    Nonetheless we can imagine generalizing the formula by punishing dierent arc dierences

    with dierent weights as suggested by Buntine Furthermore it may be more reasonable

    to use a prior network constructed without conditioning on Bhsc

    We note that this parametric form satises prior equivalence only when the prior net

    work contains no arcs Consequently because the priors on network structures for acausal

  • Learning Bayesian Networks MSRTR

    networks must satisfy prior equivalence we should not use this parameterization for acausal

    networks

    Search Methods

    In this section we examine methods for nding network structures with high posterior

    probabilities Although our methods are presented in the context of Bayesian scoring met

    rics they may be used in conjunction with other nonBayesian metrics as well Also we

    note that researchers have proposed networkselection criteria other than relative posterior

    probability eg Madigan and Raferty which we do not consider here

    Many search methods for learning network structureincluding those that we describe

    make use of a property of scoring metrics that we call decomposability Given a network

    structure for domain U we say that a measure on that structure is decomposable if it can

    be written as a product of measures each of which is a function only of one node and

    its parents From Equation we see that the likelihood pDjBhs given by the BD

    metric is decomposable Consequently if the prior probabilities of network structures are

    decomposable as is the case for the priors given by Equation then the BD metric will

    be decomposable Thus we can write

    pDBhs j nYi

    sxiji

    where sxiji is only a function of xi and its parents Given a decomposable metric we

    can compare the score for two network structures that dier by the addition or deletion of

    arcs pointing to xi by computing only the term sxiji for both structures We note that

    most known Bayesian and nonBayesian metrics are decomposable

    SpecialCase Polynomial Algorithms

    We rst consider the special case of nding the l network structures with the highest score

    among all structures in which every node has at most one parent

    For each arc xj xi including cases where xj is null we associate a weight wxi xj

    log sxijxj log sxij From Equation we have

    log pDBhs j nXi

    log sxiji

    nXi

    wxi i !nXi

    log sxij

  • Learning Bayesian Networks MSRTR

    where i is the possibly null parent of xi The last term in Equation is the same for all

    network structures Thus among the network structures in which each node has at most

    one parent ranking network structures by sum of weightsPn

    i wxi i or by score has

    the same result

    Finding the network structure with the highest weight l is a special case of a

    wellknown problem of nding maximum branchings describedfor examplein Evans and

    Minieka The problem is dened as follows A treelike network is a connected

    directed acyclic graph in which no two edges are directed into the same node The root of a

    treelike network is a unique node that has no edges directed into it A branching is a directed

    forest that consists of disjoint treelike networks A spanning branching is any branching that

    includes all nodes in the graph A maximum branching is any spanning branching which

    maximizes the sum of arc weights in our casePn

    iwxi i An e"cient polynomial

    algorithm for nding a maximum branching was rst described by Edmonds later

    explored by Karp and made more e"cient by Tarjan and Gabow et al

    The general case l was treated by Camerini et al

    These algorithms can be used to nd the l branchings with the highest weights regardless

    of the metric we use as long as one can associate a weight with every edge Therefore this

    algorithm is appropriate for any decomposable metric When using metrics that are score

    equivalent ie both prior and likelihood equivalent however we have

    sxijxjsxj j sxj jxisxij

    Thus for any two edges xi xj and xi xj the weights wxi xj and wxj xi are equal

    Consequently the directionality of the arcs plays no role for scoreequivalent metrics and

    the problem reduces to nding the l undirected forests for whichPwxi xj is a maximum

    For the case l we can apply a maximum spanning tree algorithm with arc weights

    wxi xj to identify an undirected forest F having the highest score The set of network

    structures that are formed from F by adding any directionality to the arcs of F such that

    the resulting network is a branching yields a collection of equivalent network structures each

    having the same maximal score This algorithm is identical to the tree learning algorithm

    described by Chow and Liu except that we use a scoreequivalent Bayesian metric

    rather than the mutualinformation metric For the general case l we can use the

    algorithm of Gabow to identify the l undirected forests having the highest score and

    then determine the l equivalence classes of network structures with the highest score

  • Learning Bayesian Networks MSRTR

    Heuristic Search

    A generalization of the problem described in the previous section is to nd the l best

    networks from the set of all networks in which each node has no more than k parents

    Unfortunately even when l the problem for k is NPhard In particular let us

    consider the following decision problem which corresponds to our optimization problem

    with l

    kLEARN

    INSTANCE Set of variables U database D fC Cmg where each Ci is an instance

    of all variables in U scoring metric MDBs and real value p

    QUESTION Does there exist a network structure Bs dened over the variables in U where

    each node in Bs has at most k parents such that MDBs p&

    H)ogen shows that a similar problem for PAC learning is NPcomplete His results

    can be translated easily to show that kLEARN is NPcomplete for k when the BD

    metric is used Chickering et al show that kLEARN is NPcomplete even when

    we use the likelihoodequivalent BDe metric and the constraint of prior equivalence

    Therefore it is appropriate to use heuristic search algorithms for the general case k

    In this section we review several such algorithms

    As is the case with essentially all search methods the methods that we examine have

    two components an initialization phase and a search phase For example let us consider

    the K search method not to be confused with the K metric described by CH The

    initialization phase consists of choosing an ordering over the variables in U In the search

    phase for each node xi in the ordering provided the node from fxi xig that most

    increases the network score is added to the parent set of xi until no node increases the

    score or the size of i exceeds a predetermined constant

    The search algorithms we consider make successive arc changes to the network and

    employ the property of decomposability to evaluate the merit of each change The possible

    changes that can be made are easy to identify For any pair of variables if there is an

    arc connecting them then this arc can either be reversed or removed If there is no arc

    connecting them then an arc can be added in either direction All changes are subject to

    the constraint that the resulting network contain no directed cycles We use E to denote

    the set of eligible changes to a graph and *e to denote the change in log score of the

    network resulting from the modication e E Given a decomposable metric if an arc to

    xi is added or deleted only sxiji need be evaluated to determine *e If an arc between

    xi and xj is reversed then only sxiji and sxj jj need be evaluated

  • Learning Bayesian Networks MSRTR

    One simple heuristic search algorithm is local search Johnson First we choose

    a graph Then we evaluate *e for all e E and make the change e for which *e

    is a maximum provided it is positive We terminate search when there is no e with a

    positive value for *e As we visit network structures we retain l of them with the highest

    overall score Using decomposable metrics we can avoid recomputing all terms *e after

    every change In particular if neither xi xj nor their parents are changed then *e

    remains unchanged for all changes e involving these nodes as long as the resulting network

    is acyclic Candidates for the initial graph include the empty graph a random graph a

    graph determined by one of the polynomial algorithms described in the previous section

    and the prior network

    A potential problem with local search is getting stuck at a local maximum Methods for

    avoiding local maxima include iterated hillclimbing and simulated annealing In iterated

    hillclimbing we apply local search until we hit a local maximum Then we randomly

    perturb the current network structure and repeat the process for some manageable number

    of iterations At all stages we retain the top l networks structures

    In one variant of simulated annealing described by Metropolis et al we initialize

    the system to some temperature T Then we pick some eligible change e at random and

    evaluate the expression p exp*eT If p then we make the change e otherwise

    we make the change with probability p We repeat this selection and evaluation process

    times or until we make changes If we make no changes in repetitions then we stop

    searching Otherwise we lower the temperature by multiplying the current temperature

    T by a decay factor and continue the search process We stop searching if we

    have lowered the temperature more than times Thus this algorithm is controlled by ve

    parameters T and Throug