bayesian learning of bayesian networks with informative priors -...

48
Bayesian Learning of Bayesian Networks Bayesian Learning of Bayesian Networks with Informative Priors Nicos Angelopoulos [email protected] James Cussens [email protected] Department of Computer Science University of York Heslington, York, YO10 5DD, UK Editor: Some editor Abstract This paper presents and evaluates an approach to Bayesian model averaging where the models are Bayesian nets (BNs). Prior distributions are defined using stochastic logic pro- grams and the MCMC Metropolis-Hastings algorithm is used to (approximately) sample from the posterior. Experiments using data generated from known BNs have been con- ducted to evaluate the method. The experiments used 6 different BNs and varied: the structural prior, the parameter prior, the Metropolis-Hasting proposal and the data size. Each experiment was repeated three times with different random seeds to test the robust- ness of the MCMC-produced results. Our results show that with effective priors (i) robust results are produced and (ii) informative priors improve results significantly. Keywords: Prior knowledge, Bayesian inference, Bayesian model averaging, Markov chain Monte Carlo, loss functions, stochastic logic programs. 1. Introduction The Bayesian approach to machine learning is (conceptually) remarkably simple. It assumes that, before any analysis of data, a space of models (or hypotheses ) is defined, and that one of these models is the true model of the process which led to the data being observed. Further it is assumed that a prior distribution is defined over this model space. The prior probability of any given model represents the modeller’s belief that it is the true model based on information available prior to the data. Prior plus data define the posterior distribution over model space, and from a pure Bayesian viewpoint the posterior is the endpoint of statistical inference. In practice, however, the posterior is rarely the final result since it is generally used to make rational decisions. For example, a posterior distribution over a space of classifiers can be used to classify new examples: the example is just classified to whichever class has minimal expected cost according to the posterior. If misclassification costs are equal this is just the most likely class for the example according to the posterior. If the end-goal is to make rational decisions (for example, classification decisions) based on observed data, then there are strong arguments for a Bayesian approach (Howson and Urbach, 1989) which conclude that the posterior distribution is what is required to do this. Also the posterior (suitably presented and/or summarised) may permit insight into the 1

Upload: others

Post on 14-Nov-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

Bayesian Learning of Bayesian Networks with Informative

Priors

Nicos Angelopoulos [email protected]

James Cussens [email protected]

Department of Computer Science

University of York

Heslington, York, YO10 5DD, UK

Editor: Some editor

Abstract

This paper presents and evaluates an approach to Bayesian model averaging where themodels are Bayesian nets (BNs). Prior distributions are defined using stochastic logic pro-grams and the MCMC Metropolis-Hastings algorithm is used to (approximately) samplefrom the posterior. Experiments using data generated from known BNs have been con-ducted to evaluate the method. The experiments used 6 different BNs and varied: thestructural prior, the parameter prior, the Metropolis-Hasting proposal and the data size.Each experiment was repeated three times with different random seeds to test the robust-ness of the MCMC-produced results. Our results show that with effective priors (i) robustresults are produced and (ii) informative priors improve results significantly.

Keywords: Prior knowledge, Bayesian inference, Bayesian model averaging, Markovchain Monte Carlo, loss functions, stochastic logic programs.

1. Introduction

The Bayesian approach to machine learning is (conceptually) remarkably simple. It assumesthat, before any analysis of data, a space of models (or hypotheses) is defined, and that oneof these models is the true model of the process which led to the data being observed.Further it is assumed that a prior distribution is defined over this model space. The priorprobability of any given model represents the modeller’s belief that it is the true model basedon information available prior to the data. Prior plus data define the posterior distributionover model space, and from a pure Bayesian viewpoint the posterior is the endpoint ofstatistical inference.

In practice, however, the posterior is rarely the final result since it is generally used tomake rational decisions. For example, a posterior distribution over a space of classifierscan be used to classify new examples: the example is just classified to whichever class hasminimal expected cost according to the posterior. If misclassification costs are equal this isjust the most likely class for the example according to the posterior.

If the end-goal is to make rational decisions (for example, classification decisions) basedon observed data, then there are strong arguments for a Bayesian approach (Howson andUrbach, 1989) which conclude that the posterior distribution is what is required to do this.Also the posterior (suitably presented and/or summarised) may permit insight into the

1

Page 2: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

combined information provided by prior and data. Naturally, though, there are importantquestions which a Bayesian analysis does not address. Bayesian inference (at least in its‘vanilla’ form) is inference conditional on the actually observed data, it is not interested inother data which might have been observed but were not. If, for example, we want boundson the probability of observing a dataset which leads some learning algorithm to producea result which is ‘approximately correct’ then non-Bayesian results on uniform convergenceare needed (Vapnik and Chervonenkis, 1971).

Even when a Bayesian analysis is theoretically desirable, there are often large practicalobstacles to its application. Anyone who has tried to shoe-horn real prior knowledge, with allits heterogeneity, into a usable prior distribution will have encountered a problem this paperaims to reduce. Formalising prior knowledge so that it is accessible to Bayesian inferencerequires a representation language which straddles the gap between the prior knowledge ina human’s brain and that expressed by a probability distribution. In this paper we proposefirst-order logic with an additional probabilistic component (specifically, stochastic logicprograms) as such a language. It is of course no great discovery that first-order logic formsa good basis for formalising human thoughts: it was specifically designed to do just that(Frege, 1879).

A second problem for the Bayesian approach is getting hold of the posterior. In manycases, particularly in parametric Bayesian inference, a conjugate prior distribution is usedso that the computation and representation of the posterior are greatly simplified. In thecurrent paper the focus is on non-parametric Bayesian inference: both prior and posteriorare defined over a large space of models. Although in this paper, in contrast to some ofour other work (Angelopoulos and Cussens, 2004a), each model space is finite, they arenonetheless generally too big to represent explicitly. This means that a simple tabularrepresentation of prior and posterior is ruled out. Here we adopt a common approach tothis problem: a Markov Chain Monte Carlo (MCMC) algorithm is used to construct anapproximate sample from the posterior and the sample is then used as a proxy for theactual posterior.

Using MCMC, an approximate sample from the true posterior distribution is generated,and from a pure Bayesian point of view, the only learning result to evaluate is the accuracyof this approximation. Our MCMC runs have been repeated with different random seeds totest that similar estimates of posterior quantities (e.g. expected losses) are produced eachtime. We also extract single models (‘best guesses’ of the true model) from the MCMCsample to evaluate our results from the more usual non-Bayesian perspective.

The paper is structured as follows. Section 2 is a survey of existing methods of incorpo-rating prior information into Bayesian net learning. Section 3 shows how to use stochasticlogic programs to define prior distributions. Section 4 describes how our SLP priors are‘folded into’ the Metropolis-Hastings algorithm to generate approximate samples from theposterior. Sections 5 and 6 describe the experiments we have conducted and the results ofthose experiments, respectively. The paper ends with conclusions and suggestions for futurework in Section 7.

2

Page 3: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

2. Existing approaches to using prior information when learning the

structure of Bayesian nets

In this section, we provide an overview of existing approaches to including prior knowledgeinto BN learning. However, it should be noted that most BN learning approaches makeno significant attempt to integrate prior knowledge into the learning process. As Friedmanand Koller (2003) note “ . . . relatively little attention has been paid to the choice of struc-ture prior, and a simple prior is often chosen largely for pragmatic reasons. The simplestand therefore most common choice is a uniform prior over structures (Heckerman, 1998)”Madigan et al. (1995) state that “All of the reported applications of BMA use a uniformprior distribution, thereby implying that all models are equally likely a priori.” They goon to argue that:

. . . in the context of knowledge-based systems, or indeed in any context wherethe primary aim of the modeling effort is to predict the future, such prior distri-butions are often inappropriate; one of the primary advantages of the Bayesianapproach is that it provides a practical framework for harnessing all availableresources including prior expert knowledge.

We agree with this position and our own work is an attempt facilitate harnessing expertknowledge. Moreover, despite the frequent use of non-informative priors there remains somework which seeks to incorporate prior knowledge into BN learning. We turn now to considerthe various ways in which this has been done.

2.1 Hard constraints

The next step up in complexity from a uniform prior is to completely rule out particularstructures. This approach has been applied in both Bayesian and non-Bayesian approaches.In the Bayesian case, one can either declare a uniform prior on the surviving structures or goon to specify some non-uniform prior using one of the techniques given in Sections 2.2–2.4.

An early non-Bayesian example of using hard constraints is the algorithm presented bySrinivas et al. (1990) which allows (but does not require) an expert to effect constraints offour different kinds:

• Declaring that a variable must be a root node

• Declaring that a variable must be a leaf node

• Declaring that one variable must be the parent of another

• Making explicit conditional independence declarations

The expert knowledge also defines a partial order between variables (a ‘priority ordering’)which declares an ancestor relation between pairs of nodes. This knowledge guides thelearning of a BN by a non-Bayesian algorithm.

In Bayesian approaches a hard constraint sets the prior probability of the relevant struc-tures to zero. It follows from Bayes theorem that such structures will also have posterior

3

Page 4: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

probability of zero no matter how much the data might support them. An interesting sug-gested application of zero prior probabilities applies when (unusually) we consider structurescontaining extra ‘hidden’ variables not present in the data:

One difficulty in considering the possibility of hidden variables is that there isan unlimited number of them and an unlimited number of belief-networks thatcan contain them . . . Another approach is to specify explicitly nonzero priors foronly a limited number of belief-network structures that contain hidden variables.(Cooper and Herskovits, 1992)

By far the most common hard constraint is to impose a total ordering ≺ on variablessuch that y is a parent of x (y ∈ Πx) only if y ≺ x. (This is equivalent to stating that y isan ancestor of x only if y ≺ x.) This is the approach taken by Buntine (1991) and in the K2algorithm (Cooper and Herskovits, 1992), for example. Such an ordering reduces the modelspace considerably. For example, for 10 variables there are about 4.2 × 1018 possible BNstructures, but with an ordering ≺ this reduces to 2C(10,2) = 2(10×9)/2 = 3.5× 1013 (Cooperand Herskovits, 1992).

As an alternative to supplying a total ordering on variables, the BIFROST algorithm(Højsgaard and Thiesson, 1995) asks the user to partition the variables into ‘blocks’ andsupply a total ordering only on the blocks.

Hence model selection in BIFROST demands that the user specifies a blockrecursive model by which the types of potential associations between variablesare defined. BIFROST will then on the basis of a complete data set determine onwhich of the potential associations that actually holds. (Højsgaard and Thiesson,1995)

In addition the user may also give information on the existence of links.

It is possible to reduce the set of possible selectable models by specifying priorknowledge of some partial associations which definitely exist and some whichdefinitely do not exist. (Højsgaard and Thiesson, 1995)

Using a greedy non-Bayesian approach, BIFROST induces a block-recursive model (aka achain graph) not a Bayesian network. Since BIFROST only considers decomposable models,the resulting block-recursive model can always be translated into a Bayesian network.

Another common constraint is to limit the number of parents any variable may have(the indegree) (Cooper and Herskovits, 1992; Friedman and Koller, 2003). Such a constraintcan be justified as more than just a convenience:

There are few applications in which very large families are called for, and thereis rarely enough data to support robust parameter estimation for such families.From a more formal perspective, networks with very large families tend to havelow score. (Friedman and Koller, 2003)

The ‘low score’ referred to here is a measure of fit to data, so BNs with large families arelikely to be largely ruled out by the data. Although a tight bound on indegree can reducethe set of possible models considerably, for n variables and an indegree k, the number ofpossible structures is still 2O(kn log n) (Friedman and Koller, 2003).

4

Page 5: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

To finish this section on hard constraints, note that hard constraints are the mostsignificant form of prior knowledge for the following reason supplied by Langseth and Nielsen(2003):

The use of structural priors when learning BNs has received only little atten-tion in the learning community. The most obvious reason is that in most casesthe effect of the prior is dominated by the likelihood term, even for relativelysmall databases. One exception, however, is when some of the network struc-tures are given zero probability a priori, in which case the data cannot changethat belief.

2.2 Using priors on arcs

2.2.1 Using individual arc probabilities

The most natural non-uniform prior on Bayesian networks uses (i) a total ordering onvariables and (ii) assumptions of independence so that the probability of a graph is simplythe product of probabilities of its components. Call a prior which meets these two conditionsmodular. This approach originates with Cooper and Herskovits (1992) where P (BS), theprobability of structure BS is the product of the probabilities of parent sets (πS

i ) for eachvariable xi:

Assume that P (BS) can be calculated as P (BS) =∏

i=1,n P (πSi → xi). Thus,

for all distinct pairs of variables xi and xj, our belief about Xi having someset of parents is independent of our belief about xj having some set of parents.(Cooper and Herskovits, 1992, p.18)

A second independence assumption can be used to decompose P (πSi → xi):

The probability P (πSi → xi) could be assessed directly or derived using addi-

tional methods. For example, one method would be to assume that the presenceof an arc in πS

i → xi is marginally independent of the presence of the other arcsthere; if the marginal probability of each arc in πS

i → xi is specified, we cancompute P (πS

i → xi). (Cooper and Herskovits, 1992, p.18)

Another early use of this prior is given by Buntine (1991), who gives an explicit formulafor the probability of a structure Π in terms of individual arc probabilities. Assume thatthe expert has given us an ordering ≺ and also for any y and x, where y ≺ x, has given usP (y → x| ≺, E). E represents information from the expert. For any structure Π consistentwith ≺, the first independence assumption gives us:

Pr(Π| ≺, E) =∏

x∈X

Pr(Πx| ≺, E)

where Πx is the family structure for node x, and the second independence assumption givesus:

Pr(Πx| ≺, E) =

y∈Πx

Pr(y → x| ≺, E)

y 6∈Πx

(1− Pr(y → x| ≺, E))

5

Page 6: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

The possibility of computing exact posterior quantities using modular priors make them anattractive option. When the true ordering is not known modular priors can still be exploitedby summing/maximising over orders (Koivisto and Sood, 2004) or sampling orders usingMCMC (Friedman and Koller, 2003).

Note that the total order ≺ is crucial for the simple product form of this prior. Withoutit we have to consider combinations of parent-child links which give rise to directed cyclesand hence illegal Bayesian networks. To see this consider the approach of Valdueza Casteloand Siebes (1998). In this approach the user is not required to supply a total ordering≺. For any two distinct nodes A and B, the user is allowed (but not required) to defineP (A→ B), P (A← B) and P (A . . . B). P (A . . . B) denotes the probability that there is noarc. Clearly we must have

P (A→ B) + P (A← B) + P (A . . . B) = 1

Valdueza Castelo and Siebes (1998) assume that the user’s beliefs are coherent so thatprobabilities given by the user do not contradict this equation. If, for any two distinct nodesA and B, the user provide fewer than two probabilities, then the unspecified probabilities areset automatically by uniformly distributing the remaining probability mass. For example,if the user declares only that P (A → B) = 3/4, then the system sets P (A ← B) =P (A . . . B) = 1/8.

Given a BN structure BS , let (vi ↔ vj)BSbe either vi → vj or vi ← vj or vi . . . vj

according to which of these three possibilities is specified by BS. If we were to set

P (BS |ξ) =∏

vi,vj∈V

i6=j

p((vi ↔ vj)BS|ξ)

then we would not have defined a probability distribution over BN structures because∑

BSP (BS |ξ) < 1. The problem is that this simple product distribution assigns positive

probability to directed graphs with cycles, so some of the mass for the BNs is lost.The solution is simple. The product distribution is altered so that graphs with directed

cycles are set to probability zero, and to compensate a normalising constant is introduced.Valdueza Castelo and Siebes (1998) consider two options for normalisation. The distributionover BNs can either be:

P (BS |ξ) = c∏

vi,vj∈V

i6=j

p((vi ↔ vj)BS|ξ) (1)

orP (BS |ξ) = c +

vi,vj∈V

i6=j

p((vi ↔ vj)BS|ξ) (2)

Although Valdueza Castelo and Siebes (1998) do not describe it as such, Equation 1 definesa log-linear (also exponential-family or MAXENT) distribution over BNs, where the linksare the ‘features’ of the BN and c is the dreaded partition function Z. In general, c will behard to calculate, but as Valdueza Castelo and Siebes (1998) note we generally do not needto, since it is often enough to have probabilities defined only up to a normalising constant.

6

Page 7: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

The approach of Valdueza Castelo and Siebes (1998) turns out to be essentially the sameas one of those suggested by Madigan and Raftery (1994). In their experiments, Madiganand Raftery (1994) initially use a uniform prior, but this is criticised as follows:

In the examples considered above, the prior model probabilities pr(M) were as-sumed equal (Cooper and Herskovits, 1992, also assume that models are equallylikely a priori). In general this can be unrealistic and may also be expensiveand we will want to penalise the search strategy as it moves further away fromthe model(s) provided by the expert(s)/data analyst(s). Ideally one would elicitprior probabilities for all possible qualitative structures from the expert but thiswill be feasible only in trivial cases.

They go on to discuss an alternative to the uniform prior:

For models with fewer than 15 to 20 nodes, prior model probabilities may beapproximated by eliciting prior probabilities for the presence of every possiblelink and assuming that the links are mutually independent, as follows. LetE = EP ∪EA denote the set of all possible links for the nodes of model M , whereEP denotes the set of links which are present in model M and EA denotes theabsent links. For every link e ∈ E we elicit pr(e), the prior probability that linke is included in M . The prior model probability is then approximated by

pr(M) ∝∏

e∈EP

pr(e)∏

e∈EA

(1− pr(e))

Prior link probabilities from multiple experts are treated as independent sourcesof information and are simply multiplied together to give pooled prior modelprobabilities. Clearly, the contribution from each expert/data analyst could beweighted. (Madigan and Raftery, 1994)

The simplicity with which this prior incorporates information from multiple sourcesis a strong point in its favour, however the independence assumption seems unrealistic.Richardson and Domingos (2003) have recently used a similar method.

2.2.2 A distance based approach

Eliciting very many arc probabilities is unrealistic, so Madigan and Raftery (1994) go onto consider a coarser approach:

For applications involving a larger number of nodes or where the elicitation oflink probabilities is not possible, we could assume that the “evidence” in favourof each link included by the expert(s)/data analyst(s) in the elicited qualitativestructure(s) is “substantial” or “strong” but not “very strong” or “decisive”(Jeffreys, 1961). For example, we could assume that the evidence in favour ofan included link lies at the center of Occam’s window corresponding to a priorlink probability for all e ∈ EP of

pr(e) =1

1 + exp(

OL+OR

2

)

7

Page 8: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

Similarly, the prior link probabilities for e ∈ EA are given by

pr(e) =exp

(

OL+OR

2

)

1 + exp(

OL+OR

2

)

(Madigan and Raftery, 1994)

This approach is most easily explained if we assume the domain expert has provided uswith a single model which s/he proposes as his/her ‘best guess’. Basically, each arc in thiselicited model receives the same prior probability, and this probability will be higher thanthat for arcs not in the elicited model. The parameters OL and OR are set by the user, andaffect the search algorithms proposed in Madigan and Raftery (1994). Their effect on theprior is to determine the prior bias in favour of arcs included in the model provided by theuser.

A similar prior is used by Heckerman et al. (1995a) who present a Bayesian approachwhere “a user specifies his prior knowledge about the problem by constructing a Bayesiannetwork, called a prior network, and by assessing his confidence in this network.” Theyargue that “. . . structures that closely resemble the prior network will tend to have higherprior probabilities.” If a given structure BS differs from the prior network by δ arcs, then

P (BS) = cκδ

where κ (0 < κ < 1) is a user-selected penalty factor and c is a normalising constant.An alternative to penalising structures too distant from the expert’s is simply to fix on

some particular prior network irrespective of expert opinion. For example, Friedman andKoller (2003) consider (but do not use) a prior which penalises dense networks. In thiscase, implicitly, the arcless BN is the prior network. Suppose β is the probability for anedge to be present, then a structure with m edges will have prior probability proportionalto βm(1− β)C(n,2)−m.

2.3 Priors over variable orderings

We have seen that if the user is prepared to supply a total order ≺ then setting an arc-basedprior is simplified (as is learning in general). If the user is unwilling to do this, he/she mightat least provide a prior over possible orderings. Although Madigan and Raftery (1994)suggest that “In the directed case it may be possible to construct a prior distributionon the space of orderings . . . ”, this possibility is only really explored by Friedman andKoller (2003). However, this work concentrates on a MCMC method for sampling from theposterior distribution over orderings, rather than exploring methods for defining priors overthat space. Consequently Friedman and Koller (2003) restrict their work to using a uniformdistribution over orderings. Such a distribution is not hypothesis equivalent (Heckermanet al., 1995a): different structures in the same Markov equivalence class have differentpriors. Concerning this problem Friedman and Koller (2003) state that:

In general, while this discrepancy in priors is unfortunate, it is important to seeit in proportion. The standard priors over network structures are often used not

8

Page 9: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

because they are particularly well-motivated, but rather because they are simpleand easy to work with. In fact, the ubiquitous uniform prior over structures isfar from uniform over PDAGs (Markov equivalence classes)—PDAGs consistentwith more structures have a higher induced prior probability.

This raises the question of the importance of hypothesis equivalence when setting priors.When the goal of BN learning is to induce a standard probabilistic model, then BNs inthe same Markov equivalence class (i.e. defining the same set of probability distributions)are interchangeable and so it might seem inconsistent if they do not all have the sameprior. However, if we consider working in a model space of Markov equivalence classes(whether represented by PDAGs or not) there is no inconsistency: the prior probability ofa Markov equivalence class of BNs can be defined to be the sum of the priors of the BNsin that class. There seems no reason why these BN priors need be equal. Of course, itmay be more convenient to define a prior directly on the Markov equivalence class: thisis a knowledge engineering issue. In the case where BNs model more than probabilitydistributions, for example where the arcs also have a causal interpretation then, of course,there is no argument for hypothesis equivalence.

2.4 Defining priors using global features

An application where the proper incorporation of prior structural knowledge is particularlyimportant is the use of Bayesian networks to represent pedigrees (Sheehan and Sorensen,2003). Pedigrees are directed graphs representing family relationships between individualpeople (or other animals). Given DNA data on a group of individuals it is often desiredto uncover the pedigree relating them: settling paternity cases is a common motivation.Given DNA data x and a pedigree g, a likelihood P (x|g) can be determined using theMendelian laws of genetic inheritance. A Bayesian approach is possible by defining a priordistribution P (g) in addition. If the number of possible pedigrees is not too great it ispossible to apply an exact Bayesian approach to this, where the posterior probability of eachpedigree is calculated. This is the approach taken in the (freely available) familias system:http://www.math.chalmers.se/~mostad/familias/ (Egeland et al., 2000). There is avariety of BN representations of pedigrees in the literature. Here we will have each randomvariable in the BN represent the genotype of some individual. (A person’s genotype is theentire set of genes under consideration for that person.)

For pedigrees, the terms, ‘parent’ and ‘child’ are not metaphors; a link A → B in apedigree states that the person A is a known parent of the person B. Nodes in a pedigreehave zero, one or two parents, corresponding to the number of known parents of the indi-vidual corresponding to that node. If the pedigree is represented as a Bayesian network thisbecomes a BN learning task. A Bayesian approach requires that a prior over each possiblepedigree is specified. In the familias system (Egeland et al., 2000) each pedigree g has aprior proportional to

MbI (g)I M

bP (g)P M

bG(g)G (3)

where: MI ,MP and MG are non-negative parameters set by the user, bI(g) quantifies theamount of inbreeding present in g, bP (g) quantifies the level of promiscuity in g and bG(g)is the number of generations in g. Hard constraints can also be included by specifying

9

Page 10: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

any known family relationships. Once pedigrees not satisfying the hard constraints areexcluded, a uniform distribution over the survivors can be imposed by setting MI = MP =MG = 1. However, doing so tends to give unrealistically high prior probabilities to pedigreesdisplaying high levels of inbreeding and promiscuity.

2.5 Setting parameter priors

Up to this point, only probabilities on Bayesian net structures—not parameters—have beenconsidered. But it is not possible to consider only structure, since a BN structure onits own does not define a likelihood for the data. It follows that for each BN structure,a prior density over possible parameters for that structure is required. Methods for sodoing will not be explored here, since our focus is on structural priors and the issues havebeen well covered elsewhere (Heckerman et al., 1995b). In short, there are compellingreasons to use Dirichlet priors, not least because, with complete data, it is possible tointegrate the parameters out to get a closed form for the marginal likelihood. Prior Dirichletparameters should be chosen so that BNs in the same Markov equivalence class have thesame marginal likelihood. Using the freely available deal system (which is an R package)http://www.math.aau.dk/novo/deal/ (Bøttcher and Dethlefsen, 2003) is a good way tounderstand how to set the Dirichlet priors.

3. Defining informative priors using stochastic logic programs

We use stochastic logic programs (SLPs) to define informative priors on BN structures.Basically, an SLP is a logic program with added probabilities. These probabilities areused to change the normal Prolog depth-first backtracking search to a probabilistic depth-first search, which in this paper will use backtracking. Stochastic logic programs wereintroduced by Muggleton (1996). Maximum likelihood parameter estimation for SLPs wasdone by Cussens (2001) and the first use of SLPs to define prior distributions was byAngelopoulos and Cussens (2001). Earlier work on SLPs, such as (Cussens, 2001), did notpermit backtracking as we do here; backtracking for SLPs was first introduced by Cussens(2000).

In this section, the aim is to give a concise account of the basic method. Throughoutthis section, deliberately simplistic examples of SLPs and fragments of SLPs will be usedfor explanatory purposes.

3.1 Defining the model space with a logic program

Before considering how to define an informative prior over a space of models using anSLP, we will consider the easier—and related—problem of defining a model space with alogic program. Suppose that we wish to consider a space of possible BNs (denoted by Ω)containing a mere 5 Bayesian nets, as listed in Fig 1. To use a logic program to representthis space the first decision is to choose a representation of each member of Ω as a first-orderterm. (In all applications that have been studied so far, it has been natural to use groundfirst-order terms to represent statistical models, although there is no actual requirement forgroundedness.) One option is to represent BNs as a list of families. Each family specifiesa child together with a list of its parents. Next we choose a monadic predicate symbol

10

Page 11: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

to denote the set of models so represented. Suppose we used k/1 for this purpose. Thisproduces the logic program in Fig 1. k/1 is such that k(bn) is true iff bn represents anBN in Ω. Note that the functor ← is an operator so that b ← [a, c] is syntactic sugar for′ ←′ (b, [a, c, ]).

Ω =

A → B → CA → B ← CA → B CA ← B ← CA B C

:- op( 100, xfx, ’<-’ ).

k([a<-[],b<-[a],c<-[b]]).

k([a<-[],b<-[a,c],c<-[]]).

k([a<-[],b<-[a],c<-[]]).

k([a<-[b],b<-[c],c<-[]]).

k([a<-[],b<-[],c<-[]]).

Figure 1: A very small model space and the logic program defining this space.

Naturally, much bigger model spaces can be defined using this technique. The logicprogram in Fig 2 uses a binary predicate k/2 to define the set of all BNs consistent withsome given ordering of the BN random variables. The first argument contains the orderingand the second the BN. The program in Fig 2 represents BNs as a list of edges, rather thana list of families—this is just more convenient for this program.1

:- op( 100, xfx, ’<-’ ).

k([],[]).

k([RV|RVs], BN) :- %it’s a Bayes net if ..

k(RVs, TailBN), %this is and ..

edges(RVs, RV, TailBN, BN). %we connect RV thus giving BN

edges([], _RV, BN, BN). % Finished.

edges([H|T], RV, TailBN, BN) :- % Connect or otherwise, each

connected(H, RV, TailBN, MidBN), % potential parent.

edges(T, RV, MidBN, BN).

connected(Pa, RV, TailBN, [Pa<-RV|TailBN]). % connect

connected(_NoPa, RV, BN, BN). % don’t connect

Figure 2: A logic program defining a model space containing all BNs consistent with somegiven variable ordering.

Fig 3 shows what happens when we use a normal Prolog interpreter to enumerate allBNs consistent with the lexicographic ordering for random variables a, b and c. Note thatk/2 defines not one model space but an infinite number: one for each possible ordering

1. Some variables in Fig 2 begin with an underscore. This is just a Prolog convention for variables thatappear only once in a clause.

11

Page 12: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

provided by an instantiation of the first argument. It turns out to be very convenient tohave such parameterised definitions of model spaces.

| ?- k([a,b,c],BN).

BN = [c<-a,b<-a,c<-b] ? ;

BN = [b<-a,c<-b] ? ;

BN = [c<-a,c<-b] ? ;

BN = [c<-b] ? ;

BN = [c<-a,b<-a] ? ;

BN = [b<-a] ? ;

BN = [c<-a] ? ;

BN = [] ? ;

no

Figure 3: Enumerating 3 node BNs consistent with lexicographic ordering, using the pred-icate k/2 defined in Fig 2. The semi-colons are user input.

Next, consider the logic program in Fig 4. This is similar to the program in Fig 2 exceptthat BNs do not have to respect the ordering of the variables provided in the first argument.Because of this it is necessary to check that any graphs produced are acyclic. The code fordoing this in Fig 4 is chosen for simplicity and is not the most efficient way of checking forcycles. The cut (!) in Fig 4 is a ‘green cut’: removing it to make the logic program purewould only affect the program’s efficiency. The symbol \+ is Prolog notation for negation,which in logic programs is negation-as-failure not standard first-order negation.

Logic programs (being a subset of first-order logic) provide a convenient high-level anddeclarative language with which to precisely define complex model spaces. For example, ifwe wish to restrict attention to only those BNs which respect some conditional independencerelation it is enough to describe the condition in first-order logic and add it to the definitionof what constitutes a model.

3.2 Representing proofs with proof trees

Consider the behaviour exhibited in Fig 3. An initial query k([a,b,c],BN) is posedto the Prolog interpreter: this amounts to the question: does there exist a BN suchthat k([a,b,c],BN) is true?. The interpreter (1) finds a proof (not shown in Fig 3)that k([a,b,c],BN) is true, (2) gives an affirmative answer and (3) provides a witnessBN=[c<-a,b<-a,c<-b] corresponding to the found proof. In Fig 3 the user has asked forfurther proofs by entering ‘;’ and the interpreter finds 8 proofs in total, providing a wit-ness BN in each case. As will be explained in Section 3.3, SLPs define a distribution overfirst-order terms such as BN=[c<-a,b<-a,c<-b] by defining a distribution over the proofsof queries such as k([a,b,c],BN).

Since proofs play such a central role in SLP-defined distributions, we need an appropriaterepresentation of them. It will prove useful to represent each proof as a proof tree. Prooftrees are fully described, for example, by Nilsson and Ma luszynski (1995). Basically, a proof

12

Page 13: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

k([],[]).

k([RV|RVs],BN) :- % it’s a Bayes net if ..

k(RVs,TailBN), % this is and ..

edges(RVs,RV,TailBN,BN), % we connect RV thus giving ..

no_cycle(BN). % a BN

edges([],_RV,BN,BN). % finished

edges([H|T],RV,TailBN,BN) :- edges(T,RV,[H<-RV|TailBN],BN).

edges([H|T],RV,TailBN,BN) :- edges(T,RV,[RV<-H|TailBN],BN).

edges([_H|T],RV,TailBN,BN) :- edges(T,RV,TailBN,BN).

no_cycle([]).

no_cycle(BN) :-

select(_X<-Y,BN,Rest), % delete a parent ..

\+ select(Y<-_Z,BN,_), % which is an orphan

!, no_cycle(Rest).

select(H,[H|T],T).

select(X,[H|T],[H|T2]) :- select(X,T,T2).

Figure 4: A logic program defining a model space containing all BNs for a given set ofvariables.

tree is the logic programming version of a parse (or derivation) tree for grammars. Prooftrees are constructed by bolting together elementary trees. Each clause in the logic programhas a corresponding elementary tree. Figs 5 and 6 show two clauses from Fig 4 togetherwith their corresponding elementary trees. (We have abbreviated some of the predicatenames in Fig 6 for reasons of space.)

k([],[]). k([],[])

Figure 5: A unit clause with its elementary tree representation.

k([RV|RVs],BN) :-

k(RVs,TBN),

e(RVs,RV,TBN,BN),

nc(BN).

k([RV|RVs], BN)

k(RVs, TBN) e(RVs, RV, TBN, BN) nc(BN)

Figure 6: A non-unit clause with its elementary tree representation.

13

Page 14: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

Atomic goals (i.e. things we want to prove such as k([a,b,c],BN)) also have an el-ementary tree representation: each goal corresponds to a tree where the goal is the onlynode, such as in Fig 7. Elementary trees are joined together to make derivation trees byconnecting the root of one elementary tree onto the leaf of another and declaring an equalityconstraint between the atomic formulae which label these two nodes. Only leaves labelledwith atomic formulae, not those labelled , can have trees joined to them.

k([a, b, c], BN)

Figure 7: A single-node elementary tree corresponding to the goal k([a,b,c],BN).

For example, Fig 8 shows the derivation tree produced by joining the elementary trees ofFigs 6 and 7. Note that variables are renamed (‘standardised apart’) to prevent accidentalequalities. Derivation trees with a goal at the root node represent states in the search for aproof of that goal. The tree in Fig 8 represents the state that is reached after choosing therule in Fig 6 as the first rule in the search for a proof for k([a, b, c], BN).

k([a, b, c], BN)=

k([RV1|RVs1], BN1)

k(RVs1,TBN1) e(RVs1,RV1,TBN1,BN1) nc(BN1)

k([a,b,c],BN)

k([b,c],TBN1) e([b,c],a,TBN1,BN) nc(BN)

Figure 8: LHS: Derivation tree produced by combining the elementary trees of Figs 6 and7. Note the variable renaming. RHS: Simplified version of this tree.

A derivation tree all of whose leaves are labelled by is said to be complete. Completederivation trees are also called proof trees. Proof trees cannot be further extended, so theyrepresent the termination of a proof search. If the set of equations in a derivation tree havea solution, then the tree is consistent, otherwise it is inconsistent. Only consistent prooftrees represent successful proofs; inconsistent ones represent ‘failed proofs’.

One strategy to finding proofs is to build a complete proof tree, and only then checkwhether the equations it contains are soluble. Clearly this can be highly inefficient: it ismuch better to check for solubility as the tree is built, stopping the construction of thetree if the current set of equations is insoluble. The Prolog approach to doing this is tosimplify the set of equations after each extension of the derivation tree, thus facilitatingthe detection of an insoluble set of equations. The RHS of Fig 8 shows how simplificationworks. Note that the extension of the initial goal tree in Fig 7 to the derivation tree inFig 8 amounts to replacing one atomic goal with the three atomic goals at the leaves ofFig 8. Each of these atomic goals has to be proved in the same way as the original goal:by having a consistent proof tree built ‘underneath’ it. However, because the variables BN

and TBN1 are shared between the three goals there is a dependency between possible prooftrees for each of the three goals.

Consider next negated goals such as \+ select(Y<-_Z,BN,_) in Fig 4. A consistentproof tree with a negated goal \+ Goal exists if and only if it can be established that thereis no consistent proof tree with Goal as its root. In this case the proof tree has only two

14

Page 15: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

nodes: the root \+ select(Y<-_Z,BN,_) and a single leaf . Note that no variables in anegated goal get instantiated.

It may be useful for some readers to compare proof trees to the more standard accountof proof construction for logic programs in terms of SLD-resolution. In SLD-resolution toprove a formula such as k([a, b, c], BN), the negation ¬k([a, b, c], BN) is asserted (the initialgoal) and an attempt to prove a contradiction is made. Standard SLD-resolution selects theleftmost atom from the current goal, unifies it with the head of a clause and replaces theselected atom with the body of that clause. It at any point the empty goal is produced thena contradiction has been established. The connection with proof trees is simple: replacingthe selected atom with a clause body is like joining an elementary tree; the current goalis just the set of non- leaves; the selected atom of the current goal will be the deepest,leftmost leaf; a complete tree (all leaves ) corresponds to producing the empty goal.

Prolog execution (at least conceptually) consists of interleaving steps of tree extensionand equation simplification. As previously mentioned, in standard Prolog, the tree is alwaysextended by attaching an elementary tree to the deepest, leftmost non- leaf. Call theatomic formula at this leaf the selected atom. Generally, there will be several elementarytrees available. This creates what is known as a choice-point which is simply an ordered setof possible elementary trees. In standard Prolog the ordering of elementary trees is fixedby the lexical ordering of the corresponding clause in the Prolog source file.

Upon the creation of a choice-point the first tree in the ordering is used and then removedfrom the choice-point. If at any stage an inconsistent derivation tree is produced, then theprocess backtracks to the most recent non-empty choice point and uses the first elementarytree in the choice-point (this tree is then removed from the choice-point). The process ofbacktracking to non-empty choice points and removing trees from choice-points means it ispossible to backtrack all the way back to the first choice-point created.

The trees in Figs 7 and 8 show the first two stages in the construction of a prooftree for the goal k([a, b, c], BN). For reasons of space we will not show all the sub-sequent stages, but one intermediate stage is shown in Fig 9 and the final stage is theconsistent proof tree shown in Fig 10. This last tree constitutes a proof that k([a,b,c],

[c<-a,b<-a,c<-b]) is true. We say that the consistent proof tree has yielded the atomk([a,b,c], [c<-a,b<-a,c<-b]). Yielded atoms are initial goals with the instantiationgenerated by the tree applied. The SLPs considered in this paper only every produceground yielded atoms.

3.3 Defining priors using stochastic logic programs

Having seen how logic programs are used to define model spaces, we now turn to howstochastic logic programs (SLPs) can be used to define probability distributions over thesemodel spaces. The basic idea is very simple: parameters are added to a logic program sothat when a choice-point is created an elementary tree is chosen according to a probabilitydistribution determined by the parameters, rather than deterministically as in standardProlog execution. This determines a distribution over consistent proof trees for any giveninitial goal and thus a distribution over instantiations of any variables in that initial goal.

There are number of different ways of doing this, three of which were discussed byCussens (2000). In this paper, we opt for an approach named backtrackable sampling, which

15

Page 16: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

k([a,b,c], BN)

k([b,c],[c<-b])

k([c],[])

k([],[])

e([],c,[],[])

nc([])

e([c],b,[],[c<-b])

e([],b,[c<-b],[c<-b])

nc([c<-b])

e([b,c],a,[c<-b],BN)

e([c],a,[b<-a,c<-b],BN)

nc(BN)

Figure 9: The lexically first recursive e/4 clause is used, a is chosen to be a parent of b.The selected atom is now e([c],a,[b<-a,c<-b],BN). A choice point is createdsince there are alternative elementary trees (= clauses) for this atom.

k([a,b,c], [c<-a,b<-a,c<-b])

k([b,c],[c<-b])

k([c],[])

k([],[])

e([],c,[],[])

nc([])

e([c],b,[],[c<-b])

e([],b,[c<-b],[c<-b])

nc([c<-b])

e([b,c],a,[c<-b],[c<-a,b<-a,c<-b])

e([c],a,[b<-a,c<-b],[c<-a,b<-a,c<-b])

e([],a,[c<-a,b<-a,c<-b],[c<-a,b<-a,c<-b])

nc([c<-a,b<-a,c<-b])

Figure 10: At this point nc([c<-a,b<-a,c<-b]) is proved to be true, but the relevantproof sub-tree has been omitted. This is a consistent proof tree proving thatk([a,b,c], [c<-a,b<-a,c<-b]) is true.

16

Page 17: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

is similar in spirit to that given in the original paper by Muggleton (1996). Backtrackablesampling was introduced by Cussens (2000) where it is contrasted with using SLPs todefine log-linear distributions. It is important to realise that backtrackable sampling doesnot define a log-linear model, in contrast with the sampling approach analysed by Cussens(2001) which did not permit backtracking.

Backtrackable sampling permits us to use a very generalised form of SLP. Not all clausesneed to be parameterised, so the SLPs are impure, and any parameters need only to benon-negative, so that the SLPs are unnormalised. We will begin our presentation withunextended SLPs leading to the simple formal definition in Definition 1.

Definition 1 An unextended, impure, unnormalised stochastic logic program is a logicprogram where some clauses are annotated with non-negative numbers. A clause C anno-tated with p is represented by p :: C. For each predicate in such an SLP, either all theclauses making up its definition are parameterised or none are. Predicates of the formersort are called probabilistic predicates.

It remains to state how these clause parameters determine a distribution over consis-tent proof trees for any given initial goal. The distribution is determined by a samplingmechanism for consistent proof trees which we now define.

Definition 2 Backtrackable sampling for unextended, impure, unnormalised stochastic logicprograms produces a distribution over consistent proof trees for an initial goal as follows.Backtrackable sampling is identical to the standard Prolog construction of a proof tree exceptthat elementary trees are chosen probabilistically from non-empty choice points when thepredicate symbol of the selected atom is probabilistic. If T1, T2, . . . Tn is such a choice pointand p1 :: C1, p2 :: C2, . . . pn :: Cn is the corresponding set of annotated clauses, then tree Ti

is selected with probability pi/∑n

j=1 pj. (Note that these probabilities are independent of theordering of the elementary trees.)

Definition 3 A distribution over consistent proof trees Ptree (for a given initial goal G0) de-fines a distribution Patom over yielded atoms G0θ as follows. For any atom G0θ, Patom(G0θ) =∑

t:t yields G0θ Ptree(t)

It is because selection probabilities are defined by the quotient pi/∑n

j=1 pj that there isno need for the parameters for a particular predicate to be normalised, i.e. to sum to one.However, in our examples we will use normalised SLPs since there is no reason not to. Notethat the set of possible trees at a choice point is reduced if backtracking back to that choicepoint occurs. This is because the ‘guilty’ elementary tree (= clause) is no longer a possiblechoice. This means that the denominator in pi/

∑nj=1 pj will be reduced. If only one tree

is left at a choice point then it is chosen with probability one.

For a given SLP and initial goal, it is possible to characterise backtrackable samplingfor SLPs as a Markov chain. Each state of the Markov chain is a derivation tree togetherwith the current choice point for every node in the tree. We will call these states augmentedderivation trees. Augmented derivation trees are nothing more than representations ofthe internal state of a Prolog program as it executes. The initial state (with probabilityone) is the one-node derivation tree whose single node is the initial goal with a choice

17

Page 18: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

point containing all elementary trees whose root node matches the predicate symbol of theinitial goal. If the selected atom of the current augmented derivation tree (= state) is non-probabilistic then with probability 1 it is extended with the next available elementary tree, ifany, according to the standard Prolog lexical ordering. If the selected atom is probabilistic,then for the unextended SLPs we are currently considering, the selected elementary tree ischosen as specified in Definition 2. If an inconsistent tree is produced or if we run out ofelementary trees for a particular choice-point then with probability 1 the chain backtracksto the most recent non-empty choice-point as previously described. We can use the usualtrick of making ‘final’ states, which here are consistent proof trees, into absorbing states.Once we hit a consistent proof tree the chain remains stuck there for ever. For non-trivialSLPs the state space of this Markov chain is very large, but since both it and the relevanttransition probabilities are defined in a compact manner it is still a usable representation.

If any sequence of probabilistic choices leads to a consistent proof tree without back-tracking, then the SLP is said to be failure-free. Our first example SLP, in Fig 11 is clearlyfailure-free; it is a simple tabular representation of a discrete probability distribution forthe model space defined by its underlying logic program (which is given in Fig 1).

0.40 :: k([a<-[],b<-[a],c<-[b]]).

0.15 :: k([a<-[],b<-[a,c],c<-[]]).

0.15 :: k([a<-[],b<-[a],c<-[]]).

0.15 :: k([a<-[b],b<-[c],c<-[]]).

0.15 :: k([a<-[],b<-[],c<-[]]).

Figure 11: A failure-free SLP defining a prior distribution over a very small space of BNs.

Our next example SLP is produced by replacing the definition of connected/4 in Fig 2with:

0.6 :: connected(Pa, RV, TailBN, [Pa<-RV|TailBN]). % connect

0.4 :: connected(_NoPa, RV, BN, BN). % don’t connect

The SLP so produced is not formally failure-free, since a naıve underlying Prolog interpretermight, for example, select the first edges/4 clause when the first argument of the selectedatom is a non-empty list. This would produce an inconsistent tree (a unification failure) andbacktracking would be invoked to try (successfully) the other edges/4 clause. A smartersystem would use indexing to avoid this. This is just a question of the efficiency of sampling,the same distribution over consistent proof trees is defined in both cases.

Failure-free SLP priors are efficient to sample from (since each clause choice constitutesprogress towards a successful proof) and have a simple mathematical characterisation. Theyare essentially equivalent to stochastic context-free grammars with predicates playing therole of grammar non-terminals. For the unextended SLPs we have seen so far, the sequenceof probabilistic choices made in building a consistent proof tree for a failure-free SLP issimply a joint instantiation of a sequence of independent random variables—each individualrandom variable having a multinomial distribution specifying which clause to choose for aprobabilistic predicate.

18

Page 19: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

We next consider a more complex SLP which is produced by replacing the definition ofedges/4 in Fig 4 with:

edges([],_RV,BN,BN). % finished

edges([H|T],RV,BNIn,BNOut) :-

p_edges([H|T],RV,BNIn,BNOut).

0.3 :: p_edges([H|T],RV,TailBN,BN) :- % RV parent of H

edges(T,RV,[H<-RV|TailBN],BN).

0.2 :: p_edges([H|T],RV,TailBN,BN) :- % RV child of H

edges(T,RV,[RV<-H|TailBN],BN).

0.5 :: p_edges([_H|T],RV,TailBN,BN) :- % no direct connection

edges(T,RV,TailBN,BN).

We have shown how for the logic program of Fig 4 the standard Prolog approach wouldbuild a consistent proof tree for the goal k([a,b,c], BN) in Figs 8–10. We now considerhow backtrackable sampling works for its probabilistic version.

Note that our SLP has two predicates: edges/4 and p edges/4 where its correspondinglogic program in Fig 4 had only one: edges/4. This is to cleanly separate the entirelydeterministic question of whether there remain nodes to consider connecting to the ‘current’node RV from the probabilistic issue of what sort of connection to make. To cut down on thesize of the derivation trees we are about to show we will informally pretend, when presentingthese trees, that there is only one ‘edges’ predicate in our SLP, which, moreover, we willabbreviate to e/4.

Suppose that, as chance would have it, the lexically first recursive p edges/4 clause hadbeen probabilistically chosen for the first two choice points, just as if normal deterministicProlog execution were being used. The derivation tree would then be the one shown inFig 9. The selected atom at this point is edges([c],a,[b<-a,c<-b],BN)—abbreviatedto e([c],a,[b<-a,c<-b],BN)—and a choice point is created since there is a choice of 3p edges/4 clauses. Suppose that the elementary tree corresponding to the second p edges/4clause is probabilistically chosen. This produces the derivation tree shown in Fig 12. Theprocedure would then (deterministically) use the base edges/4 clause to produce the treein Fig 13.

The tree in Fig 13 is not inconsistent itself, but it has no consistent complete extensionbecause nc([a<-c,b<-a,c<-b]) is not true and thus cannot be proved. We will not detailthe relevant proof steps since there is no probabilistic element. Suffice to say that once theunprovability of nc([a<-c,b<-a,c<-b]) has been established, our sampling mechanismwill backtrack to the most recent choice point which is the tree in Fig 9, but with the‘guilty’ second recursive clause now removed from the choice point. Suppose that this time,the third p edges/4 clause is probabilistically chosen, then the derivation tree in Fig 14 isthe result. This tree will then be deterministically extended to the consistent proof tree inFig 15 without any further backtracking.

3.4 Extended SLPs

It is not difficult to see that the state transition probabilities of the Markov chain associ-ated with an unextended SLP depend only on the predicate symbol of the selected atom

19

Page 20: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

k([a,b,c], BN)

k([b,c],[c<-b])

k([c],[])

k([],[])

e([],c,[],[])

nc([])

e([c],b,[],[c<-b])

e([],b,[c<-b],[c<-b])

nc([c<-b])

e([b,c],a,[c<-b],BN)

e([c],a,[b<-a,c<-b],BN)

e([],a,[a<-c,b<-a,c<-b],BN)

nc(BN)

Figure 12: Probabilistically choose c to be the parent of a

k([a,b,c], [a<-c,b<-a,c<-b])

k([b,c],[c<-b])

k([c],[])

k([],[])

e([],c,[],[])

nc([])

e([c],b,[],[c<-b])

e([],b,[c<-b],[c<-b])

nc([c<-b])

e([b,c],a,[c<-b],[a<-c,b<-a,c<-b])

e([c],a,[b<-a,c<-b],[a<-c,b<-a,c<-b])

e([],a,[a<-c,b<-a,c<-b],[a<-c,b<-a,c<-b])

nc([a<-c,b<-a,c<-b])

Figure 13: An incomplete derivation tree which cannot be extended to a consistent prooftree, since nc([a<-c,b<-a,c<-b]) is not true. The backtrackable samplingmechanism will thus eventually backtrack to the most recent choice point, whichis Fig 9.

20

Page 21: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

k([a,b,c], BN)

k([b,c],[c<-b])

k([c],[])

k([],[])

e([],c,[],[])

nc([])

e([c],b,[],[c<-b])

e([],b,[c<-b],[c<-b])

nc([c<-b])

e([b,c],a,[c<-b],BN)

e([c],a,[b<-a,c<-b],BN)

e([],a,[b<-a,c<-b],BN)

nc(BN)

Figure 14: Probabilistically choose c not to connect to a

k([a,b,c], [b<-a,c<-b])

k([b,c],[c<-b])

k([c],[])

k([],[])

e([],c,[],[])

nc([])

e([c],b,[],[c<-b])

e([],b,[c<-b],[c<-b])

nc([c<-b])

e([b,c],a,[c<-b],[b<-a,c<-b])

e([c],a,[b<-a,c<-b],[b<-a,c<-b])

e([],a,[b<-a,c<-b],[b<-a,c<-b])

nc([b<-a,c<-b])

Figure 15: A consistent proof tree proving that k([a,b,c], [b<-a,c<-b])

21

Page 22: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

and which clauses, if any, have been removed from the choice-point corresponding to thisselected atom. Sometimes this is not fine-grained enough. Extended SLPs (Angelopoulosand Cussens, 2004b), in contrast, permit a much greater range of transition probabilities tobe used. We introduce extended SLPs by way of a simple example in Fig 16.

:- pvars( umember(_El,List), [L-(length( List, L ),L>0)] ).

1/L :: [L] :: umember( El, [El|T] ).

1 - (1/L) :: [L] :: umember( El, [_|T] ) :- [L-1] :: umember( El, T ).

Figure 16: An extended SLP for selecting uniformly from a list.

The basic idea behind umember/2 in Fig 16 is that the probability with which eachelement (first argument) is picked from a list (second argument) should be uniform. Toachieve this, when umember(El,List) is the selected atom the pvars/2 directive is calledand length( List, L ),L>0 is used to compute the list length L. The value L is then usedto dynamically compute the probabilities for the two umember/2 clauses via the expressions1/L and 1 - (1/L). It is easy to see (by induction) that this procedure defines a uniformdistribution over members of the list.

In general, the first argument for directive pvars/2 is an atom corresponding to a prob-abilistic predicate while the second one is a list of pairs. Each member in the list connectsa variable (e.g. L) to its guard (e.g. (length( List, L ),L>0)). The syntax for definingprobabilistic clauses is extended to, Expr::List::Clause. Expr is an arithmetic expressioncontaining some of the variables from List. List is a list of variables, each of which corre-sponds to one item in the second argument of the relevant pvars/2 directive. Clause is astandard clause. Note that the recursive call in our example is [L - 1]::umember( El, T

). Since the length of tail T can be computed from that of [H|T], recursive calls can avoidcalling length/2. In the example, this is achieved by attaching the evaluable expressionL - 1 to the recursive call. In general a goal corresponding to a probabilistic call can beeither of the form Head or of the form Exprs::Head. With Exprs being a list of evaluableexpressions and variables. When the call is of the form Head or when a variable is found inExprs, the relevant guards from pvars/2 are called. Otherwise, the expressions in Exprs

are evaluated.

The basic rule for using extended SLPs is that the SLP must be written in such away that selected atoms contain sufficient information to enable computation of the desiredprobabilities. In this paper we will not attempt to characterise which probability distribu-tions can be represented by extended SLPs, leaving this for future work. So far, we haveonly used extended SLPs for simple utility predicates, similar to umember/2.

4. MCMC for posterior distributions over large finite spaces

This section does not contain a tutorial on Markov chains or on MCMC. The bare minimumrequired to understand our particular approach and to fix notation is provided. Goodintroductions to finite Markov chains are provided by Feller (1950) and Haggstrom (2002).

22

Page 23: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

For an introduction to MCMC from a machine learning perspective see Andrieu et al. (2003)and for a statistical perspective see Robert and Casella (2004).

4.1 MCMC basics

Let X be a finite set, and let π be some probability distribution over X . Suppose we wish tosample from π but this is difficult to do directly. The Markov chain Monte Carlo (MCMC)solution to this problem is to construct a Markov chain to indirectly produce an approx-imate sample from π. Restricting to the homogenous case as we will do throughout, aMarkov chain on X is defined by an initial distribution µ0 on X and transition probabilitiesP (x, y) for all x, y ∈ X . Elements of X can be viewed as states: intuitively P (x, y) is theprobability of ‘moving’ from state x to state y. The Markov chain provides a distributionover infinite state sequences. Let Xt be the state at (discrete) time t, then the transi-tion probabilities determine all the necessary conditional probabilities via this conditionalindependence relation:

P (Xt = y|X0 = x0, . . . Xt−1 = x) = P (Xt = y|Xt−1 = x) = P (x, y) t = 1, 2, . . . (4)

Markov chains are thus memoryless: the probability of moving to any given state de-pends only on the current state. Let µt represent the (marginal) probability distributionon Xt defined by the Markov chain. Since X is finite the transition probabilities can bearranged in a transition matrix P such that entry Pij of P is the probability of moving fromstate number i to state number j according to some arbitrary enumeration of the elements(=states) of X . We then have µt = µ0P

t, where µ0 and µt are row-vectors of length |X |.

If π is the distribution from which we wish to sample, the MCMC approach is to con-struct a Markov chain such that ||µt − π|| → 0 as t→ ∞. For sufficiently large t, samplesdrawn from Xt will provide a good approximation to sampling from π. This convergenceimplies that πP = π. π is thus a stationary distribution for the chain.

4.2 The Metropolis-Hastings algorithm

Throughout we will restrict attention to posterior distributions π, where each x ∈ X repre-sents a particular statistical model in some model space. Note that in this paper, attentionis restricted to Bayesian network models, but that much of what follows is of general appli-cation. We have that π(x) ∝ L(x)π0(x) where L(x) is the likelihood associated with x forsome fixed set of observed data, and where π0(x) is the prior probability of x.

In this paper each BN x ∈ X is unparameterised: no conditional probabilities arespecified. It follows that the likelihood L(x) must be a marginal likelihood: we integrate overpossible parameter values rather than specify particular values. This integration requiresthe definition of prior density functions and we adopt the entirely standard approach ofrequiring these to be Dirichlet distributions. For the details of this approach see Heckermanet al. (1995b). If various independence assumptions are made about the joint distributionover parameters we have, for any BN x ∈ X :

L(x) =

n∏

i=1

qi∏

j=1

Γ(αij)

Γ(nij + αij)

ri∏

k=1

Γ(nijk + αijk)

Γ(αijk)(5)

23

Page 24: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

where i is a node in the BN, j is a joint instantiation of the parents of such a node in x and kis a value of the node. nijk is the data count of node i having value k when its parents haveconfiguration j. αijk is the corresponding Dirichlet parameter. Finally, nij =

∑ri

k=1 nijk

and αij =∑ri

k=1 αijk. In our experiments we have set αijk = N/(riqi),∀i, j, k where N (theprior precision) has either been N = 1 or N = 10. For either value of N , the marginallikelihood function is an instance of the BDeu metric for scoring BN structures. This metricrespects likelihood equivalence: BN structures encoding the same conditional independencerelations (Markov equivalent structures) will get the same score. The BDeu metric wasintroduced by Buntine (1991) and was given this name by Heckerman et al. (1995b) whoshowed it was a special case of a more general likelihood-equivalent metric, the BDe metric.

The central assumption of our method is that, although π is hard to sample from, π0, ordistributions closely related to π0, are easy to sample from. We do not assume that π0(x)is easy to evaluate for any given x ∈ X , so our method is particularly well-suited to priordistributions which are directly defined in terms of some sampling mechanism, as opposedto some closed form.

We use the Metropolis-Hastings algorithm to construct a chain with π as stationarydistribution. This algorithm defines a Markov chain using an auxiliary proposal distributionq as follows. If Xt = x, then propose a new state y according to the proposal distributionq(y|x). This restricts q to be such that values can be readily sampled from the conditionaldistributions q(.|x), x ∈ X . Next compute an acceptance probability α(x, y) as defined by(6).

α(x, y) = min

π(y)

π(x)

q(x|y)

q(y|x), 1

(6)

and accept the proposed y with probability α(x, y). If the proposal is rejected then thechain remains at state x. Under weak conditions such a Markov chain converges to π.

We now consider proposals for which the acceptance probability is particularly easy tocompute. Suppose the proposal q is related to the prior as follows:

q(x|y)

q(y|x)= Rq(x, y)

π0(x)

π0(y)⇔ Rq(x, y) =

q(x|y)π0(y)

q(y|x)π0(x)x, y ∈ X (7)

where Rq(x, y) is some function which is easy to evaluate for any x, y ∈ X . Whenever (7)is satisfied we have:

α(x, y) = min

π(y)q(x|y)

π(x)q(y|x), 1

(8)

= min

π0(y)L(y)q(x|y)

π0(x)L(x)q(y|x), 1

= min

Rq(x, y)L(y)

L(x), 1

So if (7) holds then the acceptance probability reduces to a likelihood ratio multipliedby the easily-evaluable Rq(x, y). In this case, we do not need to be able to evaluate anyother functions, in particular we do not need to be able to evaluate the prior or the proposaldistribution.

24

Page 25: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

4.2.1 The single-component Metropolis-Hastings algorithm

The ordered single-component Metropolis-Hastings algorithm is a variant which is applica-ble when models x ∈ X can be divided into components x = x1, x2, . . . xn. The detailsare given by (Gilks et al., 1996, p. 10) from whom we extract a brief description. Let xt.i

denote the state of xi at time t. Iteration t+ 1 is done in n steps. Candidate y.i is proposedby qi(y.i|xt.i, xt,−i) where

xt,−i = (xt+1,1, . . . , xt+1,i−1, xt,i+1, . . . xt,n)

so the proposal is conditional on the already-found new components of x for earlier compo-nents and the existing later components. Each proposed y.i is accepted with probability:

αi = min

π(y.i|x.−i)qi(x.i|y.i, x.−i)

π(x.i|x.−i)qi(y.i|x.i, x.−i), 1

Ordered Gibbs sampling is a special case where qi(y.i|x.i, x.−i) = π(y.i|x.−i), the proposaldistributions are the full conditional posterior distributions, so that αi ≡ 1. If conditionalposterior distributions are not available to serve as proposals, but conditional prior distri-butions are, we can set qi(y.i|x.i, x.−i) = π0(y.i|x.−i) in which case:

αi = min

L(y.i, x.−i)

L(x.i, x.−i), 1

(9)

Call this algorithm the ordered prior-Gibbs sampler. In our case, due to the decomposabilityof the marginal likelihood, the likelihood ratio in (9) reduces to Li(y.,i)/Li(x.,i) where Li

is the component of the likelihood associated with variable i. This cheaply computableacceptance probability is a big advantage for prior-Gibbs sampling.

Note that the Gibbs (resp. prior-Gibbs) sampler requires proposal distributions onlyfor the n full conditional posterior (resp. prior) distributions. It follows that the posterior(resp. prior) need only be specified implicitly by these conditional distributions. Howeverin a number of applications it has been found practical to use an MCMC algorithm in allrespects like the ordered Gibbs sampler except that the n distributions used for samplingare inconsistent : there is no joint distribution for which they are full conditional distribu-tions. Following Heckerman et al. (2000) we will call such a procedure ordered pseudo-Gibbssampling.

As Gelman points out:

This “ordered pseudo-Gibbs sampler” (Heckerman, Chickering, Meek, Roun-thwaite, and Kadie 2001) is a Markov chain algorithm, and if the values of theparameters are recorded at the end of each loop of iterations, then they will con-verge to a distribution. This distribution may be inconsistent with various of theconditional distributions used in its implementation, but in general there willbe convergence to something (assuming that the usual conditions hold for con-vergence of a Markov chain to a unique nondegenerate stationary distribution).(Gelman, 2004) [emphasis in the original]

Gelman notes that the inconsistency is a “theoretical flaw” but argues that in compen-sation “Conditional modeling allows enormous flexibility in dealing with practical problems.

25

Page 26: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

In applied work, we have never been able to fit the joint models to a real dataset withoutmaking drastic simplifications.” One interesting example of using inconsistent conditionaldistributions is the PHASE algorithm for haplotype reconstruction whose authors describeit as follows:

Indeed, the algorithm implemented in PHASE was not actually developedin the conventional way of writing down a prior and likelihood and then devel-oping a computational method for sampling from the corresponding posterior(and neither, incidentally, was the algorithm of Lin et al. [2002]). Rather,the posterior is defined implicitly as the stationary distribution of a particularMarkov chain, which in turn is defined via a set of (inconsistent) conditionaldistributions. (Stephens and Donelly, 2003)

Lauritzen and Richardson (2002) note that even if the pseudo-Gibbs sampler is guar-anteed to have a stationary distribution, it may have “a limiting distribution dependingon the particular choice of ordering π in the sitewise updating” (Stephens and Donelly usea randomised component updating scheme for PHASE to avoid this problem.) They addthat “it would be desirable to have a more precise understanding of the general relation be-tween the limiting distribution µ of a stationary pseudo-Gibbs sampler and the conditionalspecifications q.”

In this paper we use pseudo-prior-Gibbs sampling which is like prior-Gibbs samplingexcept that the proposals may be inconsistent: there may be no prior distribution forwhich they are conditional distributions. Much of what has been said about pseudo-Gibbssampling applies to pseudo-prior-Gibbs sampling: it is an approximation to the theoreticallysound prior-Gibbs sampling which works well in practice. When the proposals closelyapproximate proper conditional priors then we can expect a reasonable approximation butwe lack a precise understanding of the nature of this approximation.

4.3 Metropolis-Hastings with SLPs

In this paper we implement the Metropolis-Hastings algorithm using SLPs. The prior π0 isexpressed by an SLP. The state space X is the set of consistent proof trees constructed fromthe SLP and a given user-supplied goal. Examples of consistent proof trees can be foundin Figs 10 and 15. The proof tree in Fig 15 was sampled from the SLP version of Fig 4using the initial goal k([a,b,c], BN). As described in Section 3, SLPs define a distributionover yielded atoms. Our SLP priors are written in such a way that there is exactly oneterm in each yielded atom which represents a model—in this paper the model is a BN.Call this term the model term. For example, the tree in Fig 15 generates the model term[b<-a,c<-b] which represents the obvious BN structure. Each x ∈ X thus has a likelihoodL(x) assuming a fixed set of training data. This is just the (marginal) likelihood of itsassociated model. In the language of MCMC, the information provided by the proof treeover and above the model term is an ‘auxiliary variable’: part of the state space for theMarkov chain but not part of the stationary distribution we are interested in. Naturallyonly the model term is recorded as an MCMC run progresses.

It remains to choose a proposal: how to move between proof trees. The basic idea isthis: given a current proof tree x ∈ X , a new one is proposed by deterministically deleting

26

Page 27: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

part of x, thus producing a derivation tree x′ and then using the prior π0 to probabilisticallyre-grow x′ to produce a proposed y ∈ X . Derivation trees x′ produced by deleting part of aconsistent proof tree are called backtrack points. We begin by more precisely characterisingbacktrack points.

4.3.1 Backtrack points

Recall, from Section 3, the way in which an SLP (and initial goal) define a procedure whichsamples from the prior π0. This procedure can be seen as the realisation of a Markov chainwhose states are augmented derivation trees.

For any consistent proof tree x ∈ X and for any set of probabilistic atoms A in x letbp(x,A) be the augmented derivation tree produced by deleting the trees under each a ∈ Aand reinitialising the choice point for each a to include all possible elementary trees. bp(x,A)is called a backtrack point for x. For example, Fig 9 is a backtrack point bp(x,A) wherex is the consistent proof tree in Fig 10 and A = e([c],a,[b<-a,c<-b],BN), nc(BN).The derivation tree given in Fig 17 is a backtrack point for the tree in Fig 10 where A =e([c],b,[],TBN1). An important difference between the backtrack points in Fig 9 andFig 17 is that the former must have been built at some stage in the construction of theconsistent proof tree in Fig 10, whereas the tree in Fig 17 could never be built as someintermediary stage in the production of any consistent proof tree. This is because the SLPsampling mechanism, which defines our prior, always builds trees depth-first left-first. InFig 17 the tree under e([c],b,[],TBN1) is not constructed even though the tree to itsright has been constructed.

It is clear that backtrack points for any consistent proof tree are partitioned betweenthose like Fig 9 which we shall call chronological backtrack points and those like Fig 17which we shall call non-chronological backtrack points. Chronological backtrack points areso called since they can be reached from a consistent proof tree by ‘rewinding history’ toreach a stage which must have occurred at some point in the construction of that consistentproof tree.

k([a,b,c], [c<-a,b<-a|TBN1])

k([b,c],TBN1)

k([c],[])

k([],[])

e([],c,[],[])

nc([])

e([c],b,[],TBN1) nc(TBN1)

e([b,c],a,TBN1,[c<-a,b<-a|TBN1])

e([c],a,[b<-a,|TBN1],[c<-a,b<-a|TBN1])

e([],a,[c<-a,b<-a|TBN1],[c<-a,b<-a|TBN1])

nc([c<-a,b<-a|TBN1])

Figure 17: Non-chronological backtrack point

27

Page 28: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

4.3.2 Chronological backtracking proposals

Any given consistent proof tree x has a finite number of chronological backtrack pointswhich can be ordered chronologically. Let the number of chronological backtrack pointsfor a consistent proof tree x be called the depth of x, denoted d(x). (Note that this doesnot agree with the usual definition of the ‘depth’ of a tree.) Let x1, x2, . . . , xd(x) be thechronological ordering of chronological backtrack points for x.2 For any k > d(x), define xk

to be x. It is now easy to define our first type of proposal.

Definition 4 For any positive integer k, the depth-k chronological backtracking proposalqk is defined as follows. Let the current consistent proof tree be x ∈ X . To propose a newconsistent proof tree move to the backtrack point xk and use the SLP sampling mechanismdefining the prior to extend xk to a new consistent proof tree y. There will always be at leastone such consistent proof tree, namely x. (Note that if k > d(x) then x will be proposed.)

Proposition 5 Let qk denote the depth-k chronological backtracking proposal then ∀k, π0 :Rqk

(x, y) = qk(x|y)π0(y)qk(y|x)π0(x) = 1 and so

αqk(x, y) = min

L(y)

L(x), 1

Proof Introducing an abuse of notation, for any augmented derivation trees x ′, x′′ (notnecessarily consistent proof trees) let π0(x′) denote the probability that at some point theSLP prior-sampling mechanism produces x′, and let π0(x′′|x′) denote the probability thatx′ gets extended to x′′ by the mechanism. In the construction of a consistent proof treex, all of x’s chronological backtrack points will be passed through. It follows that ∀k ∈1, 2, . . . d(x) : π0(x) = π0(xk)π0(x|xk).

By definition qk(y|x) = π0(y|xk). So if xk = yk, then π0(x)qk(y|x) = π0(xk)π0(x|xk)π0(y|xk) =π0(yk)π0(x|yk)π0(y|yk) = π0(yk)π0(y|yk)π0(x|yk) = π0(y)qk(x|y), and the result follows. Ifxk 6= yk then qk(y|x) = qk(x|y) = 0, and a value for αqk

(x, y) is never needed.

Fixed-depth backtracking is too fixed, much better is a proposal which sometimes makessmall jumps (big k) and sometimes big ones (small k). This leads to our next type ofproposal: chronological uniform choice quc. Instead of using a fixed depth k to backtrackto, k is chosen uniformly from 1, 2, . . . , d(x). The acceptance probability is only slightlyharder to compute:

Proposition 6 Let quc denote the chronological uniform choice backtracking proposal then∀π0 : Rquc(x, y) = quc(x|y)π0(y)

quc(y|x)π0(x) = d(x)/d(y) and so

αquc(x, y) = min

d(x)

d(y)

L(y)

L(x), 1

2. In previous work (Angelopoulos and Cussens, 2005a) we numbered backtrack points from 0. Numberingfrom 1 makes some of the presentation clearer.

28

Page 29: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

Proof Given x, y ∈ X , let d(x, y) be the largest k such that xk = yk. Note that there isalways such a k since x1 = y1,∀x, y ∈ X . It is clear that if k′ > d(x, y) then π0(y|xk′) =π0(x|yk′) = 0. From this and the definition of quc we have the following equation:

π0(x)quc(y|x)

= π0(x)

d(x)∑

k=1

1

d(x)π0(y|xk)

= π0(x)1

d(x)

d(x)∑

k=1

π0(y|xk)

= π0(x)1

d(x)

d(x,y)∑

k=1

π0(y|xk)

So π0(x)quc(y|x)d(x) = π0(x)∑d(x,y)

k=1 π0(y|xk). Similarly, π0(y)quc(x|y)d(y) = π0(y) 1d(y)

∑d(x,y)k=1 π0(x|yk).

However, ∀k : 1 ≤ k ≤ d(x, y) : π0(x)π0(y|xk) = π0(y)π0(x|yk) from the proof of Propo-

sition 5. So π0(x)∑d(x,y)

k=1 π0(y|xk) = π0(y)∑d(x,y)

k=1 π0(x|yk) and thus π0(x)quc(y|x)d(x) =π0(y)quc(x|y)d(y). The result follows.

4.3.3 Non-chronological backtracking proposals

In practice a big problem with chronological backtracking proposals is that to change thetree under some probabilistic atom, it is necessary to destroy everything built later toits right. Non-chronological backtracking avoids this problem. To define non-chronologicalbacktracking first note that every chronological backtrack point xk has an associated selectedprobabilistic atom ak. Each such xk has an associated non-chronological backtracking pointxk which is just x with the sub-tree under ak deleted. Both xk and xk have the tree underak deleted, the difference is that xk also has everything ‘to the right’ deleted.

Fixed-depth (qk) and uniform choice (quc) non-chronological backtracking proposals arelike their chronological counterparts except that backtracking jumps to xk rather than xk.(Note that chronological proposals are denoted using subscripts and non-chronological viasuperscripts.) A tree is then built under ak, the selected atom of xk using the standardsampling mechanism. Note that there is at least one consistent proof tree sampleable inthis fashion: namely x. In general, xk would never be reached when sampling normallyfrom the prior. Nonetheless we can use the prior-sampling mechanism to sample the treeunder ak, exactly as if it were any other selected atom.

In the general case, we do not (as yet) have expressions for acceptance probabilitiesfor non-chronological backtracking. We do, however, have easily computable acceptanceprobabilities in an important special case. To define this special case, consider first thefollowing sampling mechanism defining a different prior πk

0 over consistent proof trees. Untilthe kth choice point is created proceed as for π0. If and when the kth choice point is createdthen, instead of building a tree under the atom ak, that atom is frozen (to use the language oflogic programming): the next available atom is instead selected, using the normal selection

29

Page 30: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

ordering for atoms. Indeed atom ak is only selected when no other atom is available forselection: in short the tree under ak is built last. We now define πk

0 formally.

Definition 7 This equation defines πk0 : ∀x ∈ X : πk

0 (x) = π0(xk)π0(xk|xk)π0(x|xk), wherexk is the kth chronological backtrack point, xk is the kth non-chronological backtrack pointand π0(xk|xk) denotes the probability of extending xk to xk by freezing the atom ak andprobabilistically extending xk ‘to the right’ of ak.

In general π0 6= πk0 . If we build the tree under ak first (as normal) then this may produce

variable bindings on atoms to be selected later thus constraining which subtrees can be builtunder them. These variable bindings will not be there if the tree under ak is delayed untilthe end of the process. When π0 does equal πk

0 we have the following useful result.

Proposition 8 If π0 = πk0 then ∀x, y ∈ X : π0(x)qk(y|x) = π0(y)qk(x|y).

Proof If xk 6= yk then qk(y|x) = qk(x|y) = 0 and the result follows, so we consider thecase where xk = yk. In this case we clearly also have xk = yk

π0(x)qk(y|x)

= π0(xk)π0(xk|xk)π0(x|xk)π0(y|xk) by definitions

= π0(yk)π0(yk|yk)π0(x|xk)π0(y|xk) from above

= π0(y)qk(x|y) by definitions

From Proposition 8 it follows immediately that if π0 = πk0 then Rqk(x, y) = 1. If

π0 = πk0 for all k then (by a small alteration to the proof of Proposition 6) we also have that

Rquc(x, y) = d(x)/d(y). In some of the experiments where we have used non-chronologicalbacktracking we have ensured that π0 = πk

0 for all values of k which can ever be selectedand so we can use these simple acceptance probabilities. This is the case when the BNstructure prior defined by the SLP is modular (see Section 2.2.1).

4.3.4 Backtracking proposals and single-component MCMC

Restricting to modular priors runs counter to our goal of allowing the user to define priorswhich encapsulate whatever prior knowledge there is. So we have also done experimentsusing non-chronological backtracking with non-modular priors (so that π0 6= πk

0 ). Thisamounts to pseudo-prior-Gibbs sampling for the following reasons.

It is not difficult to see that all the backtracking proposals implement variants of theprior-Gibbs sampler. Each proof sub-tree that is pruned and resampled using the prior canbe viewed as a ‘component’ of the entire proof tree. These components form a hierarchyrather than a flat sequence as in normal single-component sampling. In the case of chrono-logical backtracking a block of components is resampled. If the backtracking is fixed-depththen the same component (or block of components) is always selected; if uniform-choice thechoice of component is randomised.

30

Page 31: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

When using pseudo-prior-Gibbs sampling, in step i we propose a new set of parents fornode i by pruning the proof sub-tree which determines its current parents and resamplingusing the SLP prior mechanism as previously described for non-chronological backtracking.With the exception of the last node, proposing like this only approximates proposing via theconditional prior. However, since the underlying logic program is the same, at least we havethat zero prior probability models are never proposed (and hence have posterior probabilityzero). This is a partial explanation for the good results this approach has yielded. Wedenote our pseudo-prior-Gibbs sampler by qsc.

Note that to effect this approach we need to restrict the number of available backtrackpoints to be only those that correspond to choosing a whole new parent set for a node. Inour current implementation this restriction is achieved via a directive like:

:- s_no_bpoint( choose_pa/3 ).

which states that backtracking should ignore all backtrack points where the selected atomhas choose_pa/3 as its predicate symbol.

4.3.5 Implementation

We will give just a brief overview of how MCMC with SLPs is actually implemented: furtherdetails are available from (Angelopoulos and Cussens, 2005b). The prior π0 is defined in aSLP source file. To sample from the prior π0 defined by this SLP a call can be made to bn/4

with the first three variables instantiated. When the call returns (i.e. when a consistentproof tree has been constructed), the fourth argument will be instantiated to a first-orderterm representing the sample BN.

Source SLPs are transformed to normal Prolog programs and loaded into memory. Eachclause is augmented by extra arguments carrying the probabilistic labels and a path of (prob-abilistic) backtracking points. The latter is used both for choosing a proposal backtrackpoint and for speeding the process of finding the proposed model. Prolog’s own backtrack-ing which implements a left to right search, is used when sampling. At each iteration afterthe current model has been reached we need to backtrack to the chosen (proposal) back-track point. Prolog’s search strategy cannot be used effectively for backtracking to theproposal point, instead this is achieved by rerunning the original top level query with itspath argument partially instantiated.

A proposal strategy has two components. Firstly, it has a path of backtrack points.Secondly, it has a function for choosing one of the points in the path. The latter is controlledby an experimental parameter. In all proposals described here, except pseudo-prior-Gibbssampling, a point is picked uniformly from those in the path. Subpaths can also be createdwhich allow non-chronological backtracking. If a point in a subpath is chosen, all pointsnot in the subpath are assumed independent to those in the subpath and it is unaffected bythe transition from the current model to the proposed one. In effect, this allows changes toa part of the model which are independent of other subparts.

5. Experimental setup

Since the goal of our experiments is to test our MCMC system rather than to performBayesian inference for some particular application, all our experiments are done using syn-

31

Page 32: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

BN Nodes Arcs Ref

ASIA 8 8 Lauritzen and Spiegelhalter (1988)

ALARM 37 46 Beinlich et al. (1989)

HAILFINDER 56 66 Abramson et al. (1996)

INSURANCE 27 52 Binder et al. (1997)

PARITY1 22 31 Koivisto and Sood (2004)

PARITY2 100 53 Koivisto and Sood (2004)

Table 1: True BNs

thetic data. In each experiment a known fully parameterised Bayesian network (TrueBN)is chosen and a specified number (Size) of complete joint instantiations are generated fromTrueBN by forward sampling. This constitutes the data for an experiment. Next an SLPprior (SLPPrior), prior precision for the Dirichlet parameters (N) and proposal (Proposal) areselected. Using the selected data, prior and proposal the Metropolis-Hastings algorithm isrun for 250,000 iterations with a record of all visited BNs and their marginal log-likelihoodsrecorded. Each such run is done three times with three different random seeds. Size waseither 500 or 1000. N was either 1 or 10. The following three sections describe each of theother experimental variables (TrueBN, SLPPrior and Proposal) in more detail.

5.1 True Bayesian networks

Six different BNs were used to generate synthetic data: ASIA, ALARM, HAILFINDER,INSURANCE, PARITY1 and PARITY2. Their sizes are given in Table 1 which also citesthe original references which provide further information on them. The six values for TrueBN

were chosen to give a range of structure size and topology, and because each had been usedat least once in previous (non-Bayesian) BN structure learning work.

5.2 Prior distributions

In total 6 prior distributions were used with differing levels of information about the trueBN. For all of our 6 priors, any BN with a topological ordering conflicting with that of thetrue BN has zero prior probability. The priors are listed below where we have indicatedwhether each is modular (see Section 2.2.1).

Prior A In this prior a consistent topological ordering is enforced: a node A can only be aparent of a node B if A comes before B in a particular (arbitrary) topological orderingof the nodes in the true BN. If A is a possible parent for B, then P (A→ B ∈ BN) =1/2, for all A,B. MODULAR

Prior B The topological ordering constraint is imposed as for Prior A. For any node B,let |Pa(B)| be the number of parents B has in the true BN. If A is a possible parentfor B, then P (A→ B ∈ BN) ∝ |Pa(B)|, for all A,B. MODULAR

Prior C This is as Prior B, except that the prior is told which nodes are orphans, i.e. anyBN with the wrong set of orphans has zero probability. MODULAR

32

Page 33: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

SLPPrior A B C C C D C E F

Proposal quc(1) quc(1) quc(1) quc(2) quc(3) quc(4) quc(5) quc(2) qsc

Combo 1 2 3 4 5 6 7 8 9

Table 2: Prior + proposal combinations used in our experiments

Prior D This is as Prior C except that the prior knows all pairs of variables which aretruly independent, i.e. any BN which does not respect these independence relationswill have zero probability. NOT MODULAR

Prior E A variant of Prior B. For each node the number of parents is always the truenumber. The prior chooses uniformly from each possible parent set. MODULAR

Prior F Like Prior E, except that the prior, like Prior D, also knows true independencerelations. NOT MODULAR

5.3 Proposals

Proposals differ only in terms of backtracking.

Proposal quc(1) A backtrack point is chosen uniformly from all possible chronological back-track points.

Proposal quc(2) A backtrack point is chosen uniformly from a subset of non-chronologicalbacktrack points. There is one backtrack point for each node in the BN. For each node,backtracking to its associated backtrack point has the effect of removing all arrowspointing to that node, so that it has no parents. New parents will subsequently bechosen (according to whichever prior is being used) for that node.

Proposal qsc Similar to quc(2) except that a single-component MH approach is taken. SeeSection 4.3.4 for more on this proposal. Note that on each iteration a new parent setis proposed for each node, so that each iteration is more computationally intensivethan for other proposals.

Proposal quc(3) Like Proposal quc(2) except that no backtrack points for true orphan nodesare considered.

Proposal quc(4) Like Proposal quc(2), except that only chronological backtrack points areconsidered.

Proposal quc(5) Chronological version of quc(3).

5.4 Experiments actually run

Given that we have 6 values for TrueBN, 2 values for Size, 6 values for SLPPrior, 2 valuesfor N, 4 values for Proposal and 3 different random seeds are used for each MCMC run,there were 6× 2× 6× 2× 4× 3 = 1728 runs of 250,000 iterations that we could have run.We judged that our approach could be evaluated with significantly fewer runs. Firstly, not

33

Page 34: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

-1600

-1550

-1500

-1450

-1400

-1350

-1300

-1250

-1200

-1150

0 50000 100000 150000 200000 250000

Log

likel

ihoo

d

Iterations

-11500

-11000

-10500

-10000

-9500

-9000

-8500

-8000

-7500

0 50000 100000 150000 200000 250000

Log

likel

ihoo

d

Iterations

Figure 18: Log-likehood trajectories for 3 different runs. LHS+RHS: Size = 500, N = 10,SLPPrior = D, Proposal = quc(4) (Combo 6). LHS: TrueBN = ASIA. RHS: TrueBN

= INSURANCE

all 24 prior+proposal combinations were considered. Instead only those 9 combinationsshown in Table 2 were tried. From now on Combo will be used to refer to a prior+proposalcombination.

Secondly, as discussed in Section 6, most choices for Combo led to demonstrably poorMCMC convergence. Since all experiments for N = 10 were done prior to those for N = 1,we chose to restrict N = 1 runs to only those 4 values of Combo (4, 5, 8 and 9) which hadshown reasonable convergence in the N = 10 experiments. So for N=10, runs were done foreach combination of TrueBN, Combo, Size and seed leading to 6 × 9 × 2 × 3 = 324 runs,whereas for N = 1 only 6× 4× 2× 3 = 144 runs were done.

6. Results

6.1 Negative convergence results with chronological proposals

As previously mentioned some Combos showed clear evidence of poor convergence. Poorconvergence was observed in all cases where non-chronological backtracking was used evenwhen the prior was highly biassed towards the true BN. Consider, for example, Combo 6which has SLPPrior = D and Proposal = quc(4). SLPPrior D is highly biassed towards thetrue BN since it rules out any BN which does not respect all the conditional independencerelations in the true BN. Nonetheless the poor movement through the state space which achronological proposal affords leads to poor results. Fig 18 shows log-likelihood trajectoriesproduced from MCMC runs applying Combo 6 to 500 datapoints generated from ASIA andINSURANCE, respectively. In both cases N, the prior precision is set to 10. Trajectoriesfrom the 3 independent runs are shown as well as a horizontal line giving the log-likelihoodof the true model. This is, of course, the marginal log-likelihood, log L(x) where L(x) isgiven by (5). All log-likelihood trajectories will be of this format in this paper.

In the ASIA case the three different runs do at least give similar results, so that in theLHS figure it is difficult to distinguish the different trajectories. However, in each case nomodel is visited with a log-likelihood even close to that of the true one. The ASIA BN is avery simple learning task, a more challenging task is given, for example, by INSURANCE.The INSURANCE trajectory shows quite clearly that the 3 MCMC runs using Combo 6

34

Page 35: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

-11500

-11000

-10500

-10000

-9500

-9000

-8500

-8000

-7500

0 50000 100000 150000 200000 250000

Log

likel

ihoo

d

Iterations

-10500

-10000

-9500

-9000

-8500

-8000

-7500

0 50000 100000 150000 200000 250000

Log

likel

ihoo

d

Iterations

Figure 19: Log-likehood trajectories for 3 different runs. Both: TrueBN = INSURANCE,Size = 500, N = 10, SLPPrior = C. LHS: Proposal = quc(1) (Combo 3). RHS:

Proposal = quc(2) (Combo 4)

-21000

-20000

-19000

-18000

-17000

-16000

-15000

-14000

-13000

-12000

-11000

0 50000 100000 150000 200000 250000

Log

likel

ihoo

d

Iterations

-21000

-20000

-19000

-18000

-17000

-16000

-15000

-14000

-13000

-12000

-11000

0 50000 100000 150000 200000 250000

Log

likel

ihoo

d

Iterations

Figure 20: Log-likehood trajectories for 3 different runs. Both: TrueBN = ALARM, Size =1000, SLPPrior = C, Proposal = quc(3) (Combo 5). LHS: N = 1. RHS: N = 10

each get stuck in different parts of the model space. The trajectories in Fig 18 are quitetypical of the ‘sticking’ problems which occur using a chronological proposal.

Combos 3 and 4 have the same prior (C) with the former having a chronological proposaland the latter a non-chronological one, so comparing them brings out the advantage ofnon-chronological proposals particularly clearly. Fig 19 shows the trajectories for these 2Combos on runs identical in all other respects. The non-chronological Combo 4 comes closerto converging and moreover explores models close (in log-likelihood) to the true BN.

6.2 Convergence for non-chronological priors

From now on we only consider experiments done using non-chronological proposals. Con-vergence, as far as can be judged by log-likelihood trajectories, is generally good for theseproposals, in contrast to chronological ones. In all cases runs for Size = 500 and Size = 1000tell a similar story, so we will restrict attention to the latter since it is the harder learningscenario. It is harder because with a greater amount of data the posterior will be furtherfrom the prior making it harder for our Metropolis-Hastings algorithm (driven by proposalsguided by the prior) to converge.

35

Page 36: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

-80000

-75000

-70000

-65000

-60000

-55000

0 50000 100000 150000 200000 250000

Log

likel

ihoo

d

Iterations

-75000

-70000

-65000

-60000

-55000

-50000

0 50000 100000 150000 200000 250000

Log

likel

ihoo

d

Iterations

-75000

-70000

-65000

-60000

-55000

-50000

0 50000 100000 150000 200000 250000

Log

likel

ihoo

d

Iterations

-75000

-70000

-65000

-60000

-55000

-50000

0 50000 100000 150000 200000 250000

Log

likel

ihoo

d

Iterations

Figure 21: Log-likehood trajectories for 3 different runs. All: TrueBN = HAILFINDER,Size = 1000, N = 1, SLPPrior = C. Top-left: Proposal = quc(2) (Combo 4). Top-right: Proposal = quc(3) (Combo 5). Bottom-left: Proposal = quc(2) (Combo 8).Bottom-right: Proposal = qsc (Combo 9). [Note that the top-left y-scale differsfrom the others.]

In most cases, trajectories for N = 1 and N = 10 were similar. The exception to thiswas the set of results for TrueBN = ALARM. For ALARM, runs for N = 10 consistentlyfound models whose log-likelihoods were above that of the true model but this was not thecase for N = 1. Fig 20 shows a typical example of this contrast using Combo 5.

From now we will restrict attention to the case where prior precision N was set to 1.The basic story is that Combos 5, 8 and 9 consistently gave better results than Combo 4 andthat PARITY2 was significantly more difficult than the other BNs. Fig 21 shows results forHAILFINDER which are similar to those obtained for all other BNs apart from PARITY2.For Combos 5, 8 and 9 each of the 3 realisations of the Markov chain finds its way to aregion of the model space with high scoring models and stays there. Note that Combo 9converges especially rapidly. For HAILFINDER, INSURANCE and PARITY1 this regioncontains models with log-likelihood scores significantly higher than that of the true model.For ASIA and ALARM the scores are the same.

These phenomena can be explained as follows. Recall that each likelihood we consideris a marginal likelihood which is a ‘weighted average’ of fitted likelihoods for the parame-terised model for each possible choice of parameters. (The ‘weights’ are given by the choiceof Dirichlet distribution over possible parameters.) Modulo sampling variation, the fittedlikelihood of TrueBN will be high if parameterised with the ‘true’ parameters (the ones used

36

Page 37: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

-67000

-66500

-66000

-65500

-65000

-64500

-64000

-63500

0 50000 100000 150000 200000 250000

Log

likel

ihoo

d

Iterations

-67500

-67000

-66500

-66000

-65500

-65000

-64500

-64000

-63500

0 50000 100000 150000 200000 250000

Log

likel

ihoo

d

Iterations

-67000

-66500

-66000

-65500

-65000

-64500

-64000

-63500

-63000

0 50000 100000 150000 200000 250000

Log

likel

ihoo

d

Iterations

-67000

-66500

-66000

-65500

-65000

-64500

-64000

-63500

-63000

0 50000 100000 150000 200000 250000

Log

likel

ihoo

d

Iterations

Figure 22: Log-likehood trajectories for 3 different runs. All: TrueBN = PARITY2, Size

= 1000, N = 1, SLPPrior = C. Top-left: Proposal = quc(2) (Combo 4). Top-right: Proposal = quc(3) (Combo 5. Bottom-left: Proposal = quc(2) (Combo 8).Bottom-right: Proposal = qsc (Combo 9). [Note different y-axis scales.]

to generate the data) but there may be large areas of the parameter space defining fittedmodels with low fitted likelihood leading to a low marginal likelihood overall.

We can get a rough idea of the extent of this problem by looking at the number andproportion of true parameters of value zero for various models. We have the following (withsome rounding) HAILFINDER = 501/3760 = 13.3%, INSURANCE = 302/1419 = 21.3%,ASIA = 4/36 = 11.1%, ALARM = 5/747 = 1.0%, PARITY1 = 0/194 = 0.0%, PARITY2= 0/496. Roughly speaking, a high number of zero parameters puts the true parameterstowards the edge of the parameter space, and so most other parameterisations of the truestructure, being distant from these true parameters, lead to low fitted likelihoods and thusa low marginal likelihood overall. Clearly, this also obtains if we have many near-zero trueparameters as well—which happens to be the case for PARITY1 and to a lesser extent forPARITY2. We conjecture that the small number of zeroes for the true parameters of ASIAand ALARM explain why these true BNs had high marginal log-likelihood.

PARITY2 is atypical in that our chains show clear evidence of non-convergence. Notethough that, for Combos 8 and 9 the MCMC runs nearly always end up visiting models withhigher log-likelihoods than the true PARITY2 model. These phenomena are demonstratedby Fig 22.

37

Page 38: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

0

10000

20000

30000

40000

50000

60000

70000

80000

0 2 4 6 8 10

Run

time

(sec

)

Combo

Datasizes = 500 and 1000 for TrueBN = alarm

"500_alarm.dat""1000_alarm.dat"

Figure 23: Runtimes for TrueBN=Alarm and Sizes of 500 and 1000. Combos order: 1-9

6.3 Running times for MCMC

The experiments were run on SICStus Prolog (version 3.12). The computers used run theLinux operating system and have modest workstation capabilities (3-4GHz single processor,0.5 or 1GB memory). We recorded runtimes that include total cpu time and garbagecollection. Figures 23 and 24 illustrate some of the execution times. In all cases: they-axis plots runtimes in seconds, the x-axis plots Combos, each value is the average of threeseparate runs, and all times are from experiments where N = 10. Experiments with N = 1are very similar as the only difference is the value of a constant with no significant effect onperformance.

Runtimes vary greatly depending on four main factors. The average size of (BN) families,the number of known orphans, the size of the data and the time taken at each iteration.Fig 23 illustrates the effect of Size on execution time. The x-axis shows Combos in thesame order as that of Table 2. In Combos 4, 5 and 8 the y-values for Size = 1000, is onlymarginally larger than the corresponding values for Size = 500. In all three cases, theproposal is non-chronological and at each iteration a single family is reconsidered and itusually is a small one. The computation of the likelihood, which depends on the size ofthe data, in these cases is very fast. From the remaining Cobmos we can deduce that thelikelihood computation is a major contributor to the overall runtimes as the difference iny-values between 500 and 1000 is almost half the variation from the base cases of Combos

4, 5 and 8.

The two plots in Fig 24 separate the TrueBNs into two groups which accommodatethe variability of the displayed runtimes. Hailfinder and Parity2 register longer executiontimes as they have more variables than the rest of the TrueBNs. The two plots are drawnon different y-axis ranges. Note that Combo 9, the leftmost on x-axis, although non-chronological, is by design spending much longer than other non-chronological proposals ateach iteration, as each and every family in the network are reconsidered in turn.

38

Page 39: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

0

10000

20000

30000

40000

50000

60000

70000

80000

0 2 4 6 8 10

Run

time

(sec

)

Combo (9,1,6,7,3,2,4,5,8)

Datasizes = 1000 for TrueBNs = [asiam,alarm,insurance,parity]

"alarm_recombined.dat""asia_recombined.dat"

"insurance_recombined.dat""parity1_recombined.dat"

0

100000

200000

300000

400000

500000

0 2 4 6 8 10

Run

time

(sec

)

Combo (9,1,6,7,3,2,4,5,8)

Datasize = 1000 for TrueBN = [hailfinder,parity2]

"hailfinder_recombined.dat""parity2_recombined.dat"

Figure 24: LHS: runtimes versus Combo for TrueBNs Alarm, Asian, Insurance and Par-ity1. RHS: runtimes versus Combo for Hailfinder and Parity2. Combos’ order:9,1,6,7,3,2,4,5,8. Note that the y-axis ranges are not drawn to the same scale

6.4 Concentration of the posterior around the true model

Log-likelihood trajectories constitute a useful tool providing evidence for convergence/non-convergence. However, we are interested not only in how well our MCMC runs allows usto approximate the posterior, but also in the nature of the posterior itself. Specifically, towhat extent is the posterior distribution concentrated around the true BN?

To address this question it is necessary to introduce a measure of ‘distance’ betweenany given BN and the true BN. We do this in the standard way by considering various lossfunctions. A loss function measures the loss one would suffer by using a given BN insteadof the true one (so that the loss of the true BN is always zero). There is no such thing asa ‘correct’ universal loss function since the loss suffered depends on what a user is doingwith the learned BNs, and this will vary between applications. We consider three simpleloss functions:

Naive 0-1 loss (ID) is a crude loss function where it is necessary to identify the correctBN structure exactly to avoid a maximal loss of 1.

LID(BN,TrueBN) =

0 if BN = TrueBN

1 otherwise(10)

Naive edge symmetric difference loss (NE) penalises a BN for each extra or missingedge it has with respect to TrueBN. Let EBN be the set of edges in BN and letETrueBN be the set of edges in ETrueBN. To allow easier comparison we normalise:let n be the number of vertices in the true BN, then we have

LNE(BN,TrueBN) =|EBN 4ETrueBN|

n2(11)

Given the normalisation used, LNE(BN,TrueBN) is the probability that a randomlychosen pair of vertices are correctly (dis)connected. This was essentially the structuralloss function used in Heckerman et al. (1995a) “the structural difference we use is∑n

i=1 δi where δi is the symmetric difference of the parents of xi in the gold-standardnetwork and the parents of xi in the learned network.”

39

Page 40: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

Essential graph edge symmetric difference (CK) is the same as naive edge symmet-ric difference except that the essential graph of the BN is compared to the essentialgraph of TrueBN. An undirected edge A − B in an essential graph is considered tobe a pair of directed edges A → B,A ← B. This is the metric used by Castelo andKocka (2003).

Define the posterior expected loss Eπ,TrueBN(L) for a given posterior π, loss functionL, and true model TrueBN:

Eπ,TrueBN(L) =∑

BN

L(BN,TrueBN)π(BN) (12)

Eπ,TrueBN(L) essentially measures the cost of picking a BN randomly chosen from π andusing it instead of TrueBN. The more the posterior is concentrated around BNs ‘near’ toTrueBN, the smaller Eπ,TrueBN(L) will be.

Posterior expected loss is a measure of how concentrated the posterior is around the truemodel. We only have the approximation to the posterior supplied by MCMC, but since,with the exception of PARITY2, convergence appears reasonable we can use the sampleMCMC to provide approximations to (12). Table 3 show the results of doing this for allcombinations of TrueBN, non-chronological Proposal and loss function. Results using theunsuccessful Combo 1 are also provided to provide a baseline. In each case posterior lossestimates from 3 different MCMC runs are given.

Table 3 shows the following:

• NE and CK results unsurprisingly tell a similar story. This is essentially because atmost one representative from each Markov equivalence class is in our model space.We restrict attention to ID and NE loss from now on.

• The baseline Combo 1 NE results are often close to 1/4, which is the expected NE lossfor a uniform distribution over the set of BNs which respect the variable ordering.

• Results are generally stable across MCMC runs.

• The Loss=ID results show that the true BN is never visited except when TrueBN isASIA and an informative prior is used. Note that, since we impose the true variableordering as a constraint, at most one representative from each Markov equivalenceclass of BNs is in the model space. It follows that, with the exception just noted, noBN in the true Markov equivalence class is visited.

• Recall that LNE(BN,TrueBN) measures the probability that a pair of vertices in BNis correctly (dis)connected. So, notwithstanding the previous point, the many verylow values for NE loss show that these posteriors are concentrated on BNs which arestructurally similar to the true BN, even for PARITY2.

6.5 Single model learning

To provide a comparison with standard Bayesian net learning algorithms we extractedfrom each MCMC run: BNMAP , the most visited model and BNMLE, the model with

40

Page 41: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

ASIACombo ID NE CK

1 1.00 1.00 1.00 0.1215 0.1252 0.1286 0.1474 0.1500 0.15394 1.00 1.00 1.00 0.0934 0.1221 0.0933 0.1396 0.1552 0.13995 0.99 0.99 0.99 0.0433 0.0423 0.0432 0.0654 0.0642 0.06498 0.16 0.16 0.16 0.0052 0.0051 0.0050 0.0053 0.0051 0.00519 0.16 0.16 0.16 0.0051 0.0052 0.0051 0.0051 0.0052 0.0052

ALARMCombo ID NE CK

1 1.00 1.00 1.00 0.1953 0.1982 0.1981 0.1971 0.2007 0.20124 1.00 1.00 1.00 0.0216 0.0200 0.0226 0.0245 0.0229 0.02555 1.00 1.00 1.00 0.0086 0.0092 0.0087 0.0108 0.0115 0.01108 0.99 1.00 1.00 0.0030 0.0032 0.0032 0.0031 0.0032 0.00329 1.00 1.00 1.00 0.0028 0.0029 0.0029 0.0029 0.0029 0.0029

HAILFINDERCombo ID NE CK

1 1.00 1.00 1.00 0.2401 0.2384 0.2300 0.2433 0.2416 0.23264 1.00 1.00 1.00 0.0271 0.0251 0.0235 0.0290 0.0273 0.02595 1.00 1.00 1.00 0.0110 0.0115 0.0113 0.0143 0.0149 0.01458 1.00 1.00 1.00 0.0123 0.0115 0.0115 0.0151 0.0147 0.01489 1.00 1.00 1.00 0.0118 0.0117 0.0116 0.0151 0.0149 0.0148

INSURANCECombo ID NE CK

1 1.00 1.00 1.00 0.1907 0.1820 0.1923 0.2004 0.1878 0.19974 1.00 1.00 1.00 0.0555 0.0547 0.0547 0.0664 0.0663 0.06595 1.00 1.00 1.00 0.0555 0.0547 0.0547 0.0664 0.0663 0.06598 1.00 1.00 1.00 0.0307 0.0305 0.0299 0.0397 0.0394 0.03869 1.00 1.00 1.00 0.0300 0.0300 0.0301 0.0389 0.0389 0.0391

PARITY1Combo ID NE CK

1 1.00 1.00 1.00 0.2220 0.2306 0.2295 0.2242 0.2372 0.23454 1.00 1.00 1.00 0.1227 0.1206 0.1195 0.1248 0.1264 0.12575 1.00 1.00 1.00 0.1136 0.1096 0.1123 0.1165 0.1125 0.11528 1.00 1.00 1.00 0.0990 0.0990 0.0999 0.0995 0.0994 0.10049 1.00 1.00 1.00 0.0988 0.0995 0.0990 0.0993 0.1000 0.0995

PARITY2Combo ID NE CK

1 1.00 1.00 1.00 0.2469 0.2475 0.2492 0.2469 0.2475 0.24924 1.00 1.00 1.00 0.0075 0.0069 0.0074 0.0075 0.0069 0.00765 1.00 1.00 1.00 0.0062 0.0061 0.0063 0.0062 0.0062 0.00648 1.00 1.00 1.00 0.0062 0.0062 0.0071 0.0062 0.0062 0.00729 1.00 1.00 1.00 0.0060 0.0059 0.0060 0.0060 0.0060 0.0060

Table 3: Estimated posterior expected loss results for all BNs

41

Page 42: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

the highest log-likelihood. BNMAP is an estimate for the BN with the highest posteriorprobability under the prior encoded by the SLP and BNMLE is an estimate of the highestposterior probability model with a uniform prior distribution over models with positiveprior probability according to the SLP prior. Table 4 show the ID and NE loss for thesemodels for all TrueBNs and Combos 1, 4, 5, 8 and 9.

Table 4 shows the following:

• Results are generally stable across different realisations of the same Markov chain.

• Excluding the baseline Combo 1: For ASIA, MLE and MAP losses were always iden-tical; for ALARM, INSURANCE and PARITY1 MLE has lower or equal loss, forHAILFINDER and PARITY2 results were mixed. This suggests that the principalinfluence given by our priors is the hard constraint effected by zero prior probabilities.

• As expected both MAP and MLE losses are smaller than posterior expected loss:selecting an appropriate model using the (approximate) posterior renders a bettermodel than randomly selecting one according to the posterior.

In two cases we can do a rough comparison with existing work. Castelo and Kocka(2003) (among many other experiments) apply their HCMC algorithm to a dataset of 1000points sampled from the ALARM BN. They run this (randomised) algorithm with differentparameter settings, the best CK score averaging at 0.0073. So, as Tables 3 and 4 show,the basic story is that our ALARM results (whether expected loss, MAP loss or MLEloss) are worse than theirs with Combos 1, 4 and 5 and better with the Combos 8 and 9.Acid and de Campos (2003) also learn from data sampled from ALARM. Their smallestdataset has 3000 datapoints, from which their algorithm learned a BN with an NE lossof only 0.0014, better than all our scores, but this is using three times as much data.They also did experiments on datasets generated from INSURANCE and HAILFINDER.For INSURANCE with 10,000 datapoints the learned BN had NE loss of 0.0247 (a littlebetter than our scores produced using 1,000 datapoints). For HAILFINDER, again with10,000 datapoints, an NE loss of only 0.00765 was achieved: considerably better than ourHAILFINDER losses (using only 1,000 data points) which are in the 0.011-0.012 region.

7. Conclusions

The research presented here lies at the intersection of two research areas: Bayesian modelaveraging of BNs via MCMC (Madigan and York, 1995; Friedman and Koller, 2003) and theuse of rich representations to represent the model space (Langseth and Nielsen, 2003; Segalet al., 2005). In our case the representation language (first-order logic) and the MCMCapproach are tightly coupled since the Metropolis-Hastings proposals operate by pruningand re-growing proof trees.

First-order logic has proved to be an effective way of incorporating domain knowledgeinto the prior. For example, to add pairwise independence constraints the d-separationcriterion was written in clausal logic (concretely a Prolog program) and just added as anextra constraint on what a legal BN was. It was straightforward to apply (variants of) manyof the approaches to defining priors described in Section 2. The system employed, calledMCMCMS, is not specific to BNs; other recent work has applied it to Bayesian learning of

42

Page 43: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

ASIA ALARMC BN ID NE ID NE

1 MAP 1 1 1 0.0625 0.0781 0.0937 1 1 1 0.1869 0.1950 0.19721 MLE 1 1 1 0.0937 0.0937 0.0937 1 1 1 0.2132 0.1957 0.1957

4 MAP 1 1 1 0.0937 0.1093 0.0937 1 1 1 0.0182 0.0175 0.02044 MLE 1 1 1 0.0937 0.1093 0.0937 1 1 1 0.0189 0.0175 0.0197

5 MAP 1 1 1 0.0312 0.0312 0.0312 1 1 1 0.0073 0.0080 0.00875 MLE 1 1 1 0.0312 0.0312 0.0312 1 1 1 0.0065 0.0058 0.0065

8 MAP 0 0 0 0.0 0.0 0.0 1 1 1 0.0029 0.0029 0.00298 MLE 0 0 0 0.0 0.0 0.0 1 1 1 0.0029 0.0029 0.0029

9 MAP 0 0 0 0.0 0.0 0.0 1 1 1 0.0029 0.0029 0.00299 MLE 0 0 0 0.0 0.0 0.0 1 1 1 0.0029 0.0029 0.0029

HAILFINDER INSURANCECombo BN ID NE ID NE

1 MAP 1 1 1 0.2397 0.2382 0.2292 1 1 1 0.1893 0.1810 0.18651 MLE 1 1 1 0.2397 0.2382 0.2302 1 1 1 0.1906 0.1810 0.1906

4 MAP 1 1 1 0.0277 0.0258 0.0239 1 1 1 0.0548 0.0562 0.05764 MLE 1 1 1 0.0277 0.0232 0.0229 1 1 1 0.0562 0.0534 0.0521

5 MAP 1 1 1 0.0102 0.0092 0.0095 1 1 1 0.0301 0.0329 0.03295 MLE 1 1 1 0.0117 0.0092 0.0098 1 1 1 0.0301 0.0329 0.0329

8 MAP 1 1 1 0.0140 0.0108 0.0108 1 1 1 0.0301 0.0301 0.03018 MLE 1 1 1 0.0133 0.0121 0.0102 1 1 1 0.0274 0.0301 0.0274

9 MAP 1 1 1 0.0127 0.0102 0.0108 1 1 1 0.0301 0.0301 0.03019 MLE 1 1 1 0.0114 0.0121 0.0114 1 1 1 0.0301 0.0301 0.0301

PARITY1 PARITY2Combo BN ID NE ID NE

1 MAP 1 1 1 0.2066 0.2272 0.2272 1 1 1 0.2466 0.2470 0.24861 MLE 1 1 1 0.2086 0.2272 0.2293 1 1 1 0.2465 0.2475 0.2488

4 MAP 1 1 1 0.1198 0.1239 0.1198 1 1 1 0.0075 0.0071 0.00724 MLE 1 1 1 0.1219 0.1198 0.1198 1 1 1 0.0074 0.0065 0.0076

5 MAP 1 1 1 0.1115 0.1095 0.1115 1 1 1 0.0056 0.0058 0.00605 MLE 1 1 1 0.1157 0.1095 0.1074 1 1 1 0.0063 0.0058 0.0058

8 MAP 1 1 1 0.0950 0.0991 0.0991 1 1 1 0.0054 0.0060 0.00648 MLE 1 1 1 0.1033 0.0991 0.0991 1 1 1 0.0054 0.0064 0.0078

9 MAP 1 1 1 0.0950 0.1033 0.1033 1 1 1 0.0058 0.0058 0.00609 MLE 1 1 1 0.0909 0.0991 0.0950 1 1 1 0.0058 0.0058 0.0060

Table 4: Loss figures for MAP and MLE models for all BNs

43

Page 44: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

classification trees (Angelopoulos and Cussens, 2005a,c). The first of these papers showshow to apply the distance based approach to priors (see Section 2.2.2) to classificationtrees. In the work on classification trees, non-chronological priors have also been used togood effect.

A central goal of this research has been investigate the effect of informative structuralpriors on MCMC for Bayesian nets. To this end even our least informative priors have madeuse of the true variable ordering. But, given a variable ordering, exact Bayesian approachesare possible, so the MCMC methods in this case are questionable. For example, Koivistoand Sood (2004) present an algorithm which finds the MAP BN given an ordering. Exactcomputation of posterior quantities is also exploited by Friedman and Koller (2003) as a‘subroutine’ in MCMC over variable orderings. However, in both these cases it is requiredthat the structural priors are modular. Here this is not required, although to use non-modular priors we do have to resort to pseudo-prior-Gibbs. Note that the restriction tomodular structural priors has to be dropped to allow independence constraints to be used.

Our results indicate that our most complex proposal: pseudo-prior-Gibbs, is the mostsuccessful despite being the theoretically flawed one. Future work will focus on this ap-proach. Although our approach is intended for use when domain knowledge is there to beexploited and a variable ordering is the most common sort of domain knowledge, it wouldbe interesting to use priors which do not use such an ordering, perhaps compensating with(conditional) independence constraints. Even with this best proposal our MCMC runs failedon PARITY2. A possible solution is to employ tempering which we used successfully forclassification trees (Angelopoulos and Cussens, 2005c), and it has been successfully used byLaskey and Myers (2003) for Bayesian nets.

Acknowledgements

This work was partially supported by UK EPSRC MathFIT project Stochastic Logic Pro-grams for MCMC (GR/S30993/01). We would like to thank Mikko Koivisto and KismatSood for supplying their datasets.

References

B. Abramson, J. Brown, A. Murphy, and R. L. Winker. Hailfinder: A Bayesian system forforecasting severe weather. International Journal of Forecasting, 12:57–71, 1996.

Silvia Acid and Luis M. de Campos. Searching for Bayesian network structures in the spaceof restricted acyclic partially directed graphs. Journal of Artificial Intelligence Research,18:445–490, 2003.

Christophe Andrieu, Nando de Freitas, Arnaud Doucet, and Michael I. Jordan. An intro-duction to MCMC for Machine Learning. Machine Learning, 50:5–43, 2003.

Nicos Angelopoulos and James Cussens. Markov chain Monte Carlo using tree-based priorson model structure. In Jack Breese and Daphne Koller, editors, Proceedings of the Seven-teenth Annual Conference on Uncertainty in Artificial Intelligence (UAI–2001), Seattle,August 2001. Morgan Kaufmann. URL ftp://ftp.cs.york.ac.uk/pub/aig/Papers/

james.cussens/uai01.ps.gz.

44

Page 45: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

Nicos Angelopoulos and James Cussens. Extended stochastic logic programs for infor-mative priors over C&RTs. In Rui Camacho, Ross King, and Ashwin Srinivasan, edi-tors, Proceedings of the work-in-progress track of the Fourteenth International Confer-ence on Inductive Logic Programming (ILP04), pages 7–11, Porto, September 2004a.URL ftp://ftp.cs.york.ac.uk/pub/aig/Papers/james.cussens/ilp04_wip.pdf.

Nicos Angelopoulos and James Cussens. On the implementation of MCMC proposals overstochastic logic programs. In Colloquium on Implementation of Constraint and LOgicProgramming Systems. Satellite workshop to ICLP’04, Saint-Malo, France, 2004b. URLftp://ftp.cs.york.ac.uk/pub/aig/Papers/james.cussens/ciclops04.ps.gz.

Nicos Angelopoulos and James Cussens. Exploiting informative priors for Bayesian classifi-cation and regression trees. In Proc. 19th International Joint Conference on AI (IJCAI-05), Edinburgh, August 2005a.

Nicos Angelopoulos and James Cussens. MCMCMS 0.3.4 User Guide. University of York,2005b. URL http://www.cs.york.ac.uk/aig/slps/mcmcms/MCMCMS_uguide.html.

Nicos Angelopoulos and James Cussens. Tempering for Bayesian C&RT. In Proceedings ofthe 22nd International Conference on Machine Learning (ICML05), Bonn, 2005c.

I. A. Beinlich, H. J. Suermondt, R. M. Chavez, and G. F. Cooper. The alarm monitoringsystem: A case study with two probabilistic inference techniques for belief networks.In Proceedings of the European Conference on Artificial Intelligence in Medicine, pages247–256, 1989.

J. Binder, D. Koller, S. Russell, and K. Kanazawa. Adaptive probabilistic networks withhidden variables. Machine Learning, 29:213–244, 1997.

Susanne G. Bøttcher and Claus Dethlefsen. deal: A package for learning Bayesian networks.Journal of Statistical Software, 8(20), 2003.

W. L. Buntine. Theory refinement of Bayesian networks. In Bruce D’Ambrosio, PhilippeSmets, and Piero Bonissone, editors, Proceedings of the Seventh Annual Conference onUncertainty in Artificial Intelligence (UAI–1991), pages 52–60, 1991. URL http://

citeseer.nj.nec.com/buntine91theory.html.

Robert Castelo and Tomas Kocka. On inclusion-driven learning of Bayesian networks.Journal of Machine Learning Research, 4:527–574, 2003.

G. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networksfrom data. Machine Learning, 9:309–347, 1992. URL http://smi-web.stanford.edu/

pubs/SMI_Reports/SMI-91-0355.pdf. Appeared as 1991 Technical Report KSL-91-02for the Knowledge Systems Laboratory, Stanford University (also SMI-91-0355).

James Cussens. Stochastic logic programs: Sampling, inference and applications. In Proc.UAI-00, pages 115–122, San Francisco, CA, 2000. Morgan Kaufmann. URL ftp://ftp.

cs.york.ac.uk/pub/ML_GROUP/Papers/uai00.ps.gz.

45

Page 46: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

James Cussens. Parameter estimation in stochastic logic programs. Machine Learning, 44(3):245–271, 2001. URL ftp://ftp.cs.york.ac.uk/pub/ML_GROUP/Papers/jcslpmlj.

ps.gz.

Thore Egeland, Petter Mostad, Bente Mevag, and Margurethe Stenersen. Beyond tradi-tional paternity and identification cases. Selecting the most probable pedigree. ForensicScience International, 110(1), 2000.

William Feller. An Introduction to Probability Theory and Its Applications, volume 1. JohnWiley, New York, third edition, 1950.

G. Frege. Begriffsschrift, eine der arithmetischen nachgebildete Formelsprache des reinenDenkens, 1879.

Nir Friedman and Daphne Koller. Being Bayesian about network structure: A Bayesianapproach to structure discovery in Bayesian networks. Machine Learning, 50:95–126,2003. URL http://robotics.stanford.edu/~koller/papers/order-mcmc.ps. Ex-panded version of UAI-2000 paper.

Andrew Gelman. Parameterization and Bayesian modeling. Journal of the American Sta-tistical Association, 99(466):537–545, 2004.

W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov Chain Monte Carloin Practice. Chapman & Hall, 1996.

Olle Haggstrom. Finite Markov Chains and Algorithmic Applications, volume 52 of LondonMathematical Society Student Texts. Cambridge University Press, Cambridge, 2002.

D. Heckerman, D. Geiger, and D. Chickering. Learning Bayesian networks: The combinationof knowledge and statistical data. Machine Learning, 20:197–243, 1995a. URL ftp://

ftp.research.microsoft.com/pub/tr/tr-94-09.ps. Also appears as Technical ReportMSR-TR-94-09, Microsoft Research, March, 1994 (revised December, 1994).

David Heckerman, David Maxwell Chickering, Christopher Meek, Robert Rounthwaite,and Carl Kadie. Dependency networks for inference, collaborative filtering, and datavisualization. Journal of Machine Learning Research, 1:49–75, October 2000.

David Heckerman, Dan Geiger, and David M. Chickering. Learning Bayesian networks:The combination of knowledge and statistical data. Machine Learning, 20(3):197–243,1995b.

Søren Højsgaard and Bo Thiesson. BIFROST—block recursive models induced from relevantknowledge, observations, and statistical techniques. Computational Statistics & DataAnalysis, 19:155–175, 1995.

Colin Howson and Peter Urbach. Scientific Reasoning: The Bayesian Approach. OpenCourt, La Salle, Illinois, 1989.

Mikko Koivisto and Kismat Sood. Exact Bayesian structure discovery in Bayesian networks.Journal of Machine Learning Research, 5:549–573, 2004.

46

Page 47: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Bayesian Learning of Bayesian Networks

Helge Langseth and Thomas D. Nielsen. Fusion of domain knowledge with data forstructural learning in object oriented domains. Journal of Machine Learning Research,4:339–368, July 2003. URL http://www.jmlr.org/papers/volume4/langseth03a/

langseth03a.pdf.

Kathryn Blackmond Laskey and James W. Myers. Population Markov chain Monte Carlo.Machine Learning, 50:175–196, 2003.

S.L. Lauritzen and D.J. Spiegelhalter. Local computations with probabilities on graphicalstructures and their applications to expert systems. Journal of the Royal StatisticalSociety A, 50(2):157–224, 1988.

Steffen L. Lauritzen and Thomas S. Richardson. Chain graph models and their causalinterpretations. Journal of the Royal Statistical Society (B), 64(3):321–361, 2002.

D. Madigan and J. York. Bayesian graphical models for discrete data. International Sta-tistical Review, 63:215–232, 1995.

David Madigan, Jonathan Gavrin, and Adrian E. Raftery. Eliciting prior informationto enhance the predictive performance of Bayesian graphical models. Communica-tions in Statistics: Theory and Methods, 24:2271–2292, 1995. URL http://www.stat.

washington.edu/tech.reports/tr270.ps. Appeared as 1994 Technical Report 270,University of Washington.

David Madigan and Adrian E. Raftery. Model selection and accounting for model un-certainty in graphical models using Occam’s window. Journal of the American Statis-tical Association, 89:1535–1546, 1994. URL http://www.stat.washington.edu/www/

research/reports/1991/tr213.ps. First version was 1991 Technical Report 213, Uni-versity of Washington.

Stephen Muggleton. Stochastic logic programs. In Luc De Raedt, editor, Advances in Induc-tive Logic Programming, volume 32 of Frontiers in Artificial Intelligence and Applications,pages 254–264. IOS Press, Amsterdam, 1996.

Ulf Nilsson and Jan Ma luszynski. Logic, Programming and Prolog. John Wiley, Chichester,second edition, 1995. URL http://www.ida.liu.se/~ulfni/lpp/.

Matthew Richardson and Pedro Domingos. Learning with knowledge from multiple experts.In Proceedings of the Twentieth International Conference on Machine Learning, Wash-ington, DC, 2003. Morgan Kaufmann. URL http://www.cs.washington.edu/homes/

pedrod/papers/mlc03.pdf.

Christian P. Robert and Robert Casella. Monte Carlo Statistical Methods. Springer, NewYork, second edition, 2004.

Eran Segal, Dana Pe’er, Aviv Regev, Daphne Koller, and Nir Friedman. Learning modulenetworks. Journal of Machine Learning Research, 6:557–588, 2005.

47

Page 48: Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of

Angelopoulos and Cussens

Nuala Sheehan and Daniel Sorensen. Graphical models for mapping continuous traits.In Peter J. Green, Nils Lid Hjort, and Sylvia Richardson, editors, Highly StructuredStochastic Systems, pages 382–386. Oxford University Press, Oxford, 2003.

Sampath Srinivas, Stuart Russell, and Alice M. Agogino. Automated construction of sparseBayesian networks from unstructured probabilistic models and domain information. InMax Henrion, Ross Schachter, Laveen Kanal, and John Flemmer, editors, Uncertainty inArtificial Intelligence: Proceedings of the Fifth Conference (UAI-1989), pages 295–308,New York, NY, 1990. Elsevier Science Publishing Company, Inc.

Matthew Stephens and Peter Donelly. A comparison of Bayesian methods for haplotypereconstruction from population genotype data. American Journal of Human Genetics,73:1162–1169, 2003.

Robert Valdueza Castelo and Arno Siebes. Priors on network structures. Biasing thesearch for Bayesian networks. Technical Report INS-R9816, Centruum voor Wiskundeen Informatica (CWI), Amsterdam, The Netherlands, December 1998. URL http:

//www.cwi.nl/ftp/CWIreports/INS/INS-R9816.pdf.

V.N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequenciesof events to their probabilities. Theory of Probability and its Applications, 16(2):264–280,1971.

48