scale up bayesian networks learning dissertation proposal · 2015. 3. 19. · scale up bayesian...

47
Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan [email protected] March 2015 Abstract Bayesian networks are widely used graphical models which represent uncertain rela- tions between the random variables in a domain compactly and intuitively. The first step of applying Bayesian networks to real-word problems typically requires building the network structure. Among others, optimal structure learning via score-and-search has become an active research topic in recent years. In this context, a scoring function is used to measure the goodness of fit of a structure to the data, and the goal is to find the structure which optimizes the scoring function. The problem has been shown to be NP-hard. This pro- posal focuses on scaling up structure learning algorithms. Two potential directions are preliminarily explored. One is to tighten the bounds; we tighten the lower bound by us- ing more informed variable groupings when creating the heuristics. and tighten the upper bound using anytime learning algorithm. The other is to prune the search space by extract- ing constraints directly from data. Our preliminary results show that the two directions can significantly improve the efficiency and scalability of heuristics search-based structure learning algorithms, which gives us strong reasons to follow these directions in the future. Professor: Dr. Changhe Yuan Signature: Date: 03-11-2015 Professor: Dr. Liang Huang Signature: Date: 03-11-2015 Professor: Dr. Andrew Rosenberg Signature: Date: 03-11-2015 Professor: Dr.John Paisley Signature: Date: 03-11-2015 Advisor: Dr. Changhe Yuan 1

Upload: others

Post on 21-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

Scale Up Bayesian Networks LearningDissertation Proposal

Xiannian [email protected]

March 2015

Abstract

Bayesian networks are widely used graphical models which represent uncertain rela-tions between the random variables in a domain compactly and intuitively. The first step ofapplying Bayesian networks to real-word problems typically requires building the networkstructure. Among others, optimal structure learning via score-and-search has become anactive research topic in recent years. In this context, a scoring function is used to measurethe goodness of fit of a structure to the data, and the goal is to find the structure whichoptimizes the scoring function. The problem has been shown to be NP-hard. This pro-posal focuses on scaling up structure learning algorithms. Two potential directions arepreliminarily explored. One is to tighten the bounds; we tighten the lower bound by us-ing more informed variable groupings when creating the heuristics. and tighten the upperbound using anytime learning algorithm. The other is to prune the search space by extract-ing constraints directly from data. Our preliminary results show that the two directionscan significantly improve the efficiency and scalability of heuristics search-based structurelearning algorithms, which gives us strong reasons to follow these directions in the future.

Professor: Dr. Changhe Yuan Signature: Date: 03-11-2015

Professor: Dr. Liang Huang Signature: Date: 03-11-2015

Professor: Dr. Andrew Rosenberg Signature: Date: 03-11-2015

Professor: Dr.John Paisley Signature: Date: 03-11-2015

Advisor: Dr. Changhe Yuan

1

Page 2: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

Contents

1 Introduction 31.1 Notation and Termimology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Bayesian Network Structure Learning 62.1 Scoring Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Local Search Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Hill Climbing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Max-Min Hill Climbing . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.3 Stochastic Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Optimal Search Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Branch-and-Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.2 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.3 Integer Linear Programming . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.4 Heuristic Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Exact Struture Learning 133.1 Integer Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Heuristic Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Finding the Shortest Path: Heuristic Search Algorithms . . . . . . . . . 16

3.3 Integer Linear Programming VS Heuristic Search . . . . . . . . . . . . . . . . 17

4 Proposal I: Tightening Bounds 194.1 K-Cycle Conflict Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Tightening Dynamic Pattern Database . . . . . . . . . . . . . . . . . . . . . . 22

4.3 Tightening Static Pattern Database . . . . . . . . . . . . . . . . . . . . . . . . 22

4.3.1 Creating the Partition Graph . . . . . . . . . . . . . . . . . . . . . . . 23

4.3.2 Performing the Partition . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.4 Topology Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.5 Tightening Upper Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.6 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.6.1 Results on Dynamic Pattern Databases . . . . . . . . . . . . . . . . . . 26

4.6.2 Results on Upper Bounds . . . . . . . . . . . . . . . . . . . . . . . . 27

4.6.3 Results on Static Pattern Databases . . . . . . . . . . . . . . . . . . . 28

1

Page 3: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

5 Proposal II: Using Constraints Learned from Data 305.1 Potential Optimal Parents Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2 Using POPS Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2.1 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2.2 Ancestor Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2.3 POPS Constraints Pruning . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2.4 Recursive POPS Constraints Pruning . . . . . . . . . . . . . . . . . . 34

5.2.5 Bridge Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 Conclusion and Future Work 40

2

Page 4: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

1 Introduction

Bayesian networks, also known as belief networks, are widely-used graphical models that rep-

resent uncertain relations between the random variables in a uncertain domain compactly and

intuitively [40]. In particular, each node in the graph represents a random variables, while

edges between nodes represent probabilistic dependencies among the corresponding random

variables. These conditional dependencies are often estimated by using known statistical and

computaitonal methods. Hence, Bayesian Networks combine principles from graph theory,

probability theory, computer science and statistics.

A Bayesian network is a directed acyclic graph (DAG) in which nodes represent random vari-

ables. The structure of a DAG is defined by two sets, the set of nodes and the set of directed

edges. Each node represents one random variable. An edge from node Xi to node Xj represents

a statistical dependency between the corresponding variables. Thus, the arrow indicates that a

value taken by Xj depends on the value taken by Xj , or roughly speaking that Xi influences

Xj . Xj is then referred to as a parent of Xj and, similarly, Xj is then referred to as the child of

Xi. The structure of DAG guarantees that there is no cycle in the Bayesian networks; in other

words, there is no node that can be its own ancestor or its own descendent. Such a condition is

of vital importance to the factorization of the joint probability of a collection of nodes.

In addition to the DAG structure, which often considered as the qualitative part of the Bayesian

Networks, one needs to specify the quantitative parameters of Bayesian networks. These pa-

rameters are quantified by a set of conditional probability distributions or conditional proba-

bility tables (CPTs), one for each variable conditioning on its parents. Overall, a Bayesian

network represents a joint probability distribution over the variables.

Figure 1 show an example of Bayesian Network, the qualitative part is indicated by the DAG

structure, and the quantitative part is given by four CPTs.

Applying Bayesian networks to real-world problems typically requires building graphical rep-

resentations of relationships among variables. When these relationships are not known a priori,

3

Page 5: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

Figure 1: A Bayesian Network

the structure of Bayesian networks must be learned. The topic of this proposal is about Bayesian

networks structure learning.

The remainder of this section first introduce the Bayesian networks as well as the notation and

terminology used throughout the rest of this paper, and the outline of this proposal.

1.1 Notation and Termimology

A Bayesian network is a directed acyclic graph (DAG) G that represents a joint probability

distribution over a set of random variables V = {X1, X2, ..., Xn}. A directed arc from Xi

to Xj represents the dependence between the two variables; we say Xi is a parent of Xj .

We use PAj to stand for the parent set of Xj . The dependence relation between Xj and PAj

are quantified using a conditional probability distribution, P (Xj|PAj). The joint probability

distribution represented by G is factorized as the product of all the conditional probability

distributions in the network, i.e., P (X1, ..., Xn) =∏n

i=1 P (Xi|PAi).

In addition to the compact representation, Bayesian networks also provide principled approaches

to solving various inference tasks, including belief updating, most probable explanation, maxi-

4

Page 6: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

mum a Posteriori assignment [40], and most relevant explanation [54, 52, 53].

Given a dataset D = {D1, ..., DN}, where each data point Di is a vector of values over variables

V, learning a Bayesian network is the task of finding a network structure that best fits D. In

this work, we assume that each variable is discrete with a finite number of possible values, and

no data point has missing values.

1.2 Outline

The remainder of the proposal is organized as follows: in section 2, we introduce the basic

concepts of Bayesian Network structure learning and different learning algorithms. In section

3, we elaborate two most efficient exact structure learning methods. We propose methods for

tightening the bouds in section 5, and propose methods to prune search space by extracting

constraints directly from data in section 5. We briefly conclude and discuss the future work in

section 6.

Related Publication: Parts of the proposed methods appeared in the following paper:

Xiannian Fan, Changhe Yuan, and Brandon Malone.Tightening Bounds for Bayesian Network

Structure Learning. In Proceedings of the 28th AAAI Conference on Artificial Intelligence

(AAAI), 2014.

5

Page 7: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

2 Bayesian Network Structure Learning

Bayesesian networks Struture Learning problem takes the data as inputs and produces an di-

rected acyclic graph as the output. Before we dive into the algorithmic details in the next

section, let us review these basic concepts and briefly introduce different learning methods in

this section.

There are roughly three main approaches to the learning problem: score-based learning, constraint-

based learning, and hybrid methods. Score-based learning methods evaluate the quality of

Bayesian network structures using a scoring function and selects the one that has the best

score [13, 27]. These methods basically formulate the learning problem as a combinatorial

optimization problem. They work well for datasets with not too many variables, but may fail

to find optimal solutions for large datasets. We will discuss this approach in more detail in

the next section, as it is the approach we take. Constraint-based learning methods typically

use statistical testings to identify conditional independence relations from the data and build

a Bayesian network structure that best fits those independence relations [40, 46, 9, 20, 51].

Constraint-based methods mostly rely on results of local statistical testings, so they can often

scale to large datasets. However, they are sensitive to the accuracy of the statistical testings

and may not work well when there are insufficient or noisy data. In comparison, score-based

methods work well even for datasets with relatively few data points. Hybrid methods aim to

integrate the advantages of the previous two approaches and use combinations of constraint-

based and/or score-based methods for solving the learning problem [17, 2, 50, 41]. One popular

strategy is to use constraint-based learning to create a skeleton graph and then use score-based

learning to find a high-scoring network structure that is a subgraph of the skeleton [50, 41].

In this paper, we mainly focus on the score-based approach. we do not consider Bayesian model

averaging methods which aim to estimate the posterior probabilities of structural features such

as edges rather than model selection [27, 23, 16].

Score-based learning methods rely on a scoring function Score(.) in evaluating the quality of a

Bayesian network structure. A search strategy is used to find a structure G∗ that optimizes the

6

Page 8: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

score. Therefore, score-based methods have two major elements, scoring functions and search

strategies. We first introduce the scoring functions, then review two classes of search stratigies,

local search strategies and optimal search strategies.

2.1 Scoring Functions

Many scoring functions can be used to measure the quality of a network structure. Some

of them are Bayesian scoring functions which define a posterior probability distribution over

the network structures conditioning on the data, and the structure with the highest posterior

probability is presumably the best structure. These scoring functions are best represented by

the Bayesian Dirichlet score (BD) [28] and its variations, e.g., K2 [13], Bayesian Dirichlet

score with score equivalence (BDe) [28], and Bayesian Dirichlet score with score equivalence

and uniform priors (BDeu) [8]. Other scoring functions often have the form of trading off the

goodness of fit of a structure to the data and the complexity of the structure. The goodness of fit

is measured by the likelihood of the structure given the data or the amount of information that

can be compressed into a structure from the data. Scoring functions belonging to this category

include minimum description length (MDL) (or equivalently Bayesian information criterion,

BIC) [42, 47, 33], Akaike information criterion (AIC) [4, 7], (factorized) normalized maximum

likelihood function (NML/fNML) [44], and the mutual information tests score (MIT) [19].

All of these scoring functions have some gerneral properties. The most important property is

decomposable, that is, the score of a network can be decomposed into a sum of node scores [27].

It is worth noting that the optimal structure G∗ may not be unique because multiple Bayesian

network structures may share the same optimal score1. Two network structures are said to

belong to the same equivalence class [10] if they represent the same set of probability distribu-

tions with all possible parameterizations. Score-equivalent scoring functions assign the same

score to structures in the same equivalence class. Most of the above scoring functions are score

equivalent.

1That is why we often use “an optimal” instead of “the optimal” throughout this paper.

7

Page 9: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

Due to the limit of the space, we mainly introduce the MDL score in this survey. Let ri be

the number of states of Xi, Npai

be the number of data points consistent with PAi = pai,

and Nxi,pai

be the number of data points further constrained by Xi = xi. MDL is defined as

follows [33].

MDL(G) =∑i

MDL(Xi|PAi), (1)

where

MDL(Xi|PAi) = H(Xi|PAi) +logN

2K(Xi|PAi), (2)

H(Xi|PAi) = −∑xi,pa

i

Nxi,pailog

Nxi,pai

Npai

, (3)

K(Xi|PAi) = (ri − 1)∏

Xl∈PAi

rl. (4)

The goal is then to find a Bayesian network that has the minimum MDL score. Any other

decomposable scoring function, such as BIC, BDeu, or fNML, can be used instead without

affecting the search strategy. One slight difference between MDL and the other scoring func-

tions is that the latter scores need to be maximized in order to find an optimal solution. But

it is rather straightforward to translate between maximization and minimization problems by

simply changing the sign of the scores. Also, we sometimes use costs to refer to the scores.

2.2 Local Search Strategies

Given n variables, there are O(n2n(n−1)) directed acyclic graphs (DAGs). The size of the solu-

tion space grows exponentially in the number of variables. It is not surprising that score-based

structure learning has been shown to be NP-hard [11]. Due to the complexity, early research

focused mainly on developing local search algorithms [27, 6]. Popular local search strategies

that were used include greedy hill climbing, Min Min Hill Climbing, stochastic search, etc.

8

Page 10: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

2.2.1 Hill Climbing

Greedy hill climbing methods have been proposed. They start from an initial solution, which

typically is an empty network without any edges, or a randomly generated structure, and it-

eratively apply single edge operations, including addition, deletionn and reversal, looking for

the choice that locally maximizing the score improvement. Extensions to this approach in-

clude tabu search with random restarts [25], limiting the number of parents or parameters for

each variable [24], searching in the space of equivalence classes [12], searching in the space of

variable orderings [49], and searching under the constraints extracted from data.

2.2.2 Max-Min Hill Climbing

Max-min hil-climbing (MMHC) [50], a double step algorithm, was proposed. After first exe-

cuting a statistical conditional independence test to find a reduced set of candidate parents sets,

it applies a greedy hill-climbing step to find a local optimal solution. MMHC is classifed as a

hybrid methods between scoring-based methods and constraint based on methods.

The Max-min step, based on the Max-min parent-chilren (Tsamardinos2006), is a heuristic that

looks for a set of candidates for each node in the graph. It aims to find an undirected graph

representing the skeleton of original DAG by looking for subsets of variables Z conditionally

separting pairs of variables X, Y . Such test is denoted as T (X, Y |Z), and is G2 test in [50].

MMPC works as follows. Let PCGX be the set of parents and children for node X in a Bayesian

Ntwork over P that is Markov equivalent with respect to G, it holds that PCGX = PCG′

X , and

we can write only PCX . Thus, the set of parents and children of a node is the same for all

the Markove equivalent Bayesian Networks over the same probabiliby distribution P . MMPC

performs the conditional independence test T (X, Y |Z) in a subroutine called MinAssoc, The

logarithm of the so obtained p−value is called the association between variables.

The Max-min heuristic, for each varibale X , iteratively constructs a set of variables with high

association with X (the CPC set), picking up at each iteration the variable Y with the largest

9

Page 11: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

association, until such association falls under a minimum value. The set of CPCs found by the

Max-min heuristic form the skeleton of the Bayesian Network.

The hill-climbing algorithm is then applied to the skeleton to perform a local search, by using

three possible edge operations over DAGs (edge insertion, edge removal and edge reversal) and

greedily choosing the one that increases the score the most until no improvement. The search

space for edges is limited to the ones allowed by skeleton found by Max-min heuristic.

2.2.3 Stochastic Search

Stochastic search methods such as Markov Chain Monte Carlo and simulated annealing have

also been applied to find a high-scoring structure [27, 21, 36]. These methods explore the so-

lution space using non-deterministic transitions between neighboring network structures while

favoring better solutions. The stochastic moves are used in hope to escape local optima and

find better solutions.

2.3 Optimal Search Strategies

Local search methods are quite robust in the face of large learning problems with many vari-

ables. However, they do not guarantee to find an optimal solution. What is worse, the quality of

their solutions is typically unknown. Recently multiple exact algorithms have been developed

for learning optimal Bayesian networks.

2.3.1 Branch-and-Bound

A branch and bound algorithm (BB) was proposed in [18] for learning Bayesian networks. The

algorithm first creates a cyclic graph by allowing each variable to obtain optimal parents from

all the other variables. A best-first search strategy is then used to break the cycles by removing

one edge at a time. The algorithm uses an approximation algorithm to estimate an initial upper

bound solution for pruning. The algorithm also occasionally expands the worst nodes in the

10

Page 12: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

search frontier in hope to find better networks to update the upper bound. At completion, the

algorithm finds an optimal network structure that is a subgraph of the initial cyclic graph. If

the algorithm ran out of memory before finding the solution, it will switch to using a depth-

first search strategy to find a suboptimal solution. Authors claim their alorithms to be optimal.

Such optimality, however, is under the constraint that each variable can only select parents from

optimal parents set.

2.3.2 Dynamic Programming

Several dynamic programming algorithms are proposed based on the observation that a Bayesian

network has at least one leaf [37, 45]. A leaf is a variable with no child variables in a Bayesian

network. In order to find an optimal Bayesian network for a set of variables V, it is sufficient to

find the best leaf. For any leaf choice X , the best possible Bayesian network is constructed by

letting X choose an optimal parent set PAX from V\{X} and letting V\{X} form an optimal

subnetwork. Then the best leaf choice is the one that minimizes the sum of Score(X, PAX)

and Score(V\{X}) for a scoring function Score(.). More formally, we have:

Score(V) = minX∈V{Score(V \ {X}) +BestScore(X,V \ {X})}, (5)

where

BestScore(X,V \ {X}) = minPAX⊆V\{X}

Score(X, PAX). (6)

Given the above recurrence relation, a dynamic programming algorithm works as follows. It

first finds optimal structures for single variables, which is trivial. Starting with these base cases,

the algorithm builds optimal subnetworks for increasingly larger variable sets until an optimal

network is found for V. The dynamic programming algorithms can find an optimal Bayesian

network in O(n2n) time and space [31, 37, 43, 45]. Recent algorithms have improved the mem-

ory complexity by either trading longer running times for reduced memory consumption [39]

or taking advantage of the layered structure present within the dynamic programming lattice.

11

Page 13: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

2.3.3 Integer Linear Programming

Integer linear programming (ILP) has been used to learn optimal Bayesian network struc-

tures [15, 30]. The BNSL problem is formulated as an integer linear program over a polytope

with an exponential number of facets. An outer bound approximation to the polytope is then

solved. If the solution of the relaxed problem is integral, it is guaranteed to be the optimal

structure. Otherwise, cutting planes and branch and bound algorithms are subsequently applied

to find the optimal structure. More details will be covered in next section.

2.3.4 Heuristic Search

Several heuristic search algorithms have been developed for solving this problem by formu-

lating the problem as a shortest path problem [57], e.g., A*, breadth-first branch and bound

(BFBnB) [35], and anytime Window A* [34]. Most of these algorithms need a lower bound

function in estimating the quality of a search path so that they can prune paths that are guaran-

teed to lead to suboptimal solutions and focus on exploring the most promising search spaces.

A lower bound was first computed by a simple huristic function which completely relaxes the

acyclicity constraint of Bayesian networks; then a better lower bound was proposed by using

static k-cycle conflict heuristic [55], which demonstrated excellent performance when used

by the search algorithms to solve many benchmark data sets. It takes as input a partition of

the random variables of a data set, and computes the heuristic value of a path by enforcing

the acyclicity constraint between the variables within each group and relaxing the acyclicity

between different groups. More details will be covered in next section.

12

Page 14: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

3 Exact Struture Learning

In this proposal, we focus on the exact Bayesian network structure learning. We briefly review

some algorithms which adopted optimal search strategies in the last section. Previous study [56]

showed that integer linear programming and heuristic search methods are more efficient than

branch-and-bound and dynamic programming methods. We highlight and elaborate these two

methods in this section.

3.1 Integer Linear Programming

In [15], BN structure learning was encoded as an integer programming. For each variable x

and candidate parent set PA , a binary variable I(PA→ x) is created. I(PA→ x) = 1 if and

only if PA are the parents of x in an optimal BN.These variables were called f amily variables.

Table 1 shows coding of familiy variables. For each variable Xi, there are ki potential optimal

parents candidates. The total number of family variables is∑

i ki.

I(PA1,1 → X1) I(PA1,2 → X1) ... I(PA1,k1 → X1)I(PA2,1 → X2) I(PA2,2 → X2) ... I(PA2,k2 → X2)I(PA3,1 → X3) I(PA3,2 → X1) ... I(PA3,k3 → X3)

... ... ... ...I(PAn,1 → Xn) I(PAn,2 → Xn) ... I(PAn,kn → Xn)

Table 1: coding for family variables

Then BN structure learning can be formulated as the following contrained optimisation prob-

lem:

∑x,PA

s(x, PA)I(PA→ x) (7)

To use an tnteger linear programming approach it is necessary to use linear constraints to ensure

that only valid DAGs are feasible.

Jaakkola etc. [30] use a family of cluster constraints which impose that, for every subset C ⊂ V

of the nodes of the graph G, there must be at least one node whose parent set either completely

13

Page 15: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

lies outside C, or is the empty set. Formally, the cluster constraints is defined as follows: for

every set of nodes C ⊂ V ,

∑x∈C

∑PA:PA∩C=∅

I(PA→ x) ≥ 1 (8)

This class of constraints stems from the observation that any sebset C of nodes in a DAG must

contain at least one node who has no parent in that subset. Clearly, there is an exponential

number of cluster constrains. The method to this problem is to remove then, solve the linear

relaxation of the problem and look for violated cosntraints.

Cussens etc. [15, 5] inherited Jaakola appraoch, but treated problem with branch-and-cut strat-

egy: they consider the relaxated problem obtained by removing a more general version of the

cluster constraints from the mode and solve it, andding the most effective cluster constraints as

cutting planes when needed, getting a solution x∗. If x∗ does not violate any cluster constaints

and is integer-value, the problem is solved; otherwise, x∗ is used to be branched on, creating

two new subproblems.

3.2 Heuristic Search

In [57], BN structure learning was cast as a shortest path search problem. The state space graph

for learning Bayesian networks is basically a Hasse diagram containing all of the subsets of the

variables in a domain. Figure 2 visualizes the state space graph for a learning problem with

four variables.

Figure 2 shows the implicit search graph for four variables. The top-most node with the empty

set at layer 0 is the start search node, and the bottom-most node with the complete set at layer

n is the goal node, where n is the number of variables in a domain. An arc from U to U∪{Xi}

represents generating a successor node by adding a new variable {Xi} to an existing set of

variables U; U is called a predecessor of U ∪ {Xi}. the cost of the arc is equal to the score

of the optimal parent set for Xi out of U, which is computed by considering all subsets of the

14

Page 16: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

{ }

{ 2 } { 3 } { 4 }

{ 3,4 }{ 2,4 }{ 2,3 }

{ 1,2,3 }

{ 1 }

{ 1,4 }{ 1,3 }{ 1,2 }

{ 1,2,4 } { 1,3,4 } { 2,3,4 }

{ 1,2,3,4 }

Figure 2: An order graph of four variables.

variables in PA ⊆ U, PA ∈ Pi, i.e.,

cost(U→ U ∪ {Xi}) = BestScore(Xi,U) (9)

= minPAi⊆U,PAi∈Pi

si(PAi). (10)

For example, the arc {X1, X2} → {X1, X2, X3} has a cost equal to BestScore(X3, {X1, X2}).

Each node at layer i has n− i successors as there are this many ways to add a new variable, and

i predecessors as there are this many leaf choices. We define expanding a node U as generating

all successors nodes of U.

With the search graph thus defined, a path from the start node to the goal node is defined as a se-

quence of nodes such that there is an arc from each of the nodes to the next node in the sequence.

Each path also corresponds to an ordering of the variables in the order of their appearance. For

example, the path traversing nodes ∅, {X1}, {X1, X2}, {X1, X2, X3}, {X1, X2, X3, X4} stands

for the variable ordering X1, X2, X3, X4. That is why the greaph is called order graph. The

cost of a path is defined as the sum of the costs of all the arcs on the path. The shortest path is

then the path with the minimum total cost in the order graph.

15

Page 17: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

Given the shortest path, we can reconstruct a Bayesian network structure by noting that each arc

on the path encodes the choice of optimal parents for one of the variables out of the preceding

variables, and the complete path represents an ordering of all the variables. Therefore, putting

together all the optimal parent choices generates a valid Bayesian network. By construction,

the Bayesian network structure is optimal.

3.2.1 Finding the Shortest Path: Heuristic Search Algorithms

This shortest path problem has been solved using several heuristic search algorithms, including

A* [57], anytime window A* (AWA*) [34] and breadth-first branch and bound (BFBnB) [35].

Figure 3: Finding Shortest Path by Heuristic Search.

In A* [26], an admissible heuristic function is used to calculate a lower bound on the cost

from a node U in the order graph to goal. An f-cost is calculated for U by summing the cost

from start to U (called g(U)) and the lower bound from U to goal (called h(U)). Figure 3

illustrates the basics of the heuristic search methods for finding shortest path of BNSL problem.

g(U) corresponds to the score of the subnetwork over the variables U, and h(U) estimates the

score of the remaining variables. So f(U) = g(U) + h(U). The f-cost provides an optimistic

estimation on how good a path through U can be. The search maintains a list of nodes to

be expanded sorted by f-costs called open and a list of already-expanded nodes called closed.

16

Page 18: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

Initially, open contains just start, and closed is empty. Nodes are then expanded from open

in best-first order according to f-costs. Expanded nodes are added to closed. As better paths

to nodes are discovered, they are added to open. Upon expanding goal, the shortest path from

start to goal has been found.

In AWA* [3], a sliding window search strategy is used to explore the order graph over a number

of iterations. During each iteration, the algorithm uses a fixed window size, w, and tracks the

layer l of the deepest node expanded. For the order graph, the layer of a node corresponds to

the number of variables in its subnetwork. Nodes are expanded in best-first order as usual by

A*; however, nodes selected for expansion in a layer less that l − w are instead frozen. A path

to goal is found in each iteration, which gives an upper bound solution. After finding the path

to goal, the window size is increased by 1 and the frozen nodes become open. The iterative

process continues until no nodes are frozen during an iteration, which means the upper bound

solution is optimal. Alternatively, the search can be stopped early if a resource bound, such as

running time, is exceeded; the best solution found so far is output.

In BFBnB [58], nodes are expanded one layer at a time. Before beginning the BFBnB search,

a quick search strategy, such as AWA* for a few iterations or greedy hill climbing, is used to

find a “good” network and its score. The score is used as an upper bound. During the BFBnB

search, any node with an f-cost greater than the upper bound can safely be pruned.

3.3 Integer Linear Programming VS Heuristic Search

Integer linear programming approach formulates the BNSL problem as an integer linear pro-

gram whose variables correspond to the potential optimal parent sets of all variables (called

famility variables), thus the efficiency depends of the number of family varaibles, while the

heuristic search approach formulates the BNSL problem as the graph search problem who ef-

ficiency depends on complexity of the search space, which is 2n, where n is the number of

variables.

Yuan etc. [56] compared the two approaches. They compared GOBNILP (a public software

17

Page 19: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

using integer linear programming method) to A* algorithm ( which guarantees to expand the

mininum number of nodes in heuristic search algorithms). The comparison between GOBNILP

and A* showed that they each has its own advantages. A* was able to find optimal Bayesian

networks for all the datasets well within the time limit. GOBNILP failed to learn optimal

Bayesian networks for some benchmark datasets. The reason is that GOBNILP formulates

the learning problem as an integer linear program whose variables correspond to the optimal

parent sets of all variables. Even though these datasets do not have many variables, they have

many optimal parent sets, so the integer programs for them have too many variables to be

solvable within the time limit. On the other hand, the results also showed that GOBNILP was

quite efficient on many benchmark datasets. Even though a dataset may have many variables,

GOBNILP can solve it efficiently as long as the number of optimal parent sets is small.

These insights are quite important, as they provide a guideline for choosing a suitable algo-

rithm given the characteristic of a dataset. If there are many optimal parent sets but not many

variables, heuristic search is the better algorithm; if the other way around is true, integer linear

programming is better. In comparison to integer linear programming, heuristic search is less

sensitive to the number of optimal parent sets, number of data points, or scoring functions, but

is more sensitive to the number of variables in the datasets.

18

Page 20: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

4 Proposal I: Tightening Bounds

For the rest of this paper, we focus on heuristic search methods. Malone et al. [35] proposed

to use breadth-first search to solve the shortest path problem. They observed that the order

graph has a regular layered structure, and the successors of any node only appear in the very

next layer. By searching the graph layer by layer, the algorithm only needs the layer being

generated in RAM. All nodes from earlier layers are stored on disk, and nodes from the layer

being expanded are read in from disk one by one. Furthermore, nodes from the layer being

generated are written to disk as RAM fills; delayed duplicate detection methods are used to

remove unnecessary copies of nodes [32].

The basic breadth-first search can be enhanced with branch and bound, resulting in a breadth-

first branch and bound algorithm (BFBnB). An upper bound ub is found in the beginning of the

search using a hill climbing search. A lower bound called f -cost is calculated for each node U

by summing two costs, g-cost and h-cost, where g-cost stands for the shortest distance from the

start node to U, and h-cost stands for an optimistic estimation on how far away U is from the

goal node. The h-cost is calculated from a heuristic function. Whenever f > ub, all the paths

extending U are guaranteed to lead to suboptimal solutions and are discarded immediately. The

algorithm terminates when reaching the goal node. Clearly the tightness of the lower and upper

bounds has a high impact on the performance of the algorithm.

In this section, we proposes methods for tightening the lower bounds of BFBnB. We first pro-

vide a brief review of the k-cycle conflict heuristic, and then discuss how it can be tightened.

4.1 K-Cycle Conflict Heuristic

The effectiveness of the heuristic function is a critical factor in the effectiveness of the search.

The following simple heuristic function was introduced in [57] for computing lower bounds for

A* search.

19

Page 21: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

Definition 1. Let U be a node in the order graph, its heuristic value is

h(U) =∑

X∈V\U

BestScore(X,V\{X}). (11)

The above heuristic function allows each remaining variable to choose optimal parents from

all of the other variables. Therefore it completely relaxes the acyclicity constraint of Bayesian

networks in the estimation. The heuristic was proven to be admissible, meaning it never over-

estimates the future distance [57]. Admissible heuristics guarantee the optimality. However,

because of the complete relaxation of the acyclicity constraint, the simple heuristic may gener-

ate loose lower bounds.

In [55], an improved heuristic function called k-cycle conflict heuristic was proposed by reduc-

ing the amount of relaxation. The idea is to divide the variables into multiple groups with a

size up to k and enforce acyclicity within each group while still allowing cycles between the

groups. There are two major approaches to dividing the groups. One is to enumerate all of

the subsets with a size up to k (each subset is called a pattern); a set of mutually exclusive

patterns covering V\U can be selected to produce a lower bound for node U as the heuristic is

additive [22]. This approach is called dynamic pattern database. In [55], the dynamic pattern

database was created by performing a reverse breadth-first search for the last k layers in the

order graph. The search began with the goal node V whose reverse g cost is 0. A reverse arc

from U′ ∪ {X} to U′ corresponds to selecting the best parents for X from among U′ and has

a cost of BestScore(X,U′).

The other approach is to divide the variables V into several static groups Vi (typically two).

The idea is to partition the variables V into multiple groups Vi (typically two), i.e. V =⋃

i Vi,

and enforce acyclicity within each group while still allowing cycles between the groups. For

the partition, we need to compute a pattern database for each group Vi. For this particular

problem, a pattern database for group Vi is basically a full order graph containing all subsets

of Vi. We will use a backward breadth first search to create the graph layer by layer starting

from the node Vi. The cost for any reverse arc from U ∪ {X} to U in this order graph will be

20

Page 22: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

���

������ ������ ������

���������������������������

�����������

�����

������������������������

����������� ����������� ������������

��������������

���

����� ������ ������

�������������������������

��� �������

��� ��

��� �������� ������� ����

��� ������� ��� �������� �����������

��� ����������

Figure 4: Two pattern databases for a 8 variables problem. 8 variables are partitioned into twogroups, V1 = {X1, X2, X3, X4} and V2 = {X5, X6, X7, X8}. (Left ) The pattern database forgroup V1; bold arrows show the path corresponding to the score P1 for pattern {X2, X3}, whereP1 = BestScore(X3, {X1, X2, X4}∪V2)+BestScore(X2, {X1, X4}∪V2). (Right ) The pat-tern database for group V2; bold arrows show the path corresponding to the score P2 for pattern{X5, X7}, where P2 = BestScore(X7, {X5, X6, X8}∪V1)+BestScore(X5, {X6, X8}∪V1). The heuristic value for pattern {X2, X3, X5, X7} is P1 + P2.

BestScore(X, (⋃

j 6=i Vj)∪U). We then enumerate all subsets of each group Vi as the patterns,

which can be done by a reverse breadth-first search in an order graph containing only Vi [55].

The patterns from different groups are guaranteed to be mutually exclusive, so we simply pick

out the maximum-size pattern for each group that is a subset of V \U and add them together

as the lower bound. Figure 4 shows two pattern databases for a 8-variable problem, as well as

the procedure of calculating the heuristic value of node {X2, X3, X5, X7}.

The tightness of the static k-cycle conflict heuristic depends highly on the partition being used.

The heuristic can avoid directed cycles for the patterns within the same group, but cannot

avoid cycles between different groups. In the example shown in Figure 4, X3 selects parents

{X1, X5} as parents (subset of {X1, X4} ∪V2 ) and X5 selects {X3, X6} as parents; there is

a cycle between {X3} and {X5}. In current works so far, now simple partition are used, e.g,

first half variables fall into one group and second half variables fall into the other group. How

to form a good partition could be one future research direction.

21

Page 23: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

4.2 Tightening Dynamic Pattern Database

To use the dynamic pattern database to calculate the tightest-possible lower bound for a search

node U, we select the set of mutually exclusive patterns which covers all of the variables in

V \U and has the maximum sum of costs. This can be shown to be equivalent to the maximum

weighted matching problem [22], which is NP-hard [38] for k > 2. Consequently, the previous

work [55] used a greedy approximation to solve the matching problem; its idea is to greedily

choose patterns with the maximum differential scores.

Solving the matching problem exactly improves the tightness of dynamic pattern database

heuristic. To achieve that, we formulate the problem as an integer linear program [?] and solve

it exactly. A set of binary variables Ai,j are created which indicate if Xi is in pattern j. A sec-

ond set of variables pj are created which indicate if pattern j is selected. The linear constraints

that A · p = e, where e is a vector with 1s for each variable in V \U, are also added to the

program. A standard integer linear program solver, such as SCIP [1], is used to maximize the

cost of the selected patterns subject to the constraints. The resulting value is the lower bound

which is guaranteed to be at least as tight as that found by the greedy algorithm. Solving the

integer linear program introduces overhead compared to the simple greedy algorithm, though.

4.3 Tightening Static Pattern Database

The tightness of the static pattern database heuristic depends highly on the static grouping used

during its construction. A very simple grouping (SG) method was used in [55]. Let X1, ..., Xn

be the ordering of the variables in a dataset. SG divides the variables into two balanced groups,

{X1, ..., Xdn2e} and {Xdn

2e+1, ..., Xn}. Even though SG exhibited excellent performance on

some datasets in [55], there is potential to develop more informed grouping strategies by taking

into account of the correlation between the variables.

A good grouping method should reduce the number of directed cycles between the variables

and enforce acyclicity as much as possible. Since no cycles are allowed within each group,

22

Page 24: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

we should maximize the correlation between the variables in each group. Also, because cycles

are allowed between groups, we should minimize the correlation between the groups. Consider

two variables X1 and X2. If they have no correlation, there will be no arc between these two

variables in the optimal Bayesian network, so there is no need to put the two variables in the

same group. On the other hand, if they have strong correlation, and if they are put into different

groups G1 and G2, X1 is likely to select X2 from G2 as a parent, and vice versa. Then a cycle

will be introduced between the two groups, resulting in a loose bound. It is better to put these

variables in one group and disallow cycles between them.

The above problem can be naturally formulated as a graph partition problem. Given a weighted

undirected graph G = (V,E) with V vertices and E edges, which we call the partition

graph, the graph partition problem cuts the graph into 2 or more components while mini-

mizing the weight of edges between different components. We are interested in uniform or

balanced partitions as it has been shown that such partitions typically work better in static pat-

tern databases [22]. Two issues remain to be addressed in the formulation: creating the partition

graph and performing the partition.

4.3.1 Creating the Partition Graph

We propose to use two methods to create the partition graph. One is to use constraint-based

learning methods such as the Max-Min Parent Children (MMPC) algorithm [50] to create the

graph. The MMPC algorithm uses independence tests to find a set called candidate parent and

children (CPC) for each variable Xi. The CPC set contains all parent and child candidates of

Xi. The CPC sets for all variables together create an undirected graph. Then MMPC assigns a

weight to each edge of the undirected graph by independent tests, which indicate the strength

of correlation with p-values. Small p-values indicate high correlation, so we use the negative

p-values as the weights. We name this approach family grouping (FG).

The second method works as follows. The simple heuristic in Eqn. 11 considers the optimal

parent set out of all of the other variables for each variable X , denoted as PA(X). Those

23

Page 25: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

PA(X) sets together create a directed cyclic graph; by simply ignoring the directions of the

arcs, we again obtain an undirected graph. We then use the same independence tests as in

MMPC to obtain the edge weights. We name this approach parents grouping (PG).

4.3.2 Performing the Partition

Many existing partition algorithms can be used to perform the graph partition. Since we prefer

balanced partitions, we select the METIS algorithm. METIS is a multilevel graph partitioning

method: it first coarsens the graph by collapsing adjacent vertices and edges, then partitions

the reduced graph, and finally refines the partitions back into the original graph. Studies have

shown that the METIS algorithm has very good performance at creating balanced partitions

across many domains. Also, since the partition is done on the reduced graph, the algorithm is

quite efficient. This is important for our purpose of using it only as a preprocessing step.

4.4 Topology Grouping

Besides the above grouping methods based on partitioning undirected graphs, we also propose

another method based on topologies of directed acyclic graphs found by approximate Bayesian

network learning algorithms. Even though these algorithms cannot guarantee to find the opti-

mal solutions, some of them can find Bayesian networks that are close to optimal. We assume

that the suboptimal networks capture many of the dependence/independence relations of the

optimal solution. So we simply divide the topological ordering of the suboptimal network

into two groups, and those are the grouping for the static pattern database. Many approximate

learning algorithms can be used to learn the suboptimal Bayesian networks. We select the re-

cent anytime window A* algorithm [3, 34] (introduced for obtaining upper bounds in the next

section) for this purpose. We name this approach as topology grouping (TG).

24

Page 26: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

4.5 Tightening Upper Bounds

Breadth-first branch and bound improves search efficiency by pruning nodes which have an

f -cost worse than some known upper bound, ub. In the best case, when ub is equal to the opti-

mal cost, BFBnB provably performs the minimal amount of work required to find the shortest

path [58]. As the quality of ub decreases, though, it may perform exponentially more work.

Consequently, a tight upper bound is pivotal for good search behavior.

A beam search-based hill climbing algorithm was used in [55] to find an upper bound. Recently,

anytime window A* (AWA*) [3, 34] was shown to find high quality, often optimal, solutions

very quickly. Briefly, AWA* uses a sliding window search strategy to explore the order graph

over a number of iterations. During each iteration, the algorithm uses a fixed window size, w,

and tracks the layer l of the deepest node expanded. Nodes are expanded in best-first order as

usual by A*; however, nodes selected for expansion in a layer less that l−w are instead frozen.

A path to the goal is found in each iteration, which gives an upper bound solution. After finding

the path to the goal, the window size is increased by 1 and the frozen nodes become the open

list. The iterative process continues until no nodes are frozen during an iteration, which means

the upper bound solution is optimal, or a resource bound, such as running time, is exceeded.

Due to its ability to often find tight upper bounds quickly, we use AWA* in this work.

4.6 Preliminary Results

We empirically tested our new proposed tightened bounds using the BFBnB algorithm. We use

benchmark datasets from the UCI machine learning repository and Bayesian Network Repos-

itory. The experiments were performed on an IBM System x3850 X5 with 16 core 2.67GHz

Intel Xeon Processors and 512G RAM; 1.7TB disk space was used.

25

Page 27: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

Figure 5: The number of expanded nodes and running time needed by BFBnB to solve Parkin-sons based on dynamic pattern database strategies: matching (“Exact” or “Approximate”) andk (“3” or “4”).

4.6.1 Results on Dynamic Pattern Databases

We compared the dynamic pattern database with exact matching against with the previous

greedy matching in [55] on the Parkinsons dataset. We used AWA* upper bounds in these

experiments. The exact matching method is guaranteed to produce a tighter bound than the

greedy method. However, an exact matching problem needs to be solved at each search step;

the total amount of time needed may become too prohibitive. This concern is confirmed in

the experiments. Figure 5 compares the performance of BFBnB with four dynamic pattern

databases (k = 3/4, and matching=exact/approximate). Solving the matching problems exactly

did reduce the number of expanded nodes by up to 2.5 times. However, solving the many exact

matching problems necessary to calculate the h-costs of the generated nodes in the order graph

far outweighed the benefit of marginally fewer expanded nodes. For example, BFBnB with the

exact matching required up to 400k seconds to solve Parkinsons, compared to only 4 seconds

using the approximate matching.

26

Page 28: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

4.6.2 Results on Upper Bounds

We first tested the effect of the upper bounds generated by AWA* on BFBnB on two datasets:

Parkinsons and Steel Plates. Since AWA* is an anytime algorithm, it produces multiple upper

bounds during its execution. We recorded each upper bound plus the corresponding running

time of AWA*. Each upper bound is tested on BFBnB. In this experiment, we use static pattern

database with family grouping (FG) as the lower bound. Figure 6 plots the running time of

AWA* versus the running time and number of expanded nodes of BFBnB.

10−2

10−1

100

101

102

0

2

4

6

8x 10

6

Nu

mb

er o

f E

xpan

ded

No

des

Running Time of AWA*

10−2

10−1

100

101

1020

5

10

15

20

Ru

nn

ing

Tim

e o

f B

FB

nB

Expanded NodesRunning Time of BFBnB

10−1

100

101

102

103

104

0

2

4x 10

8

Nu

mb

er o

f E

xpan

ded

No

des

Running Time of AWA*

10−1

100

101

102

103

1040

500

1000

Ru

nn

ing

Tim

e o

f B

FB

nB

Expanded NodesRunning Time of BFBnB

(a)Parkinsons (b)SteelP lates

Figure 6: The effect of upper bounds generated by running AWA* for different amount of time(in seconds) on the performance of BFBnB on Parkinsons and Steel Plates.

The experiments show that the upper bounds had a huge impact on the performance of BFBnB.

Parkinsons is a small dataset with only 23 variables. AWA* finds upper bounds within several

hundredths of seconds and solves the dataset optimally around 10 seconds. The BFBnB was

able to solve Parkinsons in 18 seconds using the first upper bound found by AWA* at 0.03s, and

in 2 seconds using the upper bound found at 0.06s. Subsequent upper bounds bring additional

but marginal improvements to BFBnB. This confirms the results in [34] that AWA* often finds

excellent solutions early on and spends the rest of time finding marginally better solutions, or

just proving the optimality of the early solution.

Steel Plates is a slightly larger dataset. AWA* needs more than 4,000 seconds to solve it. We

tested all the upper bounds found by AWA*. The first upper bound found at 0.1s enabled BF-

27

Page 29: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

(a)Parkinsons (b)SteelP lates

Figure 7: The effect of different grouping strategies on the number of expanded nodes andrunning time of BFBnB on Parkinsons and Steel Plates. The four grouping methods are thesimple grouping (SG), family grouping (FG), parents grouping (PG), and topology grouping(TG).

BnB to solve the dataset within 1,000 seconds, and the third upper bound found at 0.5s solved

the dataset within 400 seconds. Again, all the subsequent upper bounds only brought marginal

improvements, even for the optimal bound found by AWA*. For much larger datasets, we be-

lieve that running AWA* for several hundreds of seconds should already generate sufficiently

tight upper bounds; the time is minimal when compared to the amount of time needed to prove

the optimality of Bayesian network structures. In all these experiments, the results on the num-

ber of expanded nodes show similar patterns as the running time.

4.6.3 Results on Static Pattern Databases

We compared the new grouping strategies, including family grouping (FG), parents grouping

(PG), and topology grouping (TG), against the simple grouping (SG) on Parkinsons and Steel

Plates. We again used AWA* upper bounds. Figure 7 shows the results. On Parkinsons, FG and

PG had the best performance; both FG and PG enabled BFBnB to solve the dataset three times

faster. TG was even worse than the SG. On Steel Plates, however, TG was slightly better than

FG. PG was again the best performing method on this dataset; it allowed BFBnB to solve the

dataset 10 times faster than SG. These results suggest that family grouping (FG) and parents

28

Page 30: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

grouping (PG) are both reliably better than the simple grouping (SG). We therefore use FG and

PG as the lower bounds in our subsequent experiments.

29

Page 31: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

5 Proposal II: Using Constraints Learned from Data

In this section, we propose methods using the constaints extracted from data to improve learn-

ing efficiency. The methods makes the observation that there is useful information implicit in

the potential optimal parents sets (POPS). Specifically, the POPS of a variable constrain its par-

ent candidates. Moreover, the parent candidates of all variables together give a directed cyclic

graph, which often decomposes into a set of strongly connected components (SCCs). Each

SCC corresponds to a smaller subproblem which can be solved independently of the others.

Since potential optimal parents sets (POPS) is the basis of proposed methods, we first give the

details about POPS.

5.1 Potential Optimal Parents Sets

Recall our structure learning problem. The goal is to find a structure which optimizes the score.

We only require that the scoring function is decomposable [27]; that is, the score of a network

s(N) =∑

i si(PAi). The si(PAi) values are often called local scores. Many commonly used

scoring functions, such as MDL [33] and BDe [8, 29], are decomposable.

While the local scores are defined for all 2n−1 possible parent sets for each variable, this number

is greatly reduced by pruning parent sets that are provably never optimal [18]. We refer to this

as lossless score pruning because it is guaranteed to not remove the optimal network from

consideration. We refer to the scores remaining after pruning as potentially optimal parent sets

(POPS).

Other pruning strategies, such as restricting the cardinality of parent sets, are also possible,

but these techniques could eliminate parent sets which are in the globally optimal network; we

refer to pruning strategies which might remove the optimal network from consideration as lossy

score pruning. Of course, these, and any other, score pruning strategies can be combined.

Regardless of the score pruning strategies used, we still refer to the set of unpruned local scores

as POPS and denote the set of POPS for Xi as Pi. The POPS are given as input to the learning

30

Page 32: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

problem. We define the Bayesian network structure learning problem (BNSL) as follows.

The BNSL Problem

INPUT: A set V = {X1, . . . , Xn} of variables and a set of POPS Pi for each Xi.

TASK: Find a DAG N∗ such that

N∗ ∈ argminN

n∑i=1

si(PAi),

where PAi is the parent set of Xi in N and PAi ∈ Pi.

5.2 Using POPS Constraints

POPS implicitly encodes some information, which can benefit the search process. We will first

motivate our approach using a simple example and then describe the technical details.

5.2.1 A Simple Example

variable POPSX1 {X2} {}X2 {X1} {}X3 {X1, X2} {X2, X6} {X1, X6} {X2} {X6} {}X4 {X1, X3} {X1} {X3} {}X5 {X4} {X2} {}X6 {X2, X5} {X2} {}

Table 2: The POPS for six variables. The ith row shows Pi.

Table 2 shows the POPS for six variables. Based on these sets, we can see that not all variables

can select all other variables as parents. For example, X1 can only select X2 as its parent (due

to score pruning). We collect all of the potential parents for Xi by taking the union of all

PA ∈ Pi. Figure 8 shows the resulting parent relation graph for the POPS in Table 2. The

parent relation graph includes an edge from Xj to Xi if Xj is a potential parent of Xi.

31

Page 33: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

1 2

4 6

3

5

Figure 8: The parent relation graph constructed by aggregating the POPS in Table 2. Thestrongly connected components are surrounded by shaded shapes.

Naively, the complete order graph for six variables contains 26 nodes. However, from the

parent relation graph, we see that none of {X3, X4, X5, X6} can be a parent of X1 or X2.

Consequently, we can split the problem into two subproblems as shown in Figure 9: first,

finding the shortest path from start to {X1, X2}, and then, finding the shortest path from

{X1, X2} to goal. Thus, the size of the search space is reduced to 22 + 24.

5.2.2 Ancestor Relations

This simple example shows that the parent relation graph can be used to prune the order graph

without bounds. In general, we must consider ancestor relations to prune the order graph. In

particular, if Xi can be an ancestor of Xj , and Xj cannot be an ancestor of Xi (due to local score

pruning), then no node in the order graph which contains Xj but not Xi needs to be generated.

As a proof sketch, we can consider a node U which includes neither Xi nor Xj . If we add Xi

and then Xj , then the cost from U to U∪{Xi, Xj} is BestScore(Xi,U)+BestScore(Xj,U∪

{Xi}). On the other hand, if we add Xj first, then the cost from U to U ∪ {Xi, Xj} is

BestScore(Xj,U) + BestScore(Xi,U ∪ {Xj}). However, due to the ancestor relations, we

know that BestScore(Xi,U ∪ {Xj}) = BestScore(Xi,U). So, regardless of the order we

32

Page 34: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

add the two variables, Xi will have the same parent choices. If we add Xj first, though, then

Xj will have fewer choices. Therefore, adding Xj as a leaf first can never be better than adding

Xi first [56].

5.2.3 POPS Constraints Pruning

We find the ancestor relations by constructing the parent relation graph and extracting its

strongly connected components (SCCs). The SCCs of the parent relation graph form the com-

ponent graph, which is a DAG [14]; each component graph node ci corresponds to an SCC scci

from the parent relation graph (which in turn corresponds to a set of variables in the Bayesian

network). The component graph includes a directed edge from ci to cj if the parent relation

graph includes an edge from a variable Xi ∈ scci to Xj ∈ sccj .

The component graph gives the ancestor constraints: if cj is a descendent of ci in the compo-

nent graph, then variables in sccj cannot be ancestors of variables in scci. Consequently, the

component graph gives POPS constraints which allow the order graph to be pruned without

considering bounds. In particular, the POPS constraints allow us to prune nodes in the order

graph which do not respect the ancestor relations.

Tarjan’s algorithm [48] extracts the SCCs from directed graphs, like the parent relation graph.

We chose to use it because, in addition to its polynomial complexity, it extracts the SCCs

from the parent relation graph consistent with their topological order in the component graph.

Consequently, all of the parent candidates of X ∈ scci appear in PCi = ∪ik=1scck2. After

extracting the m SCCs, the search can be split into m indepedent subproblems: one for each

SCC where starti is PCi−1 and goali is PCi. That is, during the ith subproblem, we select the

optimal parents for the variables in scci. Of course, start0 = ∅ and goalm = V. The worst-

case complexity of subproblem i is then O(2|scci|). Figure 9(a) shows the pruned order graph

resulting from the parent relation graph in Figure 8. In particular, it shows the first subproblem,

from PC0 = ∅ to PC1 = {X1, X2}, and the second subproblem, from PC1 to PC2 = V.

2Depending on the structure of the component graph, this may be a superset of the parent candidates for X .

33

Page 35: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

{ 1,2 }

{ 1,2,4 } { 1,2,5 } { 1,2,6 }

{ 1,2,5,6 }{ 1,2,4,6 }{ 1,2,4,5 }

{ 1,2,3,4,5 }

{ 1,2,3 }

{ 1,2,3,6 }{ 1,2,3,5 }{ 1,2,3,4 }

{ 1,2,3,4,6 } { 1,2,3,5,6 } { 1,2,4,5,6 }

{ 1,2,3,4,5,6 }

(a)

{ }

{ 1 } { 2 }

{ 1,2 }

{ 1,2,4 } { 1,2,5 } { 1,2,6 }

{ 1,2,5,6 }{ 1,2,4,6 }{ 1,2,4,5 }

{ 1,2,3,4,5 }

{ 1,2,3 }

{ 1,2,3,6 }{ 1,2,3,5 }{ 1,2,3,4 }

{ 1,2,3,4,6 } { 1,2,3,5,6 } { 1,2,4,5,6 }

{ 1,2,3,4,5,6 }

(b)

{ }

{ 1 } { 2 }

1st Subproblem

2nd Subproblem

Figure 9: Order graphs after applying the POPS constraints. (a) The order graph after applyingthe POPS constraints once. (b) The order graph after recursively applying the POPS constraintson the second subproblem.

The (worst-case) size of the original order graph for n variables is as follows.

O(2|scc1|+...+|sccm|) = O(2n) (12)

The worst-case size of the search space after splitting into subproblems using the SCCs is as

follows.

O(2|scc1| + . . .+ 2|sccm|) = O(m ·max|i|

2|scci|) (13)

That is, the complexity is at worst exponential in the size of the largest SCC. Consequently, our

method can scale to datasets with many variables if the largest SCC is of manageable size.

5.2.4 Recursive POPS Constraints Pruning

As described in Section 5.2.3, the size of the search space for the ith subproblem is O(2|scci|),

which can still be intractable for large SCCs. However, recursive application of the POPS

34

Page 36: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

1 2 3 4

Figure 10: The toy example. There is a bridge between variable x2 and x3.

constraints can further reduce the size of the search space. We refer to the constraints added by

this strategy as recursive POPS constraints.

The intuition is the same as that behind the POPS constraints. As an example, consider the

subproblem associated with scc2 in Figure 9, which includes variables X3, X4, X5 and X6.

Naively, the order graph associated with this subproblem has O(24) nodes. However, suppose

we add variable X3 as a leaf first. Then, the remaining variables split into three SCCs, and their

order is completely determined. Similarly, selecting any of the other variables as the first to add

as a leaf completely determines the order of the rest. Figure 9(b) shows the order graph after

applying recursive POPS constraints.

In general, selecting the parents for one of the variables has the effect of removing that variable

from the parent relation graph. After removing it, the remaining variables may split into smaller

SCCs, and the resulting smaller subproblems can be solved recursively. These SCC checks can

be implemented efficiently by again appealing to Tarjan’s algorithm. In particular, after adding

variable Xi as a leaf from U, we remove all of those variables from the parent relation graph.

We then find the topologically first SCC and expand just the variables in that component. As

we recursively explore the remaining variables, they will all eventually appear in the first SCC

of the updated parent relation graph.

5.2.5 Bridge Constraints

We have shown the strongly connected components of the parent child relation graph can help

to reduce the search search. Now let us consider the following toy example in Figure 10.

Using applying POPS constraints described prevously, we can reduce some search space, as

35

Page 37: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

1 2 3

1,2 1,3

1,2,3

4

1,4 2,4

1,2,4 1,3,4 2,3,4

1,2,3,4

3,4

Figure 11: The toy example. The search space pruned by recursive POPS.

shown in Figure 11. The usage of POPS constraints to get strongly connected components did

not quite efficient, Though that number of the total paths searched is reduced a lot, only one

node {x2, x3} is saved from expanding.

Let us use the ancestor relation information in another way. Consider the relations between

x2 and x3, if components. We will further show that we can explore more useful information

from POPS constraints. Observe the following toy example in Figure 10. There is 3 bridges in

the graph. In graph theory, a bridge, or cut arc, is an edge of a graph whose deletion increases

its connected components. Note that in the directed graph, the bridge contains both directions.

For the bridge between x2 and x3, there are three possibalities in a optimal network:

1. there is neither edge x2 → x3 nor x3 → x2.

2. there is only edge x2 → x3

3. there is only edge x3 → x2

In each of the above three cases, there are 2 strongly connected componnets. Thus we can

to search all three cases, then combine the results. The correponding order graph, needed to

36

Page 38: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

1 2 3

1,2

1,2,3

4

1,2,4 1,3,4 2,3,4

1,2,3,4

3,4

Figure 12: The toy example. The search space pruned by cutting the bridge between x2 and x3.

serach, is illuslated in Figure 12. More nodes are saved from expanding, and more paths are

saved from searching.

5.3 Preliminary Results

We empirically tested our new proposed prune methods using the A* algorithm We use bench-

mark datasets from the UCI machine learning repository and Bayesian Network Repository.

The experiments were performed on an IBM System x3850 X5 with 16 core 2.67GHz Intel

Xeon Processors and 512G RAM; 1.7TB disk space was used.

We considered three variants of A*: a basic version not using POPS constraints; a version

using the POPS constraints but not applying them recursively as described in Section 5.2.4;

and a version which uses the recursive POPS constraints. The result is illustrated in Figure 13.

As the figure shows, the versions of A* augmented with the POPS constraints always outper-

form the basic version. The improvement in running time ranges from two times on several

of the datasets to over an order of magnitude on three of the datasets. Additionally, the basic

version is unable to solve Soybean and Barley within the time limit (30 minutes); however,

37

Page 39: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

Graduate Center/City University of New YorkUniversity of Helsinki

FINDING OPTIMAL BAYESIAN NETWORK STRUCTURESWITH CONSTRAINTS LEARNED FROM DATA Xiannian Fan, Brandon Malone and Changhe YuanSeveral recent algorithms for learning Bayesian network structures first calculate potentially optimal parent sets (POPS) for all variables and then use various optimization techniques tofind a set of POPS, one for each variable, that constitutes an optimal network structure. This paper makes the observation that there is useful information implicit in the POPS. Specifically,the POPS of a variable constrain its parent candidates. Moreover, the parent candidates of all variables together give a directed cyclic graph, which often decomposes into a set of stronglyconnected components (SCCs). Each SCC corresponds to a smaller subproblem which can be solved independently of the others. Our results show that solving the constrainedsubproblems significantly improves the efficiency and scalability of heuristic search-based structure learning algorithms. Further, we show that by considering only the top p POPS of eachvariable, we quickly find provably very high quality networks for large datasets.

Bayesian Network Structure Learning with Graph Search

Selected References1. de Campos, C. P.; and Ji, Q. Efficient learning of Bayesian networks using

constraints. JMLR 2011.2. Tian, J. A branch-and-bound algorithm for MDL learning Bayesian networks. In

UAI’00, 2000.3. Yuan, C.; Malone, B.; and Wu, X. 2011. Learning optimal Bayesian networks

using A* search. In IJCAI ‘11, 2011.

4. Yuan, C.,; and Maline, B. An Improved Admissible Heuristic for Learning OptimalBayesian Networks. In UAI’12, 2012..

AcknowledgementsThis research was supported by NSF grants IIS-0953723, IIS-1219114 and the Academyof Finland (COIN, 251170).

Software: http://url.cs.qc.cuny.edu/software/URLearning.html

Bayesian Network Structure LearningRepresentation. Joint probability distribution over a set of variables.Structure. DAG storing conditional dependencies.

•Vertices correspond to variables.•Edges indicate relationships among variables.

Parameters. Conditional probability distributions. Learning. Find the network with the minimal score for complete dataset D. We often omit D for brevity.

where Score(Xi | PAi) is called local score

Graph Search FormulationThe dynamic programming can be visualized as a search through an order graph.

The Order GraphCalculation. Score(U), best subnetwork for U.Node. Score(U) for U.Successor. Add X as a leaf to U.Path. Induces an ordering on variables.Size. 2n nodes, one for each subset.

Admissible Heuristic Search FormulationStart Node. Top node, {}.Goal Node. Bottom node, V.Shortest Path. Correspondsto optimal structure.g(U). Score(U). h(U). Relax acyclicity.

Dynamic ProgrammingIntuition. All DAGs must have a leaf. Optimal networks for a single variable are trivial. Recursively add new leaves and select optimal parents until adding all variables. All orderings have to be considered.

Recurrences.

Begin with a

single variable.

Pick one variable

as leaf. Find its

optimal parents.

Pick another leaf.

Find its optimal

parents from

current.

Continue picking leaves

and finding optimal

parents.

Potentially Optimal Parent Sets (POPS)

While the local scores are defined for all 2n-1 possible parent sets for each variable, this number is greatly reduced by pruning parent sets that are provably never optimal (Tian, J. 2000; de Campos, C. P., et al 2011.).

We refer to the above pruning as lossless score pruning because it is guaranteed to not remove the optimal network from consideration. We refer to the scores remaining after pruning as potentially optimal parent sets (POPS). Denote the set of POPS for variable Xi as Pi .

Table 1: The POPS for six variables problem. The ith rowshows the Pi .

POPS Constraints Pruning

Observation: The constraints seem to help benchmark network datasets more than UCI.Observation: Even for very small values of p, the top-p POPS constraint results in networks provably very closeto the globally optimal solution.

Motivation: Not all variables can possibly be ancestors of the others.Previous technique: Consider all variable orderings anyway.Shortcomings: Exponential increase in the number of paths in thesearch space.Contribution: Construct the parent relation graph and find its SCCs;divide the problem into independent subproblems based on theSCCs.

Figure 2: The parentrelation graph.

Figure 3: Order graphs after applying the POPS constraints. (a)The order graph after applying the POPS constraints once. (b) Theorder graph after recursively applying the POPS constraints onthe second subproblem.

We collected all the potentialparents–children relation fromPOPS, and get the resultingparent relation graph.

Recursive POPS Constraints PruningSelecting the parents for one of the variables has the effectof removing that variable from the parent relation graph.After removing it, the remaining variables may split intosmaller SCCs, and the resulting smaller subproblemscan be solved recursively. Figure 3(b) shows the example.

Top-p POPS ConstraintsMotivation: Despite POPS constraints pruning, someproblems remain difficult.Previous technique: AWA* has been used to find boundedoptimality solutions.Shortcomings: AWA* does not give any explicit tradeoffbetween complexity and optimality.Contribution: Lossy score pruning gives a more principledway to control the tradeoff; we create the parent relationgraph using only the best p POPS of each variable anddiscard POPS not compatible with this graph.

Rather than constructing the parent relation graph byaggregating all of the POPS, we can instead create the graphby considering only the best p POPS for each variable.

Experiment Result

We extracted its strongly connected components (SCCs) from parentrelation graph. The SCCs form the component graph (Cormen et al.2001), giving the ancestor constraints (which we call POPSconstraint). Each SCC corresponds to a smaller subproblem which canbe solved independently of the others.

Figure 4: Running Time (in seconds) of A* with three differentconstraint settings: No Constraint, POPS Constraint andRecursive POPS Constraint.

Figure 5: The behavior of dataset Hailfinder under the top-p POPS constraint as p varies.

Figure 1: Order Graphfor 4 variables

0

100

200

300

400

500

600

Autos Soybean Alarm Barley

46.62

FAIL

76.51

FAIL

20.93

435.65

6.472.51

26.76

511.55

4.06 1.28

No Constraint

POPS Constraint

Recursive POPSConstraint

Figure 13: The results on dataset Autos, Soybean, Alarm, Barly on three versions of A*.

with the POPS constraints, all of the datasets are easily solved within the limit. The number of

nodes expanded, and, hence, memory requirements, are similarly reduced.

The recursive POPS constraints always reduce the number of nodes expanded. However, it

sometimes increases the running time. The overhead of Tarjan’s algorithm to recursively look

for SCCs is small; however, in some cases, such as when the parent relation graph is dense,

the additional work yields minimal savings. In these cases, despite the reduction in nodes

expanded, the running time may increase.

On the other hand, when the parent relation graph is sparse, the advantages of the recursive

POPS constraints are sometimes more pronounced. For example, the running time of Alarm

and Barley is reduced in half by recursively applying POPS constraints. Most networks con-

structed by domain experts, including those evaluated in this study, are sparse. Our analysis

shows that these datasets also yield sparse parent relation graphs. Thus, our results suggest

that the recursive constraints are sometimes effective when the generative process of the data

38

Page 40: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

is sparse. The overhead of looking for the recursive POPS constraints is minimal, and it some-

times offers substantial improvement for sparse generative processes. So we always use it in

the remaining experiments.

39

Page 41: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

6 Conclusion and Future Work

In this paper, we propose various methods for scaling up the Bayesian networks structure learn-

ing. Two directions are preliminarily explored. One is to tignthen the bouds; we tighten the

lower boound by using more informed variable groupings when creating the heuristics, and

tighten the upper bound using anytime algorithm learning algorithm. The other is to prune the

search space by extracting contraints directly from data.

As future work, we plan to investigate integration of domain knowledge into heurstics building.

Combining prior knowledge with information from suboptimal solutions could further scale

Bayesian networks learning to big data.

40

Page 42: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

References

[1] T. Achterberg. Constrained Integer Programming. PhD thesis, TU Berlin, July 2007.

[2] S. Acid and L. M. de Campos. A hybrid methodology for learning belief networks:

BENEDICT. International Journal of Approximate Reasoning, 27(3):235–262, 2001.

[3] S. Aine, P. P. Chakrabarti, and R. Kumar. AWA*-a window constrained anytime heuristic

search algorithm. In Proceedings of the 20th International Joint Conference on Artificial

Intelligence, pages 2250–2255, 2007.

[4] H. Akaike. Information theory and an extension of the maximum likelihood principle. In

Proceedings of the Second International Symposium on Information Theory, pages 267–

281, 1973.

[5] M. Bartlett and J. Cussens. Advances in Bayesian network learning using integer pro-

gramming. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelli-

gence, 2013.

[6] R. R. Bouckaert. Properties of Bayesian belief network learning algorithms. In Proceed-

ings of the Tenth Conference on Uncertainty in Artificial Intelligence, pages 102–109,

Seattle, WA, 1994. Morgan Kaufmann.

[7] H. Bozdogan. Model selection and akaike’s information criterion (aic): The

general theory and its analytical extensions. Psychometrika, 52:345–370, 1987.

10.1007/BF02294361.

[8] W. Buntine. Theory refinement on Bayesian networks. In Proceedings of the seventh

conference (1991) on Uncertainty in artificial intelligence, pages 52–60, San Francisco,

CA, USA, 1991. Morgan Kaufmann Publishers Inc.

[9] J. Cheng, R. Greiner, J. Kelly, D. Bell, and W. Liu. Learning Bayesian networks from data:

an information-theory based approach. Artificial Intelligence, 137(1-2):43–90, 2002.

41

Page 43: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

[10] D. Chickering. A transformational characterization of equivalent Bayesian network struc-

tures. In Proceedings of the 11th annual conference on uncertainty in artificial intelli-

gence (UAI-95), pages 87–98, San Francisco, CA, 1995. Morgan Kaufmann Publishers.

[11] D. M. Chickering. Learning Bayesian networks is NP-complete. In Learning from Data:

Artificial Intelligence and Statistics V, pages 121–130. Springer-Verlag, 1996.

[12] D. M. Chickering. Learning equivalence classes of Bayesian-network structures. Journal

of Machine Learning Research, 2:445–498, 2002.

[13] G. F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic

networks from data. Machine Learning, 9:309–347, 1992.

[14] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms.

McGraw-Hill Higher Education, 2001.

[15] J. Cussens. Bayesian network learning with cutting planes. In Proceedings of the 27th

Conference on Uncertainty in Artificial Intelligence, pages 153–160, 2011.

[16] D. Dash and G. Cooper. Model averaging for prediction with discrete Bayesian networks.

Journal of Machine Learning Research, 5:1177–1203, 2004.

[17] D. H. Dash and M. J. Druzdzel. A hybrid anytime algorithm for the construction of

causal models from sparse data. In Proceedings of the Fifteenth Annual Conference on

Uncertainty in Artificial Intelligence (UAI–99), pages 142–149, San Francisco, CA, 1999.

Morgan Kaufmann Publishers, Inc.

[18] C. P. de Campos and Q. Ji. Efficient learning of Bayesian networks using constraints.

Journal of Machine Learning Research, 12:663–689, 2011.

[19] L. M. de Campos. A scoring function for learning Bayesian networks based on mutual

information and conditional independence tests. Journal of Machine Learning Research,

7:2149–2187, 2006.

42

Page 44: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

[20] L. M. de Campos and J. F. Huete. A new approach for learning belief networks using

independence criteria. International Journal of Approximate Reasoning, 24(1):11 – 37,

2000.

[21] L. M. de Campos and J. M. Puerta. Stochastic local algorithms for learning belief net-

works: Searching in the space of the orderings. In S. Benferhat and P. Besnard, edi-

tors, ECSQARU, volume 2143 of Lecture Notes in Computer Science, pages 228–239.

Springer, 2001.

[22] A. Felner, R. Korf, and S. Hanan. Additive pattern database heuristics. Journal of Artifi-

cial Intelligence Research, 22:279–318, 2004.

[23] N. Friedman and D. Koller. Being Bayesian about network structure: A Bayesian ap-

proach to structure discovery in Bayesian networks. Machine Learning, 50(1-2):95–125,

2003.

[24] N. Friedman, I. Nachman, and D. Pe’er. Learning Bayesian network structure from mas-

sive datasets: The “sparse candidate” algorithm. In K. B. Laskey and H. Prade, editors,

Proceedings of the Fifteenth Conference Conference on Uncertainty in Artificial Intelli-

gence (UAI-99), pages 206–215. Morgan Kaufmann, 1999.

[25] F. Glover. Tabu search: A tutorial. Interfaces, 20(4):74–94, July/August 1990.

[26] P. E. Hart, N. J. Nilsson, and B. Raphael. A formal basis for the heuristic determination of

minimum cost paths. IEEE Transactions On Systems Science And Cybernetics, 4(2):100–

107, 1968.

[27] D. Heckerman. A tutorial on learning with Bayesian networks. In M. Jordan, editor,

Learning in Graphical Models, volume 89 of NATO ASI Series, pages 301–354. Springer

Netherlands, 1998.

[28] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The

combination of knowledge and statistical data. Machine Learning, 20:197–243, 1995.

43

Page 45: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

[29] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The

combination of knowledge and statistical data. Machine Learning, 20:197–243, 1995.

[30] T. Jaakkola, D. Sontag, A. Globerson, and M. Meila. Learning Bayesian network structure

using LP relaxations. In Proceedings of the 13th International Conference on Artificial

Intelligence and Statistics, 2010.

[31] M. Koivisto and K. Sood. Exact Bayesian structure discovery in Bayesian networks.

Journal of Machine Learning Research, 5:549–573, 2004.

[32] R. Korf. Linear-time disk-based implicit graph search. Journal of the ACM, 35(6), 2008.

[33] W. Lam and F. Bacchus. Learning Bayesian belief networks: An approach based on the

MDL principle. Computational Intelligence, 10:269–293, 1994.

[34] B. Malone and C. Yuan. Evaluating anytime algorithms for learning optimal Bayesian

networks. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence

(UAI-13), pages 381–390, Seattle, Washington, 2013.

[35] B. Malone, C. Yuan, E. Hansen, and S. Bridges. Improving the scalability of optimal

Bayesian network learning with external-memory frontier breadth-first branch and bound

search. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence,

pages 479–488, 2011.

[36] J. W. Myers, K. B. Laskey, and T. S. Levitt. Learning Bayesian networks from incomplete

data with stochastic search algorithms. In K. B. Laskey and H. Prade, editors, Proceedings

of the Fifteenth Conference Conference on Uncertainty in Artificial Intelligence (UAI-99),

pages 476–485. Morgan Kaufmann, 1999.

[37] S. Ott, S. Imoto, and S. Miyano. Finding optimal models for small gene networks. In

Pacific Symposium on Biocomputing, pages 557–567, 2004.

[38] C. H. Papadimitriou and K. Steiglitz. Combinatorial optimization: algorithms and com-

plexity. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1982.

44

Page 46: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

[39] P. Parviainen and M. Koivisto. Exact structure discovery in Bayesian networks with less

space. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelli-

gence, Montreal, Quebec, Canada, 2009. AUAI Press.

[40] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference.

Morgan Kaufmann Publishers Inc., 1988.

[41] E. Perrier, S. Imoto, and S. Miyano. Finding optimal Bayesian network given a super-

structure. Journal of Machine Learning Research, 9:2251–2286, October 2008.

[42] J. Rissanen. Modeling by shortest data description. Automatica, 14:465–471, 1978.

[43] T. Silander and P. Myllymaki. A simple approach for finding the globally optimal

Bayesian network structure. In Proceedings of the 22nd Conference on Uncertainty in

Artificial Intelligence, 2006.

[44] T. Silander, T. Roos, P. Kontkanen, and P. Myllymaki. Factorized normalized maximum

likelihood criterion for learning Bayesian network structures. In Proceedings of the 4th

European Workshop on Probabilistic Graphical Models (PGM-08), pages 257–272, 2008.

[45] A. Singh and A. W. Moore. Finding optimal Bayesian networks by dynamic program-

ming. Technical Report CMU-CALD-05-106, Carnegie Mellon University, 2005.

[46] P. Spirtes, C. Glymour, and R. Scheines. Causation, prediction, and search. The MIT

Press, second edition, 2000.

[47] J. Suzuki. Learning Bayesian belief networks based on the minimum description length

principle: An efficient algorithm using the B&B technique. In International Conference

on Machine Learning, pages 462–470, 1996.

[48] R. Tarjan. Depth-first search and linear graph algorithms. SIAM Journal on Computing,

1(2):146–160, 1972.

45

Page 47: Scale Up Bayesian Networks Learning Dissertation Proposal · 2015. 3. 19. · Scale Up Bayesian Networks Learning Dissertation Proposal Xiannian Fan xnf1203@gmail.com March 2015 Abstract

[49] M. Teyssier and D. Koller. Ordering-based search: A simple and effective algorithm for

learning Bayesian networks. In Proceedings of the 21st Conference on Uncertainty in

Artificial Intelligence, pages 584–590, Arlington, Virginia, 2005. AUAI Press.

[50] I. Tsamardinos, L. Brown, and C. Aliferis. The max-min hill-climbing Bayesian network

structure learning algorithm. Machine Learning, 65:31–78, 2006. 10.1007/s10994-006-

6889-7.

[51] X. Xie and Z. Geng. A recursive method for structural learning of directed acyclic graphs.

Journal of Machine Learning Research, 9:459–483, 2008.

[52] C. Yuan, H. Lim, and M. L. Littman. Most relevant explanation: Computational com-

plexity and approximation methods. Annals of Mathematics and Artificial Intelligence,

61:159–183, 2011.

[53] C. Yuan, H. Lim, and T.-C. Lu. Most relevant explanation in Bayesian networks. Journal

of Artificial Intelligence Research (JAIR), 42:309–352, 2011.

[54] C. Yuan, X. Liu, T.-C. Lu, and H. Lim. Most Relevant Explanation: Properties, algo-

rithms, and evaluations. In Proceedings of 25th Conference on Uncertainty in Artificial

Intelligence (UAI-09), pages 631–638, Montreal, Canada, 2009.

[55] C. Yuan and B. Malone. An improved admissible heuristic for finding optimal Bayesian

networks. In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence,

2012.

[56] C. Yuan and B. Malone. Learning optimal Bayesian networks: A shortest path perspec-

tive. Journal of Artificial Intelligence Research, 48:23–65, 2013.

[57] C. Yuan, B. Malone, and X. Wu. Learning optimal Bayesian networks using A* search. In

Proceedings of the 22nd International Joint Conference on Artificial Intelligence, 2011.

[58] R. Zhou and E. A. Hansen. Breadth-first heuristic search. Artificial Intelligence, 170:385–

408, 2006.

46