learning high-dimensional bayesian networks for mixed datafmliang/stat598purdue/bayesnet.pdflearning...

41
Learning High-Dimensional Bayesian Networks for Mixed Data Faming Liang University of Florida May 6, 2017

Upload: others

Post on 22-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Learning High-Dimensional Bayesian Networks forMixed Data

Faming Liang

University of Florida

May 6, 2017

Page 2: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Outline

▶ Introduction to Bayesian Networks

▶ Basic Theory of Bayesian Networks▶ A Three-Stage Algorithm for Learning high-dimensional

Bayesian Networks▶ Learning the moral graph▶ Identifying v -structures▶ Identifying derived directions▶ Consistency

▶ Numerical Studies

▶ Discussion

Page 3: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Graphical Models

Graphical models have proven to be a useful tool to describeconditional independence relationships for a large set of randomvariables. Two types of graphical models are commonly used:

▶ Markov networks

▶ Bayesian networks

Page 4: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Markov Networks

▶ The Markov network, also known as the Markov random field,is a model over an undirected graph.

▶ Gaussian graphical model (GGM), as a special case of Markovnetworks, has been used in many scientific fields fromcomputer vision to natural language processing to genomics.

▶ Learning Gaussian graphical models:▶ graphical Lasso (Yuan and Lin, 2007; Friedman et al., 2008)▶ nodewise regression (Meinshausen and Buhlmann, 2006)▶ ψ-learning (Liang et al., 2015)

Page 5: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Bayesian Networks

▶ The Bayesian network is a model over a directed acyclicgraph. Since the directions of the edges are included in themodel, it can represent more types of conditionalindependences than the Markov network.

▶ The relationship that X and Y are marginally independentbut dependent conditioned on variable Z , is not representableby a Markov network. However, it can be easily representedby a Bayesian network using a v -structure: X⃝→ Z⃝← Y⃝. InBayesian formula, this situation can be described by

π(X ,Y |Z ) = π(Z |X ,Y )π(X )π(Y )

π(Z )= π(X |Z )π(Y |Z ).

Page 6: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Bayesian Networks

Table: Conditional independences represented by Markov networks.

Conditional Marginal Independenceindependence X ⊥P Y X ⊥P Y

X ⊥P Y |Z X⃝ Y⃝− Z⃝ X⃝− Z⃝− Y⃝X ⊥P Y |Z non-representable X⃝− Y⃝− Z⃝

Page 7: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Learning Bayesian Networks

The existing Bayesian network learning methods can be traced tothree categories:

▶ Constraint-based: It is to learn Bayesian networks byconducting a series of conditional independence tests.Examples include grow-shrink (Margaritis, 2003), incrementalassociation (Tsamardinos et al., 2003; Yaramakala andMargaritis, M2005), and PC (Spirtes et al., 2000)

▶ Score-based: It is to find a network that optimizes a selectedscoring function, e.g., entropy, minimum description lengthand Bayesian score.

▶ Hybrid Method: It is to combine constraint-based andscore-based methods to offset their respective weakness.Examples include the sparse candidate algorithm (Friedman etal., 1999) and the max-min hill-climbing (MMHC) algorithm(Tsamardinos et al., 2006).

Page 8: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

PC Algorithm

▶ The PC algorithm is the state-of-the-art algorithm forBayesian network learning: Under the sparsity assumption,which bounds the neighborhood size of each node, the PCalgorithm has been shown to be consistent and can execute ina polynomial time of p (Kalisch and Buhlmann, 2007).

▶ The computational complexity of the PC algorithm isO(p2+m), where m is the maximum size of the Markovblanket of each node.

Page 9: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Joint Distribution

▶ A Bayesian network can be represented by a directed acyclicgraph (DAG) G = (V,E), where V denotes a set of p nodescorresponding to the p variables X1, . . . ,Xp, and E = (eij)denotes the adjacency matrix or arc sets. The jointdistribution of X1, . . . ,Xp is given by

P(X) =∏i

q(Xi |Pa(Xi )), (1)

where Pa(Xi ) denotes the parent nodes/variables of Xi in thenetwork, and q(·|·) specifies the conditional distribution of Xi

given its parent nodes.

▶ Local Markov property: Each node Xi is conditionallyindependent of its non-descendants (i.e., the nodes for whichthere is no path to reach from Xi ) given its parents.

Page 10: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

v -Structure

▶ v-structure: A convergent connection Xi → Xk ← Xj is calleda v-structure if there is no arc connecting Xi and Xj . Thev -structure enables Bayesian networks to represent a type ofrelationship that Markov networks cannot, that is, Xi and Xj

are marginally independent while they are dependentconditional on Xk .

▶ Markov blanket: The Markov blanket of a node Xi is the setconsisting of the parents of Xi , the children of Xi and thespouse nodes that share a child with Xi . The Markov blanketof a node Xi ∈ V is the minimal subset of V such that Xi isindependent of all other nodes conditioned on it. The Markovblanket is symmetric, i.e., if node Xi is in the Markov blanketof Xj , then Xj is also in the Markov blanket of Xi .

Page 11: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

v -Structure

▶ Skeleton: If the directions of all arcs in a Bayesian network areremoved, the resulting undirected graph is called the skeletonof the Bayesian network. Note that we can have Bayesiannetworks with different arc sets that encode the sameconditional independence relationships and represent the samejoint distribution.

▶ Two DAGs defined over the same set of variables areequivalent if and only if they have the same skeleton and thesame v -structure.

▶ In Bayesian networks, only the directions of the arcs that arepart of one or more v -structures are important.

Page 12: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Moral Graph

▶ The moral graph is an undirected graph that is constructed by(i) connecting the non-adjacent nodes in each v -structurewith an undirected arc, and (ii) ignoring the directions ofother arcs.

▶ Moralisation provides a simple way to transform a Bayesiannetwork into a Markov network.

▶ In the moral graph, the neighboring set of each node forms itsMarkov blanket.

Page 13: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Faithfulness▶ Let M denote the dependence structure of the probability

distribution of X, i.e., the set of conditional dependencerelationships between any triplet A, B, C of subsets of X.The graph G is said to be faithful or isomorphic to M if for alldisjoint subsets A, B, C of X, we have

A ⊥P B|C⇐⇒ A ⊥G B|C, (2)

where the left denotes the conditional independence inprobability, and the right denotes the separation in graph (i.e.,C is a separator of A and B).

▶ For a Markov network, C is said to be a separator of A and Bif for every a ∈ A and b ∈ B, all paths from a to b have atleast one node in C.

▶ For Bayesian networks, C is said to be a separator of A and Bif along every path between a node in A and a node in B thereis a node v satisfying one of the following two conditions: (i)v has converging arcs and neither v nor any of its descendantsare in C, and (ii) v is in C and does not have converging arcs.

Page 14: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Equivalent Test

▶ Suppose that we are able to identify a super Markov blanketSi for each node Xi such that Si ⊆ Si holds, where Si denotesthe Markov blanket of the node Xi .

▶ Let ϕij denote the output of the conditional independence testXi ⊥P Xj |Si \ {Xi ,Xj}, i.e., ϕij = 1 if the conditionalindependence holds and 0 otherwise. Let ϕij denote the outputof the conditional independence test Xi ⊥P Xj |Si \ {Xi ,Xj}

▶ Under faithfulness, ϕij and ϕij are equivalent in learning moralgraphs.

Page 15: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Equivalent Test

Theorem 1. Assume the faithfulness holds. Let Si denote theMarkov blanket of Xi , and let Si denote a superset of Si . Then ϕijand ϕij are equivalent in learning moral graphs in the sense that

ϕij = 1⇐⇒ ϕij = 1.

Page 16: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Learning moral graphs: step (a)

(a) (Screening for parents and children nodes) Find a superset ofparents and children for each node Xi :

(i) For each ordered pair of nodes (Xi ,Xj), i , j = 1, 2, . . . , p,conduct the marginal independence test Xi ⊥p Xj and obtainthe p-value.

(ii) Conduct a multiple hypothesis test to identify the pairs ofnodes that are dependent. Denote the superset by Ai fori = 1, . . . , p. If the size of Ai is greater than n/(cn1 log(n)) fora pre-specified constant cn1, reduce it to n/(cn1 log(n)) byremoving the variables having larger p-values in the marginalindependence tests.

Page 17: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Learning moral graphs: step (b) & (c)

(b) (Spouse nodes amendment) For each node Xi , find the spousenodes that are not included in Ai , i.e., finding the setBi = {Xj : Xj /∈ Ai , ∃Xk ∈ Ai ∩Aj} for i = 1, . . . , p, where Xj

is a node not connected but sharing a common neighbor withXi . If the size of Bi is greater than n/(cn2 log(n)) for apre-specified constant cn2, reduce it to n/(cn2 log(n)) byremoving the variables having larger p-values in the spousetest Xi ⊥p Xj |Xk .

(c) (Screening for the moral graph) Construct the moral graphbased on conditional independence tests:(i) For each ordered pair of nodes (Xi ,Xj), i , j = 1, 2, . . . , p,

conduct the conditional independence test Xi ⊥P Xj |Sij \ {i , j},where Sij = Ai ∪Bi if |Ai ∪Bi \ {i , j}| ≤ |Aj ∪Bj \ {i , j}| andSij = Aj ∪ Bj otherwise.

(ii) Conduct a multiple hypothesis test to identify the pairs ofnodes for which they are conditionally dependent, and set theadjacency matrix Emb accordingly, where Emb denotes theadjacency matrix of the moral graph.

Page 18: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Identifying v -Structure

▶ Given the moral graph, the v -structures contained in theBayesian network can be identified by performing furtherconditional independence tests around each variable.

▶ With the identified v -structures, the Markov blankets can beresolved by deleting the spouse links and orienting the arcs inthe v -structures.

▶ This step can be accomplished using some existing algorithms,such as the collider set algorithm (Pellet and Elisseeff, 2008)or the local neighborhood algorithm (Margaritis and Thrun,2000).

Page 19: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Identifying derived directions

Given the skeleton and colliders, a maximally directed Bayesiannetwork can be obtained following the four necessary and sufficientrules, which ensure that no directed cycles and additional collidersare created in the graph.

(a) Since Xi → Xj − Xk is not a valid v -structure, Xj → Xk mustbe directed.

(b) Given the edges Xi → Xj → Xk , directing from Xk to Xi willproduce a directed cycle, so Xi → Xk must be directed.

(c) Since directing the edge Xj → Xi will inevitably produce anadditional collider XL → Xi ← Xk or a directed cycle, Xi → Xj

must be directed.

(d) Since directing Xj → Xi will inevitably produce an additionalcollider Xj → Xi ← Xl or a directed cycle, Xi → Xj must bedirected.

Page 20: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Identifying derived directions

(a) (b)

(c) (d)

Figure: Four necessary and sufficient rules for derived directions, wherethe indices I , J, K and L are used to represent the nodes Xi , Xj , Xk andXl , respectively.

Page 21: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Consistency: Assumptions

(A) (Faithfulness) The Bayesian network is faithful, for which thejoint distribution can be expressed in a Gaussian-multinomialdistribution (Lee and Hastie, 2013).

(B) (High dimensionality) The dimension pn = O(exp(nδ)), where0 ≤ δ < (1− 2κ)α/(α+ 2) for some positive constantsκ < 1/2 and α > 0, and the subscript n of pn indicates thedependence of the dimension p on the sample size n.

(C) (Sparsity) The maximum size of the Markov blanket of eachnode, denoted by qn = max1≤j≤pn |Si |, satisfies qn = O(nb)for some constant 0 ≤ b < (1− 2κ)α/(α+ 2), where Si

denotes the Markov blanket of node i .

(D) (Identifiability) The regression coefficients satisfyinf{|βij |C|;βij |C = 0, i , j = 1, 2, . . . , pn, C ⊆{1, 2, . . . , pn} \ {i , j}, |C| ≤ O(n/ log(n))} ≥ c0n

−κ, for someconstant c0 > 0, where κ is as defined in condition (B), andβij |C denotes the true regression coefficient of Xj in thecorresponding GLMs.

Page 22: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Gaussian-multinomial distribution (Lee and Hastie, 2013)

p(x, y|Θ) ∝ exp{−1

2

pc∑s=1

pc∑t=1

θstxsxt +

pc∑s=1

ϑsxs +

pc∑s=1

pd∑j=1

ρsj(yj)xs

+

pd∑j=1

pd∑r=1

ψrj(yr , yj)},

(3)

where xs denotes the sth of pc continuous variables, and yjdenotes the jth of pd discrete variables. The joint model isparameterized by Θ = [{θst}, {ϑs}, {ρsj}, {ψrj}].

Page 23: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Consistency: Theorems

Theorem 2. Under the assumptions (A)–(D) and some regularityconditions for high-dimensional GLMs, the three-stage algorithm is

consistent, i.e., P(E(n)mb = E

(n)mb)→ 1 and

P(E(n)v = E

(n)v |E

(n)mb = E

(n)mb)→ 1 as n→∞, where E

(n)mb denotes

the adjacency matrix of the moral graph, E(n)v denotes the set of

v -structures, and E(n)mb and E

(n)v denote the estimators of E

(n)mb and

E(n)v obtained by the proposed method, respectively.

Page 24: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Mixed Data for an Undirected GraphThis example consists of two types of variables, Bernoulli andGaussian. Figure 2 shows a smaller version of the structure of theunderlying graph. The data were simulated under two settings(n, pc , pd) = (500, 100, 100) and (100, 100, 100), where n denotesthe sample size, pc denotes the number of Gaussian variables, andpd denotes the number of Bernoulli variables.

Figure: A smaller version of the graph structure underlying the simulationstudy, where the circle nodes represent Gaussian variables, the square nodesrepresent Bernoulli variables, and the solid, dotted and dashed lines representthree different types of edges.

Page 25: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Mixed Data for an Undirected Graph

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Recall

Pre

cisi

on

three−stage methodpseudolikelihood estimates of mixed graphical model

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Recall

Pre

cisi

on

three−stage methodpseudolikelihood estimates of mixed graphic model

Figure: Precision-Recall curves produced by the three-stage andpseudo-likelihood methods for two mixed datasets: the left was generatedunder the setting (n, p1, p2) = (500, 100, 100) and the right under the setting(n, p1, p2) = (100, 100, 100).

Page 26: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Mixed Data for an Undirected Graph

Table: Average areas under the Precision-Recall curves produced by thethree-stage and pseudo-likelihood methods. The number in parenthesesrepresents the standard deviation of the areas averaged over 10 datasets.

(n, pc , pd) three-stage pseudo-likelihood

(500, 100, 100) 0.997 (8.22× 10−4) 0.943 (9.49× 10−7)(100, 100, 100) 0.749 (0.012) 0.473 (1.58× 10−4)

Page 27: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Mixed Data for a Directed Graph: Data Generation

(i) Fix an order of variables;

(ii) Randomly mark half of the variables as continuous and therest as binary;

(iii) Fill the adjacency matrix E with zeros, and replace the lowertriangle (below the diagonal) of E with independentrealizations of Bernoulli random variables generated with asuccess probability s;

(iv) generate the data according to the adjacency matrix in asequential manner.

Page 28: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Mixed Data for a Directed Graph: Data Generation

▶ The variable X1, which corresponds to the first node of theBayesian network, was generated through a Gaussian randomvariable Y1 ∼ N(0, 1). We set X1 = Y1 if X1 was set to becontinuous, and X1 ∼ Binomial(n, 1/(1 + e−Y1)) otherwise.

▶ The other variables Xi ’s, i = 2, 3, . . . , p, were thensequentially generated by setting

Yi =i−1∑k=1

0.5EikXk + ϵi , (4)

Xi =

{Yi , if Xi is continuous,

Binomial(n, exp(Yi )1+exp(Yi )

), if Xi is binary,(5)

where ϵ1, ..., ϵp are iid standard Gaussian random variables,and Eik denotes the (i , k) entry of E.

Page 29: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Mixed Data for a Directed Graph: Graph Estimation

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

7879

80

81

82

83

84

85

86

87

8889

90

91

92

93

94

95

96

97

98

99

100

1

2

34

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

3435 36

3738

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98 99

100

Figure: The true directed network (left) and the estimated one (right) by thethree-stage method for a dataset with n = 3000 samples.

Page 30: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Mixed Data for a Directed Graph: Estimation Assessment

Table: Average precision (prec) and recall (Rec) values produced by thethree-stage, PC, HC, GS and MMHC methods. The number in parenthesesrepresents the standard deviation of the value averaged over 10 datasets.

n three-stage PC HC MMHC GSPrec Rec Prec Rec Prec Rec Prec Rec Prec Rec

1000.640 0.168 0.276 0.058 0.184 0.268 0.458 0.185 0.243 0.018(0.023) (0.018) (0.034) (0.007) (0.010) (0.013) (0.018) (0.007) (0.040) (0.004)

5000.675 0.767 0.609 0.406 0.499 0.482 0.652 0.389 0.270 0.016(0.013) (0.023) (0.011) (0.012) (0.019) (0.019) (0.024) (0.016) (0.035) (0.005)

10000.703 0.849 0.649 0.532 0.561 0.533 0.665 0.450 0.450 0.038(0.011) (0.012) (0.012) (0.013) (0.020) (0.022) (0.024) (0.022) (0.060) (0.006)

30000.739 0.994 0.717 0.642 0.568 0.561 0.689 0.539 0.385 0.049(0.008) ( 0.003) (0.021) ( 0.023) (0.024) (0.020) (0.030) (0.020) (0.039) (0.008)

Page 31: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Lung cancer genetic network

▶ This study aims to learn an interactive genetic network forLung Squamous Cell Carcinoma (LUSC), which incorporatesboth gene expression information (mRNA-array data) andmutation information. The dataset was downloaded from TheCancer Genome Atlas (TCGA) athttps://tcga-data.nci.nih.gov/tcga/.

▶ The original mRNA data contains 17814 genes and 154patients, and the original mutation data contains 14873 genesand 178 patients. We filtered the mRNA data by includingonly 1000 genes with most variable expression values, andfiltered the mutation data by including only 102 genes forwhich the mutation occurred in at least 15% of the patients.

▶ We then merged the two data sets together. The mergeddataset consists of 121 patients, which are common to boththe mRNA and mutation data.

Page 32: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Lung cancer genetic network

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

● ●

●●

● ●

KLHL13

BANK1

CXCL13

FOXE1

EMX2

ADAM23

KRT15

TMSL8

tcag7.1260

KRT34

NTRK2

MGC39715

ROPN1B

TXNRD1

LOC253970

CA8

EIF1AY

SAA4

SERPINA3

APOC1

KLRB1

DMRT1

AKR1B10

POU2AF1

UGT2B15

PAGE2B

SV2B

SERPINB2

C20orf114

SPON1

PLA2G4A

SLC26A4

EREG

CAMP

GRHL3

PTX3

C7

DYNLRB2

OASL

FLJ38723

ARMC3

EDIL3

LOC90925

DSC3

DMKN

NUDT11

FOS

RBM11

WNT2B

CXCL1

POPDC3

DAPL1

CYP39A1

PON3

RP13�36C9.6

KIAA1622

SRD5A2

CSTA

IL1A

FAM3B

ROPN1

SPP1

GLI1

FAM81B

PTGS2

ADH1C

RPS4Y1

PI3

BNC1

SCGB1A1

MYEOV

CBR3

AQP3

KRT6C

LINGO2

CCDC38

FABP6

CLDN1

GSTA5

ATP8A2

PSCA

LEPREL1

SOX2

KRT75

ADAMDEC1

FBN2

SPRR2D

SEMA3E

LRRC17

MAGEA9

NPTX2

FAM80A

ARMC4

CLCA2

DEFA3

CDH16

JARID1D

CRNN

KLHL29

KRTDAP

SLC34A2

COL10A1

GPR87

KLK11

HS3ST3A1

AQP9

C1orf61

DSCR6

BCHE

TSPAN7

SFRP2

FAM9B

IL8

SLC44A5

GPX2HRASLS

MMP1

ASPN

MMP3

OGN

KRT16

VCX3A

DDX43

HS3ST2

PAMCI

S100A7

SERPINB3

GULP1

COL12A1

CCNA1

ITGB6

ACTBL1

EBF3

DOK6

FAM30A

KRT6A

DPT

GPR65

KIAA1822L

KRTAP3�1

HOXC9

SMC1B

DNAH9

CFTR

LMO3

KLK10

INDOL1OVOS2

LCN2

F3

CDKN2A

INDO

SCGB3A2

PTGIS

KRT23

HOXD11

FPRL1

FIGF

FABP5

FGFR3

FLJ35880

EGF

ABCC2

KIT

LUZP2

LASS3

GAST

FHOD3

IL19

FLJ45803

GSTM1

OSGIN1CYP4F2

NAPSA

CTSE

DMRT2

VWC2

SAA1

ATAD4

SLN

SPRR2G

IL1B

CDK5RAP2

PHACTR3

CLEC2B

CXCL11

MGC45438

SCRG1

C4orf19

SPANXD

FMO1

HMGA2

CYP4X1

SDPR

ZNF229

KLK13

IFIT1

MGC16291LTF

DSG3

DKK1

MAGEA1

MAP1B

KIAA0319

EDN2

PPP1R3C

PKD2L1

CYP24A1

TP63

SGEF

EDN1

LGI2

NAT8L

CXCR7

FETUB

FGF12

ZNF750

RBP7

CDKN2B

MME

SOX15

SFTPA1

DPPA2

PAK7

OLFM4

CPA3

GAS2

HEY1

GPR109B

CABYR

MGC10981

LRG1

LDLRAD1

SAC

COL11A1

CGB1

HOXC13

PLUNC

GJB6

KRT13

FAT2

LRRC4

AQP4

LIPH

THBD

ABCC3

FAM83C

S100A8

KHDRBS2

C8orf47

NQO1

GPR34CCL28

ADH1A

GSTM3

MGC102966COL17A1

XG

FABP4EGLN3

PAGE5

ZMAT4

PRAME

AHNAK2

CYP4F11

TCBA1

SFRP1

SCN9A

C19orf59

UGT2B17

CEACAM6

MS4A2

CYorf15B

PENK

ANXA8L2

PNOC

SPRR2F

LOC130576

BMP2

TMPRSS11D

NTS

MMP12

FGFBP1

SLC1A6

CTA�246H3.1

HOXD10

ZNF415

SUSD4

AKR1C1

SBSN

HTRA4

VCX2

EYA2

NR0B1

CDH2

P2RY1

DMRTA1

SFRP4

C9orf18

C6orf156

STXBP5L

CLCA4

SRXN1

FLJ35773

SERPINB4

VNN1

ALDH3A1

RIBC2

PTH2R

GJB2

MICALCL

GSTA3

ABCA4

PCP4

TMEM46

SAA2

C20orf82

CLDN18

RNF128

COL6A6

C12orf54

ABCA3

GDA

UGT2B7

CD1A

GSTA2

TOX3

MYBPC1

UNQ9433

NRG4

NOS2A

GLYATL2

MUC15

POTE15

TSPAN19APOBEC3A

NPY

SOST

PAGE2

IL1F9

CBLN2GCLC

SCTR

CCL11

HOXD8

ANXA10

IRX4

RNF152

KIAA1257

PLAC1

GABRA5

MAGEA12

SAGE1

LOC253012

CSAG3A

S100A12

ABCA13

AGR2

CLCA3

PADI3

NUDT10

ARL14

FA2H

SH3GL3

G0S2

ADCY8

IGFL2

IL6

C5orf23

LYPD1

KRT14

MS4A1

CHP2

IGLV2�14

CXorf57

LCE3D

CYP2C18

IFI44L

SLC47A2

PPP2R2B

NRCAM

GBA3

AKR1C3

CYorf15A

CES1

RPS4Y2

ZNF556

KRT17

HRASLS3

WNT5A

C4orf7

GABRG3

CALB2

MAGEA3

KLK12

CSAG1

PLA2G1B

MUCL1

CT45�6

MGC9913

CA3

NDRG4

HOXC8

ME1

IVL

AGTR2

MAGEA4

TMPRSS11A

EHF

CTNND2

LGR6

PROM2

CRYM

CYP4F8

FCER1A

NPR3

ABHD7

TMEM20

FLJ21511

MB

CYP4Z1

C2orf54

GBP6

ATP6V1C2

CSN1S1

GLT25D2

HHIPWBSCR28

SPRR3

ART3

PLAC8

SLC14A1

TP53

CDH18

TNR

AHNAK2

NAV3

ANK2

USH2A

OR2T4

CDH12

SPHKAPBIRC6

LOC96610

Figure: The Bayesian network produced by the three-stage method with themRNA (circle nodes) and mutation (square nodes) data measured on the sameset of 121 LUSC ((Lung Squamous Cell Carcinoma) patients.

Page 33: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Lung cancer genetic network

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●● ●

●● ●

● ●

● ●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

KLHL13

SALL1

BANK1

CXCL13

BMP3

CPNE4

FOXE1

PGBD5

FNDC1

EMX2

ADAM23CHST6

TMEM125

KRT15

RDHE2

UPK1B

UGT1A8

MS4A8B

TMSL8tcag7.1260

DDIT4L

KRT34

C1orf125

GAL

FAM83A

KIAA1045

NLGN4X

ALOX12

TCEAL2

NTRK2

hCG_1990170

KLK7

C20orf85

KRT4

PCDH20

MGC39715

ROPN1B

COMPTXNRD1

BRDT

LOC253970

SRPX

COL4A6

CAPSL

CA8

SFTPC

EIF1AY

SAA4

SOSTDC1

C4BPA

KCNMB4

SERPINA3

KCNJ16

APOC1

KLRB1

IGFL1

DMRT1

MMP13

RTP3

CLGN

AKR1B10

GALNT13

POU2AF1

UGT2B15

VSNL1

MAG1

PAGE2B

SV2B

CRYBA2

EYA1

SERPINB2

C20orf114

HPR

C18orf2

SPON1

PLA2G4A

7A5 AMPD1

SLC26A4

EREG

CAMP

GRHL3

PTX3

MUC4

C7

DYNLRB2

SLITRK6

OASL

RNF165

FLJ38723

ARMC3

ZDHHC11

EDIL3

LOC90925

DSC3

HTR2B

DMKN

NUDT11

FOS

RBM11

FBXL21

WNT2B

CXCL1BEX1

SLC26A9

TMPRSS11F

FAM133A

FGFBP2

IL33

SEC14L4

KCNMB2

KRT5

PPP1R14C

POPDC3

TM4SF19

SDCBP2

GATM

GAD1

VAV3

DAPL1

CBR1

CYP39A1

GLULD1

PON3

GHR

RP13�36C9.6

KRTAP19�1

KIAA1622

KLRG2

PRSS21

SLC10A4

PKIB

SRD5A2

CSTA

IL1A

FAM3B

ROPN1

LTB4DH

SPP1

GLI1

SFTPD

FAM81B

PTGS2

BMP7

NEFL

ADH1C

SLCO1B1

PGLYRP4

RPS4Y1

C21orf81

PI3

BNC1

PCDHB6

PAGE1

D4S234EZNF365

SCGB1A1

HSD17B3

MYEOV

CBR3

AQP3

KRT6C

LINGO2

CCDC38

SEMA6D

TTC29

WFDC2

FABP6

GABRB3

C12orf56

CLDN1

GSTA5

ATP8A2

PSCA

LEPREL1

ZFP42

SOX2

KRT75

TMPRSS11B

ODZ2

S100P

SLC6A4

ZYG11A

DEFB1

TCHH

ADAMDEC1

TMEM45A

FBN2

SPRR2D

SEMA3E

LRRC17

C10orf81

MAGEA9

NPTX2

DEFB103A

EDN3

S100A14

AMY2A

FAM80A

ARMC4

CLCA2

RHCG

IFNE1

CPA6

DEFA3

PLA2G3 CDH16

FOXA1

JARID1D

CA2PCDH8

CRNN

ECAT8

KLHL29

KRTDAP

WISP3

SCNN1A

CXCL6

SLC34A2

EYA4

COL10A1

PVRL3

GPR87

KLK11

DNER

FMN2

BEX5

SPRR1B

CRCT1

LOC441376NAP1L2 HS3ST3A1

CPXM2

RP11�35N6.1

AQP9

LRRN1

CLC

C1orf61

DSCR6

BCHE

BCL2

KIAA1324

TSPAN7

LONRF2

LOC440356

SFRP2

GPR158

OMD

FAM9B

IL8

SERPINB13

LYPD3

IGFBP2

P53AIP1

WIF1

SLC44A5

MUC20

GPX2

AREG

NRXN1

COCH

HRASLS

MMP1

CA12

WNT16

MAGEA10

ASPN

DACT2

MMP3OGN

KRT16VCX3AAMDHD1

C6orf142

DDX43

GAS1

FAM112B

SOHLH1

HS3ST2

GPC3

PAMCI

S100A7

NXF5

SERPINB3

PCOLCE2

C3orf55

HORMAD1

GULP1

CH25H

COL12A1

CCNA1

KLK6

ITLN1

CALB1

C4orf31

MUC16

JAKMIP2

C8orf4

PTHLH

FGB

POU4F1

ATP13A4

RASEF

ZFHX4

CDH18

CHL1

PAGE4

ITGB6

ACTBL1

MATN3

EBF3

TFF2

SMPX

C1orf87

DOK6

FAM30A

GOLSYN

GJB7

KRT6A

C6orf117

GLYATL1

DPT

ACTL8

GPR65

KIAA1822L

IL12RB2

KRTAP3�1

HOXC9

SMC1B

LEMD1

OXGR1

ALDH1A1

DNAH9

IRX2

TMEM22

CFTR

RTP4

LMO3

C9orf24

SFTPA1B

KLK10

INDOL1

OVOS2

LCN2

F3

CDKN2A

C4orf26

INDO

SCGB3A2

GSTT1

PTGIS

KRT23

HAPLN1

TNIP3

GOLT1A

HOXD11

PCSK1

KRT1

FLJ22655

FPRL1

MSMB

IL1F5

FIGF

FABP5

PNLDC1

FGFR3

FLJ35880

VTCN1

HIST1H1A

MAGEC1

EGF

ABCC2

KIT

DKK4

ACE2 LUZP2

LASS3

GAST

FZD10

FHOD3

IL19

LECT1

CRABP2

FLJ45803

GSTM1

TSKS

POSTN

OSGIN1

CYP4F2

GPR128

WNT2

NAPSA

MGAT4C

CTSE

FLJ46266

DMRT2

VWC2

C20orf103

SAA1ATAD4

GLI2

SLN

RGS17

SPRR2G

IL1B

ATP8B3

PCDHB5

tcag7.1136

C2orf39

MMP10

AMIGO2

RPL39L

ADRA2C

PKP1

CDK5RAP2

PHACTR3

CLEC2B

MAEL

COL21A1

HOXB13

RAPGEFL1

CXCL11

MGC45438

SCRG1

DLX6

IL13RA2

C4orf19

ODAM

GSTO2

SLC16A9

CFHR4

TMEPAI

CDH26

SPANXDCCK

FMO1

SLC35F3

SCGB2A1

SLC6A14

HMGA2

LOC285141

TMEM45B

SESN3

CYP4X1

SDPR

ZNF229

KLK13

IFIT1

PRB1

MGC16291

CXCL9

TPO

LTF

HOXA4

CHIA

CEACAM7

DSG3

DKK1

MAGEA1

MAT1A

MAP1B

KIAA0319

ZBED2

SLC5A1

EDN2

C3orf41

PPP1R3C

SOX11

PKD2L1 CP

CYP24A1

FXYD3

TP63 WDR66

SGEF

BBOX1 EDN1

PTPRR

CAPS

LGI2

NAT8L

C6orf15

SYT13

PPBP

SLITRK4

TFAP2B

C6orf105

CXCR7

FAM79B

KCNK10

FSTL5

CYP3A5

FETUB

IL23A

LOC399947

WDR16

FGF12

FMO9P

CLDN10

ZNF750

RBP7

DSC1

HS3ST5

CDKN2B

MME

SOX15

SFTPA1

DPPA2

DEFB4

SERPINB11

LGALS2

MAGEA8F2RL2

ATP2C2

PNPLA3

TUBB2B

PAK7

OLFM4

TSPY2

SCG2

FMO3

DDX3Y

CPA3

BARX1

GAS2

HEY1

ODC1

GPR109B

TSLP

CABYR

MGC10981

LPL

MAGEA11LRG1

CDH19LDLRAD1

TP53TG3

IQCA

SAC

C11orf41

COL11A1

C20orf174

ICEBERG

CGB1

ST6GALNAC1

MYB

OR6X1

MPPED2

HOXC13

GPNMB

C16orf73

SASP

PLUNC

GJB6KRT13

ATP4A

NEFM

FAT2

LRRC4

AQP4

LIPH

SELETHBD

FLJ39822

BTBD16

KRT24

ABCC3

NLRP7

FAM83C

CRLF1

HSPA1A

UGT1A6

UBD

S100A8

KHDRBS2

TMEM100SPINK2

SPESP1

PDPN

CAPN13

C8orf47

TCN1

INA

COL4A4

NQO1

IFNG

FOXL2

GPR34

KLK5

CCL28

LDHC

ADH1A

GSTM3

MGC102966

COL17A1

XG

MSLN

ELF5

FABP4

CLDN8

EGLN3

PAGE5

AIM2

ZMAT4

KRT6B

IL20RB

AHNAK2

TFPI2

CYP4F11

LHX2

TCBA1

SFRP1

TNNT1

C19orf59

MUM1L1

CYP26A1

PEBP4

UGT2B17

CEACAM6

MS4A2

HS3ST1

SLC6A15

CYorf15B

DMRT3

PPP1R9A

BARX2

PENK

ANXA8L2

OPRK1

PNOC

RPS6KA6

SPRR2F

LOC130576

CYP2C9

RUNX1T1

BMP2

ROPN1L

GAP43

RASGEF1A

TMPRSS11D SLC7A11

NTS

CHIT1

MIA

PCSK2

MMP12

FGFBP1

CHGB

HAS3

SLC1A6

CTA�246H3.1

PLAT

HOXD10

FAM132A

STAR

TBX18

ZNF415

SUSD4

FLRT3

LOX

AKR1C1

LYZ

SP8

KCND2

SBSN

MLLT11

TMTC1

C2orf40

HTRA4

VCX2

MAGEB2

EYA2

SPINK1

CILP

AGR3

AADACL2

SLCO1B3

WFDC5

PROK2

DACH1

OSTalpha

CREG2

BDNF

HSPB3 SCIN

LOC124220

NR0B1

CDH2

SERPINB7

STEAP4

P2RY1

ADH7

FHL2

PCDHB2

RAB6B

STK32A

SLPI

DMRTA1

SFRP4

GLI3

C9orf18

MAGEC2

C6orf156

TGM1

SULT1E1

HIST1H1B

ARL9

AFF2

STXBP5L

CLCA4

SRXN1

FLJ35773

STXBP6

CHST9

STRA8

SERPINB4

VNN1

ALDH3A1

RIBC2

PTH2R

GJB2MICALCL

GSTA3

NPPC

SSTR1ABCA4

RP13�102H20.1

SPINK6

ATP12A

PIP

PCP4

TMEM46

SAA2

GPR103

C20orf82

OGDHL

CLDN18

RNF128

CD274

COL6A6

RNF183

C9orf47

C12orf54

ABCA3

XK

GDA

UGT2B7

C1QTNF3

CD1A

KLK8

GSTA2

TOX3

MYBPC1

EN1

UNQ9433POF1B

TKTL1

EPGNNRG4

NOS2A

GLYATL2

MUC15

RNF182

SOX9

POTE15

RGS20

CA9

TSPAN19

APOBEC3A

NPY

SOST

ECHDC3

PAGE2

CTCFL

IL1F9

NELL2

POU3F2

GNGT1CBLN2

GCLC

GPR110

AMTN

PSMAL

ZNF695

SCTR

CCL11

HOXD8

ZNF114

NRN1

ANXA10

IRX4

CEACAM5

FUT3

ERP27

RNF152

KIAA1257

PLAC1

C19orf46

PHEX

FGF13

GABRA5

GRP

MAGEA12

SAGE1

LOC253012

CSAG3A

LY6D

LMO1

S100A12

ARSJ

GKN2

SLC5A8

CYP4B1

ABCA13

AGR2

CLCA3

PTN

PADI3

NUDT10

PTPRZ1

ARL14

FAM137A

SUNC1

PAX9

CST1

OLFM1

FLJ44379

PITX2

FA2H

SPOCK3

SH3GL3

ALDH7A1

G0S2

ADCY8

IGFL2

CHODL

IL6

C5orf23

LYPD1

KRT14

KREMEN2 MS4A1

CHP2

CDO1

KCNH8

IGLV2�14

HOXC12

CXorf57

LCE3D

IL1R2

CYP2C18 IFI44L

TMEM190

NLGN1

RPESP

FAM90A1

CNTNAP2

ABHD12B

SLC47A2

PPP2R2B

NRCAM

GBA3

CRTAC1

AKR1C3

C10orf82

CYorf15A

OCA2

CES1

HSD17B2

ITLN2

WDR69

RPS4Y2

ZNF556

LRRK2

HOXD13

KRT17C6orf159

VIT

HRASLS3

WNT5A

C4orf7

GABRG3

CALB2

MAGEA3

PRDM13

KLK12

ABCA12

CSAG1

EPHB1

CTTNBP2

COL9A1

CALML5

HOTAIR

PLCB4

PLA2G1B

MUCL1

CT45�6

S100A2

GALNT14

PF4V1

REG1A

MGC9913

KRT7

CXorf48

CA3

NDRG4

NRXN3

HOXC8

VNN2

ME1

IVL

AGTR2

ROS1

C6orf54

MAGEA4

TMPRSS11A

CTNND2

LGR6

FABP7

DSCR8

PROM2

HES5

CRYM

UCHL1

AKAP14

TSPAN8

C6orf150

COX7B2

CYP4F8TRIM43

OR7A5

ZIC1

FCER1A

ARG1

NPR3

PPP2R2C

ABHD7

TMEM20SPRR1A

SCEL

TMPRSS2

FLJ21511

HLA�DRB6

ATP6V0A4

DGKG

MB

CYP4Z1

DMBT1

NLRP2

NELL1

C2orf54

WNT10A

GBP6

ATP6V1C2

CSN1S1

GLT25D2

HHIP

WBSCR28

SPRR3

ART3

PLAC8

SALL3

SLC14A1

SULT1B1

KLRD1

DAZ4

MLL3

CSMD3

COL11A1

PKHD1

FAT3

ADAM6

CDKN2A

SYNE2

SPTA1

USH2A

OR2T4

HEATR7B2

FBN2

PKHD1L1

MUC5B

DPP10

C1orf173

CTNNA2DNAH9

BIRC6

SLITRK3MUC4

LRFN5

LOC96610

RELN

PDE4DIP

STAB2

Figure: The Bayesian network produced by MMHC with the mRNA (circlenodes) and mutation (square nodes) data measured on the same set of 121LUSC (Lung Squamous Cell Carcinoma) patients.

Page 34: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Lung cancer genetic network

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●●

●●

● ●

● ●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●● ●

● ●

● ●

●●

● ●

● ●

●●

●●

●●

● ●

KLHL13

SALL1

BANK1

CXCL13

LOC152573

BMP3

CPNE4

FOXE1

NEFH

PGBD5

TNNT3

FNDC1

EMX2

ADAM23

CHST6

TMEM125

KRT15

RDHE2

FST

UPK1B

UGT1A8

MS4A8B

TMSL8

tcag7.1260

DDIT4L

KRT34

C1orf125

GAL

FAM83A

KIAA1045

NLGN4X

ALOX12

TCEAL2

NTRK2

hCG_1990170

KLK7

C20orf85

KRT4

PCDH20

MGC39715

ROPN1B

GABRP

COMP

TXNRD1

BRDT

LOC253970

SRPX

COL4A6

CAPSLCA8

SFTPC

EIF1AY

SAA4

SOSTDC1

C4BPA

KCNMB4

SERPINA3

KCNJ16

APOC1

KLRB1

IGFL1

DMRT1

MMP13

RTP3

CLGNAKR1B10

GALNT13

POU2AF1

UGT2B15

VSNL1

MAG1

PAGE2B

SV2B

CRYBA2

EYA1

SERPINB2

C20orf114

HPR

C18orf2

SPON1

PLA2G4A

7A5

AMPD1

SLC26A4

EREG

CAMP

GRHL3

PTX3

C1orf110

MUC4

C7

DYNLRB2

SLITRK6

OASL

RNF165

FLJ38723

ARMC3

ZDHHC11

EDIL3

LOC90925

DSC3

HTR2B

DMKN

NUDT11

FOS

RBM11

FBXL21

WNT2BCXCL1

BEX1

SLC26A9

TMPRSS11F

C3orf57

FAM133A

FGFBP2

IL33

SEC14L4

KCNMB2

KRT5

PPP1R14C

POPDC3 TM4SF19

SDCBP2

GATM

GAD1

VAV3

DAPL1

CBR1

CYP39A1

GLULD1

PON3

GHR

RP13�36C9.6

KRTAP19�1

KIAA1622

KLRG2

PRSS21

SLC10A4

PKIB

SRD5A2

CSTA

IL1A

FAM3B

ROPN1

LTB4DH

SPP1

GLI1

SFTPD

FAM81B

PTGS2

BMP7

NEFL

ADH1C

SLCO1B1

PGLYRP4

RPS4Y1

C21orf81

PI3

BNC1

PCDHB6

PAGE1

D4S234E

ZNF365

SCGB1A1

HSD17B3

MYEOV

CBR3

AQP3

KRT6C

LINGO2

CCDC38

SEMA6D

TTC29

WFDC2

FABP6 GABRB3

GCNT3

C12orf56

CLDN1

GSTA5

ATP8A2

PSCA

LEPREL1

ZFP42

SOX2

KRT75

TMPRSS11B

ODZ2

S100P

SLC6A4ZYG11A

DEFB1

TCHHADAMDEC1

TMEM45A

FBN2

SPRR2D

SEMA3E

LRRC17

C10orf81

CTAG2

MAGEA9

NPTX2DEFB103A

EDN3

S100A14

AMY2A

FAM80A

ARMC4

CLCA2

RHCG

IFNE1

CPA6

DEFA3PLA2G3

CDH16

FOXA1

JARID1DCA2

PCDH8

CRNN

DDC

PROM1

ECAT8

KLHL29

KRTDAP

WISP3

SCNN1A

CXCL6

SLC34A2

EYA4

COL10A1

PVRL3

GPR87

KLK11

DNER

FMN2

BEX5

SPRR1B

CRCT1

LOC441376NAP1L2

HS3ST3A1

CPXM2

RP11�35N6.1

AQP9

LRRN1

CLC

C1orf61

DSCR6

BCHE

BCL2

KIAA1324TSPAN7

LONRF2

LOC440356

SPINK5

SFRP2

GPR158

OMD

FAM9BIL8

SERPINB13

LYPD3

IGFBP2

P53AIP1WIF1

SLC44A5

MUC20

GPX2 AREG

NRXN1

COCH

HRASLS

MMP1

CA12

WNT16

MAGEA10ASPN

DACT2

MMP3

OGN

KRT16

VCX3AAMDHD1

C6orf142

DDX43

GAS1

FAM112B

SOHLH1

HS3ST2

GPC3

PAMCI

S100A7

NXF5

SERPINB3

STMN2

PCOLCE2

C3orf55

HORMAD1

GULP1

CH25H

COL12A1

CCNA1 CXCL14

KLK6

ITLN1

CALB1C4orf31

MUC16

JAKMIP2

C8orf4

PTHLH

GALNT5

FGB

POU4F1ATP13A4

RASEF

ZFHX4

CDH18

CHL1

PAGE4ITGB6

ACTBL1

MATN3

EBF3

TFF2

SMPX

C1orf87

DOK6

FAM30A

GOLSYN

GJB7

KRT6A

C6orf117

GLYATL1

DPT

ACTL8

GPR65

KIAA1822L

IL12RB2

KRTAP3�1

HOXC9

SMC1B

LEMD1

OXGR1

ALDH1A1

DNAH9

IRX2

TMEM22

CFTR

RTP4

LMO3

C9orf24

SFTPA1BKLK10

INDOL1

OVOS2 LCN2F3

CDKN2A

C4orf26

INDO

SCGB3A2

GSTT1

PTGIS

KRT23

HAPLN1

TNIP3

GOLT1A

HOXD11

PCSK1

KRT1

FLJ22655

FPRL1 MSMB

IL1F5

FIGF

FABP5

PNLDC1

ZNF334

FGFR3

FLJ35880

VTCN1

HIST1H1A

MAGEC1

EGF

ABCC2

KIT

DKK4

ACE2

LUZP2

LASS3

GAST

FZD10

FHOD3

IL19

LECT1

CRABP2

FLJ45803

GSTM1

TSKS

POSTN

OSGIN1

CYP4F2

GPR128

WNT2

NAPSA

MGAT4C

CTSE

FLJ46266

DMRT2

VWC2

C20orf103

SAA1

ATAD4

GLI2

SLN

NMU

RGS17

SPRR2G

IL1B

ATP8B3

PCDHB5

tcag7.1136

C2orf39

MMP10

AMIGO2 RPL39L

ADRA2C

PKP1

CDK5RAP2

PHACTR3

AADAC

CLEC2B

MAEL

COL21A1

HOXB13

RAPGEFL1

CXCL11

MGC45438

SCRG1

LOC644186

DLX6

IL13RA2

C4orf19

ODAM

GSTO2

SLC16A9

CFHR4

TMEPAI

CCL20

CDH26SPANXD

CCK

FMO1

SLC35F3

SCGB2A1

SLC6A14

HMGA2

LOC285141

TMEM45B

SESN3

KRT27

CYP4X1

SDPR

ZNF229

KLK13

IFIT1

PRB1

MGC16291

CXCL9

TPO

LTF

HOXA4

CHIA

CEACAM7

DSG3

DKK1

MAGEA1MAT1A

MAP1B

KIAA0319

ZBED2

SLC5A1

EDN2

C3orf41

PPP1R3C

SOX11

PKD2L1

CP

CYP24A1

FXYD3

TP63

WDR66

SGEF

BBOX1

EDN1

PTPRR

CAPS

LGI2

NAT8L

C6orf15

SYT13

LRAP

PPBP

SLITRK4

TFAP2B

C6orf105

CXCR7

FAM79BKCNK10

FSTL5

CYP3A5

FETUB

IL23A

LOC399947

WDR16

FGF12

FMO9P

CLDN10

ZNF750

RBP7

DSC1

HS3ST5CDKN2B MME

SOX15

SFTPA1

DPPA2

DEFB4SERPINB11

LGALS2

MAGEA8

F2RL2

ATP2C2

PNPLA3

TUBB2B

PAK7

OLFM4

TSPY2

SCG2

FMO3

DDX3Y

CPA3

BARX1

GAS2

HEY1

ODC1

GPR109B

TSLP

CABYR

MGC10981

LPL

MAGEA11

LRG1

CDH19

LDLRAD1

TP53TG3

IQCA

SAC

C11orf41

COL11A1

C20orf174

ICEBERG

CGB1

ST6GALNAC1

MYB

OR6X1

MPPED2

HOXC13GPNMB

C16orf73

SASP

PLUNC

GJB6

KRT13

ATP4A

NEFM

FAT2

LRRC4

AQP4 LIPH

SELE

THBD

FLJ39822

BTBD16

KRT24

ABCC3

NLRP7

FAM83C

CRLF1

HSPA1A

UGT1A6

UBD

S100A8

KHDRBS2

TMEM100

SPINK2

SPESP1

PDPN

CAPN13

C8orf47

TCN1

INA

COL4A4

NQO1

IFNG

FOXL2

GPR34

KLK5

CCL28

LDHC

ADH1A

GSTM3

MGC102966

COL17A1

XG

MSLN

ELF5

FABP4

CLDN8

EGLN3

PAGE5

AIM2ZMAT4

KRT6B

IL20RB

PRAME

FAM70A

AHNAK2

TFPI2

CYP4F11

LHX2

TCBA1

SFRP1

SCN9A

TNNT1

C19orf59

MUM1L1

CYP26A1

PEBP4

UGT2B17

CEACAM6MS4A2

HS3ST1

SLC6A15

CYorf15BDMRT3

PPP1R9A

BARX2

PENK

ANXA8L2

OPRK1

PNOC

RPS6KA6

SPRR2F

LOC130576

CYP2C9

RUNX1T1

BMP2

ROPN1L

TSPYL5

GAP43

RASGEF1ATMPRSS11D

SLC7A11

NTS

CHIT1

MIA

PCSK2

MMP12

FGFBP1

CHGB

HAS3

SLC1A6

CTA�246H3.1

CALCB

PLAT

HOXD10

FAM132A

STAR TBX18

ZNF415

SUSD4

FLRT3LOX

AKR1C1

LYZ

SP8

KCND2

SBSN

MLLT11

TMTC1

C2orf40

HTRA4

VCX2

THNSL2

MAGEB2

EYA2

SPINK1

CILP

AGR3

AADACL2 SLCO1B3

WFDC5

PROK2

DACH1

OSTalpha

CREG2

BDNF

HSPB3

SCIN

LOC124220

NR0B1

CDH2

SERPINB7

NDUFA4L2

STEAP4

P2RY1

ADH7

FHL2

PCDHB2

RAB6B

STK32A

SLPI

DMRTA1

SFRP4

GLI3C9orf18

MAGEC2

C6orf156TGM1

SULT1E1

HIST1H1B

ARL9

AFF2

STXBP5L

CLCA4

SRXN1

FLJ35773

STXBP6

CHST9

STRA8

SERPINB4

VNN1

ALDH3A1

RIBC2

PTH2R GJB2

MICALCL

GSTA3

NPPCSSTR1

ABCA4

RP13�102H20.1

SPINK6

ATP12A

PIP

CCND2

PCP4

TMEM46

SAA2

GPR103

C20orf82

OGDHL

CLDN18

RNF128

CD274

COL6A6

RNF183

C9orf47

C12orf54

ABCA3

XK

GDA

UGT2B7

C1QTNF3

CD1A

KLK8

GSTA2

TOX3

MYBPC1

EN1

UNQ9433

POF1B

TKTL1

EPGN

NRG4

NOS2A

GLYATL2

MUC15

RNF182SOX9

POTE15

RGS20

CA9

TSPAN19

APOBEC3A

NPY

SOST

ECHDC3

PAGE2

CTCFL

IL1F9

NELL2

POU3F2

GNGT1

CBLN2

GCLC

GPR110

AMTN

PSMAL

ZNF695

SCTR

CCL11

HOXD8

ZNF114

NRN1

ANXA10

IRX4

CEACAM5

FUT3

ERP27

RNF152

KIAA1257

PLAC1

C19orf46

PHEX

FGF13GABRA5

GRP

MAGEA12

SAGE1

LOC253012CSAG3A

LY6D

LMO1

S100A12

ARSJ

GKN2

SLC5A8

CYP4B1

ABCA13

AGR2

CLCA3

PTN

PADI3NUDT10

PTPRZ1

ARL14

FAM137A

SUNC1PAX9

CST1

CCL7

OLFM1

FLJ44379

PITX2

FA2H

SPOCK3

SH3GL3

ALDH7A1

G0S2

ADCY8

IGFL2

CHODL

IL6

C5orf23

LYPD1

KRT14

KREMEN2

MS4A1

CHP2

CDO1

KCNH8

TRIM49

IGLV2�14

HOXC12

CXorf57

LCE3D

IL1R2

CYP2C18

IFI44L

TMEM190

NLGN1

RPESP

FAM90A1

CNTNAP2

ABHD12B

SLC47A2

PPP2R2B

NRCAMGBA3

CRTAC1

AKR1C3

C10orf82 CYorf15A

OCA2

CES1

HSD17B2 ITLN2

WDR69

RPS4Y2

ZNF556

LRRK2

HOXD13

KRT17

C6orf159

VIT

HRASLS3

WNT5A

C4orf7

GABRG3

CALB2

MAGEA3

PRDM13

KLK12

ABCA12

CSAG1

EPHB1

CTTNBP2

COL9A1

CALML5

HOTAIR

PLCB4

PLA2G1B

MUCL1

CT45�6S100A2

GALNT14

PF4V1

REG1A

MGC9913

KRT7

CXorf48

CA3

NDRG4

NRXN3

HOXC8

VNN2

ME1

IVL

AGTR2

ROS1

C6orf54

MAGEA4

TMPRSS11A

EHF

CTNND2

LGR6

FABP7 DSCR8

PROM2

HES5

CRYM

UCHL1

AKAP14

TSPAN8

C6orf150

COX7B2

CYP4F8

TRIM43

OR7A5

ZIC1

FCER1A

ARG1

NPR3

PPP2R2C

ABHD7

TMEM20

SPRR1A

SCEL

TMPRSS2

FLJ21511

HLA�DRB6

ATP6V0A4

DGKG

MB

CYP4Z1

DMBT1

NLRP2

NELL1

C2orf54

WNT10A

GBP6

ATP6V1C2

CSN1S1

GLT25D2

HHIP

WBSCR28

SPRR3

ART3

PLAC8

SALL3

SLC14A1

SULT1B1

TEX101

KLRD1

DAZ4

FLG

LRP2

TTN

DNAH5

MLL3

MLL2TP53

MACF1

APOB

CNTNAP5

CDH18

SYNE1

ABCA13

PXDNL

CSMD3

MYCBP2

PEG3

DMD

CSMD2

LRRC7

COL11A1

TNN

TNR

RYR2

LRP1B

SCN1A

CDH10

HCN1

DNAH8

PKHD1

ZFHX4

FAM135B

FAT3

AHNAK2ADAM6

HERC2

MUC16

RYR1

TPTE

XIST

PCDH11X

NRXN1

ALMS1

DNAH7

MDN1

CNGB3CDKN2A

NAV3

SYNE2

ZNF99

NEB

ANK2

FAT4

PCLO

ZNF804B

SPTA1

PAPPA2

USH2A

OR2T4

XIRP2

SI

CDH12

HEATR7B2GPR98

FBN2

DNAH11

PLXNA4

PKHD1L1

COL22A1

CUBN

PCDH15

MUC5B

DNAH10

TMEM132D

RYR3

MYH2

ZNF208

ZNF536

DPP10RP1

COL6A6

MUC17

CSMD1

ANKRD30A

C1orf173

CTNNA2

CNTNAP2

FMN2

CPS1

SPHKAP

DNAH9

BIRC6

SLITRK3

FAM5C

ODZ1

MUC4

ADAMTS12

LRFN5

LOC96610

RELN

PDE4DIP

STAB2

Figure: The Bayesian network produced by HC with the mRNA (circle nodes)and mutation (square nodes) data measured on the same set of 121 LUSC(Lung Squamous Cell Carcinoma) patients.

Page 35: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

CPU time

▶ Three-stage: 1.94 CPU hours on a 3.6GHz desktop

▶ MMHC: 0.79 CPU hours

▶ HC: 10.89 CPU hours

▶ PC: > 40 CPU hours without results produced

▶ GS: > 40 CPU hours without results produced

Page 36: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Glioblastoma genetic network with methylation adjustmentLet W1, . . . ,Wq denote the external variables. To adjust theireffects,

▶ we can replace the p-values used in the parents-childrenscreening step by the p-values calculated from the GLM

Xi ∼ 1 + Xj +W1 +W2 + . . .+Wq, (6)

in testing the hypothesis H0 : γ′ij = 0 versus H1 : γ

′ij = 0,

where βij is the regression coefficient of Xj .

▶ Similarly, the p-values used in the moral graph screening stepcan be replaced by the p-values calculated from the GLM

Xi ∼ 1 + Xj +∑

k∈Sij\{i ,j}

Xk +W1 +W2 + . . .+Wq, (7)

in testing the hypothesis H0 : β′ij = 0 versus H1 : β

′ij = 0,

where Sij is as defined in the moral graph learning algorithmand β′ij is the regression coefficient of Xj .

Page 37: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Glioblastoma genetic network: Dataset

The dataset we considered is for the glioblastoma (GBM) cancer, ahighly malignant brain tumor for which no cure is available. Thedataset consists of both methylation and gene expression data(mRNA-array data) and was downloaded from TCGA athttps://tcga-data.nci.nih.gov/tcga/. We filtered the dataset byincluding only the 1000 genes with most variable expression values.The number of samples/patients is 281. It is known that the geneexpression value can be affected by the methylation sites in thepromoter region. For this reason, we annotated the methylationfeatures of the 1000 genes according to their positions on thechromosomes.

Page 38: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Glioblastoma genetic network: Graph Estimation

● ●●

●●

●●

● ●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

ACSM3

ADA

AGMAT

ALX1

ANKRD12

ANKRD44

ANXA3

AP3S1

AQP12B

ARF3

ARL11

ASB3

B2M

C11orf64

C16orf30C17orf65

C18orf25

C1QC

C20orf103

C3orf27

C4BPB

C6orf201

C7orf41

C7orf45

CA2

CAMK1G

CAND1

CAPG

CD14

CD163

CLCN4

COL3A1

COX4I1CSF1R

CYP3A5

DCX

DIAPH3

DLX4

DTX1

EIF1AY

EMX2

EPN1

FAM26B

FAM26C

FCGR3A

FCGRT

FLJ25801

FLJ41603

FOXG1

GDF3

GDPD2

GIMAP1

GIMAP4

GLDN

GMEB2

GPC3GPM6B

HBB

HCK

HLA.DMB

HLA.DRA

HLA.F

HMGCLL1

HOPX

HSD17B7

IMMP1L

IMPG1

INDOL1

INPP4B

ISX

ITGA4

ITM2B

KBTBD5

KCNJ6

KCNK4

KCTD12

KIFC3

KLHDC8A

KLK11

KLRF1

LEPR

LLGL2

LOC152586

LOC441601

LOC51057

LOC641367

LOXL3

LPHN3

LRP8 MAGEC1

MAL2

MCM5

MEGF11

MIZF

MTHFD1

MYO1A

NDUFB10

NFYA

NPAL2

NUDCD1 OCA2

OR10A2

OR2F1

OR3A3

OR52M1

OVOS2

PARD6B

PARVBPDZD11

PKP1

POLR2JPOU5F1

POU5F1P3

PSD2

PTPN13

PVRL2

RAD50

RBM7

RGMA

RGN

RGS1

RIT1

RNF180

RP11.114G1.1

SCG3

SERPINA5

SERPINC1

SHC3

SIRPG

SLC28A1

SLC2A13

SLC5A9

SLC7A3

SPANXD

SPRED2

SRF

SRGN

STK19TACSTD1

TAS2R4

TERT

TGM7

THBS4

THOC7

TIA1

TLK2

TM6SF1

TMEM16C

TOM1L1

TRIM50

TSPY2

UBE1L

UBE2C

UBXD4VAPA

VIPR1

WDFY4

WDR12

YES1

ZBTB39

ZNF121

Figure: Directed Glioblastoma genetic network learned by the three-stagemethod with the methylation effects having been adjusted.

Page 39: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Discussion

▶ We have proposed a three-stage method for learning thestructure of high-dimensional Bayesian networks. Thethree-stage method is to first learn the moral graph of theBayesian network based on variable screening and multiplehypothesis tests, and then resolve the local structure of theMarkov blanket for each variable based on conditionalindependence tests, and finally identify the derived directionsfor non-convergent connections based on logical rules.

▶ We justified the consistency of the three-stage method underthe small-n-large-p scenario.

▶ The time complexity of the proposed algorithm is bounded byO(p22m−1). This is much better than the PC algorithm,which has a computational complexity of O(p2+m) under thesame sparsity assumption.

▶ The moral graph is a Markov network: The proposed methodprovides a new way to learn Markov networks for mixed data.

Page 40: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Extension

▶ Non-Gaussian continuous data: nonparanormal transformation

▶ Poisson or negative binomial data: random-effect model-basedtransformation (Jia et al., 2017)

▶ Other types of discrete data: regroup and treat as multinomialdata

Page 41: Learning High-Dimensional Bayesian Networks for Mixed Datafmliang/STAT598Purdue/BayesNet.pdfLearning High-Dimensional Bayesian Networks for Mixed Data Faming Liang ... and (ii) ignoring

Acknowledgement

▶ Student: Suwa Xu

▶ NSF grant: DMS-1612924

▶ NIH Grant: R01-GM117597