learning high-dimensional bayesian networks for mixed datafmliang/stat598purdue/bayesnet.pdflearning...
TRANSCRIPT
Learning High-Dimensional Bayesian Networks forMixed Data
Faming Liang
University of Florida
May 6, 2017
Outline
▶ Introduction to Bayesian Networks
▶ Basic Theory of Bayesian Networks▶ A Three-Stage Algorithm for Learning high-dimensional
Bayesian Networks▶ Learning the moral graph▶ Identifying v -structures▶ Identifying derived directions▶ Consistency
▶ Numerical Studies
▶ Discussion
Graphical Models
Graphical models have proven to be a useful tool to describeconditional independence relationships for a large set of randomvariables. Two types of graphical models are commonly used:
▶ Markov networks
▶ Bayesian networks
Markov Networks
▶ The Markov network, also known as the Markov random field,is a model over an undirected graph.
▶ Gaussian graphical model (GGM), as a special case of Markovnetworks, has been used in many scientific fields fromcomputer vision to natural language processing to genomics.
▶ Learning Gaussian graphical models:▶ graphical Lasso (Yuan and Lin, 2007; Friedman et al., 2008)▶ nodewise regression (Meinshausen and Buhlmann, 2006)▶ ψ-learning (Liang et al., 2015)
Bayesian Networks
▶ The Bayesian network is a model over a directed acyclicgraph. Since the directions of the edges are included in themodel, it can represent more types of conditionalindependences than the Markov network.
▶ The relationship that X and Y are marginally independentbut dependent conditioned on variable Z , is not representableby a Markov network. However, it can be easily representedby a Bayesian network using a v -structure: X⃝→ Z⃝← Y⃝. InBayesian formula, this situation can be described by
π(X ,Y |Z ) = π(Z |X ,Y )π(X )π(Y )
π(Z )= π(X |Z )π(Y |Z ).
Bayesian Networks
Table: Conditional independences represented by Markov networks.
Conditional Marginal Independenceindependence X ⊥P Y X ⊥P Y
X ⊥P Y |Z X⃝ Y⃝− Z⃝ X⃝− Z⃝− Y⃝X ⊥P Y |Z non-representable X⃝− Y⃝− Z⃝
Learning Bayesian Networks
The existing Bayesian network learning methods can be traced tothree categories:
▶ Constraint-based: It is to learn Bayesian networks byconducting a series of conditional independence tests.Examples include grow-shrink (Margaritis, 2003), incrementalassociation (Tsamardinos et al., 2003; Yaramakala andMargaritis, M2005), and PC (Spirtes et al., 2000)
▶ Score-based: It is to find a network that optimizes a selectedscoring function, e.g., entropy, minimum description lengthand Bayesian score.
▶ Hybrid Method: It is to combine constraint-based andscore-based methods to offset their respective weakness.Examples include the sparse candidate algorithm (Friedman etal., 1999) and the max-min hill-climbing (MMHC) algorithm(Tsamardinos et al., 2006).
PC Algorithm
▶ The PC algorithm is the state-of-the-art algorithm forBayesian network learning: Under the sparsity assumption,which bounds the neighborhood size of each node, the PCalgorithm has been shown to be consistent and can execute ina polynomial time of p (Kalisch and Buhlmann, 2007).
▶ The computational complexity of the PC algorithm isO(p2+m), where m is the maximum size of the Markovblanket of each node.
Joint Distribution
▶ A Bayesian network can be represented by a directed acyclicgraph (DAG) G = (V,E), where V denotes a set of p nodescorresponding to the p variables X1, . . . ,Xp, and E = (eij)denotes the adjacency matrix or arc sets. The jointdistribution of X1, . . . ,Xp is given by
P(X) =∏i
q(Xi |Pa(Xi )), (1)
where Pa(Xi ) denotes the parent nodes/variables of Xi in thenetwork, and q(·|·) specifies the conditional distribution of Xi
given its parent nodes.
▶ Local Markov property: Each node Xi is conditionallyindependent of its non-descendants (i.e., the nodes for whichthere is no path to reach from Xi ) given its parents.
v -Structure
▶ v-structure: A convergent connection Xi → Xk ← Xj is calleda v-structure if there is no arc connecting Xi and Xj . Thev -structure enables Bayesian networks to represent a type ofrelationship that Markov networks cannot, that is, Xi and Xj
are marginally independent while they are dependentconditional on Xk .
▶ Markov blanket: The Markov blanket of a node Xi is the setconsisting of the parents of Xi , the children of Xi and thespouse nodes that share a child with Xi . The Markov blanketof a node Xi ∈ V is the minimal subset of V such that Xi isindependent of all other nodes conditioned on it. The Markovblanket is symmetric, i.e., if node Xi is in the Markov blanketof Xj , then Xj is also in the Markov blanket of Xi .
v -Structure
▶ Skeleton: If the directions of all arcs in a Bayesian network areremoved, the resulting undirected graph is called the skeletonof the Bayesian network. Note that we can have Bayesiannetworks with different arc sets that encode the sameconditional independence relationships and represent the samejoint distribution.
▶ Two DAGs defined over the same set of variables areequivalent if and only if they have the same skeleton and thesame v -structure.
▶ In Bayesian networks, only the directions of the arcs that arepart of one or more v -structures are important.
Moral Graph
▶ The moral graph is an undirected graph that is constructed by(i) connecting the non-adjacent nodes in each v -structurewith an undirected arc, and (ii) ignoring the directions ofother arcs.
▶ Moralisation provides a simple way to transform a Bayesiannetwork into a Markov network.
▶ In the moral graph, the neighboring set of each node forms itsMarkov blanket.
Faithfulness▶ Let M denote the dependence structure of the probability
distribution of X, i.e., the set of conditional dependencerelationships between any triplet A, B, C of subsets of X.The graph G is said to be faithful or isomorphic to M if for alldisjoint subsets A, B, C of X, we have
A ⊥P B|C⇐⇒ A ⊥G B|C, (2)
where the left denotes the conditional independence inprobability, and the right denotes the separation in graph (i.e.,C is a separator of A and B).
▶ For a Markov network, C is said to be a separator of A and Bif for every a ∈ A and b ∈ B, all paths from a to b have atleast one node in C.
▶ For Bayesian networks, C is said to be a separator of A and Bif along every path between a node in A and a node in B thereis a node v satisfying one of the following two conditions: (i)v has converging arcs and neither v nor any of its descendantsare in C, and (ii) v is in C and does not have converging arcs.
Equivalent Test
▶ Suppose that we are able to identify a super Markov blanketSi for each node Xi such that Si ⊆ Si holds, where Si denotesthe Markov blanket of the node Xi .
▶ Let ϕij denote the output of the conditional independence testXi ⊥P Xj |Si \ {Xi ,Xj}, i.e., ϕij = 1 if the conditionalindependence holds and 0 otherwise. Let ϕij denote the outputof the conditional independence test Xi ⊥P Xj |Si \ {Xi ,Xj}
▶ Under faithfulness, ϕij and ϕij are equivalent in learning moralgraphs.
Equivalent Test
Theorem 1. Assume the faithfulness holds. Let Si denote theMarkov blanket of Xi , and let Si denote a superset of Si . Then ϕijand ϕij are equivalent in learning moral graphs in the sense that
ϕij = 1⇐⇒ ϕij = 1.
Learning moral graphs: step (a)
(a) (Screening for parents and children nodes) Find a superset ofparents and children for each node Xi :
(i) For each ordered pair of nodes (Xi ,Xj), i , j = 1, 2, . . . , p,conduct the marginal independence test Xi ⊥p Xj and obtainthe p-value.
(ii) Conduct a multiple hypothesis test to identify the pairs ofnodes that are dependent. Denote the superset by Ai fori = 1, . . . , p. If the size of Ai is greater than n/(cn1 log(n)) fora pre-specified constant cn1, reduce it to n/(cn1 log(n)) byremoving the variables having larger p-values in the marginalindependence tests.
Learning moral graphs: step (b) & (c)
(b) (Spouse nodes amendment) For each node Xi , find the spousenodes that are not included in Ai , i.e., finding the setBi = {Xj : Xj /∈ Ai , ∃Xk ∈ Ai ∩Aj} for i = 1, . . . , p, where Xj
is a node not connected but sharing a common neighbor withXi . If the size of Bi is greater than n/(cn2 log(n)) for apre-specified constant cn2, reduce it to n/(cn2 log(n)) byremoving the variables having larger p-values in the spousetest Xi ⊥p Xj |Xk .
(c) (Screening for the moral graph) Construct the moral graphbased on conditional independence tests:(i) For each ordered pair of nodes (Xi ,Xj), i , j = 1, 2, . . . , p,
conduct the conditional independence test Xi ⊥P Xj |Sij \ {i , j},where Sij = Ai ∪Bi if |Ai ∪Bi \ {i , j}| ≤ |Aj ∪Bj \ {i , j}| andSij = Aj ∪ Bj otherwise.
(ii) Conduct a multiple hypothesis test to identify the pairs ofnodes for which they are conditionally dependent, and set theadjacency matrix Emb accordingly, where Emb denotes theadjacency matrix of the moral graph.
Identifying v -Structure
▶ Given the moral graph, the v -structures contained in theBayesian network can be identified by performing furtherconditional independence tests around each variable.
▶ With the identified v -structures, the Markov blankets can beresolved by deleting the spouse links and orienting the arcs inthe v -structures.
▶ This step can be accomplished using some existing algorithms,such as the collider set algorithm (Pellet and Elisseeff, 2008)or the local neighborhood algorithm (Margaritis and Thrun,2000).
Identifying derived directions
Given the skeleton and colliders, a maximally directed Bayesiannetwork can be obtained following the four necessary and sufficientrules, which ensure that no directed cycles and additional collidersare created in the graph.
(a) Since Xi → Xj − Xk is not a valid v -structure, Xj → Xk mustbe directed.
(b) Given the edges Xi → Xj → Xk , directing from Xk to Xi willproduce a directed cycle, so Xi → Xk must be directed.
(c) Since directing the edge Xj → Xi will inevitably produce anadditional collider XL → Xi ← Xk or a directed cycle, Xi → Xj
must be directed.
(d) Since directing Xj → Xi will inevitably produce an additionalcollider Xj → Xi ← Xl or a directed cycle, Xi → Xj must bedirected.
Identifying derived directions
(a) (b)
(c) (d)
Figure: Four necessary and sufficient rules for derived directions, wherethe indices I , J, K and L are used to represent the nodes Xi , Xj , Xk andXl , respectively.
Consistency: Assumptions
(A) (Faithfulness) The Bayesian network is faithful, for which thejoint distribution can be expressed in a Gaussian-multinomialdistribution (Lee and Hastie, 2013).
(B) (High dimensionality) The dimension pn = O(exp(nδ)), where0 ≤ δ < (1− 2κ)α/(α+ 2) for some positive constantsκ < 1/2 and α > 0, and the subscript n of pn indicates thedependence of the dimension p on the sample size n.
(C) (Sparsity) The maximum size of the Markov blanket of eachnode, denoted by qn = max1≤j≤pn |Si |, satisfies qn = O(nb)for some constant 0 ≤ b < (1− 2κ)α/(α+ 2), where Si
denotes the Markov blanket of node i .
(D) (Identifiability) The regression coefficients satisfyinf{|βij |C|;βij |C = 0, i , j = 1, 2, . . . , pn, C ⊆{1, 2, . . . , pn} \ {i , j}, |C| ≤ O(n/ log(n))} ≥ c0n
−κ, for someconstant c0 > 0, where κ is as defined in condition (B), andβij |C denotes the true regression coefficient of Xj in thecorresponding GLMs.
Gaussian-multinomial distribution (Lee and Hastie, 2013)
p(x, y|Θ) ∝ exp{−1
2
pc∑s=1
pc∑t=1
θstxsxt +
pc∑s=1
ϑsxs +
pc∑s=1
pd∑j=1
ρsj(yj)xs
+
pd∑j=1
pd∑r=1
ψrj(yr , yj)},
(3)
where xs denotes the sth of pc continuous variables, and yjdenotes the jth of pd discrete variables. The joint model isparameterized by Θ = [{θst}, {ϑs}, {ρsj}, {ψrj}].
Consistency: Theorems
Theorem 2. Under the assumptions (A)–(D) and some regularityconditions for high-dimensional GLMs, the three-stage algorithm is
consistent, i.e., P(E(n)mb = E
(n)mb)→ 1 and
P(E(n)v = E
(n)v |E
(n)mb = E
(n)mb)→ 1 as n→∞, where E
(n)mb denotes
the adjacency matrix of the moral graph, E(n)v denotes the set of
v -structures, and E(n)mb and E
(n)v denote the estimators of E
(n)mb and
E(n)v obtained by the proposed method, respectively.
Mixed Data for an Undirected GraphThis example consists of two types of variables, Bernoulli andGaussian. Figure 2 shows a smaller version of the structure of theunderlying graph. The data were simulated under two settings(n, pc , pd) = (500, 100, 100) and (100, 100, 100), where n denotesthe sample size, pc denotes the number of Gaussian variables, andpd denotes the number of Bernoulli variables.
Figure: A smaller version of the graph structure underlying the simulationstudy, where the circle nodes represent Gaussian variables, the square nodesrepresent Bernoulli variables, and the solid, dotted and dashed lines representthree different types of edges.
Mixed Data for an Undirected Graph
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Recall
Pre
cisi
on
three−stage methodpseudolikelihood estimates of mixed graphical model
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Recall
Pre
cisi
on
three−stage methodpseudolikelihood estimates of mixed graphic model
Figure: Precision-Recall curves produced by the three-stage andpseudo-likelihood methods for two mixed datasets: the left was generatedunder the setting (n, p1, p2) = (500, 100, 100) and the right under the setting(n, p1, p2) = (100, 100, 100).
Mixed Data for an Undirected Graph
Table: Average areas under the Precision-Recall curves produced by thethree-stage and pseudo-likelihood methods. The number in parenthesesrepresents the standard deviation of the areas averaged over 10 datasets.
(n, pc , pd) three-stage pseudo-likelihood
(500, 100, 100) 0.997 (8.22× 10−4) 0.943 (9.49× 10−7)(100, 100, 100) 0.749 (0.012) 0.473 (1.58× 10−4)
Mixed Data for a Directed Graph: Data Generation
(i) Fix an order of variables;
(ii) Randomly mark half of the variables as continuous and therest as binary;
(iii) Fill the adjacency matrix E with zeros, and replace the lowertriangle (below the diagonal) of E with independentrealizations of Bernoulli random variables generated with asuccess probability s;
(iv) generate the data according to the adjacency matrix in asequential manner.
Mixed Data for a Directed Graph: Data Generation
▶ The variable X1, which corresponds to the first node of theBayesian network, was generated through a Gaussian randomvariable Y1 ∼ N(0, 1). We set X1 = Y1 if X1 was set to becontinuous, and X1 ∼ Binomial(n, 1/(1 + e−Y1)) otherwise.
▶ The other variables Xi ’s, i = 2, 3, . . . , p, were thensequentially generated by setting
Yi =i−1∑k=1
0.5EikXk + ϵi , (4)
Xi =
{Yi , if Xi is continuous,
Binomial(n, exp(Yi )1+exp(Yi )
), if Xi is binary,(5)
where ϵ1, ..., ϵp are iid standard Gaussian random variables,and Eik denotes the (i , k) entry of E.
Mixed Data for a Directed Graph: Graph Estimation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
7879
80
81
82
83
84
85
86
87
8889
90
91
92
93
94
95
96
97
98
99
100
1
2
34
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
3435 36
3738
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98 99
100
Figure: The true directed network (left) and the estimated one (right) by thethree-stage method for a dataset with n = 3000 samples.
Mixed Data for a Directed Graph: Estimation Assessment
Table: Average precision (prec) and recall (Rec) values produced by thethree-stage, PC, HC, GS and MMHC methods. The number in parenthesesrepresents the standard deviation of the value averaged over 10 datasets.
n three-stage PC HC MMHC GSPrec Rec Prec Rec Prec Rec Prec Rec Prec Rec
1000.640 0.168 0.276 0.058 0.184 0.268 0.458 0.185 0.243 0.018(0.023) (0.018) (0.034) (0.007) (0.010) (0.013) (0.018) (0.007) (0.040) (0.004)
5000.675 0.767 0.609 0.406 0.499 0.482 0.652 0.389 0.270 0.016(0.013) (0.023) (0.011) (0.012) (0.019) (0.019) (0.024) (0.016) (0.035) (0.005)
10000.703 0.849 0.649 0.532 0.561 0.533 0.665 0.450 0.450 0.038(0.011) (0.012) (0.012) (0.013) (0.020) (0.022) (0.024) (0.022) (0.060) (0.006)
30000.739 0.994 0.717 0.642 0.568 0.561 0.689 0.539 0.385 0.049(0.008) ( 0.003) (0.021) ( 0.023) (0.024) (0.020) (0.030) (0.020) (0.039) (0.008)
Lung cancer genetic network
▶ This study aims to learn an interactive genetic network forLung Squamous Cell Carcinoma (LUSC), which incorporatesboth gene expression information (mRNA-array data) andmutation information. The dataset was downloaded from TheCancer Genome Atlas (TCGA) athttps://tcga-data.nci.nih.gov/tcga/.
▶ The original mRNA data contains 17814 genes and 154patients, and the original mutation data contains 14873 genesand 178 patients. We filtered the mRNA data by includingonly 1000 genes with most variable expression values, andfiltered the mutation data by including only 102 genes forwhich the mutation occurred in at least 15% of the patients.
▶ We then merged the two data sets together. The mergeddataset consists of 121 patients, which are common to boththe mRNA and mutation data.
Lung cancer genetic network
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
KLHL13
BANK1
CXCL13
FOXE1
EMX2
ADAM23
KRT15
TMSL8
tcag7.1260
KRT34
NTRK2
MGC39715
ROPN1B
TXNRD1
LOC253970
CA8
EIF1AY
SAA4
SERPINA3
APOC1
KLRB1
DMRT1
AKR1B10
POU2AF1
UGT2B15
PAGE2B
SV2B
SERPINB2
C20orf114
SPON1
PLA2G4A
SLC26A4
EREG
CAMP
GRHL3
PTX3
C7
DYNLRB2
OASL
FLJ38723
ARMC3
EDIL3
LOC90925
DSC3
DMKN
NUDT11
FOS
RBM11
WNT2B
CXCL1
POPDC3
DAPL1
CYP39A1
PON3
RP13�36C9.6
KIAA1622
SRD5A2
CSTA
IL1A
FAM3B
ROPN1
SPP1
GLI1
FAM81B
PTGS2
ADH1C
RPS4Y1
PI3
BNC1
SCGB1A1
MYEOV
CBR3
AQP3
KRT6C
LINGO2
CCDC38
FABP6
CLDN1
GSTA5
ATP8A2
PSCA
LEPREL1
SOX2
KRT75
ADAMDEC1
FBN2
SPRR2D
SEMA3E
LRRC17
MAGEA9
NPTX2
FAM80A
ARMC4
CLCA2
DEFA3
CDH16
JARID1D
CRNN
KLHL29
KRTDAP
SLC34A2
COL10A1
GPR87
KLK11
HS3ST3A1
AQP9
C1orf61
DSCR6
BCHE
TSPAN7
SFRP2
FAM9B
IL8
SLC44A5
GPX2HRASLS
MMP1
ASPN
MMP3
OGN
KRT16
VCX3A
DDX43
HS3ST2
PAMCI
S100A7
SERPINB3
GULP1
COL12A1
CCNA1
ITGB6
ACTBL1
EBF3
DOK6
FAM30A
KRT6A
DPT
GPR65
KIAA1822L
KRTAP3�1
HOXC9
SMC1B
DNAH9
CFTR
LMO3
KLK10
INDOL1OVOS2
LCN2
F3
CDKN2A
INDO
SCGB3A2
PTGIS
KRT23
HOXD11
FPRL1
FIGF
FABP5
FGFR3
FLJ35880
EGF
ABCC2
KIT
LUZP2
LASS3
GAST
FHOD3
IL19
FLJ45803
GSTM1
OSGIN1CYP4F2
NAPSA
CTSE
DMRT2
VWC2
SAA1
ATAD4
SLN
SPRR2G
IL1B
CDK5RAP2
PHACTR3
CLEC2B
CXCL11
MGC45438
SCRG1
C4orf19
SPANXD
FMO1
HMGA2
CYP4X1
SDPR
ZNF229
KLK13
IFIT1
MGC16291LTF
DSG3
DKK1
MAGEA1
MAP1B
KIAA0319
EDN2
PPP1R3C
PKD2L1
CYP24A1
TP63
SGEF
EDN1
LGI2
NAT8L
CXCR7
FETUB
FGF12
ZNF750
RBP7
CDKN2B
MME
SOX15
SFTPA1
DPPA2
PAK7
OLFM4
CPA3
GAS2
HEY1
GPR109B
CABYR
MGC10981
LRG1
LDLRAD1
SAC
COL11A1
CGB1
HOXC13
PLUNC
GJB6
KRT13
FAT2
LRRC4
AQP4
LIPH
THBD
ABCC3
FAM83C
S100A8
KHDRBS2
C8orf47
NQO1
GPR34CCL28
ADH1A
GSTM3
MGC102966COL17A1
XG
FABP4EGLN3
PAGE5
ZMAT4
PRAME
AHNAK2
CYP4F11
TCBA1
SFRP1
SCN9A
C19orf59
UGT2B17
CEACAM6
MS4A2
CYorf15B
PENK
ANXA8L2
PNOC
SPRR2F
LOC130576
BMP2
TMPRSS11D
NTS
MMP12
FGFBP1
SLC1A6
CTA�246H3.1
HOXD10
ZNF415
SUSD4
AKR1C1
SBSN
HTRA4
VCX2
EYA2
NR0B1
CDH2
P2RY1
DMRTA1
SFRP4
C9orf18
C6orf156
STXBP5L
CLCA4
SRXN1
FLJ35773
SERPINB4
VNN1
ALDH3A1
RIBC2
PTH2R
GJB2
MICALCL
GSTA3
ABCA4
PCP4
TMEM46
SAA2
C20orf82
CLDN18
RNF128
COL6A6
C12orf54
ABCA3
GDA
UGT2B7
CD1A
GSTA2
TOX3
MYBPC1
UNQ9433
NRG4
NOS2A
GLYATL2
MUC15
POTE15
TSPAN19APOBEC3A
NPY
SOST
PAGE2
IL1F9
CBLN2GCLC
SCTR
CCL11
HOXD8
ANXA10
IRX4
RNF152
KIAA1257
PLAC1
GABRA5
MAGEA12
SAGE1
LOC253012
CSAG3A
S100A12
ABCA13
AGR2
CLCA3
PADI3
NUDT10
ARL14
FA2H
SH3GL3
G0S2
ADCY8
IGFL2
IL6
C5orf23
LYPD1
KRT14
MS4A1
CHP2
IGLV2�14
CXorf57
LCE3D
CYP2C18
IFI44L
SLC47A2
PPP2R2B
NRCAM
GBA3
AKR1C3
CYorf15A
CES1
RPS4Y2
ZNF556
KRT17
HRASLS3
WNT5A
C4orf7
GABRG3
CALB2
MAGEA3
KLK12
CSAG1
PLA2G1B
MUCL1
CT45�6
MGC9913
CA3
NDRG4
HOXC8
ME1
IVL
AGTR2
MAGEA4
TMPRSS11A
EHF
CTNND2
LGR6
PROM2
CRYM
CYP4F8
FCER1A
NPR3
ABHD7
TMEM20
FLJ21511
MB
CYP4Z1
C2orf54
GBP6
ATP6V1C2
CSN1S1
GLT25D2
HHIPWBSCR28
SPRR3
ART3
PLAC8
SLC14A1
TP53
CDH18
TNR
AHNAK2
NAV3
ANK2
USH2A
OR2T4
CDH12
SPHKAPBIRC6
LOC96610
Figure: The Bayesian network produced by the three-stage method with themRNA (circle nodes) and mutation (square nodes) data measured on the sameset of 121 LUSC ((Lung Squamous Cell Carcinoma) patients.
Lung cancer genetic network
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
● ●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
KLHL13
SALL1
BANK1
CXCL13
BMP3
CPNE4
FOXE1
PGBD5
FNDC1
EMX2
ADAM23CHST6
TMEM125
KRT15
RDHE2
UPK1B
UGT1A8
MS4A8B
TMSL8tcag7.1260
DDIT4L
KRT34
C1orf125
GAL
FAM83A
KIAA1045
NLGN4X
ALOX12
TCEAL2
NTRK2
hCG_1990170
KLK7
C20orf85
KRT4
PCDH20
MGC39715
ROPN1B
COMPTXNRD1
BRDT
LOC253970
SRPX
COL4A6
CAPSL
CA8
SFTPC
EIF1AY
SAA4
SOSTDC1
C4BPA
KCNMB4
SERPINA3
KCNJ16
APOC1
KLRB1
IGFL1
DMRT1
MMP13
RTP3
CLGN
AKR1B10
GALNT13
POU2AF1
UGT2B15
VSNL1
MAG1
PAGE2B
SV2B
CRYBA2
EYA1
SERPINB2
C20orf114
HPR
C18orf2
SPON1
PLA2G4A
7A5 AMPD1
SLC26A4
EREG
CAMP
GRHL3
PTX3
MUC4
C7
DYNLRB2
SLITRK6
OASL
RNF165
FLJ38723
ARMC3
ZDHHC11
EDIL3
LOC90925
DSC3
HTR2B
DMKN
NUDT11
FOS
RBM11
FBXL21
WNT2B
CXCL1BEX1
SLC26A9
TMPRSS11F
FAM133A
FGFBP2
IL33
SEC14L4
KCNMB2
KRT5
PPP1R14C
POPDC3
TM4SF19
SDCBP2
GATM
GAD1
VAV3
DAPL1
CBR1
CYP39A1
GLULD1
PON3
GHR
RP13�36C9.6
KRTAP19�1
KIAA1622
KLRG2
PRSS21
SLC10A4
PKIB
SRD5A2
CSTA
IL1A
FAM3B
ROPN1
LTB4DH
SPP1
GLI1
SFTPD
FAM81B
PTGS2
BMP7
NEFL
ADH1C
SLCO1B1
PGLYRP4
RPS4Y1
C21orf81
PI3
BNC1
PCDHB6
PAGE1
D4S234EZNF365
SCGB1A1
HSD17B3
MYEOV
CBR3
AQP3
KRT6C
LINGO2
CCDC38
SEMA6D
TTC29
WFDC2
FABP6
GABRB3
C12orf56
CLDN1
GSTA5
ATP8A2
PSCA
LEPREL1
ZFP42
SOX2
KRT75
TMPRSS11B
ODZ2
S100P
SLC6A4
ZYG11A
DEFB1
TCHH
ADAMDEC1
TMEM45A
FBN2
SPRR2D
SEMA3E
LRRC17
C10orf81
MAGEA9
NPTX2
DEFB103A
EDN3
S100A14
AMY2A
FAM80A
ARMC4
CLCA2
RHCG
IFNE1
CPA6
DEFA3
PLA2G3 CDH16
FOXA1
JARID1D
CA2PCDH8
CRNN
ECAT8
KLHL29
KRTDAP
WISP3
SCNN1A
CXCL6
SLC34A2
EYA4
COL10A1
PVRL3
GPR87
KLK11
DNER
FMN2
BEX5
SPRR1B
CRCT1
LOC441376NAP1L2 HS3ST3A1
CPXM2
RP11�35N6.1
AQP9
LRRN1
CLC
C1orf61
DSCR6
BCHE
BCL2
KIAA1324
TSPAN7
LONRF2
LOC440356
SFRP2
GPR158
OMD
FAM9B
IL8
SERPINB13
LYPD3
IGFBP2
P53AIP1
WIF1
SLC44A5
MUC20
GPX2
AREG
NRXN1
COCH
HRASLS
MMP1
CA12
WNT16
MAGEA10
ASPN
DACT2
MMP3OGN
KRT16VCX3AAMDHD1
C6orf142
DDX43
GAS1
FAM112B
SOHLH1
HS3ST2
GPC3
PAMCI
S100A7
NXF5
SERPINB3
PCOLCE2
C3orf55
HORMAD1
GULP1
CH25H
COL12A1
CCNA1
KLK6
ITLN1
CALB1
C4orf31
MUC16
JAKMIP2
C8orf4
PTHLH
FGB
POU4F1
ATP13A4
RASEF
ZFHX4
CDH18
CHL1
PAGE4
ITGB6
ACTBL1
MATN3
EBF3
TFF2
SMPX
C1orf87
DOK6
FAM30A
GOLSYN
GJB7
KRT6A
C6orf117
GLYATL1
DPT
ACTL8
GPR65
KIAA1822L
IL12RB2
KRTAP3�1
HOXC9
SMC1B
LEMD1
OXGR1
ALDH1A1
DNAH9
IRX2
TMEM22
CFTR
RTP4
LMO3
C9orf24
SFTPA1B
KLK10
INDOL1
OVOS2
LCN2
F3
CDKN2A
C4orf26
INDO
SCGB3A2
GSTT1
PTGIS
KRT23
HAPLN1
TNIP3
GOLT1A
HOXD11
PCSK1
KRT1
FLJ22655
FPRL1
MSMB
IL1F5
FIGF
FABP5
PNLDC1
FGFR3
FLJ35880
VTCN1
HIST1H1A
MAGEC1
EGF
ABCC2
KIT
DKK4
ACE2 LUZP2
LASS3
GAST
FZD10
FHOD3
IL19
LECT1
CRABP2
FLJ45803
GSTM1
TSKS
POSTN
OSGIN1
CYP4F2
GPR128
WNT2
NAPSA
MGAT4C
CTSE
FLJ46266
DMRT2
VWC2
C20orf103
SAA1ATAD4
GLI2
SLN
RGS17
SPRR2G
IL1B
ATP8B3
PCDHB5
tcag7.1136
C2orf39
MMP10
AMIGO2
RPL39L
ADRA2C
PKP1
CDK5RAP2
PHACTR3
CLEC2B
MAEL
COL21A1
HOXB13
RAPGEFL1
CXCL11
MGC45438
SCRG1
DLX6
IL13RA2
C4orf19
ODAM
GSTO2
SLC16A9
CFHR4
TMEPAI
CDH26
SPANXDCCK
FMO1
SLC35F3
SCGB2A1
SLC6A14
HMGA2
LOC285141
TMEM45B
SESN3
CYP4X1
SDPR
ZNF229
KLK13
IFIT1
PRB1
MGC16291
CXCL9
TPO
LTF
HOXA4
CHIA
CEACAM7
DSG3
DKK1
MAGEA1
MAT1A
MAP1B
KIAA0319
ZBED2
SLC5A1
EDN2
C3orf41
PPP1R3C
SOX11
PKD2L1 CP
CYP24A1
FXYD3
TP63 WDR66
SGEF
BBOX1 EDN1
PTPRR
CAPS
LGI2
NAT8L
C6orf15
SYT13
PPBP
SLITRK4
TFAP2B
C6orf105
CXCR7
FAM79B
KCNK10
FSTL5
CYP3A5
FETUB
IL23A
LOC399947
WDR16
FGF12
FMO9P
CLDN10
ZNF750
RBP7
DSC1
HS3ST5
CDKN2B
MME
SOX15
SFTPA1
DPPA2
DEFB4
SERPINB11
LGALS2
MAGEA8F2RL2
ATP2C2
PNPLA3
TUBB2B
PAK7
OLFM4
TSPY2
SCG2
FMO3
DDX3Y
CPA3
BARX1
GAS2
HEY1
ODC1
GPR109B
TSLP
CABYR
MGC10981
LPL
MAGEA11LRG1
CDH19LDLRAD1
TP53TG3
IQCA
SAC
C11orf41
COL11A1
C20orf174
ICEBERG
CGB1
ST6GALNAC1
MYB
OR6X1
MPPED2
HOXC13
GPNMB
C16orf73
SASP
PLUNC
GJB6KRT13
ATP4A
NEFM
FAT2
LRRC4
AQP4
LIPH
SELETHBD
FLJ39822
BTBD16
KRT24
ABCC3
NLRP7
FAM83C
CRLF1
HSPA1A
UGT1A6
UBD
S100A8
KHDRBS2
TMEM100SPINK2
SPESP1
PDPN
CAPN13
C8orf47
TCN1
INA
COL4A4
NQO1
IFNG
FOXL2
GPR34
KLK5
CCL28
LDHC
ADH1A
GSTM3
MGC102966
COL17A1
XG
MSLN
ELF5
FABP4
CLDN8
EGLN3
PAGE5
AIM2
ZMAT4
KRT6B
IL20RB
AHNAK2
TFPI2
CYP4F11
LHX2
TCBA1
SFRP1
TNNT1
C19orf59
MUM1L1
CYP26A1
PEBP4
UGT2B17
CEACAM6
MS4A2
HS3ST1
SLC6A15
CYorf15B
DMRT3
PPP1R9A
BARX2
PENK
ANXA8L2
OPRK1
PNOC
RPS6KA6
SPRR2F
LOC130576
CYP2C9
RUNX1T1
BMP2
ROPN1L
GAP43
RASGEF1A
TMPRSS11D SLC7A11
NTS
CHIT1
MIA
PCSK2
MMP12
FGFBP1
CHGB
HAS3
SLC1A6
CTA�246H3.1
PLAT
HOXD10
FAM132A
STAR
TBX18
ZNF415
SUSD4
FLRT3
LOX
AKR1C1
LYZ
SP8
KCND2
SBSN
MLLT11
TMTC1
C2orf40
HTRA4
VCX2
MAGEB2
EYA2
SPINK1
CILP
AGR3
AADACL2
SLCO1B3
WFDC5
PROK2
DACH1
OSTalpha
CREG2
BDNF
HSPB3 SCIN
LOC124220
NR0B1
CDH2
SERPINB7
STEAP4
P2RY1
ADH7
FHL2
PCDHB2
RAB6B
STK32A
SLPI
DMRTA1
SFRP4
GLI3
C9orf18
MAGEC2
C6orf156
TGM1
SULT1E1
HIST1H1B
ARL9
AFF2
STXBP5L
CLCA4
SRXN1
FLJ35773
STXBP6
CHST9
STRA8
SERPINB4
VNN1
ALDH3A1
RIBC2
PTH2R
GJB2MICALCL
GSTA3
NPPC
SSTR1ABCA4
RP13�102H20.1
SPINK6
ATP12A
PIP
PCP4
TMEM46
SAA2
GPR103
C20orf82
OGDHL
CLDN18
RNF128
CD274
COL6A6
RNF183
C9orf47
C12orf54
ABCA3
XK
GDA
UGT2B7
C1QTNF3
CD1A
KLK8
GSTA2
TOX3
MYBPC1
EN1
UNQ9433POF1B
TKTL1
EPGNNRG4
NOS2A
GLYATL2
MUC15
RNF182
SOX9
POTE15
RGS20
CA9
TSPAN19
APOBEC3A
NPY
SOST
ECHDC3
PAGE2
CTCFL
IL1F9
NELL2
POU3F2
GNGT1CBLN2
GCLC
GPR110
AMTN
PSMAL
ZNF695
SCTR
CCL11
HOXD8
ZNF114
NRN1
ANXA10
IRX4
CEACAM5
FUT3
ERP27
RNF152
KIAA1257
PLAC1
C19orf46
PHEX
FGF13
GABRA5
GRP
MAGEA12
SAGE1
LOC253012
CSAG3A
LY6D
LMO1
S100A12
ARSJ
GKN2
SLC5A8
CYP4B1
ABCA13
AGR2
CLCA3
PTN
PADI3
NUDT10
PTPRZ1
ARL14
FAM137A
SUNC1
PAX9
CST1
OLFM1
FLJ44379
PITX2
FA2H
SPOCK3
SH3GL3
ALDH7A1
G0S2
ADCY8
IGFL2
CHODL
IL6
C5orf23
LYPD1
KRT14
KREMEN2 MS4A1
CHP2
CDO1
KCNH8
IGLV2�14
HOXC12
CXorf57
LCE3D
IL1R2
CYP2C18 IFI44L
TMEM190
NLGN1
RPESP
FAM90A1
CNTNAP2
ABHD12B
SLC47A2
PPP2R2B
NRCAM
GBA3
CRTAC1
AKR1C3
C10orf82
CYorf15A
OCA2
CES1
HSD17B2
ITLN2
WDR69
RPS4Y2
ZNF556
LRRK2
HOXD13
KRT17C6orf159
VIT
HRASLS3
WNT5A
C4orf7
GABRG3
CALB2
MAGEA3
PRDM13
KLK12
ABCA12
CSAG1
EPHB1
CTTNBP2
COL9A1
CALML5
HOTAIR
PLCB4
PLA2G1B
MUCL1
CT45�6
S100A2
GALNT14
PF4V1
REG1A
MGC9913
KRT7
CXorf48
CA3
NDRG4
NRXN3
HOXC8
VNN2
ME1
IVL
AGTR2
ROS1
C6orf54
MAGEA4
TMPRSS11A
CTNND2
LGR6
FABP7
DSCR8
PROM2
HES5
CRYM
UCHL1
AKAP14
TSPAN8
C6orf150
COX7B2
CYP4F8TRIM43
OR7A5
ZIC1
FCER1A
ARG1
NPR3
PPP2R2C
ABHD7
TMEM20SPRR1A
SCEL
TMPRSS2
FLJ21511
HLA�DRB6
ATP6V0A4
DGKG
MB
CYP4Z1
DMBT1
NLRP2
NELL1
C2orf54
WNT10A
GBP6
ATP6V1C2
CSN1S1
GLT25D2
HHIP
WBSCR28
SPRR3
ART3
PLAC8
SALL3
SLC14A1
SULT1B1
KLRD1
DAZ4
MLL3
CSMD3
COL11A1
PKHD1
FAT3
ADAM6
CDKN2A
SYNE2
SPTA1
USH2A
OR2T4
HEATR7B2
FBN2
PKHD1L1
MUC5B
DPP10
C1orf173
CTNNA2DNAH9
BIRC6
SLITRK3MUC4
LRFN5
LOC96610
RELN
PDE4DIP
STAB2
Figure: The Bayesian network produced by MMHC with the mRNA (circlenodes) and mutation (square nodes) data measured on the same set of 121LUSC (Lung Squamous Cell Carcinoma) patients.
Lung cancer genetic network
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●● ●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
● ●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
KLHL13
SALL1
BANK1
CXCL13
LOC152573
BMP3
CPNE4
FOXE1
NEFH
PGBD5
TNNT3
FNDC1
EMX2
ADAM23
CHST6
TMEM125
KRT15
RDHE2
FST
UPK1B
UGT1A8
MS4A8B
TMSL8
tcag7.1260
DDIT4L
KRT34
C1orf125
GAL
FAM83A
KIAA1045
NLGN4X
ALOX12
TCEAL2
NTRK2
hCG_1990170
KLK7
C20orf85
KRT4
PCDH20
MGC39715
ROPN1B
GABRP
COMP
TXNRD1
BRDT
LOC253970
SRPX
COL4A6
CAPSLCA8
SFTPC
EIF1AY
SAA4
SOSTDC1
C4BPA
KCNMB4
SERPINA3
KCNJ16
APOC1
KLRB1
IGFL1
DMRT1
MMP13
RTP3
CLGNAKR1B10
GALNT13
POU2AF1
UGT2B15
VSNL1
MAG1
PAGE2B
SV2B
CRYBA2
EYA1
SERPINB2
C20orf114
HPR
C18orf2
SPON1
PLA2G4A
7A5
AMPD1
SLC26A4
EREG
CAMP
GRHL3
PTX3
C1orf110
MUC4
C7
DYNLRB2
SLITRK6
OASL
RNF165
FLJ38723
ARMC3
ZDHHC11
EDIL3
LOC90925
DSC3
HTR2B
DMKN
NUDT11
FOS
RBM11
FBXL21
WNT2BCXCL1
BEX1
SLC26A9
TMPRSS11F
C3orf57
FAM133A
FGFBP2
IL33
SEC14L4
KCNMB2
KRT5
PPP1R14C
POPDC3 TM4SF19
SDCBP2
GATM
GAD1
VAV3
DAPL1
CBR1
CYP39A1
GLULD1
PON3
GHR
RP13�36C9.6
KRTAP19�1
KIAA1622
KLRG2
PRSS21
SLC10A4
PKIB
SRD5A2
CSTA
IL1A
FAM3B
ROPN1
LTB4DH
SPP1
GLI1
SFTPD
FAM81B
PTGS2
BMP7
NEFL
ADH1C
SLCO1B1
PGLYRP4
RPS4Y1
C21orf81
PI3
BNC1
PCDHB6
PAGE1
D4S234E
ZNF365
SCGB1A1
HSD17B3
MYEOV
CBR3
AQP3
KRT6C
LINGO2
CCDC38
SEMA6D
TTC29
WFDC2
FABP6 GABRB3
GCNT3
C12orf56
CLDN1
GSTA5
ATP8A2
PSCA
LEPREL1
ZFP42
SOX2
KRT75
TMPRSS11B
ODZ2
S100P
SLC6A4ZYG11A
DEFB1
TCHHADAMDEC1
TMEM45A
FBN2
SPRR2D
SEMA3E
LRRC17
C10orf81
CTAG2
MAGEA9
NPTX2DEFB103A
EDN3
S100A14
AMY2A
FAM80A
ARMC4
CLCA2
RHCG
IFNE1
CPA6
DEFA3PLA2G3
CDH16
FOXA1
JARID1DCA2
PCDH8
CRNN
DDC
PROM1
ECAT8
KLHL29
KRTDAP
WISP3
SCNN1A
CXCL6
SLC34A2
EYA4
COL10A1
PVRL3
GPR87
KLK11
DNER
FMN2
BEX5
SPRR1B
CRCT1
LOC441376NAP1L2
HS3ST3A1
CPXM2
RP11�35N6.1
AQP9
LRRN1
CLC
C1orf61
DSCR6
BCHE
BCL2
KIAA1324TSPAN7
LONRF2
LOC440356
SPINK5
SFRP2
GPR158
OMD
FAM9BIL8
SERPINB13
LYPD3
IGFBP2
P53AIP1WIF1
SLC44A5
MUC20
GPX2 AREG
NRXN1
COCH
HRASLS
MMP1
CA12
WNT16
MAGEA10ASPN
DACT2
MMP3
OGN
KRT16
VCX3AAMDHD1
C6orf142
DDX43
GAS1
FAM112B
SOHLH1
HS3ST2
GPC3
PAMCI
S100A7
NXF5
SERPINB3
STMN2
PCOLCE2
C3orf55
HORMAD1
GULP1
CH25H
COL12A1
CCNA1 CXCL14
KLK6
ITLN1
CALB1C4orf31
MUC16
JAKMIP2
C8orf4
PTHLH
GALNT5
FGB
POU4F1ATP13A4
RASEF
ZFHX4
CDH18
CHL1
PAGE4ITGB6
ACTBL1
MATN3
EBF3
TFF2
SMPX
C1orf87
DOK6
FAM30A
GOLSYN
GJB7
KRT6A
C6orf117
GLYATL1
DPT
ACTL8
GPR65
KIAA1822L
IL12RB2
KRTAP3�1
HOXC9
SMC1B
LEMD1
OXGR1
ALDH1A1
DNAH9
IRX2
TMEM22
CFTR
RTP4
LMO3
C9orf24
SFTPA1BKLK10
INDOL1
OVOS2 LCN2F3
CDKN2A
C4orf26
INDO
SCGB3A2
GSTT1
PTGIS
KRT23
HAPLN1
TNIP3
GOLT1A
HOXD11
PCSK1
KRT1
FLJ22655
FPRL1 MSMB
IL1F5
FIGF
FABP5
PNLDC1
ZNF334
FGFR3
FLJ35880
VTCN1
HIST1H1A
MAGEC1
EGF
ABCC2
KIT
DKK4
ACE2
LUZP2
LASS3
GAST
FZD10
FHOD3
IL19
LECT1
CRABP2
FLJ45803
GSTM1
TSKS
POSTN
OSGIN1
CYP4F2
GPR128
WNT2
NAPSA
MGAT4C
CTSE
FLJ46266
DMRT2
VWC2
C20orf103
SAA1
ATAD4
GLI2
SLN
NMU
RGS17
SPRR2G
IL1B
ATP8B3
PCDHB5
tcag7.1136
C2orf39
MMP10
AMIGO2 RPL39L
ADRA2C
PKP1
CDK5RAP2
PHACTR3
AADAC
CLEC2B
MAEL
COL21A1
HOXB13
RAPGEFL1
CXCL11
MGC45438
SCRG1
LOC644186
DLX6
IL13RA2
C4orf19
ODAM
GSTO2
SLC16A9
CFHR4
TMEPAI
CCL20
CDH26SPANXD
CCK
FMO1
SLC35F3
SCGB2A1
SLC6A14
HMGA2
LOC285141
TMEM45B
SESN3
KRT27
CYP4X1
SDPR
ZNF229
KLK13
IFIT1
PRB1
MGC16291
CXCL9
TPO
LTF
HOXA4
CHIA
CEACAM7
DSG3
DKK1
MAGEA1MAT1A
MAP1B
KIAA0319
ZBED2
SLC5A1
EDN2
C3orf41
PPP1R3C
SOX11
PKD2L1
CP
CYP24A1
FXYD3
TP63
WDR66
SGEF
BBOX1
EDN1
PTPRR
CAPS
LGI2
NAT8L
C6orf15
SYT13
LRAP
PPBP
SLITRK4
TFAP2B
C6orf105
CXCR7
FAM79BKCNK10
FSTL5
CYP3A5
FETUB
IL23A
LOC399947
WDR16
FGF12
FMO9P
CLDN10
ZNF750
RBP7
DSC1
HS3ST5CDKN2B MME
SOX15
SFTPA1
DPPA2
DEFB4SERPINB11
LGALS2
MAGEA8
F2RL2
ATP2C2
PNPLA3
TUBB2B
PAK7
OLFM4
TSPY2
SCG2
FMO3
DDX3Y
CPA3
BARX1
GAS2
HEY1
ODC1
GPR109B
TSLP
CABYR
MGC10981
LPL
MAGEA11
LRG1
CDH19
LDLRAD1
TP53TG3
IQCA
SAC
C11orf41
COL11A1
C20orf174
ICEBERG
CGB1
ST6GALNAC1
MYB
OR6X1
MPPED2
HOXC13GPNMB
C16orf73
SASP
PLUNC
GJB6
KRT13
ATP4A
NEFM
FAT2
LRRC4
AQP4 LIPH
SELE
THBD
FLJ39822
BTBD16
KRT24
ABCC3
NLRP7
FAM83C
CRLF1
HSPA1A
UGT1A6
UBD
S100A8
KHDRBS2
TMEM100
SPINK2
SPESP1
PDPN
CAPN13
C8orf47
TCN1
INA
COL4A4
NQO1
IFNG
FOXL2
GPR34
KLK5
CCL28
LDHC
ADH1A
GSTM3
MGC102966
COL17A1
XG
MSLN
ELF5
FABP4
CLDN8
EGLN3
PAGE5
AIM2ZMAT4
KRT6B
IL20RB
PRAME
FAM70A
AHNAK2
TFPI2
CYP4F11
LHX2
TCBA1
SFRP1
SCN9A
TNNT1
C19orf59
MUM1L1
CYP26A1
PEBP4
UGT2B17
CEACAM6MS4A2
HS3ST1
SLC6A15
CYorf15BDMRT3
PPP1R9A
BARX2
PENK
ANXA8L2
OPRK1
PNOC
RPS6KA6
SPRR2F
LOC130576
CYP2C9
RUNX1T1
BMP2
ROPN1L
TSPYL5
GAP43
RASGEF1ATMPRSS11D
SLC7A11
NTS
CHIT1
MIA
PCSK2
MMP12
FGFBP1
CHGB
HAS3
SLC1A6
CTA�246H3.1
CALCB
PLAT
HOXD10
FAM132A
STAR TBX18
ZNF415
SUSD4
FLRT3LOX
AKR1C1
LYZ
SP8
KCND2
SBSN
MLLT11
TMTC1
C2orf40
HTRA4
VCX2
THNSL2
MAGEB2
EYA2
SPINK1
CILP
AGR3
AADACL2 SLCO1B3
WFDC5
PROK2
DACH1
OSTalpha
CREG2
BDNF
HSPB3
SCIN
LOC124220
NR0B1
CDH2
SERPINB7
NDUFA4L2
STEAP4
P2RY1
ADH7
FHL2
PCDHB2
RAB6B
STK32A
SLPI
DMRTA1
SFRP4
GLI3C9orf18
MAGEC2
C6orf156TGM1
SULT1E1
HIST1H1B
ARL9
AFF2
STXBP5L
CLCA4
SRXN1
FLJ35773
STXBP6
CHST9
STRA8
SERPINB4
VNN1
ALDH3A1
RIBC2
PTH2R GJB2
MICALCL
GSTA3
NPPCSSTR1
ABCA4
RP13�102H20.1
SPINK6
ATP12A
PIP
CCND2
PCP4
TMEM46
SAA2
GPR103
C20orf82
OGDHL
CLDN18
RNF128
CD274
COL6A6
RNF183
C9orf47
C12orf54
ABCA3
XK
GDA
UGT2B7
C1QTNF3
CD1A
KLK8
GSTA2
TOX3
MYBPC1
EN1
UNQ9433
POF1B
TKTL1
EPGN
NRG4
NOS2A
GLYATL2
MUC15
RNF182SOX9
POTE15
RGS20
CA9
TSPAN19
APOBEC3A
NPY
SOST
ECHDC3
PAGE2
CTCFL
IL1F9
NELL2
POU3F2
GNGT1
CBLN2
GCLC
GPR110
AMTN
PSMAL
ZNF695
SCTR
CCL11
HOXD8
ZNF114
NRN1
ANXA10
IRX4
CEACAM5
FUT3
ERP27
RNF152
KIAA1257
PLAC1
C19orf46
PHEX
FGF13GABRA5
GRP
MAGEA12
SAGE1
LOC253012CSAG3A
LY6D
LMO1
S100A12
ARSJ
GKN2
SLC5A8
CYP4B1
ABCA13
AGR2
CLCA3
PTN
PADI3NUDT10
PTPRZ1
ARL14
FAM137A
SUNC1PAX9
CST1
CCL7
OLFM1
FLJ44379
PITX2
FA2H
SPOCK3
SH3GL3
ALDH7A1
G0S2
ADCY8
IGFL2
CHODL
IL6
C5orf23
LYPD1
KRT14
KREMEN2
MS4A1
CHP2
CDO1
KCNH8
TRIM49
IGLV2�14
HOXC12
CXorf57
LCE3D
IL1R2
CYP2C18
IFI44L
TMEM190
NLGN1
RPESP
FAM90A1
CNTNAP2
ABHD12B
SLC47A2
PPP2R2B
NRCAMGBA3
CRTAC1
AKR1C3
C10orf82 CYorf15A
OCA2
CES1
HSD17B2 ITLN2
WDR69
RPS4Y2
ZNF556
LRRK2
HOXD13
KRT17
C6orf159
VIT
HRASLS3
WNT5A
C4orf7
GABRG3
CALB2
MAGEA3
PRDM13
KLK12
ABCA12
CSAG1
EPHB1
CTTNBP2
COL9A1
CALML5
HOTAIR
PLCB4
PLA2G1B
MUCL1
CT45�6S100A2
GALNT14
PF4V1
REG1A
MGC9913
KRT7
CXorf48
CA3
NDRG4
NRXN3
HOXC8
VNN2
ME1
IVL
AGTR2
ROS1
C6orf54
MAGEA4
TMPRSS11A
EHF
CTNND2
LGR6
FABP7 DSCR8
PROM2
HES5
CRYM
UCHL1
AKAP14
TSPAN8
C6orf150
COX7B2
CYP4F8
TRIM43
OR7A5
ZIC1
FCER1A
ARG1
NPR3
PPP2R2C
ABHD7
TMEM20
SPRR1A
SCEL
TMPRSS2
FLJ21511
HLA�DRB6
ATP6V0A4
DGKG
MB
CYP4Z1
DMBT1
NLRP2
NELL1
C2orf54
WNT10A
GBP6
ATP6V1C2
CSN1S1
GLT25D2
HHIP
WBSCR28
SPRR3
ART3
PLAC8
SALL3
SLC14A1
SULT1B1
TEX101
KLRD1
DAZ4
FLG
LRP2
TTN
DNAH5
MLL3
MLL2TP53
MACF1
APOB
CNTNAP5
CDH18
SYNE1
ABCA13
PXDNL
CSMD3
MYCBP2
PEG3
DMD
CSMD2
LRRC7
COL11A1
TNN
TNR
RYR2
LRP1B
SCN1A
CDH10
HCN1
DNAH8
PKHD1
ZFHX4
FAM135B
FAT3
AHNAK2ADAM6
HERC2
MUC16
RYR1
TPTE
XIST
PCDH11X
NRXN1
ALMS1
DNAH7
MDN1
CNGB3CDKN2A
NAV3
SYNE2
ZNF99
NEB
ANK2
FAT4
PCLO
ZNF804B
SPTA1
PAPPA2
USH2A
OR2T4
XIRP2
SI
CDH12
HEATR7B2GPR98
FBN2
DNAH11
PLXNA4
PKHD1L1
COL22A1
CUBN
PCDH15
MUC5B
DNAH10
TMEM132D
RYR3
MYH2
ZNF208
ZNF536
DPP10RP1
COL6A6
MUC17
CSMD1
ANKRD30A
C1orf173
CTNNA2
CNTNAP2
FMN2
CPS1
SPHKAP
DNAH9
BIRC6
SLITRK3
FAM5C
ODZ1
MUC4
ADAMTS12
LRFN5
LOC96610
RELN
PDE4DIP
STAB2
Figure: The Bayesian network produced by HC with the mRNA (circle nodes)and mutation (square nodes) data measured on the same set of 121 LUSC(Lung Squamous Cell Carcinoma) patients.
CPU time
▶ Three-stage: 1.94 CPU hours on a 3.6GHz desktop
▶ MMHC: 0.79 CPU hours
▶ HC: 10.89 CPU hours
▶ PC: > 40 CPU hours without results produced
▶ GS: > 40 CPU hours without results produced
Glioblastoma genetic network with methylation adjustmentLet W1, . . . ,Wq denote the external variables. To adjust theireffects,
▶ we can replace the p-values used in the parents-childrenscreening step by the p-values calculated from the GLM
Xi ∼ 1 + Xj +W1 +W2 + . . .+Wq, (6)
in testing the hypothesis H0 : γ′ij = 0 versus H1 : γ
′ij = 0,
where βij is the regression coefficient of Xj .
▶ Similarly, the p-values used in the moral graph screening stepcan be replaced by the p-values calculated from the GLM
Xi ∼ 1 + Xj +∑
k∈Sij\{i ,j}
Xk +W1 +W2 + . . .+Wq, (7)
in testing the hypothesis H0 : β′ij = 0 versus H1 : β
′ij = 0,
where Sij is as defined in the moral graph learning algorithmand β′ij is the regression coefficient of Xj .
Glioblastoma genetic network: Dataset
The dataset we considered is for the glioblastoma (GBM) cancer, ahighly malignant brain tumor for which no cure is available. Thedataset consists of both methylation and gene expression data(mRNA-array data) and was downloaded from TCGA athttps://tcga-data.nci.nih.gov/tcga/. We filtered the dataset byincluding only the 1000 genes with most variable expression values.The number of samples/patients is 281. It is known that the geneexpression value can be affected by the methylation sites in thepromoter region. For this reason, we annotated the methylationfeatures of the 1000 genes according to their positions on thechromosomes.
Glioblastoma genetic network: Graph Estimation
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
ACSM3
ADA
AGMAT
ALX1
ANKRD12
ANKRD44
ANXA3
AP3S1
AQP12B
ARF3
ARL11
ASB3
B2M
C11orf64
C16orf30C17orf65
C18orf25
C1QC
C20orf103
C3orf27
C4BPB
C6orf201
C7orf41
C7orf45
CA2
CAMK1G
CAND1
CAPG
CD14
CD163
CLCN4
COL3A1
COX4I1CSF1R
CYP3A5
DCX
DIAPH3
DLX4
DTX1
EIF1AY
EMX2
EPN1
FAM26B
FAM26C
FCGR3A
FCGRT
FLJ25801
FLJ41603
FOXG1
GDF3
GDPD2
GIMAP1
GIMAP4
GLDN
GMEB2
GPC3GPM6B
HBB
HCK
HLA.DMB
HLA.DRA
HLA.F
HMGCLL1
HOPX
HSD17B7
IMMP1L
IMPG1
INDOL1
INPP4B
ISX
ITGA4
ITM2B
KBTBD5
KCNJ6
KCNK4
KCTD12
KIFC3
KLHDC8A
KLK11
KLRF1
LEPR
LLGL2
LOC152586
LOC441601
LOC51057
LOC641367
LOXL3
LPHN3
LRP8 MAGEC1
MAL2
MCM5
MEGF11
MIZF
MTHFD1
MYO1A
NDUFB10
NFYA
NPAL2
NUDCD1 OCA2
OR10A2
OR2F1
OR3A3
OR52M1
OVOS2
PARD6B
PARVBPDZD11
PKP1
POLR2JPOU5F1
POU5F1P3
PSD2
PTPN13
PVRL2
RAD50
RBM7
RGMA
RGN
RGS1
RIT1
RNF180
RP11.114G1.1
SCG3
SERPINA5
SERPINC1
SHC3
SIRPG
SLC28A1
SLC2A13
SLC5A9
SLC7A3
SPANXD
SPRED2
SRF
SRGN
STK19TACSTD1
TAS2R4
TERT
TGM7
THBS4
THOC7
TIA1
TLK2
TM6SF1
TMEM16C
TOM1L1
TRIM50
TSPY2
UBE1L
UBE2C
UBXD4VAPA
VIPR1
WDFY4
WDR12
YES1
ZBTB39
ZNF121
Figure: Directed Glioblastoma genetic network learned by the three-stagemethod with the methylation effects having been adjusted.
Discussion
▶ We have proposed a three-stage method for learning thestructure of high-dimensional Bayesian networks. Thethree-stage method is to first learn the moral graph of theBayesian network based on variable screening and multiplehypothesis tests, and then resolve the local structure of theMarkov blanket for each variable based on conditionalindependence tests, and finally identify the derived directionsfor non-convergent connections based on logical rules.
▶ We justified the consistency of the three-stage method underthe small-n-large-p scenario.
▶ The time complexity of the proposed algorithm is bounded byO(p22m−1). This is much better than the PC algorithm,which has a computational complexity of O(p2+m) under thesame sparsity assumption.
▶ The moral graph is a Markov network: The proposed methodprovides a new way to learn Markov networks for mixed data.
Extension
▶ Non-Gaussian continuous data: nonparanormal transformation
▶ Poisson or negative binomial data: random-effect model-basedtransformation (Jia et al., 2017)
▶ Other types of discrete data: regroup and treat as multinomialdata
Acknowledgement
▶ Student: Suwa Xu
▶ NSF grant: DMS-1612924
▶ NIH Grant: R01-GM117597