a sub-space greedy search method for efficient bayesian network inference

8
A sub-space greedy search method for efficient Bayesian Network inference Qing Zhang a , Yong Cao b , Yong Li c , Yanming Zhu c , Samuel S.M. Sun a , Dianjing Guo a,n a School of Life Sciences and the State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong, Shatin, N. T., Hong Kong SAR, China b Department of Mechanical Engineering and Automation, Shenzhen Graduate School, Harbin Insititute of Technology, Shenzhen, China c Plant Bioengineering Laboratory, Northeast Agricultural University, Harbin, China article info Article history: Received 26 March 2010 Accepted 14 June 2011 Keywords: Microarray Greedy search Partial correlation coefficient Regulation network Bayesian network(BN) abstract Bayesian network (BN) has been successfully used to infer the regulatory relationships of genes from microarray dataset. However, one major limitation of BN approach is the computational cost because the calculation time grows more than exponentially with the dimension of the dataset. In this paper, we propose a sub-space greedy search method for efficient Bayesian Network inference. Particularly, this method limits the greedy search space by only selecting gene pairs with higher partial correlation coefficients. Using both synthetic and real data, we demonstrate that the proposed method achieved comparable results with standard greedy search method yet saved 50% of the computational time. We believe that sub-space search method can be widely used for efficient BN inference in systems biology. & 2011 Elsevier Ltd. All rights reserved. 1. Introduction One of the most important mechanisms in living cells is the regulation of gene expression, which subsequently affects the cellular behaviors. The development of high-throughput technol- ogies, such as gene expression microarray [1] and CHIP-chip [2] technology, provide valuable resources for elucidating the tran- scription networks in the cellular systems. In recent years, net- work modeling approaches, e.g. co-expression network [3], Boolean Networks (BN) [4,5], differential equations [6,7], infor- mation-theoretic approaches [8,9], and Bayesian Networks (BN) [10] have been widely adopted to infer genetic network using these high-throughput dataset. Among these network models, BNs show the greatest promise in network inference utilizing gene expression data. A BN consists of a graphical structure encoding domain variables, the probabil- istic relationships between the variables, and a numerical part encoding probabilities over these variables [11]. BN adopts two methods to infer the regulatory network. One is based on conditional independence test, which uses a statistical hypothesis test to build a network that exhibits the observed dependencies and independencies [1214]. Another method is quality measure- ment, in which the quality of a candidate graph is measured by several Bayesian scores [15] or MDL scores [16], etc., and graph with the highest score can best explain the observed data. The conditional independence test method is more sensitive to failure in independence test. Thus, the quality measurement is often considered as the method of choice in structure learning algo- rithm using high throughput data [17]. The most commonly used score-based Bayesian Network learning algorithm is greedy hill-climbing, which starts from a candidate network and then iteratively moves to a neighbor network that leads to the largest score improvement. During this process, the number of changes is denoted as O(n 2 ), where n is the number of variables. Because the number of possible networks grows more than exponentially with the number of the variables, the cost of calculation becomes acute when BN is applied to high- throughput microarray data. However, most false candidate gene pairs or networks resulted from the search process should be eliminated in reality. For example, in a small network containing 3 genes (namely X, Y, and Z), if X regulates Y and Z is not related to X and Y, the X–Z pair and Y–Z pair should not be considered during the network inference. To reduce the computational cost, a measure of dependence between variables should be performed to restrict the search space before constructing the networks. Based on this notion, mutual information has been proposed for measure of dependence between variables in network reconstruc- tion [18]. Another simple method to infer the dependence between variables is to compute all the pair-wise correlations. However, the correlation coefficient is a weak criterion for measuring dependence because it only reflects marginal independence and indirect depen- dence. Partial correlation coefficient (PCC) measures the degree of association between two random variables with the effects of controlling random variables removed, and therefore provides Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/cbm Computers in Biology and Medicine 0010-4825/$ - see front matter & 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiomed.2011.06.012 n Corresponding author. E-mail addresses: [email protected] (Q. Zhang), [email protected] (Y. Cao), [email protected] (Y. Li), [email protected] (Y. Zhu), [email protected] (S.S.M. Sun), [email protected] (D. Guo). Computers in Biology and Medicine 41 (2011) 763–770

Upload: qing-zhang

Post on 26-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Computers in Biology and Medicine 41 (2011) 763–770

Contents lists available at ScienceDirect

Computers in Biology and Medicine

0010-48

doi:10.1

n Corr

E-m

yongc@

ymzhu@

djguo@c

journal homepage: www.elsevier.com/locate/cbm

A sub-space greedy search method for efficient Bayesian Network inference

Qing Zhang a, Yong Cao b, Yong Li c, Yanming Zhu c, Samuel S.M. Sun a, Dianjing Guo a,n

a School of Life Sciences and the State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong, Shatin, N. T., Hong Kong SAR, Chinab Department of Mechanical Engineering and Automation, Shenzhen Graduate School, Harbin Insititute of Technology, Shenzhen, Chinac Plant Bioengineering Laboratory, Northeast Agricultural University, Harbin, China

a r t i c l e i n f o

Article history:

Received 26 March 2010

Accepted 14 June 2011

Keywords:

Microarray

Greedy search

Partial correlation coefficient

Regulation network

Bayesian network(BN)

25/$ - see front matter & 2011 Elsevier Ltd. A

016/j.compbiomed.2011.06.012

esponding author.

ail addresses: [email protected] (Q. Zhang

hitsz.edu.cn (Y. Cao), [email protected] (Y. L

neau.edu.cn (Y. Zhu), [email protected] (S.S

uhk.edu.hk (D. Guo).

a b s t r a c t

Bayesian network (BN) has been successfully used to infer the regulatory relationships of genes from

microarray dataset. However, one major limitation of BN approach is the computational cost because

the calculation time grows more than exponentially with the dimension of the dataset. In this paper, we

propose a sub-space greedy search method for efficient Bayesian Network inference. Particularly, this

method limits the greedy search space by only selecting gene pairs with higher partial correlation

coefficients. Using both synthetic and real data, we demonstrate that the proposed method achieved

comparable results with standard greedy search method yet saved �50% of the computational time.

We believe that sub-space search method can be widely used for efficient BN inference in systems

biology.

& 2011 Elsevier Ltd. All rights reserved.

1. Introduction

One of the most important mechanisms in living cells is theregulation of gene expression, which subsequently affects thecellular behaviors. The development of high-throughput technol-ogies, such as gene expression microarray [1] and CHIP-chip [2]technology, provide valuable resources for elucidating the tran-scription networks in the cellular systems. In recent years, net-work modeling approaches, e.g. co-expression network [3],Boolean Networks (BN) [4,5], differential equations [6,7], infor-mation-theoretic approaches [8,9], and Bayesian Networks (BN)[10] have been widely adopted to infer genetic network usingthese high-throughput dataset.

Among these network models, BNs show the greatest promisein network inference utilizing gene expression data. A BN consistsof a graphical structure encoding domain variables, the probabil-istic relationships between the variables, and a numerical partencoding probabilities over these variables [11]. BN adopts twomethods to infer the regulatory network. One is based onconditional independence test, which uses a statistical hypothesistest to build a network that exhibits the observed dependenciesand independencies [12–14]. Another method is quality measure-ment, in which the quality of a candidate graph is measured byseveral Bayesian scores [15] or MDL scores [16], etc., and graph

ll rights reserved.

),

i),

.M. Sun),

with the highest score can best explain the observed data. Theconditional independence test method is more sensitive to failurein independence test. Thus, the quality measurement is oftenconsidered as the method of choice in structure learning algo-rithm using high throughput data [17].

The most commonly used score-based Bayesian Networklearning algorithm is greedy hill-climbing, which starts from acandidate network and then iteratively moves to a neighbornetwork that leads to the largest score improvement. During thisprocess, the number of changes is denoted as O(n2), where n is thenumber of variables. Because the number of possible networksgrows more than exponentially with the number of the variables,the cost of calculation becomes acute when BN is applied to high-throughput microarray data. However, most false candidate genepairs or networks resulted from the search process should beeliminated in reality. For example, in a small network containing3 genes (namely X, Y, and Z), if X regulates Y and Z is not related toX and Y, the X–Z pair and Y–Z pair should not be consideredduring the network inference. To reduce the computational cost, ameasure of dependence between variables should be performedto restrict the search space before constructing the networks.

Based on this notion, mutual information has been proposed formeasure of dependence between variables in network reconstruc-tion [18]. Another simple method to infer the dependence betweenvariables is to compute all the pair-wise correlations. However, thecorrelation coefficient is a weak criterion for measuring dependencebecause it only reflects marginal independence and indirect depen-dence. Partial correlation coefficient (PCC) measures the degreeof association between two random variables with the effectsof controlling random variables removed, and therefore provides

Q. Zhang et al. / Computers in Biology and Medicine 41 (2011) 763–770764

a strong measurement of dependence [19–23]. In this paper, wepropose a sub-space greedy search method based on partial correla-tion coefficient to estimate the dependence between variables andto restrict the search space. We demonstrate that our model cangreatly reduce computational cost with minimum tradeoffs innetwork accuracy. We believe this method can be widely used inefficient genetic network inference for systems biology discovery.

2. Results and discussion

2.1. BN tends to select gene pairs with higher partial correlation

coefficients

Using the synthetic datasets generated by SynTReN [24], anetwork was reconstructed using BN inference. Comparing thePCCs of BN-inferred gene pairs with that of all the gene pairs(Fig. 1), we found that PCCs of gene pairs resulted from BNinference follows normal distribution and the number of BNinferred gene pairs increases with increase in absolute PCC. Thisobservation suggests that BN inference tends to select highlycorrelated gene pairs, which is consistent with the finding that realregulatory gene pairs often contain genes with similar expressionpatterns and higher PCC compared to the false ones. This result alsohighlights the rationale of our proposed sub-space search, which isto restrict the search space by selectively choosing gene pairs withhigher PCC as an efficient alternative for BN inference.

2.2. BN tends to infer DAGs with higher PCC in each iteration steps

Using the same synthetic datasets, a matrix (Mp) based on PPCwas established to indicate the possible regulatory relationshipbetween two genes (see method).

To examine if the DAGs with highest PCCs were selectedduring each iteration step, DAGs with the highest score for each

Fig. 1. The partial correlation values of gene pairs were plotted against the percenta

gene pairs.

Mp column were collected and sorted based on their scores.As shown in Fig. 2a, most of the DAGs contain parent genes withhigher Mp index. Because only one candidate DAG with thehighest score is selected for the next iteration step, we monitoredthe distribution of Mp index in selected DAGs and found thatmajority of them contain parent genes with highest Mp index(Fig. 2b). The results again demonstrated that high PCC DAGs arealso the high score DAGs inferred by Bayesian Network.

2.3. Comparison to classical greedy search method using synthetic

data

A dataset containing 50 genes generated by SynTReN [24] wasused to infer the network using classical search method and thesub-space search method. By using gene pairs with various PCCs(Mp index 1–10, 1–15, 1–20, 1–25, 1–30, 1–35, 1–40, 1–45, and1–49), the results from BN inference were compared (Fig. 3).As shown, the consumption of computational time increasedalmost linearly with the increase of parent genes. However, thenetwork score reached the highest (Table 1) and then remainedalmost unchanged after genes with Mp index 1–25 were used.This demonstrated that by including only a portion of highlycorrelated gene pairs, the sub-space search method achievedsimilar performance to the classical method in terms of networkscore while saved nearly half of the computational time. Thehighest BN score was obtained when 50% of the total gene pairs(parent gene indexed 1–25 out of 49) were included.

2.4. Comparison to classical greedy search method using real dataset

A similar comparison was done using the real microarray data(see method) and the results were summarized in Fig. 4 andTable 2. When parent genes with Mp index 1–25 were used, theinferred network achieved comparable score to that of classical

ge of BN gene pairs. Dashed lines: all gene pairs. Solid lines: percentage of BN

Fig. 2. a. DAGs with the highest score in each column were collected in every iteration steps and sorted based on their scores. X-axis: Mp index; Y-axis: iteration steps. The

color represents the Mp index of sorted DAGs. b. The heights of cuboids represent the scores of DAGs and cuboids in red means this DAG is selected for next iteration step.

Q. Zhang et al. / Computers in Biology and Medicine 41 (2011) 763–770 765

greedy search, but cost only 66% of the computational time.Although the network score reached the highest when geneswith Mp index 1–35 were used, the computational cost is around90%. Considering the tradeoffs of computational cost, it is sug-gested to include the top 50% gene pairs in terms of PCC to obtainthe maximum network efficiency.

Using the absolute efficiency (F) as an estimate, the two networkgenerated by Amira Djebbari and Quackenbush [25] and oursub-space search method were compared. Because the referencenetwork was inferred based on microarray data rather than a real

network validated by biological experiments, we only focus thecomparison on network efficiency and computational time. Due tothe limitation of BN that tends to over fit the data, low F values wereobserved for both networks. Despite that, the standard greedy searchachieved 15% absolute efficiency using 100% consumption time. Thesub-space search method, on the other hand, achieved a comparable14% absolute efficiency at a cost of only 66% computational time. Inreal application, users may choose to define a degree of sub-space, ormay choose to include some gene pairs with lower PCC value to avoidthe possible arbitrary effects of selecting only high PCC pairs.

Fig. 3. X-axis: Mp index representing the portion of parent genes included. e.g. the number 20 means that parent gene resides at the 1st to the 20th columns after the child

gene in the Mp are included; Y-axis: BN score (left) and the percentage of computational time consumed.

Table 1Comparison of standard greedy search and sub-space greedy search using

synthetic dataset.

Mp

indexScore Absolute computational

time (seconds)Relativecomputational time

10 3970.72 8493.05 0.23

15 4158.87 11705.13 0.31

20 4141.32 15289 0.41

25 4281.34 20711.27 0.56

30 4246.14 23234.37 0.62

35 4264.96 27524.41 0.74

40 4265.18 30530.24 0.82

45 4265.47 35182.1 0.94

49 4272.11 37239.86 1

Q. Zhang et al. / Computers in Biology and Medicine 41 (2011) 763–770766

The advantage of restricting search space can be especiallyuseful when large scale gene expression data is applied. Inclassical greedy search, the number of initial change O(n2) is firstcalculated and each iteration step afterwards requires O(n) timesnew calculations. In sub-space search, however, the number ofinitial change is O(kn), where k is decided by user-defined numberof genes. Because high-throughput microarray data is often usedfor BN inference, and the large number of variables (e.g. tens ofthousands of genes in human genome) may cause enormousincrease of computational cost. By limiting the number of genepairs, the sub-space search can achieve efficient network infer-ence with much less computational cost with minimum tradeoffs.

2.5. Comparison to Pearson Correlation (COR) and mutual

information (MI)

Mutual information (MI) has been used to narrow the para-meter searching space to improve the efficiency of Bayesian

network. In this section, we compared our method with othermethods based on mutual information (MI) and Pearson Correla-tion (COR). A synthetic dataset generated from SynTReN [24] wasused as benchmarks. The inferred regulation pairs using these3 methods were compared to the reference network. To ensurethat the inferred regulatory network is independent on the initialregulation structure, we randomly assigned the initial gene pairs100 times and calculated the number of times that any given genepair is inferred (each pair has a score between 100 and 0). The PR-curves (Precision-Recall) under different Mp index were plottedby selecting different score values (Supplementary material). Tocompare the performance of these 3 methods, the best predictedresults (highest absolute efficiency) and the relative average timeof each method were plotted under different Mp (Fig. 5). Here therelative average time is the average time of 100 times divided bythe maximum average time. From Fig. 5, PCC and MI showedbetter performance and used less time than the classical methodunder most Mp. COR gave the worst result in terms of the absoluteefficiency and the time spent. When Mp is 5, PCC achieved thehighest absolute efficiency and consumed only �25% of the timecompared to the classical method. It is reasonable because PCCcan measure the dependence of two genes without the effect ofthe third gene. From this result, we can conclude that PCC is anefficient pre-processing method for limiting the search space inBayesian structure learning.

3. Conclusions

Greedy search is an iteration process aiming to find a localoptimizing state. During the iteration process, an added gene pairwith low PCC value may affect other real pairs with higher PCCs.

Fig. 4. X-axis: Mp index representing the portion of parent genes included, e.g. the number 20 means that parent gene resides at the 1st to the 20th columns after the child

gene in the Mp are included; Y-axis: BN score (left) and the percentage of computational time consumed.

Table 2Comparison of standard greedy search and sub-space greedy search using real dataset.

Mp index Score Absolutecomputationaltime(seconds)

Relativecomputationaltime

Precision Sensitivity Absoluteefficiency (F)

15 2913.83 6658.1 0.4 0.08 0.15 0.11

20 2890.63 8780.16 0.52 0.09 0.16 0.12

25 2882.66 11161.03 0.66 0.11 0.2 0.14

30 2883.81 13811.01 0.82 0.12 0.21 0.15

35 2876.72 15367.59 0.91 0.13 0.25 0.17

40 2881.24 16854.16 1 0.12 0.21 0.15

Q. Zhang et al. / Computers in Biology and Medicine 41 (2011) 763–770 767

In the present work, we propose a sub-space search method toreduce the computational time while maximally retaining the BNinference accuracy. We showed that this method is feasiblebecause BN tends to infer highly correlated gene pairs and aportion of high PCC gene pairs can be used instead of all the genepairs. By comparing with classical greedy search algorithm usingboth synthetic dataset and real dataset, we demonstrated thatsub-space search method can reduce nearly half of the computa-tional time with minimum tradeoff in accuracy in BN inference.This method can be widely applied in efficient BN modeling forsystems biology discovery.

4. Materials and methods

4.1. Generation of synthetic dataset

Using simulation program SynTReN [24], we selected n genes(n¼10, 15, 20, 25 and 50) and obtained independent datasetswith 1000 observed samples, each contains n genes. Thesesynthetic datasets were used to reconstruct the transcriptionnetwork using the classical greedy search and the proposedsub-space search method.

4.2. Selection of real gene expression dataset and reference network

We adopted the microarray dataset comparing gene expres-sion in Acute Lymphoblastic Leukemia (ALL) patients and AcuteMyeloid Leukemia (AML) patients (27 ALL and 11 AML) usingAffymetrix Hu6800 GeneChipTM. The chip contains 7129 gene-specific probe sets representing approximately 6817 genes [26].Using this dataset, Amira Djebbari et al. carried out seeded BNinference to obtain a standard network containing 41 genes [25].In this study, we used their inferred network as reference tocompare the performance of proposed sub-space greedy searchmethod to that of the classical greedy search algorithm.

4.3. Learning Bayesian network

In graphical model representation, a Bayesian network (BN) isa directed acyclic graph (DAG) representing a joint probabilitydistribution (JPD) of overall variables. The nodes in the DAGrepresent the variables and edges represent the relationshipbetween variables. In BN, each variable is independent of itsnon-descendants given its parents, and the relationships betweenvariables are described by conditional probability distributions(CPDs) denoted as p(B9A)–the probability of B given A. The JPD can

Fig. 5. The x-axis corresponds to the different Mp index. ‘‘All’’ means the all the gene pairs obtained using to the classic method. The y-axis corresponds to the absolute

efficiency (black) and the relative times (red). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Q. Zhang et al. / Computers in Biology and Medicine 41 (2011) 763–770768

be calculated using the following formula:

pðx9hÞ ¼Y

pðxi9paðxiÞ,yiÞ

where x¼{x1, y, xn} denotes the variables and h¼{y1,y,yn}denotes the model parameters, and hi is the set of parametersdescribing the distribution for the ith variable xi, and pa(xi)denotes the parents of xi.

The learning of a BN structure can be stated as: finding anetwork B that can best match D, given a dataset D{d1, y, dn}.To assess the degree to which the resulting structure explains D,we use the score function of relative probability p(S, D). This scoreis also used by deal [27] a software package implemented in R

[28,29]. This package includes several methods for analyzing geneexpression data using Bayesian networks with variables ofdiscrete and/or continuous types but restricted to conditionallyGaussian networks. BNArray [30] is another package thatre-samples microarray data and construct the gene regulationnetwork based on deal. In our study, deal package was used tocalculate the CPD of gene pairs and BN scores, and BNArray

package was used to re-sample datasets. All of inferred networksby sub-space greedy search and classical greedy search are from100 bootstrap iterations with bootstrap confidence greater than0.5 (occurring in more than 50% of iterations).

4.4. Measure of dependence

To measure the dependence between two genes, partial corre-lation coefficients of all the possible gene pairs was calculatedusing GeneNet [23] package implemented in bioconductor [31].

4.5. Structure learning using sub-space search algorithm

In classical Bayesian Network, the greedy search algorithmexplores all the candidate networks and selects the one with the

highest score during iteration until the network convergence.Because the arrow deletion and turning processes are based onthe DAG, which has finite edges, both processes cost only limitedcomputational time compared to the arrow addition process. Wetherefore proposed a method to restrict the search space in arrowadding process by selecting gene pairs with higher PCC values.

A detailed description of the algorithm is as follows:

1.

Based on the partial correlation coefficient (PCC) of all thepossible gene pairs, we construct a matrix (Mp) that indicatesthe possible regulatory relationship between genes. The rowsin Mp correspond to different variables (child genes) and thecolumns correspond to all the potential parents of thesevariables. The parent genes are indexed based on their PCCwith the individual variables. For example, in an Mp containing50 columns, the parent gene resides at the first column (Mp

index 1) after the variable has the highest PCC with thevariable. Similarly, the gene resides at the last column (Mp

index 49) has the lowest PCC with the variable.

2. Select an initial DAG D0, from which to start the search. 3. Calculate Bayes factor of D0 and select networks through the

following process:a. One arrow is added to D0. Unlike classical greedy search

that selects all the genes as the candidate parents for eachchild, the sub-space search method limits the search spaceby only selecting gene pairs with higher PCCs (parent geneswith higher Mp index, e.g. 1, 2, 3, 4, 5, etc.). To avoid thepossible arbitrary effects of selecting high PCC pairs only,user may choose to randomly include some low PCC pairs.

b. One arrow in D0 is deletedc. One arrow in D0 is turned (reverted)d. Among all the resulted networks, select the one that

increases the Bayes factor the most as candidate the DAG(Dc). If the score of Dc is higher than that of D0, D0 is

Fig. 6. (a) Calculation of PCC of all gene pairs using GeneNet package. (b) Construct a matrix Mp to describe the possible parents for each variable. The rows correspond to

variables and the columns correspond to all their parents. The parent genes are listed in a descending order based on their PCC with the child genes (variables). Only higher

ranking parents (e.g. in brown columns) are selected to form search space with the corresponding child variable. User-defined low PCC gene pairs (e.g. columns in orange)

can be randomly selected in each iteration steps to avoid arbitrary effect. (c) After structure learning, if a DAG with an added arrow (g6-g2) is selected, the parent gene g6

is transferred to the last column (in red) (d) If a DAG with a removed arrow (g6-g2) is selected, g6 is re-transferred to the first column(in red) for the next search. (e) If a

DAG with a turned arrow (g2-g6) is selected, then two transfer processes are done as described in (c) and (d). (For interpretation of the references to color in this figure

legend, the reader is referred to the web version of this article.)

Q. Zhang et al. / Computers in Biology and Medicine 41 (2011) 763–770 769

replaced by Dc and the process is repeated from step (a).If the score of Dc is lower than that of D0, the algorithmstops and D0 is the final DAG.

4.

If the Bayes factor is not increased, stop the search. Otherwise,let the chosen network be D0 and repeat from step 3.

A graphical description of the sub-space search method isillustrated in Fig. 6

4.6. Estimate of BN inference

Three types of efficiencies, precision (P), sensitivity (S) andabsolute efficiency (F), were computed to compare BN inferrednetwork and reference network. P is the fraction of predicted genepairs that are correct

P¼ TP=ðTPþFPÞ

and S is the fraction of all known gene pairs that are inferred by BN

S¼ TP=ðTPþFNÞ

where TP is the number of true positives, FN the number of falsenegatives and FP the number of false positives. F thus denotes theabsolute efficiency

F ¼ 2PS=ðPþSÞ

which is the harmonic mean of precision and sensitivity.

Conflict of interest statement

None declared.

Acknowledgment

This work is supported by a grant from Hong Kong UGC/AoEPlant & Agricultural Biotechnology Project AoE-B-07/09. We thankITSC at CUHK for providing computing server support.

Appendix A. Supplementary material

Supplementary data associated with this article can be foundin the online version at doi:10.1016/j.compbiomed.2011.06.012.

References

[1] A. Schulze, J. Downward, Navigating gene expression using microarrays—atechnology review, Nat. Cell Biol. 3 (2001) E190–E195.

[2] B. Ren, F. Robert, J.J. Wyrick, O. Aparicio, E.G. Jennings, I. Simon, et al.,Genome-wide location and function of DNA binding proteins, Science 290(2000) 2306–2309.

[3] M.B. Eisen, P.T. Spellman, P.O. Brown, D. Botstein, Cluster analysis and displayof genome-wide expression patterns, Proc. Natl. Acad. Sci. USA 95 (1998)14863–14868.

[4] S.A. Kauffman, Metabolic stability and epigenesis in randomly constructedgenetic nets, J. Theor. Biol. 22 (1969) 437–467.

[5] T. Akutsu, S. Miyano, S. Kuhara, Identification of genetic networks from asmall number of gene expression patterns under the Boolean network model,Pac Symp Biocomput (1999) 17–28.

[6] D. di Bernardo, M.J. Thompson, T.S. Gardner, S.E. Chobot, E.L. Eastwood,A.P. Wojtovich, et al., Chemogenomic profiling on a genome-wide scale usingreverse-engineered gene networks, Nat. Biotechnol. 23 (2005) 377–383.

[7] M. Bansal, G. Della Gatta, D. di Bernardo, Inference of gene regulatorynetworks and compound mode of action from time course gene expressionprofiles, Bioinformatics 22 (2006) 815–822.

[8] K. Basso, A.A. Margolin, G. Stolovitzky, U. Klein, R. Dalla-Favera, A. Califano,Reverse engineering of regulatory networks in human B cells, Nat. Genet. 37(2005) 382–390.

[9] A.A. Margolin, I. Nemenman, K. Basso, C. Wiggins, G. Stolovitzky, R. Dalla Favera,et al., ARACNE: an algorithm for the reconstruction of gene regulatory networksin a mammalian cellular context, BMC Bioinf. 7 (Suppl. 1) (2006) S7.

Q. Zhang et al. / Computers in Biology and Medicine 41 (2011) 763–770770

[10] N. Friedman, M. Linial, I. Nachman, D. Pe’er, Using Bayesian networks toanalyze expression data, J. Comput. Biol. 7 (2000) 601–620.

[11] R. Cowell, P. Dawid, S. Lauritzen, D. Spiegelhalter, Probabilistic Networks andExpert Systems (Information Science and Statistics), Springer-Verlag, NewYork, 1999.

[12] G. Rebane, J. Pearl, The recovery of causal poly-trees from statistical data, Int.J. Approx. Reasoning 2 (1988) 341.

[13] P. Spirtes, C. Glymour, An algorithm for fast recovery of sparse causal graphs,Soc. Sci. Comput. Rev. 9 (1991) 62–72.

[14] L.M. De Campos, J.F. Huete, A new approach for learning belief networksusing independence criteria, Int. J. Approx. Reasoning 24 (2000) 11–37.

[15] G.F. Cooper, T. Dietterich, A Bayesian method for the induction of probabil-istic networks from data, Mach. Learn. 9 (1992) 309–347.

[16] W. Lam, F. Bacchus, Learning Bayesian belief networks—an approach basedon the MDL principle, Comput. Intell, 10 (1992) 269–293.

[17] N. Friedman, I. Nachman, D. Peer, 1999. Learning Bayesian Network structurefrom massive datasets: the ‘‘sparse candidate’’ algorithm. In: Proceedings ofUAI, pp. 206–215.

[18] C. Chow, C. Liu, Approximating discrete probability distributions with depen-dence trees, Information Theory, IEEE Trans. Inf. Theory 14 (467) (1968) 462.

[19] S. Ma, Q. Gong, H.J. Bohnert, An Arabidopsis gene network based on thegraphical Gaussian model, Genome Res. 17 (2007) 1614–1625.

[20] H. Toh, K. Horimoto, System for automatically inferring a genetic netwerkfrom expression profiles, J. Biol. Phys. 28 (2002) 449–464.

[21] X. Wu, Y. Ye, K.R. Subramanian, Interactive analysis of gene interactions usinggraphical gaussian model, in: A.C.M. SIGKDD (Ed.), Workshop on Data Miningin Bioinformatics, 3, 2003, pp. 63–69.

[22] H. Li, J. Gui, Gradient directed regularization for sparse Gaussian concentra-tion graphs, with applications to inference of genetic networks, Biostatistics 7(2006) 302–317.

[23] J. Schafer, R. Opgen-Rhein, K. Strimmer, Reverse engineering genetic net-works using the GeneNet package, R News (2006) 50–53.

[24] T. Van den Bulcke, K. Van Leemput, B. Naudts, P. van Remortel, H. Ma,A. Verschoren, et al., SynTReN: a generator of synthetic gene expression data

for design and analysis of structure learning algorithms, BMC Bioinf. 7 (2006) 43.[25] A. Djebbari, J. Quackenbush, Seeded Bayesian Networks: constructing genetic

networks from microarray data, BMC Syst. Biol. 2 (2008) 57.[26] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov,

et al., Molecular classification of cancer: class discovery and class prediction

by gene expression monitoring, Science 286 (1999) 531–537.[27] S.G. Bøttcher, C. Dethlefsen, A. DEAL, Package for learning Bayesian Networks,

Journal of Statistical Software 8 (2003) 200–203.[28] R. Team, R: A language and environment for statistical computing, 2004.[29] R. Ihaka, R. Gentleman, R: a language for data analysis and graphics, Journal

of Computational and Graphical Statistics 5 (314) (1996) 299.[30] X. Chen, M. Chen, K. Ning, BNArray: an R package for constructing gene

regulatory networks from microarray data by using Bayesian network,

Bioinformatics 22 (2006) 2952–2954.[31] R.C. Gentleman, V.J. Carey, D.M. Bates, B. Bolstad, M. Dettling, S. Dudoit, et al.,

Bioconductor: open software development for computational biology andbioinformatics, Genome Biol. 5 (2004) R80.