mining gene regulatory networks by neural modeling of...

1

Mining gene regulatory networksby neural modeling of expression time-series

Mariano Rubiolo, Diego Milone, IEEE Member , Georgina Stegmayer, IEEE Member

Abstract—Discovering gene regulatory networks from data is one of the most studied topics in recent years. Neural networks canbe successfully used to infer an underlying gene network by modeling expression profiles as times series. This work proposes anovel method based on a pool of neural networks for obtaining a gene regulatory network from a gene expression dataset. Theyare used for modeling each possible interaction between pairs of genes in the dataset, and a set of mining rules is applied toaccurately detect the subjacent relations among genes. The results obtained on artificial and real datasets confirm the methodeffectiveness for discovering regulatory networks from a proper modeling of the temporal dynamics of gene expression profiles.

Index Terms—Gene Regulatory Networks, Gene profiles, Times Series Data, Neural Networks.

F

1 INTRODUCTION

AGENE regulatory network (GRN) is an abstractmapping of gene regulations in living organisms

that can help to predict the system behavior. Duringlast years, many approaches have been proposedto unravel the complexity of gene regulation [1].Genes interact with one another and these interactionscan be measured over a number of time steps,producing temporal gene expression profiles. A hottopic on gene expression data analysis nowadays isthe reconstruction of a GRN from such data, revealingthe underlying network of gene-to-gene interactions[2]. In other words, the goal is to determine thepattern of activations and inhibitions amongst genesthat make up the underlying GRN.

Given the expression levels of a set of interactinggenes measured at different time points, formalmethods can be developed to model gene interaction[3]. In fact, discovering gene regulatory networks bydata-driven methodologies has been under study inthe last years [4]–[7]. For instance, Boolean networks[8] [9] only consider the expression of a gene as on/off(do not taking into account intermediate expressionlevel) hence having inadequate resolution. Bayesiannetworks [10] are represented as graphs consideringthe joint probability distribution of genes. Thismodel can capture the inherent noise and stochasticbehavior in gene expression data effectively, butthe high computational cost hinders the applicationto a large number of genes in network. Dynamic

• M. Rubiolo is with CONICET, CIDISI-UTN-FRSF, Lavaise 610,(3000) Santa Fe, Argentina, Tel.: +54-342-4601579 (Int: 258),[email protected]

• D. Milone and G. Stegmayer are with CONICET, sinc(i),UNL-CONICET, Ciudad Universitaria, RN 168 Km. 472.4, (3000)Santa Fe, Argentina.

Bayesian networks [11] have extended Bayesiannetworks to include temporal components. Alsogenetic algorithms [12] were applied to predict theregulatory pathways represented as an influencematrix. This approach can be applied with a smallamount of data, optimizing a large number ofparameters simultaneously, and solving nonlinearmodels. A neural network (NN) based model waspresented in [13] for identifying gene regulatorynetworks from temporal gene expression data, inwhich both feedforward and Elman networks wereused. The proposed models have a structuralresemblance to genetic networks and also have theability to capture the strengths of gene interactionsin the form of network weights. Two iterativesmethods were proposed to determine a set of potentialregulators of a target gene. An artificial dataset thatcontained a subjacent GRN was used to test theapproach, and some of the existing interactions werefound, but not all of them.

Recently, the results of a comparative study [14]among six different reverse engineering methods,have highlighted the NN approach as the bestperforming method. This approach [15] involved aRecurrent Neural Network, whose topology was asimplified representation of a GRN with a weightmatrix having non-zero entries to indicate interactions(activation or inhibition) of nodes (representinggenes). Sensitivity was 57%, that is, only about halfof the regulations were found.

Artificial neural networks can be used to infergenetic networks by modeling pairs of genes activityover a number of time steps. Thus, for modelinggene regulation all the possible combinations betweengenes must be analyzed in order to discovertheir relations. Using neural networks for modelinginteractions in genetic networks requires trainingthem to predict a target gene regulation from

sinc

(i)

Res

earc

h C

ente

r fo

r Si

gnal

s, S

yste

ms

and

Com

puta

tiona

l Int

ellig

ence

(fi

ch.u

nl.e

du.a

r/si

nc)

M. R

ubio

lo, G

. Ste

gmay

er &

D. H

. Milo

ne; "

Min

ing

gene

reg

ulat

ory

netw

orks

by

neur

al m

odel

ing

of e

xpre

ssio

n tim

e-se

ries

"IE

EE

IE

EE

/AC

M T

rans

actio

ns o

n C

ompu

tatio

nal B

iolo

gy a

nd B

ioin

form

atic

s, 2

015.

2

Fig. 1. Temporal delay identification between two geneexpresion levels data.

candidate regulating profiles. By adjusting theirsynaptic weights, NNs alter their configuration tomodel each gene connection, which results in aminimum error in predicting a target profile.

This work proposes a novel approach to discovera GRN from temporal genes expression profiles byusing a pool of multi layer perceptrons (MLP) withtemporal delays at the input, named GRNNminer1. Inany GRN, the complex interactions occurring amongstgenes can be either instantaneous or time-delayed[16]. Genes expression profiles are considered here astime series [17], and are used to train a pool of NNmodels. Each NN is designed to discover, for a targetgene profile at the output, the potential regulator ofthat gene at the input, by modeling their gene-to-geneinteraction during a time period. The ability of eachtrained NN to model each possible relation is judgedby the evaluation of the generalization error duringthe training process. For each possible relationshipfound, a score matrix is computed to detect themost likely regulations. Three rules for mining theunknown GRN from the score matrix are proposed,which allows to discover each correct gene-to-generelation from all the possible ones.

The manuscript is organized as follows: Section2 explains in detail the novel GRNNminer approachfor GRN mining. Section 3 presents the datasetsused in this study, whereas Section 4 explains theexperimental setup and introduces the performancemeasures used for validation. In Section 5 theexperimental results are presented and discussed. Atlast, conclusion and future works are given in Section6.

1. The source code is freely available for academic purpuses athttp://sourceforge.net/projects/sourcesinc/files/grnnminer/

Input Hidden Output

gi(t)

gi(t-1)

gi(t-2)

gi(t-Τ)

gj(t)

Fig. 2. NN model with temporal delays at the input.

2 NOVEL APPROACH FOR GRN MINING

This section explains how a NN can be trainedby using temporal data from gene profiles formodeling gene-to-gene relations. After that, the novelGRNNminer will be presented for discovering thesubjacent GRN.

Temporal gene expression profiles represent genesactivity observed over a number of time steps. Forinstance, in Figure 1 a segment of two genes profilesis shown both possibly related. It can be seen that,for gene A at time 4 the signal starts to increase,while for gene B this behavior is only beginning to beperceived at the sixth time instant. It could be statedthat gene A behavior has some influence on geneB, and the temporal delay between the activation ofthese two genes is around 2 time steps. This temporalseries can be used to train a NN aiming to model thisgene-to-gene relation in a GRN.

A NN can be loosely defined as a large set ofsimple interconnected units (neurons), which areexecuted in parallel to perform a common global task[18]. These units usually undergo a learning processthat automatically updates network parameters (theconnections among neurons, also named synapticweights) in response to a possibly evolving inputenvironment [19]. The model topology has severallayers with no connections among neurons belongingto the same layer. The number of layers and neuronsin each layer is chosen a priori, as well as the type ofactivation functions for the neurons [20].

The relation between two signals can be modeledby using a NN. According to the ability for modelingthis relation within a bounded number of trainingepochs, it could be possible to detect a regulationbetween genes represented by two temporal series.Figure 2 shows a NN model with time delays atinput layer, which can consider explicitly the temporalstructure of the input signal. Specifically, the MLP hasthree layers (input, hidden and output), a sigmoidactivation function at the hidden layer, and a linearactivation function at the output. The first layer hasonly input connections. The basic learning algorithmused is backpropagation [20] which uses gradientdescent to minimize a cost function, generally definedas the Mean Squared Error (MSE) between the desiredoutput (targets) and the actual network output.During learning, the error propagates backwards

sinc

(i)

Res

earc

h C

ente

r fo

r Si

gnal

s, S

yste

ms

and

Com

puta

tiona

l Int

ellig

ence

(fi

ch.u

nl.e

du.a

r/si

nc)

M. R

ubio

lo, G

. Ste

gmay

er &

D. H

. Milo

ne; "

Min

ing

gene

reg

ulat

ory

netw

orks

by

neur

al m

odel

ing

of e

xpre

ssio

n tim

e-se

ries

"IE

EE

IE

EE

/AC

M T

rans

actio

ns o

n C

ompu

tatio

nal B

iolo

gy a

nd B

ioin

form

atic

s, 2

015.

3

TABLE 1Example of error ranking after five repetitions of an

experiment aiming to discover the regulator of gene B.

Repetition Regulator Regulated Error

A B 5.21e−101

C B 6.10e−10

A B 8.92e−102

D B 9.57e−10

A B 0.35e−093

E B 1.75e−09

A B 2.19e−094

D B 8.32e−09

A B 0.98e−085

C B 1.64e−08

through the network and the model parameters arechanged accordingly [21]. The number of inputs isrelated to the number of temporal delays considered(window τ ). That is to say, there will be τ + 1inputs, considering the input signal from time t tot − τ . The value of temporal delays τ consideredat the input signal can be determined by analyzingeach gene-to-gene relation. In this way, all thepossible combinations of genes should be analyzedto determine the temporal delay for each dataset.

Discovering all the possible relations among aset of genes is the main objective of this networkmining approach. In order to achieve this aim, eachpossible gene-to-gene relation can be modeled bya NN with temporal delays. This model has to betrained and the results must be analyzed to detectwhich is the most probable GRN underlying thedataset. The GRNNminer method consists in threesteps: 1) modeling all possible gene-to-gene relationsby using a pool of NNs, 2) putting a score to eachrelationship found, 3) applying rules to discover theunderlying GRN. They are explained in detail in thefollowing subsections.

2.1 Pool of NN for modeling gene-to-generelations

Modeling gene-to-gene relationships implies buildinga NN for each possible combination of regulatorregulated genes and training the model with thecorresponding temporal series. In this work, it isimportant to find a good model that takes into accountthe dynamics underlying the data. In other words,how good is the NN to identify the dynamics of themodel. If training error is high after a large numberof training epochs, it is possible to conclude thatthere is no possible relation between those genes;

A B C D EA 10 B C 2 D 2 E 1

Regulated

Reg

ula

tor

Fig. 3. Example of filling one column of the scoringmatrix after considering Table 1.

while if the error is low, it can be interpreted as anindication that such relation exists. However, if therelationship between two signals is not clear enough,independently of the number of training epochs, atthe end of the training process the error will behigh anyway. Similarly, if the relationship betweentwo signals is clear enough, whatever the number oftraining epochs, the error will be low. In this work, thetraining error is calculated regarding the predictionof the regulated gene, as the mean square differencebetween the target time series (of the regulated gene)and the output time series obtained from the neuralmodel.

The training and test of the models have to berepeated several times in order to achieve more robuststatistical results. Therefore, there will be a set of MLPmodels specialized in each relation. Globally, this canbe considered as a pool of NNs whose main objectiveis to measure how reproducible is the relationshipbetween two temporal series. That is to say, how muchmodelable is each gene-to-gene relation. The ability ofa trained NN to model a gene network is assessedusing MSE, which is the mean of discrepanciesbetween the gene expression predictions made bya NN and the true gene expression profiles, overall measurements. Therefore, for each training donewithin the pool of models, this generalization erroris measured by cross-validation in order to find thedifferences and similarities between input and targetgenes. This error is considered in the next step toapply the scoring method.

2.2 Scoring methodIn order to determine which gene is a potentialregulator of another one, it is necessary to give ascore to each modeled relation. This score is based onhow many times the smallest error has been achieved,considering all the repetitions of each experiment,with random initializations. This way, the neuralmodel that has the lowest error in most of therepetitions determines which regulation is the easiestto model, in a statistical sense. For instance, let ussuppose we have five genes {A, B, C, D, E}. All thecandidates as regulators of gene B are listed. Then,

sinc

(i)

Res

earc

h C

ente

r fo

r Si

gnal

s, S

yste

ms

and

Com

puta

tiona

l Int

ellig

ence

(fi

ch.u

nl.e

du.a

r/si

nc)

M. R

ubio

lo, G

. Ste

gmay

er &

D. H

. Milo

ne; "

Min

ing

gene

reg

ulat

ory

netw

orks

by

neur

al m

odel

ing

of e

xpre

ssio

n tim

e-se

ries

"IE

EE

IE

EE

/AC

M T

rans

actio

ns o

n C

ompu

tatio

nal B

iolo

gy a

nd B

ioin

form

atic

s, 2

015.

4

a score is given to each possible regulator-to-regulatedrelation {A→B, C→B, D→B, E→B}. This is carried outfor all the genes, so that all alternatives are evaluated.

The results must be ordered so that the smallesterror is at the top of the list whereas the largest at thebottom, as can be seen in Table 1. In order to clarify theexplanation, in this Table only the first five repetitions(first column of the table) and the first two pairswith minimum MSE for each repetition are shown.In this example, the experiment which evaluates ifgene A regulates gene B has the minimum MSE in allrepetitions. After that, the scoring method is applied.

The scoring method consists in giving 2 points tothe regulator-regulated relation with the lowest error,(the first position on the list), and 1 point to thefollowing relation with the lowest error (the secondposition on the list). This procedure is iterated asmany times as repetitions in the table. For instance,from the list shown in Table 1 it is possible to see that 5repetitions have been done to determine which is theregulator of gene B. These repetitions are separatedin the list by using a dotted line. In all of them, therelation A→B has the first position, so its score is 10.Relations C→B and D→B have the second positionin two repetitions, therefore 2 points are assigned toboth of them. At last, the score for E→B is 1 becauseit obtained the second position in only one repetition.

Once the points have been assigned, it is possibleto build a scoring matrix to represent each possiblegene-to-gene relations, as can be seen in Figure 3.Rows represent the potential regulator genes whereascolumns depict regulated genes. The diagonal is nottaken into account because autoregulation is notconsidered in this work. Filling the scoring matrixinvolves putting the scores on the correspondingregulator-regulated cell in the matrix. For instance, thesecond column of the matrix shown in Figure 3 isfilled with the scores calculated above by applyingthe scoring method over the list.

2.3 Mining rulesDuring the previous step, the scoring matrix is filledwith the values obtained after analyzing the results ofthe pool of NNs defined in Subsection 2.1. In Figure4a), an example of a full scoring matrix is presented. Itis possible that some doubtful relations can be presenton the matrix. The aim of the rules is to solve theseconflicts, as follows.

2.3.1 Threshold ruleObserving Figure 4a) it is difficult to identify whichgene is regulated by whom because of the differentscore. It is necessary to eliminate weak connectionsfrom the score matrix in order to clarify the way thegenes are related. Therefore, the first step consists inexcluding those relationships whose errors are abovea threshold. This threshold is calculated as the lowest

A B C D E A B C D E

A 10 A 10

B 1 4 4 B 4 4

C 2 5 C 5

D 3 2 6 D 3 6

E 1 E

A B C D E A B C D E

A 10 A 10

B 4 4 B 4

C 5 C 5

D 6 D 6

E E

Regulated Regulated

a) b)

c) d)

Reg

ula

tor

Reg

ula

tor

Fig. 4. Application of mining rules. a) Original scoringmatrix. b) Resultant matrix after Threshold rule isapplied. c) Resultant matrix after Symmetric rule isapplied. d) Resultant matrix after the application ofUnchained rule.

error in all runs plus a percentage θ of that error.Considering the errors associated with the example,the grey cells in Figure 4a) are the relationships thatmust be removed. Figure 4b) shows the resultantmatrix after the application of this rule.

2.3.2 Symmetric rule

Another doubtful case can occur between twoparticular genes that are apparently mutuallyregulated. For instance, in Figure 4b) it can beseen how the regulation B→D has 4 points andthe regulation D→B contains 3 points. After theapplication of this rule, shown in Figure 4c) , onlyremains the relation with the best average valuebetween normalized error and score.

Given two genes i and j, ε relates both possiblemodels errors as

εs =eji − eij

max{eij , eji}, (1)

where eij is the error of the model i→j, and eji is theerror of the reverse model j→i. When εs > 0, it meansthat i→j is the model that has the lowest error. On thecontrary, when εs < 0 the j→i model has the lowestvalue.

Analogously, ρs relates both models scores as

ρs =sij − sji

max{sij , sji}, (2)

where sij is the score of the model i→j, and sji isthe score obtained by the reverse model j→i, so thatρs > 0 if the first model has the maximum score. Inboth cases, εs and ρs are normalized in order to havethe same weight in the average, by dividing each oneby their corresponding maximum value.

sinc

(i)

Res

earc

h C

ente

r fo

r Si

gnal

s, S

yste

ms

and

Com

puta

tiona

l Int

ellig

ence

(fi

ch.u

nl.e

du.a

r/si

nc)

M. R

ubio

lo, G

. Ste

gmay

er &

D. H

. Milo

ne; "

Min

ing

gene

reg

ulat

ory

netw

orks

by

neur

al m

odel

ing

of e

xpre

ssio

n tim

e-se

ries

"IE

EE

IE

EE

/AC

M T

rans

actio

ns o

n C

ompu

tatio

nal B

iolo

gy a

nd B

ioin

form

atic

s, 2

015.

5

A B D E

C

Fig. 5. A simple GRN discovered at the end of themining rules application.

Therefore, when the average between εs and ρs isgreater than zero, it can be inferred that i→j is thewinning relationship, and when the average is lessthan zero, the opposite j→i is true.

This approach could be used in more general caseswhere mutual regulation is present in the biologicalnetwork simply by not applying this symmetricrule. In this way mutual regulations would be fullycontemplated within the model.

2.3.3 Unchained ruleIn Figure 4c) it can be observed another special caseamong genes B, D and E. Analyzing the matrix, therelations B→D, D→E and B→E may be present. It isnecessary to remove one of the relations to unchain apossibly redundant regulation. Two possible scenarioscan be considered here. The first one is that the correctchain is B→D→E, thus regulation B→E should beremoved because E is regulated by B but throughD. The second scenario could be that the correctregulations are B→D and B→E. In this case, D→Eshould be discarded because both genes D and E areregulated by the same parent gene B. To determinewhether one situation or the other is the correct one,it is possible to use the averaged errors and scores.If the regulation chain involves genes i, j and k, withlinks i→j, i→k and j→k, the normalized sum of allmodels errors is

εu =(eij + eik)− (eij + ejk)

max{(eij + ejk), (eij + eik)}, (3)

and for scores

ρu =(sij + sjk)− (sij + sik)

max{(sij + sjk), (sij + sik)}. (4)

Therefore, when the average between εu and ρu isgreater than zero, it can be inferred that i→j→k has thebest value, so that i→k must be removed; and whenthe average is less than zero, the best value is for i→jand i→k regulations, determining that j→k must beremoved.

Considering again the example at Figure 4c),B→E is effectively removed from the matrix (cellhighlighted in grey). The resulting matrix after theapplication of this rule is shown in Figure 4d).

Finally, at the end of the application of the threerules explained in 2.3.2, 2.3.3 and 2.3.1, a finalGRN can be obtained from the resultant scoring

0 50 100 150 2000

2020406080

100

activ

ity

0 50 100 150 2000

20406080

100

gene

1

0 50 100 150 2000

20406080

100

time step

gene

15

0 50 100 150 2000

20406080

100

gene

10

Fig. 6. Example of time series for the activation signal(top) and the expression of genes 1, 10 and 15.

matrix. For example, from Figure 4d) it is possibleto obtain the GRN shown in Figure 5 by drawing theregulator-regulated relations represented by the matrix.

3 MATERIALS

In this section an artificial dataset with knownconnections is described. A real dataset which havebeen used to validate the proposal is also presented.

3.1 Artificial data set

The artificial data set was developed by Smith et al.[22] and improved by Yu et al. [23]. They modeled agenetic regulatory network of arbitrary structure andmeasured values for gene expression levels at discretetime-steps. The method was the same used by Knott.et al [13] to model the gene expression values andvalidate its gene regulatory model.

Figure 6 shows examples of some of the timeseries used in this study. It can be seen thatan increase in activity (top trace) is followed byupregulation of gene 1 (second trace from top) witha slight time lag. When activity drops, then gene1 returns to hover near its target level, also witha little delay. Gene 10 (third trace from the top) isdown-regulated considerably later than the responseof gene 1 to activity, showing that the regulatoryeffects are propagated through a network over time.Gene 15 (bottom trace) is an unregulated gene whichchanges in a wide range and without relationship tothe activity or the other genes shown. The activitylevel acts as an activation signal starting randomlybetween a low or high level, and directly affectingthe expression levels of the first two genes of theGRN. These two genes, in turn, affect the expressionlevels (time series) of eight downstream genes. Forthese time series, the corresponding network isshown in Figure 7. It contains 20 genes, where only

sinc

(i)

Res

earc

h C

ente

r fo

r Si

gnal

s, S

yste

ms

and

Com

puta

tiona

l Int

ellig

ence

(fi

ch.u

nl.e

du.a

r/si

nc)

M. R

ubio

lo, G

. Ste

gmay

er &

D. H

. Milo

ne; "

Min

ing

gene

reg

ulat

ory

netw

orks

by

neur

al m

odel

ing

of e

xpre

ssio

n tim

e-se

ries

"IE

EE

IE

EE

/AC

M T

rans

actio

ns o

n C

ompu

tatio

nal B

iolo

gy a

nd B

ioin

form

atic

s, 2

015.

6

activity

gene 1

gene 4

gene 2

gene 5

gene 3

gene 6

gene 9

gene 10

gene 7

gene 8

+ 0.20

+ 0.20

+ 0.20

+ 0.20

+ 0.20

+ 0.20

+ 0.20

+ 0.20

- 0.20

- 0.20

- 0.20

gene 11

gene 20

Fig. 7. Artificial data set GRN topology used as goldenstandard.

10 are associated among them through regulatoryinteractions, while the remaining 10 genes serve asdistractors. Each connection between genes has anumber that represents the expression proportionthat is added or substracted from the regulatorgene signal [22]. This proportion indicates whetherthe predecessor upregulates or downregulates thesuccessor gene signal.

At each time point, the expression levels of genesin the network are governed by the expressionlevels of their regulators, a degradation factor anda noise factor. In contrast, the expression levels ofthe remaining genes randomly fluctuate within theupper and lower expression level bounds. Due tothe stochastic nature of the noise in the system, theoutput will slightly differ in each replicate but itwill be governed by the same underlying networkinteractions.

In this work 10 replicates will be used, whereeach one comprised 20 gene expression levels at 200time points, sampled at five minutes intervals. Thismultiple replicates method was originally suggestedin [24] for solving the dimensionality problem,by combining multiple microarray datasets fromdifferent conditions for inferring the GRN, in order toderive in the most consistent network structure withrespect to all the datasets. Similarly, we used severalreplicates in order to train our models with enoughsamples from the same GRN. The 80% of the wholedataset will be used for training, while the remaining20% for validation.

3.2 Five-gene network in Yeast

Yeast (Saccharomyces cerevisiae) cell cycle-regulatedgenes were identified by microarray hibridation[25]. The time series of gene-expression data fromyeast protein synthesis2 has 17 sampling points andthe observation interval is 10 min. Data betweentimepoints have been normalized with respect to eachother. In this work, data for five genes have beenselected (HAP1, CYB2, CYC7, CYT1, COX5A) becausethe relations among those genes have been reportedand validated by biological experiments [26].

4 EXPERIMENTAL SETUP

The experimental setup will be explained here,detailing neural networks configuration. Theperformance measures for evaluating the proposedGRNNminer will be also presented.

4.1 Neural models configuration

A NN with temporal delays has a particular inputlayer, in which the number of elements dependson the number of temporal delays (τ ) consideredat the input data. In this work, two values wereconsidered: τ = {2,3}. As regards hidden layer, asimple heuristic was adopted in which the numberof hidden neurons were 50%, 100%, 150% and 200%of the inputs number. For example, for input layerwith 4 elements (τ = 3), the hidden layer can have2, 4, 6 or 8 neurons. The output layer has only oneneuron, which represents the value of the target geneexpression at time t.

Weights and biases were initialized according to theNguyen-Widrow initialization algorithm [27], whichchooses random values in order to distribute theactive region of each neuron approximately evenlyacross the input space. The models were trained usinga function that updates weights and bias according toLevenberg-Marquardt optimization [28]. The numberof training epochs used as stop criteria were 50, 100and 500. In this work, the parameter θ was testedbetween 0 and 1, with an incremental factor of 0.01.

4.2 Performance measures

Two measures were used in this work: accuracyand sensitivity. Accuracy is the proportion of trueresults (both true positives and true negatives) in thepopulation, whereas sensitivity relates to the abilityto identify positive results. The known GRN of thedataset (Figure 7) was used as the golden referencefor performance calculation. Accuracy and sensitivityare defined as

A =TP + TN

TP + TN + FP + FN, (5)

2. http://www.wanghaixin.com/biowq2007.html

sinc

(i)

Res

earc

h C

ente

r fo

r Si

gnal

s, S

yste

ms

and

Com

puta

tiona

l Int

ellig

ence

(fi

ch.u

nl.e

du.a

r/si

nc)

M. R

ubio

lo, G

. Ste

gmay

er &

D. H

. Milo

ne; "

Min

ing

gene

reg

ulat

ory

netw

orks

by

neur

al m

odel

ing

of e

xpre

ssio

n tim

e-se

ries

"IE

EE

IE

EE

/AC

M T

rans

actio

ns o

n C

ompu

tatio

nal B

iolo

gy a

nd B

ioin

form

atic

s, 2

015.

7

TABLE 2Accuracy when the highest sensitivity are achieved in

each experiment.

Window size (τ ) Epochs θ A [%] S [%]50 0.02 99.71 100.00

2 100 0.02 99.12 88.89500 0.02 99.42 100.0050 0.02 99.12 88.89

3 100 0.02 98.83 77.78500 0.02 99.12 88.89

S =TP

TP + FN, (6)

where TP, TN, FP and FN stand for true positives,true negatives, false positives and false negativesrespectively.

It is important to note that if accuracy is 1, allthe relations are discovered and there are no falsepositives in the results. That is to say, when accuracyis 1, sensitivity is 1, but not the opposite. On theother hand, if sensitivity is 1, all the desired relationsare discovered, but there can be some false positivespresent in the results.

5 RESULTS AND DISCUSSION

The experimental results obtained on two GRNmining problems are shown in this section. First,the performance measures obtained from the resultsof mining the artificial dataset will be shown, inorder to demonstrate the strengths of the proposedmethod. The discovered GRN will be analysed anda discussion with related work will be done. Second,the results from the application of the GRNNminer toa real dataset will be discussed.

Table 2 presents global results of the proposedmining approach on the artificial data set. The firstcolumn indicates the window sizes used, representingthe temporal delay taken into account at the NNinput. The different training epochs considered areshown at the second column. The third columnpresents the best θ for the threshold rule. The accuracyis presented at column four, while the sensitivity isshown in the last column.

Analyzing this table, two important results can behighlighted. First, by using GRNNminer it is possibleto obtain all the relations of the golden reference,because the sensitivity was 100% in two cases (rows1 and 3). Secondly, accuracies are higher than 99%for these two cases so that there are very few extrarelations discovered. The best accuracy obtained wasof 99.71% (row 1, shown in bold), which means thatthere is only one extra relation, and the proposedmethod is capable of discovering a GRN that isalmost identical to the golden reference. That is tosay, all of the gene-to-gene relations that have to bediscovered were obtained, adding only one relation

activity

gene 1

gene 4

gene 2

gene 5

gene 3

gene 6

gene 9

gene 10

gene 7

gene 8

20

20

13

17

20

7

16

17

19

16

Fig. 8. Discovered GRN. Dotted lines indicate theactivity influence over genes 1 and 4. Thick lines arethe relations discovered by GRNNminer that match thegolden reference relations in the dataset. Thin line isthe FP that was discovered by the proposed method.Each line has it corresponding score.

that is not present in the reference network. Sensitivityand accuracy show that it is possible to find a goodmodel that takes into account the temporal structureof the data independently of the number of epochs.That is, the NN with temporal delays can identify thedynamics of the actual model, achieving very similarresults with either high or low number of trainingepochs. Therefore, the neural models proposed herecan effectively deal with the identification of tworelated genes by modeling their temporal profiles.

From the score matrix it is possible to obtain theGRN shown in Figure 8. It can be clearly seen thatGRNNminer was able to discover all the regulatoryinteractions existing between genes, as shown bythick lines. Only one extra relation was discovered,indicated with thin line (gene 4 → gene 1).

In this same problem, our proposal was comparedagainst five methods. Results are shown in Table 3.The first column shows the methods in comparison.The second and third columns report the accuracy,obtained when the highest sensitivity is achieved,respectively. The first row shows the results of amethod based on Pearson correlation [29], where thecorrelation over all possible pairwise relationships

sinc

(i)

Res

earc

h C

ente

r fo

r Si

gnal

s, S

yste

ms

and

Com

puta

tiona

l Int

ellig

ence

(fi

ch.u

nl.e

du.a

r/si

nc)

M. R

ubio

lo, G

. Ste

gmay

er &

D. H

. Milo

ne; "

Min

ing

gene

reg

ulat

ory

netw

orks

by

neur

al m

odel

ing

of e

xpre

ssio

n tim

e-se

ries

"IE

EE

IE

EE

/AC

M T

rans

actio

ns o

n C

ompu

tatio

nal B

iolo

gy a

nd B

ioin

form

atic

s, 2

015.

8

TABLE 3Comparison of GRNNminer against literature

methods.

Methods A [%] S [%]Pearson correlation 92.69 100.00

Rank correlation 83.92 100.00Mutual Information 65.79 100.00

Knott et al. [13] 99.71 88.89Smith et al. [22] 98.54 88.89

GRNNminer 99.71 100.00

between genes is calculated, and only those valuesthat are higher than a threshold are selectedas indicative of a possible regulation relationshipbetween two genes. To determine this threshold, allpossible values were evaluated between 0 and 1(with step 0.01), selecting the one which allowed tofind the best sensitivity and accuracy. Similarly, RankCorrelation [29] and Mutual Information [30] methodsare reported in the second and third row, respectively.

In comparison to the method proposed in Knot etal. [13] for the same GRN, our proposal was ableto discover all the interactions, while the Knot etal. method was able to infer only eight of the nineregulatory interactions. Their analysis was unable toidentify gene 3 as a regulator of gene 6. Since itis was not specified in their work, we assumed forthe best that they did not have any false positive.Another related work [22] that used the same data setand GRN was also unable to predict the regulationof gene 6 by gene 3, as well as five incorrect extralinks between genes were additionally found. Itscorresponding accuracy and sensitivity are reportedin the fifth row. It should be highlighted here thatthe GRNNminer presented in this work, instead, wascapable of finding this important regulation betweengenes 3 and 6, with the best accuracy and sensitivity(as shown in the sixth row).

Analyzing Table 3 in detail it can be statedthat, although by using the proposals of Knottet al. and Smith et al. it is possible to obtainmost part of the reference network (accuracies of99.71% and 98.54%, respectively), it is not possibleto fully reconstruct it, since the sensitivity was88.89% in both cases. As regards Rank correlationand Mutual Information methods, in spite they canobtain the complete reference network there arealso so many false positives that impact negativelyon the performance of both methods (accuraciesof 83.92% and 65.79%, respectively). By using thePearson correlation, high sensitivity and accuracy(92.69%) can be obtained. However, GRNNminer hasa better accuracy (99.71%), which means that it notonly obtains the reference network, but also obtainsonly one false positive. This better performance ofGRNNminer is due to the advantage of modeling thedynamic relationship between genes profiles. Unlike

HAP1

CYT1

CYC7

COX5

CYB2

Fig. 9. Discovered yeast GRN. Thick lines are therelations discovered by GRNNminer that matched thegolden reference relations in the yeast dataset. Thinline is the relation discovered that is not present in thesubjacent network. Dashed lines are the relations thatthe proposal was not able to discover because discardsmutual regulations. Each line has it correspondingscore.

the traditional correlations methods, the temporaldynamic is considered explicitly in GRNNminer bymodeling gene regulation through neural networkswith temporal delays at the input.

In order to further evaluate the scalability androbustness of GRNNminer, this experimental workwas extended to discover the same GRN with 100genes (9900 putative relationships between genes).The accuracy increased to 99.99% at a sensitivity of100%. It shows the capability of the proposed methodfor mining GRNs from larger datasets.

GRNNminer method was also tested with the realdataset described in Section 3.2. Figure 9 showsthe GRN reconstruction after the application of themining method. In this case, three correct relationshave been found: HAP1→COX5, HAP1→CYT1 andCYC7→CYB2 (thin lines). All the relations discoveredby the proposed method are real regulations betweengenes in the yeast protein synthesis pathway. Theseresults are in agreement with the experimentalfindings in [26]. Since mutual regulation is notconsidered in GRNNminer, the CYB2→CYC7 relationhas not been found (dotted line). Another case tobe analyzed is the false positive between HAP1and CYT1. Although the real sense of the relationbetween CYT1 and HAP1 is the opposite, it shouldbe highlighted that the relation was effectively foundby the proposed approach. The same problem wasaddressed in [31], where the same relations betweengenes have been discovered, plus one extra regulationbetween CYB2→CYC7 and another one betweenHAP1→CYT1. That work has also found an extrarelation COX5→HAP1 that is not actually present inthe real network. It is worth noting that this falsepositive has not been indicated by GRNNminer dueto the fact that mutual regulation is not considered

sinc

(i)

Res

earc

h C

ente

r fo

r Si

gnal

s, S

yste

ms

and

Com

puta

tiona

l Int

ellig

ence

(fi

ch.u

nl.e

du.a

r/si

nc)

M. R

ubio

lo, G

. Ste

gmay

er &

D. H

. Milo

ne; "

Min

ing

gene

reg

ulat

ory

netw

orks

by

neur

al m

odel

ing

of e

xpre

ssio

n tim

e-se

ries

"IE

EE

IE

EE

/AC

M T

rans

actio

ns o

n C

ompu

tatio

nal B

iolo

gy a

nd B

ioin

form

atic

s, 2

015.

9

into the mining rules and our approach based on theminimum error of the neural models has correctlyidentified the regulation HAP1→COX5 as stronger (ithas obtained a higher score) than COX5→HAP1.

6 CONCLUSIONS AND FUTURE WORK.A novel approach for successfully obtaining a GRNfrom genes expression data was proposed andexplained in detail. GRNNminer involves modelingeach pair of genes interaction with NNs. Specifically,a pool of multilayer perceptrons was used to modelall the possible combinations of gene-to-gene relationsbetween genes present in the dataset. The abilityof modeling each relation was measured accordingto the generalization error. A scoring method wasproposed in order to determine the most probableinteractions in the dataset. From the resulting scoringmatrix, a set of mining rules were applied todiscover the underlying GRN. Several experimentswere done over a known GRN present in an artificialdataset, in comparison against traditional and recentmethods in literature. By calculating sensitivity andaccuracy measures, experimental results demonstratethat the proposal is capable of discovering all thegene-to-gene relations and reconstruct the subjacentGRN. Moreover, applying the same method over areal dataset, it was possible to obtain the interactionsbetween real genes. The capability of our miningapproach to identify the potential existing relationsbetween genes could be very useful because it allowsto focus the attention over a set of genes. GRNNminercould be used as a starting point from whereresearchers can reconstruct or build a regulationpathway, later to be tested or validated through wetexperiments.

Future work involves the scaling up of the proposedmethod to more complex problems, including a largenumber of genes. Also, in order to improve theestimation of the temporal delays to be consideredat the input of each neural model, we will studythe automatic evaluation of the temporal dynamicsat each gene expression data.

REFERENCES

[1] M. Hecker, S. Lambeck, S. Toepfer, E. van Someren,and R. Guthke, “Gene regulatory network inference: Dataintegration in dynamic modelsA review,” Biosystems, vol. 96,no. 1, pp. 86–103, 2009.

[2] A. J. Hartemink, “Reverse engineering gene regulatorynetworks,” Nat Biotech, vol. 23, no. 5, pp. 554–555, May 2005.

[3] R.-S. Wang, Y. Wang, X.-S. Zhang, and L. Chen, “Inferringtranscriptional regulatory networks from high-throughputdata,” Bioinformatics, vol. 23, no. 22, pp. 3056–3064, Nov. 2007.

[4] H. D. Jong, “Modeling and simulation of genetic regulatorysystems: A literature review,” Journal of Computational Biology,vol. 9, pp. 67–103, 2002.

[5] M. P. Styczynski and G. Stephanopoulos, “Overview ofcomputational methods for the inference of gene regulatorynetworks,” Computers & Chemical Engineering, vol. 29, no. 3,pp. 519 – 534, 2005.

[6] S. Mitra, R. Das, and Y. Hayashi, “Genetic networks and softcomputing,” IEEE/ACM Transactions on Computational Biologyand Bioinformatics, vol. 8, no. 1, pp. 94–107, 2011.

[7] D. Muraro, U. Voss, M. Wilson, M. Bennett, H. Byrne,I. D. Smet, C. Hodgman, and J. King, “Inference of thegenetic network regulating lateral root initiation in arabidopsisthaliana,” IEEE/ACM Transactions on Computational Biology andBioinformatics, vol. 10, no. 1, pp. 50–60, 2013.

[8] S. Bornholdt, “Boolean network models of cellular regulation:prospects and limitations.” J R Soc Interface, vol. 5 Suppl 1,2008.

[9] C. H. A. Higa, T. P. Andrade, and R. F. Hashimoto,“Growing seed genes from time series data and thresholdedboolean networks with perturbation,” IEEE/ACM Transactionson Computational Biology and Bioinformatics, vol. 10, no. 1, pp.37–49, 2013.

[10] A. Werhli and D. Husmeier, “Reconstructing Gene RegulatoryNetworks with Bayesian Networks by Combining ExpressionData with Multiple Sources of Prior Knowledge,” StatisticalApplications in Genetics and Molecular Biology, vol. 6, no. 1, 2007.

[11] D. Husmeier, “Sensitivity and specificity of inferring geneticregulatory interactions from microarray experiments withdynamic bayesian networks,” Bioinformatics, vol. 19, no. 17,pp. 2271–2282, 2003.

[12] S. Ando and H. Iba, “Inference of gene regulatory modelby genetic algorithms,” in Evolutionary Computation, 2001.Proceedings of the 2001 Congress on, vol. 1, 2001, pp. 712–719vol. 1.

[13] S. Knott, S. Mostafavi, and P. Mousavi, “A neural networkbased modeling and validation approach for identifying generegulatory networks,” Neurocomputing, vol. 73, no. 13-15, pp.2419–2429, 2010.

[14] H. Hache, H. Lehrach, and R. Herwig, “Reverse engineeringof gene regulatory networks: A comparative study,” EURASIPJournal on Bioinformatics and Systems Biology, vol. 2009, no. 1,p. 617281, 2009.

[15] H. L. H. Hache, C. Wierling and R. Herwig, “Reconstructionand validation of gene regulatory networks with neuralnetworks,” in Proceedings of the 2nd Foundations of SystemsBiology in Engineering Conference (FOSBE 07), 2007, pp.319–3241.

[16] A. Chowdhury, M. Chetty, and N. Vinh, “Incorporatingtime-delays in s-system model for reverse engineering geneticnetworks,” BMC Bioinformatics, vol. 14, no. 1, p. 196, 2013.

[17] N. A. Barker, C. J. Myers, and H. Kuwahara, “Learninggenetic regulatory network connectivity from time seriesdata,” IEEE/ACM Transactions on Computational Biology andBioinformatics, vol. 8, no. 1, pp. 152–165, 2011.

[18] G. Stegmayer and O. Chiotti, “Volterra NN-based behavioralmodel for new wireless communications devices,” NeuralComputing and Applications, vol. 18, pp. 283–291, 2009.

[19] D. Chaturvedi, Soft Computing Techniques - Studies inComputational Intelligence. Springer-Verlag, London, 2008.

[20] S. Haykin, Neural Networks: A Comprehensive Foundation.Prentice-Hall, 1999.

[21] D. Rumelhart, G. Hinton, and R. J. Williams, “Learningrepresentations by back-propagating errors,” Nature, vol. 323,no. 1, pp. 533–536, 1986.

[22] V. A. Smith, E. D. Jarvis, and A. J. Hartemink, “Evaluatingfunctional network inference using simulations of complexbiological systems,” Bioinformatics, vol. 18, no. 1, pp.S216–S224, 2002.

[23] J. Yu, V. A. Smith, P. P. Wang, A. J. Hartemink, andE. D. Jarvis, “Advances to bayesian network inference forgenerating causal networks from observational biologicaldata,” Bioinformatics, vol. 20, no. 18, pp. 3594–3603, 2004.

[24] Y. Wang, T. Joshi, X.-S. Zhang, D. Xu, and L. Chen, “Inferringgene regulatory networks from multiple microarray datasets,”Bioinformatics, vol. 22, no. 19, pp. 2413–2420, 2006.

[25] P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer,K. Anders, M. B. Eisen, P. O. Brown, D. Botstein,and B. Futcher, “Comprehensive identification of cellcycle-regulated genes of the yeast saccharomyces cerevisiaeby microarray hybridization,” Mol Biol Cell, vol. 9, no. 12, pp.3273–3297, Dec. 1998.

[26] J. C. Schneider and L. Guarente, “Regulation of the yeastcyt1 gene encoding cytochrome c1 by hap1 and hap2/3/4.”

sinc

(i)

Res

earc

h C

ente

r fo

r Si

gnal

s, S

yste

ms

and

Com

puta

tiona

l Int

ellig

ence

(fi

ch.u

nl.e

du.a

r/si

nc)

M. R

ubio

lo, G

. Ste

gmay

er &

D. H

. Milo

ne; "

Min

ing

gene

reg

ulat

ory

netw

orks

by

neur

al m

odel

ing

of e

xpre

ssio

n tim

e-se

ries

"IE

EE

IE

EE

/AC

M T

rans

actio

ns o

n C

ompu

tatio

nal B

iolo

gy a

nd B

ioin

form

atic

s, 2

015.

10

Molecular and cellular biology, vol. 11, no. 10, pp. 4934–4942,1991.

[27] D. Nguyen and B. Widrow, “Improving the learning speedof 2-layer neural networks by choosing initial values of theadaptive weights.” in Proceedings of the International JointConference on Neural Networks, 1990, pp. 21–26.

[28] D. Marquardt, “An algorithm for least-squares estimation ofnonlinear parameters,” SIAM Journal on Applied Mathematics,vol. 11, no. 2, pp. 431–441, 1963.

[29] J. Gibbons and S. Chakraborti, Nonparametric StatisticalInference, Fourth Edition: Revised and Expanded. Taylor &Francis, 2014.

[30] T. Cover and J. Thomas, Elements of information theory. Wiley,1991.

[31] B. Yang, Y. Chen, and M. Jiang, “Reverse engineering ofgene regulatory networks using flexible neural tree models,”Neurocomput., vol. 99, pp. 458–466, Jan. 2013.

MARIANO RUBIOLO received theEngineering degree in Information Systemsfrom Universidad Tecnologica NacionalFacultad Regional Santa Fe, Argentina, in2007, and the Ph.D. degree in Engineeringin Information Systems from UniversidadTecnologica Nacional Facultad RegionalSanta Fe, Argentina, in 2014. He was AuxiliarProfessor at the Department of InformationSystem Engineering in UTN-FRSF from2008 and 2014. Recently, he was named Full

Adjunct Professor in the same department. He is currently ResearchAssistant at the National Scientific and Technical Research Council(CONICET) Argentina. His current research interests are in thefields of computational intelligence, data mining and bioinformatics.

DIEGO H. MILONE received theBioengineering degree (Hons.) fromNational University of Entre Rios (UNER),Argentina, in 1998, and the Ph.D. degreein Microelectronics and ComputerArchitectures from Granada University,Spain, in 2003. He was with the Departmentof Bioengineering and the Department ofMathematics and Informatics at UNERfrom 1995 to 2002. Since 2003 he is FullProfessor in the Department of Informatics

at National University of Litoral (UNL). Since 2006 he is a ResearchScientist at the National Scientific and Technical Research Council(CONICET). From 2009 to 2011 was Director of the Department ofInformatics and from 2010 to 2014 was Assistant Dean for Scienceand Technology. His research interests include statistical learning,pattern recognition, signal processing, neural and evolutionarycomputing, with applications to speech recognition, computer vision,biomedical signals and bioinformatics.

GEORGINA STEGMAYER received theEngineering degree in Information Systemsfrom Universidad Tecnologica NacionalFacultad Regional Santa Fe, Argentina, in2000, and the Ph.D. degree in ElectronicDevices from Politecnico di Torino, Italy, in2006. Since 2007 she is Professor at theDepartment of Informatics in UTN-FRSF andUNL. She is currently Adjunct Researcherat the National Scientific and TechnicalResearch Council (CONICET) Argentina.

She is author and co-author of numerous papers on journals, bookchapters and conference proceedings in artificial neural networks.Her current research interests are in the fields of data mining andpattern recognition with application to bioinformatics.

sinc

(i)

Res

earc

h C

ente

r fo

r Si

gnal

s, S

yste

ms

and

Com

puta

tiona

l Int

ellig

ence

(fi

ch.u

nl.e

du.a

r/si

nc)

M. R

ubio

lo, G

. Ste

gmay

er &

D. H

. Milo

ne; "

Min

ing

gene

reg

ulat

ory

netw

orks

by

neur

al m

odel

ing

of e

xpre

ssio

n tim

e-se

ries

"IE

EE

IE

EE

/AC

M T

rans

actio

ns o

n C

ompu

tatio

nal B

iolo

gy a

nd B

ioin

form

atic

s, 2

015.

mining gene regulatory networks by neural modeling of...

Documents