north carolina agricultural and technical state university inferring stable gene regulatory networks...

North Carolina Agricultural and Technical State University


Inferring stable gene regulatory networks from steady-state data

Presenter: Joy Edward Larvie Collaborator: Mohammad GorjiAdvisor: Dr. Abdollah Homaifar


2

Outline

Motivation

Background

Introduction

Existing Techniques

Objectives

Methodology

Results and Discussion

Conclusion

Future Work

References


3

Motivation

Genetic networks useful in drug discovery where it is crucial

for identifying targeted pathways

Promotes biological knowledge and medical diagnosis

Existing genetic network identification techniques have

inherent limitations


4

Background

Traditional techniques for gene

expression studies is limited in both

breadth and efficiency

Investigators/researchers could only

study one or a few genes at a time

DNA microarray technology provides

researchers the opportunity to analyze

expression patterns of tens of

thousands of genes at a time

Multi-step, data-intensive nature of

technology has created a huge

informatics and analytical challenge

It has become a standard tool for

genomic research

Fig. 1: Overview of DNA microarray experiment


5

Introduction

Understanding the nature of cellular functions requires the

study of gene behavior from a global perspective

Genes typically regulated through complex interconnections

of cellular components, such as proteins

Interactions form the basis of several cellular pathways and

molecular processes in living cells

Identification of these gene regulatory networks (GRNs)

promotes biological knowledge, medical diagnosis, drug

design and also helps to identify molecular targets of

pharmacological compounds


6

Introduction (cont’d)

Large number of genes involved in GRNs makes network

recovery a very complex task

Advent of high throughput technologies such as DNA microarrays

has provided a powerful tool allowing large expression data to be

collected in a single experiment

Increasing availability of data has boosted the network recovery

task

Two categories of expression data» Temporal expression data» Steady-state expression data

Several novel machine learning algorithms proposed for network

identification

Commonly used approaches include clustering, Bayesian

Networks, Boolean Networks and Dynamic Bayesian Networks


7

Existing TechniquesComputational approaches Strength Weakness

Boolean network

•Can analyse large regulatory networks •Easier to interpret due to its simplicity •Phenomena of biological realistic complex can be represented by Simplistic Boolean formalism

•Deterministic in nature •Unable to handle incomplete regulatory network data only involves two representative states for gene expression level•High computing time is needed •Most BNs can only with with a small number of genes

Probabilistic Boolean network

•Copes with uncertainties •Two or more transition function for each variable is allowed the use of positive feedback and probabilities can make the model work more effectively •Compared to DBN, PBN can explain more details in the regulatory roles of different sets of gene

•Difficult to apply for large scale networks •High computational complexity •Cannot cope with instantaneous interactions between variables

Bayesian network

•Ability of handling noisy •Handle with uncertainty •Able to work on the logically interacting components with small number of variables •Integrate the prior knowledge to strengthen the causal relationship •Infer the structure of network statistically

•Hard to distinguish between the origin and the target of an interaction •Feedback loops not allowed •Failure to capture temporal information of time series microarray data •Support small sized gene regulatory networks •Combinatorial learning of Bayesian network

Dynamic Bayesian network

•Able to model cyclic interaction among genes •Can handle stochastic components •Well-suited for handling time-series gene expression data •Model indirect or direct causal relationships •Handle perturbation or structural modification of networks

•Excessive computation time and cost •Performance restricted by the missing values of gene expression data •Supports small sized gene regulatory networks

Ordinary differential Equation•Produces directed signed graphs •Suited for steady-state and time series expression profiles •Can work entirely in classical category

•Only applicable to small networks •Difficult to find appropriate parameter values that fit with the data

Neural network•Able to recognize input pattern •Able to model any functional relationships and data structure •Captures the nonlinear and dynamic interactions •Noise resistant

•Difficult to obtain efficient training since learning rate must be defined for different data situation •High computational complexity therefore can only apply to very small systems

Table. 1: Comparison of existing network Identification techniques


8

Objectives

Infer a stable, sparse and causal genetic network from

steady-state microarray data

Understand the inherent gene-gene interactions within

inferred network for biological studies/research


9

Methodology

Vector autoregressive (VAR) model known to be one of the

most flexible and easy to use models for analyzing

multivariate time series.

Particularly useful for forecasting and modeling the dynamic

behavior of economic and financial time series

In general, a first order VAR [VAR(1)] model is defined as:

where is a random vector, is a fixed coefficient matrix, is a K-dimensional white noise or innovation process, that is, , and for

VAR processes become intractable in high dimensional space


10

Methodology (cont’d)

Least Absolute Shrinkage & Selection Operator – VAR

model addresses intractability

Applies an L1 penalty to the convex least squares objective

function.

Some entries of coefficient matrix made zero

Resulting objective function is:

where is the square of the Frobenius norm of, is the norm, and is a penalty parameter

Sparsity induced by penalty term


11


A VAR(1) is said to be stable iff all eigenvalues of have

absolute value less than one.

Mathematically equivalent to;

Allows the incorporation of a stability constraint that helps to

address the specification of a stable gene regulatory

network from steady-state data.

Constraint imposed through application of Geršgorin’s

Circle Theorem on original Lasso-VAR technique


12


Geršgorin’s Circle Theorem:

All the eigenvalues of a matrix are in the union of the disks whose

boundaries are circles with centers at the points and the radii are:

In essence, the eigenvalues of a square matrix can not be

too far from its diagonal entries.


13

301

852

324

M

The circles that bound the Eigenvalues are:

C1: Center point (4,0) with radius r1 = |2|+|3|=5

C2: Center point (-5,0) with radius r2=|-2|+|8|=10

C3: Center Point (3,0) with radius r3=|1|+|0|=1

-15 -10 -5 5 10

-15

-10

-5

5

10

Union of the Circles

-15 -10 -5 5 10

-15

-10

-5

5

10

The red dots to the right mark the actual location of the Eigenvalues

Consider the following example.


14

In the Gerschgorin Circle Theorem the y-axis is interpreted as the imaginary axis. Since the roots of the characteristic polynomial could be complex numbers they take on the form x+iy where i is the square root of -1.

344

052

701

M

The circles that bound the Eigenvalues are:

C1: Center point (1,0) with radius r1 = |0|+|7|=7

C2: Center point (-5,0) with radius r2=|2|+|0|=2

C3: Center Point (-3,0) with radius r3=|4|+|4|=8

-15 -10 -5 5 10

-15

-10

-5

5

10

All the eigenvalues lie inside the union of all the circles.


15


Overall objective function:

is chosen to penalize Geršgorin’s discs far in the left plane


16

Results and Discussion I

Identification algorithm is applied to a

subnetwork of the SOS pathway in E.

coli as shown

Main pathway depicted in network is

that between the single-stranded

DNA (ssDNA) and the protein LexA

which works to repress several other

genes

The protein RecA, activated by the

single-stranded DNA, cleaves LexA,

hence up- regulates the genes

described

Steady-state data consists of 9

genes over 9 time steps

Maximum of 81 interactions to be

identified

Fig. 2: Diagram of interactions in SOS network in E. coli


17

Results and Discussion (cont’d)

Steady-state data for SOS pathway in E. coli from perturbation

experiment

Table. 2: Steady-state data for SOS network in E. coli


18

Results and Discussion (cont’d) Recovered network from the

steady-state data is as shown

Red arrows represent inhibition

while green arrows depict

activation

Network has 4 false activations,

8 false inhibitions, 25 false no-

interactions, and 37 false

identifications in total

Penalty parameter = 0.4

Satisfies desired constraints

(i.e. stability, sparsity and

causality)Fig. 3: Recovered network for SOS pathway in E. coli


19

Results and Discussion (cont’d)

Sparse network matrix, A = ;

Genes recA lexA ssb recF dinI umuDC rpoD rpoH rpoS

recA -0.00108 0 0 0 0 0 0 0 0

lexA 0.677765 -0.43355 -0.30091 0.01902 -0.94018 -0.31369 0.097238 -0.44007 0.696682

ssb 0 0 -0.9311 -1.20523 0 0 -0.01077 -0.20607 0

recF 0 0 0 -0.44527 0 0 0 0 0.444202

dinI -0.00119 0 0 0.054514 -0.1221 0 -0.05714 -0.00821 0

umuDC 0 0 0 0 0.527116 -0.52819 0 0 0

rpoD 0 0 0 0 0 0.39363 -0.3947 0 0

rpoH 0 0 0 0 0 0 0.558855 -0.55992 0

rpoS 0 0 0 0 0 0 0 0.509771 -0.51084Table. 3: Recovered sparse network as adjacency matrix


20

Results and Discussion II

Identification algorithm applied to perturbed subset data of Human

Cancer Cell Line(HeLa)

Data consists of 20 genes over 20 time steps;GENES 98982 STK15 serine/threonine kinase 15 Hs.48915 R11407 111792 PLK polo (Drosophia)-like kinase Hs.77597 AA629262 119851 UBCH10 ubiquitin carrier protein E2-C Hs.93002 AA430504 102354 MAPK13 mitogen-activated protein kinase 13 Hs.178695 AA157499 p38delta mRNA=stress-activated protein kinase 4 107303 CDC2 cell division cycle 2, G1 to S and G2 to M Hs.184572 AA598974 100279 TOP2A topoisomerase (DNA) II alpha (170kD) Hs.156346 AA504348 120314 CENPE centromere protein E (312kD) Hs.75573 AA402431 CENP-E=putative kinetochore motor that accumulates just befo 112387 TOP2A topoisomerase (DNA) II alpha (170kD) Hs.156346 AA026682 105833 KPNA2 karyopherin alpha 2 (RAG cohort 1, importin alpha 1) Hs.159557 AA676460 99561 FLJ10468 hypothetical protein FLJ10468 Hs.48855 N63744 105611 CCNF cyclin F Hs.1973 AA676797 117169 DKFZp762E1312 hypothetical protein DKFZp762E1312 Hs.104859 T66935 114475 CKS2 CDC2-Associated Protein CKS2 Hs.83758 AA292964 118635 C20ORF1 chromosome 20 open reading frame 1 Hs.9329 H73329 114729 BUB1 budding uninhibited by benzimidazoles 1 (yeast homolog) Hs.98658 AA430092 BUB1=putative mitotic checkpoint protein ser/thr kinase 105421 TOP2A **topoisomerase (DNA) II alpha (170kD) Hs.156346 AI734240 102513 CKS2 CDC2-Associated Protein CKS1 Hs.83758 AA010065 ckshs2=homolog of Cks1=p34Cdc28/Cdc2-associated protein 113413 ARL6IP ADP-ribosylation factor-like 6 interacting protein Hs.75249 H20558 106982 L2DTL L2DTL protein Hs.126774 R06900 163308 STK15 **serine/threonine kinase 15 Hs.48915 H63492 aurora/IPL1-related kinase 98982 STK15 serine/threonine kinase 15 Hs.48915 R11407 -0.568 0 0 0 0 0 0 0 0.159421 0 0.078646 0 0 0 0 0 0 0 0 0.328769111792 PLK polo (Drosophia)-like kinase Hs.77597 AA629262 0.312587 -0.41391 0 0 0 0 0.100149 0 0 0 0 0 0 0 0 0 0 0 0 0119851 UBCH10 ubiquitin carrier protein E2-C Hs.93002 AA430504 0 0 -0.68234 0.731752 0.862432 1.759613 -1.06106 -2.01938 -1.09479 0.674146 1.306913 0 1.57569 -0.40609 -1.10401 1.062439 0.33662 -0.35899 -0.35088 0102354 MAPK13 mitogen-activated protein kinase 13 Hs.178695 AA157499 p38delta mRNA=stress-activated protein kinase 4 0.314304 0 0 -0.31548 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0107303 CDC2 cell division cycle 2, G1 to S and G2 to M Hs.184572 AA598974 0 0 0 0 -0.00119 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0100279 TOP2A topoisomerase (DNA) II alpha (170kD) Hs.156346 AA504348 0 -0.0726 0 -0.32152 0.200928 -0.24112 0.288502 0 0.161419 0.169449 0.107678 0.492876 0 -0.20688 -0.26778 0 0.180307 -0.24065 -0.17238 0.335376120314 CENPE centromere protein E (312kD) Hs.75573 AA402431 CENP-E=putative kinetochore motor that accumulates just befo 0 0 0 0 0 0 -0.42543 0 0 0 0.42425 0 0 0 0 0 0 0 0 0112387 TOP2A topoisomerase (DNA) II alpha (170kD) Hs.156346 AA026682 0 -0.5713 0 -0.31393 -0.14781 0.555933 0.413613 -0.77267 0 0.08476 0.478954 0.618459 0.111037 0.055929 -0.76283 0.009346 0.707782 -0.12131 -0.31973 0105833 KPNA2 karyopherin alpha 2 (RAG cohort 1, importin alpha 1) Hs.159557 AA676460 0 0 0 0 0 0 0 0 -0.09586 0 0 0.094684 0 0 0 0 0 0 0 099561 FLJ10468 hypothetical protein FLJ10468 Hs.48855 N63744 0.183876 -0.34692 0 0.542299 -0.04658 0 0.01115 0 0 -0.00988 0.099172 0.854785 0 -0.58489 -0.02327 0 0.254583 -0.40988 -0.40022 0105611 CCNF cyclin F Hs.1973 AA676797 0 0 0 0 0 0 0 0 0 0 -0.14645 0.145273 0 0 0 0 0 0 0 0117169 DKFZp762E1312 hypothetical protein DKFZp762E1312 Hs.104859 T66935 0 0 0 0 0 0 0 0 0 0 0 -0.00119 0 0 0 0 0 0 0 0114475 CKS2 CDC2-Associated Protein CKS2 Hs.83758 AA292964 0 -0.85215 0 -0.23282 -0.77727 0.372721 0.5829 0 1.062176 0.162725 0 0.052911 -0.24191 0 0 0 0.288165 -0.5449 -0.5108 0.254078118635 C20ORF1 chromosome 20 open reading frame 1 Hs.9329 H73329 1.017563 0 0 0 0 0 0 0 0 0 0 0 0 -0.01874 0 0 0 0 0 0114729 BUB1 budding uninhibited by benzimidazoles 1 (yeast homolog) Hs.98658 AA430092 BUB1=putative mitotic checkpoint protein ser/thr kinase 0.582765 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.58394 0 0 0 0 0105421 TOP2A **topoisomerase (DNA) II alpha (170kD) Hs.156346 AI734240 0 -0.72216 0 -0.27009 0.071475 0.049774 0.065306 0 -0.22795 0.329402 0.337006 0.718894 0.095955 0.706517 -0.23721 -0.4512 0.294601 -0.0551 -0.15713 0102513 CKS2 CDC2-Associated Protein CKS1 Hs.83758 AA010065 ckshs2=homolog of Cks1=p34Cdc28/Cdc2-associated protein 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.91699 0 0 0.91581113413 ARL6IP ADP-ribosylation factor-like 6 interacting protein Hs.75249 H20558 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.26106 0 0.259881106982 L2DTL L2DTL protein Hs.126774 R06900 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.70773 -0.70655163308 STK15 **serine/threonine kinase 15 Hs.48915 H63492 aurora/IPL1-related kinase 0 0 0 0 0 0 0 0 0.177447 0 0 0 0 0 0 0 0 0 0 -0.17863

Table. 2: Recovered sparse network of HeLa as adjacency matrix


21

Conclusion

Lasso-VAR technique models stable gene regulatory

networks from steady-state data.

It is naturally possible to model networks that are sparse,

causal and with feedback loops, without a priori knowledge

of the network structure

Formulation of the proposed algorithm allows for scalability

to large networks


23

References

[1] L. E. Chai, S. K. Loh, S. T. Low, M. S. Mohamad, S. Deris, and Z. Zakaria, “A review on the computational approaches for gene regulatory network construction,” Computers in Bio and Med, vol. 48, pp. 55–65, May 2014.

[2] T. S. Gardner, D. Di Bernardo, D. Lorenz, and J. J. Collins, “Inferring genetic networks and identifying compound mode of action via expression profiling,” Science, vol. 301, no. 5629, pp. 102–105, 2003.

[3] M. M. Zavlanos, A. A. Julius, S. P. Boyd, and G. J. Pappas, “Inferring stable genetic networks from steady-state data,” Automatica, vol. 47, no. 6, pp. 1113–1122, 2011.

[4] M. M. Kordmahalleh, M. G. Sefidmazgi, A. Homaifar, A. Karimoddini, A. Guiseppi-Elie, and J. L. Graves, “Delayed and hidden variables interactions in gene regulatory networks,” in 2014 IEEE BIBE, Nov. 2014, pp. 23–29.

[5] A. Fujita, J. R. Sato, H. M. Garay-Malpartida, R. Yamaguchi, S. Miyano, M. C. Sogayar, and C. E. Ferreira, “Modeling gene expression regulatory networks with the sparse vector autoregressive model,” BMC Sys Bio,vol. 1, no. 1, p. 39, Aug. 2007.

[6] G. Michailidis and F. dAlch Buc, “Autoregressive models for gene regulatory network inference: Sparsity, stability and causality issues,” Mathematical biosciences, vol. 246, no. 2, pp. 326–334, 2013.

[7] C. Sima, J. Hua, and S. Jung, “Inference of gene regulatory networks using time-series data: A survey,” Curr Gen, vol. 10, no. 6, pp. 416–429, Sep. 2009.

[8] W. Nicholson, D. Matteson, and J. Bien, “Structured regularization for large vector autoregression,” Ph.D. dissertation, Cornell University, Sep. 2014.

[9] H. Lütkepohl, New introduction to multiple time series analysis. Springer, 2007.



ANY QUESTIONS??

north carolina agricultural and technical state university inferring stable gene regulatory networks...

Documents

north carolina agricultural

large expression data

boolean networks

steadystate data presenter

gene regulatory networks

availability of data

gene expression studies

network recovery task