north carolina agricultural and technical state university inferring stable gene regulatory networks...
TRANSCRIPT
North Carolina Agricultural and Technical State University
North Carolina Agricultural and Technical State University
Inferring stable gene regulatory networks from steady-state data
Presenter: Joy Edward Larvie Collaborator: Mohammad GorjiAdvisor: Dr. Abdollah Homaifar
North Carolina Agricultural and Technical State University
2
Outline
Motivation
Background
Introduction
Existing Techniques
Objectives
Methodology
Results and Discussion
Conclusion
Future Work
References
North Carolina Agricultural and Technical State University
3
Motivation
Genetic networks useful in drug discovery where it is crucial
for identifying targeted pathways
Promotes biological knowledge and medical diagnosis
Existing genetic network identification techniques have
inherent limitations
North Carolina Agricultural and Technical State University
4
Background
Traditional techniques for gene
expression studies is limited in both
breadth and efficiency
Investigators/researchers could only
study one or a few genes at a time
DNA microarray technology provides
researchers the opportunity to analyze
expression patterns of tens of
thousands of genes at a time
Multi-step, data-intensive nature of
technology has created a huge
informatics and analytical challenge
It has become a standard tool for
genomic research
Fig. 1: Overview of DNA microarray experiment
North Carolina Agricultural and Technical State University
5
Introduction
Understanding the nature of cellular functions requires the
study of gene behavior from a global perspective
Genes typically regulated through complex interconnections
of cellular components, such as proteins
Interactions form the basis of several cellular pathways and
molecular processes in living cells
Identification of these gene regulatory networks (GRNs)
promotes biological knowledge, medical diagnosis, drug
design and also helps to identify molecular targets of
pharmacological compounds
North Carolina Agricultural and Technical State University
6
Introduction (cont’d)
Large number of genes involved in GRNs makes network
recovery a very complex task
Advent of high throughput technologies such as DNA microarrays
has provided a powerful tool allowing large expression data to be
collected in a single experiment
Increasing availability of data has boosted the network recovery
task
Two categories of expression data» Temporal expression data» Steady-state expression data
Several novel machine learning algorithms proposed for network
identification
Commonly used approaches include clustering, Bayesian
Networks, Boolean Networks and Dynamic Bayesian Networks
North Carolina Agricultural and Technical State University
7
Existing TechniquesComputational approaches Strength Weakness
Boolean network
•Can analyse large regulatory networks •Easier to interpret due to its simplicity •Phenomena of biological realistic complex can be represented by Simplistic Boolean formalism
•Deterministic in nature •Unable to handle incomplete regulatory network data only involves two representative states for gene expression level•High computing time is needed •Most BNs can only with with a small number of genes
Probabilistic Boolean network
•Copes with uncertainties •Two or more transition function for each variable is allowed the use of positive feedback and probabilities can make the model work more effectively •Compared to DBN, PBN can explain more details in the regulatory roles of different sets of gene
•Difficult to apply for large scale networks •High computational complexity •Cannot cope with instantaneous interactions between variables
Bayesian network
•Ability of handling noisy •Handle with uncertainty •Able to work on the logically interacting components with small number of variables •Integrate the prior knowledge to strengthen the causal relationship •Infer the structure of network statistically
•Hard to distinguish between the origin and the target of an interaction •Feedback loops not allowed •Failure to capture temporal information of time series microarray data •Support small sized gene regulatory networks •Combinatorial learning of Bayesian network
Dynamic Bayesian network
•Able to model cyclic interaction among genes •Can handle stochastic components •Well-suited for handling time-series gene expression data •Model indirect or direct causal relationships •Handle perturbation or structural modification of networks
•Excessive computation time and cost •Performance restricted by the missing values of gene expression data •Supports small sized gene regulatory networks
Ordinary differential Equation•Produces directed signed graphs •Suited for steady-state and time series expression profiles •Can work entirely in classical category
•Only applicable to small networks •Difficult to find appropriate parameter values that fit with the data
Neural network•Able to recognize input pattern •Able to model any functional relationships and data structure •Captures the nonlinear and dynamic interactions •Noise resistant
•Difficult to obtain efficient training since learning rate must be defined for different data situation •High computational complexity therefore can only apply to very small systems
Table. 1: Comparison of existing network Identification techniques
North Carolina Agricultural and Technical State University
8
Objectives
Infer a stable, sparse and causal genetic network from
steady-state microarray data
Understand the inherent gene-gene interactions within
inferred network for biological studies/research
North Carolina Agricultural and Technical State University
9
Methodology
Vector autoregressive (VAR) model known to be one of the
most flexible and easy to use models for analyzing
multivariate time series.
Particularly useful for forecasting and modeling the dynamic
behavior of economic and financial time series
In general, a first order VAR [VAR(1)] model is defined as:
where is a random vector, is a fixed coefficient matrix, is a K-dimensional white noise or innovation process, that is, , and for
VAR processes become intractable in high dimensional space
North Carolina Agricultural and Technical State University
10
Methodology (cont’d)
Least Absolute Shrinkage & Selection Operator – VAR
model addresses intractability
Applies an L1 penalty to the convex least squares objective
function.
Some entries of coefficient matrix made zero
Resulting objective function is:
where is the square of the Frobenius norm of, is the norm, and is a penalty parameter
Sparsity induced by penalty term
North Carolina Agricultural and Technical State University
11
Methodology (cont’d)
A VAR(1) is said to be stable iff all eigenvalues of have
absolute value less than one.
Mathematically equivalent to;
Allows the incorporation of a stability constraint that helps to
address the specification of a stable gene regulatory
network from steady-state data.
Constraint imposed through application of Geršgorin’s
Circle Theorem on original Lasso-VAR technique
North Carolina Agricultural and Technical State University
12
Methodology (cont’d)
Geršgorin’s Circle Theorem:
All the eigenvalues of a matrix are in the union of the disks whose
boundaries are circles with centers at the points and the radii are:
In essence, the eigenvalues of a square matrix can not be
too far from its diagonal entries.
North Carolina Agricultural and Technical State University
13
301
852
324
M
The circles that bound the Eigenvalues are:
C1: Center point (4,0) with radius r1 = |2|+|3|=5
C2: Center point (-5,0) with radius r2=|-2|+|8|=10
C3: Center Point (3,0) with radius r3=|1|+|0|=1
-15 -10 -5 5 10
-15
-10
-5
5
10
Union of the Circles
-15 -10 -5 5 10
-15
-10
-5
5
10
The red dots to the right mark the actual location of the Eigenvalues
Consider the following example.
North Carolina Agricultural and Technical State University
14
In the Gerschgorin Circle Theorem the y-axis is interpreted as the imaginary axis. Since the roots of the characteristic polynomial could be complex numbers they take on the form x+iy where i is the square root of -1.
344
052
701
M
The circles that bound the Eigenvalues are:
C1: Center point (1,0) with radius r1 = |0|+|7|=7
C2: Center point (-5,0) with radius r2=|2|+|0|=2
C3: Center Point (-3,0) with radius r3=|4|+|4|=8
-15 -10 -5 5 10
-15
-10
-5
5
10
All the eigenvalues lie inside the union of all the circles.
North Carolina Agricultural and Technical State University
15
Methodology (cont’d)
Overall objective function:
is chosen to penalize Geršgorin’s discs far in the left plane
North Carolina Agricultural and Technical State University
16
Results and Discussion I
Identification algorithm is applied to a
subnetwork of the SOS pathway in E.
coli as shown
Main pathway depicted in network is
that between the single-stranded
DNA (ssDNA) and the protein LexA
which works to repress several other
genes
The protein RecA, activated by the
single-stranded DNA, cleaves LexA,
hence up- regulates the genes
described
Steady-state data consists of 9
genes over 9 time steps
Maximum of 81 interactions to be
identified
Fig. 2: Diagram of interactions in SOS network in E. coli
North Carolina Agricultural and Technical State University
17
Results and Discussion (cont’d)
Steady-state data for SOS pathway in E. coli from perturbation
experiment
Table. 2: Steady-state data for SOS network in E. coli
North Carolina Agricultural and Technical State University
18
Results and Discussion (cont’d) Recovered network from the
steady-state data is as shown
Red arrows represent inhibition
while green arrows depict
activation
Network has 4 false activations,
8 false inhibitions, 25 false no-
interactions, and 37 false
identifications in total
Penalty parameter = 0.4
Satisfies desired constraints
(i.e. stability, sparsity and
causality)Fig. 3: Recovered network for SOS pathway in E. coli
North Carolina Agricultural and Technical State University
19
Results and Discussion (cont’d)
Sparse network matrix, A = ;
Genes recA lexA ssb recF dinI umuDC rpoD rpoH rpoS
recA -0.00108 0 0 0 0 0 0 0 0
lexA 0.677765 -0.43355 -0.30091 0.01902 -0.94018 -0.31369 0.097238 -0.44007 0.696682
ssb 0 0 -0.9311 -1.20523 0 0 -0.01077 -0.20607 0
recF 0 0 0 -0.44527 0 0 0 0 0.444202
dinI -0.00119 0 0 0.054514 -0.1221 0 -0.05714 -0.00821 0
umuDC 0 0 0 0 0.527116 -0.52819 0 0 0
rpoD 0 0 0 0 0 0.39363 -0.3947 0 0
rpoH 0 0 0 0 0 0 0.558855 -0.55992 0
rpoS 0 0 0 0 0 0 0 0.509771 -0.51084Table. 3: Recovered sparse network as adjacency matrix
North Carolina Agricultural and Technical State University
20
Results and Discussion II
Identification algorithm applied to perturbed subset data of Human
Cancer Cell Line(HeLa)
Data consists of 20 genes over 20 time steps;GENES 98982 STK15 serine/threonine kinase 15 Hs.48915 R11407 111792 PLK polo (Drosophia)-like kinase Hs.77597 AA629262 119851 UBCH10 ubiquitin carrier protein E2-C Hs.93002 AA430504 102354 MAPK13 mitogen-activated protein kinase 13 Hs.178695 AA157499 p38delta mRNA=stress-activated protein kinase 4 107303 CDC2 cell division cycle 2, G1 to S and G2 to M Hs.184572 AA598974 100279 TOP2A topoisomerase (DNA) II alpha (170kD) Hs.156346 AA504348 120314 CENPE centromere protein E (312kD) Hs.75573 AA402431 CENP-E=putative kinetochore motor that accumulates just befo 112387 TOP2A topoisomerase (DNA) II alpha (170kD) Hs.156346 AA026682 105833 KPNA2 karyopherin alpha 2 (RAG cohort 1, importin alpha 1) Hs.159557 AA676460 99561 FLJ10468 hypothetical protein FLJ10468 Hs.48855 N63744 105611 CCNF cyclin F Hs.1973 AA676797 117169 DKFZp762E1312 hypothetical protein DKFZp762E1312 Hs.104859 T66935 114475 CKS2 CDC2-Associated Protein CKS2 Hs.83758 AA292964 118635 C20ORF1 chromosome 20 open reading frame 1 Hs.9329 H73329 114729 BUB1 budding uninhibited by benzimidazoles 1 (yeast homolog) Hs.98658 AA430092 BUB1=putative mitotic checkpoint protein ser/thr kinase 105421 TOP2A **topoisomerase (DNA) II alpha (170kD) Hs.156346 AI734240 102513 CKS2 CDC2-Associated Protein CKS1 Hs.83758 AA010065 ckshs2=homolog of Cks1=p34Cdc28/Cdc2-associated protein 113413 ARL6IP ADP-ribosylation factor-like 6 interacting protein Hs.75249 H20558 106982 L2DTL L2DTL protein Hs.126774 R06900 163308 STK15 **serine/threonine kinase 15 Hs.48915 H63492 aurora/IPL1-related kinase 98982 STK15 serine/threonine kinase 15 Hs.48915 R11407 -0.568 0 0 0 0 0 0 0 0.159421 0 0.078646 0 0 0 0 0 0 0 0 0.328769111792 PLK polo (Drosophia)-like kinase Hs.77597 AA629262 0.312587 -0.41391 0 0 0 0 0.100149 0 0 0 0 0 0 0 0 0 0 0 0 0119851 UBCH10 ubiquitin carrier protein E2-C Hs.93002 AA430504 0 0 -0.68234 0.731752 0.862432 1.759613 -1.06106 -2.01938 -1.09479 0.674146 1.306913 0 1.57569 -0.40609 -1.10401 1.062439 0.33662 -0.35899 -0.35088 0102354 MAPK13 mitogen-activated protein kinase 13 Hs.178695 AA157499 p38delta mRNA=stress-activated protein kinase 4 0.314304 0 0 -0.31548 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0107303 CDC2 cell division cycle 2, G1 to S and G2 to M Hs.184572 AA598974 0 0 0 0 -0.00119 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0100279 TOP2A topoisomerase (DNA) II alpha (170kD) Hs.156346 AA504348 0 -0.0726 0 -0.32152 0.200928 -0.24112 0.288502 0 0.161419 0.169449 0.107678 0.492876 0 -0.20688 -0.26778 0 0.180307 -0.24065 -0.17238 0.335376120314 CENPE centromere protein E (312kD) Hs.75573 AA402431 CENP-E=putative kinetochore motor that accumulates just befo 0 0 0 0 0 0 -0.42543 0 0 0 0.42425 0 0 0 0 0 0 0 0 0112387 TOP2A topoisomerase (DNA) II alpha (170kD) Hs.156346 AA026682 0 -0.5713 0 -0.31393 -0.14781 0.555933 0.413613 -0.77267 0 0.08476 0.478954 0.618459 0.111037 0.055929 -0.76283 0.009346 0.707782 -0.12131 -0.31973 0105833 KPNA2 karyopherin alpha 2 (RAG cohort 1, importin alpha 1) Hs.159557 AA676460 0 0 0 0 0 0 0 0 -0.09586 0 0 0.094684 0 0 0 0 0 0 0 099561 FLJ10468 hypothetical protein FLJ10468 Hs.48855 N63744 0.183876 -0.34692 0 0.542299 -0.04658 0 0.01115 0 0 -0.00988 0.099172 0.854785 0 -0.58489 -0.02327 0 0.254583 -0.40988 -0.40022 0105611 CCNF cyclin F Hs.1973 AA676797 0 0 0 0 0 0 0 0 0 0 -0.14645 0.145273 0 0 0 0 0 0 0 0117169 DKFZp762E1312 hypothetical protein DKFZp762E1312 Hs.104859 T66935 0 0 0 0 0 0 0 0 0 0 0 -0.00119 0 0 0 0 0 0 0 0114475 CKS2 CDC2-Associated Protein CKS2 Hs.83758 AA292964 0 -0.85215 0 -0.23282 -0.77727 0.372721 0.5829 0 1.062176 0.162725 0 0.052911 -0.24191 0 0 0 0.288165 -0.5449 -0.5108 0.254078118635 C20ORF1 chromosome 20 open reading frame 1 Hs.9329 H73329 1.017563 0 0 0 0 0 0 0 0 0 0 0 0 -0.01874 0 0 0 0 0 0114729 BUB1 budding uninhibited by benzimidazoles 1 (yeast homolog) Hs.98658 AA430092 BUB1=putative mitotic checkpoint protein ser/thr kinase 0.582765 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.58394 0 0 0 0 0105421 TOP2A **topoisomerase (DNA) II alpha (170kD) Hs.156346 AI734240 0 -0.72216 0 -0.27009 0.071475 0.049774 0.065306 0 -0.22795 0.329402 0.337006 0.718894 0.095955 0.706517 -0.23721 -0.4512 0.294601 -0.0551 -0.15713 0102513 CKS2 CDC2-Associated Protein CKS1 Hs.83758 AA010065 ckshs2=homolog of Cks1=p34Cdc28/Cdc2-associated protein 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.91699 0 0 0.91581113413 ARL6IP ADP-ribosylation factor-like 6 interacting protein Hs.75249 H20558 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.26106 0 0.259881106982 L2DTL L2DTL protein Hs.126774 R06900 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.70773 -0.70655163308 STK15 **serine/threonine kinase 15 Hs.48915 H63492 aurora/IPL1-related kinase 0 0 0 0 0 0 0 0 0.177447 0 0 0 0 0 0 0 0 0 0 -0.17863
Table. 2: Recovered sparse network of HeLa as adjacency matrix
North Carolina Agricultural and Technical State University
21
Conclusion
Lasso-VAR technique models stable gene regulatory
networks from steady-state data.
It is naturally possible to model networks that are sparse,
causal and with feedback loops, without a priori knowledge
of the network structure
Formulation of the proposed algorithm allows for scalability
to large networks
North Carolina Agricultural and Technical State University
23
References
[1] L. E. Chai, S. K. Loh, S. T. Low, M. S. Mohamad, S. Deris, and Z. Zakaria, “A review on the computational approaches for gene regulatory network construction,” Computers in Bio and Med, vol. 48, pp. 55–65, May 2014.
[2] T. S. Gardner, D. Di Bernardo, D. Lorenz, and J. J. Collins, “Inferring genetic networks and identifying compound mode of action via expression profiling,” Science, vol. 301, no. 5629, pp. 102–105, 2003.
[3] M. M. Zavlanos, A. A. Julius, S. P. Boyd, and G. J. Pappas, “Inferring stable genetic networks from steady-state data,” Automatica, vol. 47, no. 6, pp. 1113–1122, 2011.
[4] M. M. Kordmahalleh, M. G. Sefidmazgi, A. Homaifar, A. Karimoddini, A. Guiseppi-Elie, and J. L. Graves, “Delayed and hidden variables interactions in gene regulatory networks,” in 2014 IEEE BIBE, Nov. 2014, pp. 23–29.
[5] A. Fujita, J. R. Sato, H. M. Garay-Malpartida, R. Yamaguchi, S. Miyano, M. C. Sogayar, and C. E. Ferreira, “Modeling gene expression regulatory networks with the sparse vector autoregressive model,” BMC Sys Bio,vol. 1, no. 1, p. 39, Aug. 2007.
[6] G. Michailidis and F. dAlch Buc, “Autoregressive models for gene regulatory network inference: Sparsity, stability and causality issues,” Mathematical biosciences, vol. 246, no. 2, pp. 326–334, 2013.
[7] C. Sima, J. Hua, and S. Jung, “Inference of gene regulatory networks using time-series data: A survey,” Curr Gen, vol. 10, no. 6, pp. 416–429, Sep. 2009.
[8] W. Nicholson, D. Matteson, and J. Bien, “Structured regularization for large vector autoregression,” Ph.D. dissertation, Cornell University, Sep. 2014.
[9] H. Lütkepohl, New introduction to multiple time series analysis. Springer, 2007.
North Carolina Agricultural and Technical State University
North Carolina Agricultural and Technical State University
ANY QUESTIONS??