introduction to systems biology iiveda.cs.uiuc.edu/.../08_intro_systems_biology_ii.pdf · systems...

61
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Introduction to Systems Biology II Amin Emad NIH BD2K KnowEnG Center of Excellence in Big Data Computing Carl R. Woese Institute for Genomic Biology Department of Computer Science University of Illinois at Urbana-Champaign June, 2018

Upload: others

Post on 13-Jun-2020

7 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

Introduction to Systems Biology II

Amin Emad

NIH BD2K KnowEnG Center of Excellence in Big Data Computing Carl R. Woese Institute for Genomic Biology

Department of Computer Science University of Illinois at Urbana-Champaign

June, 2018

Page 2: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Systems Biology

•  Systems biology is the computational and mathematical modeling of complex biological systems (wikipedia).

•  Studies the interactions between the components of biological systems such as genes, proteins, metabolites, etc. (i.e. biological networks), and how these interactions give rise to the function and behavior of that system (phenotype)

NIH Big Data Center of Excellence 2

Page 3: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Biological Networks

A graphical representation of the interactions of the components of a biological systems

NIH Big Data Center of Excellence 3

BMIF310, Fall 2009 3

Cell as a system

Signaling network

Transcriptional regulatory network

Metabolic network

Gene co-expression network

Protein interaction network

Zhang (2009)

•  Cell signaling networks

•  Gene regulatory networks

•  Protein-protein interaction networks

•  Gene co-expression networks

•  Metabolic networks

Page 4: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Biological Networks in Computational Biology

NIH Big Data Center of Excellence 4

Analyzing network

properties

Analyzing ‘omic’ data in

light of networks

Reconstructing biological

networks

Graph Theory

Machine Learning

Statistics

Page 5: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

1) Analyzing network properties

NIH Big Data Center of Excellence 5

Page 6: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

What is a network/graph?

NIH Big Data Center of Excellence 6

•  Graph: A representation of relationship among objects •  A graph G(V, E) is a set of vertices (nodes) V and edges (links) E

Directed vs. Undirected:

Undirected graph

•  Protein-protein interactions

•  Co-expression network

Directed graph

•  Gene regulatory network

•  Signaling pathways

Page 7: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Graph Properties

NIH Big Data Center of Excellence 7

Weighted vs. Unweighted:

•  Weights represent affinity in PPI, correlation coefficient in a co-expression network, confidence in a GRN, etc.

Weighted graph Unweighted graph

Page 8: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Graph Properties

NIH Big Data Center of Excellence 8

Degree and degree distribution: •  Degree: Number of connections of a node to other nodes

•  Indegree (outdegree) of a node in a directed graph is the number of edges entering (leaving) that node

•  Degree distribution of a network is the probability distribution of these degrees over the network:

Page 9: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Graph Properties

NIH Big Data Center of Excellence 9

Adjacency matrix: •  A matrix representation of the graph

https://www.ebi.ac.uk/training/online/course/network-analysis-protein-interaction-data-introduction/introduction-graph-theory/graph-0

Page 10: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Graph Properties

NIH Big Data Center of Excellence 10

Path and connectivity: •  Path: A sequence of distinct edges connecting a sequence

of vertices: GFAB, EAC, etc.

•  Connectivity: A graph that in which a path exists between any two nodes

Page 11: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Graph Properties

NIH Big Data Center of Excellence 11

Important classes of graphs: •  Tree: Any two vertices are connected by exactly one path (e.g.

dendogram in hierarchical clustering)

•  Complete graph: Each pair of vertices are connected by an edge

Page 12: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

2) Analyzing ‘omic’ data in light of biological networks

NIH Big Data Center of Excellence 12

Page 13: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Analyzing ‘omic’ data in light of networks

NIH Big Data Center of Excellence 13

How to analyze large ‘omic’ datasets?

Statistics Machine Learning

Page 14: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Analyzing ‘omic’ data in light of networks

NIH Big Data Center of Excellence 14

How to analyze large ‘omic’ datasets? Machine learning is concerned with utilizing statistical techniques to give computers the ability to “learn”.

Statistics Machine Learning

Page 15: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Analyzing ‘omic’ data in light of networks

NIH Big Data Center of Excellence 15

How to analyze large ‘omic’ datasets? Machine learning is concerned with utilizing statistical techniques to give computers the ability to “learn”. However, it can do much more!

Statistics Machine Learning

Page 16: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Machine Learning in Computational Biology

NIH Big Data Center of Excellence 16

Some examples:

•  Predicting whether a patient is sensitive or resistant to a drug

•  Predicting the survival probability of a cancer patient

•  Identifying the subtypes of a disease

•  Identifying genes associated with a disease

•  etc.

Page 17: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Machine Learning

NIH Big Data Center of Excellence 17

Training examples are provided with desired inputs and outputs to help learning the desired rule

No training example exists and the goal is to learn structure in the data

Machine Learning

Supervised Learning

Unsupervised Learning

Page 18: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Machine Learning

NIH Big Data Center of Excellence 18

Machine Learning

Supervised Learning

Unsupervised Learning

Classification Regression

Supervised Feature Selection

Clustering

Dimensionality Reduction

Page 19: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Unsupervised Machine Learning (Clustering)

NIH Big Data Center of Excellence 19

•  We have a set of samples characterized using several features (e.g. expression of thousands of genes for tumor samples)

•  Goal: Group the sample such that those in the same group are more similar to each other than to those in other groups

•  Many methods exist such as K-means, hierarchical clustering, matrix factorization, etc.

•  Example: Identifying subtypes of breast

cancer using transcriptomic data

Page 20: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Unsupervised ML (Dimensionality Reduction)

NIH Big Data Center of Excellence 20

•  We have a set of samples characterized using several features

•  Goal: Reduce the number of features while preserving characteristics of the data

•  Many methods exist such as principal component analysis, linear discriminative analysis, etc.

•  Example: PCA identifies a few principal

components, orthogonal to each other, such that they account for most of the variance in the data

Page 21: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Supervised Machine Learning (Classification)

NIH Big Data Center of Excellence 21

Classification: •  We have a set of samples characterized using several features (e.g.

expression of thousands of genes for tumor samples)

•  The samples belong to set of known categories •  Goal: Given a new sample, to which category does it belong?

•  Many methods exist such as KNN, SVM, logistic regression, decision trees, random forests, etc.

Page 22: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Supervised Machine Learning (Classification)

NIH Big Data Center of Excellence 22

Example: •  We have ‘omic’ profiles and clinical information of breast cancer patients

•  We also know which patients were resistant to a drug and which ones were not

•  Given the ‘omic’ profiles and clinical information of a new patient, will they be resistant to the drug or not?

+ =

+ =

‘omic’ and clinical features

sam

ples

Page 23: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Supervised Machine Learning (Regression)

NIH Big Data Center of Excellence 23

•  We have a set of samples characterized using several features (e.g. expression of thousands of genes for tumor samples)

•  For each sample, we know a continuous-valued response (dependent variable) (e.g. number of years between diagnosis and occurrence of metastasis)

•  Goal: Estimate the relationship between the response and features and predict the value of response for a new sample

•  Many methods exist such as linear regression, LASSO, Elastic Net, Support vector regression, etc.

Page 24: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Supervised Machine Learning (Regression)

NIH Big Data Center of Excellence 24

Example: •  We have transcriptomic profiles of breast cancer patients

•  We also know number of months between diagnosis and occurrence of metastasis

•  What is the relationship between gene expression and time of metastasis?

genes

sam

ples

https://www.cancer.gov/types/metastatic-cancer

Page 25: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Supervised Machine Learning (Feature Selection)

NIH Big Data Center of Excellence 25

•  We have a set of samples characterized using several features •  We know a continuous-valued or categorical response for samples •  Goal: What are the features most predictive of the response?

Examples:

•  Differentially expressed genes (case vs. control)

•  Correlation analysis (GWAS)

•  etc.

genes

sam

ples

continuous categorical

Page 26: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Network guided analysis

NIH Big Data Center of Excellence 26

How can biological networks help? •  When features correspond to genes or proteins (e.g. gene

expression, mutation, etc.), these networks can provide information regarding the interactions and relationships of these features.

genes

sam

ples

Page 27: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Network-guided gene prioritization using ProGENI

NIH Big Data Center of Excellence 27

Page 28: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Background

•  Phenotypic properties of a cell are determined (partially) by expression of its genes and proteins

•  Gene expression profiling measures the activity of thousands of genes to create a global picture of cellular function

NIH Big Data Center of Excellence

genes

sam

ples

28

Page 29: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Background

•  Goal: •  Identifying genes whose basal mRNA expression determines the drug

sensitivity in different samples (supervised feature selection)

•  Motivations: •  Overcoming drug resistance

•  Revealing drug mechanism of action

•  Identifying novel drug targets

•  Predicting drug sensitivity of individuals

NIH Big Data Center of Excellence

+ =

+ =

29

Page 30: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Gene prioritization

NIH Big Data Center of Excellence

genes

sam

ples

Examples of current methods: •  Score each gene based on the correlation of its

expression with drug response

30

Page 31: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Gene prioritization

NIH Big Data Center of Excellence

Xwixi

Examples of current methods: •  Score each gene based on the correlation of its

expression with drug response

•  Use multivariable regression algorithms such as Elastic Net to relate multiple genes’ expression values to drug response

31

genes

sam

ples

Page 32: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Gene prioritization

Examples of current methods: •  Score each gene based on the correlation of its

expression with drug response

•  Use multivariable regression algorithms such as Elastic Net to relate multiple genes’ expression values to drug response

Shortcoming: •  These methods do not incorporate prior information

about the interaction of the genes

NIH Big Data Center of Excellence 32

Page 33: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

ProGENI

Hypothesis: •  Since genes and proteins involved in drug MoA are functionally related, prior

knowledge in the form of gene interaction network (e.g. PPI) can improve accuracy of the prioritization task

NIH Big Data Center of Excellence

genes

sam

ples

33

Page 34: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

ProGENI

ProGENI: Network-guided gene prioritization •  An algorithm that incorporates gene network information to improve

prioritization accuracy

NIH Big Data Center of Excellence

RESEARCH Open Access

Knowledge-guided gene prioritizationreveals new insights into the mechanismsof chemoresistanceAmin Emad1 , Junmei Cairns2, Krishna R. Kalari3, Liewei Wang2* and Saurabh Sinha4*

Abstract

Background: Identification of genes whose basal mRNA expression predicts the sensitivity of tumor cells to cytotoxictreatments can play an important role in individualized cancer medicine. It enables detailed characterization of themechanism of action of drugs. Furthermore, screening the expression of these genes in the tumor tissue may suggestthe best course of chemotherapy or a combination of drugs to overcome drug resistance.

Results: We developed a computational method called ProGENI to identify genes most associated with the variationof drug response across different individuals, based on gene expression data. In contrast to existing methods, ProGENIalso utilizes prior knowledge of protein–protein and genetic interactions, using random walk techniques. Analysis oftwo relatively new and large datasets including gene expression data on hundreds of cell lines and their cytotoxicresponses to a large compendium of drugs reveals a significant improvement in prediction of drug sensitivity usinggenes identified by ProGENI compared to other methods. Our siRNA knockdown experiments on ProGENI-identifiedgenes confirmed the role of many new genes in sensitivity to three chemotherapy drugs: cisplatin, docetaxel, anddoxorubicin. Based on such experiments and extensive literature survey, we demonstrate that about 73% of our toppredicted genes modulate drug response in selected cancer cell lines. In addition, global analysis of genes associatedwith groups of drugs uncovered pathways of cytotoxic response shared by each group.

Conclusions: Our results suggest that knowledge-guided prioritization of genes using ProGENI gives new insight intomechanisms of drug resistance and identifies genes that may be targeted to overcome this phenomenon.

Keywords: Chemoresistance, Chemotherapy, Drug sensitivity, Gene interaction network, Gene prioritization,Network-based algorithm

BackgroundThe goal of gene prioritization is to rank genes with re-spect to their relationship to a phenotype (e.g., occurrenceof a disease, response to a drug, etc.), providing an experi-mentalist a way to prioritize genetic perturbation tests andleading to discovery of genes affecting the phenotype [1].In the context of drug design and drug sensitivity, variousgene prioritization techniques have been used to identifydrug targets, reveal mechanisms of action (MoAs) of

drugs, and identify genes associated with drug response,as well as for drug repositioning [2–5].It has been previously shown that gene expression is

the most informative currently available ‘omic’ featurewith respect to drug sensitivity prediction [6], and it hasbeen also successfully used to predict drug response inlarge clinical studies [7]. Basal gene expression of cancercell lines (CCLs) has been used to rank genes by theirrole in cytotoxic drug resistance, utilizing correlationanalysis [2, 8–11] or feature selection and regressiontechniques [12–16] to statistically associate drug re-sponse with gene expression profiles of cell lines. At thesame time, many genes with key roles escape identifica-tion based on expression profiling alone, due to thecomplexity of drug MoA and noisy data [2], and due tothe fact that current methods overlook known functional

* Correspondence: [email protected]; [email protected] of Molecular Pharmacology and Experimental Therapeutics,Gonda 19, Mayo Clinic Rochester, 200, 1st St. SW, Rochester, MN 55905, USA4Department of Computer Science and Institute of Genomic Biology,University of Illinois at Urbana-Champaign, 2122 Siebel Center, 201N.Goodwin Ave, Urbana, IL 61801, USAFull list of author information is available at the end of the article

© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link tothe Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Emad et al. Genome Biology (2017) 18:153 DOI 10.1186/s13059-017-1282-3

34

Page 35: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

ProGENI

Step 1: Generate new features representing expression of each gene and the activity level of their neighbors weighted proportional to their relevance

NIH Big Data Center of Excellence

Randomlyselect80%ofcelllines

Rankallgenes

Aggregaterankedlistsofgenes

RepeatNr8mes

Genes

Celllines

Priori%z

a%on)

PerformNetworktransforma8onofgeneexpressions

Obtainequilibriumprobabilitydistribu8on

forthenodes

Celllines

Genes

Network

Geneexpressions

Drugresponse(e.g.IC50)

Iden8fyresponsecorrelatedgenes(RCG)andusethemasthe

restartsetforaRWR

a)

b)

Rankgenesaccordingtonormalized

probabilityscores

Normalizew.r.t.globalnetworkdistribu8on

35

Page 36: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

ProGENI

Step 1: Generate new features representing expression of each gene and the activity level of their neighbors weighted proportional to their relevance

NIH Big Data Center of Excellence

Randomlyselect80%ofcelllines

Rankallgenes

Aggregaterankedlistsofgenes

RepeatNr8mes

Genes

Celllines

Priori%z

a%on)

PerformNetworktransforma8onofgeneexpressions

Obtainequilibriumprobabilitydistribu8on

forthenodes

Celllines

Genes

Network

Geneexpressions

Drugresponse(e.g.IC50)

Iden8fyresponsecorrelatedgenes(RCG)andusethemasthe

restartsetforaRWR

a)

b)

Rankgenesaccordingtonormalized

probabilityscores

Normalizew.r.t.globalnetworkdistribu8on

(Rosvall and Bergstrom, 2007)

36

Page 37: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

ProGENI

Step 1: Generate new features representing expression of each gene and the activity level of their neighbors weighted proportional to their relevance

Step 2: Find genes most correlated with drug response (RCG set)

NIH Big Data Center of Excellence

Randomlyselect80%ofcelllines

Rankallgenes

Aggregaterankedlistsofgenes

RepeatNr8mes

Genes

Celllines

Priori%z

a%on)

PerformNetworktransforma8onofgeneexpressions

Obtainequilibriumprobabilitydistribu8on

forthenodes

Celllines

Genes

Network

Geneexpressions

Drugresponse(e.g.IC50)

Iden8fyresponsecorrelatedgenes(RCG)andusethemasthe

restartsetforaRWR

a)

b)

Rankgenesaccordingtonormalized

probabilityscores

Normalizew.r.t.globalnetworkdistribu8on

37

Page 38: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

ProGENI

Step 1: Generate new features representing expression of each gene and the activity level of their neighbors weighted proportional to their relevance

Step 2: Find genes most correlated with drug response (RCG set)

Step 3: Score genes based on their relevance to the RCG set

NIH Big Data Center of Excellence

Randomlyselect80%ofcelllines

Rankallgenes

Aggregaterankedlistsofgenes

RepeatNr8mes

Genes

Celllines

Priori%z

a%on)

PerformNetworktransforma8onofgeneexpressions

Obtainequilibriumprobabilitydistribu8on

forthenodes

Celllines

Genes

Network

Geneexpressions

Drugresponse(e.g.IC50)

Iden8fyresponsecorrelatedgenes(RCG)andusethemasthe

restartsetforaRWR

a)

b)

Rankgenesaccordingtonormalized

probabilityscores

Normalizew.r.t.globalnetworkdistribu8on

38

Page 39: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

ProGENI

Step 1: Generate new features representing expression of each gene and the activity level of their neighbors weighted proportional to their relevance

Step 2: Find genes most correlated with drug response (RCG set)

Step 3: Score genes based on their relevance to the RCG set

Step 4: Remove network bias by normalizing scores w.r.t. scores corresponding to global network topology

NIH Big Data Center of Excellence

Randomlyselect80%ofcelllines

Rankallgenes

Aggregaterankedlistsofgenes

RepeatNr8mes

Genes

Celllines

Priori%z

a%on)

PerformNetworktransforma8onofgeneexpressions

Obtainequilibriumprobabilitydistribu8on

forthenodes

Celllines

Genes

Network

Geneexpressions

Drugresponse(e.g.IC50)

Iden8fyresponsecorrelatedgenes(RCG)andusethemasthe

restartsetforaRWR

a)

b)

Rankgenesaccordingtonormalized

probabilityscores

Normalizew.r.t.globalnetworkdistribu8on

39

Page 40: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Datasets

•  Human lymphoblastoid cell lines (LCL) •  Gene expression (~17K genes of ~300 cell lines) •  Drug response of 24 cytotoxic treatments

•  Publicly available dataset from GDSC •  Gene expression (~13K genes of ~600 cell lines from 13

tissues) •  Drug response of 139 cytotoxic treatments

•  Publicly available prior knowledge •  Network of gene interactions (PPI and genetic interactions) from

STRING (~1.5M edges, ~15.5K nodes)

NIH Big Data Center of Excellence

Data Sources for Knowledge Network• Philosophy: Rely on existing collections

• Protein-Protein Interactions • (40 M)

• Experimentally determined physical and genetic interactions

• Literature-based co-occurrence• Many other types

• Sources for experimental interactions (1.4 M)

5Interactions among 12 genes

40

Page 41: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Validation using drug response prediction

•  Genes ranked highly using a good prioritization method are good predictors of drug sensitivity

NIH Big Data Center of Excellence

Randomlyselect80%ofcelllines

Rankallgenes

Aggregaterankedlistsofgenes

RepeatNr8mes

GenesCe

lllines

Priori%z

a%on)

PerformNetworktransforma8onofgeneexpressions

Obtainequilibriumprobabilitydistribu8on

forthenodes

Celllines

Genes

Network

Geneexpressions

Drugresponse(e.g.IC50)

Iden8fyresponsecorrelatedgenes(RCG)andusethemasthe

restartsetforaRWR

A

B

Rankgenesaccordingtonormalized

probabilityscores

Normalizew.r.t.globalnetworkdistribu8on

Dividesamplesintotrainingand

testsetsRankallgenes

TrainaSVRontrainingsetusingexpressionofhighlyrankedgenes

RepeatNr8mes

Celllines

Genes

Geneexpressions

Predictdrugsensi8vityofthetestset

Trainingset

Testset

C

41

Page 42: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Validation using drug response prediction

NIH Big Data Center of Excellence

LCL Dataset Pearson Elastic Net Num. Drugs (out of 24)

ProGENI > Baseline 14 20

FDR (Wilcoxon signed-rank test) 6.5 E-3 9.6 E-5

GDSC Dataset Pearson Elastic Net Num. Drugs (out of 139)

ProGENI > Baseline 66 110

FDR (Wilcoxon signed-rank test) 9.1 E-4 4.0 E-21

SPCI(P

roGE

NI-S

VR)

A

B

C

D

SPCI(PCC-SVR)

SPCI(PCC-SVR)

SPCI(EN-SVR)

SPCI(EN-SVR)

SPCI(P

roGE

NI-S

VR)

SPCI(P

roGE

NI-S

VR)

SPCI(P

roGE

NI-S

VR)

42

Page 43: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Functional validation

We validated role of 33 (out of 45) genes (73%) for three drugs.

NIH Big Data Center of Excellence

A

Gene Symbol Rank (ProGENI) Rank (Pearson) Absolute value of

Pearson correlation coefficient

Evidence

TUBB6 2 2 0.2759 Direct (this study)

DYNC2H1 3 4 0.2680 Direct (this study)

CLDN3 4 7 0.2602 Direct (literature)

SPARC 5 8 0.2574 Direct (literature)

GJA1 6 6 0.2623 Direct (literature)

ITGA5 7 11 0.2466 Direct (literature)

TPM2 8 9 0.2567 Direct (literature)

MMP2 9 37 0.2160 Direct (literature)

AXL 12 15 0.2373 Direct (literature)

ENG 13 47 0.2089 Direct (literature)

ELK3 14 13 0.2394 Direct (this study)

TIMP1 15 29 0.2207 Direct (literature)

FSCN1 1 1 0.2879 Not found

FHL3 10 10 0.2477 Not found

MMP14 11 39 0.2143 Not found

B

Gene Symbol Rank (ProGENI) Rank (Pearson) Absolute value of

Pearson correlation coefficient

Evidence

CAV1 1 8 0.3713 Direct (literature)

YAP1 2 1 0.4148 Direct (literature)

WWTR1 3 4 0.4075 Direct (literature)

AXL 6 2 0.4098 Direct (literature)

MMP14 7 22 0.3525 Direct (literature)

CYR61 9 6 0.3791 Direct (literature)

CAV2 10 16 0.3566 Direct (literature)

GNG12 11 5 0.3792 Direct (this study)

CTSB 12 27 0.3462 Direct (literature)

FSTL1 14 17 0.3557 Direct (this study)

ST5 15 7 0.3782 Direct (this study)

PDGFC 4 13 0.3659 Not found

PTRF 5 3 0.4094 Not found

ITGB5 8 21 0.3534 Not found

PLAU 13 110 0.3033 Not found

C

Gene Symbol Rank (ProGENI) Rank (Pearson) Absolute value of

Pearson correlation coefficient

Evidence

ATF1 1 1 0.2000 Direct (this study)

MIS12 2 4 0.1887 Direct (this study)

OSBPL2 5 6 0.1865 Direct (this study)

CSNK2A1 7 1587 0.0752 Direct (literature)

PSIP1 (LEDGF) 8 46 0.1537 Direct (literature)

CAMK2A 9 6991 0.0157 Direct (literature)

CSNK2A2 10 4870 0.0347 Direct (literature)

GOSR1 11 6867 0.0167 Direct (this study)

MAPK8 13 7574 0.0112 Direct (literature)

SPI1 14 6287 0.0217 Direct (literature)

CREB1 15 665 0.1000 Direct (literature)

NOC3L 3 3 0.1893 Not found

IL27RA 4 2 0.1911 Not found

MGEA5 6 7 0.1814 Not found

WAPAL 12 8 0.1805 Not found

B

C

p-value<0.0001 p-value<0.0001 p-value<0.0001

p-value<0.0001 p-value<0.0001 p-value<0.0001p-value<0.0001

BT549

BT549

p-value=0.0005 p-value<0.0001 p-value=0.0002MDA-MB-231

A BT549p-value<0.0001 p-value<0.0001 p-value<0.0001

MDA-MB-231p-value<0.0001 p-value<0.0001 p-value<0.0001

p-value=0.0002 p-value=0.0010 p-value=0.0018MDA-MB-231

JUNMAPK8PDPK1

CDC42IL27RA

PAK1

ATF1SRC

PXN

FNBP1

GOSR1

IRS1

MIS12

SUGT1

NOL6

NOC3L

HSP90AA1

GTPBP4LEO1

MGEA5NIFK

RPL5

RRS1

SUMO2

PSIP1

HEATR1

CDC37

CAMK2ACREM

CREB1RPS6KB1CREBBP

SGK1

ZNF45

SPI1

PRKCD

EDF1YWHAQ

UBC

CSNK2A1SIN3A

CEBPACSNK2A2

BRCA1

RPS6KA1

RPS6KA5

OSBPL2

PRKACA

RPS6KA3

RPS6KA4

EBI3

RPS6KA2

FKBP4

WAPAL

NOL3

OGT

CASP2

FGFR2

HSP90AA1

ITGA5

MXRA7

MMP2

TGFB1

TIMP1

MMP14

ZBTB16

FSCN1

AXL

UBC

MRC2

DYNC2H1

GJA1

FHL3

TUBB6

COL18A1

SPARCENG

ITGB1

ACVRL1

TGFBR1

CAV1TPM2

ELK3

CLDN3

MMP9

CLDN1

THBS1

PARVA

GNG12

FLNA

TUBB6

FLOT2

PTRF

ITGAVSRC

ITGB5EHD2

HRAS

PLAT

FHL2

YAP1

UBC

CAP2

PROCR

RASA1CAV2

PTGS1

FPGT-TNNI3K

LRP1

MALL

CAV1

KCNMA1

CSNK2A1

MMP14PLD2

PDGFCCYR61

CTSB

AXLPLAU

ST14

cispla4n

D

docetaxel

doxorubicin

43

Page 44: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

How about other ML tasks?

NIH Big Data Center of Excellence 44

•  Similar principles can be used for ML tasks other than feature selection/prioritization

•  “Network-smoothing” of the features used in ProGENI can be used as a preprocessing step to regression and classification algorithms

•  Network-smoothing can also be used for clustering and dimensionality reduction (e.g. Network-based stratification)

Page 45: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Network-based Stratification

NIH Big Data Center of Excellence 45

Goal: •  Stratification (clustering) of tumor samples based on somatic mutation

profiles

Main Issue: •  The mutation data is very sparse and most conventional clustering

techniques fail to identify reasonable patterns

•  Although two tumors may not share the same somatic mutations, they may affect the same pathways and interaction networks

Page 46: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Value of network-guided analysis

NIH Big Data Center of Excellence 46

Data sparsity: •  Due to the sparsity of the

data, all samples are at equal distance of each other

Page 47: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Value of network-guided analysis

NIH Big Data Center of Excellence 47

Data sparsity: •  Due to the sparsity of the

data, all samples are at equal distance of each other

•  Pathway information clarifies the similarity among some samples

Page 48: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Value of network-guided analysis

NIH Big Data Center of Excellence 48

Data sparsity: •  Due to the sparsity of the

data, all samples are at equal distance of each other

•  Pathway information clarifies the similarity among some samples

•  Conventional clustering methods can identify clusters based on network-smoothed features

Page 49: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

NBS (Algorithm Overview)

NIH Big Data Center of Excellence 49

•  Employs network smoothing to mitigate sparsity by transforming the binary gene-level somatic mutation vectors of patients into a continuous gene importance vector that captures the proximity of each gene in the network to all of the genes with somatic mutations in the patient sample

•  Bootstrap sampling enables robust clustering

Page 50: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

3) Reconstruction of Biological Networks

NIH Big Data Center of Excellence 50

Page 51: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Gene Co-expression Networks

NIH Big Data Center of Excellence 51

•  Nodes represent genes •  An edge exists between two genes that are highly co-expressed

across different samples

gene 1 gene 1 gene 1

gene

2

gene

2

gene

2

genes

sam

ples

Page 52: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Gene Co-expression Networks

NIH Big Data Center of Excellence 52

•  Such networks provide a global view of co-expression patterns

•  But do not provide information on how these networks relate to the variation in a phenotypic outcome

Page 53: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Gene Co-expression Networks

NIH Big Data Center of Excellence 53

How can we relate these networks to the phenotypic variation?

gene 1 gene 1 gene 1

gene

2

gene

2

gene

2

genes

sam

ples

Calculate pair-wise

correlations Filter o

ut small

corre

lations

genes

sam

ples

continuous categorical

gene-gene correlation

gene-phenotype association

Page 54: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Gene Co-expression Networks

NIH Big Data Center of Excellence 54

Approach 1: In reconstructing the network, we can limit our samples to one manifestation of the phenotypic outcome

•  For example, build a Basal-like co-expression network by looking at the gene correlations across Basal breast cancer samples

•  Issues: 1.  Only works if we have categorical phenotype 2.  Does not relate the network to the variation in the phenotypic

outcome

Page 55: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Gene Co-expression Networks

NIH Big Data Center of Excellence 55

Approach 2: If the phenotype is binary, reconstruct two networks (one for each manifestation of the phenotype) and compare the two to build a differential network

•  Shows changes in the co-expression pattern

Case Control Differential Network

Page 56: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Gene Co-expression Networks

NIH Big Data Center of Excellence 56

•  Issues: 1.  Becomes very cumbersome if phenotype is not binary

2.  Does not work for continuous-valued phenotypes 3.  By dividing the samples into two groups, we will have less

statistical power in identifying co-expression patterns 4.  Fails in a case shown below

Gene 1

Gen

e 2

Page 57: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Gene Co-expression Networks

NIH Big Data Center of Excellence 57

Approach 3: First, find genes associated with the phenotype and then reconstruct a context-specific network only using those genes

•  Issue: Ignores the strength (p-value) of gene-phenotype association

gene 1 gene 1 gene 1

gene

2

gene

2

gene

2

genes

sam

ples

Calculate pair-wise

correlations Filter o

ut small

corre

lations

Filter o

ut non-D

EGs

Page 58: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Gene Co-expression Networks

NIH Big Data Center of Excellence 58

Approach 4: Calculate p-values of gene-gene correlation and gene-phenotype associations separately and combine together using Fisher’s method or Stouffer’s method à Simplified-InPheRNo •  Specifically useful to identify transcription factor-gene-

phenotype associations

gene expression phenotype

gene

exp

ress

ion

gene

exp

ress

ion

Combine the two p-values

Page 59: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Questions?

NIH Big Data Center of Excellence 59

Page 60: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Regression algorithms

•  Lasso: learns a linear model from the training data using only a few features (sparse linear model)

•  Elastic Net: learns a linear model from the training data by linearly combining ridge and Lasso regression regularization terms (a generalization of both Lasso and ridge regression)

NIH Big Data Center of Excellence

�̂ = argmin�

�||y �X�||2 + �1||�||1

�̂ = argmin�

�||y �X�||2 + �2||�||2 + �1||�||1

60

Page 61: Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

Regression algorithms

•  Kernel-SVR:

•  Linear SVR learns a linear model such that it has at most ε-deviation from the response values and is as flat as possible

(Smola and Schölkopf, 1998)

•  Kernel-SVR generalizes the idea to nonlinear models by mapping the features to a high-dimensional kernel space

NIH Big Data Center of Excellence

x

x

x x

x

xxx

x

xx

x

x

x

+ε−ε

x

ζ+ε

−ε0

ζ

61