department of biomedical informatics mining gene co-expression network for cancer biomarker...
TRANSCRIPT
Department of Biomedical Informatics
Mining Gene Co-expression Network for Cancer Biomarker Prediction
Kun HuangDepartment of Biomedical Informatics
OSUCCC Biomedical Informatics Shared Resource
Department of Biomedical Informatics
2
Outline
• Introduction• Co-expression network for Breast cancer
• Frequent cancer co-expression network• Tissue-tissue network between stroma and
tumor mass• Other applications
• Chronic lymphocytic leukemia• Glioblastoma
• Discussion
Department of Biomedical Informatics
3
• Correlation / co-expression• Time-course data
• Bayesian network• Boolean network• …
Department of Biomedical Informatics
4
Boolean Network
Sahoo et al. Genome Biology 2008 9:R157
Department of Biomedical Informatics
5
Gene Co-Expression
HMMR siRNA
Department of Biomedical Informatics
r = 1 r = -1
Ranges from 1 to -1.
Pearson Correlation Coefficient
Department of Biomedical Informatics
7
• Expansion• Negative correlation• Multiple breast cancer datasets• More anchor genes• …
• Is there a way to find all highly correlated genes in multiple datasets?
• Do these genes form clusters?
Gene Co-Expression Network
Department of Biomedical Informatics
8
Frequent Gene Co-expression Network Mining
• Genes appear in tight networks in multiple disease datasets may indicate functionally related biological modules, therefore can provide insights on the disease cell physiology and new direction for the research.
Department of Biomedical Informatics
9
Frequent network mining • CODENSE
• Search for frequent coherent dense subgraphs across large numbers of massive graphs
• Unsupervised bottom-up clustering on unweighted, undirected network
Department of Biomedical Informatics
10
Data selection and correlation
• Selected 23 datasets from Gene Expression Omnibus (GEO): • Search term “metastatic cancer”• Contain both control and tumor, # sample > 8• Only primary tumor biopsy
• Correlation : │PCC│ > 0.75 (really high similarity)• For CODENSE:
• Edge support appears in at least 4 datasets• Connectivity ratio r > 40% (r = L / [n(n-1)/2] )• # of nodes > 20
Department of Biomedical Informatics
11
Results from CODENSE• 44 networks (clusters) are identified• # of nodes: 21 ~ 74 (average 44)• Connectivity: 0.41 ~ 0.78
Department of Biomedical Informatics
12
GO Enrichment Analysis on the Networks
• Networks with enriched GO terms associated with at least 1/3 of the genes• Immune response/system – 15• Protein translation (ribosome) – 5• Development – 4• Metabolism and energy (oxidative phosphorylation or
monocarboxylic acid metabolism) – 3• Cell cycle – 2• Muscle contraction – 1
• 14 networks do NOT satisfy the above criterion• Potential new functions• New interactions
Department of Biomedical Informatics
13
Use cluster 2 to predict survival outcome
• NKI-295 dataset• Supervised clustering: k-means, k=2, 100 random
initialization• Kaplan Meier curve and log-rank test for survival analysis
and comparison• Test for different patient groups
Department of Biomedical Informatics
14
Predict Survival Outcome
Department of Biomedical Informatics
15
Predict Survival Outcome
Department of Biomedical Informatics
Relation to BRCA1
Department of Biomedical Informatics
Finding New Gene FunctionsTest siRNA HDR
(relative)Centrosome
(Hela cell)Centrosome
(B.C. cell)1 Control siRNA 1 2% 4%2 BRCA1 0.16 ND 22%3 BRCA2 0.02 ND ND4 HMMR 1.33 10% 19%5 KIAA0101 0.52 10% 19%6 ASPM 1.0 2% ND7 NUSAP1 0.5 5% ND8 DLG7 0.5 ND ND9 KIF14 0.33 20% ND
10 KIF23 ND 27% 15%
Department of Biomedical Informatics
Finding New Gene Functions
KIAA0101
Department of Biomedical Informatics
ER-Negative Breast Cancer
Department of Biomedical Informatics
ER-Negative Breast Cancer
Department of Biomedical Informatics
21
Tumor Microenvironment (TME)
21
Cell, Volume 100, Issue 1, 7 January 2000, Pages 57-70
Kalluri et al. Nature Reviews Cancerpublished online 30 March 2006 | doi:10.1038/nrc1877
Department of Biomedical Informatics
Bipartite Graph
Network Density (r) For a bipartite network
with M+N nodes (M nodes in one side and N nodes in the other) and K edges
r = K/MN. For a weighted bipartite
network with M+N nodes and K edges
r = Σi=1,…K Wi/MN.
Stroma Tumor
Department of Biomedical Informatics
Bipartite Quasi-clique Discovery Algorithm
A Greedy Algorithm Original algorithm for quasi-clique finding is from
Ou and Zhang (2007). A new multimembership clustering method. J. of Ind. and Man. Opt., 3(4): 619-624.
Modified for the bipartite graph Four steps:
1. Set the threshold on edge weight w0 = g•max(wi).2. Initialize a new search: pick the edge with the maximal weight (larger than
w0) that has not been assigned to any network as the first edge of a new network.
3. Grow: alternatively adding nodes to the network from both sides which contribute most to the network density if the contribution to the density is higher than an adaptive threshold defined by two parameters l and t; 3.1. stop when no new node can be added; go to Step 2.
4. Merge: iteratively merge networks with more than 50% overlap (w.r.t. to the smaller one).
Department of Biomedical Informatics
24
Select a breast cancer dataset from GEO:GSE5847 contains 47 samples with separate microarray data for stroma and tumor separated using laser capture microdissection
Compute Pearson Correlation Coefficients (PCC) for every pair of gene between the stroma and the tumor
Select the top 10 networks for further analysis
Use the PCC values as the weights for the edges and set the three parameters (0.7)g , (2)l , and t (2) to run the bipartite quasi-clique finding algorithm
Workflow
Department of Biomedical Informatics
• Stroma-tumor networkStroma GO BP (p-values) Tumor GO BP (p-values) Common
1 19 M phase/cell cycle (1e-6) 68 Cell cycle/M phase (0) 13
2 35 Cell-cell signaling (8e-6)Neuropeptide signaling pathway (0.000686)
49 Cell-cell signaling (0.000549)Synaptic transmission (0.000084)
32
3 33 Immune response (0)Response to virus (0)
36 Immune response (0)Response to virus (0)
31
4 29 Immune response (0) 29 Immune response (0)Positive regulation of B cell proliferation (0.000536)
18
5 23 Cell-cell signaling (0.003158)Synaptic transmission (0.005969)
28 Secretion to cell (0.000218)Cell-cell signaling (0.000441)
15
6 17 M phase (0)Cell cycle (2e-6)
31 M phase (0)Mitosis (0)
7
7 12 Immune response(8e-6) 27 Follicle-stimulation hormone secretion (33e-6)
10
8 13 Extracellular space (1e-6) (CC) 19 Extracellular space (9.7e-5) (CC) 7
9 7 Wound healing (0.000114) 25 Amine metabolic process (3.5e-6) 6
10 5 24 Extracellular region part (0.000267) (CC) 4
Results
Department of Biomedical Informatics
• Tumor microenvironmentNetwork Stroma Tumor
8 AMPH, DCN, ELN, FBN1, HTRA1, LRRC17, OMD, PDGFRL, SFRP4, SGCD, SPON1, TGFB3, ZFHX4
AMPH, ANGPTL2, BNC2, COL1A1, DPT, ECM2, EHD2, FAT4, GLT8D2, GUCY1B3, HTRA1, KCNJ8, LRRC17, MMP2, OLFML1, PDGFRL, SFRP4, SPON1, ZFHX4
Extracellular Matrix Network
Department of Biomedical Informatics
27
Outline
• Introduction• Co-expression network for Breast cancer
• Frequent cancer co-expression network• Tissue-tissue network between stroma and
tumor mass• Other applications
• Chronic lymphocytic leukemia (CLL)• Glioblastoma
• Discussion
Department of Biomedical Informatics
28
CLL Prognostic Biomarker
• CLL is the most common adult leukemia in the western world. It is highly heterogeneous, can be indolent or progressive.
• Prognosis at early stage is crucial to progressive patient survival as well as to indolent patients to avoid unnecessary adverse treatment.
• Biological prognostic markers:• Serum markers (TK, B2M, sCD23)• FISH cytogenetics• IgVH mutational status - Determination is time consuming and expansive
• CD38 expression - Actually independent of IgVH mutational status
• ZAP-70 expression - Not 100% correlated to IgVH mutational status, only accurate when patients in the progressive stage
Department of Biomedical Informatics
29
Network 17
• 51 genes, including ZAP-70 and CD38• r = 0.4142• Including known ZAP-70 interacting genes - CD8A, CD3G,
CD3D, CD247
Department of Biomedical Informatics
30
Highly enriched Functions of Immune Response
Department of Biomedical Informatics
31
Workflow of CLL Prognostic Biomarker Selection
Further select prognostic biomarkers by testing on separate CLL dataset
Genes with exp fold change > 1.5p <0.05
Test the prediction accuracy of each gene on IgVH mutation status
Cross validation
Select a group of feature genes that can
differentiate IgVH mu +/- groups
mRMR40
10
Identify potential prognostic biomarkers 5
Compute gene exp level difference on IgVH
mu+/- groups
11
6
40
Department of Biomedical Informatics
32
Differential Expression of Genes between IgVH mutation +/-
Genesp-values
(Unmutated vs Mutated)
Mean fold change
(Unmutated vs Mutated)
p-value (Patients vs
Normal)
SH2D1AIL2RBKLRK1CD247GZMBCD3GCD3DGZMKCD8ANKG7ZAP70LAG3
1.3E-38.1E-54.9E-31.6E-43.1E-30.0171.4E-40.0229.9E-58.3E-47.9E-40.023
1.9441.8211.8131.8071.7191.6851.6211.5861.5761.560-1.403-1.598
0.0894.8E-160.00797.1E-86.2E-11
0.414.3E-169.2E-113.5E-91.3E-9
5.5E-120.028
Department of Biomedical Informatics
33
Prediction of IgVH Mutational Status with Individual Genes
Genes Accuracy Subcellular
location
SH2D1AIL2RBKLRK1CD247GZMBCD3GCD3DGZMKCD8ANKG7ZAP70LAG3
ZAP70+IL2RBZAP70+IL2RB+CD8A
57.32%68.84%63.67%66.03%57.13%62.52%64.27%57.58%68.31%64.94%68.46%59.53%73.22%74.62%
cytoplasmicmembranemembranemembranesecreted
membranemembranesecreted
membranemembrane
cytoplasmicmembrane
--
• Two groups of patients (GDS1494): 49 IgVH mu- ; 51 IgVH mu+• Each gene / gene set was tested independently• A linear classifier with 20% hold out and 100 repeats
Department of Biomedical Informatics
34
Top Ten Genes Selected by mRMR
Rank Name mRMR Score
123456789
10
IL2RBLAG3
RASGRP1CD8AXCL1ZAP70CD79AFMNL1KLRK1CST7
0.1010.0200.0290.0210.0110.0180.0010.0000.0000.002
Department of Biomedical Informatics
35
Cross Check with Outcome Data
• LAG3 : involved in T-cell-dependent B-cell activation, reported recently to be highly correlated to IgVH mutational status
• IL2RB: involved in endocytosis and transduction of mitogenic signal of IL2, expression on B-cells was linked to CLL
• CD8A and CD247: expression of CD8A on B-cells has been linked to CLL
• KLRK1: involved in immune surveillance exerted by T/B-cells
Using GSE10138
Department of Biomedical Informatics
36
Application to GBM
Cluster p-value # Genes
A 0.0010946 79
B 0.00054934 87
C 0.0016763 23
D 0.0063116 466
E 0.0057298 154
F 0.000957 79
G 0.0010599 29
H 0.0086392 303
I 0.00098224 39
J 0.0097023 21
K 0.0061901 97
L 0.000352 42
Department of Biomedical Informatics
37
ID Name P-value
Term in Query
Term in Genome
1 GO:0002376 immune system process
5.380E-37
119 1258
2 GO:0006955 immune response 4.421E-35
95 832
3 GO:0006952 defense response 1.172E-22
75 760
4 GO:0002684 positive regulation of immune system
process
2.644E-21
47 303
5 GO:0002682 regulation of immune system
process
1.374E-20
58 493
6 GO:0050776 regulation of immune response
1.825E-17
41 277
7 GO:0009611 response to wounding
4.149E-17
62 658
8 GO:0050778 positive regulation of immune response
6.545E-16
33 188
9 GO:0006954 inflammatory response
1.640E-15
47 414
10
GO:0002252 immune effector process
4.530E-14
36 260
Functional Enrichment analysis using IPA for cluster D.
Department of Biomedical Informatics
38
IDName
P-value
Term in
Query
Term in Genome
1GO:0008219 cell death 8.245
E-328 1385
2GO:0016265 death 8.829
E-328 1390
3GO:0010033 response to organic
substance1.299E-2
17 605
4GO:0012501 programmed cell death 1.735
E-226 1279
5GO:0034097 response to cytokine
stimulus1.742E-2
7 93
6GO:0009628 response to abiotic
stimulus2.393E-2
14 443
7GO:0048545 response to steroid
hormone stimulus2.469E-2
10 225
8GO:0051093 negative regulation of
developmental process3.889E-2
18 728
9GO:0006915 apoptosis 4.244
E-225 1265
Functional Enrichment analysis using IPA for cluster E. The x-axis shows the log (base 10) of p-values of the enriched terms using the Fisher’s exact tests.
GO Enrichment results using ToppGene for Cluster E (GO: Biological Processes)
Department of Biomedical Informatics
Summary and Future Work
Gene co-expression networks provide rich information in predicting gene functions and disease mechanisms
Need to be integrated with other networks such as PPI
Department of Biomedical Informatics
Summary and Future Work
Ongoing work 1: More biological and clinical validation Tissue microarray – at protein level
Ongoing work 2: Multiple tissue network for TME Microarray data for epithelial cells, fibroblast cells,
endothelial cells, macrophages Moving to RNA-seq
Ongoing work 3: Biclique mining algorithm using frequent item set
and graph summarization
Department of Biomedical Informatics
Summary and Future Work
Ongoing work 4: Integrating multiple networks – disease network,
phenotype network
Barabasi A-L, Network medicine – from obesity to “Diseasome”, NEJM, 357(4): 404-407, 2007.