idea national resource for proteomics core facilities

Gene Set Analysis for proteomics data: A case studyPresenter: Yasir Rahmatallah

Department of Biomedical Informatics April 5, 2017

IDeA National Resource for Proteomics Core Facilities Directors/Staff Workshop

Outlines We used membrane-enriched proteome profile of tumor and adjacent normal tissue

from 8 CRC patients, obtained using label-free liquid chromatography tandem mass spectrometry (LC-MS/MS).

Data obtained from Table S1 in:Manveen Sethi et. al., Quantitative proteomic analysis of paired colorectal cancer and non-tumorigenic tissues reveals signature proteins and perturbed pathways involved in CRC progression and metastasis, Journal of Proteomics 126 (2015) 54–67.

We use the rotation gene set test (ROAST) for self-contained GSA over curated gene sets from the molecular signature database (MSigDB). ROAST is a multivariate statistical test that was originally proposed for GSA of microarrays. ROAST uses the framework of linear models and tests if the expression levels of the genes in a

set yield a particular non-zero contrast of the model coefficients. It accounts for correlations between genes and can use different alternative hypotheses, testing

whether the direction of changes for genes in a set is up, down or mixed. ROAST assess the significance using rotation, a Monte Carlo simulation scheme for

multivariate regression models.

We compare the results with conventional GO enrichment analysis presented in the original paper and well-known CRC hallmarks.

Approach

Form gene sets

Gene symbols IDs

Perform test (ROAST)

Differential gene sets

Generate data matrix

Convert ensembl protein IDs to gene symbol IDs (STRING database)

Ensembl protein IDs

Detected Proteins (ensembl IDs)

Add the data of ensembl IDs that map to the same gene symbol

C2 curated gene sets (MSigDB)

Detected proteins dataset

Supplemental Table S1 - Total non-redundant proteins observed in combined tumor tissues (T1-T8)Identifier Description T1 T2 T3 T4 T5 T6 T7 T8 NSAF_T1 NSAF_T2 NSAF_T3 NSAF_T4 NSAF_T5 NSAF_T6 NSAF_T7 NSAF_T8ENSP00000000233 ADP-ribosylation factor 5 [Source:HGNC Symbol;Acc:658] 0 15 31 0 0 35 15 37 4.26E-05 0.001629 0.0027909 4.66E-05 6.18E-05 0.003477 0.00162 0.002879

ENSP00000000412mannose-6-phosphate receptor (cation dependent) [Source:HGNCSymbol;Acc:6752] 6 4 5 12 19 11 6 5 0.0003666 0.000313 0.0003223 0.000770707 0.0015946 0.000745 0.000449 0.000279

ENSP00000003100cytochrome P450, family 51, subfamily A, polypeptide 1 [Source:HGNC Symbol;Acc:2649] 3 0 2 1 4 2 2 2 0.000107 1.88E-05 7.94E-05 5.01E-05 0.0001994 8.78E-05 9.36E-05 6.88E-05

ENSP00000005257v-ral simian leukemia viral oncogene homolog A (ras related) [Source:HGNC Symbol;Acc:9839] 6 4 6 10 0 3 6 5 0.0004816 0.000411 0.0005003 0.00085039 5.37E-05 0.000298 0.00059 0.000367

ENSP00000027335 cadherin 17, LI cadherin (liver-intestine) [Source:HGNC Symbol;Acc:1756] 15 0 20 12 0 0 5 1 0.0002939 1.17E-05 0.0004038 0.000259131 1.37E-05 1.09E-05 0.000128 2.56E-05

ENSP00000054666 vesicle-associated membrane protein 3 [Source:HGNC Symbol;Acc:12644] 7 4 5 4 3 2 7 6 0.0011605 0.000858 0.0008841 0.000761158 0.0007852 0.000444 0.001422 0.000905

ENSP00000157812proteasome (prosome, macropain) 26S subunit, ATPase, 4 [Source:HGNCSymbol;Acc:9551] 11 0 10 10 6 13 2 6 0.0004251 2.28E-05 0.0004032 0.000424296 0.0003484 0.000573 0.000113 0.000216

ENSP00000167588 keratin 20 [Source:HGNC Symbol;Acc:20412] 33 4 6 32 0 13 4 36 0.0012077 0.0002 0.0002434 0.001280803 2.61E-05 0.000559 0.000199 0.001185

ENSP00000176643aldehyde dehydrogenase 3 family, member A2 [Source:HGNC Symbol;Acc:403] 1 1 1 3 0 0 2 1 4.79E-05 5.90E-05 4.97E-05 0.000122075 2.31E-05 1.83E-05 9.77E-05 4.31E-05

ENSP00000184266NADH dehydrogenase (ubiquinone) 1 beta subcomplex, 4, 15kDa [Source:HGNC Symbol;Acc:7699] 5 3 7 2 2 3 6 2 0.0006327 0.000496 0.0008962 0.000314367 0.0004169 0.000462 0.000916 0.000259

Supplemental Table S1 - Total non-redundant proteins observed in combined non-tumor tissues (N1-N8)

Identifier Description N1 N2 N3 N4 N5 N6 N7 N8 NSAF_N1 NSAF_N2 NSAF_N3 NSAF_N4 NSADF_N5 NSAF_N6 NSAF_N7 NSAF_N8

ENSP00000000233ADP-ribosylation factor 5 [Source:HGNC Symbol;Acc:658] 10 0 13 26 11 21 42 25 0.0012792 6.55E-05 0.001973 0.00243 0.001287 0.002081 0.003785 0.002925

ENSP00000000412

mannose-6-phosphate receptor (cation dependent) [Source:HGNCSymbol;Acc:6752] 4 1 3 5 3 4 2 2 0.0003625 0.00013 0.000338 0.000333 0.000259 0.000288 0.000147 0.00019

ENSP00000003100

cytochrome P450, family 51, subfamily A, polypeptide 1 [Source:HGNC Symbol;Acc:2649] 0 0 1 1 0 3 1 1 2.183E-05 2.35E-05 7.86E-05 4.93E-05 2.01E-05 0.000121 4.79E-05 6.17E-05

ENSP00000005257

v-ral simian leukemia viral oncogene homolog A (ras related) [Source:HGNC Symbol;Acc:9839] 3 0 4 4 3 5 0 4 0.0003704 5.69E-05 0.000571 0.000358 0.00034 0.000462 3.87E-05 0.000448

ENSP00000006724

carcinoembryonic antigen-related cell adhesion molecule 7 [Source:HGNC Symbol;Acc:1819] 9 5 6 9 12 0 0 9 0.000807 0.000503 0.000662 0.000607 0.000975 3.37E-05 3.1E-05 0.00076

Table for Tumor samples

Table for Normal samples

Combined Table (generated using MS excel)

Identifier class Description S1 S2 S3 S4 S5 S6 S7 S8 NSAF_S1 NSAF_S2 NSAF_S3 NSAF_S4 NSAF_S5 NSAF_S6 NSAF_S7 NSAF_S8

ENSP00000000233 TADP-ribosylation factor 5 [Source:HGNC Symbol;Acc:658] 0 15 31 0 0 35 15 37 4.26E-05 0.001629 0.002791 4.66E-05 6.18E-05 0.003477 0.00162 0.002879

ENSP00000000412 T

mannose-6-phosphate receptor (cation dependent) [Source:HGNC Symbol;Acc:6752] 6 4 5 12 19 11 6 5 0.000367 0.000313 0.000322 0.000771 0.001595 0.000745 0.000449 0.000279

ENSP00000003100 T

cytochrome P450, family 51, subfamily A, polypeptide 1 [Source:HGNC Symbol;Acc:2649] 3 0 2 1 4 2 2 2 0.000107 1.88E-05 7.94E-05 5.01E-05 0.000199 8.78E-05 9.36E-05 6.88E-05

ENSP00000005257 T

v-ral simian leukemia viral oncogene homolog A (ras related) [Source:HGNC Symbol;Acc:9839] 6 4 6 10 0 3 6 5 0.000482 0.000411 0.0005 0.00085 5.37E-05 0.000298 0.00059 0.000367

.

.

.

ENSP00000299179 Nmitochondrial ribosomal protein L43 [Source:HGNC Symbol;Acc:14517] 2 0 1 0 0 1 1 1 0.000276 5.94E-05 0.000199 4.16E-05 5.08E-05 0.000132 0.000121 0.000156

ENSP00000299198 Ncreatine kinase, brain [Source:HGNC Symbol;Acc:1991] 32 35 9 16 6 22 6 30 0.001905 0.002239 0.000668 0.000728 0.00035 0.001048 0.000279 0.001684

ENSP00000299300 Nchaperonin containing TCP1, subunit 2 (beta) [Source:HGNC Symbol;Acc:1615] 8 2 6 4 3 10 11 8 0.000369 0.000117 0.000339 0.000147 0.00014 0.000362 0.000365 0.000348

ENSP00000299427 Ntripeptidyl peptidase I [Source:HGNC Symbol;Acc:2073] 11 1 3 24 3 15 4 12 0.000469 6.58E-05 0.000171 0.000752 0.000131 0.000503 0.000134 0.00048

R code (generate data matrix)# read the provided raw counts or NSAF data# put your own working directory insteadsetwd("C:/Users/rahmatallahyasir/Desktop/Yasir/inbre/proteomics_workshop/R_code")tab <- read.csv("Table_S1.csv", header=TRUE, skip=1)id.T <- as.character(tab$Identifier[as.character(tab$class)=="T"])id.N <- as.character(tab$Identifier[as.character(tab$class)=="N"])id.T <- id.T[grep("ENSP", id.T)]id.N <- id.N[grep("ENSP", id.N)]id.T <- id.T[-grep("reverse", id.T)]id.N <- id.N[-grep("reverse", id.N)]mat.T <- as.matrix(tab[((as.character(tab$class)=="T") & (as.character(tab$Identifier) %in% id.T)),c(12:19)])mat.N <- as.matrix(tab[((as.character(tab$class)=="N") & (as.character(tab$Identifier) %in% id.N)),c(12:19)])mat.T <- as.matrix(tab[as.character(tab$class)=="T",c(12:19)])rownames(mat.T) <- as.character(tab[as.character(tab$class)=="T",]$Identifier)mat.T <- mat.T[id.T,]mat.N <- as.matrix(tab[as.character(tab$class)=="N",c(12:19)])rownames(mat.N) <- as.character(tab[as.character(tab$class)=="N",]$Identifier)mat.N <- mat.N[id.N,]mat <- matrix(0, length(unique(c(id.N,id.T))), 16)rownames(mat) <- unique(c(id.N,id.T))mat[id.N,1:8] <- mat.Nmat[id.T, 9:16] <- mat.T

STRING

STRING protein annotations file

R code (extract gene symbols)# extract gene names using the protein annotations obtained from the STRING databasett <- read.delim("string_protein_annotations_CRC.tsv", skip=1, header=FALSE, sep="\n")ens <- rownames(mat)ens_map <- array("", c(1,length(ens)))gg <- array("", c(1,nrow(tt)))a <- as.character(tt[1:915,])b <- strsplit(a, "\t9606.")for(k in 1:length(b)) gg[1,k] <- b[[k]][1]for(m in 1:length(ens))

{for(k in 1: nrow(tt))

if(length(grep(ens[m], tt[k,]))) ens_map[1,m] <- gg[1,k]

}rownames(mat) <- ens_map# how many ensembl protein identifiers without mapping to gene namessum(rownames(mat)=="")[1] 21 #(protein ensembl identifiers with no mapping to gene symbols)mat <- mat[-which(rownames(mat)==""),]mat <- mat[-grep("ENSG", rownames(mat)),]

R code (add the data with ensembl identifiers that map to the same gene symbol)

dg <- rownames(mat)[which(duplicated(rownames(mat)))]

uni <- unique(dg)

m2 <- matrix(0, length(uni), ncol(mat))

rownames(m2) <- uni

m <- matrix(0, length(uni), 1)

rownames(m) <- uni

for(k in 1:length(uni))

m[k,1]<- sum(rownames(mat)==uni[k])

for(k in 1:length(uni))

m2[k,] <- colSums(mat[rownames(mat)==uni[k],])

mat <- mat[-which(rownames(mat) %in% uni),]

mat <- rbind(mat, m2)

MSigDB

We use the C2 category which currently has 4729 curated gene sets.

We include only the genes detected in our dataset and discard everything else.

We consider gene sets with 10-500 genes only.

1171 gene sets survived the filtering criteria.

R code (construct gene sets)# build the list of C2 gene sets from the molecular signature database (MSigDB)# consider only gene sets with 10 to 500 geneslibrary(GSEABase)rn <- rownames(mat)C2data <- getGmt("c2.all.v5.2.symbols.gmt")# all C2 pathwaysC2 <- as.list(geneIds(C2data))lenC2 <- length(C2)pathway.names <- names(C2)len <- length(C2)path.len <- path.len2 <- array(0,c(1,len))c2.pathways <- c2.pathways2 <- list()c2.pathway.names <- list()upper.limit <- 500; lower.limit <- 10for (k in seq(1, len, by=1))

{path.len[k] <- length(C2[[k]])if ((path.len[k] >= lower.limit) & (path.len[k] <= upper.limit))c2.pathways[[length(c2.pathways)+1]] <- C2[[k]]path.len2[k] <- sum(C2[[k]] %in% rn)if ((path.len2[k] >= lower.limit) & (path.len2[k] <= upper.limit))c2.pathways2[[length(c2.pathways2)+1]] <- C2[[k]][which(C2[[k]] %in% rn)]}

c2.pathway.names <- rbind(pathway.names[(path.len>=lower.limit)&(path.len<=upper.limit)])c2.pathway.names2 <- rbind(pathway.names[(path.len2>=lower.limit)&(path.len2<=upper.limit)])c2.len <- length(c2.pathway.names2)c2.path.len.original <- path.len[which((path.len2 >= lower.limit) & (path.len2 <= upper.limit))]c2.path.len <- path.len2[which((path.len2 >= lower.limit) & (path.len2 <= upper.limit))]c2.genes <- unique(unlist(c2.pathways2))names(c2.path.len.original) <- c2.pathway.names2

R code (perform ROAST)

# perform the ROAST method using package limmalibrary(limma)mat2 <- log(mat)mat2[which(mat2 == -Inf)] <- -14GS.ind <- list()for(k in 1:c2.len) GS.ind[[length(GS.ind)+1]] <- which(rownames(mat) %in% c2.pathways2[[k]])names(GS.ind) <- c2.pathway.names2label <- c(rep("N", 8), rep("T", 8))design <- model.matrix(~label)roast_test <- roast(y=mat2, index=GS.ind, design=design, contrast=2, nrot=1000, set.statistic="msq")roast_test <-cbind("NGenes.original"=matrix(c2.path.len.original[rownames(roast_test)]), roast_test)short_list <- roast_test[which((roast_test$FDR < 0.001) & ((abs(roast_test$PropUp-roast_test$PropDown) >= 0.5)|((roast_test$PropUp >= 0.6)|(roast_test$PropDown >= 0.6)))),]write.csv(short_list, file="ROAST_results_UP0.6_or_DOWN0.6_or_UP-DOWN0.5.csv")

log(NSAF)

Freq

uenc

y

-14 -12 -10 -8 -6 -4

050

010

0015

0020

00

Top DE C2 gene sets detected by ROAST

pathway NGenes.original NGenes PropDown PropUp Direction PValue FDR PValue.Mixed FDR.Mixed

SABATES_COLORECTAL_ADENOMA_DN 291 22 0.73 0.14 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

SMID_BREAST_CANCER_LUMINAL_A_UP 84 14 0.71 0.14 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

AZARE_NEOPLASTIC_TRANSFORMATION_BY_STAT3_UP 121 14 0.21 0.71 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

SANSOM_APC_MYC_TARGETS 217 10 0.00 0.70 Up 9.99E-04 9.99E-04 9.99E-04 9.99E-04

SHEDDEN_LUNG_CANCER_GOOD_SURVIVAL_A4 196 10 0.70 0.10 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

PLASARI_TGFB1_SIGNALING_VIA_NFIC_1HR_DN 106 13 0.69 0.15 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

SHETH_LIVER_CANCER_VS_TXNIP_LOSS_PAM4 261 16 0.69 0.13 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

FLECHNER_BIOPSY_KIDNEY_TRANSPLANT_REJECTED_VS_OK_UP 87 15 0.27 0.67 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

KOBAYASHI_EGFR_SIGNALING_24HR_UP 101 15 0.67 0.00 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

ZHANG_BREAST_CANCER_PROGENITORS_DN 145 12 0.67 0.00 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

VECCHI_GASTRIC_CANCER_ADVANCED_VS_EARLY_DN 138 14 0.64 0.07 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

PLASARI_TGFB1_TARGETS_10HR_DN 244 11 0.64 0.27 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

JAEGER_METASTASIS_DN 258 24 0.63 0.17 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

DELYS_THYROID_CANCER_DN 232 16 0.63 0.25 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

BROWNE_HCMV_INFECTION_8HR_UP 105 13 0.15 0.62 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

PEREZ_TP53_TARGETS 1174 20 0.60 0.25 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

BOYLAN_MULTIPLE_MYELOMA_PCA1_UP 101 15 0.07 0.60 Up 9.99E-04 9.99E-04 9.99E-04 9.99E-04

MANTOVANI_VIRAL_GPCR_SIGNALING_UP 86 10 0.60 0.10 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

SCHUHMACHER_MYC_TARGETS_UP 80 10 0.20 0.60 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

CHEOK_RESPONSE_TO_HD_MTX_UP 23 10 0.00 0.60 Up 9.99E-04 9.99E-04 9.99E-04 9.99E-04

APPEL_IMATINIB_RESPONSE 33 10 0.60 0.10 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

SHAFFER_IRF4_TARGETS_IN_ACTIVATED_B_LYMPHOCYTE 81 10 0.10 0.60 Up 9.99E-04 9.99E-04 9.99E-04 9.99E-04

CAIRO_HEPATOBLASTOMA_DN 267 10 0.60 0.10 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

GABRIELY_MIR21_TARGETS 289 10 0.60 0.20 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

WIERENGA_STAT5A_TARGETS_UP 217 10 0.60 0.20 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

KOHOUTEK_CCNT1_TARGETS 50 10 0.60 0.00 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

DANG_REGULATED_BY_MYC_UP 72 11 0.00 0.55 Up 9.99E-04 9.99E-04 9.99E-04 9.99E-04

RODRIGUES_DCC_TARGETS_DN 121 13 0.54 0.00 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

THUM_SYSTOLIC_HEART_FAILURE_DN 244 12 0.50 0.00 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

CHIBA_RESPONSE_TO_TSA_UP 52 12 0.50 0.00 Down 9.99E-04 9.99E-04 9.99E-04 9.99E-04

(Deleted in Colorectal Cancer)

GO enrichment analysis with respect to subcellular localization (no significance)

Functional annotation of the identified and quantified proteins derived from tumor (T1-T8, red bars) and non-tumor (N1-N8, blue bars) CRC tissues with respect to their subcellular localization using the PloGO R package software. The relative distribution is based on protein IDs belonging to each non-unique redundant categories.

GO enrichment analysis with respect to biological processes (no significance)

Functional annotation of the identified and quantified proteins derived from tumor (T1-T8, red bars) and non-tumor (N1-N8, blue bars) CRC tissues with respect to their biological processes using the PloGO R package software. The relative distribution is based on protein IDs belonging to each non-unique redundant categories.

Conclusions The comparison of results for this proteomics dataset demonstrated that Gene

Set Analysis (GSA) methods can outperform the conventional functional enrichment analysis (GO enrichment) and provide meaningful biological insights.

Curated gene sets incorporate biological knowledge into the analysis and provide better biological insights.

Some GSA methods originally designed for RNA-seq or microarray data can be adapted into proteomics data.

Other methods (example) Bioconductor package GSAR provides

non-parametric multivariate methods that test specific alternative hypotheses against null (differential sample distribution, mean, and variance as well as differential gene co-expression).

Non-parametric methods need larger sample size than parametric methods (e.g. 15 or 20 samples under each condition) to estimate statistical significance properly. Hence, applying non-parametric methods to our 8-by-8 dataset is not strictly appropriate.

We demonstrate how to selectively apply some of the available methods to the 20 genes from gene set ‘SABATES_COLORECTAL_ADENOMA_DN’.

Package GSAR is available athttps://bioconductor.org/packages/release/bioc/html/GSAR.html

Differential mean vs differential variance (graph example)

R code (for a selected gene set, methods from package GSAR corroborate ROAST)

library(GSAR)KStest(mat2[c2.pathways2[[which(c2.pathway.names2=="SABATES_COLORECTAL_ADENOMA_DN")]],], group=c(rep(1,8),rep(2,8)), nperm=1000)[1] 0.000999001 (significant differential sample mean)MDtest(mat2[c2.pathways2[[which(c2.pathway.names2=="SABATES_COLORECTAL_ADENOMA_DN")]],], group=c(rep(1,8),rep(2,8)), nperm=1000)[1] 0.001998002 (significant differential sample mean)RKStest(mat2[c2.pathways2[[which(c2.pathway.names2=="SABATES_COLORECTAL_ADENOMA_DN")]],], group=c(rep(1,8),rep(2,8)), nperm=1000)[1] 0.2797203 (non-significant differential sample variance)RMDtest(mat2[c2.pathways2[[which(c2.pathway.names2=="SABATES_COLORECTAL_ADENOMA_DN")]],], group=c(rep(1,8),rep(2,8)), nperm=1000)[1] 0.1688312 (non-significant differential sample variance)# add small random values instead of -14 (which corresponds to 0 counts)# this will bypass the problem of genes having 0 standard deviation under # one condition which prevents calculating the correlation coefficientmat4 <- mat2ind <- which(mat2==-14)mat4[ind] <- -14 + rnorm(length(ind), 0, min(apply(mat2, 1, "sd")))# perform the GSNCA methodGSNCAtest(mat4[c2.pathways2[[which(c2.pathway.names2=="SABATES_COLORECTAL_ADENOMA_DN")]],], group=c(rep(1,8),rep(2,8)), nperm=1000)[1] 0.001998002 (significant differential co-expression, see MST2 plot in the next slide)# plot the MST2 for the selected gene setplotMST2.pathway(mat4[intersect(c2.pathways2[[which(c2.pathway.names2=="SABATES_COLORECTAL_ADENOMA_DN")]], rownames(mat4)),], group=c(rep(1,8),rep(2,8)), group1.name="Normal", group2.name="Tumor")

MST2 of the 20 genes mapped to the SABATES_COLORECTAL_ADENOMA_DN gene set.

Thank YouQuestions?

idea national resource for proteomics core facilities

Documents