new statistics and machine learning methods for proteogenomic … · 2016. 8. 3. · • data...
TRANSCRIPT
Proteomics & Biomarker Discovery
Statistics and Machine Learning Methods for Proteogenomic Data Analysis
D. R. Mani Broad Institute
Northeastern University Short Course Computation and Statistics for Targeted Proteomics May 5, 2016
Proteomics & Biomarker Discovery
Finding protein targets for Targeted Proteomics
§ Proteins for targeted proteomics
§ Literature § Pathways • Disease/phenotype
related
§ Characteristic Subset • E.g: P100
§ Unbiased Discovery
§ Observed/predicted proteins & peptides
ESP
Proteomics & Biomarker Discovery
Unbiased biomarker discovery yields targets for Targeted Proteomics
Adapted from Rifai, et. al., Nature Biotech., 2006.
Proteomics & Biomarker Discovery
Unbiased discovery is increasingly Proteogenomic (or Multi-omic)
§ Discovery efforts include multi-omic profiling • Omic profiling is getting
cheaper
§ Proteomic profiles are increasingly common • Smaller sample numbers
due to higher cost • More input material
needed Image Source: Goodacre, J. Exp. Bot 2005.
‘OMICS’ RESEARCH
Image Source: http://fluorous.com/images/omics.JPG
Proteomics & Biomarker Discovery
Proteogenomic analysis of Breast Cancer provides generalizable methods
§ NIH CPTAC initiative to perform large-scale proteogenomic analysis of cancer samples • Breast Cancer—Broad Institute • Colon Cancer—Vanderbilt • Ovarian Cancer—Johns Hopkins/PNNL
§ Presentation Goals:
• Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis, but generalizable
• Generalized applicability to wide range of data sets – Potential use for targeted data analysis
» Some methods applicable, others need to be modified/applied with care
Proteomics & Biomarker Discovery
Profiling of 105 TCGA samples produced largest proteomic dataset yet generated at Broad
6
• 105 BC Tumor Samples ◘ PAM50 Subtypes: 18 Her2, 25 Basal, 29
Luminal A, 33 Luminal B
• Analyzed by iTRAQ-‐MS ◘ Proteome + Phosphoproteome
Data Tables
TCGA genomic data mRNA, RNA-‐seq, Exome, SNP, CNV, MethylaSon, miRNA QUILTS
Tumor specific custom DB Spectrum Mill
Global proteome & phosphoproteome
raw MS data
MS search DB
TCGA Pipeline
mRNA CNV MutaSon
MethylaSon miRNA
Proteome
Phospho-‐proteome
Proteomics & Biomarker Discovery
1 mg total protein per tumor Internal reference: equal representation of basal, Her2 and Luminal A/B subtypes Tumor-specific databases based on whole exome seq and RNA seq
Sample processing
Proteomics & Biomarker Discovery
1 mg total protein per tumor Internal reference: equal representation of basal, Her2 and Luminal A/B subtypes Tumor-specific databases based on whole exome seq and RNA seq
Sample processing: The basics
3 samples are included in each iTRAQ run. Each
run also includes a Common Reference
sample.
37 iTRAQ Runs 105 samples
Samples are fractionated for increased depth of coverage
[[enrichment]]
Phospho-proteome
Proteome
Spectrum Mill DATA output: Protein/peptide
log2(ratio to common reference)
Data Analysis Pipeline
Data Analysis Pipeline Overview
Setup
Data ExtracSon
NormalizaSon & QC
mRNA CorrelaSon
Clustering
AssociaSon Analysis
MutaSon Analysis
LINCS & Alternate Analyses
CNA Analysis
Profile Plots
MS Data iTRAQ/TMT
Normalized, Filtered & QC-‐pass’d
data
mRNA
CNV
MutaSon
Proteomics & Biomarker Discovery
Setup initiates automatic pipeline execution
§ Unix shell script § Create directories § Copy input data files § Assemble required code and
additional data files • Code & data are versioned
§ Execute all core analysis components
§ Use Grid Engine for parallelization at multiple levels • Account for data dependencies
10
Setup
Extract
Normalize
mRNA
Cluster Assoc Mut
Alternate Analyses
CNA
Plots
Data
Norm Data
RNA
CNA
Mut
§ Tools: • bash • symbolic links • subversion (svn) • UGER (Univa grid engine for research)
Proteomics & Biomarker Discovery
Quality Control: Profile plots identify bimodal samples
§ Bimodal samples are identified using (mixture) model-based clustering
0.0
0.1
0.2
0.3
0.4
−20 −10 0 10ratio
dens
ity
Sample type: unimodal
0.0
0.1
0.2
−30 −20 −10 0 10ratio
dens
ity
Sample type: bimodal
§ Tools: • Mclust (R)
77 samples 28 samples
§ Bimodality is most likely due to poor sample quality
Proteomics & Biomarker Discovery
Defining bimodal samples: Challenges
§ Identify a metric that can separate bimodal/tailing samples from unimodal samples • Bimodality coefficient (too conservative—too many bimodals) • Dip statistic (too stringent—very few bimodal samples) • Measures of dispersion
– IQR – Standard deviation (balanced metric)
§ Classify new samples as unimodal/bimodal • Train classifier using single-shot (label-free) MS data
– Use unimodal/bimodal designation as class vector
• Apply to new samples as a QC check
Proteomics & Biomarker Discovery
Normalization
§ Each sample contains regulated and unregulated proteins. • Unregulated: log2 (ratio) ~ 0 • Regulated: Extreme (+/-)
ratios
§ Normalize samples using only unregulated proteins.
§ Unified method for both unimodal and bimodal samples
Setup
Extract
Normalize
mRNA
Cluster Assoc Mut
Alternate Analyses
CNA
Plots
Data
Norm Data
RNA
CNA
Mut
Proteomics & Biomarker Discovery
Normalization Algorithm Using 2-component Gaussian mixture model
§ Unimodal samples: • Find the mode M using kernel
density estimation (Gaussian kernel with Shafer-Jones bandwidth)
• Fit mixture model with mean for both components constrained to be equal to M
• Normalize (standardize) samples using mean M and smaller std. dev. from mixture model fit
−4 −2 0 2 4
0.00
0.05
0.10
0.15
0.20
0.25
0.30
density.default(x = x)
N = 9162 Bandwidth = 0.1759
Den
sity
Proteomics & Biomarker Discovery
Normalization Algorithm Using 2-component Gaussian mixture model
§ Bimodal samples: • Find the major mode M1 by kernel density estimation (Gaussian
kernel with Shafer-Jones bandwidth) • Fit mixture model with one component mean constrained to M1 • Normalize (standardize) samples using mean (M1) and resulting
std. dev.
§ Tools: • mixtools (R)
– normalmixEM for EM estimation of mixture parameters (µ, σ2) with constrained mean
• Mclust (R)
Proteomics & Biomarker Discovery
Normalization: Challenges
§ Mixtools estimation is not robust, and can produce one-off results • Unrealistic mean and variance estimates • Large variation is estimates when re-fit
§ Use mclust to assess parameter estimates from mixtools • Obtain approximate (unconstrained) estimate using mclust • Re-fit mixtools model multiple times to ensure repeatable
parameter estimates – Must be close to mclust estimates
Proteomics & Biomarker Discovery
Clustering for proteogenomic analysis
§ Does the proteome capture intrinsic RNA-based classes?
§ Does tumor heterogeneity invalidate genome-proteome comparisons?
§ Define intrinsic proteome and phosphoproteome clusters
§ How does phosphoproteome data cluster in pathway space? • Based on single-sample Gene
Set Enrichment Analysis (ssGSEA)
Setup
Extract
Normalize
mRNA
Cluster Assoc Mut
Alternate Analyses
CNA
Plots
Data
Norm Data
RNA
CNA
Mut
Proteomics & Biomarker Discovery
C8−A131.01TC
GA
BH−A18Q
.02TCGA
BH−A0AV.05TC
GA
A2−A0CM.07TC
GA
AR−A0U
4.09TCGA
AO−A0J6.11TC
GA
A7−A0CE.13TC
GA
D8−A142.18TC
GA
AN−A0FL.19TC
GA
AO−A12F.22TC
GA
AN−A0AL.28TC
GA
E2−A158.29TCGA
C8−A12V.30TC
GA
A2−A0D2.31TC
GA
C8−A131.32TC
GA
C8−A134.32TC
GA
AO−A0JL.35TC
GA
A2−A0YM.36TC
GA
A2−A0SX.36TCGA
AO−A12D
.01TCGA
C8−A130.02TC
GA
C8−A138.03TC
GA
C8−A12L.04TC
GA
AO−A12D
.05TCGA
C8−A12T.06TC
GA
A2−A0EQ.08TC
GA
AR−A0TX.14TC
GA
A8−A076.15TCGA
C8−A135.17TC
GA
C8−A12Z.20TC
GA
AO−A0JE.21TC
GA
A8−A09G.32TC
GA
E2−A154.03TCGA
A2−A0EX.04TCGA
AN−A04A.05TC
GA
AO−A0J9.10TC
GA
AR−A1AP.11TC
GA
BH−A0E1.12TC
GA
A2−A0YC.13TC
GA
A8−A08Z.14TCGA
AO−A126.15TC
GA
BH−A0C
1.16TCGA
AR−A1AW
.17TCGA
A2−A0EV.18TCGA
BH−A0D
G.19TC
GA
AO−A0JJ.20TC
GA
A2−A0YD.24TC
GA
AR−A0TR
.25TCGA
AO−A12E.26TC
GA
BH−A18N
.27TCGA
A2−A0T6.29TCGA
AR−A1AS.31TC
GA
A2−A0YF.33TCGA
BH−A0E9.33TC
GA
BH−A0BV.35TC
GA
AO−A12B.01TC
GA
A8−A06Z.07TCGA
BH−A18U
.08TCGA
AN−A0FK.11TC
GA
A7−A13F.12TCGA
AO−A0JC
.14TCGA
A2−A0EY.16TCGA
AR−A1AV.17TC
GA
AN−A0AM
.18TCGA
AR−A0TV.20TC
GA
AN−A0AJ.21TC
GA
A7−A0CJ.22TC
GA
A8−A079.23TCGA
A2−A0T3.24TCGA
AO−A03O
.25TCGA
A8−A06N.26TC
GA
A2−A0YG.27TC
GA
E2−A15A.29TCGA
AO−A0JM
.30TCGA
C8−A12U
.31TCGA
BH−A0D
D.33TC
GA
AR−A0TT.34TC
GA
AO−A12B.34TC
GA
A2−A0SW.35TC
GA
BH−A0C
7.36TCGA
CEP55CDC20TYMSMKI67ANLNNUF2RRM2UBE2CBIRC5CENPFKIF2CNDC80UBE2TCCNB1MMP11CDH3MELKACTR3BEGFRCXXC5BCL2SLC39A6FOXA1MLPHFGFR4ERBB2KRT5KRT14SFRP1PHGDHFOXC1PGRESR1MAPTNAT1 PAM−50 (TCGA)
BasalHer2LumALumB
RNA (50 genes)BasalHer2LumALumB
RNA (35 genes observed in proteome)BasalHer2LumALumB
Proteome (35 observed PAM−50 proteins)BasalHer2LumALumB
−3
−2
−1
0
1
2
3
RNA-based PAM-50 clusters are captured in the proteome
§ Tools: FANNY clustering (Kaufman & Rousseeuw, 1990) • cluster (R)
Proteomics & Biomarker Discovery
FANNY
Proteomics & Biomarker Discovery
Proteome and RNA samples co-cluster in the space of correlated genes
§ Dataset: Combined RNA + proteome for 77 samples. • 4,291 proteins/genes with
moderate to high correlation (R > 0.4)
§ Spearman correlation to measure sample similarity
§ AGNES hierarchical clustering § “Fanplot” to show co-clustering
• 62/77 samples co-cluster
AO−A12D.P
C8−A131.P
AO−A12B.P
BH−A18
Q.P
C8−A130.P
C8−A13
8.P
E2−A154.P
C8−A12L.P
A2−A0EX.P
AO−A12D.1.P
AN−A04A.P
BH−A0AV.P
C8−A12T.P
A8−A06Z.P
A2−A0C
M.P
BH−A18U
.P
A2−A0EQ.P
AR−A0U4.P
AO−A0J9.P
AR−A1A
P.P
AN−A0FK.P
AO−A0J6.P
A7−A13F.P
BH−A0E1.P
A7−A0CE.P
A2−A0YC.P
AO−A0JC.P
A8−A08Z.P
AR−A0TX.P
A8−A07
6.P
AO−A126.P
BH−A0C1.P
A2−A0EY.P
AR−A1AW.P
AR−A1AV.P
C8−A13
5.P
A2−A0EV.P
AN−A0AM.P
D8−A142.P
AN−A0FL.P
BH−A0DG.P
AR−A0TV.P
C8−A12Z.P
AO−A0JJ.P
AO−A0JE.P
AN−A0AJ.P
A7−A0CJ.P
AO−A12F.P
A8−A079.P
A2−A0T3.P
A2−A0YD.P
AR−A0TR.P
AO−A03O.P
AO−A12E.P
A8−A06N.P
A2−A0YG.P
BH−A18N.P
AN−A0AL.P
A2−A0T6.P
E2−A158.P
E2−A15
A.P
AO−A0JM
.P
C8−A12V.P
A2−A0D2.P
C8−A12U.P
AR−A1AS.P
A8−A09G.P
C8−A131.1.P
C8−A134.P
A2−A0YF.P
BH−A0DD.P
BH−A0E9.P
AR−A0TT.P
AO−A12B.1.P
A2−A0SW.P
AO−A0JL.P
BH−A0BV.P
A2−A0YM.P
BH−A0C
7.P
A2−A0SX.P
AO−A12D.R
C8−A131.R
AO−A12B.R
BH−A1
8Q.R
C8−A130.R
C8−A138.R
E2−A154.R
C8−A12L.R
A2−A0EX.R
AO−A12D.1.R
AN−A04A.R
BH−A0AV
.R
C8−A12T.R
A8−A06Z.R
A2−A0CM.R
BH−A18U.R
A2−A0EQ.R
AR−A0U4.R
AO−A0J9.R
AR−A1A
P.R
AN−A0FK.R
AO−A0J6.R
A7−A13F.R
BH−A0E1.R
A7−A0CE.R
A2−A0YC.R
AO−A0JC.R
A8−A08Z.R
AR−A0TX.R
A8−A076
.RAO−A126.R
BH−A0C1.R
A2−A0EY.R
AR−A1AW.R
AR−A1AV.R
C8−A13
5.R
A2−A0EV.R
AN−A0AM.R
D8−A142.R
AN−A0FL.R
BH−A0DG.R
AR−A0TV.R
C8−A12Z.R
AO−A0JJ.R
AO−A0JE.R
AN−A0AJ.R
A7−A0CJ.R
AO−A12F.R
A8−A079.R
A2−A0T3.R
A2−A0YD.R
AR−A0TR.R
AO−A03O.R
AO−A12E.R
A8−A06N.R
A2−A0YG.R
BH−A18N.R
AN−A0AL.R
A2−A0T6.R
E2−A158.R
E2−A15
A.R
AO−A0JM
.R
C8−A12V.R
A2−A0D2.R
C8−A12U.R
AR−A1AS.R
A8−A09G.R
C8−A131.1.R
C8−A134.R
A2−A0YF.R
BH−A0DD.R
BH−A0E9.R
AR−A0TT.R
AO−A12B.1.R
A2−A0SW.R
AO−A0JL.R
BH−A0BV.R
A2−A0YM.R
BH−A0C
7.R
A2−A0SX.R
AO−A
12D
.PAO
−A12
D.1
.PAO
−A12
D.R
AO−A
12D
.1.R
C8−
A131
.PC
8−A1
31.1
.PC
8−A1
31.R
C8−
A131
.1.R
D8−
A142
.PD
8−A1
42.R
BH−A
0AV.
RC
8−A1
35.P
C8−
A135
.RBH
−A18
Q.P
BH−A
18Q
.RA2−A
0CM
.PA2−A
0CM
.RA2−A
0YM
.RA2−A
0SX.
RA2−A
0D2.
PA2−A
0D2.
RC
8−A1
34.P
C8−
A134
.RAR
−A0U
4.P
AR−A
0U4.
RC
8−A1
2V.P
C8−
A12V
.RBH
−A0C
1.P
AO−A
0J6.
RC
8−A1
30.P
C8−
A130
.RA7−A
0CE.
PA7−A
0CE.
RAO
−A12
F.P
AO−A
12F.
RE2−A
158.
PE2−A
158.
RAO
−A0J
L.P
AO−A
0JL.
RC
8−A1
2L.P
C8−
A12L
.RAN
−A0A
L.P
AN−A
0AL.
RAN
−A0F
L.P
AN−A
0FL.
RA2−A
0EQ
.PA2−A
0EQ
.RAR
−A1A
W.P
AR−A
1AW
.RAO
−A0J
C.R
AR−A
0TT.
PAR
−A0T
T.R
AN−A
0AM
.PAN
−A0A
M.R
AO−A
0JE.
PAO
−A0J
E.R
AR−A
0TX.
RA2−A
0SW
.PA2−A
0SW
.RC
8−A1
38.R
AO−A
0J9.
RAO
−A12
B.P
AO−A
12B.
1.P
AO−A
12B.
RAO
−A12
B.1.
RBH
−A0E
1.R
AN−A
0FK.
PAN
−A0F
K.R
E2−A
154.
PE2−A
154.
RAR
−A1A
S.P
AR−A
1AS.
RA8−A
06Z.
PA8−A
06Z.
RAR
−A0T
R.P
AR−A
0TR
.RA8−A
06N
.PA8−A
06N
.R AO−A
0JC
.PBH
−A18
N.P
BH−A
18N
.RA2−A
0YC
.PA2−A
0YC
.RAR
−A1A
V.R
BH−A
0BV.
RA2−A
0EV.
RAO
−A12
E.P
AO−A
12E.
RA2−A
0YF.
PA2−A
0YF.
RAR
−A1A
P.P
AR−A
1AP.
RE2−A
15A.
PE2−A
15A.
RC
8−A1
38.P
A8−A
076.
PA8−A
076.
RAO
−A0J
M.P
AO−A
0JM
.RBH
−A0D
G.P
BH−A
0DG
.RA2−A
0EY.
PA2−A
0EY.
RC
8−A1
2Z.P
C8−
A12Z
.RA2−A
0YG
.PA2−A
0YG
.RA8−A
09G
.PA8−A
09G
.RAN
−A04
A.P
AN−A
04A.
RA2−A
0T3.
PA2−A
0T3.
RA7−A
13F.
PA7−A
13F.
RA8−A
079.
PA8−A
079.
RAR
−A1A
V.P
BH−A
0C7.
PBH
−A0C
7.R
C8−
A12T
.PC
8−A1
2T.R
BH−A
18U
.PBH
−A18
U.R
AR−A
0TV.
PAR
−A0T
V.R
AN−A
0AJ.
PAN
−A0A
J.R
C8−
A12U
.PC
8−A1
2U.R
A7−A
0CJ.
PA7−A
0CJ.
RAO
−A03
O.P
AO−A
03O
.RBH
−A0D
D.P
BH−A
0DD
.RA2−A
0EX.
PA2−A
0T6.
PA8−A
08Z.
PAO
−A0J
9.P
BH−A
0BV.
PAO
−A12
6.P
AO−A
126.
RA2−A
0EX.
RBH
−A0A
V.P
A2−A
0YM
.PA2−A
0SX.
PAO
−A0J
6.P
BH−A
0E9.
PBH
−A0E
1.P
AR−A
0TX.
PA2−A
0EV.
PA2−A
0YD
.PAO
−A0J
J.P
AO−A
0JJ.
RA8−A
08Z.
RA2−A
0T6.
RA2−A
0YD
.RBH
−A0C
1.R
BH−A
0E9.
R
0.0
0.5
1.0
1.5
Co−clutering RNA and Protein
RNA−protein correlation > 0.4Samples
Hei
ght
Proteomics & Biomarker Discovery
AGNES
Proteomics & Biomarker Discovery
(Intrinsic) Proteome Clusters
§ 1,521 proteins with • No missing values • Standard deviation > 1.5
§ Consensus k-means clustering • 1000 bootstrap samples • k=3,4,5,6
§ Assess cluster coherence • Visualization of consensus
matrix • Consensus CDF/Delta-area plot • Silhouette distance
consensus matrix k=3
123
consensus matrix k=4
1234
consensus matrix k=5
12345
consensus matrix k=6
123456
Silhouette width si
0.0 0.2 0.4 0.6 0.8 1.0
Silhoutte plot for k=3
Average silhouette width : 0.09
n = 83 3 clusters Cjj : nj | avei∈Cj si
1 : 21 | 0.08
2 : 32 | 0.08
3 : 30 | 0.12
Silhouette width si
0.0 0.2 0.4 0.6 0.8 1.0
Silhoutte plot for k=4
Average silhouette width : 0.07
n = 83 4 clusters Cjj : nj | avei∈Cj si
1 : 21 | 0.08
2 : 30 | 0.07
3 : 17 | 0.008
4 : 15 | 0.16
●
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
consensus CDF
consensus index
CD
F
23456
●
●
●
●
●
2 3 4 5 6
0.1
0.2
0.3
0.4
Delta area
k
rela
tive
chan
ge in
are
a un
der C
DF
curv
e
Extended'Data'Figure'6'
a"
b"
c"
Proteomics & Biomarker Discovery
Assessing cluster coherence
§ Silhouette distance § Consensus CDF § Delta-area plot
§ Tools: • cluster (R) • consensusClusterPlus (R)
Proteomics & Biomarker Discovery
Proteome and Phosphoproteome Clusters
AO−A12D
.01TCGA
C8−A131.01TC
GA
BH−A18Q
.02TCGA
C8−A12L.04TC
GA
AO−A12D
.05TCGA
A2−A0CM.07TC
GA
A2−A0EQ.08TC
GA
AR−A0U
4.09TCGA
A7−A0CE.13TC
GA
BH−A0C
1.16TCGA
AN−A0FL.19TC
GA
AO−A12F.22TC
GA
AN−A0AL.28TC
GA
E2−A158.29TCGA
C8−A12V.30TC
GA
A2−A0D2.31TC
GA
C8−A12U
.31TCGA
A8−A09G.32TC
GA
C8−A131.32TC
GA
C8−A134.32TC
GA
AR−A0TT.34TC
GA
AO−A12B.01TC
GA
C8−A130.02TC
GA
C8−A138.03TC
GA
AN−A04A.05TC
GA
C8−A12T.06TC
GA
A8−A06Z.07TCGA
AR−A1AP.11TC
GA
AN−A0FK.11TC
GA
A7−A13F.12TCGA
A2−A0YC.13TC
GA
AO−A0JC
.14TCGA
A8−A08Z.14TCGA
A8−A076.15TCGA
AO−A126.15TC
GA
AR−A1AW
.17TCGA
AR−A1AV.17TC
GA
AN−A0AM
.18TCGA
BH−A0D
G.19TC
GA
AR−A0TV.20TC
GA
AN−A0AJ.21TC
GA
A7−A0CJ.22TC
GA
A8−A079.23TCGA
AR−A0TR
.25TCGA
AO−A03O
.25TCGA
AO−A12E.26TC
GA
A8−A06N.26TC
GA
BH−A18N
.27TCGA
E2−A15A.29TCGA
AO−A0JM
.30TCGA
AR−A1AS.31TC
GA
BH−A0D
D.33TC
GA
AO−A12B.34TC
GA
E2−A154.03TCGA
A2−A0EX.04TCGA
BH−A0AV.05TC
GA
BH−A18U
.08TCGA
AO−A0J9.10TC
GA
AO−A0J6.11TC
GA
BH−A0E1.12TC
GA
AR−A0TX.14TC
GA
A2−A0EY.16TCGA
C8−A135.17TC
GA
A2−A0EV.18TCGA
D8−A142.18TC
GA
C8−A12Z.20TC
GA
AO−A0JJ.20TC
GA
AO−A0JE.21TC
GA
A2−A0T3.24TCGA
A2−A0YD.24TC
GA
A2−A0YG.27TC
GA
A2−A0T6.29TCGA
A2−A0YF.33TCGA
BH−A0E9.33TC
GA
A2−A0SW.35TC
GA
AO−A0JL.35TC
GA
BH−A0BV.35TC
GA
A2−A0YM.36TC
GA
BH−A0C
7.36TCGA
A2−A0SX.36TCGA
263d3f−I.CPTAC
blcdb9−I.CPTAC
c4155b−C.CPTAC
Proteome123
Phospohoproteome123
RPPABasalHer2LumALumA/BReacIReacIIX
PAM50BasalHer2LumALumBNormal
−2
−1
0
1
2
Basal-enriched Luminal-enriched Stroma-enriched
Proteomics & Biomarker Discovery
Cell cycle, DNA-damage and immuno-regulatory gene sets are enriched in Basal-like tumors
Pathway enrichment analysis
Proteomics & Biomarker Discovery
Phospho-pathway clustering
§ Dataset: 5,914 phosphoproteins • Filtered Phosphoproteome data
– Phosphosites with <81 missing values – Standard deviation > 0.5 across all samples
• Phosphosite rolled-up to proteins using median ratio • Map phosphoproteins to genes
§ Map samples to MSigDB pathways using ssGSEA • 908 curated pathways
§ Consensus k-means clustering in pathway space § Assess cluster coherence
Proteomics & Biomarker Discovery
AO.A12D
.01TCGA
AO.A12D
.05TCGA
AO.A0J9.10TC
GA
BH.A0E1.12TC
GA
A2.A0YC.13TC
GA
AO.A0JC
.14TCGA
A8.A076.15TCGA
A2.A0EY.16TCGA
AR.A1AW
.17TCGA
AN.A0AM
.18TCGA
BH.A0D
G.19TC
GA
AO.A0JJ.20TC
GA
A8.A06N.26TC
GA
C8.A12V.30TC
GA
A8.A09G.32TC
GA
C8.A131.32TC
GA
A2.A0YF.33TCGA
AR.A0TT.34TC
GA
A2.A0SW.35TC
GA
AO.A0JL.35TC
GA
BH.A0BV.35TC
GA
A2.A0YM.36TC
GA
A2.A0SX.36TCGA
c4155b.C.CPTAC
C8.A131.01TC
GA
BH.A18Q
.02TCGA
E2.A154.03TCGA
A2.A0EX.04TCGA
BH.A0AV.05TC
GA
AR.A1AP.11TC
GA
AO.A0J6.11TC
GA
AR.A0TX.14TC
GA
C8.A135.17TC
GA
A2.A0EV.18TCGA
D8.A142.18TC
GA
C8.A12Z.20TC
GA
AO.A0JE.21TC
GA
A2.A0YD.24TC
GA
BH.A18N
.27TCGA
A2.A0T6.29TCGA
AR.A1AS.31TC
GA
BH.A0E9.33TC
GA
AO.A12B.01TC
GA
C8.A130.02TC
GA
C8.A138.03TC
GA
AN.A04A.05TC
GA
A8.A06Z.07TCGA
AN.A0FK.11TC
GA
A7.A13F.12TCGA
AO.A126.15TC
GA
AR.A1AV.17TC
GA
AR.A0TV.20TC
GA
AR.A0TR
.25TCGA
AO.A03O
.25TCGA
AO.A12E.26TC
GA
AO.A12B.34TC
GA
BH.A0C
7.36TCGA
X263d3f.I.CPTAC
blcdb9.I.CPTAC
C8.A12L.04TC
GA
C8.A12T.06TC
GA
A2.A0CM.07TC
GA
BH.A18U
.08TCGA
A2.A0EQ.08TC
GA
AR.A0U
4.09TCGA
A7.A0CE.13TC
GA
A8.A08Z.14TCGA
BH.A0C
1.16TCGA
AN.A0FL.19TC
GA
AN.A0AJ.21TC
GA
A7.A0CJ.22TC
GA
AO.A12F.22TC
GA
A8.A079.23TCGA
A2.A0T3.24TCGA
A2.A0YG.27TC
GA
AN.A0AL.28TC
GA
E2.A158.29TCGA
E2.A15A.29TCGA
AO.A0JM
.30TCGA
A2.A0D2.31TC
GA
C8.A12U
.31TCGA
C8.A134.32TC
GA
BH.A0D
D.33TC
GA
KEGG_DNA_REPLICATION
REACTOME_MITOTIC_PROMETAPHASE
PID_FOXM1_PATHWAY
REACTOME_REGULATION_OF_MITOTIC_CELL_CYCLE
REACTOME_METABOLISM_OF_RNA
REACTOME_DNA_STRAND_ELONGATION
REACTOME_APC_C_CDC20_MEDIATED_DEGRADATION_OF_MITOTIC_PROTEINS
REACTOME_ACTIVATION_OF_THE_PRE_REPLICATIVE_COMPLEX
PSP:KinaseSubstrate:CHEK1
REACTOME_SYNTHESIS_OF_DNA
PSP:KinaseSubstrate:PLK1
PID_PLK1_PATHWAY
REACTOME_ACTIVATION_OF_ATR_IN_RESPONSE_TO_REPLICATION_STRESS
REACTOME_G0_AND_EARLY_G1
REACTOME_CHROMOSOME_MAINTENANCE
REACTOME_S_PHASE
PSP:KinaseSubstrate:CHEK2
REACTOME_G1_S_TRANSITION
PSP:KinaseSubstrate:CDK1
KEGG_CELL_CYCLE
REACTOME_CYCLIN_A_B1_ASSOCIATED_EVENTS_DURING_G2_M_TRANSITION
REACTOME_CELL_CYCLE_CHECKPOINTS
REACTOME_DNA_REPLICATION
REACTOME_G2_M_CHECKPOINTS
REACTOME_MITOTIC_M_M_G1_PHASES
PSP:KinaseSubstrate:CDK2
REACTOME_MITOTIC_G1_G1_S_PHASES
REACTOME_DEPOSITION_OF_NEW_CENPA_CONTAINING_NUCLEOSOMES_AT_THE_CENTROMERE
REACTOME_CELL_CYCLE_MITOTIC
REACTOME_CELL_CYCLE
REACTOME_SIGNALING_BY_TGF_BETA_RECEPTOR_COMPLEX
PID_CMYB_PATHWAY
BIOCARTA_MTA3_PATHWAY
KEGG_RNA_DEGRADATION
REACTOME_RORA_ACTIVATES_CIRCADIAN_EXPRESSION
KEGG_DRUG_METABOLISM_OTHER_ENZYMES
REACTOME_CIRCADIAN_REPRESSION_OF_EXPRESSION_BY_REV_ERBA
KEGG_SPLICEOSOME
REACTOME_ASPARAGINE_N_LINKED_GLYCOSYLATION
REACTOME_DOUBLE_STRAND_BREAK_REPAIR
REACTOME_GENERIC_TRANSCRIPTION_PATHWAY
REACTOME_NOTCH1_INTRACELLULAR_DOMAIN_REGULATES_TRANSCRIPTION
REACTOME_MITOCHONDRIAL_PROTEIN_IMPORT
REACTOME_TRAF6_MEDIATED_IRF7_ACTIVATION
REACTOME_TRANSCRIPTION
REACTOME_RNA_POL_II_TRANSCRIPTION
PID_AR_PATHWAY
PID_CIRCADIAN_PATHWAY
REACTOME_TRAF3_DEPENDENT_IRF_ACTIVATION_PATHWAY
REACTOME_TRANSPORT_TO_THE_GOLGI_AND_SUBSEQUENT_MODIFICATION
PID_HES_HEY_PATHWAY
BIOCARTA_WNT_PATHWAY
KEGG_CYSTEINE_AND_METHIONINE_METABOLISM
REACTOME_MRNA_SPLICING
BIOCARTA_CARM_ER_PATHWAY
REACTOME_NUCLEAR_RECEPTOR_TRANSCRIPTION_PATHWAY
REACTOME_CIRCADIAN_CLOCK
BIOCARTA_RARRXR_PATHWAY
KEGG_LYSINE_DEGRADATION
REACTOME_ELONGATION_ARREST_AND_RECOVERY
REACTOME_NUCLEAR_EVENTS_KINASE_AND_TRANSCRIPTION_FACTOR_ACTIVATION
REACTOME_MAPK_TARGETS_NUCLEAR_EVENTS_MEDIATED_BY_MAP_KINASES
PID_FAK_PATHWAY
PID_IL2_1PATHWAY
REACTOME_SIGNAL_TRANSDUCTION_BY_L1
PID_NETRIN_PATHWAY
REACTOME_MAP_KINASE_ACTIVATION_IN_TLR_CASCADE
PID_ENDOTHELIN_PATHWAY
PID_VEGFR1_PATHWAY
BIOCARTA_AGR_PATHWAY
BIOCARTA_MAL_PATHWAY
REACTOME_NGF_SIGNALLING_VIA_TRKA_FROM_THE_PLASMA_MEMBRANE
KEGG_ERBB_SIGNALING_PATHWAY
BIOCARTA_INTEGRIN_PATHWAY
PID_LYMPH_ANGIOGENESIS_PATHWAY
REACTOME_AXON_GUIDANCE
KEGG_VEGF_SIGNALING_PATHWAY
PID_ERBB2_ERBB3_PATHWAY
BIOCARTA_PYK2_PATHWAY
BIOCARTA_AT1R_PATHWAY
KEGG_FOCAL_ADHESION
BIOCARTA_BIOPEPTIDES_PATHWAY
PID_VEGFR1_2_PATHWAY
SIG_INSULIN_RECEPTOR_PATHWAY_IN_CARDIAC_MYOCYTES
REACTOME_SIGNALLING_BY_NGF
KEGG_MAPK_SIGNALING_PATHWAY
ST_INTEGRIN_SIGNALING_PATHWAY
PID_PDGFRB_PATHWAY
PID_ERBB1_DOWNSTREAM_PATHWAY
KEGG_NEUROTROPHIN_SIGNALING_PATHWAY
KEGG_CALCIUM_SIGNALING_PATHWAY
REACTOME_PLATELET_HOMEOSTASIS
SIG_BCR_SIGNALING_PATHWAY
REACTOME_GPCR_LIGAND_BINDING
REACTOME_GASTRIN_CREB_SIGNALLING_PATHWAY_VIA_PKC_AND_MAPK
REACTOME_REGULATION_OF_INSULIN_SECRETION_BY_GLUCAGON_LIKE_PEPTIDE1
REACTOME_HEMOSTASIS
REACTOME_G_ALPHA_I_SIGNALLING_EVENTS
REACTOME_SIGNALING_BY_GPCR
REACTOME_GPVI_MEDIATED_ACTIVATION_CASCADE
REACTOME_TRANSMISSION_ACROSS_CHEMICAL_SYNAPSES
REACTOME_OPIOID_SIGNALLING
PID_IL8_CXCR2_PATHWAY
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM
REACTOME_PLATELET_ACTIVATION_SIGNALING_AND_AGGREGATION
ST_MYOCYTE_AD_PATHWAY
REACTOME_PLC_BETA_MEDIATED_EVENTS
KEGG_INOSITOL_PHOSPHATE_METABOLISM
PID_AMB2_NEUTROPHILS_PATHWAY
PID_IL8_CXCR1_PATHWAY
REACTOME_G_ALPHA_Z_SIGNALLING_EVENTS
REACTOME_NEUROTRANSMITTER_RECEPTOR_BINDING_AND_DOWNSTREAM_TRANSMISSION_IN_THE_POSTSYNAPTIC_CELL
REACTOME_GPCR_DOWNSTREAM_SIGNALING
REACTOME_NEURONAL_SYSTEM
KEGG_PHOSPHATIDYLINOSITOL_SIGNALING_SYSTEM
SIG_PIP3_SIGNALING_IN_B_LYMPHOCYTES
REACTOME_REGULATION_OF_INSULIN_SECRETION
REACTOME_G_PROTEIN_BETA_GAMMA_SIGNALLING
REACTOME_ACTIVATION_OF_KAINATE_RECEPTORS_UPON_GLUTAMATE_BINDING
REACTOME_G_ALPHA_Q_SIGNALLING_EVENTS PhosphoPathway1234
Proteome123
RPPABasalHer2LumALumA/BReacIReacIIX
PAM50BasalHer2LumALumBNormal
−2
−1
0
1
2
Phospho-pathway clustering identifies unique clusters
Supplementary Figure 4
a b
Phospho: ssGSEA phospohoproteomecluster
1234
Proteome: Global proteome cluster
123
PAM50: PAM 50 statusBasalHer2LumALumBNormal
−3
−2
−1
0
1
2
3
PhosphoProteomePAM50
Phospho
Cell cycle
Checkpoint
G protein
G protein-coupled receptor
Inositol phosphatemetabolism
ErbB / HERsignaling
VEGF signaling
Integrin signaling
Estrogen receptorsignaling
TranscriptionRegulation
Proteomics & Biomarker Discovery
Association analysis via marker selection and GSEA
Setup
Extract
Normalize
mRNA
Cluster Assoc Mut
Alternate Analyses
CNA
Plots
Data
Norm Data
RNA
CNA
Mut
Proteomics & Biomarker Discovery
Association Analysis and Marker Selection
§ Collection of algorithms for • Identification of statistically significant differential markers
– Multiclass – One-vs-all
• Training of multiple classifiers – Partial Least Squares, Shrunken Centroids, Random Forests, Elastic Nets – Other algorithms can be easily added
• Variable importance from classifiers for further prioritization of differential markers
– Marker rank aggregation for final marker ranking
• Class prediction for unknown/new samples • Visualization (heatmaps) • GSEA for pathway enrichment • EnrichmentMap for visualizing enriched pathways
Proteomics & Biomarker Discovery
Rank Aggregation for Marker Ranking
§ Perform Marker selection: • Identify statistically significant differential markers (SAM)
– Multiclass – One-vs-all
• Train multiple classifiers – Partial Least Squares, Shrunken Centroids, Random Forests, Elastic
Nets – Other algorithms can be easily added
• Rank markers using variable importance from classifiers
§ Combine multiple rankings to a final rank • Robust rank aggregation (R. Kolde et. al., Bioinformatics, 2012)
– Calculate final rank based on order statistics – Accommodates significant proportion of “noise” markers and
occasional “low” ranks
Proteomics & Biomarker Discovery
Linking copy number alteration and protein expression using LINCS (Library of Integrated Cellular Signatures)
Setup
Extract
Normalize
mRNA
Cluster Assoc Mut
Alternate Analyses
CNA
Plots
Data
Norm Data
RNA
CNA
Mut
Proteomics & Biomarker Discovery
Approach
§ Compare proteome profiles filtered by CNA TRANS correlations with LINCS functional knock-down data. • Genes with LINCS-enriched CIS effects are considered
candidate driver genes • FDR for candidate driver genes is estimated using a
permutation test
Proteomics & Biomarker Discovery
CNA-protein correlations show CIS and many TRANS effects
§ Correlate copy number (CN) data with proteome for all 60 million gene-protein pairs
§ Plot statistically significant correlations (FDR < 0.05)
Pro
teom
e Copy Number
Chromosome
§ Histogram shows percent of significant correlations at a CN locus
§ Highlights “hot-spots” of TRANS-activity
§ positive correlation § negative correlation
Chr
omos
ome
Fr
eque
ncy,
%
TRANS effect “hot spots” at chromosomes 5q, 10p, 12, 16q, 17q, and 22q
Proteomics & Biomarker Discovery
Can proteome profiles identify candidate genes driving response in copy number altered regions?
§ A small number of key genes drive observed TRANS-effects
§ To identify candidate genes: Correlate proteome profiles of CN altered samples with gene knock-down mRNA profiles
§ CN amplification negatively correlated to knock down profile and/or CN deletion positively correlated to knock down profile
è candidate causal gene
Proteomics & Biomarker Discovery
Leveraging large scale perturbation datasets to identify candidate causal genes in CNA regions
§ LINCS Functional knock-down profiles on ~3,800 genes: • Multiple hairpins per gene knock-down • 1000 landmark genes measured on Luminex assay • Complete profile (~22,000 genes) calculated by inference • Includes ~20,000 drug perturbagens. Total ~476,000 mRNA profiles
Library of Integrated Cellular Signatures (LINCS) aka The Connectivity Map (CMAP)
http://www.lincscloud.org
Proteomics & Biomarker Discovery
Use LINCS to identify key genes driving response to copy number alterations: STEP 1
§ Identify samples with deletion [log(CN) < −0.3], neutral and amplification [log(CN)>0.3] CNA for a given gene
§ Extract protein expression for genes with significant TRANS-effects (FDR < 0.05).
36
CNA_neutralCNA_delCNA_amp
CNA_status
relative
row min row max
AO.A
12D
.01T
CG
AC
8.A1
38.0
3TC
GA
C8.
A12L
.04T
CG
AAO
.A12
D.0
5TC
GA
A2.A
0EQ
.08T
CG
AA8
.A07
6.15
TCG
AA2
.A0E
Y.16
TCG
AAO
.A0J
E.21
TCG
AA2
.A0Y
G.2
7TC
GA
BH.A
0DD
.33T
CG
ABH
.A0C
7.36
TCG
AC
8.A1
30.0
2TC
GA
AN.A
04A.
05TC
GA
A2.A
0CM
.07T
CG
AAR
.A1A
P.11
TCG
ABH
.A0E
1.12
TCG
AA8
.A08
Z.14
TCG
AAN
.A0A
M.1
8TC
GA
AO.A
12F.
22TC
GA
AO.A
03O
.25T
CG
AE2
.A15
8.29
TCG
AC
8.A1
2U.3
1TC
GA
A2.A
0YF.
33TC
GA
A2.A
0SW
.35T
CG
AC
8.A1
31.0
1TC
GA
AO.A
12B.
01TC
GA
A2.A
0EX.
04TC
GA
BH.A
0AV.
05TC
GA
A8.A
06Z.
07TC
GA
AR.A
0U4.
09TC
GA
AO.A
0J9.
10TC
GA
AN.A
0FK.
11TC
GA
AO.A
0J6.
11TC
GA
A7.A
13F.
12TC
GA
A7.A
0CE.
13TC
GA
A2.A
0YC
.13T
CG
AAO
.A0J
C.1
4TC
GA
AO.A
126.
15TC
GA
AR.A
1AW
.17T
CG
AAR
.A1A
V.17
TCG
AAN
.A0F
L.19
TCG
ABH
.A0D
G.1
9TC
GA
AR.A
0TV.
20TC
GA
AO.A
0JJ.
20TC
GA
A7.A
0CJ.
22TC
GA
A8.A
079.
23TC
GA
A2.A
0T3.
24TC
GA
AR.A
0TR
.25T
CG
AAO
.A12
E.26
TCG
AA8
.A06
N.2
6TC
GA
BH.A
18N
.27T
CG
AAN
.A0A
L.28
TCG
AE2
.A15
A.29
TCG
AC
8.A1
2V.3
0TC
GA
A2.A
0D2.
31TC
GA
AR.A
1AS.
31TC
GA
C8.
A131
.32T
CG
AC
8.A1
34.3
2TC
GA
BH.A
0E9.
33TC
GA
AR.A
0TT.
34TC
GA
AO.A
12B.
34TC
GA
AO.A
0JL.
35TC
GA
BH.A
0BV.
35TC
GA
A2.A
0YM
.36T
CG
AA2
.A0S
X.36
TCG
A
CNA_status
Proteomics & Biomarker Discovery
Use LINCS to identify key genes driving response to copy number alterations: STEP 2
§ Summarize expression in CNA_DEL, CNA_AMP and CNA_NEUTRAL groups • Median expression for each trans-gene
37
CNA_neutralCNA_delCNA_amp
CNA_status
relative
row min row max
AO.A
12D
.01T
CG
AC
8.A1
38.0
3TC
GA
C8.
A12L
.04T
CG
AAO
.A12
D.0
5TC
GA
A2.A
0EQ
.08T
CG
AA8
.A07
6.15
TCG
AA2
.A0E
Y.16
TCG
AAO
.A0J
E.21
TCG
AA2
.A0Y
G.2
7TC
GA
BH.A
0DD
.33T
CG
ABH
.A0C
7.36
TCG
AC
8.A1
30.0
2TC
GA
AN.A
04A.
05TC
GA
A2.A
0CM
.07T
CG
AAR
.A1A
P.11
TCG
ABH
.A0E
1.12
TCG
AA8
.A08
Z.14
TCG
AAN
.A0A
M.1
8TC
GA
AO.A
12F.
22TC
GA
AO.A
03O
.25T
CG
AE2
.A15
8.29
TCG
AC
8.A1
2U.3
1TC
GA
A2.A
0YF.
33TC
GA
A2.A
0SW
.35T
CG
AC
8.A1
31.0
1TC
GA
AO.A
12B.
01TC
GA
A2.A
0EX.
04TC
GA
BH.A
0AV.
05TC
GA
A8.A
06Z.
07TC
GA
AR.A
0U4.
09TC
GA
AO.A
0J9.
10TC
GA
AN.A
0FK.
11TC
GA
AO.A
0J6.
11TC
GA
A7.A
13F.
12TC
GA
A7.A
0CE.
13TC
GA
A2.A
0YC
.13T
CG
AAO
.A0J
C.1
4TC
GA
AO.A
126.
15TC
GA
AR.A
1AW
.17T
CG
AAR
.A1A
V.17
TCG
AAN
.A0F
L.19
TCG
ABH
.A0D
G.1
9TC
GA
AR.A
0TV.
20TC
GA
AO.A
0JJ.
20TC
GA
A7.A
0CJ.
22TC
GA
A8.A
079.
23TC
GA
A2.A
0T3.
24TC
GA
AR.A
0TR
.25T
CG
AAO
.A12
E.26
TCG
AA8
.A06
N.2
6TC
GA
BH.A
18N
.27T
CG
AAN
.A0A
L.28
TCG
AE2
.A15
A.29
TCG
AC
8.A1
2V.3
0TC
GA
A2.A
0D2.
31TC
GA
AR.A
1AS.
31TC
GA
C8.
A131
.32T
CG
AC
8.A1
34.3
2TC
GA
BH.A
0E9.
33TC
GA
AR.A
0TT.
34TC
GA
AO.A
12B.
34TC
GA
AO.A
0JL.
35TC
GA
BH.A
0BV.
35TC
GA
A2.A
0YM
.36T
CG
AA2
.A0S
X.36
TCG
A
CNA_status
Proteomics & Biomarker Discovery
Use LINCS to identify key genes driving response to copy number alterations: STEP 3
• Determine up and down regulated genes in CNA_DEL and CNA_AMP (in comparison to CNA_NEUTRAL expression)
38
CNA_neutralCNA_delCNA_amp
CNA_status
relative
row min row max
AO.A
12D
.01T
CG
AC
8.A1
38.0
3TC
GA
C8.
A12L
.04T
CG
AAO
.A12
D.0
5TC
GA
A2.A
0EQ
.08T
CG
AA8
.A07
6.15
TCG
AA2
.A0E
Y.16
TCG
AAO
.A0J
E.21
TCG
AA2
.A0Y
G.2
7TC
GA
BH.A
0DD
.33T
CG
ABH
.A0C
7.36
TCG
AC
8.A1
30.0
2TC
GA
AN.A
04A.
05TC
GA
A2.A
0CM
.07T
CG
AAR
.A1A
P.11
TCG
ABH
.A0E
1.12
TCG
AA8
.A08
Z.14
TCG
AAN
.A0A
M.1
8TC
GA
AO.A
12F.
22TC
GA
AO.A
03O
.25T
CG
AE2
.A15
8.29
TCG
AC
8.A1
2U.3
1TC
GA
A2.A
0YF.
33TC
GA
A2.A
0SW
.35T
CG
AC
8.A1
31.0
1TC
GA
AO.A
12B.
01TC
GA
A2.A
0EX.
04TC
GA
BH.A
0AV.
05TC
GA
A8.A
06Z.
07TC
GA
AR.A
0U4.
09TC
GA
AO.A
0J9.
10TC
GA
AN.A
0FK.
11TC
GA
AO.A
0J6.
11TC
GA
A7.A
13F.
12TC
GA
A7.A
0CE.
13TC
GA
A2.A
0YC
.13T
CG
AAO
.A0J
C.1
4TC
GA
AO.A
126.
15TC
GA
AR.A
1AW
.17T
CG
AAR
.A1A
V.17
TCG
AAN
.A0F
L.19
TCG
ABH
.A0D
G.1
9TC
GA
AR.A
0TV.
20TC
GA
AO.A
0JJ.
20TC
GA
A7.A
0CJ.
22TC
GA
A8.A
079.
23TC
GA
A2.A
0T3.
24TC
GA
AR.A
0TR
.25T
CG
AAO
.A12
E.26
TCG
AA8
.A06
N.2
6TC
GA
BH.A
18N
.27T
CG
AAN
.A0A
L.28
TCG
AE2
.A15
A.29
TCG
AC
8.A1
2V.3
0TC
GA
A2.A
0D2.
31TC
GA
AR.A
1AS.
31TC
GA
C8.
A131
.32T
CG
AC
8.A1
34.3
2TC
GA
BH.A
0E9.
33TC
GA
AR.A
0TT.
34TC
GA
AO.A
12B.
34TC
GA
AO.A
0JL.
35TC
GA
BH.A
0BV.
35TC
GA
A2.A
0YM
.36T
CG
AA2
.A0S
X.36
TCG
A
CNA_status
UP UP DN DN CNA_AMP gene set
CNA_DEL gene set
Proteomics & Biomarker Discovery
Use LINCS to identify key genes driving response to copy number alterations: STEP 4
§ Query LINCS using CNA_AMP/DEL gene sets • Convert CNA_AMP/DEL gene sets to Affymetix IDs • Run LINCS enrichment test on ~240,000 “gold” consensus
signatures (CGS). • Extract “CIS-enriched” gene knock downs:
– Enriched gene knock downs include CIS gene • Correct direction of correlation (+ve for CNA_DEL, −ve for
CNA_AMP) – |mean_rankpt4| > 90
• Mean percentile in 4 cell lines > 90 • Extract and analyze z-scores for CIS-enriched genes
39
UP UP DN DN CNA_AMP gene set
CNA_DEL gene set
Proteomics & Biomarker Discovery
Calculate Permutation-based FDR
1. For each of the genes input to the LINCS enrichment test, generate a random permutation as follows: • Let gene G have Ng TRANS genes • From the list of all genes, randomly select Ng genes (without
replacement) 2. Run LINCS enrichment for this permuted dataset 3. Determine FPi, the number of “candidate driver genes” from the
random dataset. 4. Repeat Steps 1-3 R times. 5. Calculate FDR as mid point of 95% Score CI assuming Poisson
distribution with small rate (λ ≈ 0) and small R (R=6).
40
FDR = E #FP#P
!
"#
$
%&=
E(#FP)#P
=FP +1.962 / (2R)
#P where FP = 1
RFPi
i=1
R
∑
95% Score CI for E(#FP) = FP +1.962 / (2R)±1.96 4FP +1.962 / R4R
Proteomics & Biomarker Discovery
Results
41
§ Input gene sets: • ≥15 tumors with |CNA| > 0.3 • Gene must be on CMAP KD list • # TRANS genes ≥ 20 • Total genes tested: 502
§ 20 CIS-enriched candidate genes • Level I: 10 genes
– Enriched in both CNA_AMP and CNA_DEL
• Level II: 10 genes – Enriched in either CNA_AMP or
CNA_DEL
² Level I CETN3 SKP1 SNX2 FBXO7 TERF2IP WIPF2 YAP1 RB1 USP7 DHPS
² Level II FBXL20 ERBB2 USP6NL ARHGEF12 MRPL12 RAB21 EP300 CPNE3 PLCB3 UBE3C
FDR=0.049 [0.003, 0.094]
FDR=0.305 [0.225, 0.385]
Proteomics & Biomarker Discovery
SKP1 and CETN3 are new candidate causal genes for 5q CNA TRANS effects
• 20 candidate causal genes were identified • ERBB2 serves as a positive control • CNA/Protein CIS effects are more indicative for a
CMAP connection to TRANS regulated genes than CNA/RNA CIS effects
Proteomics & Biomarker Discovery
LINCS data: Knock-down of SKP1 and CETN3 increases EGFR, YES1 and DAPK3 expression
Proteomics & Biomarker Discovery
CMAP programming interface for large-scale queries
§ LINCS Cloud Compute server with command line interface • Run large-scale batch queries
using a Grid Engine
§ Programming interface
• R, Python, Matlab • For accessing and manipulating
LINCS data and results
§ Web-based API • For accessing and querying
metadata – Perturbations and
perturbagens – Signatures – Measured and inferred
genes
44
• HDF5 (hierarchical data format) for storing data • Mongo DB for metadata
Proteomics & Biomarker Discovery
Challenges and Implementation
Proteomics & Biomarker Discovery
Summary
§ Automated pipeline enables high throughput analysis of CPTAC data • Reproducible and documented process • Version controlled • Generalizable to other projects • Easy comparison of alternatives • Effective use of parallelism
§ Marker selection and classification can be used for any
analysis • Automated • Multiple ML methods
Proteomics & Biomarker Discovery
Acknowledgments
WASHINGTON U./NYU/UNC/VANDERBILT
- Sherri Davies - Matthew Ellis - Reid Townsend - Li Ding - Song Cao - Michael McLellan - Kuan-lin Huang - Venkata Yellapa - David Fenyo - Kelly Ruggles - Chuck Perou - Michael Gatza - Bing Zhang - Jing Wang
BROAD INSTITUTE - Steve Carr - Karl Clauser - Michael Gillette - Jana Qiao - Lauren Tang - Philipp Mertins - D R Mani - Karsten Krug - Eric Kuhn - Filip Mundt - Corey Flynn - Jacob Asiedu - Aravind Subramaniyan
NCI STAFF - Emily Boja - Mehdi Mesri - Rob Rivers - Chris Kinsinger - Henry Rodriguez
FUNDING: National Cancer Institute
FHCRC - Amanda Paulovich - Jeffrey Whiteaker - Pei Wang - Sean Wang - Chenwei Lin - Ping Yan - Yuzheng Zhang