new statistics and machine learning methods for proteogenomic … · 2016. 8. 3. · • data...

47
Proteomics & Biomarker Discovery Statistics and Machine Learning Methods for Proteogenomic Data Analysis D. R. Mani Broad Institute Northeastern University Short Course Computation and Statistics for Targeted Proteomics May 5, 2016

Upload: others

Post on 16-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Statistics and Machine Learning Methods for Proteogenomic Data Analysis

D. R. Mani Broad Institute

Northeastern University Short Course Computation and Statistics for Targeted Proteomics May 5, 2016

Page 2: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Finding protein targets for Targeted Proteomics

§  Proteins for targeted proteomics

§  Literature §  Pathways •  Disease/phenotype

related

§  Characteristic Subset •  E.g: P100

§  Unbiased Discovery

§  Observed/predicted proteins & peptides

ESP

Page 3: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Unbiased biomarker discovery yields targets for Targeted Proteomics

Adapted from Rifai, et. al., Nature Biotech., 2006.

Page 4: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Unbiased discovery is increasingly Proteogenomic (or Multi-omic)

§  Discovery efforts include multi-omic profiling •  Omic profiling is getting

cheaper

§  Proteomic profiles are increasingly common •  Smaller sample numbers

due to higher cost •  More input material

needed Image Source: Goodacre, J. Exp. Bot 2005.

‘OMICS’ RESEARCH

Image Source: http://fluorous.com/images/omics.JPG

Page 5: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Proteogenomic analysis of Breast Cancer provides generalizable methods

§  NIH CPTAC initiative to perform large-scale proteogenomic analysis of cancer samples •  Breast Cancer—Broad Institute •  Colon Cancer—Vanderbilt •  Ovarian Cancer—Johns Hopkins/PNNL

§  Presentation Goals:

•  Data analysis algorithms and toolkit for proteogenomics –  Applied to breast cancer analysis, but generalizable

•  Generalized applicability to wide range of data sets –  Potential use for targeted data analysis

»  Some methods applicable, others need to be modified/applied with care

Page 6: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Profiling of 105 TCGA samples produced largest proteomic dataset yet generated at Broad

6

•  105  BC  Tumor  Samples  ◘  PAM50  Subtypes:  18  Her2,  25  Basal,  29  

Luminal  A,  33  Luminal  B  

•  Analyzed  by  iTRAQ-­‐MS  ◘  Proteome  +  Phosphoproteome  

Data  Tables  

TCGA  genomic  data  mRNA,  RNA-­‐seq,  Exome,  SNP,  CNV,  MethylaSon,  miRNA  QUILTS  

Tumor  specific  custom  DB  Spectrum  Mill  

Global  proteome  &  phosphoproteome  

raw  MS  data  

MS  search  DB  

TCGA  Pipeline  

mRNA   CNV   MutaSon  

MethylaSon   miRNA  

Proteome  

Phospho-­‐proteome  

Page 7: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

1 mg total protein per tumor Internal reference: equal representation of basal, Her2 and Luminal A/B subtypes Tumor-specific databases based on whole exome seq and RNA seq

Sample processing

Page 8: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

1 mg total protein per tumor Internal reference: equal representation of basal, Her2 and Luminal A/B subtypes Tumor-specific databases based on whole exome seq and RNA seq

Sample processing: The basics

3 samples are included in each iTRAQ run. Each

run also includes a Common Reference

sample.

37 iTRAQ Runs 105 samples

Samples are fractionated for increased depth of coverage

[[enrichment]]

Phospho-proteome

Proteome

Spectrum Mill DATA output: Protein/peptide

log2(ratio to common reference)

Data Analysis Pipeline

Page 9: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Data  Analysis  Pipeline  Overview  

Setup  

Data  ExtracSon  

NormalizaSon  &  QC  

mRNA  CorrelaSon  

Clustering  

AssociaSon  Analysis  

MutaSon  Analysis  

LINCS  &  Alternate  Analyses  

CNA  Analysis  

Profile  Plots  

MS  Data  iTRAQ/TMT  

Normalized,  Filtered  &  QC-­‐pass’d    

data  

mRNA  

CNV  

MutaSon  

Page 10: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Setup initiates automatic pipeline execution

§  Unix shell script §  Create directories §  Copy input data files §  Assemble required code and

additional data files •  Code & data are versioned

§  Execute all core analysis components

§  Use Grid Engine for parallelization at multiple levels •  Account for data dependencies

10

Setup  

Extract  

Normalize  

mRNA  

Cluster  Assoc   Mut  

Alternate  Analyses  

CNA  

Plots  

Data  

Norm    Data  

RNA  

CNA  

Mut  

§  Tools: •  bash •  symbolic links •  subversion (svn) •  UGER (Univa grid engine for research)

Page 11: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Quality Control: Profile plots identify bimodal samples

§  Bimodal samples are identified using (mixture) model-based clustering

0.0

0.1

0.2

0.3

0.4

−20 −10 0 10ratio

dens

ity

Sample type: unimodal

0.0

0.1

0.2

−30 −20 −10 0 10ratio

dens

ity

Sample type: bimodal

§  Tools: •  Mclust (R)

77 samples 28 samples

§  Bimodality is most likely due to poor sample quality

Page 12: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Defining bimodal samples: Challenges

§  Identify a metric that can separate bimodal/tailing samples from unimodal samples •  Bimodality coefficient (too conservative—too many bimodals) •  Dip statistic (too stringent—very few bimodal samples) •  Measures of dispersion

–  IQR –  Standard deviation (balanced metric)

§  Classify new samples as unimodal/bimodal •  Train classifier using single-shot (label-free) MS data

–  Use unimodal/bimodal designation as class vector

•  Apply to new samples as a QC check

Page 13: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Normalization

§  Each sample contains regulated and unregulated proteins. •  Unregulated: log2 (ratio) ~ 0 •  Regulated: Extreme (+/-)

ratios

§  Normalize samples using only unregulated proteins.

§  Unified method for both unimodal and bimodal samples

Setup  

Extract  

Normalize  

mRNA  

Cluster  Assoc   Mut  

Alternate  Analyses  

CNA  

Plots  

Data  

Norm    Data  

RNA  

CNA  

Mut  

Page 14: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Normalization Algorithm Using 2-component Gaussian mixture model

§  Unimodal samples: •  Find the mode M using kernel

density estimation (Gaussian kernel with Shafer-Jones bandwidth)

•  Fit mixture model with mean for both components constrained to be equal to M

•  Normalize (standardize) samples using mean M and smaller std. dev. from mixture model fit

−4 −2 0 2 4

0.00

0.05

0.10

0.15

0.20

0.25

0.30

density.default(x = x)

N = 9162 Bandwidth = 0.1759

Den

sity

Page 15: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Normalization Algorithm Using 2-component Gaussian mixture model

§  Bimodal samples: •  Find the major mode M1 by kernel density estimation (Gaussian

kernel with Shafer-Jones bandwidth) •  Fit mixture model with one component mean constrained to M1 •  Normalize (standardize) samples using mean (M1) and resulting

std. dev.

§  Tools: •  mixtools (R)

–  normalmixEM for EM estimation of mixture parameters (µ, σ2) with constrained mean

•  Mclust (R)

Page 16: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Normalization: Challenges

§  Mixtools estimation is not robust, and can produce one-off results •  Unrealistic mean and variance estimates •  Large variation is estimates when re-fit

§  Use mclust to assess parameter estimates from mixtools •  Obtain approximate (unconstrained) estimate using mclust •  Re-fit mixtools model multiple times to ensure repeatable

parameter estimates –  Must be close to mclust estimates

Page 17: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Clustering for proteogenomic analysis

§  Does the proteome capture intrinsic RNA-based classes?

§  Does tumor heterogeneity invalidate genome-proteome comparisons?

§  Define intrinsic proteome and phosphoproteome clusters

§  How does phosphoproteome data cluster in pathway space? •  Based on single-sample Gene

Set Enrichment Analysis (ssGSEA)

Setup  

Extract  

Normalize  

mRNA  

Cluster  Assoc   Mut  

Alternate  Analyses  

CNA  

Plots  

Data  

Norm    Data  

RNA  

CNA  

Mut  

Page 18: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

C8−A131.01TC

GA

BH−A18Q

.02TCGA

BH−A0AV.05TC

GA

A2−A0CM.07TC

GA

AR−A0U

4.09TCGA

AO−A0J6.11TC

GA

A7−A0CE.13TC

GA

D8−A142.18TC

GA

AN−A0FL.19TC

GA

AO−A12F.22TC

GA

AN−A0AL.28TC

GA

E2−A158.29TCGA

C8−A12V.30TC

GA

A2−A0D2.31TC

GA

C8−A131.32TC

GA

C8−A134.32TC

GA

AO−A0JL.35TC

GA

A2−A0YM.36TC

GA

A2−A0SX.36TCGA

AO−A12D

.01TCGA

C8−A130.02TC

GA

C8−A138.03TC

GA

C8−A12L.04TC

GA

AO−A12D

.05TCGA

C8−A12T.06TC

GA

A2−A0EQ.08TC

GA

AR−A0TX.14TC

GA

A8−A076.15TCGA

C8−A135.17TC

GA

C8−A12Z.20TC

GA

AO−A0JE.21TC

GA

A8−A09G.32TC

GA

E2−A154.03TCGA

A2−A0EX.04TCGA

AN−A04A.05TC

GA

AO−A0J9.10TC

GA

AR−A1AP.11TC

GA

BH−A0E1.12TC

GA

A2−A0YC.13TC

GA

A8−A08Z.14TCGA

AO−A126.15TC

GA

BH−A0C

1.16TCGA

AR−A1AW

.17TCGA

A2−A0EV.18TCGA

BH−A0D

G.19TC

GA

AO−A0JJ.20TC

GA

A2−A0YD.24TC

GA

AR−A0TR

.25TCGA

AO−A12E.26TC

GA

BH−A18N

.27TCGA

A2−A0T6.29TCGA

AR−A1AS.31TC

GA

A2−A0YF.33TCGA

BH−A0E9.33TC

GA

BH−A0BV.35TC

GA

AO−A12B.01TC

GA

A8−A06Z.07TCGA

BH−A18U

.08TCGA

AN−A0FK.11TC

GA

A7−A13F.12TCGA

AO−A0JC

.14TCGA

A2−A0EY.16TCGA

AR−A1AV.17TC

GA

AN−A0AM

.18TCGA

AR−A0TV.20TC

GA

AN−A0AJ.21TC

GA

A7−A0CJ.22TC

GA

A8−A079.23TCGA

A2−A0T3.24TCGA

AO−A03O

.25TCGA

A8−A06N.26TC

GA

A2−A0YG.27TC

GA

E2−A15A.29TCGA

AO−A0JM

.30TCGA

C8−A12U

.31TCGA

BH−A0D

D.33TC

GA

AR−A0TT.34TC

GA

AO−A12B.34TC

GA

A2−A0SW.35TC

GA

BH−A0C

7.36TCGA

CEP55CDC20TYMSMKI67ANLNNUF2RRM2UBE2CBIRC5CENPFKIF2CNDC80UBE2TCCNB1MMP11CDH3MELKACTR3BEGFRCXXC5BCL2SLC39A6FOXA1MLPHFGFR4ERBB2KRT5KRT14SFRP1PHGDHFOXC1PGRESR1MAPTNAT1 PAM−50 (TCGA)

BasalHer2LumALumB

RNA (50 genes)BasalHer2LumALumB

RNA (35 genes observed in proteome)BasalHer2LumALumB

Proteome (35 observed PAM−50 proteins)BasalHer2LumALumB

−3

−2

−1

0

1

2

3

RNA-based PAM-50 clusters are captured in the proteome

§  Tools: FANNY clustering (Kaufman & Rousseeuw, 1990) •  cluster (R)

Page 19: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

FANNY

Page 20: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Proteome and RNA samples co-cluster in the space of correlated genes

§  Dataset: Combined RNA + proteome for 77 samples. •  4,291 proteins/genes with

moderate to high correlation (R > 0.4)

§  Spearman correlation to measure sample similarity

§  AGNES hierarchical clustering §  “Fanplot” to show co-clustering

•  62/77 samples co-cluster

AO−A12D.P

C8−A131.P

AO−A12B.P

BH−A18

Q.P

C8−A130.P

C8−A13

8.P

E2−A154.P

C8−A12L.P

A2−A0EX.P

AO−A12D.1.P

AN−A04A.P

BH−A0AV.P

C8−A12T.P

A8−A06Z.P

A2−A0C

M.P

BH−A18U

.P

A2−A0EQ.P

AR−A0U4.P

AO−A0J9.P

AR−A1A

P.P

AN−A0FK.P

AO−A0J6.P

A7−A13F.P

BH−A0E1.P

A7−A0CE.P

A2−A0YC.P

AO−A0JC.P

A8−A08Z.P

AR−A0TX.P

A8−A07

6.P

AO−A126.P

BH−A0C1.P

A2−A0EY.P

AR−A1AW.P

AR−A1AV.P

C8−A13

5.P

A2−A0EV.P

AN−A0AM.P

D8−A142.P

AN−A0FL.P

BH−A0DG.P

AR−A0TV.P

C8−A12Z.P

AO−A0JJ.P

AO−A0JE.P

AN−A0AJ.P

A7−A0CJ.P

AO−A12F.P

A8−A079.P

A2−A0T3.P

A2−A0YD.P

AR−A0TR.P

AO−A03O.P

AO−A12E.P

A8−A06N.P

A2−A0YG.P

BH−A18N.P

AN−A0AL.P

A2−A0T6.P

E2−A158.P

E2−A15

A.P

AO−A0JM

.P

C8−A12V.P

A2−A0D2.P

C8−A12U.P

AR−A1AS.P

A8−A09G.P

C8−A131.1.P

C8−A134.P

A2−A0YF.P

BH−A0DD.P

BH−A0E9.P

AR−A0TT.P

AO−A12B.1.P

A2−A0SW.P

AO−A0JL.P

BH−A0BV.P

A2−A0YM.P

BH−A0C

7.P

A2−A0SX.P

AO−A12D.R

C8−A131.R

AO−A12B.R

BH−A1

8Q.R

C8−A130.R

C8−A138.R

E2−A154.R

C8−A12L.R

A2−A0EX.R

AO−A12D.1.R

AN−A04A.R

BH−A0AV

.R

C8−A12T.R

A8−A06Z.R

A2−A0CM.R

BH−A18U.R

A2−A0EQ.R

AR−A0U4.R

AO−A0J9.R

AR−A1A

P.R

AN−A0FK.R

AO−A0J6.R

A7−A13F.R

BH−A0E1.R

A7−A0CE.R

A2−A0YC.R

AO−A0JC.R

A8−A08Z.R

AR−A0TX.R

A8−A076

.RAO−A126.R

BH−A0C1.R

A2−A0EY.R

AR−A1AW.R

AR−A1AV.R

C8−A13

5.R

A2−A0EV.R

AN−A0AM.R

D8−A142.R

AN−A0FL.R

BH−A0DG.R

AR−A0TV.R

C8−A12Z.R

AO−A0JJ.R

AO−A0JE.R

AN−A0AJ.R

A7−A0CJ.R

AO−A12F.R

A8−A079.R

A2−A0T3.R

A2−A0YD.R

AR−A0TR.R

AO−A03O.R

AO−A12E.R

A8−A06N.R

A2−A0YG.R

BH−A18N.R

AN−A0AL.R

A2−A0T6.R

E2−A158.R

E2−A15

A.R

AO−A0JM

.R

C8−A12V.R

A2−A0D2.R

C8−A12U.R

AR−A1AS.R

A8−A09G.R

C8−A131.1.R

C8−A134.R

A2−A0YF.R

BH−A0DD.R

BH−A0E9.R

AR−A0TT.R

AO−A12B.1.R

A2−A0SW.R

AO−A0JL.R

BH−A0BV.R

A2−A0YM.R

BH−A0C

7.R

A2−A0SX.R

AO−A

12D

.PAO

−A12

D.1

.PAO

−A12

D.R

AO−A

12D

.1.R

C8−

A131

.PC

8−A1

31.1

.PC

8−A1

31.R

C8−

A131

.1.R

D8−

A142

.PD

8−A1

42.R

BH−A

0AV.

RC

8−A1

35.P

C8−

A135

.RBH

−A18

Q.P

BH−A

18Q

.RA2−A

0CM

.PA2−A

0CM

.RA2−A

0YM

.RA2−A

0SX.

RA2−A

0D2.

PA2−A

0D2.

RC

8−A1

34.P

C8−

A134

.RAR

−A0U

4.P

AR−A

0U4.

RC

8−A1

2V.P

C8−

A12V

.RBH

−A0C

1.P

AO−A

0J6.

RC

8−A1

30.P

C8−

A130

.RA7−A

0CE.

PA7−A

0CE.

RAO

−A12

F.P

AO−A

12F.

RE2−A

158.

PE2−A

158.

RAO

−A0J

L.P

AO−A

0JL.

RC

8−A1

2L.P

C8−

A12L

.RAN

−A0A

L.P

AN−A

0AL.

RAN

−A0F

L.P

AN−A

0FL.

RA2−A

0EQ

.PA2−A

0EQ

.RAR

−A1A

W.P

AR−A

1AW

.RAO

−A0J

C.R

AR−A

0TT.

PAR

−A0T

T.R

AN−A

0AM

.PAN

−A0A

M.R

AO−A

0JE.

PAO

−A0J

E.R

AR−A

0TX.

RA2−A

0SW

.PA2−A

0SW

.RC

8−A1

38.R

AO−A

0J9.

RAO

−A12

B.P

AO−A

12B.

1.P

AO−A

12B.

RAO

−A12

B.1.

RBH

−A0E

1.R

AN−A

0FK.

PAN

−A0F

K.R

E2−A

154.

PE2−A

154.

RAR

−A1A

S.P

AR−A

1AS.

RA8−A

06Z.

PA8−A

06Z.

RAR

−A0T

R.P

AR−A

0TR

.RA8−A

06N

.PA8−A

06N

.R AO−A

0JC

.PBH

−A18

N.P

BH−A

18N

.RA2−A

0YC

.PA2−A

0YC

.RAR

−A1A

V.R

BH−A

0BV.

RA2−A

0EV.

RAO

−A12

E.P

AO−A

12E.

RA2−A

0YF.

PA2−A

0YF.

RAR

−A1A

P.P

AR−A

1AP.

RE2−A

15A.

PE2−A

15A.

RC

8−A1

38.P

A8−A

076.

PA8−A

076.

RAO

−A0J

M.P

AO−A

0JM

.RBH

−A0D

G.P

BH−A

0DG

.RA2−A

0EY.

PA2−A

0EY.

RC

8−A1

2Z.P

C8−

A12Z

.RA2−A

0YG

.PA2−A

0YG

.RA8−A

09G

.PA8−A

09G

.RAN

−A04

A.P

AN−A

04A.

RA2−A

0T3.

PA2−A

0T3.

RA7−A

13F.

PA7−A

13F.

RA8−A

079.

PA8−A

079.

RAR

−A1A

V.P

BH−A

0C7.

PBH

−A0C

7.R

C8−

A12T

.PC

8−A1

2T.R

BH−A

18U

.PBH

−A18

U.R

AR−A

0TV.

PAR

−A0T

V.R

AN−A

0AJ.

PAN

−A0A

J.R

C8−

A12U

.PC

8−A1

2U.R

A7−A

0CJ.

PA7−A

0CJ.

RAO

−A03

O.P

AO−A

03O

.RBH

−A0D

D.P

BH−A

0DD

.RA2−A

0EX.

PA2−A

0T6.

PA8−A

08Z.

PAO

−A0J

9.P

BH−A

0BV.

PAO

−A12

6.P

AO−A

126.

RA2−A

0EX.

RBH

−A0A

V.P

A2−A

0YM

.PA2−A

0SX.

PAO

−A0J

6.P

BH−A

0E9.

PBH

−A0E

1.P

AR−A

0TX.

PA2−A

0EV.

PA2−A

0YD

.PAO

−A0J

J.P

AO−A

0JJ.

RA8−A

08Z.

RA2−A

0T6.

RA2−A

0YD

.RBH

−A0C

1.R

BH−A

0E9.

R

0.0

0.5

1.0

1.5

Co−clutering RNA and Protein

RNA−protein correlation > 0.4Samples

Hei

ght

Page 21: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

AGNES

Page 22: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

(Intrinsic) Proteome Clusters

§  1,521 proteins with •  No missing values •  Standard deviation > 1.5

§  Consensus k-means clustering •  1000 bootstrap samples •  k=3,4,5,6

§  Assess cluster coherence •  Visualization of consensus

matrix •  Consensus CDF/Delta-area plot •  Silhouette distance

consensus matrix k=3

123

consensus matrix k=4

1234

consensus matrix k=5

12345

consensus matrix k=6

123456

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

Silhoutte plot for k=3

Average silhouette width : 0.09

n = 83 3 clusters Cjj : nj | avei∈Cj si

1 : 21 | 0.08

2 : 32 | 0.08

3 : 30 | 0.12

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

Silhoutte plot for k=4

Average silhouette width : 0.07

n = 83 4 clusters Cjj : nj | avei∈Cj si

1 : 21 | 0.08

2 : 30 | 0.07

3 : 17 | 0.008

4 : 15 | 0.16

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

consensus CDF

consensus index

CD

F

23456

2 3 4 5 6

0.1

0.2

0.3

0.4

Delta area

k

rela

tive

chan

ge in

are

a un

der C

DF

curv

e

Extended'Data'Figure'6'

a"

b"

c"

Page 23: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Assessing cluster coherence

§  Silhouette distance §  Consensus CDF §  Delta-area plot

§  Tools: •  cluster (R) •  consensusClusterPlus (R)

Page 24: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Proteome and Phosphoproteome Clusters

AO−A12D

.01TCGA

C8−A131.01TC

GA

BH−A18Q

.02TCGA

C8−A12L.04TC

GA

AO−A12D

.05TCGA

A2−A0CM.07TC

GA

A2−A0EQ.08TC

GA

AR−A0U

4.09TCGA

A7−A0CE.13TC

GA

BH−A0C

1.16TCGA

AN−A0FL.19TC

GA

AO−A12F.22TC

GA

AN−A0AL.28TC

GA

E2−A158.29TCGA

C8−A12V.30TC

GA

A2−A0D2.31TC

GA

C8−A12U

.31TCGA

A8−A09G.32TC

GA

C8−A131.32TC

GA

C8−A134.32TC

GA

AR−A0TT.34TC

GA

AO−A12B.01TC

GA

C8−A130.02TC

GA

C8−A138.03TC

GA

AN−A04A.05TC

GA

C8−A12T.06TC

GA

A8−A06Z.07TCGA

AR−A1AP.11TC

GA

AN−A0FK.11TC

GA

A7−A13F.12TCGA

A2−A0YC.13TC

GA

AO−A0JC

.14TCGA

A8−A08Z.14TCGA

A8−A076.15TCGA

AO−A126.15TC

GA

AR−A1AW

.17TCGA

AR−A1AV.17TC

GA

AN−A0AM

.18TCGA

BH−A0D

G.19TC

GA

AR−A0TV.20TC

GA

AN−A0AJ.21TC

GA

A7−A0CJ.22TC

GA

A8−A079.23TCGA

AR−A0TR

.25TCGA

AO−A03O

.25TCGA

AO−A12E.26TC

GA

A8−A06N.26TC

GA

BH−A18N

.27TCGA

E2−A15A.29TCGA

AO−A0JM

.30TCGA

AR−A1AS.31TC

GA

BH−A0D

D.33TC

GA

AO−A12B.34TC

GA

E2−A154.03TCGA

A2−A0EX.04TCGA

BH−A0AV.05TC

GA

BH−A18U

.08TCGA

AO−A0J9.10TC

GA

AO−A0J6.11TC

GA

BH−A0E1.12TC

GA

AR−A0TX.14TC

GA

A2−A0EY.16TCGA

C8−A135.17TC

GA

A2−A0EV.18TCGA

D8−A142.18TC

GA

C8−A12Z.20TC

GA

AO−A0JJ.20TC

GA

AO−A0JE.21TC

GA

A2−A0T3.24TCGA

A2−A0YD.24TC

GA

A2−A0YG.27TC

GA

A2−A0T6.29TCGA

A2−A0YF.33TCGA

BH−A0E9.33TC

GA

A2−A0SW.35TC

GA

AO−A0JL.35TC

GA

BH−A0BV.35TC

GA

A2−A0YM.36TC

GA

BH−A0C

7.36TCGA

A2−A0SX.36TCGA

263d3f−I.CPTAC

blcdb9−I.CPTAC

c4155b−C.CPTAC

Proteome123

Phospohoproteome123

RPPABasalHer2LumALumA/BReacIReacIIX

PAM50BasalHer2LumALumBNormal

−2

−1

0

1

2

Basal-enriched Luminal-enriched Stroma-enriched

Page 25: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Cell cycle, DNA-damage and immuno-regulatory gene sets are enriched in Basal-like tumors

Pathway enrichment analysis

Page 26: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Phospho-pathway clustering

§  Dataset: 5,914 phosphoproteins •  Filtered Phosphoproteome data

–  Phosphosites with <81 missing values –  Standard deviation > 0.5 across all samples

•  Phosphosite rolled-up to proteins using median ratio •  Map phosphoproteins to genes

§  Map samples to MSigDB pathways using ssGSEA •  908 curated pathways

§  Consensus k-means clustering in pathway space §  Assess cluster coherence

Page 27: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

AO.A12D

.01TCGA

AO.A12D

.05TCGA

AO.A0J9.10TC

GA

BH.A0E1.12TC

GA

A2.A0YC.13TC

GA

AO.A0JC

.14TCGA

A8.A076.15TCGA

A2.A0EY.16TCGA

AR.A1AW

.17TCGA

AN.A0AM

.18TCGA

BH.A0D

G.19TC

GA

AO.A0JJ.20TC

GA

A8.A06N.26TC

GA

C8.A12V.30TC

GA

A8.A09G.32TC

GA

C8.A131.32TC

GA

A2.A0YF.33TCGA

AR.A0TT.34TC

GA

A2.A0SW.35TC

GA

AO.A0JL.35TC

GA

BH.A0BV.35TC

GA

A2.A0YM.36TC

GA

A2.A0SX.36TCGA

c4155b.C.CPTAC

C8.A131.01TC

GA

BH.A18Q

.02TCGA

E2.A154.03TCGA

A2.A0EX.04TCGA

BH.A0AV.05TC

GA

AR.A1AP.11TC

GA

AO.A0J6.11TC

GA

AR.A0TX.14TC

GA

C8.A135.17TC

GA

A2.A0EV.18TCGA

D8.A142.18TC

GA

C8.A12Z.20TC

GA

AO.A0JE.21TC

GA

A2.A0YD.24TC

GA

BH.A18N

.27TCGA

A2.A0T6.29TCGA

AR.A1AS.31TC

GA

BH.A0E9.33TC

GA

AO.A12B.01TC

GA

C8.A130.02TC

GA

C8.A138.03TC

GA

AN.A04A.05TC

GA

A8.A06Z.07TCGA

AN.A0FK.11TC

GA

A7.A13F.12TCGA

AO.A126.15TC

GA

AR.A1AV.17TC

GA

AR.A0TV.20TC

GA

AR.A0TR

.25TCGA

AO.A03O

.25TCGA

AO.A12E.26TC

GA

AO.A12B.34TC

GA

BH.A0C

7.36TCGA

X263d3f.I.CPTAC

blcdb9.I.CPTAC

C8.A12L.04TC

GA

C8.A12T.06TC

GA

A2.A0CM.07TC

GA

BH.A18U

.08TCGA

A2.A0EQ.08TC

GA

AR.A0U

4.09TCGA

A7.A0CE.13TC

GA

A8.A08Z.14TCGA

BH.A0C

1.16TCGA

AN.A0FL.19TC

GA

AN.A0AJ.21TC

GA

A7.A0CJ.22TC

GA

AO.A12F.22TC

GA

A8.A079.23TCGA

A2.A0T3.24TCGA

A2.A0YG.27TC

GA

AN.A0AL.28TC

GA

E2.A158.29TCGA

E2.A15A.29TCGA

AO.A0JM

.30TCGA

A2.A0D2.31TC

GA

C8.A12U

.31TCGA

C8.A134.32TC

GA

BH.A0D

D.33TC

GA

KEGG_DNA_REPLICATION

REACTOME_MITOTIC_PROMETAPHASE

PID_FOXM1_PATHWAY

REACTOME_REGULATION_OF_MITOTIC_CELL_CYCLE

REACTOME_METABOLISM_OF_RNA

REACTOME_DNA_STRAND_ELONGATION

REACTOME_APC_C_CDC20_MEDIATED_DEGRADATION_OF_MITOTIC_PROTEINS

REACTOME_ACTIVATION_OF_THE_PRE_REPLICATIVE_COMPLEX

PSP:KinaseSubstrate:CHEK1

REACTOME_SYNTHESIS_OF_DNA

PSP:KinaseSubstrate:PLK1

PID_PLK1_PATHWAY

REACTOME_ACTIVATION_OF_ATR_IN_RESPONSE_TO_REPLICATION_STRESS

REACTOME_G0_AND_EARLY_G1

REACTOME_CHROMOSOME_MAINTENANCE

REACTOME_S_PHASE

PSP:KinaseSubstrate:CHEK2

REACTOME_G1_S_TRANSITION

PSP:KinaseSubstrate:CDK1

KEGG_CELL_CYCLE

REACTOME_CYCLIN_A_B1_ASSOCIATED_EVENTS_DURING_G2_M_TRANSITION

REACTOME_CELL_CYCLE_CHECKPOINTS

REACTOME_DNA_REPLICATION

REACTOME_G2_M_CHECKPOINTS

REACTOME_MITOTIC_M_M_G1_PHASES

PSP:KinaseSubstrate:CDK2

REACTOME_MITOTIC_G1_G1_S_PHASES

REACTOME_DEPOSITION_OF_NEW_CENPA_CONTAINING_NUCLEOSOMES_AT_THE_CENTROMERE

REACTOME_CELL_CYCLE_MITOTIC

REACTOME_CELL_CYCLE

REACTOME_SIGNALING_BY_TGF_BETA_RECEPTOR_COMPLEX

PID_CMYB_PATHWAY

BIOCARTA_MTA3_PATHWAY

KEGG_RNA_DEGRADATION

REACTOME_RORA_ACTIVATES_CIRCADIAN_EXPRESSION

KEGG_DRUG_METABOLISM_OTHER_ENZYMES

REACTOME_CIRCADIAN_REPRESSION_OF_EXPRESSION_BY_REV_ERBA

KEGG_SPLICEOSOME

REACTOME_ASPARAGINE_N_LINKED_GLYCOSYLATION

REACTOME_DOUBLE_STRAND_BREAK_REPAIR

REACTOME_GENERIC_TRANSCRIPTION_PATHWAY

REACTOME_NOTCH1_INTRACELLULAR_DOMAIN_REGULATES_TRANSCRIPTION

REACTOME_MITOCHONDRIAL_PROTEIN_IMPORT

REACTOME_TRAF6_MEDIATED_IRF7_ACTIVATION

REACTOME_TRANSCRIPTION

REACTOME_RNA_POL_II_TRANSCRIPTION

PID_AR_PATHWAY

PID_CIRCADIAN_PATHWAY

REACTOME_TRAF3_DEPENDENT_IRF_ACTIVATION_PATHWAY

REACTOME_TRANSPORT_TO_THE_GOLGI_AND_SUBSEQUENT_MODIFICATION

PID_HES_HEY_PATHWAY

BIOCARTA_WNT_PATHWAY

KEGG_CYSTEINE_AND_METHIONINE_METABOLISM

REACTOME_MRNA_SPLICING

BIOCARTA_CARM_ER_PATHWAY

REACTOME_NUCLEAR_RECEPTOR_TRANSCRIPTION_PATHWAY

REACTOME_CIRCADIAN_CLOCK

BIOCARTA_RARRXR_PATHWAY

KEGG_LYSINE_DEGRADATION

REACTOME_ELONGATION_ARREST_AND_RECOVERY

REACTOME_NUCLEAR_EVENTS_KINASE_AND_TRANSCRIPTION_FACTOR_ACTIVATION

REACTOME_MAPK_TARGETS_NUCLEAR_EVENTS_MEDIATED_BY_MAP_KINASES

PID_FAK_PATHWAY

PID_IL2_1PATHWAY

REACTOME_SIGNAL_TRANSDUCTION_BY_L1

PID_NETRIN_PATHWAY

REACTOME_MAP_KINASE_ACTIVATION_IN_TLR_CASCADE

PID_ENDOTHELIN_PATHWAY

PID_VEGFR1_PATHWAY

BIOCARTA_AGR_PATHWAY

BIOCARTA_MAL_PATHWAY

REACTOME_NGF_SIGNALLING_VIA_TRKA_FROM_THE_PLASMA_MEMBRANE

KEGG_ERBB_SIGNALING_PATHWAY

BIOCARTA_INTEGRIN_PATHWAY

PID_LYMPH_ANGIOGENESIS_PATHWAY

REACTOME_AXON_GUIDANCE

KEGG_VEGF_SIGNALING_PATHWAY

PID_ERBB2_ERBB3_PATHWAY

BIOCARTA_PYK2_PATHWAY

BIOCARTA_AT1R_PATHWAY

KEGG_FOCAL_ADHESION

BIOCARTA_BIOPEPTIDES_PATHWAY

PID_VEGFR1_2_PATHWAY

SIG_INSULIN_RECEPTOR_PATHWAY_IN_CARDIAC_MYOCYTES

REACTOME_SIGNALLING_BY_NGF

KEGG_MAPK_SIGNALING_PATHWAY

ST_INTEGRIN_SIGNALING_PATHWAY

PID_PDGFRB_PATHWAY

PID_ERBB1_DOWNSTREAM_PATHWAY

KEGG_NEUROTROPHIN_SIGNALING_PATHWAY

KEGG_CALCIUM_SIGNALING_PATHWAY

REACTOME_PLATELET_HOMEOSTASIS

SIG_BCR_SIGNALING_PATHWAY

REACTOME_GPCR_LIGAND_BINDING

REACTOME_GASTRIN_CREB_SIGNALLING_PATHWAY_VIA_PKC_AND_MAPK

REACTOME_REGULATION_OF_INSULIN_SECRETION_BY_GLUCAGON_LIKE_PEPTIDE1

REACTOME_HEMOSTASIS

REACTOME_G_ALPHA_I_SIGNALLING_EVENTS

REACTOME_SIGNALING_BY_GPCR

REACTOME_GPVI_MEDIATED_ACTIVATION_CASCADE

REACTOME_TRANSMISSION_ACROSS_CHEMICAL_SYNAPSES

REACTOME_OPIOID_SIGNALLING

PID_IL8_CXCR2_PATHWAY

REACTOME_INTEGRATION_OF_ENERGY_METABOLISM

REACTOME_PLATELET_ACTIVATION_SIGNALING_AND_AGGREGATION

ST_MYOCYTE_AD_PATHWAY

REACTOME_PLC_BETA_MEDIATED_EVENTS

KEGG_INOSITOL_PHOSPHATE_METABOLISM

PID_AMB2_NEUTROPHILS_PATHWAY

PID_IL8_CXCR1_PATHWAY

REACTOME_G_ALPHA_Z_SIGNALLING_EVENTS

REACTOME_NEUROTRANSMITTER_RECEPTOR_BINDING_AND_DOWNSTREAM_TRANSMISSION_IN_THE_POSTSYNAPTIC_CELL

REACTOME_GPCR_DOWNSTREAM_SIGNALING

REACTOME_NEURONAL_SYSTEM

KEGG_PHOSPHATIDYLINOSITOL_SIGNALING_SYSTEM

SIG_PIP3_SIGNALING_IN_B_LYMPHOCYTES

REACTOME_REGULATION_OF_INSULIN_SECRETION

REACTOME_G_PROTEIN_BETA_GAMMA_SIGNALLING

REACTOME_ACTIVATION_OF_KAINATE_RECEPTORS_UPON_GLUTAMATE_BINDING

REACTOME_G_ALPHA_Q_SIGNALLING_EVENTS PhosphoPathway1234

Proteome123

RPPABasalHer2LumALumA/BReacIReacIIX

PAM50BasalHer2LumALumBNormal

−2

−1

0

1

2

Phospho-pathway clustering identifies unique clusters

Supplementary Figure 4

a b

Phospho: ssGSEA phospohoproteomecluster

1234

Proteome: Global proteome cluster

123

PAM50: PAM 50 statusBasalHer2LumALumBNormal

−3

−2

−1

0

1

2

3

PhosphoProteomePAM50

Phospho

Cell cycle

Checkpoint

G protein

G protein-coupled receptor

Inositol phosphatemetabolism

ErbB / HERsignaling

VEGF signaling

Integrin signaling

Estrogen receptorsignaling

TranscriptionRegulation

Page 28: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Association analysis via marker selection and GSEA

Setup  

Extract  

Normalize  

mRNA  

Cluster  Assoc   Mut  

Alternate  Analyses  

CNA  

Plots  

Data  

Norm    Data  

RNA  

CNA  

Mut  

Page 29: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Association Analysis and Marker Selection

§  Collection of algorithms for •  Identification of statistically significant differential markers

–  Multiclass –  One-vs-all

•  Training of multiple classifiers –  Partial Least Squares, Shrunken Centroids, Random Forests, Elastic Nets –  Other algorithms can be easily added

•  Variable importance from classifiers for further prioritization of differential markers

–  Marker rank aggregation for final marker ranking

•  Class prediction for unknown/new samples •  Visualization (heatmaps) •  GSEA for pathway enrichment •  EnrichmentMap for visualizing enriched pathways

Page 30: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Rank Aggregation for Marker Ranking

§  Perform Marker selection: •  Identify statistically significant differential markers (SAM)

–  Multiclass –  One-vs-all

•  Train multiple classifiers –  Partial Least Squares, Shrunken Centroids, Random Forests, Elastic

Nets –  Other algorithms can be easily added

•  Rank markers using variable importance from classifiers

§  Combine multiple rankings to a final rank •  Robust rank aggregation (R. Kolde et. al., Bioinformatics, 2012)

–  Calculate final rank based on order statistics –  Accommodates significant proportion of “noise” markers and

occasional “low” ranks

Page 31: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Linking copy number alteration and protein expression using LINCS (Library of Integrated Cellular Signatures)

Setup  

Extract  

Normalize  

mRNA  

Cluster  Assoc   Mut  

Alternate  Analyses  

CNA  

Plots  

Data  

Norm    Data  

RNA  

CNA  

Mut  

Page 32: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Approach

§  Compare proteome profiles filtered by CNA TRANS correlations with LINCS functional knock-down data. •  Genes with LINCS-enriched CIS effects are considered

candidate driver genes •  FDR for candidate driver genes is estimated using a

permutation test

Page 33: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

CNA-protein correlations show CIS and many TRANS effects

§  Correlate copy number (CN) data with proteome for all 60 million gene-protein pairs

§  Plot statistically significant correlations (FDR < 0.05)

Pro

teom

e Copy Number

Chromosome

§  Histogram shows percent of significant correlations at a CN locus

§  Highlights “hot-spots” of TRANS-activity

§  positive correlation §  negative correlation

Chr

omos

ome

Fr

eque

ncy,

%

TRANS effect “hot spots” at chromosomes 5q, 10p, 12, 16q, 17q, and 22q

Page 34: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Can proteome profiles identify candidate genes driving response in copy number altered regions?

§  A small number of key genes drive observed TRANS-effects

§  To identify candidate genes: Correlate proteome profiles of CN altered samples with gene knock-down mRNA profiles

§  CN amplification negatively correlated to knock down profile and/or CN deletion positively correlated to knock down profile

è candidate causal gene

Page 35: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Leveraging large scale perturbation datasets to identify candidate causal genes in CNA regions

§  LINCS Functional knock-down profiles on ~3,800 genes: •  Multiple hairpins per gene knock-down •  1000 landmark genes measured on Luminex assay •  Complete profile (~22,000 genes) calculated by inference •  Includes ~20,000 drug perturbagens. Total ~476,000 mRNA profiles

Library of Integrated Cellular Signatures (LINCS) aka The Connectivity Map (CMAP)

http://www.lincscloud.org

Page 36: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Use LINCS to identify key genes driving response to copy number alterations: STEP 1

§  Identify samples with deletion [log(CN) < −0.3], neutral and amplification [log(CN)>0.3] CNA for a given gene

§  Extract protein expression for genes with significant TRANS-effects (FDR < 0.05).

36

CNA_neutralCNA_delCNA_amp

CNA_status

relative

row min row max

AO.A

12D

.01T

CG

AC

8.A1

38.0

3TC

GA

C8.

A12L

.04T

CG

AAO

.A12

D.0

5TC

GA

A2.A

0EQ

.08T

CG

AA8

.A07

6.15

TCG

AA2

.A0E

Y.16

TCG

AAO

.A0J

E.21

TCG

AA2

.A0Y

G.2

7TC

GA

BH.A

0DD

.33T

CG

ABH

.A0C

7.36

TCG

AC

8.A1

30.0

2TC

GA

AN.A

04A.

05TC

GA

A2.A

0CM

.07T

CG

AAR

.A1A

P.11

TCG

ABH

.A0E

1.12

TCG

AA8

.A08

Z.14

TCG

AAN

.A0A

M.1

8TC

GA

AO.A

12F.

22TC

GA

AO.A

03O

.25T

CG

AE2

.A15

8.29

TCG

AC

8.A1

2U.3

1TC

GA

A2.A

0YF.

33TC

GA

A2.A

0SW

.35T

CG

AC

8.A1

31.0

1TC

GA

AO.A

12B.

01TC

GA

A2.A

0EX.

04TC

GA

BH.A

0AV.

05TC

GA

A8.A

06Z.

07TC

GA

AR.A

0U4.

09TC

GA

AO.A

0J9.

10TC

GA

AN.A

0FK.

11TC

GA

AO.A

0J6.

11TC

GA

A7.A

13F.

12TC

GA

A7.A

0CE.

13TC

GA

A2.A

0YC

.13T

CG

AAO

.A0J

C.1

4TC

GA

AO.A

126.

15TC

GA

AR.A

1AW

.17T

CG

AAR

.A1A

V.17

TCG

AAN

.A0F

L.19

TCG

ABH

.A0D

G.1

9TC

GA

AR.A

0TV.

20TC

GA

AO.A

0JJ.

20TC

GA

A7.A

0CJ.

22TC

GA

A8.A

079.

23TC

GA

A2.A

0T3.

24TC

GA

AR.A

0TR

.25T

CG

AAO

.A12

E.26

TCG

AA8

.A06

N.2

6TC

GA

BH.A

18N

.27T

CG

AAN

.A0A

L.28

TCG

AE2

.A15

A.29

TCG

AC

8.A1

2V.3

0TC

GA

A2.A

0D2.

31TC

GA

AR.A

1AS.

31TC

GA

C8.

A131

.32T

CG

AC

8.A1

34.3

2TC

GA

BH.A

0E9.

33TC

GA

AR.A

0TT.

34TC

GA

AO.A

12B.

34TC

GA

AO.A

0JL.

35TC

GA

BH.A

0BV.

35TC

GA

A2.A

0YM

.36T

CG

AA2

.A0S

X.36

TCG

A

CNA_status

Page 37: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Use LINCS to identify key genes driving response to copy number alterations: STEP 2

§  Summarize expression in CNA_DEL, CNA_AMP and CNA_NEUTRAL groups •  Median expression for each trans-gene

37

CNA_neutralCNA_delCNA_amp

CNA_status

relative

row min row max

AO.A

12D

.01T

CG

AC

8.A1

38.0

3TC

GA

C8.

A12L

.04T

CG

AAO

.A12

D.0

5TC

GA

A2.A

0EQ

.08T

CG

AA8

.A07

6.15

TCG

AA2

.A0E

Y.16

TCG

AAO

.A0J

E.21

TCG

AA2

.A0Y

G.2

7TC

GA

BH.A

0DD

.33T

CG

ABH

.A0C

7.36

TCG

AC

8.A1

30.0

2TC

GA

AN.A

04A.

05TC

GA

A2.A

0CM

.07T

CG

AAR

.A1A

P.11

TCG

ABH

.A0E

1.12

TCG

AA8

.A08

Z.14

TCG

AAN

.A0A

M.1

8TC

GA

AO.A

12F.

22TC

GA

AO.A

03O

.25T

CG

AE2

.A15

8.29

TCG

AC

8.A1

2U.3

1TC

GA

A2.A

0YF.

33TC

GA

A2.A

0SW

.35T

CG

AC

8.A1

31.0

1TC

GA

AO.A

12B.

01TC

GA

A2.A

0EX.

04TC

GA

BH.A

0AV.

05TC

GA

A8.A

06Z.

07TC

GA

AR.A

0U4.

09TC

GA

AO.A

0J9.

10TC

GA

AN.A

0FK.

11TC

GA

AO.A

0J6.

11TC

GA

A7.A

13F.

12TC

GA

A7.A

0CE.

13TC

GA

A2.A

0YC

.13T

CG

AAO

.A0J

C.1

4TC

GA

AO.A

126.

15TC

GA

AR.A

1AW

.17T

CG

AAR

.A1A

V.17

TCG

AAN

.A0F

L.19

TCG

ABH

.A0D

G.1

9TC

GA

AR.A

0TV.

20TC

GA

AO.A

0JJ.

20TC

GA

A7.A

0CJ.

22TC

GA

A8.A

079.

23TC

GA

A2.A

0T3.

24TC

GA

AR.A

0TR

.25T

CG

AAO

.A12

E.26

TCG

AA8

.A06

N.2

6TC

GA

BH.A

18N

.27T

CG

AAN

.A0A

L.28

TCG

AE2

.A15

A.29

TCG

AC

8.A1

2V.3

0TC

GA

A2.A

0D2.

31TC

GA

AR.A

1AS.

31TC

GA

C8.

A131

.32T

CG

AC

8.A1

34.3

2TC

GA

BH.A

0E9.

33TC

GA

AR.A

0TT.

34TC

GA

AO.A

12B.

34TC

GA

AO.A

0JL.

35TC

GA

BH.A

0BV.

35TC

GA

A2.A

0YM

.36T

CG

AA2

.A0S

X.36

TCG

A

CNA_status

Page 38: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Use LINCS to identify key genes driving response to copy number alterations: STEP 3

•  Determine up and down regulated genes in CNA_DEL and CNA_AMP (in comparison to CNA_NEUTRAL expression)

38

CNA_neutralCNA_delCNA_amp

CNA_status

relative

row min row max

AO.A

12D

.01T

CG

AC

8.A1

38.0

3TC

GA

C8.

A12L

.04T

CG

AAO

.A12

D.0

5TC

GA

A2.A

0EQ

.08T

CG

AA8

.A07

6.15

TCG

AA2

.A0E

Y.16

TCG

AAO

.A0J

E.21

TCG

AA2

.A0Y

G.2

7TC

GA

BH.A

0DD

.33T

CG

ABH

.A0C

7.36

TCG

AC

8.A1

30.0

2TC

GA

AN.A

04A.

05TC

GA

A2.A

0CM

.07T

CG

AAR

.A1A

P.11

TCG

ABH

.A0E

1.12

TCG

AA8

.A08

Z.14

TCG

AAN

.A0A

M.1

8TC

GA

AO.A

12F.

22TC

GA

AO.A

03O

.25T

CG

AE2

.A15

8.29

TCG

AC

8.A1

2U.3

1TC

GA

A2.A

0YF.

33TC

GA

A2.A

0SW

.35T

CG

AC

8.A1

31.0

1TC

GA

AO.A

12B.

01TC

GA

A2.A

0EX.

04TC

GA

BH.A

0AV.

05TC

GA

A8.A

06Z.

07TC

GA

AR.A

0U4.

09TC

GA

AO.A

0J9.

10TC

GA

AN.A

0FK.

11TC

GA

AO.A

0J6.

11TC

GA

A7.A

13F.

12TC

GA

A7.A

0CE.

13TC

GA

A2.A

0YC

.13T

CG

AAO

.A0J

C.1

4TC

GA

AO.A

126.

15TC

GA

AR.A

1AW

.17T

CG

AAR

.A1A

V.17

TCG

AAN

.A0F

L.19

TCG

ABH

.A0D

G.1

9TC

GA

AR.A

0TV.

20TC

GA

AO.A

0JJ.

20TC

GA

A7.A

0CJ.

22TC

GA

A8.A

079.

23TC

GA

A2.A

0T3.

24TC

GA

AR.A

0TR

.25T

CG

AAO

.A12

E.26

TCG

AA8

.A06

N.2

6TC

GA

BH.A

18N

.27T

CG

AAN

.A0A

L.28

TCG

AE2

.A15

A.29

TCG

AC

8.A1

2V.3

0TC

GA

A2.A

0D2.

31TC

GA

AR.A

1AS.

31TC

GA

C8.

A131

.32T

CG

AC

8.A1

34.3

2TC

GA

BH.A

0E9.

33TC

GA

AR.A

0TT.

34TC

GA

AO.A

12B.

34TC

GA

AO.A

0JL.

35TC

GA

BH.A

0BV.

35TC

GA

A2.A

0YM

.36T

CG

AA2

.A0S

X.36

TCG

A

CNA_status

UP UP DN DN CNA_AMP gene set

CNA_DEL gene set

Page 39: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Use LINCS to identify key genes driving response to copy number alterations: STEP 4

§  Query LINCS using CNA_AMP/DEL gene sets •  Convert CNA_AMP/DEL gene sets to Affymetix IDs •  Run LINCS enrichment test on ~240,000 “gold” consensus

signatures (CGS). •  Extract “CIS-enriched” gene knock downs:

–  Enriched gene knock downs include CIS gene •  Correct direction of correlation (+ve for CNA_DEL, −ve for

CNA_AMP) –  |mean_rankpt4| > 90

•  Mean percentile in 4 cell lines > 90 •  Extract and analyze z-scores for CIS-enriched genes

39

UP UP DN DN CNA_AMP gene set

CNA_DEL gene set

Page 40: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Calculate Permutation-based FDR

1.  For each of the genes input to the LINCS enrichment test, generate a random permutation as follows: •  Let gene G have Ng TRANS genes •  From the list of all genes, randomly select Ng genes (without

replacement) 2.  Run LINCS enrichment for this permuted dataset 3.  Determine FPi, the number of “candidate driver genes” from the

random dataset. 4.  Repeat Steps 1-3 R times. 5.  Calculate FDR as mid point of 95% Score CI assuming Poisson

distribution with small rate (λ ≈ 0) and small R (R=6).

40

FDR = E #FP#P

!

"#

$

%&=

E(#FP)#P

=FP +1.962 / (2R)

#P where FP = 1

RFPi

i=1

R

95% Score CI for E(#FP) = FP +1.962 / (2R)±1.96 4FP +1.962 / R4R

Page 41: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Results

41

§  Input gene sets: •  ≥15 tumors with |CNA| > 0.3 •  Gene must be on CMAP KD list •  # TRANS genes ≥ 20 •  Total genes tested: 502

§  20 CIS-enriched candidate genes •  Level I: 10 genes

–  Enriched in both CNA_AMP and CNA_DEL

•  Level II: 10 genes –  Enriched in either CNA_AMP or

CNA_DEL

²  Level I CETN3 SKP1 SNX2 FBXO7 TERF2IP WIPF2 YAP1 RB1 USP7 DHPS

²  Level II FBXL20 ERBB2 USP6NL ARHGEF12 MRPL12 RAB21 EP300 CPNE3 PLCB3 UBE3C

FDR=0.049 [0.003, 0.094]

FDR=0.305 [0.225, 0.385]

Page 42: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

SKP1 and CETN3 are new candidate causal genes for 5q CNA TRANS effects

•  20 candidate causal genes were identified •  ERBB2 serves as a positive control •  CNA/Protein CIS effects are more indicative for a

CMAP connection to TRANS regulated genes than CNA/RNA CIS effects

Page 43: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

LINCS data: Knock-down of SKP1 and CETN3 increases EGFR, YES1 and DAPK3 expression

Page 44: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

CMAP programming interface for large-scale queries

§  LINCS Cloud Compute server with command line interface •  Run large-scale batch queries

using a Grid Engine

§  Programming interface

•  R, Python, Matlab •  For accessing and manipulating

LINCS data and results

§  Web-based API •  For accessing and querying

metadata –  Perturbations and

perturbagens –  Signatures –  Measured and inferred

genes

44

•  HDF5 (hierarchical data format) for storing data •  Mongo DB for metadata

Page 45: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Challenges and Implementation

Page 46: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Summary

§  Automated pipeline enables high throughput analysis of CPTAC data •  Reproducible and documented process •  Version controlled •  Generalizable to other projects •  Easy comparison of alternatives •  Effective use of parallelism

§  Marker selection and classification can be used for any

analysis •  Automated •  Multiple ML methods

Page 47: New Statistics and Machine Learning Methods for Proteogenomic … · 2016. 8. 3. · • Data analysis algorithms and toolkit for proteogenomics – Applied to breast cancer analysis,

Proteomics & Biomarker Discovery

Acknowledgments

WASHINGTON U./NYU/UNC/VANDERBILT

-  Sherri Davies -  Matthew Ellis -  Reid Townsend -  Li Ding -  Song Cao -  Michael McLellan -  Kuan-lin Huang -  Venkata Yellapa -  David Fenyo -  Kelly Ruggles -  Chuck Perou -  Michael Gatza -  Bing Zhang -  Jing Wang

BROAD INSTITUTE -  Steve Carr -  Karl Clauser -  Michael Gillette -  Jana Qiao -  Lauren Tang -  Philipp Mertins -  D R Mani -  Karsten Krug -  Eric Kuhn -  Filip Mundt -  Corey Flynn -  Jacob Asiedu -  Aravind Subramaniyan

NCI STAFF -  Emily Boja -  Mehdi Mesri -  Rob Rivers -  Chris Kinsinger -  Henry Rodriguez

FUNDING: National Cancer Institute

FHCRC -  Amanda Paulovich -  Jeffrey Whiteaker -  Pei Wang -  Sean Wang -  Chenwei Lin -  Ping Yan -  Yuzheng Zhang