machine learning and pathway analysis as basic tools in ......machine learning and pathway analysis...

Machine Learning and

Pathway Analysis as basic

tools in Systems Biology

Adi L. Tarca1,2

1. Department of Computer Science, Wayne State University, Detroit, MI, USA 2. Perinatology Research Branch, NICHD/NIH, Bethesda, MD, and Detroit, MI, USA 3. Center for Molecular Medicine and Genetics, Wayne State University,

• Lessons learned from the

IMPROVER Diagnostic Signature

Challenge

• Gene set/pathway analysis

• Approach of the PRB team in

Species Translation Challenge

Outline

• Automated process performed by a

machine (computer)

• to approximate (learn) the relation

between an set of predictors (Xj) and an

outcome y

• using a set of examples (X,y)i

• The model is expected to perform well

when applied to new data (generalize)

Machine Learning / Supervised

Learning

IMPROVER: Diagnostic Signature

Challenge

• Assess and verify computational approaches that classify clinical samples based on transcriptomics data

• Participants built models from public data to predict 5 endpoints (Psoriasis, COPD, Lung cancer, MS stages, MS diagnosis)

• Compared to a previous initiative, MAQC-II (Shi et al. 2012, Nat Biotechnol) it was more stringent

Meyer P et al, Bioinformatics 28, 2012

Model Performance Results in the Diagnostic

Signature Challenge

- 54 teams

participated

- The endpoint

explained 69% of the

variance (p <0.05)

- Team/approach

explained 8% (NS)

Endpoint / Sub-challenge

Pre

dic

tio

n

Qu

ality

10

5

0

-5

BCM

CCEM

AUPR

Z-s

co

re

Psoriasis MSS MSD Lung

cancer

COPD

No team performed best in more

than one sub-challenge

Using unrelated

training dataset

Set 1

Set 2

Set 1 & Set 2

All Batches Together

Within batch + Batch effect correction

Training sets Preprocessing

Approach of the best overall

team

LDA

Neural networks

SVM

Decision trees

Classification

model

QDA

PLIER

RMA

MAS5

GCRMA

dCHIP

Filter genes by Moderated t-test & Fold Change Optimize the number of genes by cross-validated AUC

Feature

selection

DLDA

Strategies of 2nd and 3rd Best

Overall Teams in DSC

• 2nd best overall team used:

- unsupervised clustering of test samples

- clustering based on features selected by Wilcoxon test

- cluster labels assigned using prior information about the

direction of change of few known genes

• 3rd best overall team used:

- LASSO regularized logistic regression

- Regularization parameter optimized via LOO cross-validation

- Features filtered by Wilcoxon test

What explains the variability in

performance data and what works

best in general?

Issues in the Analysis of Model

Performance Data in the

IMPROVER DSC

• Methods description were not detailed enough

resulting in missing data

• There were too many different methods for each

modeling factor (e.g. over 15 types of classifiers)

• The training data was different between teams for

the same endpoint

A Post Challenge Survey

• Had the team used cross-validation to tune any of

the parameters in their classification pipeline?

• Teams that had used cross-validation had better

performance (p<0.05): 1.2 Z-score units for BCM

1.9 for AUPR

A Post Challenge Computational

Experiment - Fix the training datasets and everything else

- Vary the preprocessing (RMA, GCRMA, MAS5),

Feature selection (t-test, moderated t-test,

Wilcoxon test) and classifier (LDA, kNN, SVM)

Combination of the best overall team

(BCM+CCEM+AUPR)/3 (BCM+CCEM+AUPR)/3

Most Important Modeling Factor is

Problem and Metric Dependent

Ideally, the exact prediction assessment procedure should be known in advance !

Data Preprocessing: Together is Better

than Separate

- 24 data points (2 preprocessing methods x 3 feature selections x 4 endpoints)

- BCM and AUPR were both improved on average by 6% and 4% respectively (Wilcoxon p-value <0.05).

• Implements the approach of the PRB team, available from

Bioconductor

• Starts with raw (Affymetrix) gene expression data files and

one annotation data frame assigning files to groups

(disease, control, test)

• Tries 27 combinations of data preprocessing, feature

selection and classifiers to guide model selection

• Uses N-fold cross-validation to determine the optimal

number of features for each combination of methods

• Provides predictions for the test samples and a fitted

model

maPredictDSC R Package

Conclusions IMPROVER DSC

• When gene expression differences are weak no classification

pipeline method will work

• The No Free Lunch (NLF) theorem was proven right, again: There is

no universally best approach to class prediction

• Using one’s favorite methods can work in average well, yet the

methods need to be used properly to avoid under- and over-fitting

• Finding best model for a given problem requires trying many

combinations of methods the maPredictDSC package can help

• The importance of each step in the process (preprocessing, feature

selection, classifier choice) is problem and metric dependent, so

no shortcut can be confidently suggested

Gene Set and Pathway Analysis

• A successful class-comparsion experiment may

result in hundreds or thousands of differentially

expressed (DE) genes

• A widely used approach to interpret such result

includes the following 2 steps:

Gene Set Analysis: Motivation

1) Staring At Long Lists of Genes

and

2) Focus on Genes that we Already Know

Gene Sets

• Examples of gene sets:

– Gene Ontology Terms

– Signaling and metabolic pathways (e.g. KEGG, BioCarta,

Reactome)

– Motif gene sets, etc. (GSEAbase)

• Methods to test association between a predefined

set of variables (e.g. genes, proteins, etc.) and an

outcome of interest

• One of the few options to extract meaning from

hundreds or more of DE genes in a given condition

• Can establish a link between a gene set and the

outcome even when there are no DE genes by usual

thresholds (E.g. *)

Gene Set Analysis

* Mootha et. al, PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately down-regulated in human diabetes. Nat Genet, 2003.

• Select a subset of genes as differentially expressed

(DE)

• Test if a given gene set has more DE genes than

expected by chance:

– Hypergeometric (Tavazoie, Nat. Genet. 1999,

Draghici et al., Genomics, 2003)

– Binomial (Cho et al. Nat. Genet. 2001)

– Chi-square (2)

• Results change as a function of the stringency of gene

selection

• Ignore the correlation between genes

Over-Representation Analysis

(ORA)

Combining pathway information with with over-

representation evidence

The impact analysis method, (Pathway Express)

Draghici S. , Khatri P., Tarca A.L. et al. Genome Res, 2007

A Novel Signaling Pathway Impact Analysis (SPIA),

Tarca A.L., Draghici S., Khatri P. et al, Bioinformatics, 2009

• Use evidence from all genes in the gene set (not only DE)

• Provide a “unique” solution to a given problem

• Account for gene-gene correlations

• Typically are slower because they involve sample

permutation

• Popular methods:

• Gene Set Enrichment Analysis (GSEA) (Mootha et al, Nat.

Genet., 2003, Subramanian PNAS, 2005)

• Gene Set Analysis (GSA): Efron B & Tibshirani R, Annals of

Applied Statistics, 2006.

• Give equal weight to all genes in a gene set

Functional class scoring methods

Pathway Analysis with Down-weighting

of Overlapping Genes (PADOG)

Extends the Gene Set Analysis (GSA):

1. Weights gene score as a function of how often the

gene appears across the gene sets

2. Uses moderated t-scores instead of ordinary t-

scores for each gene

3. Computes the mean of absolute gene scores

instead of the maxmean statistic

4. Computes significance by comparing observed

pathway scores to the empirical null distribution

obtained from phenotype permutations

Tarca AL., Draghici S., Bhatti G., Romero R., BMC Bioinformatics, 2012, 13:136.

Gene frequency

across 143

KEGG non-metabolic

pathways

PADOG

• Most gene set analysis methods are compared against

a few competitor methods using a few real datasets or

simulated data

• Performance is typically measured relying on literature

citations for the relevance of a given gene set to a

condition

• It is relatively easy to find 2-3 datasets on which one’s

method works better than a competitor method

• It is hard to figure out which method works best in

general, if any

A.L. Tarca, G. Bhatti, R. Romero, PlOS ONE, 2013.

Assessing Gene Set Analysis

Results


Results

Pathway Name Method 1

P-value Method 2

P-value Method 1

Rank Method 2

Rank

Bile secretion 0.023 0.012 1 2

Fatty acid elongation in mitochondria 0.029 0.018 2 5

Colorectal cancer 0.14 0.015 3 4

Ribosome biogenesis in eukaryotes 0.2 0.03 4 7

Cell cycle 0.21 0.014 5 3

Cyanoamino acid metabolism 0.32 0.16 6 8

Purine metabolism 0.58 0.022 7 6

Small cell lung cancer 0.68 0.01 8 1

Fatty acid metabolism 0.8 0.3 9 9

Analysis of a colorectal cancer dataset Colorectal cancer pathway expected to be relevant to this dataset (target pathway)

42 Microarray

datasets

ORA

GSEA

GSEAP

GLOBALTEST

SAFE

SIGPATHWAY

Q1

SIGPATHWAY

Q2

PLAGE

GSA

ZSCORE

MRGSE

GAGE

SSGSEA

PADOG

CAMERA

GSVA

KEGG Disease Pathways

Metacore Disease Biomarker Networks

Acute myeloid leukemia (3)

Chronic myeloid leukemia (2)

Colorectal cancer (5)

Dilated cardiomyopathy (2)

Endometrial cancer (1)

Glioma (2)

Huntington's disease (1)

Prostate cancer (2)

Renal cell carcinoma (2)

Alzheimer's Disease (5)

Non Small Cell Lung Cancer (2)

Pancreatic cancer (3)

Parkinson's disease (3)

Thyroid cancer (2)

Methods

Diabetes Mellitus Type2 (1)

Lupus Erythematosus

Systemic (1)

Pulmonary Disease Chronic

Obstructive (2)

Pancreatic Neoplasms (1)

Ovarian Neoplasms 1 (2)

Performance

• The methods are compared in their ability to:

– produce low p-values the target pathway

(sensitivity)

– rank the target pathway close to the top

(prioritization)

– produce no more than expected false

positives when phenotypes are permuted

(specificity)

A.L. Tarca, G. Bhatti, R. Romero, PlOS ONE, 2013, in press.


Results

- Ranking methods by median p-value of pathways expected to be relevant, a.k.a. surrogate for sensitivity

- using sensitivity TP/(TP+FN) is similar but leads to ties in the ranking

- Ranking methods by median ranks of pathways expected to be relevant, a.k.a. surrogate for prioritization

• Permute the phenotype of each of the 42 datasets

• Repeat 50 times • Count how many

pathways have p<a

Method

Category

Sensitivity Prioritization Specificity Overall rank in

category Median

p-value Median rank (%) FP at a=1%

PLAGE I 0.0022 25.0 1.1% 1

GLOBALTEST I 0.0001 27.9 2.0% 2

PADOG I 0.0960 9.7 2.5% 3

ORA I 0.0732 18.3 2.5% 4

SAFE I 0.1065 18.8 1.3% 5

SIGPATHWAYQ2 I 0.0565 38.0 0.9% 6

GSA I 0.1420 21.0 1.3% 7

SSGSEA I 0.0808 40.3 1.0% 8

ZSCORE I 0.0950 39.8 1.0% 9

GSEA I 0.1801 33.1 2.3% 10

GSVA I 0.1986 51.5 1.1% 11

CAMERA I 0.3126 43.0 0.5% 12

MRGSE II 0.0100 18.8 4.9% 1

GSEAP II 0.0644 36.2 15.8% 2

GAGE II 0.0024 35.9 37.9% 3

SIGPATHWAYQ1 II 0.1165 49.7 17.2% 4

A Ranking of Gene Set

Analysis Methods

Ranking Stability

Method

Rank in

Category

Sample size Gene set size Effect

Small

n<22

Large

n22

Small

N<66

Large

N66

Small

g<24.6%

Large

g24.6%

PLAGE 1 1 4 2 3 3 3

GLOBALTEST 2 2 1 3 5 2 4

PADOG 3 3 2 1 2 1 1

ORA 4 4 3 5 1 5 2

SAFE 5 7 5 4 8 4 6

SIGPATH.Q2 6 5 8 8 4 8 8

GSA 7 9 6 7 6 6 11

SSGSEA 8 8 7 6 12 9 5

ZSCORE 9 6 10 10 7 7 9

GSEA 10 10 9 9 11 10 7

GSVA 11 11 11 11 9 11 10

CAMERA 12 12 12 12 10 12 12

MRGSE 1 1 1 1 2 1 1

GSEAP 2 2 2 2 1 2 2

GAGE 3 3 3 3 4 3 3

SIGPATH.Q1 4 4 4 4 3 4 4

• Ranking based on all 42 datasets significantly correlated with all rankings based on half of the datasets (all p<0.0001)

Conclusions Gene Set Analysis

• Gene set analysis methods are useful to reduce

complexity of high-throughput experiments

• Best methods for gene set prioritization are different

from best ones in terms of sensitivity

• PLAGE, GLOBALTEST, PADOG are best overall in

category I, and MRGSE best in category II.

• Disease pathways (KEGG, Metacore) can be used as

positive controls in gene set analysis in conjunction

with a large number of datasets studying those diseases

• Our approach to gene set analysis assessment is less

subjective than relying on literature citations

Approach of team PRB (49) in the



• Sub-challenges SC1-3 were approachable via a machine

learning frame work

• discriminating positive responses (proteins

phosphorylated or pathways activated) against

negatives

• Compared to IMPROVER DSC, the STC had these

particularities:

– More datasets (SC1:16, SC2:16, SC3: 246), so more challenging

but potentially better opportunity to assess performance

– Did not involve data pre-processing (batch effect removal?)

– Teams used the same data

– Datasets were highly imbalanced (many responses with 0,1,2,

etc. positives, and the remaining up to 26 negatives)

• Used the average gene expression in DME samples

within batch as normalizer (to remove batch effects)

• Predictions set to 0 (negative) for phosphoproteins that

were positive in less than 2 stimuli in the training set

• For all other phospoproteins fit a LDA model on the

training data similar procedure* as we used in

IMPROVER DSC challenge

SC1: Intra-Species Protein

Phosphorylation Prediction

(*) Tarca AL, Than NG, Romero R., Systems Biomedicine; 1(4) , 2013.

• Genes are ranked by moderated t-test p-value & fold change (cutoff

optimized between 1.25 – 4);

• Number of genes to include in the LDA model was determined by

maximizing AUC+CCEM+BCM estimated by repeated cross-validation

gene 1 gene 2 … gene N

stimuli 1

stimuli 2

stimuli 3

…

stimuli 26

Protein 1

0

1

0

…

1

Rat gene expression (DME normalized) Rat protein phosphorylation



predict

MK03 (Mitogen-activated protein

kinase 3) phosphorylation in rat

SEROTONIN

PROMETHAZINE

PMA

TGFA

MEPYRAMINE

PDGFB

EGF

FRMD5

WISP1

SMTN

DIO3

GFRA1

Negative (19)

Positive (7)

Training data

SEROTONIN

PROMETHAZINE

PMA

TGFA

MEPYRAMINE

PDGFB

EGF

PC1

PC3 PC2



SEROTONIN

PROMETHAZINE

PMA

TGFA

MEPYRAMINE

PDGFB

EGF

FRMD5

WISP1

SMTN

DIO3

GFRA1

Test data

SEROTONIN

PROMETHAZINE

PMA

TGFA

MEPYRAMINE

PDGFB

EGF

PC1

PC3 PC2

Negative (19)

Positive (7)

Training data



stim01

stim09

SEROTONIN

PROMETHAZINE

PMA

TGFA

MEPYRAMINE

PDGFB

EGF

FRMD5

WISP1

SMTN

DIO3

GFRA1

Negative

Positive

Test data

SEROTONIN

PROMETHAZINE

PMA

TGFA

MEPYRAMINE

PDGFB

EGF

PC1

PC3 PC2

Negative (19)

Positive (7)

Training data

• Same approach as in SC1 with these differences:

– Predictors are not genes but phosphoproteins (1..16) at 5 and 25

mins (32 features)

– We used the actual phosphorylation level (continuous) for model

fitting

SC2: Inter-Species Protein


Prot. 1 Prot. 2 … Prot. 16

stimuli 1

stimuli 2

stimuli 3

…

stimuli 26

Protein 1

0

1

0

…

1

5 min

Prot. 1 Prot. 2 … Prot. 16

25 min

Rat protein phosphorylation Human protein phosphorylation

• Should have used gene expression + protein

phosphorylation data?

SC2: Inter-Species Protein


• Predictors were NES scores (continuous)

• If a pathway was not perturbed in 4/26

training stimuli in human (91% of the 246

pathways), it was predicted as non-

perturbed in test set

SC3: Inter-Species Pathway Perturbation

Prediction – First submission

Pathway 1 … Pathway 246

stimuli 1

stimuli 2

stimuli 3

…

stimuli 26

Pathway 1

0

1

0

…

1

Rat pathways NES Human pathway perturbation

predict

• Result of the first submission (before the first deadline)

SC3: Inter-Species Pathway

Perturbation Prediction

Non-official 0.20 0.57 0.53

3

• Result of the final submission

SC3: Inter-Species Pathway

Perturbation Prediction

SC3: Why did we need a new and riskier

approach in SC3?

• A new approach would make editors more

interested in the resulting paper

• Machine learning can not work if a pathway is

activated in 2 or less stimuli in human (79% of

the 246 pathways)

• We assumed that the quality of predictions for

each pathway will have equal weight in the

scoring

Stimuli Pathway True Status Method 1 Method 2 Stimuli Pathway True

Status Method 1 Method 2

1 1 1 0 1 1 3 1 1 1

2 1 0 0 0 2 3 1 1 1

3 1 0 0 0 3 3 1 1 1

4 1 0 0 0 4 3 1 1 0

5 1 0 0 0 5 3 1 0 0

6 1 0 0 0 6 3 0 0 0

7 1 0 0 0 7 3 0 0 0

8 1 0 0 0 8 3 0 0 0

9 1 0 0 0 9 3 0 0 0

10 1 0 0 0 10 3 0 0 0

1 2 1 0 1 1 4 1 1 1

2 2 1 0 0 2 4 1 1 1

3 2 0 0 0 3 4 1 1 1

4 2 0 0 0 4 4 1 1 0

5 2 0 0 0 5 4 1 0 0

6 2 0 0 0 6 4 0 0 0

7 2 0 0 0 7 4 0 0 0

8 2 0 0 0 8 4 0 0 0

9 2 0 0 0 9 4 0 0 0

10 2 0 0 0 10 4 0 0 0

Person BAC

Method 1 0.72 0.807

Method 2 0.72 0.807

Person BAC

Method 1 NA 0.7

Method 2 0.74 0.84

Pooled performance Separate then averaged

SC3: A non-Machine Learning

Approach to SC3

Human gene sets collection

GSEAP analysis

Q-value

Pathway 1 0.01

Pathway 2 0.03

Pathway 3 0.26

…

Pathway 246 0.99

Human Control Replicate 1

Control Replicate 2

Stimulus01 Replicate 1


Gene 1

Gene 2

Gene 3

…

Gene N

Human Gene expression (Test set)

Rat Control Replicate 1

Control Replicate 2



Gene 1

Gene 2

Gene 3

…

Gene N

Rat Gene expression (Test set)

f

Human gene sets collection

GSEAP analysis

Q-value

Pathway 1 0.01

Pathway 2 0.03

Pathway 3 0.26

…

Pathway 246 0.99

Human Control Replicate 1

Control Replicate 2



Gene 1

Gene 2

Gene 3

…

Gene N

Rat Control Replicate 1

Control Replicate 2



Gene 1

Gene 2

Gene 3

…

Gene N

Human Gene expression (Test set) Rat Gene expression (Test set)

SC3: A non-Machine Learning

Approach to SC3

SC3: Find a data driven ortholog for

each human gene among rat genes

Human Stimulus01

t-score

Gene 10 -5.2

Gene 63 -4.0

Gene 20 -3.0 … ….

Gene 98 +4.5

Gene 67 +6.0

Human Training Set

Human Stimulus02

t-score

Gene 60 -2.2

Gene 20 -1.0

Gene 10 -0.9 … ….

Gene 76 +3.5

Gene 10 +4.0

. . .

Human Stimulus26

t-score

Gene 60 -2.2

Gene 20 -1.0

Gene 34 -0.9 … ….

Gene 76 +3.5

Gene 10 +4.0

Rat Stimulus01

t-score

Gene 50 -6.2

Gene 63 -4.0

Gene 20 -3.0 … ….

Gene 98 +4.5

Gene 67 +6.0

Rat Stimulus02

t-score

Gene 60 -2.2

Gene 20 -1.0

Gene 50 -0.5 … ….

Gene 30 +0.9

Gene 10 +4.0

. . .

Rat Stimulus26

t-score

Gene 60 -2.2

Gene 20 -1.0

Gene 34 -0.9 … ….

Gene 76 +3.5

Gene 50 +4.0

Rat Training Set

SC3: Find a Data Driven Ortholog for

each Human Gene Among Rat Genes

• For each human gene h (in gene sets 1..246)

– Compute the rank distance to each rat gene

(0.0 Rank 1.0)

– Choose the rat gene r that minimizes D(h,r)

Rank Gene Stimulus 01

t-score

0.000 Gene 50 -6.2

0.001 Gene 63 -4.0

0.002 Gene 20 -3.0 0.500 … ….

0.999 Gene 98 +4.5

1.000 Gene 67 +6.0

Acknowledgements

• Gustavo Stolovitzky, Raquel Norel, Erhan Bilal, Pablo Meyer, Jeremy

J Rice, IBM Thomas J. Watson Research Center

• Stephanie Boue, Julia Hoeng, Florian Martin, Marja Talikka, Yang

Xiang: Philip Morris International, Research & Development

• Mario Lauria: The Microsoft Research - University of Trento Centre

for Computational and Systems Biology, Rovereto, Italy

• Michael Unger, Kushal Kumar Dey, Preetam Nandy, Christoph

Zechner, Heinz Koeppl: ETH Zurich, Switzerland

• IMPROVER DSC Collaborators

Acknowledgements

• The Intramural Research Program of the Eunice Kennedy Shriver

National Institute of Child Health and Human Development,

NIH/DHHS

• The IMPROVER Diagnostic Signature Challenge Grant from Philip

Morris

Thank you! / Questions ?

machine learning and pathway analysis as basic tools in ......machine learning and pathway analysis...

Documents