machine learning and pathway analysis as basic tools in ......machine learning and pathway analysis...
TRANSCRIPT
Machine Learning and
Pathway Analysis as basic
tools in Systems Biology
Adi L. Tarca1,2
1. Department of Computer Science, Wayne State University, Detroit, MI, USA 2. Perinatology Research Branch, NICHD/NIH, Bethesda, MD, and Detroit, MI, USA 3. Center for Molecular Medicine and Genetics, Wayne State University,
• Lessons learned from the
IMPROVER Diagnostic Signature
Challenge
• Gene set/pathway analysis
• Approach of the PRB team in
Species Translation Challenge
Outline
• Automated process performed by a
machine (computer)
• to approximate (learn) the relation
between an set of predictors (Xj) and an
outcome y
• using a set of examples (X,y)i
• The model is expected to perform well
when applied to new data (generalize)
Machine Learning / Supervised
Learning
IMPROVER: Diagnostic Signature
Challenge
• Assess and verify computational approaches that classify clinical samples based on transcriptomics data
• Participants built models from public data to predict 5 endpoints (Psoriasis, COPD, Lung cancer, MS stages, MS diagnosis)
• Compared to a previous initiative, MAQC-II (Shi et al. 2012, Nat Biotechnol) it was more stringent
Meyer P et al, Bioinformatics 28, 2012
Model Performance Results in the Diagnostic
Signature Challenge
- 54 teams
participated
- The endpoint
explained 69% of the
variance (p <0.05)
- Team/approach
explained 8% (NS)
Endpoint / Sub-challenge
Pre
dic
tio
n
Qu
ality
10
5
0
-5
BCM
CCEM
AUPR
Z-s
co
re
Psoriasis MSS MSD Lung
cancer
COPD
No team performed best in more
than one sub-challenge
Using unrelated
training dataset
Set 1
Set 2
Set 1 & Set 2
All Batches Together
Within batch + Batch effect correction
Training sets Preprocessing
Approach of the best overall
team
LDA
Neural networks
SVM
Decision trees
Classification
model
QDA
PLIER
RMA
MAS5
GCRMA
dCHIP
Filter genes by Moderated t-test & Fold Change Optimize the number of genes by cross-validated AUC
Feature
selection
DLDA
Strategies of 2nd and 3rd Best
Overall Teams in DSC
• 2nd best overall team used:
- unsupervised clustering of test samples
- clustering based on features selected by Wilcoxon test
- cluster labels assigned using prior information about the
direction of change of few known genes
• 3rd best overall team used:
- LASSO regularized logistic regression
- Regularization parameter optimized via LOO cross-validation
- Features filtered by Wilcoxon test
What explains the variability in
performance data and what works
best in general?
Issues in the Analysis of Model
Performance Data in the
IMPROVER DSC
• Methods description were not detailed enough
resulting in missing data
• There were too many different methods for each
modeling factor (e.g. over 15 types of classifiers)
• The training data was different between teams for
the same endpoint
A Post Challenge Survey
• Had the team used cross-validation to tune any of
the parameters in their classification pipeline?
• Teams that had used cross-validation had better
performance (p<0.05): 1.2 Z-score units for BCM
1.9 for AUPR
A Post Challenge Computational
Experiment - Fix the training datasets and everything else
- Vary the preprocessing (RMA, GCRMA, MAS5),
Feature selection (t-test, moderated t-test,
Wilcoxon test) and classifier (LDA, kNN, SVM)
Combination of the best overall team
(BCM+CCEM+AUPR)/3 (BCM+CCEM+AUPR)/3
Most Important Modeling Factor is
Problem and Metric Dependent
Ideally, the exact prediction assessment procedure should be known in advance !
Data Preprocessing: Together is Better
than Separate
- 24 data points (2 preprocessing methods x 3 feature selections x 4 endpoints)
- BCM and AUPR were both improved on average by 6% and 4% respectively (Wilcoxon p-value <0.05).
• Implements the approach of the PRB team, available from
Bioconductor
• Starts with raw (Affymetrix) gene expression data files and
one annotation data frame assigning files to groups
(disease, control, test)
• Tries 27 combinations of data preprocessing, feature
selection and classifiers to guide model selection
• Uses N-fold cross-validation to determine the optimal
number of features for each combination of methods
• Provides predictions for the test samples and a fitted
model
maPredictDSC R Package
Conclusions IMPROVER DSC
• When gene expression differences are weak no classification
pipeline method will work
• The No Free Lunch (NLF) theorem was proven right, again: There is
no universally best approach to class prediction
• Using one’s favorite methods can work in average well, yet the
methods need to be used properly to avoid under- and over-fitting
• Finding best model for a given problem requires trying many
combinations of methods the maPredictDSC package can help
• The importance of each step in the process (preprocessing, feature
selection, classifier choice) is problem and metric dependent, so
no shortcut can be confidently suggested
Gene Set and Pathway Analysis
• A successful class-comparsion experiment may
result in hundreds or thousands of differentially
expressed (DE) genes
• A widely used approach to interpret such result
includes the following 2 steps:
Gene Set Analysis: Motivation
1) Staring At Long Lists of Genes
and
2) Focus on Genes that we Already Know
Gene Sets
• Examples of gene sets:
– Gene Ontology Terms
– Signaling and metabolic pathways (e.g. KEGG, BioCarta,
Reactome)
– Motif gene sets, etc. (GSEAbase)
• Methods to test association between a predefined
set of variables (e.g. genes, proteins, etc.) and an
outcome of interest
• One of the few options to extract meaning from
hundreds or more of DE genes in a given condition
• Can establish a link between a gene set and the
outcome even when there are no DE genes by usual
thresholds (E.g. *)
Gene Set Analysis
* Mootha et. al, PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately down-regulated in human diabetes. Nat Genet, 2003.
• Select a subset of genes as differentially expressed
(DE)
• Test if a given gene set has more DE genes than
expected by chance:
– Hypergeometric (Tavazoie, Nat. Genet. 1999,
Draghici et al., Genomics, 2003)
– Binomial (Cho et al. Nat. Genet. 2001)
– Chi-square (2)
• Results change as a function of the stringency of gene
selection
• Ignore the correlation between genes
Over-Representation Analysis
(ORA)
Combining pathway information with with over-
representation evidence
The impact analysis method, (Pathway Express)
Draghici S. , Khatri P., Tarca A.L. et al. Genome Res, 2007
A Novel Signaling Pathway Impact Analysis (SPIA),
Tarca A.L., Draghici S., Khatri P. et al, Bioinformatics, 2009
• Use evidence from all genes in the gene set (not only DE)
• Provide a “unique” solution to a given problem
• Account for gene-gene correlations
• Typically are slower because they involve sample
permutation
• Popular methods:
• Gene Set Enrichment Analysis (GSEA) (Mootha et al, Nat.
Genet., 2003, Subramanian PNAS, 2005)
• Gene Set Analysis (GSA): Efron B & Tibshirani R, Annals of
Applied Statistics, 2006.
• Give equal weight to all genes in a gene set
Functional class scoring methods
Pathway Analysis with Down-weighting
of Overlapping Genes (PADOG)
Extends the Gene Set Analysis (GSA):
1. Weights gene score as a function of how often the
gene appears across the gene sets
2. Uses moderated t-scores instead of ordinary t-
scores for each gene
3. Computes the mean of absolute gene scores
instead of the maxmean statistic
4. Computes significance by comparing observed
pathway scores to the empirical null distribution
obtained from phenotype permutations
Tarca AL., Draghici S., Bhatti G., Romero R., BMC Bioinformatics, 2012, 13:136.
Gene frequency
across 143
KEGG non-metabolic
pathways
PADOG
• Most gene set analysis methods are compared against
a few competitor methods using a few real datasets or
simulated data
• Performance is typically measured relying on literature
citations for the relevance of a given gene set to a
condition
• It is relatively easy to find 2-3 datasets on which one’s
method works better than a competitor method
• It is hard to figure out which method works best in
general, if any
A.L. Tarca, G. Bhatti, R. Romero, PlOS ONE, 2013.
Assessing Gene Set Analysis
Results
Assessing Gene Set Analysis
Results
Pathway Name Method 1
P-value Method 2
P-value Method 1
Rank Method 2
Rank
Bile secretion 0.023 0.012 1 2
Fatty acid elongation in mitochondria 0.029 0.018 2 5
Colorectal cancer 0.14 0.015 3 4
Ribosome biogenesis in eukaryotes 0.2 0.03 4 7
Cell cycle 0.21 0.014 5 3
Cyanoamino acid metabolism 0.32 0.16 6 8
Purine metabolism 0.58 0.022 7 6
Small cell lung cancer 0.68 0.01 8 1
Fatty acid metabolism 0.8 0.3 9 9
Analysis of a colorectal cancer dataset Colorectal cancer pathway expected to be relevant to this dataset (target pathway)
42 Microarray
datasets
ORA
GSEA
GSEAP
GLOBALTEST
SAFE
SIGPATHWAY
Q1
SIGPATHWAY
Q2
PLAGE
GSA
ZSCORE
MRGSE
GAGE
SSGSEA
PADOG
CAMERA
GSVA
KEGG Disease Pathways
Metacore Disease Biomarker Networks
Acute myeloid leukemia (3)
Chronic myeloid leukemia (2)
Colorectal cancer (5)
Dilated cardiomyopathy (2)
Endometrial cancer (1)
Glioma (2)
Huntington's disease (1)
Prostate cancer (2)
Renal cell carcinoma (2)
Alzheimer's Disease (5)
Non Small Cell Lung Cancer (2)
Pancreatic cancer (3)
Parkinson's disease (3)
Thyroid cancer (2)
Methods
Diabetes Mellitus Type2 (1)
Lupus Erythematosus
Systemic (1)
Pulmonary Disease Chronic
Obstructive (2)
Pancreatic Neoplasms (1)
Ovarian Neoplasms 1 (2)
Performance
• The methods are compared in their ability to:
– produce low p-values the target pathway
(sensitivity)
– rank the target pathway close to the top
(prioritization)
– produce no more than expected false
positives when phenotypes are permuted
(specificity)
A.L. Tarca, G. Bhatti, R. Romero, PlOS ONE, 2013, in press.
Assessing Gene Set Analysis
Results
- Ranking methods by median p-value of pathways expected to be relevant, a.k.a. surrogate for sensitivity
- using sensitivity TP/(TP+FN) is similar but leads to ties in the ranking
- Ranking methods by median ranks of pathways expected to be relevant, a.k.a. surrogate for prioritization
• Permute the phenotype of each of the 42 datasets
• Repeat 50 times • Count how many
pathways have p<a
Method
Category
Sensitivity Prioritization Specificity Overall rank in
category Median
p-value Median rank (%) FP at a=1%
PLAGE I 0.0022 25.0 1.1% 1
GLOBALTEST I 0.0001 27.9 2.0% 2
PADOG I 0.0960 9.7 2.5% 3
ORA I 0.0732 18.3 2.5% 4
SAFE I 0.1065 18.8 1.3% 5
SIGPATHWAYQ2 I 0.0565 38.0 0.9% 6
GSA I 0.1420 21.0 1.3% 7
SSGSEA I 0.0808 40.3 1.0% 8
ZSCORE I 0.0950 39.8 1.0% 9
GSEA I 0.1801 33.1 2.3% 10
GSVA I 0.1986 51.5 1.1% 11
CAMERA I 0.3126 43.0 0.5% 12
MRGSE II 0.0100 18.8 4.9% 1
GSEAP II 0.0644 36.2 15.8% 2
GAGE II 0.0024 35.9 37.9% 3
SIGPATHWAYQ1 II 0.1165 49.7 17.2% 4
A Ranking of Gene Set
Analysis Methods
Ranking Stability
Method
Rank in
Category
Sample size Gene set size Effect
Small
n<22
Large
n22
Small
N<66
Large
N66
Small
g<24.6%
Large
g24.6%
PLAGE 1 1 4 2 3 3 3
GLOBALTEST 2 2 1 3 5 2 4
PADOG 3 3 2 1 2 1 1
ORA 4 4 3 5 1 5 2
SAFE 5 7 5 4 8 4 6
SIGPATH.Q2 6 5 8 8 4 8 8
GSA 7 9 6 7 6 6 11
SSGSEA 8 8 7 6 12 9 5
ZSCORE 9 6 10 10 7 7 9
GSEA 10 10 9 9 11 10 7
GSVA 11 11 11 11 9 11 10
CAMERA 12 12 12 12 10 12 12
MRGSE 1 1 1 1 2 1 1
GSEAP 2 2 2 2 1 2 2
GAGE 3 3 3 3 4 3 3
SIGPATH.Q1 4 4 4 4 3 4 4
• Ranking based on all 42 datasets significantly correlated with all rankings based on half of the datasets (all p<0.0001)
Conclusions Gene Set Analysis
• Gene set analysis methods are useful to reduce
complexity of high-throughput experiments
• Best methods for gene set prioritization are different
from best ones in terms of sensitivity
• PLAGE, GLOBALTEST, PADOG are best overall in
category I, and MRGSE best in category II.
• Disease pathways (KEGG, Metacore) can be used as
positive controls in gene set analysis in conjunction
with a large number of datasets studying those diseases
• Our approach to gene set analysis assessment is less
subjective than relying on literature citations
Approach of team PRB (49) in the
Species Translation Challenge
Species Translation Challenge
• Sub-challenges SC1-3 were approachable via a machine
learning frame work
• discriminating positive responses (proteins
phosphorylated or pathways activated) against
negatives
• Compared to IMPROVER DSC, the STC had these
particularities:
– More datasets (SC1:16, SC2:16, SC3: 246), so more challenging
but potentially better opportunity to assess performance
– Did not involve data pre-processing (batch effect removal?)
– Teams used the same data
– Datasets were highly imbalanced (many responses with 0,1,2,
etc. positives, and the remaining up to 26 negatives)
• Used the average gene expression in DME samples
within batch as normalizer (to remove batch effects)
• Predictions set to 0 (negative) for phosphoproteins that
were positive in less than 2 stimuli in the training set
• For all other phospoproteins fit a LDA model on the
training data similar procedure* as we used in
IMPROVER DSC challenge
SC1: Intra-Species Protein
Phosphorylation Prediction
(*) Tarca AL, Than NG, Romero R., Systems Biomedicine; 1(4) , 2013.
• Genes are ranked by moderated t-test p-value & fold change (cutoff
optimized between 1.25 – 4);
• Number of genes to include in the LDA model was determined by
maximizing AUC+CCEM+BCM estimated by repeated cross-validation
gene 1 gene 2 … gene N
stimuli 1
stimuli 2
stimuli 3
…
stimuli 26
Protein 1
0
1
0
…
1
Rat gene expression (DME normalized) Rat protein phosphorylation
SC1: Intra-Species Protein
Phosphorylation Prediction
predict
MK03 (Mitogen-activated protein
kinase 3) phosphorylation in rat
SEROTONIN
PROMETHAZINE
PMA
TGFA
MEPYRAMINE
PDGFB
EGF
FRMD5
WISP1
SMTN
DIO3
GFRA1
Negative (19)
Positive (7)
Training data
SEROTONIN
PROMETHAZINE
PMA
TGFA
MEPYRAMINE
PDGFB
EGF
PC1
PC3 PC2
MK03 (Mitogen-activated protein
kinase 3) phosphorylation in rat
SEROTONIN
PROMETHAZINE
PMA
TGFA
MEPYRAMINE
PDGFB
EGF
FRMD5
WISP1
SMTN
DIO3
GFRA1
Test data
SEROTONIN
PROMETHAZINE
PMA
TGFA
MEPYRAMINE
PDGFB
EGF
PC1
PC3 PC2
Negative (19)
Positive (7)
Training data
MK03 (Mitogen-activated protein
kinase 3) phosphorylation in rat
stim01
stim09
SEROTONIN
PROMETHAZINE
PMA
TGFA
MEPYRAMINE
PDGFB
EGF
FRMD5
WISP1
SMTN
DIO3
GFRA1
Negative
Positive
Test data
SEROTONIN
PROMETHAZINE
PMA
TGFA
MEPYRAMINE
PDGFB
EGF
PC1
PC3 PC2
Negative (19)
Positive (7)
Training data
SC1: Intra-Species Protein
Phosphorylation Prediction
• Same approach as in SC1 with these differences:
– Predictors are not genes but phosphoproteins (1..16) at 5 and 25
mins (32 features)
– We used the actual phosphorylation level (continuous) for model
fitting
SC2: Inter-Species Protein
Phosphorylation Prediction
Prot. 1 Prot. 2 … Prot. 16
stimuli 1
stimuli 2
stimuli 3
…
stimuli 26
Protein 1
0
1
0
…
1
5 min
Prot. 1 Prot. 2 … Prot. 16
25 min
Rat protein phosphorylation Human protein phosphorylation
• Should have used gene expression + protein
phosphorylation data?
SC2: Inter-Species Protein
Phosphorylation Prediction
• Predictors were NES scores (continuous)
• If a pathway was not perturbed in 4/26
training stimuli in human (91% of the 246
pathways), it was predicted as non-
perturbed in test set
SC3: Inter-Species Pathway Perturbation
Prediction – First submission
Pathway 1 … Pathway 246
stimuli 1
stimuli 2
stimuli 3
…
stimuli 26
Pathway 1
0
1
0
…
1
Rat pathways NES Human pathway perturbation
predict
• Result of the first submission (before the first deadline)
SC3: Inter-Species Pathway
Perturbation Prediction
Non-official 0.20 0.57 0.53
3
• Result of the final submission
SC3: Inter-Species Pathway
Perturbation Prediction
SC3: Why did we need a new and riskier
approach in SC3?
• A new approach would make editors more
interested in the resulting paper
• Machine learning can not work if a pathway is
activated in 2 or less stimuli in human (79% of
the 246 pathways)
• We assumed that the quality of predictions for
each pathway will have equal weight in the
scoring
Stimuli Pathway True Status Method 1 Method 2 Stimuli Pathway True
Status Method 1 Method 2
1 1 1 0 1 1 3 1 1 1
2 1 0 0 0 2 3 1 1 1
3 1 0 0 0 3 3 1 1 1
4 1 0 0 0 4 3 1 1 0
5 1 0 0 0 5 3 1 0 0
6 1 0 0 0 6 3 0 0 0
7 1 0 0 0 7 3 0 0 0
8 1 0 0 0 8 3 0 0 0
9 1 0 0 0 9 3 0 0 0
10 1 0 0 0 10 3 0 0 0
1 2 1 0 1 1 4 1 1 1
2 2 1 0 0 2 4 1 1 1
3 2 0 0 0 3 4 1 1 1
4 2 0 0 0 4 4 1 1 0
5 2 0 0 0 5 4 1 0 0
6 2 0 0 0 6 4 0 0 0
7 2 0 0 0 7 4 0 0 0
8 2 0 0 0 8 4 0 0 0
9 2 0 0 0 9 4 0 0 0
10 2 0 0 0 10 4 0 0 0
Person BAC
Method 1 0.72 0.807
Method 2 0.72 0.807
Person BAC
Method 1 NA 0.7
Method 2 0.74 0.84
Pooled performance Separate then averaged
SC3: A non-Machine Learning
Approach to SC3
Human gene sets collection
GSEAP analysis
Q-value
Pathway 1 0.01
Pathway 2 0.03
Pathway 3 0.26
…
Pathway 246 0.99
Human Control Replicate 1
Control Replicate 2
Stimulus01 Replicate 1
Stimulus01 Replicate 2
Gene 1
Gene 2
Gene 3
…
Gene N
Human Gene expression (Test set)
Rat Control Replicate 1
Control Replicate 2
Stimulus01 Replicate 1
Stimulus01 Replicate 2
Gene 1
Gene 2
Gene 3
…
Gene N
Rat Gene expression (Test set)
f
Human gene sets collection
GSEAP analysis
Q-value
Pathway 1 0.01
Pathway 2 0.03
Pathway 3 0.26
…
Pathway 246 0.99
Human Control Replicate 1
Control Replicate 2
Stimulus01 Replicate 1
Stimulus01 Replicate 2
Gene 1
Gene 2
Gene 3
…
Gene N
Rat Control Replicate 1
Control Replicate 2
Stimulus01 Replicate 1
Stimulus01 Replicate 2
Gene 1
Gene 2
Gene 3
…
Gene N
Human Gene expression (Test set) Rat Gene expression (Test set)
SC3: A non-Machine Learning
Approach to SC3
SC3: Find a data driven ortholog for
each human gene among rat genes
Human Stimulus01
t-score
Gene 10 -5.2
Gene 63 -4.0
Gene 20 -3.0 … ….
Gene 98 +4.5
Gene 67 +6.0
Human Training Set
Human Stimulus02
t-score
Gene 60 -2.2
Gene 20 -1.0
Gene 10 -0.9 … ….
Gene 76 +3.5
Gene 10 +4.0
. . .
Human Stimulus26
t-score
Gene 60 -2.2
Gene 20 -1.0
Gene 34 -0.9 … ….
Gene 76 +3.5
Gene 10 +4.0
Rat Stimulus01
t-score
Gene 50 -6.2
Gene 63 -4.0
Gene 20 -3.0 … ….
Gene 98 +4.5
Gene 67 +6.0
Rat Stimulus02
t-score
Gene 60 -2.2
Gene 20 -1.0
Gene 50 -0.5 … ….
Gene 30 +0.9
Gene 10 +4.0
. . .
Rat Stimulus26
t-score
Gene 60 -2.2
Gene 20 -1.0
Gene 34 -0.9 … ….
Gene 76 +3.5
Gene 50 +4.0
Rat Training Set
SC3: Find a Data Driven Ortholog for
each Human Gene Among Rat Genes
• For each human gene h (in gene sets 1..246)
– Compute the rank distance to each rat gene
(0.0 Rank 1.0)
– Choose the rat gene r that minimizes D(h,r)
Rank Gene Stimulus 01
t-score
0.000 Gene 50 -6.2
0.001 Gene 63 -4.0
0.002 Gene 20 -3.0 0.500 … ….
0.999 Gene 98 +4.5
1.000 Gene 67 +6.0
Acknowledgements
• Gustavo Stolovitzky, Raquel Norel, Erhan Bilal, Pablo Meyer, Jeremy
J Rice, IBM Thomas J. Watson Research Center
• Stephanie Boue, Julia Hoeng, Florian Martin, Marja Talikka, Yang
Xiang: Philip Morris International, Research & Development
• Mario Lauria: The Microsoft Research - University of Trento Centre
for Computational and Systems Biology, Rovereto, Italy
• Michael Unger, Kushal Kumar Dey, Preetam Nandy, Christoph
Zechner, Heinz Koeppl: ETH Zurich, Switzerland
• IMPROVER DSC Collaborators
Acknowledgements
• The Intramural Research Program of the Eunice Kennedy Shriver
National Institute of Child Health and Human Development,
NIH/DHHS
• The IMPROVER Diagnostic Signature Challenge Grant from Philip
Morris
Thank you! / Questions ?