applications of bayesian model averaging in personalized
TRANSCRIPT
Applications of Bayesian Model Averaging in Personalized
Medicine and Systems Biology
Ka Yee Yeung [email protected]
Institute of Technology, University of Washington Tacoma 8/12/2015
1
Road map
• Introduction to big biological data • Framework of supervised machine
learning methods and applications • Bayesian Model Averaging (BMA):
framework and intuition • Application 1: gene signature discovery • Application 2: gene networks
2
3
“High-Throughput BioTech”
Sensors DNA sequencing Microarrays/Gene expression Mass Spectrometry/Proteomics Protein/protein & DNA/protein interaction
Controls Cloning Gene knock out/knock in RNAi
Floods of data
“Grand Challenge” problems Courtesy: Larry Ruzzo
Big Data in Biology (reference: Marx 2013)
4 http://www.nature.com/nature/journal/v498/n7453/full/498255a.html
Biology as a data-rich science
• High-throughput technologies can measure the activity levels of many biology entities at once.
• For example, sequencing and microarray technology can measure the expression (RNA) levels of all genes at the same time
• My research focuses on the development of machine learning methods using these high dimensional data.
5
6
Computational biology: an iterative approach
Experiments Data handling
Mathematical modeling
High-throughput assays
Integration of multiple forms of experiments and knowledge
• An initiative launched by NIH in 2012. • Addresses challenges in using biomedical big data:
– Locating data and software tools. – Getting access to the data and software tools. – Standardizing data and metadata. – Extending policies and practices for data and software
sharing. – Organizing, managing, and processing biomedical Big
Data. – Developing new methods for analyzing & integrating
biomedical data. – Training researchers who can use biomedical Big Data
effectively.
7 http://bd2k.nih.gov/about_bd2k.html#sthash.xs2j0lpi.dpbs
8 Figure from: http://www.pfizer.ie/personalized_med.cfm
Application 1: Personalized (or precision) Medicine
9
Initiative on Precision Medicine President Obama, State of the Union Address,
Jan 20, 2015 • https://www.whitehouse.gov/blog/
2015/01/30/precision-medicine-initiative-data-driven-treatments-unique-your-own-body
10
11
12
Personalized Medicine
To tailor treatment based on genetic information from individual patients. Machine learning methods. Application: predict clinical outcomes in cancer patients
13
Classification (supervised learning)
E1 E2 E3 E4 Gene 1 -2 +2 +2 -1 Gene 2 +8 +3 0 +4 Gene 3 -4 +5 +4 -2 Gene 4 -1 +4 +3 -1
Varia
bles
Patient samples
0 1 1 0 Labels
E’ -1 +5 -3 -1
Goal: predict label of new
sample
Feature selection
Training data
Objectives of classification
• Class prediction: – Predict patients with a given clinical outcome
(y, class label, response) • Feature selection:
– Identification of a minimal set of relevant genes for future prediction
• Identification of “biologically” interesting genes
14
Steps in classification
15
Training set Labeled samples + Data
Classification algorithm, feature selection algorithm
Classifier set of relevant variables
Predicted labels (classes) for samples in the test set
Test set Unlabeled data,
Data
Classification methods http://cran.r-project.org/web/views/MachineLearning.html
16
Method Example R package
Logistic regression glm
K nearest neighbor class
Support vector machine e1071
Decision trees rpart
LASSO glmnet
Ensemble methods randomForest, BMA
Cross Validation (CV) • An easy and useful method to estimate the prediction error. • Can also be used to optimize the classifiers and predictive
models. • Method (m-fold cross-validation):
– Split the data into m approximately equally sized subsets – Train the classifier on (m-1) subsets – Test the classifier on the remaining subset. Estimate the – prediction error by comparing the predicted class label with the true
class labels. – Repeat
• Examples: – 10-fold CV [Ambroise et al. PNAS 2002]
http://www.pnas.org/content/99/10/6562.abstract – Leave one out cross validation (LOOCV)
17
18
Top 50 genes for ALL vs.
AML
Science. 1999 Oct 15;286(5439):531-7. Cited 10,460 times!
19
Choosing relevant genes
• Use training set only • Ideal:
– different typical expression patterns in the two classes
– little variation within each class
Feature selection (or variable selection)
• T-test • Correlation • Many others
Can be formulated as a model selection problem. In this context, a model is a set of relevant features (variables, genes).
20
Model Selection • Exhaustive search • Stepwise selection
– Forward selection – Backward elimination
• Multivariate model selection – Bayesian Model Averaging – Regularized methods e.g. LASSO
21
High dimensionality challenge: # variables >>> # observations
22
Bayesian Model Averaging (BMA) [Raftery 1995], [Hoeting et. al. 1999], [Yeung et al. 2005]
• A multivariate variable selection technique: – Takes advantage of the dependencies between genes to
reduce the total number of predictive genes. • Most gene selection methods consider genes individually
and select a single set of predictive genes at a time. • Advantages of BMA:
– Fewer selected genes – Probabilities for predictions, selected genes and selected
models • BMA averages over predictions from several models
€
prediction = (prediction using model k)model k∑ * (probability of model k)
How to choose a set of “good” models?
• All possible models --> way too many!! – Eg. 2^30~1 billion, 2^50~10^15 etc…
• The BMA solution: 1. “leaps and bounds” [Furnival and Wilson 1974]: when #
variables (genes) <= 30, we can efficiently produce a reduced set of good models (branch and bound).
2. Cut down the # models? • Occam’s windows: Discard models that are much less likely
than the best model
23
24
Iterative BMA (iBMA)
Model selection for high-dimensional data 1. Univariate ranking
step 2. Iteratively apply BMA
to a fixed window of variables
experiments
gene
s
Yeung et al. Bioinformatics (2005)
25
Iterative BMA (iBMA)
Model selection for high-dimensional data 1. Univariate ranking
step 2. Iteratively apply
BMA to a fixed window of variables
experiments
gene
s
Discard variables with low posterior probabilities
Chronic myeloid leukemia (CML)
26
CML is characterized by a reciprocal translocation between chromosomes 9 and 22 yielding the Bcr-Abl fusion protein.
http://www.cancer.gov/types/leukemia/patient/cml-treatment-pdq
Treatment options for CML patients Current treatment “recipe”: 1. imatinib mesylate (IM): a tyrosine
kinase inhibitor (TKI) which inhibits BCR-ABL and its downstream targets.
2. Monitor the patients’ response 3. Options for IM resistance:
– Second-line TKI – Stem cell transplantation
Early prediction is the key: If a patient is going to progress quickly à higher priority for transplantation
27 Figure: Jerry Radich 2007, 2nd Annual Congress of the National Comprehensive Cancer Network.
0
20
40
60
80
100
Early CP
Late CP AP BC
73%
95% Patients resistant to imatinib (%)
16% 26%
Chronic myeloid leukemia (CML)
• CML = Cancer of white blood cells • Drugs are highly effective in early stage of CML • Drugs are NOT effective in late stage • Given: gene expression data studying patients in
early vs. late stage of CML • Question: Can we find genes predictive of the
stage (and hence, treatment) of CML patients?
28
Biomarker discovery in CML
29
Phase determines response to therapy à Tailor therapy to individual
patients.
Signature genes to predict progression of disease
Oehler and Yeung et al. Blood 2009, 114:3292-8
Lab validation
CML progression data Early stage vs. late stage
Early late Multivariate feature
selection method
Super 6: signature genes derived from CML microarray data
30
Cross validation: average prediction accuracy = 99.2%
Accession number gene symbol gene name
NM_016355 DDX47 DEAD (Asp-Glu-Ala-Asp) box polypeptide 47 NM_004258 IGSF2 immunoglobulin superfamily, member 2 NM_000752 LTB4R leukotriene B4 receptor NM_014062 ART4 ADP-ribosyltransferase 4 (Dombrock blood group) NM_005505 SCARB1 scavenger receptor class B, member 1
NM_005888 SLC25A3 solute carrier family 25 (mitochondrial carrier; phosphate carrier), member 3
Oehler and Yeung et al. Blood 2009, 114:3292-8
A common problem in biomarker selection
The expression levels of many genes are correlated when measured across a limited number of conditions. • There are usually multiple sets of genes
that are equally (or near equally) predictive.
• There may be little direct connection between the predictive genes and the biology of interest.
31
Data integration
32
Predicted functional relationships
Specific expert knowledge: Reference genes known to be
associated with CML progression
Signature genes: • Predictive of early vs. late CML • Biologically relevant
Yeung et al. Bioinformatics 2012, 28(6):823-830
CML progression data Early stage vs. late stage
Our network-driven algorithm
33
Start with the FLN. Threshold the edges.
Locate the reference genes on the FLN. Get all genes connected to these reference genes. Run BMA à signature genes.
Our network-driven algorithm constrains our search for predictive genes that are functionally related to genes known to be associated with CML.
34
Legend: reference genes (pink) BMA selected genes (orange)
accession # gene symbol probability (%)
AB037729 RALGDS 22.3 NM_006148 LASP1 16.6 NM_000402 G6PD 15.1 AK000242 RALGDS 15.1 NM_001619 ADRBK1 15.1 M92439 LRPPRC 13.7 NM_002786 PSMA1 13.7
Average prediction accuracy = 99.1% in cross validation.
Can genes predictive of CML progression be used to predict outcomes after transplantation?
35
Expression of our signature genes prior to transplant is associated with relapse in 169 chronic phase CML patients.
After adjustment for variables known to affect transplant outcomes, our 2012 gene signature (RALGDS, LASP1, G6PD, ADRBKI, LRPPRC, PSMA1) correlates with relapse after allogeneic transplantation in CP CML patients. In CP patients we found that an increase of 0.2 in the predicted probability correlated with an increase in relapse of 46% (HR=1.46 (1.06-2.02, p=.02)).
Frac
tion
of re
laps
e
Our BMA models predicted relapse better than any single gene.
Application 2: gene networks
36
Constraining candidate regulators
• Without prior knowledge, every gene is a potential regulator of every other gene. We want to restrict the search to the most likely regulators.
• For each gene g, we estimated how likely that each regulator R regulates g (a priori) using the supervised framework and the external data sources.
37
g
R1 R2 R3
Graphical representation of network as a set of nodes and edges. Goal: To infer parent nodes (regulators) for each gene g using the time series expression data
38
×
BY (lab)
RM (wild)
:
95 segregants
Phenotype: RNA levels in
response to drug perturbation
DNA genotype
. . .
Yeast time series data
6 time points
Experimental design: Time dependencies: ordering of regulatory events. Genotype data: correlate DNA variations in the segregants to measured expression levels
Array Express E-MTAB-412
Genetics of global gene expression. Rockman & Kruglyak. 2006.
39
Time series data: pictorial view G
enes
(~60
00)
Segregants (95) + BY + RM
Time (6)
Expression data
(observed phenotype)
Genotype data
(0 RM; 1BY; 2 missing)
Mar
kers
(~30
00)
Segregants (95) + BY + RM
Regression-based approach
Let X(g,t,s) = expression level of gene g at time t in segregant s
40
€
X(g,t,s) = βg,s *X(R,t−1,s)R is a potential regulator
∑ +ε
g
Potential regulators R
t t-1
Variable selection
Use the expression level at time (t-1) to predict the expression levels at time t in the same segregant
41 41
Expression data Genome-wide binding data Literature
Other data, e.g. protein-protein
interaction, genetic
interaction, genotype etc.
regu
lato
rs
genes
Probability that R regulates g
0.95 0.23 0.78 … ….
g
Regulators constrained by the external data sources
Gene regulatory network
Supervised learning: integration of external data
Variable selection
Time series expression data Yeung et al. PNAS 2011, 108(48): 19436 - 41
Lo et al. BMC Systems Biology 2012, 6:101 Young et al. BMC Systems Biology 2014, 8:47
Integration of external data
42
Expression data Genome-wide binding data Literature
Other data
Compute variables (Xi) that capture evidence of regulation for (TF-gene) pairs
Y Xi TF
-gen
e
Training data: Positive (Y=1) vs. negative (Y=0) training examples Apply logistic regression to determine weights (αi’s) of Xi’s.
regu
lato
rs
genes
Probability that R regulates g
0.95 0.23 0.78 … ….
43
Category Dataset Biological Relevance Independent variable (xi)
Weight ai
Posterior probability
(%) Co-expression Rosetta
compendium data Environmental stress data Stanford microarray data
Does R and g show co-expression across diverse experimental conditions?
xi=correlation between R and g
2.35 1.74 -2.26
100 100 100
Genome-wide binding data
ChIP-chip data Does the potential regulator bind upstream of gene g in-vivo?
xi=log(p-value) -0.96 100
Genotype data Cis-regulation Does sequence variation of R correlate with expression level of a nearby gene?
xi=1 if R is cis-regulated. xi=0 otherwise.
0 0
Curated knowledge from the literature
GO terms Do gene g and regulator R share the same annotations?
xi=number of common GO slim terms between R and g
0.20 100
Known regulatory role
Does regulator R known to exhibit a regulatory role?
xi=1 if R has a documented regulatory role in SGD. xi=0 otherwise.
2.79 100
Correcting the sampling rates between positive and negative training samples
• In practice, we expect positive regulatory relationships to be much rarer than negative regulatory relationships.
• Case-control studies (Breslow et al., 1980; Lachin, 2000) – add an offset of –log(p1/p0) to the logistic regression model, where
p1=positive sampling rate (rare). • In-degree distribution (Guelzim et al., 2002): exponential decay
– each target gene is regulated by approximately t = 2.76 transcription factors on average.
• Supervised learning step: – 583 positive examples and 444 negative examples. – p1= 583/(6000*2.76) = 3.52% – p0=444/[6000*(6000-2.76)] = 0.0012%. – Therefore, we scale all the predicted odds by a factor of p1/p0= 2853.
44 Lo et al. BMC Systems Biology 2012, 6:101
45
Bayesian Model Averaging (BMA) [Raftery 1995], [Hoeting et. al. 1999], [Yeung et al. 2005]
• A multivariate variable selection technique: – Takes advantage of the dependencies between genes to
reduce the total number of predictive genes. • Most gene selection methods consider genes individually
and select a single set of predictive genes at a time. • Advantages of BMA:
– Fewer selected genes – Probabilities for predictions, selected genes and selected
models • BMA averages over predictions from several models
€
prediction = (prediction using model k)model k∑ * (probability of model k)
Prior model probabilities in BMA
• Intuition: favor models consisting of candidate regulators that are strongly supported by external data
• Let prg = prior probability that regulator r regulates g
• dkr = 1 if regulator r is an inferred regulator in model Mk. dkr = 0 otherwise.
• Use these prior model probabilities to compare models in the Occam’s window step
46
€
Pr(Mk ) = π rgδ kr (1−π rg )
1−δ kr
r∏
47
ScanBMA
• Data transformation • Implementation of ScanBMA algorithm for
efficient model space search • Integration of an informed prior • Incorporation of Zellner’s g-prior Bioconductor package “networkBMA”
Young et al. BMC Systems Biology 2014, 8:47
48
Simulated data (DREAM4 time series data)
R package minet implementation of ARACNE and MRNET were used. ebdbnet is a R package for empirical Bayes dynamic Bayesian network.
Young et al. BMC Systems Biology 2014, 8:47
49
ScanBMA: running time
Method Average running time per gene on yeast data
Projected running time for 20,000 genes
scanBMA 0.04 seconds 13.3 minutes ARACNE 70.4 seconds 16.3 days CLR 7.9 seconds 43.9 hours MRNET 500 seconds 115.7 days ebdbnet failed Expect to fail
R package minet implementation of ARACNE, CLR and MRNET were used. ebdbnet is a R package for empirical Bayes dynamic Bayesian network.
Assessment • Recovery of known regulatory relationships:
– An independent assessment criteria based on the literature.
– YEASTRACT (Yeast Search for Transcriptional Regulators And Consensus Tracking) is a curated repository of regulatory associations between transcription factors (TF) and target genes in yeast, based on bibliographic references.
– Regulatory relationships used in the supervised step were subtracted.
• Lab validation of selected sub-networks
50
Direct evidence Compare edges inferred in our network to the independent assessment criteria.
If TàG is a documented regulatory relationship in the assessment criteria Direct evidence!
51
T G Edge in our
network
yes no
yes TP FP no FN TN Ed
ges in the
constructed
netw
ork?
TF-‐gene pairs in Yeastract?
True positive rate (TPR) or Recall = TP/(TP + FN) False positive rate (FPR) = FP (FP + TN) Precision = TP /(TP + FP) Accuracy = (TP + TN) / (TP + FN + FP + TN)
Direct evidence: contingency table Is there an association between our inferred network and
regulatory relationships from Yeastract?
52
yes no
yes 18 251 no 21 782 Ed
ges in the
constructed
netw
ork?
TF-‐gene pairs in Yeastract?
True positive rate (TPR) = 18/(18+21) = 46%
Direct evidence: contingency table Is there an association between our inferred network and
regulatory relationships from Yeastract?
Applied ScanBMA to a 100-gene subset of the time series data à threshold edges at posterior probability 50% à resulting network contains 100 nodes and 439 edges.
53
54
Method TPR AUROC AUPRC TP FP LASSO 0.046 0.51 0.042 996 20,469 ARACNE 0.205 0.50 0.040 69 268 CLR 0.039 0.51 0.044 8,879 220,942 MRNET 0.039 0.51 0.044 8,737 214,757 ScanBMA[20] (95%) 0.391 0.60 0.075 227 353
ScanBMA[3556]
(95%) 0.274 0.63 0.074 127 336
ScanBMA: results
55
Road map
• Introduction to big biological data • Framework of supervised machine
learning methods and applications • Bayesian Model Averaging (BMA):
framework and intuition • Application 1: gene signature discovery • Application 2: gene networks
56
57
Thank you’s University of Washington -
Seattle
Roger Bumgarner Adrian Raftery
Jerry Radich, Vivian Oehler
Fred Hutchinson Cancer Research Center
Ken Dombek, Chris Fraley, Kenneth Lo, Chad Young
Funding Sources R01GM084163, R01GM084163-02S2, R01GM084163-05S1, U54-HL127624
University of Washington - Tacoma
Ling Hong Hung, Maciej Fronczuk
58
58
BMA: Mathematical Details
€
Pr(Mk | D ) =Pr(D | Mk )Pr(Mk )Pr(D | Ml )Pr(Ml )
l∈B∑
where Pr(D | Mk ) = Pr(D |θ k ,Mk )Pr(θ k | Mk )dθ k∫
BMA:
Posterior probability for model Mk:
€
Pr(Δ | D ) = Pr(Δ | D, Mk )*Pr(Mk | D )k =1
K
∑
BIC of Mk = n * log(1 – Rk2) + pk * log(n)
where pk = number of variables in model Mk (not including the intercept) Rk
2 = R2 for model Mk
n = number of cases Comparing 2 models, 2logBjk = BICk – BICj Bjk = p(D/Mj) / p(D|Mk) Posterior odds = Bayes Factor * Prior odds
€
p(M j | D)p(Mk | D)
=p(D |M j )p(D |Mk )
*p(M j )p(Mk )
Assessment • Recovery of known regulatory relationships:
– We showed significant enrichment between our inferred network and the assessment criteria.
• Lab validation of selected sub-networks
• Comparison to other methods in the literature.
59
…
Child nodes of selected TFs
WT ΔTF
Genes that respond to deletion with rapamycin perturbation
60
Systematic Name
Common name
# references
in SGD
# child nodes in
network A
Expression pattern over
time
Known binding site
from JASPAR?
Description from SGD
YDR421W ARO80 19 51 Increasing over time yes
Zinc finger transcriptional activator of the Zn2Cys6 family; activates transcription of aromatic amino acid catabolic genes in the presence of aromatic amino acids
YML113W DAT1 17 57 Decreasing over time no
DNA binding protein that recognizes oligo(dA).oligo(dT) tracts; Arg side chain in its N-terminal pentad Gly-Arg-Lys-Pro-Gly repeat is required for DNA-binding; not essential for viability
YBL103C RTG3 83 47 Increasing over time yes
Basic helix-loop-helix-leucine zipper (bHLH/Zip) transcription factor that forms a complex with another bHLH/Zip protein, Rtg1p, to activate the retrograde (RTG) and TOR pathways
Comparing our networks to the deletion data
61
Deleted TF # child nodes Genes that
respond to the deletion
# overlap Fisher's test p-value
ARO80 51 10 4 9.3 x 10-6
DAT1 57 784 20 0.04 RTG3 47 2288 39 0.03
Our inferred network Validation experiment
62
Legend: Green: Genes that respond to deletion of ARO80 under rapamycin in BY at 50 minutes.
Aro80p is a known regulator of ARO9 and ARO10. (Iraqui et al. Molecular and Cellular
Biology 1999, 19:3360-3371).
63
Legend: Green: Genes that respond to deletion of ARO80 under rapamycin in BY at 50 minutes. Magenta: Target genes with known ARO80 binding site.
Amazingly, all 4 genes that respond to deletion (ARO9, ARO10, NAF1, ESBP6) contain the known ARO80 binding site upstream!
Gene targets of transcription factors: transcription factor binding sites
8/11/15 64
Discovery of new binding site: DAT1
65
E-value = 4.5e-30
No known binding sites for DAT1 in JASPAR or SGD.
66
Choosing a set of relevant genes (S)
• Want: genes highly expressed in each class
• Choose genes with the highest and lowest S2N=P(g,c) scores.
67
Neighborhood Analysis • Goal: high S2N by chance? • Idea: Compare the number of gene vectors at a
neighborhood of a fixed size around c with the number of gene vectors around a random permutation of c
68
(univariate) gene selection • Use expression patterns in each gene individually • class vector c eg, (0,0,0,0,0,1,1,1,1,1) • g: expression vector of a gene over all the
samples • µ1 : average expression level of samples from gene
g in class 1 • σ1 : standard deviation of gene g in class 1.
€
S2N = P(g,c) =µ1 −µ2σ1 +σ 2
69
Prediction by Weighted Voting • Each gene in S casts 1 vote for 1 class. • Weight of a gene g depends on S2N=P(g,c) • x: expression level of new sample • b: decision boundary, b = (µ1+µ2)/2
)2
(*),(
),(distance*)(weight)(
21 µµ +−=
=
xcgP
bxgxVg
Positive votes: votes for class 1 Negative votes: votes for class 2
70
Weighted Voting (cont’d) • V1= sum of all positive votes (class 1) • V2= sum of absolute values of all negative
votes (class 2) • Winner = max (V1, V2) • Prediction Strength
€
PS =Vwinner −Vloser
Vwinner +Vloser
Choose a PS threshold using cross validation
Ensemble Learning Methods
• Combine results from multiple learning methods
• “wisdom of crowds” http://www.nature.com/nmeth/journal/v9/n8/full/nmeth.2016.html
• Eg. BMA, majority voting • http://en.wikipedia.org/wiki/
Ensemble_learning
71