applications of bayesian model averaging in personalized

Applications of Bayesian Model Averaging in Personalized

Medicine and Systems Biology

Ka Yee Yeung [email protected]

Institute of Technology, University of Washington Tacoma 8/12/2015

1

Road map

•  Introduction to big biological data •  Framework of supervised machine

learning methods and applications •  Bayesian Model Averaging (BMA):

framework and intuition •  Application 1: gene signature discovery •  Application 2: gene networks

2

3

“High-Throughput BioTech”

Sensors DNA sequencing Microarrays/Gene expression Mass Spectrometry/Proteomics Protein/protein & DNA/protein interaction

Controls Cloning Gene knock out/knock in RNAi

Floods of data

“Grand Challenge” problems Courtesy: Larry Ruzzo

Big Data in Biology (reference: Marx 2013)

4 http://www.nature.com/nature/journal/v498/n7453/full/498255a.html

Biology as a data-rich science

•  High-throughput technologies can measure the activity levels of many biology entities at once.

•  For example, sequencing and microarray technology can measure the expression (RNA) levels of all genes at the same time

•  My research focuses on the development of machine learning methods using these high dimensional data.

5

6

Computational biology: an iterative approach

Experiments Data handling

Mathematical modeling

High-throughput assays

Integration of multiple forms of experiments and knowledge

•  An initiative launched by NIH in 2012. •  Addresses challenges in using biomedical big data:

–  Locating data and software tools. –  Getting access to the data and software tools. –  Standardizing data and metadata. –  Extending policies and practices for data and software

sharing. –  Organizing, managing, and processing biomedical Big

Data. –  Developing new methods for analyzing & integrating

biomedical data. –  Training researchers who can use biomedical Big Data

effectively.

7 http://bd2k.nih.gov/about_bd2k.html#sthash.xs2j0lpi.dpbs

8 Figure from: http://www.pfizer.ie/personalized_med.cfm

Application 1: Personalized (or precision) Medicine

Initiative on Precision Medicine President Obama, State of the Union Address,

Jan 20, 2015 •  https://www.whitehouse.gov/blog/

2015/01/30/precision-medicine-initiative-data-driven-treatments-unique-your-own-body

10

12

Personalized Medicine

To tailor treatment based on genetic information from individual patients. Machine learning methods. Application: predict clinical outcomes in cancer patients

13

Classification (supervised learning)

E1 E2 E3 E4 Gene 1 -2 +2 +2 -1 Gene 2 +8 +3 0 +4 Gene 3 -4 +5 +4 -2 Gene 4 -1 +4 +3 -1

Varia

bles

Patient samples

0 1 1 0 Labels

E’ -1 +5 -3 -1

Goal: predict label of new

sample

Feature selection

Training data

Objectives of classification

•  Class prediction: – Predict patients with a given clinical outcome

(y, class label, response) •  Feature selection:

–  Identification of a minimal set of relevant genes for future prediction

•  Identification of “biologically” interesting genes

14

Steps in classification

15

Training set Labeled samples + Data

Classification algorithm, feature selection algorithm

Classifier set of relevant variables

Predicted labels (classes) for samples in the test set

Test set Unlabeled data,

Data

Classification methods http://cran.r-project.org/web/views/MachineLearning.html

16

Method Example R package

Logistic regression glm

K nearest neighbor class

Support vector machine e1071

Decision trees rpart

LASSO glmnet

Ensemble methods randomForest, BMA

Cross Validation (CV) •  An easy and useful method to estimate the prediction error. •  Can also be used to optimize the classifiers and predictive

models. •  Method (m-fold cross-validation):

–  Split the data into m approximately equally sized subsets –  Train the classifier on (m-1) subsets –  Test the classifier on the remaining subset. Estimate the –  prediction error by comparing the predicted class label with the true

class labels. –  Repeat

•  Examples: –  10-fold CV [Ambroise et al. PNAS 2002]

http://www.pnas.org/content/99/10/6562.abstract –  Leave one out cross validation (LOOCV)

17

18

Top 50 genes for ALL vs.

AML

Science. 1999 Oct 15;286(5439):531-7. Cited 10,460 times!

19

Choosing relevant genes

•  Use training set only •  Ideal:

– different typical expression patterns in the two classes

–  little variation within each class

Feature selection (or variable selection)

•  T-test •  Correlation •  Many others

Can be formulated as a model selection problem. In this context, a model is a set of relevant features (variables, genes).

20

Model Selection •  Exhaustive search •  Stepwise selection

– Forward selection – Backward elimination

•  Multivariate model selection – Bayesian Model Averaging – Regularized methods e.g. LASSO

21

High dimensionality challenge: # variables >>> # observations

22

Bayesian Model Averaging (BMA) [Raftery 1995], [Hoeting et. al. 1999], [Yeung et al. 2005]

•  A multivariate variable selection technique: –  Takes advantage of the dependencies between genes to

reduce the total number of predictive genes. •  Most gene selection methods consider genes individually

and select a single set of predictive genes at a time. •  Advantages of BMA:

–  Fewer selected genes –  Probabilities for predictions, selected genes and selected

models •  BMA averages over predictions from several models

€

prediction = (prediction using model k)model k∑ * (probability of model k)

How to choose a set of “good” models?

•  All possible models --> way too many!! –  Eg. 2^30~1 billion, 2^50~10^15 etc…

•  The BMA solution: 1.  “leaps and bounds” [Furnival and Wilson 1974]: when #

variables (genes) <= 30, we can efficiently produce a reduced set of good models (branch and bound).

2.  Cut down the # models? •  Occam’s windows: Discard models that are much less likely

than the best model

23

24

Iterative BMA (iBMA)

Model selection for high-dimensional data 1.  Univariate ranking

step 2.  Iteratively apply BMA

to a fixed window of variables

experiments

gene

s

Yeung et al. Bioinformatics (2005)

25

Iterative BMA (iBMA)

Model selection for high-dimensional data 1.  Univariate ranking

step 2.  Iteratively apply

BMA to a fixed window of variables

experiments

gene

s

Discard variables with low posterior probabilities

Chronic myeloid leukemia (CML)

26

CML is characterized by a reciprocal translocation between chromosomes 9 and 22 yielding the Bcr-Abl fusion protein.

http://www.cancer.gov/types/leukemia/patient/cml-treatment-pdq

Treatment options for CML patients Current treatment “recipe”: 1.  imatinib mesylate (IM): a tyrosine

kinase inhibitor (TKI) which inhibits BCR-ABL and its downstream targets.

2.  Monitor the patients’ response 3.  Options for IM resistance:

–  Second-line TKI –  Stem cell transplantation

Early prediction is the key: If a patient is going to progress quickly à higher priority for transplantation

27 Figure: Jerry Radich 2007, 2nd Annual Congress of the National Comprehensive Cancer Network.

0

20

40

60

80

100

Early CP

Late CP AP BC

73%

95% Patients resistant to imatinib (%)

16% 26%

Chronic myeloid leukemia (CML)

•  CML = Cancer of white blood cells •  Drugs are highly effective in early stage of CML •  Drugs are NOT effective in late stage •  Given: gene expression data studying patients in

early vs. late stage of CML •  Question: Can we find genes predictive of the

stage (and hence, treatment) of CML patients?

28

Biomarker discovery in CML

29

Phase determines response to therapy à Tailor therapy to individual

patients.

Signature genes to predict progression of disease

Oehler and Yeung et al. Blood 2009, 114:3292-8

Lab validation

CML progression data Early stage vs. late stage

Early late Multivariate feature

selection method

Super 6: signature genes derived from CML microarray data

30

Cross validation: average prediction accuracy = 99.2%

Accession number gene symbol gene name

NM_016355 DDX47 DEAD (Asp-Glu-Ala-Asp) box polypeptide 47 NM_004258 IGSF2 immunoglobulin superfamily, member 2 NM_000752 LTB4R leukotriene B4 receptor NM_014062 ART4 ADP-ribosyltransferase 4 (Dombrock blood group) NM_005505 SCARB1 scavenger receptor class B, member 1

NM_005888 SLC25A3 solute carrier family 25 (mitochondrial carrier; phosphate carrier), member 3

Oehler and Yeung et al. Blood 2009, 114:3292-8

A common problem in biomarker selection

The expression levels of many genes are correlated when measured across a limited number of conditions. •  There are usually multiple sets of genes

that are equally (or near equally) predictive.

•  There may be little direct connection between the predictive genes and the biology of interest.

31

Data integration

32

Predicted functional relationships

Specific expert knowledge: Reference genes known to be

associated with CML progression

Signature genes: •  Predictive of early vs. late CML •  Biologically relevant

Yeung et al. Bioinformatics 2012, 28(6):823-830

CML progression data Early stage vs. late stage

Our network-driven algorithm

33

Start with the FLN. Threshold the edges.

Locate the reference genes on the FLN. Get all genes connected to these reference genes. Run BMA à signature genes.

Our network-driven algorithm constrains our search for predictive genes that are functionally related to genes known to be associated with CML.

34

Legend: reference genes (pink) BMA selected genes (orange)

accession # gene symbol probability (%)

AB037729 RALGDS 22.3 NM_006148 LASP1 16.6 NM_000402 G6PD 15.1 AK000242 RALGDS 15.1 NM_001619 ADRBK1 15.1 M92439 LRPPRC 13.7 NM_002786 PSMA1 13.7

Average prediction accuracy = 99.1% in cross validation.

Can genes predictive of CML progression be used to predict outcomes after transplantation?

35

Expression of our signature genes prior to transplant is associated with relapse in 169 chronic phase CML patients.

After adjustment for variables known to affect transplant outcomes, our 2012 gene signature (RALGDS, LASP1, G6PD, ADRBKI, LRPPRC, PSMA1) correlates with relapse after allogeneic transplantation in CP CML patients. In CP patients we found that an increase of 0.2 in the predicted probability correlated with an increase in relapse of 46% (HR=1.46 (1.06-2.02, p=.02)).

Frac

tion

of re

laps

e

Our BMA models predicted relapse better than any single gene.

Application 2: gene networks

36

Constraining candidate regulators

•  Without prior knowledge, every gene is a potential regulator of every other gene. We want to restrict the search to the most likely regulators.

•  For each gene g, we estimated how likely that each regulator R regulates g (a priori) using the supervised framework and the external data sources.

37

g

R1 R2 R3

Graphical representation of network as a set of nodes and edges. Goal: To infer parent nodes (regulators) for each gene g using the time series expression data

38

×

BY (lab)

RM (wild)

:

95 segregants

Phenotype: RNA levels in

response to drug perturbation

DNA genotype

. . .

Yeast time series data

6 time points

Experimental design: Time dependencies: ordering of regulatory events. Genotype data: correlate DNA variations in the segregants to measured expression levels

Array Express E-MTAB-412

Genetics of global gene expression. Rockman & Kruglyak. 2006.

39

Time series data: pictorial view G

enes

(~60

00)

Segregants (95) + BY + RM

Time (6)

Expression data

(observed phenotype)

Genotype data

(0 RM; 1BY; 2 missing)

Mar

kers

(~30

00)

Segregants (95) + BY + RM

Regression-based approach

Let X(g,t,s) = expression level of gene g at time t in segregant s

40

€

X(g,t,s) = βg,s *X(R,t−1,s)R is a potential regulator

∑ +ε

g

Potential regulators R

t t-1

Variable selection

Use the expression level at time (t-1) to predict the expression levels at time t in the same segregant

41 41

Expression data Genome-wide binding data Literature

Other data, e.g. protein-protein

interaction, genetic

interaction, genotype etc.

regu

lato

rs

genes

Probability that R regulates g

0.95 0.23 0.78 … ….

g

Regulators constrained by the external data sources

Gene regulatory network

Supervised learning: integration of external data

Variable selection

Time series expression data Yeung et al. PNAS 2011, 108(48): 19436 - 41

Lo et al. BMC Systems Biology 2012, 6:101 Young et al. BMC Systems Biology 2014, 8:47

Integration of external data

42

Expression data Genome-wide binding data Literature

Other data

Compute variables (Xi) that capture evidence of regulation for (TF-gene) pairs

Y Xi TF

-gen

e

Training data: Positive (Y=1) vs. negative (Y=0) training examples Apply logistic regression to determine weights (αi’s) of Xi’s.

regu

lato

rs

genes

Probability that R regulates g

0.95 0.23 0.78 … ….

43

Category Dataset Biological Relevance Independent variable (xi)

Weight ai

Posterior probability

(%) Co-expression Rosetta

compendium data Environmental stress data Stanford microarray data

Does R and g show co-expression across diverse experimental conditions?

xi=correlation between R and g

2.35 1.74 -2.26

100 100 100

Genome-wide binding data

ChIP-chip data Does the potential regulator bind upstream of gene g in-vivo?

xi=log(p-value) -0.96 100

Genotype data Cis-regulation Does sequence variation of R correlate with expression level of a nearby gene?

xi=1 if R is cis-regulated. xi=0 otherwise.

0 0

Curated knowledge from the literature

GO terms Do gene g and regulator R share the same annotations?

xi=number of common GO slim terms between R and g

0.20 100

Known regulatory role

Does regulator R known to exhibit a regulatory role?

xi=1 if R has a documented regulatory role in SGD. xi=0 otherwise.

2.79 100

Correcting the sampling rates between positive and negative training samples

•  In practice, we expect positive regulatory relationships to be much rarer than negative regulatory relationships.

•  Case-control studies (Breslow et al., 1980; Lachin, 2000) –  add an offset of –log(p1/p0) to the logistic regression model, where

p1=positive sampling rate (rare). •  In-degree distribution (Guelzim et al., 2002): exponential decay

–  each target gene is regulated by approximately t = 2.76 transcription factors on average.

•  Supervised learning step: –  583 positive examples and 444 negative examples. –  p1= 583/(6000*2.76) = 3.52% –  p0=444/[6000*(6000-2.76)] = 0.0012%. –  Therefore, we scale all the predicted odds by a factor of p1/p0= 2853.

44 Lo et al. BMC Systems Biology 2012, 6:101

45

Bayesian Model Averaging (BMA) [Raftery 1995], [Hoeting et. al. 1999], [Yeung et al. 2005]

•  A multivariate variable selection technique: –  Takes advantage of the dependencies between genes to

reduce the total number of predictive genes. •  Most gene selection methods consider genes individually

and select a single set of predictive genes at a time. •  Advantages of BMA:

–  Fewer selected genes –  Probabilities for predictions, selected genes and selected

models •  BMA averages over predictions from several models

€

prediction = (prediction using model k)model k∑ * (probability of model k)

Prior model probabilities in BMA

•  Intuition: favor models consisting of candidate regulators that are strongly supported by external data

•  Let prg = prior probability that regulator r regulates g

•  dkr = 1 if regulator r is an inferred regulator in model Mk. dkr = 0 otherwise.

•  Use these prior model probabilities to compare models in the Occam’s window step

46

€

Pr(Mk ) = π rgδ kr (1−π rg )

1−δ kr

r∏

47

ScanBMA

•  Data transformation •  Implementation of ScanBMA algorithm for

efficient model space search •  Integration of an informed prior •  Incorporation of Zellner’s g-prior Bioconductor package “networkBMA”

Young et al. BMC Systems Biology 2014, 8:47

48

Simulated data (DREAM4 time series data)

R package minet implementation of ARACNE and MRNET were used. ebdbnet is a R package for empirical Bayes dynamic Bayesian network.

Young et al. BMC Systems Biology 2014, 8:47

49

ScanBMA: running time

Method Average running time per gene on yeast data

Projected running time for 20,000 genes

scanBMA 0.04 seconds 13.3 minutes ARACNE 70.4 seconds 16.3 days CLR 7.9 seconds 43.9 hours MRNET 500 seconds 115.7 days ebdbnet failed Expect to fail

R package minet implementation of ARACNE, CLR and MRNET were used. ebdbnet is a R package for empirical Bayes dynamic Bayesian network.

Assessment •  Recovery of known regulatory relationships:

–  An independent assessment criteria based on the literature.

–  YEASTRACT (Yeast Search for Transcriptional Regulators And Consensus Tracking) is a curated repository of regulatory associations between transcription factors (TF) and target genes in yeast, based on bibliographic references.

–  Regulatory relationships used in the supervised step were subtracted.

•  Lab validation of selected sub-networks

50

Direct evidence Compare edges inferred in our network to the independent assessment criteria.

If TàG is a documented regulatory relationship in the assessment criteria Direct evidence!

51

T G Edge in our

network

yes no

yes TP FP no FN TN Ed

ges in the

constructed

netw

ork?

TF-‐gene pairs in Yeastract?

True positive rate (TPR) or Recall = TP/(TP + FN) False positive rate (FPR) = FP (FP + TN) Precision = TP /(TP + FP) Accuracy = (TP + TN) / (TP + FN + FP + TN)

Direct evidence: contingency table Is there an association between our inferred network and

regulatory relationships from Yeastract?

52

yes no

yes 18 251 no 21 782 Ed

ges in the

constructed

netw

ork?

TF-‐gene pairs in Yeastract?

True positive rate (TPR) = 18/(18+21) = 46%

Direct evidence: contingency table Is there an association between our inferred network and

regulatory relationships from Yeastract?

Applied ScanBMA to a 100-gene subset of the time series data à threshold edges at posterior probability 50% à resulting network contains 100 nodes and 439 edges.

53

54

Method TPR AUROC AUPRC TP FP LASSO 0.046 0.51 0.042 996 20,469 ARACNE 0.205 0.50 0.040 69 268 CLR 0.039 0.51 0.044 8,879 220,942 MRNET 0.039 0.51 0.044 8,737 214,757 ScanBMA[20] (95%) 0.391 0.60 0.075 227 353

ScanBMA[3556]

(95%) 0.274 0.63 0.074 127 336

ScanBMA: results

Road map

•  Introduction to big biological data •  Framework of supervised machine

learning methods and applications •  Bayesian Model Averaging (BMA):

framework and intuition •  Application 1: gene signature discovery •  Application 2: gene networks

56

57

Thank you’s University of Washington -

Seattle

Roger Bumgarner Adrian Raftery

Jerry Radich, Vivian Oehler

Fred Hutchinson Cancer Research Center

Ken Dombek, Chris Fraley, Kenneth Lo, Chad Young

Funding Sources R01GM084163, R01GM084163-02S2, R01GM084163-05S1, U54-HL127624

University of Washington - Tacoma

Ling Hong Hung, Maciej Fronczuk

Assessment •  Recovery of known regulatory relationships:

–  We showed significant enrichment between our inferred network and the assessment criteria.

•  Lab validation of selected sub-networks

•  Comparison to other methods in the literature.

59

…

Child nodes of selected TFs

WT ΔTF

Genes that respond to deletion with rapamycin perturbation

60

Systematic Name

Common name

# references

in SGD

# child nodes in

network A

Expression pattern over

time

Known binding site

from JASPAR?

Description from SGD

YDR421W ARO80 19 51 Increasing over time yes

Zinc finger transcriptional activator of the Zn2Cys6 family; activates transcription of aromatic amino acid catabolic genes in the presence of aromatic amino acids

YML113W DAT1 17 57 Decreasing over time no

DNA binding protein that recognizes oligo(dA).oligo(dT) tracts; Arg side chain in its N-terminal pentad Gly-Arg-Lys-Pro-Gly repeat is required for DNA-binding; not essential for viability

YBL103C RTG3 83 47 Increasing over time yes

Basic helix-loop-helix-leucine zipper (bHLH/Zip) transcription factor that forms a complex with another bHLH/Zip protein, Rtg1p, to activate the retrograde (RTG) and TOR pathways

Comparing our networks to the deletion data

61

Deleted TF # child nodes Genes that

respond to the deletion

# overlap Fisher's test p-value

ARO80 51 10 4 9.3 x 10-6

DAT1 57 784 20 0.04 RTG3 47 2288 39 0.03

Our inferred network Validation experiment

62

Legend: Green: Genes that respond to deletion of ARO80 under rapamycin in BY at 50 minutes.

Aro80p is a known regulator of ARO9 and ARO10. (Iraqui et al. Molecular and Cellular

Biology 1999, 19:3360-3371).

63

Legend: Green: Genes that respond to deletion of ARO80 under rapamycin in BY at 50 minutes. Magenta: Target genes with known ARO80 binding site.

Amazingly, all 4 genes that respond to deletion (ARO9, ARO10, NAF1, ESBP6) contain the known ARO80 binding site upstream!

Gene targets of transcription factors: transcription factor binding sites

8/11/15 64

Discovery of new binding site: DAT1

65

E-value = 4.5e-30

No known binding sites for DAT1 in JASPAR or SGD.

66

Choosing a set of relevant genes (S)

•  Want: genes highly expressed in each class

•  Choose genes with the highest and lowest S2N=P(g,c) scores.

67

Neighborhood Analysis •  Goal: high S2N by chance? •  Idea: Compare the number of gene vectors at a

neighborhood of a fixed size around c with the number of gene vectors around a random permutation of c

68

(univariate) gene selection •  Use expression patterns in each gene individually •  class vector c eg, (0,0,0,0,0,1,1,1,1,1) •  g: expression vector of a gene over all the

samples •  µ1 : average expression level of samples from gene

g in class 1 •  σ1 : standard deviation of gene g in class 1.

€

S2N = P(g,c) =µ1 −µ2σ1 +σ 2

69

Prediction by Weighted Voting •  Each gene in S casts 1 vote for 1 class. •  Weight of a gene g depends on S2N=P(g,c) •  x: expression level of new sample •  b: decision boundary, b = (µ1+µ2)/2

)2

(*),(

),(distance*)(weight)(

21 µµ +−=

=

xcgP

bxgxVg

Positive votes: votes for class 1 Negative votes: votes for class 2

70

Weighted Voting (cont’d) •  V1= sum of all positive votes (class 1) •  V2= sum of absolute values of all negative

votes (class 2) •  Winner = max (V1, V2) •  Prediction Strength

€

PS =Vwinner −Vloser

Vwinner +Vloser

Choose a PS threshold using cross validation

Ensemble Learning Methods

•  Combine results from multiple learning methods

•  “wisdom of crowds” http://www.nature.com/nmeth/journal/v9/n8/full/nmeth.2016.html

•  Eg. BMA, majority voting •  http://en.wikipedia.org/wiki/

Ensemble_learning

71

applications of bayesian model averaging in personalized

Documents