applications of bayesian model averaging in personalized

71
Applications of Bayesian Model Averaging in Personalized Medicine and Systems Biology Ka Yee Yeung [email protected] Institute of Technology, University of Washington Tacoma 8/12/2015 1

Upload: others

Post on 05-Dec-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applications of Bayesian Model Averaging in Personalized

Applications of Bayesian Model Averaging in Personalized

Medicine and Systems Biology

Ka Yee Yeung [email protected]

Institute of Technology, University of Washington Tacoma 8/12/2015

1  

Page 2: Applications of Bayesian Model Averaging in Personalized

Road map

•  Introduction to big biological data •  Framework of supervised machine

learning methods and applications •  Bayesian Model Averaging (BMA):

framework and intuition •  Application 1: gene signature discovery •  Application 2: gene networks

2  

Page 3: Applications of Bayesian Model Averaging in Personalized

3  

“High-Throughput BioTech”

Sensors DNA sequencing Microarrays/Gene expression Mass Spectrometry/Proteomics Protein/protein & DNA/protein interaction

Controls Cloning Gene knock out/knock in RNAi

Floods of data

“Grand Challenge” problems Courtesy: Larry Ruzzo

Page 4: Applications of Bayesian Model Averaging in Personalized

Big Data in Biology (reference: Marx 2013)

4  http://www.nature.com/nature/journal/v498/n7453/full/498255a.html

Page 5: Applications of Bayesian Model Averaging in Personalized

Biology as a data-rich science

•  High-throughput technologies can measure the activity levels of many biology entities at once.

•  For example, sequencing and microarray technology can measure the expression (RNA) levels of all genes at the same time

•  My research focuses on the development of machine learning methods using these high dimensional data.

5  

Page 6: Applications of Bayesian Model Averaging in Personalized

6  

Computational biology: an iterative approach

Experiments Data handling

Mathematical modeling

High-throughput assays

Integration of multiple forms of experiments and knowledge

Page 7: Applications of Bayesian Model Averaging in Personalized

•  An initiative launched by NIH in 2012. •  Addresses challenges in using biomedical big data:

–  Locating data and software tools. –  Getting access to the data and software tools. –  Standardizing data and metadata. –  Extending policies and practices for data and software

sharing. –  Organizing, managing, and processing biomedical Big

Data. –  Developing new methods for analyzing & integrating

biomedical data. –  Training researchers who can use biomedical Big Data

effectively.

7  http://bd2k.nih.gov/about_bd2k.html#sthash.xs2j0lpi.dpbs

Page 8: Applications of Bayesian Model Averaging in Personalized

8  Figure from: http://www.pfizer.ie/personalized_med.cfm

Application 1: Personalized (or precision) Medicine

Page 9: Applications of Bayesian Model Averaging in Personalized

9  

Page 10: Applications of Bayesian Model Averaging in Personalized

Initiative on Precision Medicine President Obama, State of the Union Address,

Jan 20, 2015 •  https://www.whitehouse.gov/blog/

2015/01/30/precision-medicine-initiative-data-driven-treatments-unique-your-own-body

10  

Page 11: Applications of Bayesian Model Averaging in Personalized

11  

Page 12: Applications of Bayesian Model Averaging in Personalized

12  

Personalized Medicine

To tailor treatment based on genetic information from individual patients. Machine learning methods. Application: predict clinical outcomes in cancer patients

Page 13: Applications of Bayesian Model Averaging in Personalized

13  

Classification (supervised learning)

E1 E2 E3 E4 Gene 1 -2 +2 +2 -1 Gene 2 +8 +3 0 +4 Gene 3 -4 +5 +4 -2 Gene 4 -1 +4 +3 -1

Varia

bles

Patient samples

0 1 1 0 Labels

E’ -1 +5 -3 -1

Goal: predict label of new

sample

Feature selection

Training data

Page 14: Applications of Bayesian Model Averaging in Personalized

Objectives of classification

•  Class prediction: – Predict patients with a given clinical outcome

(y, class label, response) •  Feature selection:

–  Identification of a minimal set of relevant genes for future prediction

•  Identification of “biologically” interesting genes

14  

Page 15: Applications of Bayesian Model Averaging in Personalized

Steps in classification

15  

Training set Labeled samples + Data

Classification algorithm, feature selection algorithm

Classifier set of relevant variables

Predicted labels (classes) for samples in the test set

Test set Unlabeled data,

Data

Page 16: Applications of Bayesian Model Averaging in Personalized

Classification methods http://cran.r-project.org/web/views/MachineLearning.html

16  

Method Example R package

Logistic regression glm

K nearest neighbor class

Support vector machine e1071

Decision trees rpart

LASSO glmnet

Ensemble methods randomForest, BMA

Page 17: Applications of Bayesian Model Averaging in Personalized

Cross Validation (CV) •  An easy and useful method to estimate the prediction error. •  Can also be used to optimize the classifiers and predictive

models. •  Method (m-fold cross-validation):

–  Split the data into m approximately equally sized subsets –  Train the classifier on (m-1) subsets –  Test the classifier on the remaining subset. Estimate the –  prediction error by comparing the predicted class label with the true

class labels. –  Repeat

•  Examples: –  10-fold CV [Ambroise et al. PNAS 2002]

http://www.pnas.org/content/99/10/6562.abstract –  Leave one out cross validation (LOOCV)

17  

Page 18: Applications of Bayesian Model Averaging in Personalized

18  

Top 50 genes for ALL vs.

AML

Science. 1999 Oct 15;286(5439):531-7. Cited 10,460 times!

Page 19: Applications of Bayesian Model Averaging in Personalized

19  

Choosing relevant genes

•  Use training set only •  Ideal:

– different typical expression patterns in the two classes

–  little variation within each class

Page 20: Applications of Bayesian Model Averaging in Personalized

Feature selection (or variable selection)

•  T-test •  Correlation •  Many others

Can be formulated as a model selection problem. In this context, a model is a set of relevant features (variables, genes).

20  

Page 21: Applications of Bayesian Model Averaging in Personalized

Model Selection •  Exhaustive search •  Stepwise selection

– Forward selection – Backward elimination

•  Multivariate model selection – Bayesian Model Averaging – Regularized methods e.g. LASSO

21  

High dimensionality challenge: # variables >>> # observations

Page 22: Applications of Bayesian Model Averaging in Personalized

22  

Bayesian Model Averaging (BMA) [Raftery 1995], [Hoeting et. al. 1999], [Yeung et al. 2005]

•  A multivariate variable selection technique: –  Takes advantage of the dependencies between genes to

reduce the total number of predictive genes. •  Most gene selection methods consider genes individually

and select a single set of predictive genes at a time. •  Advantages of BMA:

–  Fewer selected genes –  Probabilities for predictions, selected genes and selected

models •  BMA averages over predictions from several models

prediction = (prediction using model k)model k∑ * (probability of model k)

Page 23: Applications of Bayesian Model Averaging in Personalized

How to choose a set of “good” models?

•  All possible models --> way too many!! –  Eg. 2^30~1 billion, 2^50~10^15 etc…

•  The BMA solution: 1.  “leaps and bounds” [Furnival and Wilson 1974]: when #

variables (genes) <= 30, we can efficiently produce a reduced set of good models (branch and bound).

2.  Cut down the # models? •  Occam’s windows: Discard models that are much less likely

than the best model

23  

Page 24: Applications of Bayesian Model Averaging in Personalized

24  

Iterative BMA (iBMA)  

Model selection for high-dimensional data 1.  Univariate ranking

step 2.  Iteratively apply BMA

to a fixed window of variables

experiments

gene

s

Yeung et al. Bioinformatics (2005)  

Page 25: Applications of Bayesian Model Averaging in Personalized

25  

Iterative BMA (iBMA)  

Model selection for high-dimensional data 1.  Univariate ranking

step 2.  Iteratively apply

BMA to a fixed window of variables

experiments

gene

s

Discard variables with low posterior probabilities

Page 26: Applications of Bayesian Model Averaging in Personalized

Chronic myeloid leukemia (CML)

26  

CML is characterized by a reciprocal translocation between chromosomes 9 and 22 yielding the Bcr-Abl fusion protein.

http://www.cancer.gov/types/leukemia/patient/cml-treatment-pdq  

Page 27: Applications of Bayesian Model Averaging in Personalized

Treatment options for CML patients Current treatment “recipe”: 1.  imatinib mesylate (IM): a tyrosine

kinase inhibitor (TKI) which inhibits BCR-ABL and its downstream targets.

2.  Monitor the patients’ response 3.  Options for IM resistance:

–  Second-line TKI –  Stem cell transplantation

Early prediction is the key: If a patient is going to progress quickly à higher priority for transplantation

27  Figure: Jerry Radich 2007, 2nd Annual Congress of the National Comprehensive Cancer Network.

0

20

40

60

80

100

Early CP

Late CP AP BC

73%

95% Patients resistant to imatinib (%)

16% 26%

Page 28: Applications of Bayesian Model Averaging in Personalized

Chronic myeloid leukemia (CML)

•  CML = Cancer of white blood cells •  Drugs are highly effective in early stage of CML •  Drugs are NOT effective in late stage •  Given: gene expression data studying patients in

early vs. late stage of CML •  Question: Can we find genes predictive of the

stage (and hence, treatment) of CML patients?

28

Page 29: Applications of Bayesian Model Averaging in Personalized

Biomarker discovery in CML

29

Phase determines response to therapy à Tailor therapy to individual

patients.

Signature genes to predict progression of disease

Oehler and Yeung et al. Blood 2009, 114:3292-8

Lab validation

CML progression data Early stage vs. late stage

Early late Multivariate feature

selection method

Page 30: Applications of Bayesian Model Averaging in Personalized

Super 6: signature genes derived from CML microarray data

30  

Cross validation: average prediction accuracy = 99.2%

Accession number gene symbol gene name

NM_016355 DDX47 DEAD (Asp-Glu-Ala-Asp) box polypeptide 47 NM_004258 IGSF2 immunoglobulin superfamily, member 2 NM_000752 LTB4R leukotriene B4 receptor NM_014062 ART4 ADP-ribosyltransferase 4 (Dombrock blood group) NM_005505 SCARB1 scavenger receptor class B, member 1

NM_005888 SLC25A3 solute carrier family 25 (mitochondrial carrier; phosphate carrier), member 3

Oehler and Yeung et al. Blood 2009, 114:3292-8

Page 31: Applications of Bayesian Model Averaging in Personalized

A common problem in biomarker selection

The expression levels of many genes are correlated when measured across a limited number of conditions. •  There are usually multiple sets of genes

that are equally (or near equally) predictive.

•  There may be little direct connection between the predictive genes and the biology of interest.

31  

Page 32: Applications of Bayesian Model Averaging in Personalized

Data integration

32

Predicted functional relationships

Specific expert knowledge: Reference genes known to be

associated with CML progression

Signature genes: •  Predictive of early vs. late CML •  Biologically relevant

Yeung et al. Bioinformatics 2012, 28(6):823-830

CML progression data Early stage vs. late stage

Page 33: Applications of Bayesian Model Averaging in Personalized

Our network-driven algorithm

33  

Start with the FLN. Threshold the edges.

Locate the reference genes on the FLN. Get all genes connected to these reference genes. Run BMA à signature genes.

Our network-driven algorithm constrains our search for predictive genes that are functionally related to genes known to be associated with CML.

Page 34: Applications of Bayesian Model Averaging in Personalized

34  

Legend: reference genes (pink) BMA selected genes (orange)

accession # gene symbol probability (%)

AB037729 RALGDS 22.3 NM_006148 LASP1 16.6 NM_000402 G6PD 15.1 AK000242 RALGDS 15.1 NM_001619 ADRBK1 15.1 M92439 LRPPRC 13.7 NM_002786 PSMA1 13.7

Average prediction accuracy = 99.1% in cross validation.

Page 35: Applications of Bayesian Model Averaging in Personalized

Can genes predictive of CML progression be used to predict outcomes after transplantation?

35  

Expression of our signature genes prior to transplant is associated with relapse in 169 chronic phase CML patients.

After adjustment for variables known to affect transplant outcomes, our 2012 gene signature (RALGDS, LASP1, G6PD, ADRBKI, LRPPRC, PSMA1) correlates with relapse after allogeneic transplantation in CP CML patients. In CP patients we found that an increase of 0.2 in the predicted probability correlated with an increase in relapse of 46% (HR=1.46 (1.06-2.02, p=.02)).

Frac

tion

of re

laps

e

Our BMA models predicted relapse better than any single gene.

Page 36: Applications of Bayesian Model Averaging in Personalized

Application 2: gene networks

36

Page 37: Applications of Bayesian Model Averaging in Personalized

Constraining candidate regulators

•  Without prior knowledge, every gene is a potential regulator of every other gene. We want to restrict the search to the most likely regulators.

•  For each gene g, we estimated how likely that each regulator R regulates g (a priori) using the supervised framework and the external data sources.

37

g

R1 R2 R3

Graphical representation of network as a set of nodes and edges. Goal: To infer parent nodes (regulators) for each gene g using the time series expression data

Page 38: Applications of Bayesian Model Averaging in Personalized

38

×

BY (lab)

RM (wild)

:

95 segregants

Phenotype: RNA levels in

response to drug perturbation

DNA genotype

. . .

Yeast time series data

6 time points

Experimental design: Time dependencies: ordering of regulatory events. Genotype data: correlate DNA variations in the segregants to measured expression levels

Array Express E-MTAB-412

Genetics of global gene expression. Rockman & Kruglyak. 2006.

Page 39: Applications of Bayesian Model Averaging in Personalized

39  

Time series data: pictorial view G

enes

(~60

00)

Segregants (95) + BY + RM

Time (6)

Expression data

(observed phenotype)

Genotype data

(0 RM; 1BY; 2 missing)

Mar

kers

(~30

00)

Segregants (95) + BY + RM

Page 40: Applications of Bayesian Model Averaging in Personalized

Regression-based approach

Let X(g,t,s) = expression level of gene g at time t in segregant s

40

X(g,t,s) = βg,s *X(R,t−1,s)R is a potential regulator

∑ +ε

g

Potential regulators R

t t-1

Variable selection

Use the expression level at time (t-1) to predict the expression levels at time t in the same segregant

Page 41: Applications of Bayesian Model Averaging in Personalized

41 41  

Expression data Genome-wide binding data Literature

Other data, e.g. protein-protein

interaction, genetic

interaction, genotype etc.

regu

lato

rs

genes

Probability that R regulates g

0.95 0.23 0.78 … ….

g

Regulators constrained by the external data sources

Gene regulatory network

Supervised learning: integration of external data

Variable selection

Time series expression data Yeung et al. PNAS 2011, 108(48): 19436 - 41

Lo et al. BMC Systems Biology 2012, 6:101 Young et al. BMC Systems Biology 2014, 8:47

Page 42: Applications of Bayesian Model Averaging in Personalized

Integration of external data

42

Expression data Genome-wide binding data Literature

Other data

Compute variables (Xi) that capture evidence of regulation for (TF-gene) pairs

Y Xi TF

-gen

e

Training data: Positive (Y=1) vs. negative (Y=0) training examples Apply logistic regression to determine weights (αi’s) of Xi’s.

regu

lato

rs

genes

Probability that R regulates g

0.95 0.23 0.78 … ….

Page 43: Applications of Bayesian Model Averaging in Personalized

43  

Category Dataset Biological Relevance Independent variable (xi)

Weight ai

Posterior probability

(%) Co-expression Rosetta

compendium data Environmental stress data Stanford microarray data

Does R and g show co-expression across diverse experimental conditions?

xi=correlation between R and g

2.35 1.74 -2.26

100 100 100

Genome-wide binding data

ChIP-chip data Does the potential regulator bind upstream of gene g in-vivo?

xi=log(p-value) -0.96 100

Genotype data Cis-regulation Does sequence variation of R correlate with expression level of a nearby gene?

xi=1 if R is cis-regulated. xi=0 otherwise.

0 0

Curated knowledge from the literature

GO terms Do gene g and regulator R share the same annotations?

xi=number of common GO slim terms between R and g

0.20 100

Known regulatory role

Does regulator R known to exhibit a regulatory role?

xi=1 if R has a documented regulatory role in SGD. xi=0 otherwise.

2.79 100

Page 44: Applications of Bayesian Model Averaging in Personalized

Correcting the sampling rates between positive and negative training samples

•  In practice, we expect positive regulatory relationships to be much rarer than negative regulatory relationships.

•  Case-control studies (Breslow et al., 1980; Lachin, 2000) –  add an offset of –log(p1/p0) to the logistic regression model, where

p1=positive sampling rate (rare). •  In-degree distribution (Guelzim et al., 2002): exponential decay

–  each target gene is regulated by approximately t = 2.76 transcription factors on average.

•  Supervised learning step: –  583 positive examples and 444 negative examples. –  p1= 583/(6000*2.76) = 3.52% –  p0=444/[6000*(6000-2.76)] = 0.0012%. –  Therefore, we scale all the predicted odds by a factor of p1/p0= 2853.

44 Lo et al. BMC Systems Biology 2012, 6:101

Page 45: Applications of Bayesian Model Averaging in Personalized

45  

Bayesian Model Averaging (BMA) [Raftery 1995], [Hoeting et. al. 1999], [Yeung et al. 2005]

•  A multivariate variable selection technique: –  Takes advantage of the dependencies between genes to

reduce the total number of predictive genes. •  Most gene selection methods consider genes individually

and select a single set of predictive genes at a time. •  Advantages of BMA:

–  Fewer selected genes –  Probabilities for predictions, selected genes and selected

models •  BMA averages over predictions from several models

prediction = (prediction using model k)model k∑ * (probability of model k)

Page 46: Applications of Bayesian Model Averaging in Personalized

Prior model probabilities in BMA

•  Intuition: favor models consisting of candidate regulators that are strongly supported by external data

•  Let prg = prior probability that regulator r regulates g

•  dkr = 1 if regulator r is an inferred regulator in model Mk. dkr = 0 otherwise.

•  Use these prior model probabilities to compare models in the Occam’s window step

46

Pr(Mk ) = π rgδ kr (1−π rg )

1−δ kr

r∏

Page 47: Applications of Bayesian Model Averaging in Personalized

47  

ScanBMA

•  Data transformation •  Implementation of ScanBMA algorithm for

efficient model space search •  Integration of an informed prior •  Incorporation of Zellner’s g-prior Bioconductor package “networkBMA”

Young et al. BMC Systems Biology 2014, 8:47

Page 48: Applications of Bayesian Model Averaging in Personalized

48  

Simulated data (DREAM4 time series data)

R package minet implementation of ARACNE and MRNET were used. ebdbnet is a R package for empirical Bayes dynamic Bayesian network.

Young et al. BMC Systems Biology 2014, 8:47

Page 49: Applications of Bayesian Model Averaging in Personalized

49  

ScanBMA: running time

Method Average running time per gene on yeast data

Projected running time for 20,000 genes

scanBMA 0.04 seconds 13.3 minutes ARACNE 70.4 seconds 16.3 days CLR 7.9 seconds 43.9 hours MRNET 500 seconds 115.7 days ebdbnet failed Expect to fail

R package minet implementation of ARACNE, CLR and MRNET were used. ebdbnet is a R package for empirical Bayes dynamic Bayesian network.

Page 50: Applications of Bayesian Model Averaging in Personalized

Assessment •  Recovery of known regulatory relationships:

–  An independent assessment criteria based on the literature.

–  YEASTRACT (Yeast Search for Transcriptional Regulators And Consensus Tracking) is a curated repository of regulatory associations between transcription factors (TF) and target genes in yeast, based on bibliographic references.

–  Regulatory relationships used in the supervised step were subtracted.

•  Lab validation of selected sub-networks

50

Page 51: Applications of Bayesian Model Averaging in Personalized

Direct evidence Compare edges inferred in our network to the independent assessment criteria.

If TàG is a documented regulatory relationship in the assessment criteria Direct evidence!

51

T G Edge in our

network

Page 52: Applications of Bayesian Model Averaging in Personalized

yes no

yes TP FP no FN TN Ed

ges  in  the  

constructed  

netw

ork?  

TF-­‐gene  pairs  in  Yeastract?  

True positive rate (TPR) or Recall = TP/(TP + FN) False positive rate (FPR) = FP (FP + TN) Precision = TP /(TP + FP) Accuracy = (TP + TN) / (TP + FN + FP + TN)

Direct evidence: contingency table Is there an association between our inferred network and

regulatory relationships from Yeastract?

52

Page 53: Applications of Bayesian Model Averaging in Personalized

yes no

yes 18 251 no 21 782 Ed

ges  in  the  

constructed  

netw

ork?  

TF-­‐gene  pairs  in  Yeastract?  

True positive rate (TPR) = 18/(18+21) = 46%

Direct evidence: contingency table Is there an association between our inferred network and

regulatory relationships from Yeastract?

Applied ScanBMA to a 100-gene subset of the time series data à threshold edges at posterior probability 50% à resulting network contains 100 nodes and 439 edges.

53

Page 54: Applications of Bayesian Model Averaging in Personalized

54  

Method TPR AUROC AUPRC TP FP LASSO 0.046 0.51 0.042 996 20,469 ARACNE 0.205 0.50 0.040 69 268 CLR 0.039 0.51 0.044 8,879 220,942 MRNET 0.039 0.51 0.044 8,737 214,757 ScanBMA[20] (95%) 0.391 0.60 0.075 227 353

ScanBMA[3556]

(95%) 0.274 0.63 0.074 127 336

ScanBMA: results

Page 55: Applications of Bayesian Model Averaging in Personalized

55  

Page 56: Applications of Bayesian Model Averaging in Personalized

Road map

•  Introduction to big biological data •  Framework of supervised machine

learning methods and applications •  Bayesian Model Averaging (BMA):

framework and intuition •  Application 1: gene signature discovery •  Application 2: gene networks

56  

Page 57: Applications of Bayesian Model Averaging in Personalized

57  

Thank you’s University of Washington -

Seattle

Roger Bumgarner Adrian Raftery

Jerry Radich, Vivian Oehler

Fred Hutchinson Cancer Research Center

Ken Dombek, Chris Fraley, Kenneth Lo, Chad Young

Funding Sources R01GM084163, R01GM084163-02S2, R01GM084163-05S1, U54-HL127624

University of Washington - Tacoma

Ling Hong Hung, Maciej Fronczuk

Page 58: Applications of Bayesian Model Averaging in Personalized

58  

58  

BMA: Mathematical Details

Pr(Mk | D ) =Pr(D | Mk )Pr(Mk )Pr(D | Ml )Pr(Ml )

l∈B∑

where Pr(D | Mk ) = Pr(D |θ k ,Mk )Pr(θ k | Mk )dθ k∫

BMA:

Posterior probability for model Mk:

Pr(Δ | D ) = Pr(Δ | D, Mk )*Pr(Mk | D )k =1

K

BIC of Mk = n * log(1 – Rk2) + pk * log(n)

where pk = number of variables in model Mk (not including the intercept) Rk

2 = R2 for model Mk

n = number of cases Comparing 2 models, 2logBjk = BICk – BICj Bjk = p(D/Mj) / p(D|Mk) Posterior odds = Bayes Factor * Prior odds

p(M j | D)p(Mk | D)

=p(D |M j )p(D |Mk )

*p(M j )p(Mk )

Page 59: Applications of Bayesian Model Averaging in Personalized

Assessment •  Recovery of known regulatory relationships:

–  We showed significant enrichment between our inferred network and the assessment criteria.

•  Lab validation of selected sub-networks

•  Comparison to other methods in the literature.

59

Child nodes of selected TFs

WT ΔTF

Genes that respond to deletion with rapamycin perturbation

Page 60: Applications of Bayesian Model Averaging in Personalized

60

Systematic Name

Common name

# references

in SGD

# child nodes in

network A

Expression pattern over

time

Known binding site

from JASPAR?

Description from SGD

YDR421W ARO80 19 51 Increasing over time yes

Zinc finger transcriptional activator of the Zn2Cys6 family; activates transcription of aromatic amino acid catabolic genes in the presence of aromatic amino acids

YML113W DAT1 17 57 Decreasing over time no

DNA binding protein that recognizes oligo(dA).oligo(dT) tracts; Arg side chain in its N-terminal pentad Gly-Arg-Lys-Pro-Gly repeat is required for DNA-binding; not essential for viability

YBL103C RTG3 83 47 Increasing over time yes

Basic helix-loop-helix-leucine zipper (bHLH/Zip) transcription factor that forms a complex with another bHLH/Zip protein, Rtg1p, to activate the retrograde (RTG) and TOR pathways

Page 61: Applications of Bayesian Model Averaging in Personalized

Comparing our networks to the deletion data

61

Deleted TF # child nodes Genes that

respond to the deletion

# overlap Fisher's test p-value

ARO80 51 10 4 9.3 x 10-6

DAT1 57 784 20 0.04 RTG3 47 2288 39 0.03

Our inferred network Validation experiment

Page 62: Applications of Bayesian Model Averaging in Personalized

62

Legend: Green: Genes that respond to deletion of ARO80 under rapamycin in BY at 50 minutes.

Aro80p is a known regulator of ARO9 and ARO10. (Iraqui et al. Molecular and Cellular

Biology 1999, 19:3360-3371).

Page 63: Applications of Bayesian Model Averaging in Personalized

63

Legend: Green: Genes that respond to deletion of ARO80 under rapamycin in BY at 50 minutes. Magenta: Target genes with known ARO80 binding site.

Amazingly, all 4 genes that respond to deletion (ARO9, ARO10, NAF1, ESBP6) contain the known ARO80 binding site upstream!

Page 64: Applications of Bayesian Model Averaging in Personalized

Gene targets of transcription factors: transcription factor binding sites

8/11/15 64

Page 65: Applications of Bayesian Model Averaging in Personalized

Discovery of new binding site: DAT1

65

E-value = 4.5e-30

No known binding sites for DAT1 in JASPAR or SGD.

Page 66: Applications of Bayesian Model Averaging in Personalized

66  

Choosing a set of relevant genes (S)

•  Want: genes highly expressed in each class

•  Choose genes with the highest and lowest S2N=P(g,c) scores.

Page 67: Applications of Bayesian Model Averaging in Personalized

67  

Neighborhood Analysis •  Goal: high S2N by chance? •  Idea: Compare the number of gene vectors at a

neighborhood of a fixed size around c with the number of gene vectors around a random permutation of c

Page 68: Applications of Bayesian Model Averaging in Personalized

68  

(univariate) gene selection •  Use expression patterns in each gene individually •  class vector c eg, (0,0,0,0,0,1,1,1,1,1) •  g: expression vector of a gene over all the

samples •  µ1 : average expression level of samples from gene

g in class 1 •  σ1 : standard deviation of gene g in class 1.

S2N = P(g,c) =µ1 −µ2σ1 +σ 2

Page 69: Applications of Bayesian Model Averaging in Personalized

69  

Prediction by Weighted Voting •  Each gene in S casts 1 vote for 1 class. •  Weight of a gene g depends on S2N=P(g,c) •  x: expression level of new sample •  b: decision boundary, b = (µ1+µ2)/2

)2

(*),(

),(distance*)(weight)(

21 µµ +−=

=

xcgP

bxgxVg

Positive votes: votes for class 1 Negative votes: votes for class 2

Page 70: Applications of Bayesian Model Averaging in Personalized

70  

Weighted Voting (cont’d) •  V1= sum of all positive votes (class 1) •  V2= sum of absolute values of all negative

votes (class 2) •  Winner = max (V1, V2) •  Prediction Strength

PS =Vwinner −Vloser

Vwinner +Vloser

 Choose a PS threshold using cross validation

Page 71: Applications of Bayesian Model Averaging in Personalized

Ensemble Learning Methods

•  Combine results from multiple learning methods

•  “wisdom of crowds” http://www.nature.com/nmeth/journal/v9/n8/full/nmeth.2016.html

•  Eg. BMA, majority voting •  http://en.wikipedia.org/wiki/

Ensemble_learning

71