illuminating genetic networks with random forest

34
Illuminating Genetic Networks with Random Forest ANDREAS BEYER ? University of Cologne

Upload: others

Post on 31-Oct-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Illuminating Genetic Networks with Random Forest

ANDREAS BEYER

?

University of Cologne

Andreas Beyer

Outline

• Random Forest

• Applications

• QTL mapping

• Epistasis (analyzing model structure)

2

Random ForestHOW DOES IT WORK?

3

Andreas Beyer

Random Forest

Y X

PredictorsSa

mpl

esResponse

Andreas Beyer

Random Forest

predictor 1

low / class 1high / class 2

predictor 2

predictor 3

Leo Breiman, 2001

Andreas Beyer

Random Forest

predictor 1

predictor 2

predictor 3

Leo Breiman, 2001

low / class 1high / class 2

Andreas Beyer

RF uses CART

• Classification And Regression Trees

Breiman et al. (1984)

Andreas Beyer

Splitting Rules

1

= fraction of items labeled km = number possible values

minimize:

Classification: Gini Impurity

Andreas Beyer

Splitting Rules

= i’th itemYl, Yr = items in left (right) nodenl, nr = number of items in left (right) node

, = average of left (right) items

∈ ∈

left node right node

minimize:

Regression: RSS

Andreas Beyer

Decision trees …

• are nice to interpret,

• but generalize very poorly (large variance)

http://research.microsoft.com

Andreas Beyer

Random Forest

Y X

PredictorsSa

mpl

esResponse

boot

stra

prandom sampling

Andreas Beyer

Random Forest

Grow many trees!

Average predictions across all trees

Andreas Beyer

Benefits

• Works very well in practice

• Very broadly applicable

• Intuitive algorithm

• Robust, no overfitting

• No assumptions about data

• Virtually no tuning needed

• Very easily parallelizable

• Accounts for complex interactions between features

13

Andreas Beyer

Drawbacks

• Difficult interpretation

• What is the underlying ‘model’?

• Almost impossible to capture analytically

14

Andreas Beyer

Predicting Protein-Protein Interactions

15

Human

Elefsinioti et al. 2011 Molec. Cell. Prot. Sarac et al. 2012 Bioinformatics

Yeast

Andreas Beyer

Proximity measure

= very similar (together in 3/3)

= quite similar (together in 2/3)

= different (together in 0/3)

Score similarities (or differences) of samples

Andreas Beyer

Weighted Clustering

17

Michaelson, Trump, et al. 2011 BMC Genomics

1. Learn model to predict outcome

RFFeatures Outcome

2. Use feature importance as weights

Proximity measure

PAM clustering Clusters

QTL Analysis

18

Andreas Beyer

Genetic Association

19

ACCGTCCGACACGTTTGGACAAGTACGCTGCAACACACCCGTACCAATTTTGG

ACCGACCGACACGTTTGGACAAGTACGTTGCAACACACCCGTACCAATTTTGG

ACCGACCGACACGTTTGGACAAGTACGCTGCAACACACCCGTACCAAAATTGG

ACCGTCCCACACGTTTGGTCAAGTACGCTGCAACACACCCGTACCAATTTTGG

ACCGACCCACACGTTTGGTCAAATACGTTGCAACACACCCGTACCAATTTTGG

ACCGTCCGACACGTTTGGTCAAATACGTTGCAACACACCCGTACCAATTTTGG

ACCGTCCGACACGTTTGGTCAAGTACGCTGCAACACACCCGTACCAATTTTGG

Andreas Beyer

Genetic Association

20

ACCGTCCGACACGTTTGGACAAGTACGCTGCAACACACCCGTACCAATTTTGG

ACCGACCGACACGTTTGGACAAGTACGTTGCAACACACCCGTACCAATTTTGG

ACCGACCGACACGTTTGGACAAGTACGCTGCAACACACCCGTACCAAAATTGG

ACCGTCCCACACGTTTGGTCAAGTACGCTGCAACACACCCGTACCAATTTTGG

ACCGACCCACACGTTTGGTCAAATACGTTGCAACACACCCGTACCAATTTTGG

ACCGTCCGACACGTTTGGTCAAATACGTTGCAACACACCCGTACCAATTTTGG

ACCGTCCGACACGTTTGGTCAAGTACGCTGCAACACACCCGTACCAATTTTGG

Andreas Beyer

NHGRI GWA Catalog, www.genome.gov/GWAStudieswww.ebi.ac.uk/fgpt/gwas/

Published Genome-Wide Associations through 2015Published GWA at p≤5X10-8 for 17 trait categories

21

Andreas Beyer

Quantitative Trait Loci (QTL)

• Sub-type of GWAS

• Must have several causal loci (complex trait)• Why?

Allele A a

Standard approach:t-test

Andreas Beyer

RF for QTL mapping

23

Feature Matrix:Genetic Markers

RF Trait(e.g. body size)

Feature importance = importance of marker

Andreas Beyer

Pathway Consistency

Random Forest

Enrichment of gene pairs in same pathway

Michaelson et al. 2010 BMC Genomics

24

Andreas Beyer

RF-based QTL Mapping

• Michaelson et al. 2010 BMC Syst. Biol.• Comparison using real data

• Ackermann et al. 2012 PLoS ONE• Comparison using simulated data (DREAM)

• Picotti, Clément-Ziza et al. 2013 Nature• Extracting epistatic interactions

• Ackermann et al. 2013 PLoS Genetics• Multiple cell types/conditions

• Clément-Ziza et al. 2014 Molec. Systems Biol.• Non-coding genes, antisense transcription

• Stephan et al. 2015 Nat. Commun.• Population substructure

• Valenzano et al. 2015 Cell• Mapping traits in fish

25

Epistasis with RFGETTING A GRIP ON RF STRUCTURE

JAKE MICHAELSON, MATHIEU CLÉMENT-ZIZA, JAN GROßBACH, CORINNA SCHMALOHR

32

Andreas Beyer

What is Epistasis?

33

AB Ab aB ab

Trai

t additive

AB Ab aB ab

Trai

t

epistatic

AB Ab aB abTr

ait

epistatic

“Non-additive interaction between markers (predictors).”

Andreas Beyer

Problem with Random Forest

34

Interaction between variables

need to know model structure!

Andreas Beyer

Finding Epistasis with Decision Trees

aA

bbB B

aA

bbB B

epistatic additive

→ compare slopes

Andreas Beyer

Algorithm

1. Learn decision trees

2. Compute slopes (differences) of trait values at splits

3. Collect slopes for ‘left’ and ‘right’ sides (there will be many trees)

4. Compare distributions of slopes. Are they different?

Slopesepistasis no epistasis

left right left right

Andreas Beyer

Validation on real data (Saccharomyces cerevisiae)

37

Using Costanzo et al. 2010 for validation

RFANOVA

True

Pos

itiv

e Ra

te

False Positive RatePr

ecis

ion

Recall

Andreas Beyer

Why is RF better?

38

A B C

Andreas Beyer

Random Forest

• Extremely versatile

• Robust

• Can analyse structure

39

http://research.microsoft.com

Andreas Beyer

Acknowledgements

Oliver Stegle (EBI)

Ruedi Aebersold (ETH)Paola Picotti

Jürg Bähler (UCL)Sam MargueratXavi Masellach

Chris Workman (DTU)Manos Papadakis

People MoneyJan GroßbachJohannes Stephan

SystemsX.chThe Swiss Initiative in Systems Biology

40

Mathieu Clément-ZizaCorinna Schmalohr