deep learning frameworks for regulatory genomics and ... · anshul kundaje genetics, computer...

44
Deep learning frameworks for regulatory genomics and epigenomics ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott- Armstrong Avanti Shrikumar

Upload: others

Post on 01-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Deep learning frameworks for regulatory genomics and

epigenomics

ANSHUL KUNDAJEGenetics, Computer science

Stanford University

1

Chuan Sheng Foo

Johnny Israeli

NicholasSinnott-

Armstrong

Avanti Shrikumar

Page 2: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Local chromatin architecture of regulatory elements

2

Adapted from Shlyueva et al. (2014) Nature Reviews Genetics.

TFs

nucleosomes

sequence motifs

histone marks

Page 3: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Combinatorial chromatin states define broad classes of elements

Promoter

Transcribed

Enhancer

Repressed

Quiescent

CTCF

PoisedHeterochromatinC

hro

mat

in s

tate

s

Thurman et al. (2012) Nature

Page 4: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

ATAC-seq: genome-wide chromatin accessibility from low input material

Paired-end sequencing

ATAC-seq peaks identify chromatin accessible regulatory elements

Page 5: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Buenrostro et al. (2013) Nature Methods.

ATAC-seq reveals chromatin architecture in genome-wide fragment length distributions

Page 6: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Chromatin architecture reflects chromatin state

Promoter

CTCF

Enhancer

Bivalent

Transcribed

Heterochromatin

Represesed

Quiescent

50 1000200 400 600 800

Fragment lengths

Page 7: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Position-aware 2D fragment length distributions (V-plots)

Peak summit +1000bp-1000bp

Aggregate plot for CTCF state

Plot at single CTCF site – sparse and noisy

50 bp

400 bp

Frag

len

(Midpoint of fragment)

Accessible binding site

Positioned nucleosome

V-plots were first introduced by Henikoff et al. 2011, PNAS

Page 8: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Can we predict chromatin states/histone marks at ATAC-peaks?

Which of 8 chromatin

states?

Which histone mark

is present?

2kb around ATAC-peak

Image classification task!

Page 9: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Deep neural networks (DNNs) for image classification

9

Lee et. al. (2009), ICML

Output label: Trump? Or Sanders?

Input: image pixel values

Page 10: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Deep neural networks (DNNs) for V-plot classification

10

Chromatin state/histone mark

Nucleosomes, Open chromatin

+1 / -1 nucleosome,Segments of open

chromatin

Nucleosome arrays,Large regions of open chromatin

Output label: Chromatin state/histone mark

Input: V-plots

Page 11: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

An artificial neuron

SigmoidReLu

(Rectified Linear Unit)

Page 12: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Convolutional filters

Feature map

Input image

Conv. Filter

Page 13: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Convolutional filters

Feature map

Input image

Page 14: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Convolutional layer: multiple filters learn distinct features

Page 15: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Pooling layers: locally smooth signal

Page 16: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

How does a deep conv. neural network transform the raw V-plot input at each layer

Pure CTCF

Promoter

Enhancer Initial Smoothing

1st set of Convolutional Maps

2nd Smoothing

2nd set of Convolutional Maps

3rd Smoothing

1st Fully Connected Layer

2nd Fully Connected Layer

Class Probabilities

V-Plot Input (300 x 2001)

Chromatin State0 +1Kb-1Kb

0

500

0

500

0

500

Page 17: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

After initial pooling (smoothing)

Pure CTCF

Promoter

Enhancer

Initial Smoothing

1st set of Convolutional Maps

2nd Smoothing

2nd set of Convolutional Maps

3rd Smoothing

1st Fully Connected Layer

2nd Fully Connected Layer

Class Probabilities

V-Plot Input (300 x 2001)

Chromatin State

Page 18: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Second set of convolutional maps

Pure CTCF

Promoter

Enhancer

Initial Smoothing

1st set of Convolutional Maps

2nd Smoothing

2nd set of Convolutional Maps

3rd Smoothing

1st Fully Connected Layer

2nd Fully Connected Layer

Class Probabilities

V-Plot Input (300 x 2001)

Chromatin State

Page 19: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Learning from multiple 1D functional data (e.g. DNase, MNase)

19

1st Convolution Layer

2nd Convolution Layer

2nd FC Layer

1D MNasesignal

(1 x 2001)

Class Probabilities

3rd Convolution Layer

1st FC Layer

1st Convolution Layer

2nd Convolution Layer

3rd Convolution Layer

1D DNasesignal

(1 x 2001)

Chromatin State

Scan DNase profile using filter

Page 20: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Learning from raw DNA sequence

Higher layers learn motif combinations

Class Probabilities

Score sequence using filters Convolutional layerslearn motif (PWM) like filters

Page 21: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

The ChromputerIntegrating multiple inputs (1D, 2D signals, sequence)

to simulatenously predict multiple outputs

Chromatin State

TF Binding

Class Probabilities

H3K4me3H3K4me1

H3K9me3

H3K27me3

H2A.Z

H3K36me3

2nd FC Layer

1st FC Layer

Initial Smoothing

1st set of Convolutional Maps

2nd Smoothing

2nd set of Convolutional Maps

3rd Smoothing

1st Combined FC Layer

2nd Combined FC Layer

V-Plot Input (300 x 2001)

Initial Smoothing

1st set of Convolutional Maps

2nd Smoothing

2nd set of Convolutional Maps

3rd Smoothing

1st Combined FC Layer

2nd Combined FC Layer

Initial Smoothing

1st set of Convolutional Maps

2nd Smoothing

2nd set of Convolutional Maps

3rd Smoothing

1st Combined FC Layer

2nd Combined FC Layer

Multi-task learning

Page 22: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Chromatin architecture can predict chromatin state in held out chromosome

(same cell type)Model + Input data types 8-class chromatin

state accuracy (%)

Majority class (baseline) 42%

Gene proximity 59%

Random Forest: ATAC-seq (150M reads) 61%

Chromputer: DNase (60M reads) 68.1%

Chromputer: Mnase (1.5B reads) 69.3%

Chromputer: ATAC-seq (150M reads) 75.9%

Chromputer: DNase + MNase 81.6%

Chromputer: ATAC-seq + sequence 83.5%

Chromputer: DNase + MNase + sequence 86.2%

Label accuracy across replicates (upper bound) 88%

Page 23: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

High cross cell-type chromatin state prediction

• Learn model on DNase and MNase only

• Learn on GM12878, predict on K562 (and vice versa)

• Requires local normalization to make signal comparable

8 class chromatin state accuracy

Train ↓ / Test → GM12878 K562

GM12878 0.816 0.818

K562 0.769 0.844

Page 24: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Predicting individual histone marksfrom ATAC/DNase/MNase/Sequence

Area under Precision recall curve

0.75

0.5

0.25

0

CTCF H3K27ac H3K4me3 H3K4me1 H3K9ac H2Az H3K36me3 H3K27me3 H3K9me3

Page 25: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Chromputer trained on TF ChIP-seq predicts cross cell-type in-vivo TF binding with high

accuracy

25

ChromputerArea under Precision recall curve

c-MYC YY1 CTCF

Inputs: Seq + DNA shape + DNase profilePositives: Reproducible ChIP-seq peaksNegatives: All other DNase peaks + flanks + matched random sites

Test sets: Held out chromosomes in held out cell types

Page 26: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

DeepLIFT: Scoring predictive power of features in Deep Neural Networks

• LIFT: Linear Importance Feature Tracker, or LIFTing the top off the black box.

• Provides a predictive ‘importance score’ for – any raw input feature (e.g. pixels in V-Plot images, each nucleotide in

sequence)– intermediate learned features (e.g. convolutional filters)

• Linear breakdown of contribution of each input to immediate outputs– Recursively apply to get contribution of any input to any output– Can be computed efficiently with a single backpropagation (unlike in-

silico mutagenesis)– Less susceptible to buffering effects than in-silico mutagenesis

• Technical details:– ReLU networks: equivalent to Taylor approximation of change in

softmax/sigmoid logit if input eliminated.– i.e. gradient (w.r.t logit) * input

Page 27: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

What architecture properties of the ATAC-seq V-plots predict different chromatin states?

what is the change in classification probability relative to an unbiased classifier if we *only* consider the contributions from each pixel

CTCF state: centered binding, symmetric phased nucleosomes

Promoter state: broad regions of accessible chromatin

Enhancer state: localized signal, heterogeneity

Page 28: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Architectural heterogeneity of accessible elements in different chromatin states

28

Low dimensional t-SNE embedding (like PCA) of accessible sites (colored by chromatin state)

Distinct classes of enhancers

Page 29: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Top scoring MNase filters and activating input patterns for CTCF state

29

Top scoring MNase filters Maximally activating input MNase profiles

Well phased arrays of nucleosomes

Page 30: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Top scoring MNase filters and activating input patterns for promoter state

30

Top scoring MNase filters Maximally activating input MNase profiles

Asymmetric positioning with well positioned +1/-1 nucleosomes

Page 31: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

G C A T T A C C G A T A A

What useful patterns can we extract from raw DNA sequence models?

Gata1

G A T A A

Motif discovery!What PWM like

‘filters’ are predictive?

Which nucleotides in input sequence are contributing to binding!

Page 32: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Top sequence filters for CTCF state

32

Canonical motif

Page 33: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

High resolution point binding events and sequence grammars at CTCF peaks

33

MNase-seq

DNase-seq

CTCF ChIP-seq

Nuc. level importance (height of letter) shows coordination of multiple point binding events

Page 34: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Context-specific reuse of regulatory sequence in chromatin accessibility changes during hematopoiesis

ATAC-seq

Ryan Corces-ZimmermanJason BuenrostroWill GreenleafHoward ChangRavi Majeti

Page 35: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

G C A T T A C C G A T A A

Deep learning sequence determinants of chromatin accessibility

HSC Erythroid

Output: Accessible (+1) vs. not accessible (0)

Input: Raw DNA sequence

Peyton Greenside

B-cells

Page 36: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Peyton Greenside

Position along sequence

Gata (Rc) Gata (Rc)GataSPI1

Context-specific re-use of regulatory sequence in HSC, B-cells and Erythroblasts

Page 37: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Peyton Greenside

Imp

ort

ance

in H

SC’s

Position along sequence

SPI1

Context-specific re-use of regulatory sequence in HSC, B-cells and Erythroblasts

SPI1 ChIP-seq

GATA1 ChIP-seq

?

?

ATAC-seq

Page 38: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Peyton Greenside

Imp

ort

ance

in B

-ce

lls

Position along sequence

Context-specific re-use of regulatory sequence in HSC, B-cells and Erythroblasts

SPI1 ChIP-seq

GATA1 ChIP-seq

No peak

No peak

Not expressed

ATAC-seq

Page 39: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Peyton Greenside

Imp

ort

ance

in E

ryth

roid

Position along sequence

Gata (Rc) Gata (Rc)GataSPI1

Context-specific re-use of regulatory sequence in HSC, B-cells and Erythroblasts

SPI1 ChIP-seq

GATA1 ChIP-seq

ATAC-seq

Page 40: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

YY1 & GATA

GATA, SPI1, RUNX2

STAT1 & GATA

TEAD4 & GATA

AP1 in B-cells only

ETS & GATA

…and much, much more

Page 41: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Summary and ongoing work

• New predictive deep learning framework (Chromputer) for integrative genomics

• New interpretation engine for deep learning models. We can extract predictive features (motifs, grammars, footprints, architecture features) from the deep neural networks

• Local chromatin architecture is predictive of chromatin stateand histone marks within and across cell types

• We can predict in-vivo binding profiles of TFs in new cell types from sequence + shape + DNase/ATAC-seq with high accuracy

• Context-specific reuse of sequence grammars in accessible sites

• Extensions: From binary to continuous signal prediction• Extensions: Functional variant (QTL, GWAS, rare variant)

prediction from raw sequence models41bioarXiv paper and code coming soon!

Page 42: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Acknowledgements

42

Will Greenleaf

Chuan Sheng Foo

Kundaje Lab members

Nathan Boley

Rahul Mohan

Johnny IsraeliNicholas

Sinnott-Armstrong

R01ES02500902

U41-HG007000-04S1

U01HG007919-02 (GGR)

Avanti Shrikumar

PeytonGreenside

Funding

Conflict of Interest: Deep Genomics (SAB), Epinomics (SAB)

Page 43: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Guess the element from the V-plotAI vs. human

What is this regulatory element? Pure CTCF, Promoter, or Enhancer?

Page 44: Deep learning frameworks for regulatory genomics and ... · ANSHUL KUNDAJE Genetics, Computer science Stanford University 1 Chuan Sheng Foo Johnny Israeli Nicholas Sinnott-Armstrong

Its an enhancer!

44

1st Fully Connected Layer

2nd Fully Connected Layer

Enhancer (correct)