-
6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences
Lecture 6: Regulatory genomics Gene regulation, chromatin accessibility,
DNA regulatory code
Prof. Manolis Kellis
http://mit6874.github.io Slides credit: 6.047, Anshul Kundaje, David Gifford
-
Deep Learning for Regulatory Genomics 1. Biological foundations: Building blocks of Gene Regulation
– Gene regulation: Cell diversity, Epigenomics, Regulators (TFs), Motifs, Disease role – Probing gene regulation: TFs/histones: ChIP-seq, Accessibility: DNase/ATAC-seq
2. Classical methods for Regulatory Genomics and Motif Discovery – Enrichment-based motif discovery: Expectation Maximization, Gibbs Sampling – Experimental: PBMs, SELEX. Comparative genomics: Evolutionary conservation.
3. Regulatory Genomics CNNs (Convolutional Neural Networks): Foundations – Key idea: pixels DNA letters. Patches/filters Motifs. Higher combinations – Learning convolutional filters Motif discovery. Applying them Motif matches
4. Regulatory Genomics CNNs/RNNs in Practice: Diverse Architectures – DeepBind: Learn motifs, use in (shallow) fully-connected layer, mutation impact – DeepSea: Train model directly on mutational impact prediction – Basset: Multi-task DNase prediction in 164 cell types, reuse/learn motifs – ChromPuter: Multi-task prediction of different TFs, reuse partner motifs – DeepLIFT: Model interpretation based on neuron activation properties – DanQ: Recurrent Neural Network for sequential data analysis
-
1a. Basics of gene regulation
-
One Genome – Many Cell Types
4
ACCAGTTACGACGGTCAGGGTACTGATACCCCAAACCGTTGACCGCATTTACAGACGGGGTTTGGGTTTTGCCCCACACAGGTACGTTAGCTACTGGTTTAGCAATTTACCGTTACAACGTTTACAGGGTTACGGTTGGGATTTGAAAAAAAGTTTGAGTTGGTTTTTTCACGGTAGAACGTACCGT
TACCAGTA
Image Source wikipedia
-
DNA packaging • Why packaging
– DNA is very long – Cell is very small
• Compression – Chromosome is 50,000
times shorter than extended DNA
• Using the DNA – Before a piece of DNA is
used for anything, this compact structure must open locally
• Now emerging: – Role of accessibility – State in chromatin itself – Role of 3D interactions
-
Combinations of marks encode epigenomic state
• 100s of known modifications, many new still emerging • Systematic mapping using ChIP-, Bisulfite-, DNase-Seq
• H3K4me3 • H3K9ac • DNase
• H3K36me3 • H3K79me2 • H4K20me1
• H3K4me1 • H3K27ac • DNase
• H3K9me3 • H3K27me3 • DNAmethyl
• H3K4me3 • H3K4me1 • H3K27ac • H3K36me3 • H4K20me1 • H3K79me3 • H3K27me3 • H3K9me3 • H3K9ac • H3K18ac
Enhancers Promoters Transcribed Repressed
-
Summarize multiple marks into chromatin states
ChromHMM: multi-variate hidden Markov model WashU Epigenome Browser
30+
epig
enom
ics m
arks
Chromatin state track summary
-
Promoter region Enhancer region Protein-coding sequence
Transcription factors control activation of cell-type-specific promoters and enhancers
-
TFs use DNA-binding domains to recognize specific DNA sequences in the genome
DNA-binding domain of Engrailed
“Logo” or “motif”
TAATTA CACGTG AGATAAGA
TCATTA
-
Regulator structure recognized motifs • Proteins ‘feel’ DNA
- Read chemical properties of bases - Do NOT open DNA (no base
complementarity)
• 3D Topology dictates specificity
- Fully constrained positions: every atom matters
- “Ambiguous / degenerate” positions loosely contacted
• Other types of recognition
- MicroRNAs: complementarity - Nucleosomes: GC content - RNAs: structure/seqn combination
-
Motifs summarize TF sequence specificity
•Summarize information
•Integrate many positions
•Measure of information
•Distinguish motif vs. motif instance
•Assumptions: - Independence - Fixed spacing
-
Regulatory motifs at all levels of pre/post-tx regulation
•The parts list: ~20-30k genes - Protein-coding genes, RNA genes (tRNA, microRNA, snRNA)
•The circuitry: constructs controlling gene usage - Enhancers, promoters, splicing, post-transcriptional motifs
•The regulatory code, complications: - Combinatorial coding of ‘unique tags’
- Data-centric encoding of addresses - Overlaid with ‘memory’ marks
- Large-scale on/off states - Modulation of the large-scale coding
- Post-transcriptional and post-translational information •Today: discovering motifs in co-regulated promoters and de novo
motif discovery & target identification
Enhancer regions Promoter motifs Where in the body? When in time? Which variants?
Splicing signals Which subsets?
Motifs at RNA level
-
Disrupted motif at the heart of FTO obesity locus
Obese
Lean
Strongest association with obesity
C-to-T disruption of AT-rich regulatory motif
Restoring motif restores thermogenesis
-
1b. Technologies for probing gene regulation
-
Bar-coded multiplexed sequencing
Mapping regulator binding: ChIP-seq (Chromatin immunoprecipitation followed by sequencing) TF=transcription factor
antibody
-
ChIP-chip and ChIP-Seq technology overview
Imag
e ad
apte
d fro
m W
ikip
edia
or modification
Modification-specific antibodies Chromatin Immuno-Precipitation followed by: ChIP-chip: array hybridization
ChIP-Seq: Massively Parallel Next-gen Sequencing
-
ChIP-Seq Histone Modifications: What the raw data looks like
• Each sequence tag is 30 base pairs long • Tags are mapped to unique positions in the ~3 billion
base reference genome • Number of reads depends on sequencing depth.
Typically on the order of 10 million mapped reads. 17
-
Chromatin accessibility can reveal TF binding
Sherwood, RI, et al. “Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape” Nat. Biotech 2014.
-
DNase-seq reveals genome protection profiles
-
ATAC-seq
-
GM12878, Chr. 14, Each point is accessibility in a 2 kb window
ATAC-seq and DNase-seq are not identical
Hashimoto TB, et al. “A Synergistic DNA Logic Predicts Genome-wide Chromatin Accessibility” Genome Research 2016
-
Dnase-seq is less defined evidence than ChIP-seq
ChIP-seq reports TF-binding locations regions (specifically)
DNase-seq reports proximal TF-non-binding locations (noisily)
A
seq
seq
-
Bound factors leave distinct DNase-seq profiles
CTCF Brg Oct4 Zfx Esrrb
motif
Aggregate CTCF:
Individual CTCF:
Individual binding site prediction is difficult
-
~650,000 TF Motifs
~50,000 binding sites for a typical TF
Motifs can predict TF binding
Binding sites change across time
-
Chromatin accessibly influences transcription factor binding
•Modeling accessibility profiles yields binding predictions and pioneer factor discovery
•Asymmetric accessibility is induced by directional pioneers
•The binding of settler factors can be enabled by proximal pioneer factor binding
Sherwood, RI, et al. “Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape” Nat. Biotech 2014.
-
Deep Learning for Regulatory Genomics 1. Biological foundations: Building blocks of Gene Regulation
– Gene regulation: Cell diversity, Epigenomics, Regulators (TFs), Motifs, Disease role – Probing gene regulation: TFs/histones: ChIP-seq, Accessibility: DNase/ATAC-seq
2. Classical methods for Regulatory Genomics and Motif Discovery – Enrichment-based motif discovery: Expectation Maximization, Gibbs Sampling – Experimental: PBMs, SELEX. Comparative genomics: Evolutionary conservation.
3. Regulatory Genomics CNNs (Convolutional Neural Networks): Foundations – Key idea: pixels DNA letters. Patches/filters Motifs. Higher combinations – Learning convolutional filters Motif discovery. Applying them Motif matches
4. Regulatory Genomics CNNs/RNNs in Practice: Diverse Architectures – DeepBind: Learn motifs, use in (shallow) fully-connected layer, mutation impact – DeepSea: Train model directly on mutational impact prediction – Basset: Multi-task DNase prediction in 164 cell types, reuse/learn motifs – ChromPuter: Multi-task prediction of different TFs, reuse partner motifs – DeepLIFT: Model interpretation based on neuron activation properties – DanQ: Recurrent Neural Network for sequential data analysis
-
2. Classical regulatory genomics (before Deep Learning)
-
Enrichment-based discovery methods Given a set of co-regulated/functionally related genes,
find common motifs in their promoter regions
• Align the promoters to each other using local alignment • Use expert knowledge for what motifs should look like • Find ‘median’ string by enumeration (motif/sample driven) • Start with conserved blocks in the upstream regions
-
Starting positions Motif matrix
sequence positions
A
C
G
T
1 2 3 4 5 6 7 8
0.1
0.1
0.6
0.2
• given aligned sequences easy to compute profile matrix
0.1
0.5
0.2
0.2 0.3
0.2
0.2
0.3 0.2
0.1
0.5
0.2 0.1
0.1
0.6
0.2
0.3
0.2
0.1
0.4
0.1
0.1
0.7
0.1
0.3
0.2
0.2
0.3
shared motif
given profile matrix • easy to find starting position probabilities
Key idea: Iterative procedure for estimating both, given uncertainty (learning problem with hidden variables: the starting positions)
expectation
maximization
-
Experimental factor-centric discovery of motifs
SELEX (Systematic Evolution of Ligands by Exponential Enrichment; Klug & Famulok, 1994).
DIP-Chip (DNA-immunoprecipitation with microarray detection; Liu et al., 2005)
PBMs (Protein binding microarrays; Mukherjee, 2004) Double stranded DNA arrays
-
Approaches to regulatory motif discovery
• Expectation Maximization (e.g. MEME) – Iteratively refine positions / motif profile
• Gibbs Sampling (e.g. AlignACE) – Iteratively sample positions / motif profile
• Enumeration with wildcards (e.g. Weeder) – Allows global enrichment/background score
• Peak-height correlation (e.g. MatrixREDUCE) – Alternative to cutoff-based approach
• Conservation-based discovery (e.g. MCS) – Genome-wide score, up-/down-stream bias
• Protein Domains (e.g. PBMs, SELEX) – In vitro motif identification, seq-/array-based
Region-based motif discovery
Genome-wide
In vitro / trans
-
Deep Learning for Regulatory Genomics 1. Biological foundations: Building blocks of Gene Regulation
– Gene regulation: Cell diversity, Epigenomics, Regulators (TFs), Motifs, Disease role – Probing gene regulation: TFs/histones: ChIP-seq, Accessibility: DNase/ATAC-seq
2. Classical methods for Regulatory Genomics and Motif Discovery – Enrichment-based motif discovery: Expectation Maximization, Gibbs Sampling – Experimental: PBMs, SELEX. Comparative genomics: Evolutionary conservation.
3. Regulatory Genomics CNNs (Convolutional Neural Networks): Foundations – Key idea: pixels DNA letters. Patches/filters Motifs. Higher combinations – Learning convolutional filters Motif discovery. Applying them Motif matches
4. Regulatory Genomics CNNs/RNNs in Practice: Diverse Architectures – DeepBind: Learn motifs, use in (shallow) fully-connected layer, mutation impact – DeepSea: Train model directly on mutational impact prediction – Basset: Multi-task DNase prediction in 164 cell types, reuse/learn motifs – ChromPuter: Multi-task prediction of different TFs, reuse partner motifs – DeepLIFT: Model interpretation based on neuron activation properties – DanQ: Recurrent Neural Network for sequential data analysis
-
Convolutional layer (same color = shared weights)
Later conv layers operate on outputs of previous conv layers
eu
Conv Layer 1 Kernel width = 4 stride = 2* num filters / num channels = 3 Total neurons = 15
Max=2
2 6 1
Maxpooling layers take the max over sets of conv layer outputs
Maxpooling layer pool width = 2 stride = 1 Conv Layer 2 Kernel width = 3 stride = 1 num filters / num channels = 2 total neurons = 6
Typically followed by one or more fully connected layers
*for genomics, a stride of 1 for conv layers is recommended
P (TF = bound | X) Sigmoid activations
Deep convolutional neural network
G C A T T A C C G A T A A
Max=2
-
3a. CNNs for Regulatory Genomics Foundations (Low-level features)
-
An example of using CNN to model DNA sequence
NNNATGCAGCANNN
A T G C
Matrix representation of DNA sequence (darker = stronger)
Representing DNA sequence as 2D matrix:
-
Convolution – extracting invariant feature
Applying 4 bp sequence filter along the DNA matrix:
ATGCAGCA on 1st position 3rd position
Yellow = high activity; blue = low activity
A T G C
A T G C
A T G C
A T G C
-
Convolution – extracting invariant feature
Convolution module
NNNATGCAGCANNN
convolution filters
Matrix representation of DNA sequence (darker = stronger)
filtered signal
ATGCAGCA
rectification (denoising) max pooling
max A T G C
Rectification = ignore signals below some threshold. Pooling = summary of each channel by max or average.
-
Prediction using extracted features map
Convolution module Prediction module
ChIP-seq, PBMs, SELEX Experiments DNA sequence A T G C A G C A N N N
(...) (...) (...)
(...) (...)
(...)
GCRC
GCRC GCRC|ATRc
Affinity
higher-level combinations
TGRT
match filter
max
match filter
max
match filter
max
TGRT
ATRc
ATRc Indi
vidu
al m
otifs
[Park and Kellis, 2015]
A T G C
-
Key properties of regulatory sequence
TRANSCRIPTION FACTOR BINDING Regulatory proteins called transcription factors (TFs) bind to high affinity sequence
patterns (motifs) in regulatory DNA
Transcription factor Regulatory
DNA sequences
Motif
-
Sequence motifs: PWM
A 0 0 1 0 1 0.5
C 0.5 0 0 0 0 0
G 0.5 1 0 0 0 0
T 0 0 0 1 0 0.5
Position weight matrix (PWM)
Bits
PWM logo https://en.wikipedia.org/wiki/Sequence_logo
GGATAA CGATAA CGATAT GGATAT
Set of aligned sequences Bound by TF
https://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logohttps://en.wikipedia.org/wiki/Sequence_logo
-
Sequence motifs: PSSM
Position-specific scoring matrix (PSSM)
PSSM logo
A
-5.7
-3.2
3.7
-3.2
3.7
0.6
C
0.5
-3.2
-3.2
-3.2
-3.2
-5.7
G
0.5
3.7
-3.2
-3.2
-3.2
-5.7
T
-5.7
-3.2
-3.2
3.7
-3.2
0.5
Accounting for genomic background nucleotide distribution
-
Scoring a sequence with a motif PSSM
G C A T T A C C G A T A A Input sequence
One-hot encoding (X)
Scoring weights
W
A
-5.7
-3.2
3.7
-3.2
3.7
0.6
C
0.5
-3.2
-3.2
-3.2
-3.2
-5.7
G
0.5
3.7
-3.2
-3.2
-3.2
-5.7
T
-5.7
-3.2
-3.2
3.7
-3.2
0.5
PSSM parameters
-
G C A T T A C C G A T A A Input sequence
One-hot encoding (X)
Scoring weights
W
-5.4
A
-5.7
-3.2
3.7
-3.2
3.7
0.6
C
0.5
-3.2
-3.2
-3.2
-3.2
-5.7
G
0.5
3.7
-3.2
-3.2
-3.2
-5.7
T
-5.7
-3.2
-3.2
3.7
-3.2
0.5
Motif match Scores sum(W * x)
Convolution: Scoring a sequence with a PSSM
-
G C A T T A C C G A T A A Input sequence
One-hot encoding (X)
Scoring weights
W
-5.4
2.0
A
-5.7
-3.2
3.7
-3.2
3.7
0.6
C
0.5
-3.2
-3.2
-3.2
-3.2
-5.7
G
0.5
3.7
-3.2
-3.2
-3.2
-5.7
T
-5.7
-3.2
-3.2
3.7
-3.2
0.5
Motif match Scores sum(W * x)
Convolution
-
G C A T T A C C G A T A A Input sequence
One-hot encoding (X)
A
-5.7
-3.2
3.7
-3.2
3.7
0.6
C
0.5
-3.2
-3.2
-3.2
-3.2
-5.7
G
0.5
3.7
-3.2
-3.2
-3.2
-5.7
T
-5.7
-3.2
-3.2
3.7
-3.2
0.5
Scoring weights
W
-2.2 -5.4 2.0 -4.3 -24 -17 -18 -11 -12 16 -5.5 -8.5 -5.2 Motif match Scores
sum(W * x)
Convolution
-
G C A T T A C C G A T A A Input sequence
One-hot encoding (X)
A
-5.7
-3.2
3.7
-3.2
3.7
0.6
C
0.5
-3.2
-3.2
-3.2
-3.2
-5.7
G
0.5
3.7
-3.2
-3.2
-3.2
-5.7
T
-5.7
-3.2
-3.2
3.7
-3.2
0.5
Scoring weights
W
-2.2 -5.4 2.0 -4.3 -24 -17 -18 -11 -12 16 -5.5 -8.5 -5.2
Thresholded Motif Scores max(0, W*x)
Motif match Scores W*x
Thresholding scores
0 0 2.0 0 0 0 0 0 0 16 0 0 0
-
3b. CNNs for Regulatory Genomics Foundations (Higher-level learning)
-
T
F
T
F
● Positive class of genomic sequences bound a transcription factor of interest
● Negative class of genomic sequences not bound by a transcription factor of interest
Can we learn patterns in the DNA sequence that distinguish these 2 classes of genomic sequences?
Learning patterns in regulatory DNA sequence
-
HOMOTYPIC MOTIF DENSITY
Regulatory sequences often contain more than one binding instance of a TF resulting in homotypic
clusters of motifs of the same TF
Key properties of regulatory sequence
-
Key properties of regulatory sequence
HETEROTYPIC MOTIF COMBINATIONS
Regulatory sequences often bound by combinations of TFs resulting in heterotypic clusters of motifs of different TFs
-
Key properties of regulatory sequence
SPATIAL GRAMMARS OF HETEROTYPIC MOTIF COMBINATIONS
Regulatory sequences are often bound by combinations of TFs with specific spatial and positional constraints resulting in distinct motif grammars
-
A simple classifier (An artificial neuron)
parameters
Linear function
Z
Training the neuron means learning the optimal w’s and b
-
A simple classifier (An artificial neuron)
Y
Non-linear function
Logistic / Sigmoid Useful for predicting probabilitie
parameters
Training the neuron means learning the optimal w’s and b
0
-
A simple classifier (An artificial neuron)
Y
Training the neuron means learning the optimal w’s and b
ReLu (Rectified Linear Unit) Useful for thresholding
parameters
Non-linear function
-
Artificial neuron can represent a motif
Y
parameters
-
Convolutional filters learn motifs (PSSM)
Biological motivation of Deep CNN
Max pool thresholded scores over windows
Threshold scores using ReLU
Scan sequence using filters
Predict probabilities using logistic neuron
-
Convolutional layer (same color = shared weights)
Later conv layers operate on outputs of previous conv layers
eu
Conv Layer 1 Kernel width = 4 stride = 2* num filters / num channels = 3 Total neurons = 15
Max=2
2 6 1
Maxpooling layers take the max over sets of conv layer outputs
Maxpooling layer pool width = 2 stride = 1 Conv Layer 2 Kernel width = 3 stride = 1 num filters / num channels = 2 total neurons = 6
Typically followed by one or more fully connected layers
*for genomics, a stride of 1 for conv layers is recommended
P (TF = bound | X) Sigmoid activations
Deep convolutional neural network
G C A T T A C C G A T A A
Max=2
-
Convolutional layer (same color = shared weights)
Later conv layers operate on outputs of previous conv layers
eu
Conv Layer 1 Kernel width = 4 stride = 2 num filters / num channels = 3 Total neurons = 15
Max=2
2 6 1
Maxpooling layers take the max over sets of conv layer outputs
Maxpooling layer pool width = 2 stride = 1 Conv Layer 2 Kernel width = 3 stride = 1 num filters / num channels = 2 total neurons = 6
Typically followed by one or more fully connected layers
Multi-task CNN
G C A T T A C C G A T A A
Max=2
P (TF1 = bound | X) P (TF2 = bound | X) Multi-task output (sigmoid activations here)
-
Convolutional layer (same color = shared weights)
Later conv layers operate on outputs of previous conv layers
Conv Layer 1 Kernel width = 4 stride = 2* num filters / num channels = 3 Total neurons = 15
Maxm= 2ax Maxm= 6ax
2 6 1
Maxpooling layers take the max over sets of conv layer outputs
Maxpooling layer pool width = 2 stride = 1 Conv Layer 2 Kernel width = 3 stride = 1 num filters / num channels = 2 total neurons = 6
Typically followed by one or more fully connected layers
Multi-task CNN
G C A T T A C C G A T A A
-
Deep Learning for Regulatory Genomics 1. Biological foundations: Building blocks of Gene Regulation
– Gene regulation: Cell diversity, Epigenomics, Regulators (TFs), Motifs, Disease role – Probing gene regulation: TFs/histones: ChIP-seq, Accessibility: DNase/ATAC-seq
2. Classical methods for Regulatory Genomics and Motif Discovery – Enrichment-based motif discovery: Expectation Maximization, Gibbs Sampling – Experimental: PBMs, SELEX. Comparative genomics: Evolutionary conservation.
3. Regulatory Genomics CNNs (Convolutional Neural Networks): Foundations – Key idea: pixels DNA letters. Patches/filters Motifs. Higher combinations – Learning convolutional filters Motif discovery. Applying them Motif matches
4. Regulatory Genomics CNNs/RNNs in Practice: Diverse Architectures – DeepBind: Learn motifs, use in (shallow) fully-connected layer, mutation impact – DeepSea: Train model directly on mutational impact prediction – Basset: Multi-task DNase prediction in 164 cell types, reuse/learn motifs – ChromPuter: Multi-task prediction of different TFs, reuse partner motifs – DeepLIFT: Model interpretation based on neuron activation properties – DanQ: Recurrent Neural Network for sequential data analysis
-
4. Regulatory Genomics CNNs in Practice: (a) DeepBind
-
DeepBind
[Alipanahi et al., 2015]
-
http://www.nature.com/nbt/journal/v33/n8/full/nbt.3300.html
http://www.nature.com/nbt/journal/v33/n8/full/nbt.3300.html
-
Constructing mutation map
NNNATGTAGCA NNN
Ref
NNNATGCAGCANNN A T G C
A T G C
Alt
DeepBind Model
p(sref|w)
p(salt|w)
∆s =(p(s |w) - p(s |w))
j alt ref
max(0,p(salt|w),p(sref|w))
-
filtered signal
rectification (denoising)
A T G C
Constructing sequence logo Motif 1
ATGCAGCA
NNNATGCAGCANNN
GCAG CAGC ACGA
Test sequence
Motif 2
Motif 2
Motif 1
GCAG GCTG GATG
.
. .
GTAG GCTG
G A G C A T T
Motif 2
CAGC GGTC AGTC
.
.
. AGGC GGTG
G G C A T G
G C A
...
PFM
-
Predicting disease mutations
[Alipanahi et al., 2015]
-
DeepBind summary
The key deep learning techniques: •Convolutional learning •Representational learning •Back-propagation and stochastic gradient •Regularization and dropout •Parallel GPU computing especially useful for hyperparameter
search Limitations in DeepBind:
•Require defining negative training examples, which is often arbitrary
•Using observed mutation data only as post-hoc evaluation •Modeling each regulatory dataset separately
-
Regulatory Genomics CNNs in Practice: (b) DeepSEA
-
DeepSea
DeepSea: • Similar as DeepBind but
trained a separate CNN on each of the ENCODE/Roadmap Epigenomic chromatin profiles 919 chromatin features (125 DNase features, 690 TF features, 104 histone features).
• It uses the ∆s mutation score as input to train a linear logistic regression to predict GWAS and eQTL SNPs defined from the GRASP database with a P-value cutoff of 1E-10 and GWAS SNPs from the NHGRI GWAS Catalog
[Zhou and Troyanskaya, 2015]
-
Regulatory Genomics CNNs in Practice: (c) Basset
-
Basset: Learning the regulatory code of the accessible genome with deep
convolutional neural networks. David R. Kelley Jasper Snoek John L. Rinn
Genome Research, March 2016
-
Basset 300
Simultaneously predicting DNase sites in 164 cell types
300 convolution filters
CNN-based Basset outperforms gkm-SVM
Convolutional filters connected to the input sequence recapitulate some known TF motifs
[Kelley et al., 2016]
-
Bassett architecture for accessibility prediction
300 filters 3 conv layers 3 FC layers
168 outputs (1 per cell type)
3 fully connected layers
Input: 600 bp
Output: 168 bits
1.9 million training examples
-
Bassett AUC performance vs. gkm-SVM
-
45% of filter derived motifs are found in the CIS-BP database
Motifs created by clustering matching input sequences and computing PWM
-
Motif derived from filters with more information tend to be annotated
-
Computational saturation mutagenesis of an AP-1 site reveals
loss of accessibility
-
Regulatory Genomics CNNs in Practice: (d) Chromputer
-
ChromPuter
E2F6 Other TFs
Class Probabilities
CTCF MYC GATA1 SOX2
OCT4 NANO
G 2nd FC Layer 1st FC Layer
Multi-task learning
2nd set of Convolutional Maps
1D DNase-seq/ATAC-seq profile DNA sequence
(Anshul Kundaje’s group from Stanford)
-
How does a deep conv. neural network transform the raw V-plot input at each layer
Promoter
Enhancer Initial Smoothing
1st set of Convolutional Maps
2nd Smoothing
2nd set of Convolutional Maps
3rd Smoothing
1st Fully Connected Layer
2nd Fully Connected Layer
Class Probabilities
V-Plot Input (300 x 2001)
Chromatin State
0 +1Kb -1Kb 500
Pure CTCF
0 500
0
0 500
-
After initial pooling (smoothing)
Pure CTCF
Promoter
Enhancer
Initial Smoothing
1st set of Convolutional Maps
2nd Smoothing
2nd set of Convolutional Maps
3rd Smoothing
1st Fully Connected Layer
2nd Fully Connected Layer
Class Probabilities
V-Plot Input (300 x 2001)
Chromatin State
-
Second set of convolutional maps
Pure CTCF
Promoter
Enhancer
Initial Smoothing
1st set of Convolutional Maps
2nd Smoothing
2nd set of Convolutional Maps
3rd Smoothing
1st Fully Connected Layer
2nd Fully Connected Layer
Class Probabilities
V-Plot Input (300 x 2001)
Chromatin State
-
Learning from multiple 1D functional data (e.g. DNase, MNase)
19
1st Convolution Layer
2nd Convolution Layer
2nd FC Layer
1D MNase signal
(1 x 2001)
Class Probabilities
3rd Convolution Layer
1st FC Layer
1st Convolution Layer
2nd Convolution Layer
3rd Convolution Layer
1D DNase signal
(1 x 2001)
Chromatin State
Scan DNase profile using filter
-
Learning from raw DNA sequence
Higher layers learn motif combinations
Class Probabilities
Score sequence using filters Convolutional layers learn motif (PWM) like filters
-
The ChrompuTer
Integrating multiple inputs (1D, 2D signals, sequence) to simulatenously predict multiple outputs
Chromatin State
TF Binding
Class Probabilities
H3K4me3
H3K9me 3
H3K27me3 H3K4me1
H2A.Z
H3K36me3
2nd FC Layer 1st FC Layer
Initial Smoothing
1st set of Convolutional Maps
2nd Smoothing
2nd set of Convolutional Maps
3rd Smoothing
1st Combined FC Layer
2nd Combined FC Layer
V-Plot Input (300 x 2001)
Initial Smoothing
1st set of Convolutional Maps
2nd Smoothing
2nd set of Convolutional Maps
3rd Smoothing
1st Combined FC Layer
2nd Combined FC Layer
Initial Smoothing
1st set of Convolutional Maps
2nd Smoothing
2nd set of Convolutional Maps
3rd Smoothing
1st Combined FC Layer
2nd Combined FC Layer
Multi-task learning
-
Chromatin architecture can predict chromatin state in held out chromosome
(same cell type) Model + Input data types 8-class chromatin
state accuracy (%) Majority class (baseline) 42% Gene proximity 59% Random Forest: ATAC-seq (150M reads) 61% Chromputer: DNase (60M reads) 68.1% Chromputer: Mnase (1.5B reads) 69.3% Chromputer: ATAC-seq (150M reads) 75.9% Chromputer: DNase + MNase 81.6% Chromputer: ATAC-seq + sequence 83.5% Chromputer: DNase + MNase + sequence 86.2% Label accuracy across replicates (upper bound) 88%
-
High cross cell-type chromatin state prediction
• Learn model on DNase and MNase only • Learn on GM12878, predict on K562 (and vice versa) • Requires local normalization to make signal comparable
8 class chromatin state accuracy
Train ↓ / Test → GM12878 K562
GM12878 0.816 0.818
K562 0.769 0.844
-
Predicting individual histone marks from ATAC/DNase/MNase/Sequence
Area under Precision recall curve
0.75
0.5
0.25
0
CTCF H3K27ac H3K4me3 H3K4me1 H3K9ac H2Az H3K36me3 H3K27me3 H3K9me3
-
Chromputer trained on TF ChIP-seq predicts cross cell-type in-vivo TF binding with high accuracy
25
Chromputer Area under Precision Recall (PR) curve
c-MYC YY1 CTCF
Inputs: Seq + DNA shape + DNase profile Positives: Reproducible ChIP-seq peaks Negatives: All other DNase peaks + flanks + matched random sites Test sets: Held out chromosomes in held out cell types
-
DeepLift reveals feature importance at the input layer
G C A T T A C C G A T A A
Nanog
Gata1
G C A T T
Which neurons/filters are predictive?
Which nucleotides in input sequence are contributing to binding
Key idea: • ReLU is piece-wide linear • Backpropagation differences of outputs using observed and reference
inputs (e.g., inputs of all zeros) to obtain gradient w.r.t. the input • Importance of any input to any output is the gradients weighted by the input
itself
(Anshul Kundaje’s group from Stanford)
-
Deep Learning for Regulatory Genomics 1. Biological foundations: Building blocks of Gene Regulation
– Gene regulation: Cell diversity, Epigenomics, Regulators (TFs), Motifs, Disease role – Probing gene regulation: TFs/histones: ChIP-seq, Accessibility: DNase/ATAC-seq
2. Classical methods for Regulatory Genomics and Motif Discovery – Enrichment-based motif discovery: Expectation Maximization, Gibbs Sampling – Experimental: PBMs, SELEX. Comparative genomics: Evolutionary conservation.
3. Regulatory Genomics CNNs (Convolutional Neural Networks): Foundations – Key idea: pixels DNA letters. Patches/filters Motifs. Higher combinations – Learning convolutional filters Motif discovery. Applying them Motif matches
4. Regulatory Genomics CNNs/RNNs in Practice: Diverse Architectures – DeepBind: Learn motifs, use in (shallow) fully-connected layer, mutation impact – DeepSea: Train model directly on mutational impact prediction – Basset: Multi-task DNase prediction in 164 cell types, reuse/learn motifs – ChromPuter: Multi-task prediction of different TFs, reuse partner motifs – DeepLIFT: Model interpretation based on neuron activation properties – DanQ: Recurrent Neural Network for sequential data analysis
6.874, 6.802, 20.390, 20.490, HST.506�Computational Systems BiologyDeep Learning in the Life SciencesDeep Learning for Regulatory Genomics1a. Basics of gene regulationOne Genome – Many Cell TypesDNA packagingCombinations of marks encode epigenomic stateSummarize multiple marks into chromatin statesSlide Number 8TFs use DNA-binding domains to recognize specific DNA sequences in the genomeRegulator structure recognized motifsMotifs summarize TF sequence specificityRegulatory motifs at all levels of pre/post-tx regulationDisrupted motif at the heart of FTO obesity locus1b. Technologies for probing gene regulationMapping regulator binding: ChIP-seq�(Chromatin immunoprecipitation followed by sequencing) TF=transcription factorChIP-chip and ChIP-Seq technology overviewChIP-Seq Histone Modifications: What the raw data looks likeSlide Number 18DNase-seq reveals genome protection profilesSlide Number 20Slide Number 21Dnase-seq is less defined evidence than ChIP-seqSlide Number 23Slide Number 24Chromatin accessibly influences transcription factor bindingDeep Learning for Regulatory Genomics2. Classical regulatory genomics�(before Deep Learning)Enrichment-based discovery methodsStarting positions Motif matrixExperimental factor-centric discovery of motifsApproaches to regulatory motif discoveryDeep Learning for Regulatory GenomicsDeep convolutional neural network3a. CNNs for Regulatory Genomics Foundations�(Low-level features)An example of using CNN to model DNA sequenceConvolution – extracting invariant featureConvolution – extracting invariant featurePrediction using extracted features mapKey properties of regulatory sequenceSequence motifs: PWMSequence motifs: PSSMScoring a sequence with a motif PSSMConvolution:Scoring a sequence with a PSSMConvolutionConvolutionThresholding scores3b. CNNs for Regulatory Genomics Foundations�(Higher-level learning)Learning patterns in regulatory DNA sequenceKey properties of regulatory sequenceKey properties of regulatory sequenceKey properties of regulatory sequenceA simple classifier (An artificial neuron)A simple classifier (An artificial neuron)A simple classifier (An artificial neuron)Artificial neuron can represent a motifBiological motivation of Deep CNNDeep convolutional neural networkMulti-task CNNMulti-task CNNDeep Learning for Regulatory Genomics4. Regulatory Genomics CNNs in Practice: �(a) DeepBindDeepBindSlide Number 63Slide Number 64Constructing mutation mapSlide Number 66Predicting disease mutationsDeepBind summaryRegulatory Genomics CNNs in Practice: �(b) DeepSEADeepSeaRegulatory Genomics CNNs in Practice: �(c) BassetSlide Number 72BassetSlide Number 74Slide Number 75Slide Number 76Slide Number 77Slide Number 78Regulatory Genomics CNNs in Practice: �(d) ChromputerChromPuterHow does a deep conv. neural network transform the raw V-plot input at each layerAfter initial pooling (smoothing)Second set of convolutional mapsLearning from multiple 1D functional data (e.g. DNase, MNase)Learning from raw DNA sequenceThe ChromputerChromatin architecture can predict chromatin state in held out chromosome (same cell type)High cross cell-type chromatin state predictionPredicting individual histone marksfrom ATAC/DNase/MNase/SequenceChromputer trained on TF ChIP-seq predicts cross cell-type in-vivo TF binding with high accuracyDeepLift reveals feature importance at the input layerDeep Learning for Regulatory Genomics