detection of regulatory motifs based on coexpression and phylogenetic footprinting phd presentation...
TRANSCRIPT
DETECTION OF REGULATORY MOTIFSBASED ON COEXPRESSION AND PHYLOGENETIC FOOTPRINTING
PhD presentation Valerie StormsMarch 29th, 2011
PromotersProf. Dr. Ir. Kathleen Marchal
Prof. Dr. Ir Bart De Moor
Overview
1. Introduction on transcriptional regulation
2. The effect of orthology and coregulation on detecting regulatory motifs
3. PhyloMotifWeb: workflow for motif discovery in eukaryotes
4. De novo motif discovery in vitamin D3 regulated genes
All living organisms consists of one or more cells
• E.g. humans:
– Built of multiple cells like nerve cells, muscle cells, skin cells
– Every cell: contains identical genetic information
Genetic information
• Stored as DNA (deoxyribose nucleic acid)
• Double helix with sugar-phosphate backbone
• 4 building blocks = “base”
– A: adenine
– C: cytosine
– G: guanine
– T: thymine / U: uracil
• Complementary base pairing -> hydrogen bounds
• Presentation: ACCTGCTAG….ATTGACGGAC
Sugar-Phosphate Backbone
Base pair A-T
Base pair G-C
GCGATCGTAGGTAT
- C- G- C- T- A- G- C- A- T- C- C- A- T- A
Genetic information
Genetic dogma
DNA contains genes = specific sequences of bases that encode instructions on how to make proteins = work units of a cell
DNA Gene
protein
TRANSLATION
mRNA
TRANSCRIPTION TRANSCRIPTIONAL REGULATION
….AAATTTGGTTGTTGTCTCCCAGCTGTTTATTTCTGTAACAGATCTTGGAGGCTGCGGTCTGGATCCCTCGCCAAGAACCAGATCCAGGAGAAAACGTGCTCAACGTGCAGCTCTGCTCCTACTGATTATAGCCCCACAGATGACATCGCTCCATAGTCACACCAAGTCTCCTGTGGGAGTCTTGCTCCTCGTTCTCAGTGTCTGTTACAGCTCGGTATTTTAGTGTCAGGACGTCGGCTCCCAGCCCGCATCTCCGCTCAGCAATGCCATTATCTTCTCAGCCAAGTCCTAGAAATGGGTTGGCTTCCCATTTGCAAAAACATCGCTCCATAGTCACACCAAGTCTCCTGTGGGAGTCTTGCTCCTCGTTCTCAGTGTCTGTTACAGCTCGGTATTTTAGTGTCAGGACGTCGGCTCCCAGCCCGCATCTCCGCTCAGCAATGCCATTATCTTCTCAGCCAAGTCCTAGAAATGGGTTGGCTTCCCATTTGCAAAAACATCGCTCCATAGTCACACCAAGTCTCCTGTGGG….
GE
NE
XP
RE
SS
IE
DIFFERENTLEVELS
OF REGULATION
Main players in Transcriptional regulation
1. Recruitment of the RNA POLYMERASE COMPLEX to the promoter region of the target gene
DNA TARGET GENE
TSSRNA polymerase
complex
Promoter region
TF
This process can be activated or repressed by:
• Transcription Factors (TFs) – activators and repressors
Bind DNA directly by recognizing specific regions
• Co-activators and co-repressors
Recruited by protein-protein interactions
Co-activator
Main players in Transcriptional regulation
Eukaryotic cells
• Nucleus
• Linear DNA molecules organized into chromosomes
• Chromatin = complex of DNA and proteins
2. Chromatin structure
InfluencesTranscriptional
Regulation
TF
Heterochromatin Euchromatin
Histones Linear DNA molecule
Main players in Transcriptional regulation
Chromatin remodeling complexDNA TARGET GENE
TSSRNA polymerase
complexTFCo-activator
• TFs bind specific non-coding sequences in the DNA to control the expression of their target genes TF binding sites
• All genes regulated by the same TF contain a similar TF binding site in their promoter region
• REGULATORY MOTIF models the TF-DNA binding specificity and captures the variability of TF binding sites
ATTGCCAT
TF-DNA INTERACTION
TFREGULATORY MOTIF
- Modify chromatin structure:- DNA methylation
- Histone modifications like methylation, acetylation
Regulatory motif
TFREGULATORY MOTIF
G T G A C GG T G A C CG A G A C GG T G T C GG T C A G G
Alignment of TF binding sites
Construction of frequency matrix
Motif logo
0.01A
C
G
T
0.01
0.01
0.97
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.010.97
0.97
0.97
0.69
0.97
0.29
p1 p2 p3 …. pn
Computational motif discovery
TFMotif scanning
1. Motif scanning: known motif model
2. De novo motif discovery: search for novel, uncharacterized motifs
Algorithms classified based on the information sources they use:- Coregulation information
- Orthology information
- Co-localization of different TF binding sites
- Chromatin structure
? De novo motif discovery
Different algorithms to predict TF binding sites
Two different computational approaches!
Overview
1. Introduction on transcriptional regulation
2. The effect of orthology and coregulation on detecting regulatory motifs
3. PhyloMotifWeb: workflow for motif discovery in eukaryotes
4. De novo motif discovery in vitamin D3 regulated genes
Different information spaces
Next generation of motif discovery tools integrates orthology with coregulation information
2. Orthologous space
3. Combined coregulation-orthology space
1. Coregulation space
Study
Research goal:– Extent of information in coregulation or orthologous space
– Conditions under which complementing both spaces improves motif detection
Method: – Synthetic and real benchmark datasets
– Select motif detection tools flexible enough to perform in each of the three spaces
- Phylogibbs (Siddharthan et al., 2005)
- Phylogenetic sampler (Newberg et al., 2007)
- MEME (Bailey and Elkan, 1994)
Theoretical comparison Overview
Phylogibbs Phylogenetic sampler MEME
Simulated annealing + tracking
=> global optimum (= MAP solution)
A Gibbs sampler
=> local optimum
=> Ensemble centroid solution
Expectation
Maximization
=> local optimum
Short Long (>multiple re-initializations) Short
Phylogenetic relatedness between the orthologous sequences
Tree-based evolutionary model
Alignment of the orthologous sequences needed
No evolutionary model
Unaligned sequences
Theoretical comparison Assignment and scoring of motif sites
Phylogibbs Phylogenetic sampler
Unaligned
Tree-based evolutionary model (F81)
Window principle -> more flexible in case of a bad prealignment
Block principle -> very sensitive to bad prealignments-> leave out phylogenetic distant orthologs
Prealigned
Single independent motif sites
Multiple orthologous motif sites
REF SPECIES
SPECIES 1
SPECIES 2
SPECIES 3
SPECIES 4
Seq 1 Seq 2 … Seq 10
TC…T
TT…T
TC…C
2 3
…
5
Coregulation Orthologous Combined
Performance assessment Construction of Synthetic datasets
Use a phylogenetic tree and an evolutionary model to create the orthologs for different species
1
4
Ancestor speciesSeq 1 Seq 2 … Seq 10
Motif WMs with a different IC Background sequences
Performance assessment Construction of Real datasets
TyrRLexA
Biological datasets:
1. Prokaryotic data -> Gamma-proteobacteria
2. Eukaryotic data -> yeast species
Urs1HRap1
Performance assessment Results (1)
COREGULATION SPACE
Does adding orthologs improve the performance
for the LOW IC motif?
Depends on the degeneracy of the embedded motif
…
Performance assessment Results (2)
COMBINED SPACE
1. Evolutionary distance between the added orthologs
……
Performance assessment Results (3)
2. Phylogenetic tree
=> Tree based on neutral evolution rate
3. The number of added orthologs and the topology of the tree
=> low impact
4. Noise
=> Orthologous direction: performance drop depends on the species distance and the algorithm characteristics
Performance assessment Results (4)
ORTHOLOGOUS SPACE
Room for improvement!
-Number of added orthologslarger effect than in combined space
-PSAlmost no output when orthologs are prealigned(No centroid solution)
Conclusions
Phylogibbs Phylogenetic sampler MEME
Quality of predicted motifs depends on correctness of prealignments Challenge: accounting for phylogenetic relatedness, independent of a prealignment
Ensemble centroid strategy Useful with low signal/noise Computationally limiting
Phylogenetic tools may perform better than the more basic MEME tool BUT More parameters to tune Performance strongly depends on the prealignment quality, the phylogenetic tree, the relationship between the orthologs etc…
Overview
1. Introduction on transcriptional regulation
2. The effect of orthology and coregulation on detecting regulatory motifs
3. PhyloMotifWeb: workflow for motif discovery in eukaryotes
4. De novo motif discovery in vitamin D3 regulated genes
PhyloMotifWeb
Motif finders with different algorithmic backgroundperformance diversity
Ensemble strategycombine results
of multiple algorithms
Progress of experimental technologies
Epigenetic informationChromatin structure
information
Growing number of sequenced genomesOrthology information
Easy reduction of search space
Create orthologs alignments
Automatic parameter sweep
phylogenetic tree
Ensemble phylogenetic motif finders
PhyloMotifWeb – Ensemble strategy
• Three motif finders: Phylogibbs, Phylogenetic sampler and MEME
• Run each motif finder across multiple parametersettings (e.g. different motif numbers, motif widths etc.)
Large collection of output matrices
• FuzzyClustering algorithm – summarizes all these output matrices into a set of non-redundant
ensemble motifs– Works on the TF binding site level <-> matrix level
PhyloMotifWeb
Motif finders with different algorithmic backgroundperformance diversity
Ensemble strategycombine results
of multiple algorithms
Progress of experimental technologies
Epigenetic informationChromatin structure
information
Growing number of sequenced genomesOrthology information
Easy reduction of search space
Create orthologs alignments
Automatic parameter sweep
phylogenetic tree
Ensemble phylogenetic motif finders
Important for motif discovery in eukaryotes!
PhyloMotifWeb - Eukaryotes
Restrict search space to regions with higher regulatory potential based on epigenetic information like chromatin structure
BUT: Tissue and condition dependent!
Annotation of regulatory regions > Regulatory build pipeline of Ensembl
• Multi-cell type:
– DNase hypersensitivity -> open chromatin
– CTCF binding sites -> enhancer/insulator marker
– Binding sites of other TFs
• Cell-type specific:
– Histone modifications
PhyloMotifWeb – Webserver
PhyloMotifWeb – Webserver
PhyloMotifWeb – Webserver
Results page
- Motif logo
- Individual binding sites of the ensemble solution
- p-value for the overrepresentation of the ensemble motif in the sequence set versus random sequence sets
- Comparison with database motifs
Overview
1. Introduction on transcriptional regulation
2. The effect of orthology and coregulation on detecting regulatory motifs
3. PhyloMotifWeb: workflow for motif discovery in eukaryotes
4. De novo motif discovery in vitamin D3 regulated genes
Vitamin D3 - metabolism
• Source: Diet and produced in skin when exposed to sunlight
• Role in regulating many physiological and cellular processes:
- Bone health
- Prevention of autoimmune diseases
- Anti-proliferative effect on different cell types like cancer cells
Vitamin D3 - mode of action
VDRVitD3
RXR VDRVitD3
VDRE
RXR VDRVitD3
Chromatinremodeling
complex
Co-activator complex
3. Recruitment of co-activators and chromatin remodelers: open chromatin structure
2. Ligand-activated VDR/RXR binds the DNA at Vitamin D Regulatory elements (VDRE)
RXR VDRVitD3
DRIP Transcription machinery
Target gene
4. Transcription of the VDR target gene
1. Vitamin D3 enters the cell and binds to the vitamin D receptor (VDR), which dimerizes with RXR
Vitamin D3 - dataset
GOAL: get insight in molecular mechanism underlying anti-proliferative effect of vitD3
- Human and mouse cell lines treated with vitD3 versus no vitD3 (Control)
- Measured the expression of all genes in the human and mouse cells using microarrays for both conditions over different time points
- Select differentially expressed genes (vitD3 versus Control) -> phenotype
- Group per species all genes with similar behavior in coexpression clusters
focus on genes with a conserved co-expression behavior across human and mouse interesting for common anti-proliferative phenotype
VitD3
Human breast cancer cells
Mouse bonecells
Ctr
VERSUS
RXR VDRVitD3
Target geneVDRE
ANTI-PROLIFERATIVE
PHENOTYPE
Vitamin D3 - Dataset
Conserved coexpression cluster:
- 10 genes
- Upregulated after vitD3
Assume: conserved transcriptional regulation
Conserved regulatory motifs responsible for expression behavior
De novo strategy
Screening: Co-localization of TF binding sites
Vitamin D3 - de novo motifs
METHOD: PhyloMotifWeb
RESULTS:
1. Very common motifs• Low specificity for coexpressed cluster
• Match with TFs involved in cell cycle regulation– Well conserved TF binding sites, present in many genes! – e.g. SP1, ZF5, NRF1
• TF involved in B-cell differentation– EBF
Vitamin D3 - de novo motifs
2. Motifs specific for the conserved coexpression cluster
-> higher overrepresentation in the cluster compared to the genome
-> match with following TFs:
ZEB1 - Transcriptional activator of VDR protein
- Role in cancer metastasis
VDR - Putative direct regulation by VDR
- VDRE hard to discover de novo: only one conserved half-site!
•Two conserved half sites with variable spacer
•Diverse configurations [DR, IR, ER]
•Located far up-/down-stream TSS
NHR-scan: specific for nuclear hormone receptor binding sites
C1 C2 C1 C2
Vitamin D3 – Cis-regulatory modules
TF2
TF2
TF2
TF1
TF1
TF1
Higher eukaryotes:
-> TFs act in cooperation to modulate gene expression
-> Find co-localized binding sites for de novo predicted motifs => CRMs
Vitamin D3 – Cis-regulatory modules
METHOD: CPModule
INPUT: • De novo predicted motif models
• Constraint: module size ranging between 150bp and 400bp
RESULTS: • 3 CRMs highly specific for the coexpressed genes (p-value < 0.001):
• Each CRM contains the EBF motif -> degenerated -> many hits -> using a motif-specific score threshold
• Most interesting is the ZEB1-VDR module
SP1-EBF 7 genes
NRF1-EBF 7 genes
VDR-ZEB1-EBF 10 genes
Vitamin D3 - perspectives
• Motifs predicted for the conserved coexpression cluster -> investigate their presence for larger species-specific clusters or maybe for the full genome
• The availability of cell-type specific epigenetic information can help to retrieve the functional binding sites
• Besides a transcriptome analysis -> integrate extra omics data like ChIP-seq and protein profiling to reconstruct the regulatory network of vitD3
Acknowledgements
ESAT-Bioi
• Prof. Dr. Bart De Moor
• Prof. Dr. Yves Moreau
• Wouter Van Delm
LEGENDO
• Dr. Lieve Verlinden
• Prof. Dr. Mieke Verstuyf
• Dr. Guy Eelen
• Els Vanoirbeek
CMPG-Bioi
• Prof. Dr. Kathleen Marchal
• Dr. Pieter Monsieurs
• Marleen Claeys
• Carolina Fierro
• Aminael Sanchez
• Hong Sun
CMPG
• Prof. Dr. Jan Michiels
Extra slides
Theoretical comparison Phylogibbs Algorithm (1)
Procedure:1. start with a random configuration C,
based on prior information on the number of motif sites/TFs
2. construct the set of all possible configurations C’ that differ
in one single move from C (designed moveset)
3. calculate for each C’ the posterior probability score
4. sample a new configuration from this score distribution
This procedure is repeated for two phases :
1. Simulated annealing: iterating to configuration C* with the highest posterior probability (=MAP) (temperature parameter β)
2. Tracking: posterior probabilities are assigned to the windows in C*
-> One initialization is sufficient
-> Very short running time (minutes/hours)
Theoretical comparison Phylogibbs Algorithm (2)
3. Calculate the posterior probability score: P(C|S)
Bayes’ Theorem:
P(C|S) ~ P(S|C) = probability that the motif sites of C are drawn from the motif WM and that the background sequence is drawn from the background model EVOLUTIONARY MODEL
The motif WM = unknown!! -> integral over all possible WMs :
with prior P(WM) modeled by Dirichlet prior distribution Dir(γ)
The approximation to solve this integral requires that the tree topologies are reduced to
collections of star topologies
Theoretical comparison Phylogenetic sampler Algorithm (1)
Procedure:
1. start with a random positioning of blocks (based on prior information on the expected number of motif sites/TFs and max number of motif sites per sequence)
2. update the motif model based on the current blocks (<-> PG)
3. scoring: leave out the blocks for one sequence (<-> PG)and calculate for each possible block the conditional probability score
4. first sample the number of motif sites for the sequence, then sample this number of blocks from the score distribution (3)
This iteration procedure is repeated for:1. Burn-in phase: to converge to local optimum2. Sampling phase: keep track of all sampled blocks to construct the centroid afterwards
-> multiple initializations (seeds) recommended to avoid getting trapped in local maximum -> long running time (hours/days)
Theoretical comparison Phylogenetic sampler Algorithm (2)
2. Update the motif model
-> Sample a new motif model from a Dirichlet distribution Dir(β+c) adjusted with phylogenetically weighted counts (based on phylogenetic tree)
-> Accept the new motif with a probability proportional to the Metropolis Hastings ratio
3. Calculate the conditional probability score
The conditional probability
=> proportional to the probability that the block is drawn from the motif model (inferred) divided by the probability that the block is drawn from the background model EVOLUTION MODEL
The Felsenstein tree-likelihood algorithm is used to handle all tree topologies (<->PG)
Theoretical comparison Solution
Figure from Newberg et al., 2007
Phylogibbs Phylogibbs Maximum a posteriori (MAP) solutionMaximum a posteriori (MAP) solution
-> set of motif sites (configuration) with the highest posterior probability
Phylogenetic sampler Phylogenetic sampler Centroid solutionCentroid solution
-> report all those motif sites that appear in at least half the sampling iterations-> keeps track of all motif sites sampled during sampling iterations to calculate posterior probabilities-> does not take into account joint occurrences of the motif sites
Theoretical comparison Evolutionary model
Adapted Felsenstein (F81) model
-> Describes the substitution process at the nucleotide level-> Assumes that all positions evolve independently and at equal rates (u)-> Probability that a is mutated to b is dependent on the time (t)-> Fixation of b is dependent on its frequency in the motif WM
Phylogibbs proximity = q = exp(-ut) = probability that no substitution took place per site
Phylogenetic sampler branch length = b = ut AND a different normalization for their branch lengths (k)
Convert proximities to branch lengths::: b=-3/4ln(q)
Introduction
Main players in Transcriptional regulation
Prokaryotic cells (bacteria):
• No nucleus, circular ‘naked’ DNA molecule
NucleusChromosome
Chromatin
Histone proteins
NucleosomeDNA
Chromatin function:
– Storage of long DNA molecules into nucleus
– Role in Transcriptional regulation: euchromatin and heterochromatin
Eukaryotic cells:
• Linear DNA molecules organized into chromosomes
• Chromatin > complex of DNA and proteins (Histones)
Main players in Transcriptional regulation
2. Chromatin structure (eukaryotes)
Chromatin remodeling complexDNA TARGET GENE
TSSRNA polymerase
complex
Promoter region
TFCo-activator
Theoretical comparison Input format
SPACE Phylogibbs Phylogenetic sampler
MEME
COREGULATION: Non-coding regions for a set of coregulated genes from one species
Unaligned Unaligned
ORTHOLOGOUS: Non-coding regions for a set of orthologous genes from multiple species
Prealigned orthologs
-PG => Dialign
-PS => ClustalW
Phylogenetic treeCOMBINED: Combination of both
Theoretical comparison Assignment and scoring of motif sites
Phylogibbs Phylogenetic sampler
Unaligned
Tree-based evolutionary model (F81)
Window principle -> more flexible in case of a bad prealignment
Block principle -> very sensitive to bad prealignments-> leave out phylogenetic distant orthologs
Prealigned
Single independent motif sites
Multiple orthologous motif sites
Performance assessment Results (3)
2. Phylogenetic tree
=> Tree based on neutral evolution rate
3. The number of added orthologs and the topology of the tree
4. Noise=> Orthologous direction: performance drop depends on the species
distance and the algorithm characteristics
Phylogibbs ↓Phylogenetic sampler ↓-Weighting scheme-Block principle
Spec 3Spec M
PhyloMotifWeb - webserver
PHYLO-MOTIF-WEB
STEP 1Select the non-coding regions
ENSEMBL CORE
STEP 3Motif discovery by using an ensemble
strategy
MEME Phylogibbs
Phylogenetic sampler
ENSEMBLCOMPARA AND REGULATORY
BUILD
STEP 2Additional information sources
STEP 4Post-processing of the predicted
ensemble motif matrices
TRANSFAC and JASPAR
UCSC GENOMEBROWSER
MotifComparison
Clover
Multi-species alignments
DNA features like chromatin structure
Mask repeats
External Database External Software
PhyloMotifWeb - Webserver
Vitamin D3 - de novo motifs
RESULTS:
1. Very common motifs
-> low overrepresentation in the cluster compared to the genome
-> match with following TFs:
SP1 - Involved in vitD3 response –> regulation of genes without VDRE binding site
- Regulator of TFs involved in cell cycle regulation
MEME
ZF5 - TF particularly abundant in differentiated tissues with low proliferation
- Growth suppressive activity
MEME
PG
NRF1 - Involved in cell proliferation MEME
PG
EBF - B-cell differentation PS
SP1, ZF5 and NRF1 are cell cycle regulators -> well conserved binding sites, present in many genes!
PhyloMotifWeb – Ensemble strategy
• Three motif finders: Phylogibbs, Phylogenetic sampler and MEME
• Run each motif finder across multiple parametersettings (e.g. different motif numbers, motif widths etc.) Large collection of output matrices
• FuzzyClustering algorithm -> summarizes all these output matrices into a set of non-redundant ensemble motifs
- Works on TF binding site level -> fine tuning sensitivity/specificity
- Integration of TF binding site scores assigned by the original motif finder
- Trace back the different motif finders that contributed to the final solution
Vitamin D3 - de novo motifs
METHOD: PhyloMotifWeb
- 4000 bp centered around TSS
Restrict to regions with regulatory potential
- Use evolutionary conservation information
human-mouse pairwise alignment
six species alignment
- Use Phylogibbs, Phylogenetic sampler and MEME => Ensemble solution
- Predicted ensemble motifs were compared to database motifs from TRANSFAC and JASPAR to retrieve TFs potentially involved in the coexpression behavior
Vitamin D3 - dataset
GOAL: get insight in molecular mechanism underlying anti-proliferative effect of vitD3
- Human and mouse cell lines treated with vitD3 versus no vitD3 (Control)
- Measured the expression of all genes in the human and mouse cells using microarrays for both conditions over different time points
- Select differentially expressed genes (vitD3 versus Control) -> phenotype
- Group per species all genes with similar behavior in coexpression clusters
Focus on similarity between human and mouse cells as interesting for COMMON antiproliferative phenotype
VitD3
Human breast cancer cells
Mouse bonecells
Ctr
VERSUS
RXR VDRVitD3
Target geneVDRE
ANTI-PROLIFERATIVE
PHENOTYPE
General perspectives
Integration of multiple information sources to improve de novo motif discovery
• Orthology information
– Ortholog alignments, evolutionary models
– Evolution in how algorithms exploit this information source
• New information sources like epigenetic information become available
– How to exploit this new information?
– More knowledge on which chromatin modifications co-locate with transcriptionally active regions like promoters, enhancers or TF binding sites will improve usability