genes and regulatory elements zhiping weng u mass medical school
Post on 19-Dec-2015
222 views
TRANSCRIPT
![Page 1: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/1.jpg)
Genes and Regulatory Elements
Zhiping WengU Mass Medical School
![Page 2: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/2.jpg)
2
ENCODEENCyclopedia Of DNA elements
(The ENCODE Project Consortium, Science 2004, Nature 2007)
m001
m002
m003
m004m005m007
m008
m009m010m011
m012m013
m014
r111
r112
r113
r114
r121
r122
r123
r131
r132
r133
r211
r212
r213
r221
r222
r223
r231
r232
r233r311
r312
r313
r321
r322
r323
r334
r324m006
r331
r332
r333
1 2
3 4 5
6 987 10 1211
13 1514
2019
16
2221 Y
X
17 18
Goal: Identify all functional
elements in the human
genome.
Pilot phase: 1% of the genome
is being annotated very
extensively (30 Mb of
sequence).
Now genome-wide
![Page 3: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/3.jpg)
The ENCODE Project Consortium (2004)The ENCODE (ENCyclopedia Of DNA Elements) ProjectScience, Vol 306, 636-640.
![Page 4: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/4.jpg)
Gene
RNA-seq
![Page 5: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/5.jpg)
Epigenomics
![Page 6: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/6.jpg)
Regulatory Elements
![Page 7: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/7.jpg)
The human genome
2% genes (25,000)
53% Unique and segmentalduplicated DNA
45% repetitive DNA
Where are the gene regulatory elements?
G. Crawford
![Page 8: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/8.jpg)
DNase hypersensitive (HS) sites identify active gene regulatory elements
DNase IHS sites
Regions hypersensitive to DNasePromotersEnhancersSilencersInsulatorsLocus control regionsMeiotic recombination hotspots
HS sites identify “open” regions of chromatin
Crawford et al., Nature Methods 2006
![Page 9: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/9.jpg)
DNase-chip to identify DNase HS sites
Crawford et al., Nature Methods 2006
or sequence directly.
![Page 10: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/10.jpg)
Arrays used for DNase-chip
NimbleGen arrays385,000 50-mer oligosoligos spaced every 38 bases (12 base overlap)non-repetitive unique regions1% of the genome (44 ENCODE regions)
Crawford et al., Nature Methods 2006
![Page 11: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/11.jpg)
DNase-chip Quality Assessment
Xi H., Shulha H.P., Lin J.M., Vales T.R., Fu Y., Bodine D.M., McKay R.D.J,Chenoweth J.G., Tesar P.J., Furey T.S., Ren B., Weng Z.+, Crawford G.E.+ (2007)Identification and characterization of cell type-specific and ubiquitous chromatin regulatory structures in the human genome. +Co-corresponding authors PLoS Genetics, 8, 8-20.
![Page 12: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/12.jpg)
GM CD4 HeLa H9 K562 IMR90Ubiquitous HS sites20%
Cell-type specific andCommon HS sites80%
Unique, common, and ubiquitous DNase HS sites
Collectively, the DHS cover 8.3% of the ENCODE regions.
![Page 13: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/13.jpg)
Have we reached saturation in identifying most DNase HS sites?
![Page 14: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/14.jpg)
CpG content of DNase HS sites
![Page 15: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/15.jpg)
Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site
(TSS)
Ubiquitous DNase HS sites
are enriched for promoters
(TSS) What about ubiquitous
distal DNase HS sites?
![Page 16: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/16.jpg)
Most Distal (non-TSS) ubiquitous DNase sites are insulators bound by CTCF
ChIP
![Page 17: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/17.jpg)
Kim T.H. et al. Direct Isolation and
Identification of Promoters in the
Human Genome
Genome Research (2005)
Antibody against CTCF
Tiling array
Direct sequencing ChIP-seq
Chromatin-immunoprecipitation
(ChIP) - chip
![Page 18: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/18.jpg)
The H19/IGF2 Locus is well insulated
![Page 19: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/19.jpg)
DNase HS sites identify insulator in the Hox locus
![Page 20: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/20.jpg)
Cell culture insulator assays demonstratethat DNaseI HS sites (that overlap CTCF) display enhancer blocking activity.
![Page 21: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/21.jpg)
CTCF motif sites are conserved
![Page 22: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/22.jpg)
CTCF sites make up a greater % of ubiquitousdistal DNase HS sites than enhancers
![Page 23: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/23.jpg)
Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site
(TSS)
Ubiquitous DNase HS sites
are enriched for promoters
(TSS)
![Page 24: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/24.jpg)
Ubiquitous proximal DNase HS sites
![Page 25: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/25.jpg)
Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site
(TSS)
![Page 26: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/26.jpg)
Antibody against histone modification
•Tiling array•Sequencing
![Page 27: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/27.jpg)
Enrichment between tissue-specific H3K4me2 and DNase HS sites
![Page 28: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/28.jpg)
Cell type-specific DNase HS sites correlatewith cell type-specific histone modifications
Similarly for H3K4me1, H3K4me3, H3ac and H4ac, for which we have experimental data.
![Page 29: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/29.jpg)
Cell type-specific DNase HS sites correlatewith cell type-specific enhancers
![Page 30: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/30.jpg)
Cell type-specific DNase HS sites correlatewith cell type-specific gene expression
![Page 31: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/31.jpg)
Transcriptional Motifs
Gene transcription is controlled by molecules (transcription factors, or TFs) binding to short DNA sequences (cis-elements, TF motifs) in promoters and distal elements
![Page 32: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/32.jpg)
Finding enriched motifs in tissue-specific DNase HS sites
Screen against a motif library, e.g., JASPAR or TRANSFAC
STAT
Myc/Max
YY1
(etc.)
the Clover algorithm
DHS #1
DHS #2
DHS #3
DHS #4
DHS #5
![Page 33: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/33.jpg)
JASPAR: a database of transcription factor motifs
![Page 34: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/34.jpg)
Clover:Cis-eLement OVERrepresentation
Myc/Max
DHSsequences
17.3
Raw score
![Page 35: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/35.jpg)
The Clover AlgorithmFrith MC, Fu Y, Yu L, Chen J-F, Hansen U, Weng Z (2004). Detection of Functional DNA Motifs Via Statistical Overrepresentation. Nucleic
Acids Res. 32:1372-1381.
Lk: nucleotide at position k
W: motif width
S: a promoter sequence
Ms: number of motif locations in a sequence
A: all possibilities of choosing a subset of sequences
N: the total number of promoter sequences
Clover Raw score
![Page 36: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/36.jpg)
Clover:Cis-eLement OVERrepresentation
Myc/Max
DHS sequences
P-value = 1/4
Control DNA sequences
17.3
Raw score
4.2 6.6 18 9.1
![Page 37: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/37.jpg)
Motifs enriched in cell-type specific DNase HS sites
Cell type
Motif Family Proximal Distal Far distal
H9 ES Oct x
Sp-1 x x
STAT x
SOX x
K562 GATA x x x
PR x x
GEN_INI x x
Tel-2 x
IMR90 AP-4 x x
![Page 38: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/38.jpg)
Motifs enriched in cell-type specific DNase HS sites
Cell typeMotif Family Proximal Distal Far distal
TAL1, E2A, E12,Lmo2 x x
ETS x x
GM06990 Lmo2,
E12, E47 x x
T3R x
IRF x
PAX6 x
HeLa AP-1 x x x
IPF1 x x
NF-1 x
CD4
![Page 39: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/39.jpg)
Genome-wide DNase-chip and DNase-sequencing data
• CD4 cells• 23 k proximal DNaseI HS sites• 72 k distal DNaseI HS sites
![Page 40: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/40.jpg)
Enriched transcription factor binding motifs in distal DNaseI HS sites
• Hematopoietic system:– TAL1– AML– PU.1 – C/EBPα
• Immune system: – STAT1, STAT3, STAT5– IRF1, IRF3 and IRF5
![Page 41: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/41.jpg)
acgtcggctgacaccaggtctgcttgattcgatgagattgaattcgtaggagctggattagag
ggcttggggcttgaggcttgacaccatatcgtagcgctgagttgctgagtttcgtatggcgct
cgatgcttattagcggctattataggctagctaggcaatacacatcgctgatatagcggctta
tgagatagcgtgctagctatatggattggaatattcggcgctgaaaggtcttagctagtcgta
aatatatgcgcgtatgcgtatggcgggtatatgggggcttggtcttttttttcgcttaggtcg
Enriched motifs
Distal DHS sequences
Find motif clustersin the human genome
Identify motif clusters (modules)
![Page 42: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/42.jpg)
MotifScore
Locationin DNA
Red = motif type 1 (e.g. TAL1)
Blue = motif type 2 (e.g. ETS)
Finding motif clusters with a hidden Markov model
Cluster-BusterMC Frith, MC Li, Z Weng (2003).
Cluster-Buster: Finding dense clusters of motifs in DNA sequences.
Nucleic Acids Research, 31(13):3666-8.http://zlab.bu.edu/cluster-buster/
0.8
0.1
0.1
![Page 43: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/43.jpg)
Overlap between predicted motif clusters and distal DNase HS sites
CutoffDNase HS sitesPredicted motif clusters
Enrichment of the overlap =Overlap * Sequence space
DHS * Motif Clusters
![Page 44: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/44.jpg)
Motif clusters can predict distal DNase HS sites genome-wide
0
2
4
6
8
10
12
7 12 17Cutoff of cluster score
Fo
ld e
nri
chm
en
t
![Page 45: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/45.jpg)
Summary• DNase HS sites identified from 6 cell types
Cell-type specific Common Ubiquitous (found in all cell types studied)
• Ubiquitous DNase HS sites are likely to function as…Promoters (TSS)Insulators (CTCF)(no enhancers?)
• Ubiquitous sites indicative of housekeeping chromatin structure
• Cell-type specific DNase HS sitesCorrelate with histone modifications in a cell type-specific mannerCorrelate with gene expression in a cell type-specific mannerCorrelate with enhancer elements in a cell type-specific mannerContain cell type-specific motifs
• Motif clusters can predict DNase HS sites genome-wide
![Page 46: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/46.jpg)
Motif FindingMany Slides by Bill Noble @ UW
![Page 47: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/47.jpg)
Outline
• What is a sequence motif?• Weight matrix representation• Motif search• Motif discovery
– Expectation-maximization– Gibbs sampling
• Patterns-with-mismatches representation
![Page 48: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/48.jpg)
What is a “Motif”?
• Generally, a recurring pattern, e.g.– Sequence motif– Structure motif– Network motif
• More specifically, a set of similar substrings, within a family of diverged sequences.– Protein sequence motifs– DNA sequence motifs
![Page 49: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/49.jpg)
Example motif
![Page 50: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/50.jpg)
Motif in Logos Format
![Page 51: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/51.jpg)
Gene 3’-Processing Signals
RNA
A simplified representation of the arrangement of control elements (withexample sequences) that identify the 3'-processing site in yeast mRNA.
JH Graber et al. (2002) Nucleic Acids Research 30(8):1851-8.
![Page 52: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/52.jpg)
Splice site motif in logo format
weblogo.berkeley.edu
![Page 53: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/53.jpg)
Exonic Splicing Enhancers
These motifs occur within exons and enhance splicing of introns from mRNA.
Letter height indicates its frequency at that position.
Fairbrother WG et al. (2002) Science 297(5583):1007-13
![Page 54: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/54.jpg)
Transcription Factor Binding Sites
ERE
EstrogenReceptor Transcription start
DNA
Gene ERE Sequence
Efp … a g g g t c a t g g t g a c c c t …
TERT … t t g g t c a g g c t g a t c t c …
Oxytocin … g c g g t g a c c t t g a c c c c …
Lactoferrin … c a g g t c a a g g c g a t c t t …
Angiotensin … t a g g g c a t c g t g a c c c g …
VEGF … a t a a t c a g a c t g a c t g g …
(estrogen response element)
![Page 55: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/55.jpg)
Outline
• What is a sequence motif?• Weight matrix representation• Motif search• Motif discovery
– Expectation-maximization– Gibbs sampling
• Patterns-with-mismatches representation
![Page 56: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/56.jpg)
Weight matrix
• Probabilistic model: How likely is each letter at each motif position?
ACGT
1 2 3 4 5 6 7 8 9
.89 .02 .38 .34 .22 .27 .02 .03 .02
.04 .91 .20 .17 .28 .31 .30 .04 .02
.04 .05 .41 .18 .29 .16 .07 .92 .18
.03 .02 .01 .31 .21 .26 .61 .01 .78
![Page 57: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/57.jpg)
A. K. A.
Weight matrices are also known as• Position-specific scoring matrices• Position-specific probability matrices• Position-specific weight matrices
![Page 58: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/58.jpg)
Scoring a motif model
• A motif is interesting if it is very different from the background distribution
more interestingless interesting
ACGT
1 2 3 4 5 6 7 8 9
.89 .02 .38 .34 .22 .27 .02 .03 .02
.04 .91 .20 .17 .28 .31 .30 .04 .02
.04 .05 .41 .18 .29 .16 .07 .92 .18
.03 .02 .01 .31 .21 .26 .61 .01 .78
![Page 59: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/59.jpg)
Relative entropy
• A motif is interesting if it is very different from the background distribution
• Use relative entropy*:
,,
position letter
log ii
i
pp
b
pi, = probability of in matrix position ib = background frequency (in non-motif sequence)
* Relative entropy is sometimes called information content.
![Page 60: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/60.jpg)
Scoring motif instances
• A motif instance matches if it looks like it was generated by the weight matrix
Matches weight matrix
Hard to tell
“ A C G G C G C C T”
Not likely!
ACGT
1 2 3 4 5 6 7 8 9
.89 .02 .38 .34 .22 .27 .02 .03 .02
.04 .91 .20 .17 .28 .31 .30 .04 .02
.04 .05 .41 .18 .29 .16 .07 .92 .18
.03 .02 .01 .31 .21 .26 .61 .01 .78
![Page 61: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/61.jpg)
Log likelihood ratio
• A motif instance matches if it looks like it was generated by the weight matrix
• Use log likelihood ratio
• Measures how much more like the weight matrix than like the background.
,
position
log i
i
i
i
p
b
i: the character atposition i of the instance
![Page 62: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/62.jpg)
Outline
• What is a sequence motif?• Weight matrix representation• Motif search• Motif discovery
– Expectation-maximization– Gibbs sampling
• Patterns-with-mismatches representation
![Page 63: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/63.jpg)
Position-specific scoring matrix
• This PSSM assigns the sequence NMFWAFGH a score of 0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = 12.
A -1 -2 -1 0 -1 -2 0 -2
R 5 0 5 -2 1 -3 -2 0
N 0 6 0 0 0 -3 0 1
D -2 1 -2 -1 0 -3 -1 -1
C -3 -3 -3 -3 -3 -2 -3 -3
Q 1 0 1 -2 5 -3 -2 0
E 0 0 0 -2 2 -3 -2 0
G -2 0 -2 6 -2 -3 6 -2
H 0 1 0 -2 0 -1 -2 8
I -3 -3 -3 -4 -3 0 -4 -3
L -2 -3 -2 -4 -2 0 -4 -3
K 2 0 2 -2 1 -3 -2 -1
M -1 -2 -1 -3 0 0 -3 -2
F -3 -3 -3 -3 -3 6 -3 -1
P -2 -2 -2 -2 -1 -4 -2 -2
S -1 1 -1 0 0 -2 0 -1
T -1 0 -1 -2 -1 -2 -2 -2
W -3 -4 -3 -2 -2 1 -2 -2
Y -2 -2 -2 -3 -1 3 -3 2
V -3 -3 -3 -3 -2 -1 -3 -3
![Page 64: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/64.jpg)
Significance of scores
MotifScanningalgorithm
LENENQGKCTIAEYKYDGKKASVYNSFVS
45
Low score = not a motifHigh score = motif occurrence
How high is high enough?
![Page 65: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/65.jpg)
Computing a p-value
• The scores for all possible sequences of length that matches the motif.
• Use these scores to compute a p-value.
• The probability of observing a score >4 is the area under the curve to the right of 4.
• This probability is called a p-value.
• p-value = Pr(data|null)
![Page 66: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/66.jpg)
![Page 67: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/67.jpg)
Outline
• What is a sequence motif?• Weight matrix representation• Motif search• Motif discovery
– Expectation-maximization– Gibbs sampling
• Patterns-with-mismatches representation
![Page 68: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/68.jpg)
Motif discovery problem
• Given sequences
• Find motif
IGRGGFGEVY at position 515LGEGCFGQVV at position 430VGSGGFGQVY at position 682
seq. 1seq. 2seq. 3
![Page 69: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/69.jpg)
Motif discovery problem
• Given: a sequence or family of sequences.• Find:
the number of motifsthe width of each motifthe locations of motif occurrences
![Page 70: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/70.jpg)
Why is this hard?
• Input sequences are long (thousands or millions of residues)
• Motif may be subtle– Instances are short.– Instances are only slightly similar.
?
?
![Page 71: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/71.jpg)
Globin motifs
xxxxxxxxxxx.xxxxxxxxx.xxxxx..........xxxxxx.xxxxxxx.xxxxxxxxxx.xxxxxxxxxHAHU V.LSPADKTN..VKAAWGKVG.AHAGE..........YGAEAL.ERMFLSF..PTTKTYFPH.FDLS.HGSAHAOR M.LTDAEKKE..VTALWGKAA.GHGEE..........YGAEAL.ERLFQAF..PTTKTYFSH.FDLS.HGSAHADK V.LSAADKTN..VKGVFSKIG.GHAEE..........YGAETL.ERMFIAY..PQTKTYFPH.FDLS.HGSAHBHU VHLTPEEKSA..VTALWGKVN.VDEVG...........G.EAL.GRLLVVY..PWTQRFFES.FGDL.STPDHBOR VHLSGGEKSA..VTNLWGKVN.INELG...........G.EAL.GRLLVVY..PWTQRFFEA.FGDL.SSAGHBDK VHWTAEEKQL..ITGLWGKVNvAD.CG...........A.EAL.ARLLIVY..PWTQRFFAS.FGNL.SSPTMYHU G.LSDGEWQL..VLNVWGKVE.ADIPG..........HGQEVL.IRLFKGH..PETLEKFDK.FKHL.KSEDMYOR G.LSDGEWQL..VLKVWGKVE.GDLPG..........HGQEVL.IRLFKTH..PETLEKFDK.FKGL.KTEDIGLOB M.KFFAVLALCiVGAIASPLT.ADEASlvqsswkavsHNEVEIlAAVFAAY.PDIQNKFSQFaGKDLASIKDGPUGNI A.LTEKQEAL..LKQSWEVLK.QNIPA..........HS.LRL.FALIIEA.APESKYVFSF.LKDSNEIPEGPYL GVLTDVQVAL..VKSSFEEFN.ANIPK...........N.THR.FFTLVLEiAPGAKDLFSF.LKGSSEVPQGGZLB M.L.DQQTIN..IIKATVPVLkEHGVT...........ITTTF.YKNLFAK.HPEVRPLFDM.GRQ..ESLE xxxxx.xxxxxxxxxxxxx..xxxxxxxxxxxxxxx..xxxxxxx.xxxxxxx...xxxxxxxxxxxxxxxxHAHU QVKGH.GKKVADA.LTN......AVA.HVDDMPNA...LSALS.D.LHAHKL....RVDPVNF.KLLSHCLLHAOR QIKAH.GKKVADA.L.S......TAAGHFDDMDSA...LSALS.D.LHAHKL....RVDPVNF.KLLAHCILHADK QIKAH.GKKVAAA.LVE......AVN.HVDDIAGA...LSKLS.D.LHAQKL....RVDPVNF.KFLGHCFLHBHU AVMGNpKVKAHGK.KVLGA..FSDGLAHLDNLKGT...FATLS.E.LHCDKL....HVDPENF.RL.LGNVLHBOR AVMGNpKVKAHGA.KVLTS..FGDALKNLDDLKGT...FAKLS.E.LHCDKL....HVDPENFNRL..GNVLHBDK AILGNpMVRAHGK.KVLTS..FGDAVKNLDNIKNT...FAQLS.E.LHCDKL....HVDPENF.RL.LGDILMYHU EMKASeDLKKHGA.TVL......TALGGILKKKGHH..EAEIKPL.AQSHATK...HKIPVKYLEFISECIIMYOR EMKASaDLKKHGG.TVL......TALGNILKKKGQH..EAELKPL.AQSHATK...HKISIKFLEYISEAIIIGLOB T.GA...FATHATRIVSFLseVIALSGNTSNAAAV...NSLVSKL.GDDHKA....R.GVSAA.QF..GEFRGPUGNI NNPK...LKAHAAVIFKTI...CESATELRQKGHAVwdNNTLKRL.GSIHLK....N.KITDP.HF.EVMKGGPYL NNPD...LQAHAG.KVFKL..TYEAAIQLEVNGAVAs.DATLKSL.GSVHVS....K.GVVDA.HF.PVVKEGGZLB Q......PKALAM.TVL......AAAQNIENLPAIL..PAVKKIAvKHCQAGVaaaH.YPIVGQEL.LGAIK xxxxxxxxx.xxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxx..xHAHU VT.LAA.H..LPAEFTPA..VHASLDKFLASV.STVLTS..KY..RHAOR VV.LAR.H..CPGEFTPS..AHAAMDKFLSKV.ATVLTS..KY..RHADK VV.VAI.H..HPAALTPE..VHASLDKFMCAV.GAVLTA..KY..RHBHU VCVLAH.H..FGKEFTPP..VQAAYQKVVAGV.ANALAH..KY..HHBOR IVVLAR.H..FSKDFSPE..VQAAWQKLVSGV.AHALGH..KY..HHBDK IIVLAA.H..FTKDFTPE..CQAAWQKLVRVV.AHALAR..KY..HMYHU QV.LQSKHPgDFGADAQGA.MNKALELFRKDM.ASNYKELGFQ..GMYOR HV.LQSKHSaDFGADAQAA.MGKALELFRNDM.AAKYKEFGFQ..GIGLOB TA.LVA.Y..LQANVSWGDnVAAAWNKA.LDN.TFAIVV..PR..LGPUGNI ALLGTIKEA.IKENWSDE..MGQAWTEAYNQLVATIKAE..MK..EGPYL AILKTIKEV.VGDKWSEE..LNTAWTIAYDELAIIIKKE..MKdaAGGZLB EVLGDAAT..DDILDAWGK.AYGVIADVFIQVEADLYAQ..AV..E
![Page 72: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/72.jpg)
Alternating approach
1. Guess an initial weight matrix2. Use weight matrix to predict instances in the input
sequences3. Use instances to predict a weight matrix4. Repeat 2 & 3 until satisfied.
Examples: Gibbs Sampler (Lawrence et al.) MEME (expectation maximization / Bailey, Elkan) ANN-Spec (neural network / Workman, Stormo)
![Page 73: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/73.jpg)
Three Ingredients of Almost any Bioinformatics Method
1. Search space2. Scoring scheme3. Search algorithm (= optimization technique)
Strictly speaking, Gibbs sampling and expectation-maximization are search algorithms. They are not specific to motif discovery; indeed they were first used in other contexts.
Mathematically precise formulation of the problem
![Page 74: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/74.jpg)
Expectation-Maximization
• Guarantees finding a local optimum.
• Widely used in bioinformatics:– The Baum-Welch algorithm for training HMMs is an
example– So is K-means clustering (e.g. used to analyze microarray
data).
![Page 75: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/75.jpg)
Expectation-maximization (EM)
foreach subsequence of width Wconvert subsequence to a matrixdo {
re-estimate motif occurrences from matrixre-estimate matrix model from motif occurrences
} until (matrix model stops changing)endselect matrix with highest score
EM
![Page 76: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/76.jpg)
Sample DNA sequences
>ce1cg TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCACAAAAATGGAAGTCCACAGTCTTGACAG
>ara GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAGAAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATGCCATAGCATTTTTATCCATAAG
>bglr1 ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATATAACTTTATAAATTCCTAAAATTACACAAAGTTAATAACTGTGAGCATGGTCATATTTTTATCAAT
>crp CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTACAGTAATACATTGATGTACTGCATGTATGCAAAGGACGTCACATTACCGTGCAGTACAGTTGATAGC
![Page 77: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/77.jpg)
Motif occurrences
>ce1cg taatgtttgtgctggtttttgtggcatcgggcgagaatagcgcgtggtgtgaaagactgttttTTTGATCGTTTTCACaaaaatggaagtccacagtcttgacag
>ara gacaaaaacgcgtaacaaaagtgtctataatcacggcagaaaagtccacattgattaTTTGCACGGCGTCACactttgctatgccatagcatttttatccataag
>bglr1 acaaatcccaataacttaattattgggatttgttatatataactttataaattcctaaaattacacaaagttaataacTGTGAGCATGGTCATatttttatcaat
>crp cacaaagcgaaagctatgctaaaacagtcaggatgctacagtaatacattgatgtactgcatgtaTGCAAAGGACGTCACattaccgtgcagtacagttgatagc
![Page 78: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/78.jpg)
Starting point
…gactgttttTTTGATCGTTTTCACaaaaatgg…
T T T G A T C G T T A 0.17 0.17 0.17 0.17 0.50 ...C 0.17 0.17 0.17 0.17 0.17G 0.17 0.17 0.17 0.50 0.17T 0.50 0.50 0.50 0.17 0.17
![Page 79: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/79.jpg)
Re-estimating motif occurrences
TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA
T T T G A T C G T T A 0.17 0.17 0.17 0.17 0.50 ...C 0.17 0.17 0.17 0.17 0.17G 0.17 0.17 0.17 0.50 0.17 T 0.50 0.50 0.50 0.17 0.17
Score = 0.50 + 0.17 + 0.17 + 0.17 + 0.17 + ...
![Page 80: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/80.jpg)
Scoring each subsequence
Subsequences ScoreTGTGCTGGTTTTTGT 2.95 GTGCTGGTTTTTGTG 4.62 TGCTGGTTTTTGTGG 2.31 GCTGGTTTTTGTGGC ...
Sequence: TGTGCTGGTTTTTGTGGCATCGGGCGAGAATA
Select from each sequence the subsequence with maximal score.
![Page 81: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/81.jpg)
Re-estimating motif matrix
OccurrencesTTTGATCGTTTTCACTTTGCACGGCGTCACTGTGAGCATGGTCATTGCAAAGGACGTCAC
CountsA 000132011000040C 001010300200403G 020301131130000T 423001002114001
![Page 82: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/82.jpg)
Adding pseudocounts
CountsA 000132011000040C 001010300200403G 020301131130000T 423001002114001
Counts + PseudocountsA 111243122111151C 112121411311514G 131412242241111T 534112113225112
![Page 83: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/83.jpg)
Converting to frequencies
Counts + PseudocountsA 111243122111151C 112121411311514G 131412242241111T 534112113225112
T T T G A T C G T T A 0.13 0.13 0.13 0.25 0.50 ...C 0.13 0.13 0.25 0.13 0.25G 0.13 0.38 0.13 0.50 0.13T 0.63 0.38 0.50 0.13 0.13
![Page 84: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/84.jpg)
Expectation-maximization
foreach subsequence of width Wconvert subsequence to a matrixdo {
re-estimate motif occurrences from matrixre-estimate matrix model from motif occurrences
} until (matrix model stops changing)endselect matrix with highest score
![Page 85: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/85.jpg)
Problem: This procedure doesn't allow the motifs to move around very much. Taking the max is too brittle.
Solution: Associate with each start site a probability of motif occurrence.
![Page 86: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/86.jpg)
Converting to probabilities
Occurrences Score ProbTGTGCTGGTTTTTGT 2.95 0.023 GTGCTGGTTTTTGTG 4.62 0.037 TGCTGGTTTTTGTGG 2.31 0.018 GCTGGTTTTTGTGGC ... ...Total 128.2 1.000
Sequence: TGTGCTGGTTTTTGTGGCATCGGGCGAGAATA
![Page 87: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/87.jpg)
Computing weighted counts
Occurrences ProbTGTGCTGGTTTTTGT 0.023GTGCTGGTTTTTGTG 0.037TGCTGGTTTTTGTGG 0.018GCTGGTTTTTGTGGC ... 1 2 3 4 5 …
A
C
G
T
Include counts from all subsequences, weighted by the degree to which they match the motif model.
![Page 88: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/88.jpg)
Occurrences ProbTGTGCTGGTTTTTGT 0.023GTGCTGGTTTTTGTG 0.037TGCTGGTTTTTGTGG 0.018GCTGGTTTTTGTGGC ... 1 2 3 4 5 …
A
C
G
T
Include counts from all subsequences, weighted by the degree to which they match the motif model.
Computing weighted counts
![Page 89: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/89.jpg)
Problem: How do we estimate counts accurately when we have only a few examples?Solution: Use Dirichlet mixture priors.
Problem: Too many possible starting points.Solution: Save time by running only 1 iteration of EM at first.
Problem: Too many possible widths.Solution: Consider widths that vary by 2 and adjust motifs afterwards.
Problem: Algorithm assumes exactly one motif occurrence per sequence.Solution: Normalize motif occurrence probabilities across all sequences, using a user-specified parameter.
Problem: The EM algorithm finds only one motif.Solution: Probabilistically erase the motif from the data set, and repeat.
Problem: The motif model is too simplistic.Solution: Use a two-component mixture model that captures the background distribution. Allow the background model to be more complex, e.g. a Markov model.
Problem: The EM algorithm does not tell you how many motifs there are. Solution: Compute statistical significance of motifs and stop when they are no longer significant.
![Page 90: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/90.jpg)
MEME algorithm
dofor (width = min; width *= 2; width < max)
foreach possible starting pointrun 1 iteration of EM
select candidate starting pointsforeach candidate
run EM to convergenceselect best motiferase motif occurrences
until (E-value of found motif > threshold)
![Page 91: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/91.jpg)
Gibbs Samplinga type of Monte Carlo Markov chain method
![Page 92: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/92.jpg)
Maximization Versus Sampling
• We are given some huge search space. Every point Z in the search space has some score SZ defined as before.
• Sampling: wander around the search space in such a way that how often we visit each point is proportional to πZ=exp(SZ).
• Maximization: find the point with the highest πZ, a likelihood ratio value between 0 and +∞.
• EM does maximization and MCMC does sampling.• MCMC attempts to escape local optima.
![Page 93: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/93.jpg)
Gibbs SamplingUse a Markov chain to wander around the search space. If we are at point X, move to point Y with probability MXY
1
2
X
Start at a random point X.
Randomly pick a dimension.
Look at all points along this dimension.
Repeat.
Move to one of them randomly, proportional to its score π.
Suppose the search space is a 2D rectangle. (Typically, many dimensions!)
![Page 94: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/94.jpg)
Initialization
Randomly guess an instance si from each of t input sequences {S1, ..., St}.
sequence 1
sequence 2
sequence 3
sequence 4
sequence 5
ACAGTGTTTAGACCGTGACCAACCCAGGCAGGTTT
![Page 95: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/95.jpg)
Gibbs sampler
• Initially: randomly guess an instance si from each of t input sequences {S1, ..., St}.
• Steps 2 & 3 (search):– Throw away an instance si: remaining (t - 1) instances
define weight matrix.– Weight matrix defines instance probability at each position
of input string Si
– Pick new si according to probability distribution
• Return highest-scoring motif seen
![Page 96: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/96.jpg)
Sampler step illustration:
ACAGTGTTAGGCGTACACCGT???????CAGGTTT
ACGT
.45 .45 .45 .05 .05 .05 .05
.25 .45 .05 .25 .45 .05 .05
.05 .05 .45 .65 .05 .65 .05
.25 .05 .05 .05 .45 .25 .85
ACGCCGT:20% ACGGCGT:52%
ACAGTGTTAGGCGTACACCGTACGCCGTCAGGTTT
sequence 411%
![Page 97: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/97.jpg)
Comparison
• Both EM and Gibbs sampling involve iterating over two steps
• Convergence:– EM converges when the PSSM stops changing.– Gibbs sampling runs until you ask it to stop.
• Solution:– EM may not find the motif with the highest score.– Gibbs sampling will provably find the motif with the
highest score, if you let it run long enough.
![Page 98: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/98.jpg)
Comparison of motif finders
![Page 99: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/99.jpg)
Summary
• Motifs are represented by weight matrices.• Motif quality is measured by relative entropy. • Motif occurrences are scored using log likelihood
ratios.• EM and the Gibbs sampler attempt to find a motif
with maximal relative entropy.• Both algorithms alternate between predicting
instances and predicting the weight matrix.
![Page 100: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/100.jpg)
![Page 101: Genes and Regulatory Elements Zhiping Weng U Mass Medical School](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d2d5503460f94a03c3b/html5/thumbnails/101.jpg)
Homework
• Go to UCSC genome browser to get the top 100 regions bound by CTCF
• Use MEME to find the binding motif of CTCF