![Page 1: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/1.jpg)
1
Finding Regulatory Motifs
![Page 2: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/2.jpg)
2
Copyright notice
• Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.
• Many slides of this power point presentation Are from slides of Dr. Jonathon Pevsner and other people. The Copyright belong to the original authors. Thanks!
![Page 3: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/3.jpg)
3
Regulation of Transcription
TFs bound to their BSs
Transcription machinery Gene start
![Page 4: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/4.jpg)
4
Transcription Factors
• Proteins involved in the regulation of gene expression that bind to the upstream promoter region of transcription initiation sites
• Composed of two essential functional regions: a DNA-binding domain and an activator domain.
![Page 5: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/5.jpg)
5
Transcription factors
Sequence-specific DNA binding
Non-DNA binding
TF1 TF2 TF3 TF4
adapter
Co-activator
HAT
DNA
Layer I
Layer III
Layer II
![Page 6: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/6.jpg)
6
BSs Models
(a) Exact string(s)
Example:
BS = TACACC , TACGGC
CAATGCAGGATACACCGATCGGTA
GGAGTACGGCAAGTCCCCATGTGA
AGGCTGGACCAGACTCTACACCTA
![Page 7: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/7.jpg)
7
BSs Models (II)
(b) String with mismatches
Example:
BS = TACACC + 1 mismatch
CAATGCAGGATTCACCGATCGGTA
GGAGTACAGCAAGTCCCCATGTGA
AGGCTGGACCAGACTCTACACCTA
![Page 8: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/8.jpg)
8
BSs Models (III)
(c) Degenerate string
Example:
BS = TASDAC (S={C,G} D={A,G,T})
CAATGCAGGATACAACGATCGGTA
GGAGTAGTACAAGTCCCCATGTGA
AGGCTGGACCAGACTCTACGACTA
![Page 9: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/9.jpg)
9
BSs Models (IV)
(d) Position Weight Matrix (PWM)
Example: BS =A 0.1 0.8 0 0.7 0.2 0
C 0 0.1 0.5 0.1 0.4 0.6
G 0 0 0.5 0.1 0.4 0.1
T 0.9 0.1 0 0.1 0 0.3
ATGCAGGATACACCGATCGGTA 0.0605
GGAGTAGAGCAAGTCCCGTGA 0.0605
AAGACTCTACAATTATGGCGT 0.0151
![Page 10: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/10.jpg)
10
Position Weight Matrix (PWM)
Frequency Matrix 1 2 3 4 5
a 12 1 0 1 0
c 1 1 10 9 0
g 2 5 5 2 14
t 0 7 0 2 1
1
15
Weight Matrix 1 2 3 4 5
a 5.1 -5.7 -8.7 -5.7 -8.7
c -5.7 -5.7 4.2 3.8 -8.7
g -2.7 1.2 1.2 -2.7 5.7
t -8.7 2.7 -8.7 -2.7 -5.7
( )ip b
( )10 log
( )ip b
f b
NOTE: Use pseudo-counts for zero frequencies
( )f b background frequencies
![Page 11: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/11.jpg)
11
Predicting Motif Occurrences:Sequence Scoring
1 2 3 4 5
a 5.1 -5.7 -8.7 -5.7 -8.7
c -5.7 -5.7 4.2 3.8 -8.7
g -2.7 1.2 1.2 -2.7 5.7
t -8.7 2.7 -8.7 -2.7 -5.7
a g c g g t a
Sum = 13.5
1 2 3 4 5
a 5.1 -5.7 -8.7 -5.7 -8.7
c -5.7 -5.7 4.2 3.8 -8.7
g -2.7 1.2 1.2 -2.7 5.7
t -8.7 2.7 -8.7 -2.7 -5.7
Sum = -15.6
![Page 12: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/12.jpg)
12
BSs Models (V)
(e) More complex models
– PWM with spacers (e.g., for p53)– Markov model (dependency between
adjacent columns of PWM)– Hybrid models, e.g., mixture of two
PWMs– …
… And we also need to model the non-BSs sequences in the promoters…
![Page 13: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/13.jpg)
13
Motif RepresentationsCGGCGCACTCTCGCCCGCGGGGCAGACTATTCCGCGGCGGCTTCTAATCCG...CGGGGCAGACTATTCCG
CGGNGCACANTCNTCCG1. Consensus
2. Frequency Matrix
3. Logo
![Page 14: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/14.jpg)
14
Logos
• Graphical representation of nucleotide base (or amino acid) conservation in a motif (or alignment)
• Information theory
• Height of letters represents relative frequency of nucleotide bases
http://weblogo.berkeley.edu/
2{ }
2 ( ) log ( )b
p b p b
A,C,G,T
![Page 15: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/15.jpg)
15
Regulatory Motif DiscoveryDNA
Group of co-regulated genesCommon subsequence
Find motifs within groups of corregulated genes
![Page 16: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/16.jpg)
16
How to find novel motifs
Degenerate string:• YMF - Sinha & Tompa ’02
String with mismatches:• WINNOWER – Pevzner & Sze ‘00• Random Projections – Buhler & Tompa ’02• MULTIPROFILER – Keich & Pevzner ’02
PWM:• MEME – Bailey & Elkan ’95• AlignACE – Hughes et al. ’98• CONSENSUS - Hertz & Stormo ’99
![Page 17: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/17.jpg)
17
How to find TF modules
• BioProspector – Liu et al. ‘01
• Co-Bind – GuhaThakurta & Stormo ‘01
• MITRA – Eskin & Pevzner ‘02
• CREME – Sharan et al. ‘03
• MCAST – Bailey & Noble ‘03
![Page 18: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/18.jpg)
18
Characteristics of Regulatory Motifs• Tiny
• Highly Variable
• ~Constant Size– Because a constant-size
transcription factor binds
• Often repeated
• Low-complexity
![Page 19: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/19.jpg)
19
Problem Definition
Probabilistic
Motif: Mij; 1 i W1 j 4
Mij = Prob[ letter j, pos i ]
Find best M, and positions p1,…, pN in sequences
Combinatorial
Motif M: m1…mW
Some of the mi’s blank
Find M that occurs in all si with k differences
Given a collection of promoter sequences s1,…, sN of genes with common expression
![Page 20: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/20.jpg)
20
Discrete Approaches to Motif Finding
![Page 21: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/21.jpg)
21
Discrete Formulations
Given sequences S = {x1, …, xn}
• A motif W is a consensus string w1…wK
• Find motif W* with “best” match to x1, …, xn
Definition of “best”:
d(W, xi) = min hamming dist. between W and any word in xi
d(W, S) = i d(W, xi)
![Page 22: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/22.jpg)
22
Exhaustive Searches
1. Pattern-driven algorithm:
For W = AA…A to TT…T (4K possibilities)Find d( W, S )
Report W* = argmin( d(W, S) )
Running time: O( K N 4K )(where N = i |xi|)
Advantage: Finds provably “best” motif WDisadvantage: Time
![Page 23: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/23.jpg)
23
Exhaustive Searches2. Sample-driven algorithm:
For W = any K-long word occurring in some xi
Find d( W, S )
Report W* = argmin( d( W, S ) )or, Report a local improvement of W*
Running time: O( K N2 )
Advantage: Time
Disadvantage: If the true motif is weak and does not occur in datathen a random motif may score better than any instance
of true motif
![Page 24: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/24.jpg)
24
MULTIPROFILER
• Extended sample-driven approach
Given a K-long word W, define:
Nα(W) = words W’ in S s.t. d(W,W’) α
Idea:
Assume W is occurrence of true motif W*
Will use Nα(W) to correct “errors” in W
![Page 25: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/25.jpg)
25
MULTIPROFILERAssume W differs from true motif W* in at most L positions
Define:
A wordlet G of W is a L-long pattern with blanks, differing from W– L is smaller than the word length K
Example:
K = 7; L = 3
W = ACGTTGAG = --A--CG
![Page 26: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/26.jpg)
26
MULTIPROFILERAlgorithm:
For each W in S:For L = 1 to Lmax
1. Find the α-neighbors of W in S Nα(W)2. Find all “strong” L-long wordlets G in Na(W)3. For each wordlet G,
1. Modify W by the wordlet G W’2. Compute d(W’, S)
Report W* = argmin d(W’, S)
Step 2 above: Smaller motif-finding problem; Use exhaustive search
![Page 27: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/27.jpg)
27
CONSENSUS
Algorithm:
Cycle 1:For each word W in S (of fixed length!)
For each word W’ in SCreate alignment (gap free) of W, W’
Keep the C1 best alignments, A1, …, AC1
ACGGTTG , CGAACTT , GGGCTCT …ACGCCTG , AGAACTA , GGGGTGT …
![Page 28: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/28.jpg)
28
CONSENSUSAlgorithm:
Cycle t:For each word W in S
For each alignment Aj from cycle t-1Create alignment (gap free) of W, Aj
Keep the Cl best alignments A1, …, ACt
ACGGTTG , CGAACTT , GGGCTCT …ACGCCTG , AGAACTA , GGGGTGT …… … …ACGGCTC , AGATCTT , GGCGTCT …
![Page 29: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/29.jpg)
29
CONSENSUS
• C1, …, Cn are user-defined heuristic constants
– N is sum of sequence lengths– n is the number of sequences
Running time:
O(N2) + O(N C1) + O(N C2) + … + O(N Cn)
= O( N2 + NCtotal)
Where Ctotal = i Ci, typically O(nC), where C is a big constant
![Page 30: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/30.jpg)
30
Expectation Maximization in Expectation Maximization in Motif FindingMotif Finding
![Page 31: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/31.jpg)
31
All K-long wordsmotif background
Expectation Maximization
Algorithm (sketch):
1. Given genomic sequences find all k-long words2. Assume each word is motif or background3. Find likeliest
Motif ModelBackground Modelclassification of words into either Motif or Background
![Page 32: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/32.jpg)
32
Expectation MaximizationGiven sequences x1, …, xN,
• Find all k-long words X1,…, Xn
• Define motif model: M = (M1,…, MK)Mi = (Mi1,…, Mi4)
(assume {A, C, G, T})
where Mij = Prob[ letter j occurs in motif position i ]
• Define background model:B = B1, …, B4
Bi = Prob[ letter j in background sequence ]
motif background
ACGT
M1 MKM1 B
![Page 33: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/33.jpg)
33
Expectation Maximization• Define
Zi1 = { 1, if Xi is motif; 0, otherwise }
Zi2 = { 0, if Xi is motif; 1, otherwise }
• Given a word Xi = x[s]…x[s+k],
P[ Xi, Zi1=1 ] = M1x[s]…Mkx[s+k]
P[ Xi, Zi2=1 ] = (1 – ) Bx[s]…Bx[s+k]
Let 1 = ; 2 = (1 – )
motif background
ACGT
M1 MKM1 B
1 –
![Page 34: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/34.jpg)
34
Expectation Maximization
Define:Parameter space = (M, B)
1: Motif; 2: Background
Objective:
Maximize log likelihood of model:
2
1
2
111
1
2
11
log)|(log
))|(log(),|,...(log
j jjij
n
ijiij
n
i
n
i jjijijn
ZZ
Z
XP
XPZXXP
ACGT
M1 MKM1 B
1 –
![Page 35: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/35.jpg)
35
Expectation Maximization• Maximize expected likelihood, in iteration of two steps:
Expectation:Find expected value of log likelihood:
Maximization:Maximize expected value over ,
)],|,...([log 1 ZXXPE n
![Page 36: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/36.jpg)
36
Expectation:
Find expected value of log likelihood:
2
1
2
111
1
log][)|(log][
)],|,...([log
j jjij
n
ijiij
n
i
n
ZZ EXPE
ZXXPE
where expected values of Z can be computed as follows:
ijii
jijijij Z
XPXP
XPZobZE *
)|()1()|(
)|(]1[Pr][
21
Expectation Maximization: E-step
![Page 37: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/37.jpg)
37
Expectation Maximization: M-step
Maximization:Maximize expected value over and independently
For , this has the following solution:(we won’t prove it)
Effectively, NEW is the expected # of motifs per position, given our current parameters
n
i
n
i
iii
NEW
n
Zxam ZZ
1 1
121
*))1log(log(arg **
![Page 38: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/38.jpg)
38
• For = (M, B), define
cjk = E[ # times letter k appears in motif position j]
c0k = E[ # times letter k appears in background]• cij values are calculated easily from Z* values
It then follows:
4
1k jk
jkNEWjk
c
cM
4
1 0
0
k k
kNEWk
c
cB
to not allow any 0’s, add pseudocounts
Expectation Maximization: M-step
![Page 39: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/39.jpg)
39
Initial Parameters Matter!Consider the following artificial example:
6-mers X1, …, Xn: (n = 2000)
– 990 words “AAAAAA”– 990 words “CCCCCC”– 20 words “ACACAC”
Some local maxima:
= 49.5%; B = 100/101 C, 1/101 A M = 100% AAAAAA
= 1%; B = 50% C, 50% A M = 100% ACACAC
![Page 40: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/40.jpg)
40
Overview of EM Algorithm
1. Initialize parameters = (M, B), :– Try different values of from N-1/2 up to 1/(2K)
2. Repeat:a. Expectationb. Maximization
3. Until change in = (M, B), falls below
4. Report results for several “good”
![Page 41: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/41.jpg)
41
Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy
Lawrence et al. 1993
![Page 42: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/42.jpg)
42
Notations
• Set of symbols:
• Sequences: S = {S1, S2, …, SN}
• Starting positions of motifs: A = {a1, a2, …, aN}
• Motif model ( ) : qij = P(symbol at the i-th position = j)
• Background model: pj = P(symbol = j)
• Count of symbols in each column: cij= count of symbol, j, in the i-th column in the aligned region
![Page 43: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/43.jpg)
43
Probability of data given model
W
i j
cijijqASP
1
||
1
),|(
W
i j
cjijpASP
1
||
10 ),|(
![Page 44: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/44.jpg)
44
Scoring Function• Maximize the log-odds ratio:
• Is greater than zero if the data is a better match to the motif model than to the background model
W
i j
cijijqASP
1
||
1
),|(
W
i j
cjijpASP
1
||
10 ),|(
W
i j j
ijij p
qc
ASP
ASPF
1
||
10
log),|(
),|(log
![Page 45: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/45.jpg)
45
Scoring function
W
i j j
ijij p
qc
ASP
ASPF
1
||
10
log),|(
),|(log
• A particular alignment “A” gives us the counts cij. • In the scoring function “F”, use:
BN
bcq jijij
1
![Page 46: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/46.jpg)
46
Scoring function
• Thus, given an alignment A, we can calculate the scoring function F
• We need to find A that maximizes this scoring function, which is a log-odds score
![Page 47: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/47.jpg)
47
Optimization and Sampling
• To maximize a function, f(x):– Brute force method: try all possible x– Sample method: sample x from probability
distribution: p(x) ~ f(x)
– Idea: suppose xmax is argmax of f(x), then it is also argmax of p(x), thus we have a high probability of selecting xmax
![Page 48: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/48.jpg)
48
Markov Chain Sampling
• To sample from a probability distribution p(x), we set up a Markov chain s.t. each state represents a value of x and for any two states, x and y, the transitional probabilities satisfy:
)()()()( xyypyxxp
)()(1
lim xpxCNN
• This would then imply:
![Page 49: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/49.jpg)
49
Gibbs sampling to maximize F
• Gibbs sampling is a special type of Markov chain sampling algorithm
• Our goal is to find the optimal A = (a1,…aN)• The Markov chain we construct will only have transitions
from A to alignments A’ that differ from A in only one of the ai
• In round-robin order, pick one of the ai to replace• Consider all A’ formed by replacing ai with some other
starting position ai’ in sequence Si
• Move to one of these A’ probabilistically• Iterate the last three steps
![Page 50: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/50.jpg)
50
Algorithm
Randomly initialize A0;Repeat:
(1) randomly choose a sequence z from S;A* = At \ az; compute θt from A*;
(2) sample az according to P(az = x), which is proportional to Qx/Px; update At+1 = A* x;
Select At that maximizes F;
Qx: the probability of generating x according to θt;
Px: the probability of generating x according to the background model
![Page 51: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/51.jpg)
51
Algorithm
Current solution At
![Page 52: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/52.jpg)
52
Algorithm
Choose one az to replace
![Page 53: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/53.jpg)
53
Algorithm
For each candidate sitex in sequence z, calculate Qx and Px:Probabilities of samplingx from motif model andbackground model resp.
x
![Page 54: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/54.jpg)
54
Algorithm
Among all possible candidates, choose one(say x) with probabilityproportional to Qx/Px
x
![Page 55: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/55.jpg)
55
Algorithm
Set At+1 = A* x
x
![Page 56: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/56.jpg)
56
Algorithm
Repeat
x
![Page 57: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/57.jpg)
57
Local optima
• The algorithm may not find the “global” or true maximum of the scoring function
• Once “At” contains many similar substrings, others matching these will be chosen with higher probability
• Algorithm will “get locked” into a “local optimum” – all neighbors have poorer scores, hence low
chance of moving out of this solution
![Page 58: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/58.jpg)
58
Phase shifts
• After every M iterations, compare the current At with alignments obtained by shifting every aligned substring ai by some amount, either to left or right
![Page 59: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/59.jpg)
59
Phase shift
![Page 60: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/60.jpg)
60
Phase shift
![Page 61: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/61.jpg)
61
Pattern Width
• The algorithm described so far requires pattern width(W) to be input.
• We can modify the algorithm so that it executes for a range of plausible widths.
• The function F is not immediately useful for this purpose as its optimal value always increases with increasing W.
![Page 62: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/62.jpg)
62
Pattern Width
• Another function based on the incomplete-data log-probability ratio G can be used.
• Dividing G by the number of free parameters needed to specify the pattern (19W in the case of proteins) produced a statistic useful for choosing pattern width. This quantity can be called information per parameter.
![Page 63: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/63.jpg)
63
Time complexity analysis
• For a typical protein sequence, it was found that, for a single pattern width, each input sequence needs to be sampled fewer than T = 100 times before convergence.
• L*W multiplications are performed in Step2 of the algorithm.
• Total multiplications to execute the algorithm = TNLavgW
• Linear Time complexity has been observed in applications
![Page 64: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/64.jpg)
64
Motif finding
• The Gibbs sampling algorithm was originally applied to find motifs in amino acid sequences– Protein motifs represent common sequence
patterns in proteins, that are related to certain structure and function of the protein
• Gibbs sampling is extensively used to find motifs in DNA sequence, i.e., transcription factor binding sites
![Page 65: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/65.jpg)
65
Advantages / Disadvantages
• Very similar to EM
Advantages:• Easier to implement• Less dependent on initial parameters• More versatile, easier to enhance with heuristics
Disadvantages:• More dependent on all sequences to exhibit the motif• Less systematic search of initial parameter space
![Page 66: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/66.jpg)
66
Repeats, and a Better Background Model
• Repeat DNA can be confused as motif– Especially low-complexity CACACA… AAAAA, etc.
Solution:
more elaborate background model0th order: B = { pA, pC, pG, pT }1st order: B = { P(A|A), P(A|C), …, P(T|T) }…Kth order: B = { P(X | b1…bK); X, bi{A,C,G,T} }
Has been applied to EM and Gibbs (up to 3rd order)
![Page 67: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/67.jpg)
67
Limits of Motif Finders
• Given upstream regions of coregulated genes:– Increasing length makes motif finding harder – random motifs
clutter the true ones– Decreasing length makes motif finding harder – true motif
missing in some sequences
0
gene???
![Page 68: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/68.jpg)
68
Example Application: Motifs in Yeast
Group:
Tavazoie et al. 1999, G. Church’s lab, Harvard
Data:
• Microarrays on 6,220 mRNAs from yeast Affymetrix chips (Cho et al.)• 15 time points across two cell cycles
1. Clustering genes according to common expression
• K-means clustering -> 30 clusters, 50-190 genes/cluster• Clusters correlate well with known function
2. AlignACE motif finding • 600-long upstream regions
![Page 69: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/69.jpg)
69
Motifs in Periodic Clusters
![Page 70: 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649f275503460f94c3e462/html5/thumbnails/70.jpg)
70
Motifs in Non-periodic Clusters