e xpectation m aximization m eets s ampling in m otif f inding zhizhuo zhang
Post on 14-Jan-2016
218 Views
Preview:
TRANSCRIPT
EXPECTATION MAXIMIZATION MEETS SAMPLING IN MOTIF FINDINGZhizhuo Zhang
OUTLINE
Review of Mixture Model and EM algorithm Importance Sampling Re-sampling EM Extending EM Integrate Other Features Result
REVIEW MOTIF FINDING: MIXTURE MODELING
Given a dataset X, a motif model Ѳ, and a background model θ0, the likelihood of observed X, is defined as :
To optimize likelihood above is NP-hard, EM algorithm solve this problem with the concept of missing data. Assume the missing data Zi is binding site Boolean flag of each site:
Motif Component Background Component
REVIEW MOTIF FINDING: EM
E-step:
M-step:
PROS AND CONS
Pros: Pure Probabilistic Modeling EM is a well known method The complexity of each iteration is linear
Cons: In each iteration, it examines all the sites (most
is background sites) EM is sensitive to its starting condition The length of motif is assumed given
SAMPLING IDEA (1)
Simple Example: 20 As and 10 Bs AAAAAAAAAAAAAAAAAAAABBBBBBBBBBBB
Let’s define a sampling function Q(x), and Q(x)=1 when x is sampled:
E.G., P(Q(A)=1)=0.1 P(Q(B)=1)=0.2The sampled data maybe: AABB we can recover the original data from
“AABB” 2A in sample/0.1=20 A in original 2B in sample/0.2=10 B in original
SAMPLING IDEA (2)
Almost every sampling function can recover the statistics in the original, which is known as “Importance sampling”
We can defined a good sampling function on the sequence data, which prefer to sample binding sites than background sites. According the parameter complexity, motif
model need more samples than background to achieve the same level of accuracy.
RE-SAMPLING EM
Sampling function Q(.), and sampled data XQ
E-step: the same as original EM M-step:
RE-SAMPLING EM
RE-SAMPLING EM
How to find a good sampling function
Intuitively, Motif PWM is the natural good sampling function, but it is impossible for us to know the motif PWM before hand.
Fortunately, a approximate PWM model already can do a good job in practice.
HOW TO FIND A GOOD APPROXIMATING PWM?
Unknown length Unknown distribution
EXTENDING EM
Start from all over-represented 5-mers Similarly, we find a motif model(PWM)
contains the given 5-mer which maximizes the likelihood of the observed data.
We define a extending EM process which optimizes the flanking columns included in the final PWM.
EXTENDING EM
Imagine we have a length-25 PWM Ѳ with 5-mer q “ACTTG” in the middle, which is wide enough for us to target any motif less than 15bp (Wmax).
Po 1 2 ……
10 11 12 13 14 15 16 ……
24 25
A 0.25
0.25
…… 0.25
1 0 0 0 0 0.25
…… 0.25
0.25
C 0.25
0.25
…… 0.25
0 1 0 0 0 0.25
…… 0.25
0.25
G 0.25
0.25
…… 0.25
0 0 0 0 1 0.25
…… 0.25
0.25
T 0.25
0.25
…… 0.25
0 0 1 1 0 0.25
…… 0.25
0.25
EXTENDING EMWe use two indices to maintain the start and end of the real motif PWM
EXTENDING EM
The M-step is the same as original EM, but we need to determine which column should be included. The increase of log-likelihood by including column j
CONSIDER OTHER FEATURES IN EM
Other features Positional Bias Strand Bias Sequence Rank Bias
We integrate them into mixture model New likelihood ratio Boolean variable to determine whether
include feature or not.
CONSIDER OTHER FEATURES IN EM
If feature data is modeled as multinomial, Chi-square Test is used to decide whether a feature should be included:
The multinomial parameters φ also can be learned in the M-step:
ALL TOGETHER
PWM Model Position Prior Model
Peak Rank Prior Model
SIMULATION RESULT
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73
AUC
Diff
eren
ce
Rank of AUC Difference
AUCSEME - AUCMEME
MEME Pax4 Motif
SEME Pax4 Motif
JASPAR Pax4 Motif
SIMULATION RESULT
2000
00
4000
00
8000
00
1600
000
2400
000
3200
000
4000
000
0200400600800
100012001400160018002000
Running Time Comparison
MEMECUDA-MEMESEME
Total Length of Input Sequences (bp)
Ru
nn
ing
Tim
e (
min
)
REAL DATA RESULT
163 ChIP-seq datasets Compare 6 popular motif finders. Half for training, half for testing
REAL DATA RESULT
De novo AP1 Model
De novo FOXA1 Model
De novo ER Model
CONCLUSION
SEME can perform EM on biased sampled data but estimate
parameters unbiasedly vary PWM size in EM procedure by starting with a
short 5-mer automatically learn and select other feature
information during EM iterations
top related