hidden markov models (hmms)
DESCRIPTION
Hidden Markov Models (HMMs). (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct. 27, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. Motivation: the CpG island problem. Methylation in human genome - PowerPoint PPT PresentationTRANSCRIPT
1
Hidden Markov Models (HMMs)
(Lecture for CS498-CXZ Algorithms in Bioinformatics)
Oct. 27, 2005
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
2
Motivation: the CpG island problem
• Methylation in human genome
– “CG” -> “TG” happens in most place except “start regions” of genes
– CpG islands = 100-1,000 bases before a gene starts
• Questions
– Q1: Given a short stretch of genomic sequence, how would we decide if it comes from a CpG island or not?
– Q2: Given a long sequence, how would we find the CpG islands in it?
3
Answer to Q1: Bayes Classifier
( | ) ( ) ( | ) ( )( | ) ( | )
( ) ( )CpG CpG Other Other
CpG Other
P X H P H P X H P HP H X P H X
P X P X
Hypothesis space: H={HCpG, HOther} Evidence: X=“ATCGTTC”
Likelihood of evidence (Generative Model)
Prior probability( | ) ( | ) ( )
( | ) ( | ) ( )CpG CpG CpG
Other Other Other
P H X P X H P H
P H X P X H P H
We need two generative models for sequences: p(X| HCpG), p(X|HOther)
4
A Simple Model for Sequences:p(X)
1 2 1 11
1
11
( ) ( ... ) ( | ... )
: ( ) ( )
: ( ) ( | )
n
n i ii
n
ii
n
i ii
p X p X X X p X X X
Unigram p X p X
Bigram p X p X X
Probability rule
Assume independence
Capture some dependence
P(x|HCpG)
P(A|HCpG)=0.25P(T|HCpG)=0.25P(C|HCpG)=0.25P(G|HCpG)=0.25
P(x|HOther)
P(A|HOther)=0.25P(T|HOther)=0.40P(C|HOther)=0.10P(G|HOther)=0.25
X=ATTGVs.
X=ATCG
5
Answer to Q2: Hidden Markov Model
CpG Island
X=ATTGATGCAAAAGGGGGATCGGGCGATATAAAATTTG
OtherOther
How can we identify a CpG island in a long sequence?
Idea 1: Test each window of a fixed number of nucleitidesIdea2: Classify the whole sequence Class label S1: OOOO………….……OClass label S2: OOOO…………. OCC…Class label Si: OOOO…OCC..CO…O…Class label SN: CCCC……………….CC
S*=argmaxS P(S|X)= argmaxS P(S,X)
S*=OOOO…OCC..CO…O
CpG
6
HMM is just one way of modeling p(X,S)…
7
A simple HMM
Parameters
Initial state prob: p(B)= 0.5; p(I)=0.5
State transition prob:p(BB)=0.8 p(BI)=0.2p(IB)=0.5 p(II)=0.5
Output prob:P(a|B) = 0.25,…p(c|B)=0.10…P(c|I) = 0.25 …
P(B)=0.5P(I)=0.5
P(x|B)B I
0.8
0.20.5
0.5P(x|I)
0.8
0.20.5
0.5
P(x|HCpG)=p(x|I)
P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25
P(x|HOther)=p(x|B)
P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25
8
( , , , , )HMM S V B A
( ) : " "i k k ib v prob of generating v at s
A General Definition of HMM
11
{ ,..., } 1N
N ii
:i iprob of starting at state s
1{ ,..., }MV v v
1{ ,..., }NS s sN states
M symbols
Initial state probability:
1
{ } 1 , 1N
ij ijj
A a i j N a
State transition probability:
1
{ ( )} 1 , 1 ( ) 1M
i k i kk
B b v i N k M b v
Output probability::ij i ja prob of going s s
9
How to “Generate” a Sequence?
B I
0.8
0.20.5
0.5
P(x|B) P(x|I)
P(B)=0.5 P(I)=0.5
B I BB BII I
I I IB BBI I… …
Given a model, follow a path to generate the observations.
model
Sequence
states
P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25
P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25
a c g t t …
10
How to “Generate” a Sequence?
B I
0.8
0.20.5
0.5
P(x|B) P(x|I)
P(B)=0.5 P(I)=0.5
model
Sequence
P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25
P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25
a c g t t …
a
B I BII
tgc
0.50.2
P(“BIIIB”, “acgtt”)=p(B)p(a|B) p(I|B)p(c|I)p(I|I)p(g|I)p(I|I)p(t|I)p(B|I)p(t|B)
0.50.50.5
0.40.250.250.250.25
t
11
HMM as a Probabilistic Model
1 2 1 2 1 1 1 2 1 2 2 1( , ,..., , , ,..., ) ( ) ( | ) ( | ) ( | )... ( | ) ( )T T T T T Tp O O O S S S p S p O S p S S p O S p S S p O S
1 2 1 2 1 1( , ,..., ) ( ) ( | )... ( | )T T Tp S S S p S p S S p S S
Time/Index: t1 t2 t3 t4 …Data: o1 o2 o3 o4 …
Observation variable: O1 O2 O3 O4 …
Hidden state variable: S1 S2 S3 S4 …
Random variables/process
Sequential data
Probability of observations (incomplete likelihood):
1
1 2 1 2 1,...
( , ,..., ) ( , ,..., , ,... )T
T T TS S
p O O O p O O O S S
1 2 1 2 1 1 2 2( , ,..., | , ,..., ) ( | ) ( | )... ( | )T T T Tp O O O S S S p O S p O S p O S
Joint probability (complete likelihood):
State transition prob:
Probability of observations with known state transitions:
Init state distr.
State trans. prob.
Output prob.
12
Three Problems
1. Decoding – finding the most likely path
Given: model, parameters, observations (data)
Compute: most likely states sequence
2. Evaluation – computing observation likelihood
Given: model, parameters, observations (data)
Compute: the likelihood to generate the observed data
1 2 1 2
* * *1 2 1 2 1 2
... ...... arg max ( ... | ) arg max ( ... , )
T T
T T TS S S S S S
S S S p S S S O p S S S O
1 2
1 2 1 2...
( | ) ( | ... ) ( ... )T
T TS S S
p O p O S S S p S S S
13
Three Problems (cont.) 3 Training – estimating parameters
- Supervised
Given: model architecture, labeled data ( data+state sequence)
- Unsupervised
Given : model architecture, unlabeled data
* arg max ( | )p O Maximum Likelihood
14
Problem I: Decoding/ParsingFinding the most likely path
You can think of this as classification with all the paths as class labels…
15
What’s the most likely path?
a c t tt ag g
? ? ?? ??? ??? ?????
????????
?
1 1 11 2 1 2
* * *1 2 1 2 1
... ... 2
... arg max ( ... , ) arg max ( ) ( ) ( )i i i i
T T
T
T T S o S S S oS S S S S S i
S S S p S S S O S b v a b v
B I
0.8
0.20.5
0.5
P(x|B) P(x|I)
P(B)=0.5 P(I)=0.5
P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25
P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25
16
Viterbi Algorithm: An Example
B
I II I
a c …tg0.5
0.2
0.5
0.5
0.8
BBB0.8 0.8
0.5 0.5
0.5
0.5
0.2
0.5
0.2
0.5
t = 1 2 3 4 …
VP(B): 0.5*0.251 (B) 0.5*0.251*0.8*0.098(BB) …
VP(I) 0.5*0.25(I) 0.5*0.25*0.5*0.25(II) …
P(a|B)=0.251P(t|B)=0.40
P(c|B)=0.098P(g|B)=0.251
B I
0.8
0.20.5
0.5
P(x|B) P(x|I)
P(B)=0.5 P(I)=0.5
P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25
Remember the best paths so far
17
Viterbi Algorithm
Observation:
Algorithm:
1 2 1 2 11 1 1 1 1
... ...max ( ... , ... ) max[ max ( ... , ... , )]
T i TT T T T T i
S S S s S S Sp o o S S p o o S S S s
1 1
1 1
1 1 1...
*1 1 1
...
( ) max ( ... , ... , )
( ) [arg max ( ... , ... , ) ] ( )
t
t
t t t t iS S
t t t t iS S
VP i p o o S S S s
q i p o o S S S s i
*1 1 11. ( ) ( ), ( ) ( ), 1,...,i iVP i b o q i i for i N
1 11
* *1 1
1
2. 1 , ( ) max ( ) ( ),
( ) ( ) ( ), arg max ( ) ( ), 1,...,
t t ji i tj N
t t t ji i tj N
For t T VP i VP j a b o
q i q k i k VP j a b o for i N
(Dynamic programming)
* ( )TThe best path is q i Complexity: O(TN2)
18
Problem II: EvaluationComputing the data likelihood
• Another use of an HMM, e.g., as a generative model for discrimination
• Also related to Problem III – parameter estimation
19
Data Likelihood: p(O|)
(" ..." | ) (" ..." | ... ) ( ... )
(" ..." | ... ) ( ... )
... (" ..." | ... ) ( ... )
p ac g t p a c g t BB B p BB B
p a c g t BT B p BT B
p a c g t TT T p TT T
B
I II I
a c …tg0.5
0.2
0.5
0.5
0.8
BBB0.8 0.8
0.5 0.5
0.5
0.5
0.2
0.5
0.2
0.5
t = 1 2 3 4 …
In general, 1 2
1 2 1 2...
( | ) ( | ... ) ( ... )T
T TS S S
p O p O S S S p S S S
Complexity of a naïve approach?
All HMM parameters
20
The Forward Algorithm
1 2 1
1 2 1
1 2 1
1 1 1 2 11 ...
1 1 2 1...
1 1 1 2 1 1...
1 1 1 2 1
( ... | ) ( ... , ... , )
( ) ( ... , ... , )
( ... , ... ) ( | ) ( | )
[ ( ... , ... )] (
T
t
t
N
T T T T ii S S S
t t t t iS S S
t t t i t t iS S S
t t j ji i
p o o p o o S S S S s
i p o o S S S S s
p o o S S S p S s S p o S s
p o o S S S s a b
1 2 21 ...
11
)
( ) ( )
t
N
tj S S S
N
i t t jij
o
b o j a
11
( ... | ) ( )N
T Ti
p o o i
The data likelihood is
Observation:
Algorithm:
Generating o1…ot
with ending state si
Complexity: O(TN2)
21
Forward Algorithm: Example
B
I II I
a c …tg0.5
0.2
0.5
0.5
0.8
BBB0.8 0.8
0.5 0.50.5
0.2
0.5
0.2
0.5
t = 1 2 3 4 …
1(B): 0.5*p(“a”|B)
1(I): 0.5*p(“a”|I)
2(B): [1(B)*0.8+ 1(I)*0.5]*p(“c”|B) ……
2(I): [1(B)*0.2+ 1(I)*0.5]*p(“c”|I) ……
1 11 1
( ... | ) ( ) ( ) ( ) ( )N N
T T t i t t jii j
p o o i i b o j a
P(“a c g t”) = 4(B)+ 4(I)
22
The Backward Algorithm
2
2
1
1
1 1 1 21 ...
1 2 2 11 ...
1 1...
2 2 1 1 1 1...
( ... | ) ( ... , , ... )
( ) ( ... , ... | )
( ) ( ... , ... | )
( ... , ... | ) ( | ) ( |
T
T
t T
t
N
T T i Ti S S
N
i i T T ii S S
t t T t T t iS S
t T t T t t t i t tS S
p o o p o o S s S S
b o p o o S S S s
i p o o S S S s
p o o S S S p S S s p o S
2
1 2 2 11 ...
1 11
)
( ) ( ... , ... | )
( ) ( )
T
t T
N
ij j t t T t T t jj S S
N
ij j t tj
a b o p o o S S S s
a b o j
1 1 1 1 11 1 1
( ... | ) ( ) ( ) ( ) ( ) ( ) ( )N N N
T i i t ti i i
p o o b o i i i i i for any t
The data likelihood is
Observation:
Algorithm: Starting from state si
Generating ot+1…oT
(o1…ot already generated)
Complexity: O(TN2)
23
Backward Algorithm: Example
B
I II I
a c …tg0.5
0.2
0.5
0.5
0.8
BBB0.8 0.8
0.5 0.50.5
0.2
0.5
0.2
0.5
t = 1 2 3 4 …
4(B): 1 3(B): 0.8*p(“t”|B)*4(B)+ 0.2*p(“t”|I)*4(I)
1 11 1
( ... | ) ( ) ( ) ( ) ( )N N
T T t i t t jii j
p o o i i b o j a
P(“a c g t”) = 1(B)* 1(B)+ 1(I)* 1(I) = 2(B)* 2(B)+ 2(I)* 2(I)
4(I): 1 3(I): 0.5*p(“t”|B)*4(B)+ 0.5*p(“t”|T)*4(I)
……
24
Problem III: TrainingEstimating Parameters
Where do we get the probability values for all parameters?
Supervised vs. Unsupervised
25
Supervised TrainingGiven:
1. N – the number of states, e.g., 2, (s1 and s2)2. V – the vocabulary, e.g., V={a,b}3. O – observations, e.g., O=aaaaabbbbb4. State transitions, e.g., S=1121122222
Task: Estimate the following parameters1. 1, 2
2. a11, a12,a22,a21
3. b1(a), b1(b), b2(a), b2(b)
1=1/1=1; 2=0/1=0
b1(a)=4/4=1.0; b1(b)=0/4=0;b2(a)=1/6=0.167; b2(b)=5/6=0.833
a11=2/4=0.5; a12=2/4=0.5a21=1/5=0.2; a22=4/5=0.8
1 20.5
0.8
0.2
0.5
P(s1)=1P(s2)=0
P(a|s1)=1P(b|s1)=0
P(a|s2)=167P(b|s2)=0.833
26
Unsupervised TrainingGiven:
1. N – the number of states, e.g., 2, (s1 and s2)2. V – the vocabulary, e.g., V={a,b}3. O – observations, e.g., O=aaaaabbbbb4. State transitions, e.g., S=1121122222
Task: Estimate the following parameters1. 1, 2
2. a11, a12,a22,a21
3. b1(a), b1(b), b2(a), b2(b)
How could this be possible?
Maximum Likelihood:* arg max ( | )p O
27
Intuition
1
1
( , | ) [ (1) 1]
( , | )
K
k kk
i K
kk
p O q q
p O q
1
1 11
1 1
( , | ) [ ( ) , ( 1) ]
( , | ) [ ( ) ]
T K
k k kt k
ij T K
k kt k
p O q q t i q t ja
p O q q t i
O=aaaaabbbbb,
q1=1111111111
P(O,q1|)
q2=11111112211 … qK=2222222222
P(O,q2|)P(O,qK|)
1
1 11
1 1
( , | ) [ ( ) , ]( )
( , | ) [ ( ) ]
T K
k k t jt k
i j T K
k kt k
p O q q t i o vb v
p O q q t i
New ’
Computation of P(O,qk|) is expensive …
28
Baum-Welch Algorithm
( ) ( ( ) | , )
( , ) ( ( ) , ( 1) | , )t i
t i j
i p q t s O
i j p q t s q t s O
Basic “counters”: Being at state si at time t
Being at state si at time t and at state sj at time t+1
Complexity: O(N2)1
1 1
1
1 1
( ) ( )( )
( ) ( )
( ) ( ) ( )( , )
( ) ( )
( ) ( )( )
( )
t tt N
t tj
t ij j t tt N
t tj
ij j t tt
t
i ii
j j
i a b o ji j
j j
a b o ji
i
Computation of counters:
29
Baum-Welch Algorithm (cont.)
Updating formulas:
'1
1
' 11
' 1 1
1
1
( )
( , )
( , ')
( ) [ ]( )
( )
i
T
tt
ij N T
tj t
T
t t kt
i k T
tt
i
i ja
i j
i o vb v
i
Overall complexity for each iteration: O(TN2)
30
What You Should Know
• Definition of an HMM and parameters of an HMM
• Viterbi Algorithm
• Forward/Backward algorithms
• Estimate parameters of an HMM in a supervised way