hidden markov models (hmms)

1

Hidden Markov Models (HMMs)

(Lecture for CS498-CXZ Algorithms in Bioinformatics)

Oct. 27, 2005

ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

2

Motivation: the CpG island problem

• Methylation in human genome

– “CG” -> “TG” happens in most place except “start regions” of genes

– CpG islands = 100-1,000 bases before a gene starts

• Questions

– Q1: Given a short stretch of genomic sequence, how would we decide if it comes from a CpG island or not?

– Q2: Given a long sequence, how would we find the CpG islands in it?

3

Answer to Q1: Bayes Classifier

( | ) ( ) ( | ) ( )( | ) ( | )

( ) ( )CpG CpG Other Other

CpG Other

P X H P H P X H P HP H X P H X

P X P X

Hypothesis space: H={HCpG, HOther} Evidence: X=“ATCGTTC”

Likelihood of evidence (Generative Model)

Prior probability( | ) ( | ) ( )

( | ) ( | ) ( )CpG CpG CpG

Other Other Other

P H X P X H P H

P H X P X H P H

We need two generative models for sequences: p(X| HCpG), p(X|HOther)

5

Answer to Q2: Hidden Markov Model

CpG Island

X=ATTGATGCAAAAGGGGGATCGGGCGATATAAAATTTG

OtherOther

How can we identify a CpG island in a long sequence?

Idea 1: Test each window of a fixed number of nucleitidesIdea2: Classify the whole sequence Class label S1: OOOO………….……OClass label S2: OOOO…………. OCC…Class label Si: OOOO…OCC..CO…O…Class label SN: CCCC……………….CC

S*=argmaxS P(S|X)= argmaxS P(S,X)

S*=OOOO…OCC..CO…O

CpG

6

HMM is just one way of modeling p(X,S)…

7

A simple HMM

Parameters

Initial state prob: p(B)= 0.5; p(I)=0.5

State transition prob:p(BB)=0.8 p(BI)=0.2p(IB)=0.5 p(II)=0.5

Output prob:P(a|B) = 0.25,…p(c|B)=0.10…P(c|I) = 0.25 …

P(B)=0.5P(I)=0.5

P(x|B)B I

0.8

0.20.5

0.5P(x|I)

0.8

0.20.5

0.5

P(x|HCpG)=p(x|I)

P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25

P(x|HOther)=p(x|B)

P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25

8

( , , , , )HMM S V B A

( ) : " "i k k ib v prob of generating v at s

A General Definition of HMM

11

{ ,..., } 1N

N ii

:i iprob of starting at state s

1{ ,..., }MV v v

1{ ,..., }NS s sN states

M symbols

Initial state probability:

1

{ } 1 , 1N

ij ijj

A a i j N a

State transition probability:

1

{ ( )} 1 , 1 ( ) 1M

i k i kk

B b v i N k M b v

Output probability::ij i ja prob of going s s

9

How to “Generate” a Sequence?

B I

0.8

0.20.5

0.5

P(x|B) P(x|I)

P(B)=0.5 P(I)=0.5

B I BB BII I

I I IB BBI I… …

Given a model, follow a path to generate the observations.

model

Sequence

states

P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25

P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25

a c g t t …

10

How to “Generate” a Sequence?

B I

0.8

0.20.5

0.5

P(x|B) P(x|I)

P(B)=0.5 P(I)=0.5

model

Sequence

P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25

P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25

a c g t t …

a

B I BII

tgc

0.50.2

P(“BIIIB”, “acgtt”)=p(B)p(a|B) p(I|B)p(c|I)p(I|I)p(g|I)p(I|I)p(t|I)p(B|I)p(t|B)

0.50.50.5

0.40.250.250.250.25

t

11

HMM as a Probabilistic Model

1 2 1 2 1 1 1 2 1 2 2 1( , ,..., , , ,..., ) ( ) ( | ) ( | ) ( | )... ( | ) ( )T T T T T Tp O O O S S S p S p O S p S S p O S p S S p O S

1 2 1 2 1 1( , ,..., ) ( ) ( | )... ( | )T T Tp S S S p S p S S p S S

Time/Index: t1 t2 t3 t4 …Data: o1 o2 o3 o4 …

Observation variable: O1 O2 O3 O4 …

Hidden state variable: S1 S2 S3 S4 …

Random variables/process

Sequential data

Probability of observations (incomplete likelihood):

1

1 2 1 2 1,...

( , ,..., ) ( , ,..., , ,... )T

T T TS S

p O O O p O O O S S

1 2 1 2 1 1 2 2( , ,..., | , ,..., ) ( | ) ( | )... ( | )T T T Tp O O O S S S p O S p O S p O S

Joint probability (complete likelihood):

State transition prob:

Probability of observations with known state transitions:

Init state distr.

State trans. prob.

Output prob.

12

Three Problems

1. Decoding – finding the most likely path

Given: model, parameters, observations (data)

Compute: most likely states sequence

2. Evaluation – computing observation likelihood

Given: model, parameters, observations (data)

Compute: the likelihood to generate the observed data

1 2 1 2

* * *1 2 1 2 1 2

... ...... arg max ( ... | ) arg max ( ... , )

T T

T T TS S S S S S

S S S p S S S O p S S S O

1 2

1 2 1 2...

( | ) ( | ... ) ( ... )T

T TS S S

p O p O S S S p S S S

13

Three Problems (cont.) 3 Training – estimating parameters

- Supervised

Given: model architecture, labeled data ( data+state sequence)

- Unsupervised

Given : model architecture, unlabeled data

* arg max ( | )p O Maximum Likelihood

14

Problem I: Decoding/ParsingFinding the most likely path

You can think of this as classification with all the paths as class labels…

15

What’s the most likely path?

a c t tt ag g

? ? ?? ??? ??? ?????

????????

?

1 1 11 2 1 2

* * *1 2 1 2 1

... ... 2

... arg max ( ... , ) arg max ( ) ( ) ( )i i i i

T T

T

T T S o S S S oS S S S S S i

S S S p S S S O S b v a b v

B I

0.8

0.20.5

0.5

P(x|B) P(x|I)

P(B)=0.5 P(I)=0.5

P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25

P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25

16

Viterbi Algorithm: An Example

B

I II I

a c …tg0.5

0.2

0.5

0.5

0.8

BBB0.8 0.8

0.5 0.5

0.5

0.5

0.2

0.5

0.2

0.5

t = 1 2 3 4 …

VP(B): 0.5*0.251 (B) 0.5*0.251*0.8*0.098(BB) …

VP(I) 0.5*0.25(I) 0.5*0.25*0.5*0.25(II) …

P(a|B)=0.251P(t|B)=0.40

P(c|B)=0.098P(g|B)=0.251

B I

0.8

0.20.5

0.5

P(x|B) P(x|I)

P(B)=0.5 P(I)=0.5

P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25

Remember the best paths so far

17

Viterbi Algorithm

Observation:

Algorithm:

1 2 1 2 11 1 1 1 1

... ...max ( ... , ... ) max[ max ( ... , ... , )]

T i TT T T T T i

S S S s S S Sp o o S S p o o S S S s

1 1

1 1

1 1 1...

*1 1 1

...

( ) max ( ... , ... , )

( ) [arg max ( ... , ... , ) ] ( )

t

t

t t t t iS S

t t t t iS S

VP i p o o S S S s

q i p o o S S S s i

*1 1 11. ( ) ( ), ( ) ( ), 1,...,i iVP i b o q i i for i N

1 11

* *1 1

1

2. 1 , ( ) max ( ) ( ),

( ) ( ) ( ), arg max ( ) ( ), 1,...,

t t ji i tj N

t t t ji i tj N

For t T VP i VP j a b o

q i q k i k VP j a b o for i N

(Dynamic programming)

* ( )TThe best path is q i Complexity: O(TN2)

18

Problem II: EvaluationComputing the data likelihood

• Another use of an HMM, e.g., as a generative model for discrimination

• Also related to Problem III – parameter estimation

19

Data Likelihood: p(O|)

(" ..." | ) (" ..." | ... ) ( ... )

(" ..." | ... ) ( ... )

... (" ..." | ... ) ( ... )

p ac g t p a c g t BB B p BB B

p a c g t BT B p BT B

p a c g t TT T p TT T

B

I II I

a c …tg0.5

0.2

0.5

0.5

0.8

BBB0.8 0.8

0.5 0.5

0.5

0.5

0.2

0.5

0.2

0.5

t = 1 2 3 4 …

In general, 1 2

1 2 1 2...

( | ) ( | ... ) ( ... )T

T TS S S

p O p O S S S p S S S

Complexity of a naïve approach?

All HMM parameters

20

The Forward Algorithm

1 2 1

1 2 1

1 2 1

1 1 1 2 11 ...

1 1 2 1...

1 1 1 2 1 1...

1 1 1 2 1

( ... | ) ( ... , ... , )

( ) ( ... , ... , )

( ... , ... ) ( | ) ( | )

[ ( ... , ... )] (

T

t

t

N

T T T T ii S S S

t t t t iS S S

t t t i t t iS S S

t t j ji i

p o o p o o S S S S s

i p o o S S S S s

p o o S S S p S s S p o S s

p o o S S S s a b

1 2 21 ...

11

)

( ) ( )

t

N

tj S S S

N

i t t jij

o

b o j a

11

( ... | ) ( )N

T Ti

p o o i

The data likelihood is

Observation:

Algorithm:

Generating o1…ot

with ending state si

Complexity: O(TN2)

21

Forward Algorithm: Example

B

I II I

a c …tg0.5

0.2

0.5

0.5

0.8

BBB0.8 0.8

0.5 0.50.5

0.2

0.5

0.2

0.5

t = 1 2 3 4 …

1(B): 0.5*p(“a”|B)

1(I): 0.5*p(“a”|I)

2(B): [1(B)*0.8+ 1(I)*0.5]*p(“c”|B) ……

2(I): [1(B)*0.2+ 1(I)*0.5]*p(“c”|I) ……

1 11 1

( ... | ) ( ) ( ) ( ) ( )N N

T T t i t t jii j

p o o i i b o j a

P(“a c g t”) = 4(B)+ 4(I)

22

The Backward Algorithm

2

2

1

1

1 1 1 21 ...

1 2 2 11 ...

1 1...

2 2 1 1 1 1...

( ... | ) ( ... , , ... )

( ) ( ... , ... | )

( ) ( ... , ... | )

( ... , ... | ) ( | ) ( |

T

T

t T

t

N

T T i Ti S S

N

i i T T ii S S

t t T t T t iS S

t T t T t t t i t tS S

p o o p o o S s S S

b o p o o S S S s

i p o o S S S s

p o o S S S p S S s p o S

2

1 2 2 11 ...

1 11

)

( ) ( ... , ... | )

( ) ( )

T

t T

N

ij j t t T t T t jj S S

N

ij j t tj

a b o p o o S S S s

a b o j

1 1 1 1 11 1 1

( ... | ) ( ) ( ) ( ) ( ) ( ) ( )N N N

T i i t ti i i

p o o b o i i i i i for any t

The data likelihood is

Observation:

Algorithm: Starting from state si

Generating ot+1…oT

(o1…ot already generated)

Complexity: O(TN2)

23

Backward Algorithm: Example

B

I II I

a c …tg0.5

0.2

0.5

0.5

0.8

BBB0.8 0.8

0.5 0.50.5

0.2

0.5

0.2

0.5

t = 1 2 3 4 …

4(B): 1 3(B): 0.8*p(“t”|B)*4(B)+ 0.2*p(“t”|I)*4(I)

1 11 1

( ... | ) ( ) ( ) ( ) ( )N N

T T t i t t jii j

p o o i i b o j a

P(“a c g t”) = 1(B)* 1(B)+ 1(I)* 1(I) = 2(B)* 2(B)+ 2(I)* 2(I)

4(I): 1 3(I): 0.5*p(“t”|B)*4(B)+ 0.5*p(“t”|T)*4(I)

……

24

Problem III: TrainingEstimating Parameters

Where do we get the probability values for all parameters?

Supervised vs. Unsupervised

25

Supervised TrainingGiven:

1. N – the number of states, e.g., 2, (s1 and s2)2. V – the vocabulary, e.g., V={a,b}3. O – observations, e.g., O=aaaaabbbbb4. State transitions, e.g., S=1121122222

Task: Estimate the following parameters1. 1, 2

2. a11, a12,a22,a21

3. b1(a), b1(b), b2(a), b2(b)

1=1/1=1; 2=0/1=0

b1(a)=4/4=1.0; b1(b)=0/4=0;b2(a)=1/6=0.167; b2(b)=5/6=0.833

a11=2/4=0.5; a12=2/4=0.5a21=1/5=0.2; a22=4/5=0.8

1 20.5

0.8

0.2

0.5

P(s1)=1P(s2)=0

P(a|s1)=1P(b|s1)=0

P(a|s2)=167P(b|s2)=0.833

26

Unsupervised TrainingGiven:

1. N – the number of states, e.g., 2, (s1 and s2)2. V – the vocabulary, e.g., V={a,b}3. O – observations, e.g., O=aaaaabbbbb4. State transitions, e.g., S=1121122222

Task: Estimate the following parameters1. 1, 2

2. a11, a12,a22,a21

3. b1(a), b1(b), b2(a), b2(b)

How could this be possible?

Maximum Likelihood:* arg max ( | )p O

28

Baum-Welch Algorithm

( ) ( ( ) | , )

( , ) ( ( ) , ( 1) | , )t i

t i j

i p q t s O

i j p q t s q t s O

Basic “counters”: Being at state si at time t

Being at state si at time t and at state sj at time t+1

Complexity: O(N2)1

1 1

1

1 1

( ) ( )( )

( ) ( )

( ) ( ) ( )( , )

( ) ( )

( ) ( )( )

( )

t tt N

t tj

t ij j t tt N

t tj

ij j t tt

t

i ii

j j

i a b o ji j

j j

a b o ji

i

Computation of counters:

29

Baum-Welch Algorithm (cont.)

Updating formulas:

'1

1

' 11

' 1 1

1

1

( )

( , )

( , ')

( ) [ ]( )

( )

i

T

tt

ij N T

tj t

T

t t kt

i k T

tt

i

i ja

i j

i o vb v

i

Overall complexity for each iteration: O(TN2)

30

What You Should Know

• Definition of an HMM and parameters of an HMM

• Viterbi Algorithm

• Forward/Backward algorithms

• Estimate parameters of an HMM in a supervised way

hidden markov models (hmms)

Documents

state transition probability

observations data compute

known state transitions

init state distr

model architecture

output probability

pxhothera simple model

t1 t2t3t4 data