hidden markov models chapter 11. cg “islands” the dinucleotide “cg” is rare –c in a...

30
Hidden Markov Models Chapter 11

Upload: jocelin-ryan

Post on 16-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Hidden Markov ModelsChapter 11

Page 2: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

CG “islands”

• The dinucleotide “CG” is rare– C in a “CG” often gets “methylated” and the

resulting C then mutates to T

• Methylation is suppressed in some areas of genome, called “CG islands”

• Such CG islands often found around genes• Problem: find CG islands in whole genome

Page 3: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

CG island normal

high CG rare CG

• Two states. • Each state emits sequence.• Sequence emitted by “CG island” state is high in CG frequency• Concatenation of sequence emissions = genome

Page 4: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Biasedcoin

Faircoin

more “H” equal “H” & “T”

• Two states. • Each state emits a “coin toss” result.• Sequence emitted by “Biased coin” state is high in “Heads” frequency

Page 5: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

CG Islands and the “Fair Bet Casino”

• The The CGCG islands problem can be modeled after a islands problem can be modeled after a problem named problem named “The Fair Bet Casino”“The Fair Bet Casino”

• The game is to flip coins, which results in only two The game is to flip coins, which results in only two possible outcomes: possible outcomes: HHead or ead or TTail.ail.

• The The FFair coin will give air coin will give HHeads and eads and TTails with same ails with same probability ½.probability ½.

• The The BBiased coin will give iased coin will give HHeads with prob. ¾.eads with prob. ¾.

Page 6: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

The “Fair Bet Casino” (cont’d)

• Thus, we define the probabilities:Thus, we define the probabilities:– P(H|F) = P(T|F) = ½P(H|F) = P(T|F) = ½– P(H|B) = ¾, P(T|B) = ¼P(H|B) = ¾, P(T|B) = ¼– The crooked dealer chages between The crooked dealer chages between

Fair and Biased coins with probability Fair and Biased coins with probability 10%10%

Page 7: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

The Fair Bet Casino Problem• Input:Input: A sequence A sequence x = xx = x11xx22xx33…x…xnn of coin of coin

tosses made by two possible coins (tosses made by two possible coins (FF or or BB).).

• Output:Output: A sequence A sequence π = ππ = π11 π π22 π π33… π… πnn, ,

with each with each ππii being either being either F F or or BB indicating indicating

that that xxii is the result of tossing the Fair or is the result of tossing the Fair or

Biased coin respectively.Biased coin respectively.

Page 8: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Problem…

Fair Bet Casino Fair Bet Casino ProblemProblem

Any observed Any observed outcome of coin outcome of coin tosses could tosses could have been have been generated by any generated by any sequence of sequence of states!states!

Need to incorporate a way to grade different sequences differently.

Decoding ProblemDecoding Problem

Page 9: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Hidden Markov Model (HMM)• Can be viewed as an abstract machine with Can be viewed as an abstract machine with k hidden k hidden

states that emits symbols from an alphabet states that emits symbols from an alphabet ΣΣ..• Each state has its own probability distribution, and Each state has its own probability distribution, and

the machine switches between states according to the machine switches between states according to this probability distribution.this probability distribution.

• While in a certain state, the machine makes 2 While in a certain state, the machine makes 2 decisions:decisions:– What state should I move to next?What state should I move to next?– What symbol - from the alphabet What symbol - from the alphabet ΣΣ - should I - should I

emit?emit?

Page 10: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Why “Hidden”?

• Observers can see the emitted symbols Observers can see the emitted symbols of an HMM but have of an HMM but have no ability to know no ability to know which state the HMM is currently inwhich state the HMM is currently in..

• Thus, the goal is to infer the most likely Thus, the goal is to infer the most likely hidden states of an HMM based on the hidden states of an HMM based on the given sequence of emitted symbols.given sequence of emitted symbols.

Page 11: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

HMM Parameters

ΣΣ: set of emission characters.: set of emission characters.

Ex.: Σ = {H, T} for coin tossingEx.: Σ = {H, T} for coin tossing

QQ: set of hidden states, each emitting : set of hidden states, each emitting symbols from Σ.symbols from Σ.

Q={F,B} for coin tossingQ={F,B} for coin tossing

Page 12: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

HMM Parameters (cont’d)

A = (aA = (aklkl):): a |Q| x |Q| matrix of probability of a |Q| x |Q| matrix of probability of changing from state changing from state k k to state to state l.l.

aaFFFF = 0.9 a = 0.9 aFBFB = 0.1 = 0.1

aaBFBF = 0.1 a = 0.1 aBBBB = 0.9 = 0.9

E = (eE = (ekk((bb)):)): a |Q| x | a |Q| x |Σ| matrix of probability of Σ| matrix of probability of emitting symbol emitting symbol bb while being in state while being in state k.k.

eeFF(0) = ½ e(0) = ½ eFF(1) = ½ (1) = ½

eeBB(0) = ¼ e(0) = ¼ eBB(1) = ¾ (1) = ¾

Page 13: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

HMM for Fair Bet Casino (cont’d)

HMM model for the HMM model for the Fair Bet Casino Fair Bet Casino ProblemProblem

Page 14: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Hidden Paths• A A pathpath π = ππ = π11… π… πnn in the HMMin the HMM is defined as a sequence is defined as a sequence

of states.of states.• Consider path Consider path π π = = FFFBBBBBFFF and sequence FFFBBBBBFFF and sequence x x = =

0101110100101011101001

x 0 1 0 1 1 1 0 1 0 0 1

π π = F F F B B = F F F B B B B B F F FB B B F F FP(P(xxii|π|πii)) ½ ½ ½ ¾ ¾ ¾ ½ ½ ½ ¾ ¾ ¾ ¼¼ ¾ ½ ½ ½ ¾ ½ ½ ½

P(πP(πi-1 i-1 ππii)) ½ ½ 99//1010 99//10 10 11//10 10

99//10 10 99//10 10

99//10 10

99//10 10 11//10 10

99//10 10 99//1010

Transition probability from state ππi-1 i-1 to state πto state πii

Probability that xi was emitted from state ππii

Page 15: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

P(x|π) Calculation

• P(P(xx,,ππ):): Probability that sequence Probability that sequence xx was was generated by the path generated by the path π:π:

P(x,π ) = P(π i−1 → π i)P(x i | π i)i=1

n

= aπ i−1 ,π ieπ i

(x i)i=1

Page 16: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Decoding Problem

• Goal:Goal: Find an optimal hidden path of states Find an optimal hidden path of states given observations.given observations.

• Input:Input: Sequence of observations Sequence of observations x = xx = x11…x…xnn

generated by an HMM generated by an HMM MM((ΣΣ, Q, A, E, Q, A, E))

• Output:Output: A path that maximizes A path that maximizes P(x,P(x,ππ)) over all over all possible paths possible paths π.π.

Page 17: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Dynamic programming for the Dynamic programming for the Decoding Problem

• Andrew Viterbi used Dynamic programming Andrew Viterbi used Dynamic programming to solve the to solve the Decoding ProblemDecoding Problem..

• Build a graph with one node for every (i,j), Build a graph with one node for every (i,j), representing the possibility that the irepresenting the possibility that the i thth position position of of ππ was emitted from state j. was emitted from state j.

• Every choice of Every choice of π = ππ = π11… π… πn n corresponds to a corresponds to a path in the graph.path in the graph.

• Edges are weighted. Sum of edge weights on Edges are weighted. Sum of edge weights on a path a path π π corresponds to P(x,π)corresponds to P(x,π)

Page 18: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Graph for Decoding Problem

Page 19: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Decoding Problem

• Every path in the graph has the probability Every path in the graph has the probability P(x,P(x,ππ))..

• The Viterbi algorithm finds the path that The Viterbi algorithm finds the path that maximizes maximizes P(x,P(x,ππ) among all possible paths.) among all possible paths.

• The Viterbi algorithm runs in The Viterbi algorithm runs in O(n|Q|O(n|Q|22)) time.time.

Page 20: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Decoding Problem: weights of edges

w

The weight w is given by:

???

(k, i) (l, i+1)

Page 21: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Decoding Problem: weights of edges

w

The weight w is given by:

??

(k, i) (l, i+1)

nn

P(P(xx,,ππ) =) = Π Π ee ππii (x (xii)) . a. a ππi-1i-1, , ππii

i=1i=1

Page 22: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Decoding Problem: weights of edges

w

The weight w is given by:

?

(k, i) (l, i+1)

i+1i+1-th term -th term == ee ππi+1i+1 (x (xi+1i+1)) . a. a ππii, , ππi+1i+1

Page 23: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Decoding Problem: weights of edges

w

The weight w=el(xi+1). akl

(k, i) (l, i+1)

i+1i+1-th term -th term == ee ππi+1i+1 (x (xi+1i+1)) . a. a ππii, , ππi+1i+1 = = el(xi+1). akl

Page 24: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Dynamic programming recurrence

• sk,i is the probability of the most likely path for the prefix x1…xi that ends in k

• sl,i+1 = maxk {sk,i x weight of edge from (k,i) to (l,i+1)}

= maxk {sk,i akl el(xi+1)}

• Let Let ππ** be the optimal path. Then,be the optimal path. Then,

P(P(x,x,ππ**) = max) = maxkk { {ssk,nk,n . . aak,endk,end}}

• Time complexity: O(n |Q|Time complexity: O(n |Q|22))

Page 25: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Viterbi Algorithm

• The value of the product can become The value of the product can become extremely small, which leads to extremely small, which leads to overflowing.overflowing.

• To avoid overflowing, use log value To avoid overflowing, use log value instead: instead: SSk,i k,i = = log log ssk,ik,i

• New recurrence:New recurrence:

SSl,i+1l,i+1= log= logel(xi+1) + max kk {{SSk,i k,i ++ log(akl)}}

Page 26: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Forward-Backward Problem

Given:Given: a sequence a sequence x = xx = x11 x x22 …x …xnn

generated by an HMMgenerated by an HMM..

Goal:Goal: find the probability that x find the probability that xii was was

generated by state k.generated by state k.

That is, given x, what is P(πThat is, given x, what is P(π ii=k | x) ?=k | x) ?

Page 27: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Forward Algorithm

• Define Define ffk,ik,i ((forward probabilityforward probability) as the ) as the probability of emitting the prefix probability of emitting the prefix xx11…x…xii and and reaching the state reaching the state ππ = k = k..

• The recurrence for the forward algorithm:The recurrence for the forward algorithm:

ffk,ik,i = = eekk(x(xii)) . . ΣΣ ffl,i-l,i-11 . a . alklk

l Є Ql Є Q

Page 28: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Backward Algorithm

• However, However, forward probability forward probability is not the only is not the only factor affecting factor affecting P(P(ππii = k|x = k|x))..

• The sequence of transitions and emissions The sequence of transitions and emissions that the HMM undergoes between that the HMM undergoes between ππi+1i+1 and and ππnn

also affect also affect P(P(ππii = k|x = k|x))..

forward xi backward

Page 29: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Backward Algorithm (cont’d)

• DefineDefine backward probability backward probability bbk,ik,i as the as the probability of being in state probability of being in state ππii = k = k and and emitting the emitting the suffixsuffix xxi+1i+1…x…xnn..

• The recurrence for the The recurrence for the backward algorithmbackward algorithm::

bbk,ik,i = = ΣΣ eell(x(xi+1i+1)) . . bbl,i+1l,i+1 . a . akl kl

l Є Ql Є Q

Page 30: Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T

Backward-Forward Algorithm

• The probability that the dealer used a The probability that the dealer used a biased coin at any moment biased coin at any moment ii::

P(x, P(x, ππii = k = k) f) fkk(i) . b(i) . bkk(i)(i) P(P(ππii = k|x = k|x) = ) = _______________ _______________ = = ____________________________

P(x) P(x)P(x) P(x)

P(x) is the sum of P(x, πP(x, πii = k) = k) over all k