hmm -class 14- - pedeciba · 2011. 8. 30. · = fh;tgforcointossing = f1;2;3;4;5;6gfordicetossing...

69
Hiden Markov Models Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 1 / 65

Upload: others

Post on 22-Jan-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Hiden Markov Models

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 1 / 65

Page 2: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Outline

CG-islandsThe “Fair Bet Casino"Hidden Markov ModelDecoding AlgorithmForward-Backward Algorithm

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 2 / 65

Page 3: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

HMM Overview

Machine learning approach used in bioinformaticsPresented with training data→derive important insights about the(hidden) parametersOnce the algorithm has been “trained" → insights can be applied totest data↑ training data, ↑ accuracy of the algorithm

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 3 / 65

Page 4: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

HMM Overview

Parameters learned during training → KNOWLEDGEApplication of those parameters to new data → algorithms use of thatknowledgeHMM: learn unknown probabilistic parameter from training samplesand uses these parameters in the framework of dynamic programming(or others) to find the best explanation for the experimental data

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 4 / 65

Page 5: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

CpG-islands

Double stranded DNA:...ApCpCpApTpGpApTpGpCpApGpGpApCpTpTpCpCpApTpCpGpTpTpCpGpCp......| | | | | | | | | | | | | | | | | | | | | | | | | | | | ......TpGpGpTpApCpTpApCpGpTpCpCpTpGpApApGpGpTpApGpCpApApGpCpGp...Given 4 nucleotides: assume equal probability of occurrence ⇒ p(N) ≈ 1/4for each N ∈ {A,G ,C ,T}. ⇒ probability of occurrence of a dinucleotide is≈ 1/16.However, the frequencies of dinucleotides in DNA sequences are not equal.In particular, the p(CG) is typically < 1/16.

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 5 / 65

Page 6: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

CpG-islands

CG is the least frequent dinucleotide, because the C in a CpG pair is oftenmodified by methylation (that is, an H-atom is replaced by a CH3-group)and the methyl-C has the tendency to mutate into T afterwards.

Upstream of a gene however, the methylation process is suppressed in shortregions of the genome of length 100-5000. These areas are calledCpG-islands and they are characterized by the fact that we see moreCpG-pairs in them than elsewhere.

So, finding the CpG islands in a genome is an important problem.

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 6 / 65

Page 7: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

CpG-island

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 7 / 65

Page 8: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

CpG-islands

Definition (Gardiner-Garden & Frommer, 1987):A CpG island is DNA sequence of length about 200bp with a C + Gcontent of 50% and a ratio of observed-to-expected number of CpG’s thatis above 0.6.

According to a recent study1, human chromosomes 21 and 22 containabout 1100 CpG-islands and about 750 genes.

1Comprehensive analysis of CpG islands in human chromosomes 21and 22, D. Takai & P. A. Jones, PNAS, March 19, 2002

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 8 / 65

Page 9: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Questions

1 Discrimination problem: Given a short segment of genomicsequence. How can we decide whether this segment comes from aCpG-island or not?

2 Localisation problem: Given a long segment of genomic sequence.How can we find all contained CpG-islands?

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 9 / 65

Page 10: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

CpG islands and the casino dealer

CpG island problem is similar to “dishonest casino problem”:An occasionally dishonest casino uses two kinds of coins, a fair one and aloaded one. The game is to flip coins with two possible outcomes only:Head or Tail.Fair coin2: p(H|F ) = p(T |F ) = 1

2Biased coin: p(H|B) = 3

4 , p(T |B) = 14

The occasionally dishonest casino dealer changes between the two coinswith 10% probability.

2Recall P(A|B) denotes probability for an event A given BBioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 10 / 65

Page 11: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

The casino problem

As input a sequence of coin tosses is made, where either of the two coinswas used:

π = π1π2π3 . . . πn, πi ∈ {F,B}

However, the problem is one observes just a sequence of outcomes of thetosses:

x = x1x2x3 . . . xn, xi ∈ {H,T}

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 11 / 65

Page 12: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

The fair bet casino problem

Goal: Given a sequence of coin tosses, determine when the dealer used afair coin and when a biased coin

Input: A sequence x = x1x2x3...xn of coin tosses made by two possiblecoins (F or B).

Output: A sequence π = π1π2π3...πn, with each πi being either F or Bindicating that xi is the result of tossing the Fair or Biased coin respectively.

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 12 / 65

Page 13: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Problem...

Fair Bet Casino Problem:Any observed outcome of cointosses could have been →→generated by any sequenceof states!

π = FFFF ...FF is a validanswer for every observedsequence of tosses, as it isπ = BBBB...BB

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 13 / 65

Page 14: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Problem...

Fair Bet Casino Problem:Any observed outcome of cointosses could have been →→generated by any sequenceof states!

Need to incorporate a way tograde different sequencesdifferently

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 14 / 65

Page 15: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

P(x |fair coin) vs. P(x |biased coin)

Suppose first that dealer never changes coins. Some definitions:I P(x |fair coin): prob. of the dealer using the F coin and generating the

outcome xI P(x |biased coin): prob. of the dealer using the B coin and generating

outcome x

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 15 / 65

Page 16: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

P(x |fair coin) vs. P(x |biased coin)

Assume x is the sequence of n tosses:P(x |fair coin) = P(x1...xn|fair coin)

= Πni=1p(xi |fair coin) =

12

n

P(x |biased coin) = P(x1...xn|biased coin)

= Πni=1p(xi |biased coin) =

34

k 14

n−k=

3k

4n

k - the number of Heads in x.

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 16 / 65

Page 17: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

P(x |fair coin) vs. P(x |biased coin)

If P(x |fair coin) > P(x |biased coin), then the dealer most likely useda fair coinIf P(x |biased coin)> P(x |fair coin), then the dealer most likely used abiased coin

When are these probabilies equal?

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 17 / 65

Page 18: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

P(x |fair coin) vs. P(x |biased coin)

If P(x |fair coin) > P(x |biased coin), then the dealer most likely useda fair coinIf P(x |biased coin)> P(x |fair coin), then the dealer most likely used abiased coin

When are these probabilies equal?

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 17 / 65

Page 19: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

P(x |fair coin) vs. P(x |biased coin)

P(x |fair coin) = P(x |biased coin)

12

n=

3k

4n

2n = 3k

n = klog23

when k = nlog23 (k = 0.67n)

Hence, when k < nlog23 the dealer most likely used a fair coin.

When k > nlog23 he most likely used a biased coin.

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 18 / 65

Page 20: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Log-odds Ratio

We define log-odds ratio as follows:

log2(P(x |fair coin)

P(x |biased coin))

= Σki=1log2(

p+(xi )

p−(xi ))

= n − klog23

log-odds< 0 biased coin

log-odds> 0 fair coin

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 19 / 65

Page 21: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Computing Log-odds Ratio in Sliding Windows

We know the dealer switches coins, albeit rarely...

Consider a sliding window of the outcome sequence. Find the log-odds forthis short window.

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 20 / 65

Page 22: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Computing Log-odds Ratio in Sliding Windows

If log-odds ratio of window is negative, then dealer most likely used abiased coin. If positive, otherwise.The same approach can be used to find CG-islands in long DNAsequences. Calculate log-odds ratios for a sliding window of someparticular length. Declare window with positive score as a potentialCG-island.

Disadvantages:- the length of CG-island is not known in advance, or the window length forthe casino- different windows may classify the same position differently

HMM represent a different probabilistic approach

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 21 / 65

Page 23: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Computing Log-odds Ratio in Sliding Windows

If log-odds ratio of window is negative, then dealer most likely used abiased coin. If positive, otherwise.The same approach can be used to find CG-islands in long DNAsequences. Calculate log-odds ratios for a sliding window of someparticular length. Declare window with positive score as a potentialCG-island.

Disadvantages:- the length of CG-island is not known in advance, or the window length forthe casino- different windows may classify the same position differently

HMM represent a different probabilistic approach

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 21 / 65

Page 24: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Computing Log-odds Ratio in Sliding Windows

If log-odds ratio of window is negative, then dealer most likely used abiased coin. If positive, otherwise.The same approach can be used to find CG-islands in long DNAsequences. Calculate log-odds ratios for a sliding window of someparticular length. Declare window with positive score as a potentialCG-island.

Disadvantages:- the length of CG-island is not known in advance, or the window length forthe casino- different windows may classify the same position differently

HMM represent a different probabilistic approach

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 21 / 65

Page 25: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Hidden Markov Models (HMM)

Motivation: Localisation problem, how to detect CpG-islands inside along sequence x = x [1, L), L� 0?

One idea is to use the Markov chain models derived above and apply themby calculating log-odds ratio for a window of a fixed width w � L that ismoved along the sequence and the score S(x [k , k + w)) is plotted for eachstarting position k (1 ≤ k ≤ L− w)

Problems:- how to determine the boundaries of CpG-islands,- which window size w should one choose?

Approach: Merge the two models Model+ (Fair, CpG) and Model−

(Biased, not CpG) to obtain a so-called Hidden Markov Model.

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 22 / 65

Page 26: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Hidden Markov Models

Can be viewed as an abstract machine with k hidden states that emitssymbols from an alphabet Σ.Each state has its own probability distribution, and the machineswitches between states according to this probability distribution.While in a certain state, the machine makes 2 decisions:

I What state should I move to next?I What symbol - from the alphabet Σ - should I emit?

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 23 / 65

Page 27: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Why “Hidden"?

Observers can see the emitted symbols of an HMM but have no abilityto know which state the HMM is currently in.Thus, the goal is to infer the most likely hidden states of an HMMbased on the given sequence of emitted symbols.

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 24 / 65

Page 28: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Hidden Markov Models

DefinitionA hidden Markov model (HMM) is a system M = (Σ,Q,P, e) consisting of

an alphabet Σ, often also emission alphabet called,a set of states Q,a matrix P = {pkl} of transition probabilities pkl for k , l ∈ Q, andan emission probability ek(b) for every k ∈ Q and b ∈ Σ.

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 25 / 65

Page 29: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

HMM Parameters: examples

Σ: set of emission characters

Σ = {H,T} for coin tossingΣ = {1, 2, 3, 4, 5, 6} for dice tossing

Q: set of hidden states, each emitting symbols from Σ

Q = {F ,B} for coin tossing

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 26 / 65

Page 30: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

HMM Parameters: examples

P = (pkl ): a |Q| × |Q| matrix of probability of changing from state k tostate l

pFF = 0.9 pFB = 0.1pBF = 0.1 pBB = 0.9

E = (ek(b)): a |Q| × |Σ| matrix of probability of emitting symbol b whilebeing in state k

eF (H) = 12 eF (T ) = 1

2eB(H) = 3

4 eB(T ) = 14

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 27 / 65

Page 31: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

HMM for Fair Bet Casino

The Fair Bet Casino in HMM terms:

Σ = {0, 1}(0 for Tails and 1 Heads)

Q = {F ,B} − F for Fair & B for Biased coin

Transition Probabilities P Emission Probabilities e

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 28 / 65

Page 32: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

HMM for Fair Bet Casino

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 29 / 65

Page 33: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

HMM for Fair Bet Casino

Formally, the corresponding HMM for the casino is defined asM(Σ,Q,P, e):

Σ = {0, 1}, 0: tails and 1: headsQ = {F ,B}, F: fair and B: biasedP transition probability matrix, pFF = pBB = 0.9

pFB = pBF = 0.1E emision matrix: e(0)F = 1

2 , e(1)F = 12 , e(0)B = 1

4 , e(1)B = 34

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 30 / 65

Page 34: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

HMM for CpG-islands

In our case of the localisation of CpG-islands, we want to have bothModels (CpG+ and CpG−) integrated in one, with a small probabilityof switching from one chain to the other at each transition point.It might seem like we should model the DNA sequence with two states(CpG+, CpG−). However, for a DNA sequence, the emissionprobability of nucleotide is dependent on the previous emission, notthe previous state (CpG island/not a CpG island) and hence we needmore than two states.We resolve this by relabeling the states such that we have the statesA+, C+, G+ and T+ which emit A, C , G and/or T withinCpG-regions, and vice versa states A−, C−, G− and T− which emit A,C , G and/or T within non-CpG-regions.

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 31 / 65

Page 35: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

HMM for CpG-islands

The graph of transitions between states looks as follows:

A C TG

A C TG+ + + +

− − − −

(Additionally, we have all transitions between states in either of the twosets that carry over from the two Markov chains Model+ and Model−.)

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 32 / 65

Page 36: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

HMM for CpG-islands

Thus, we haveΣ = {A,G ,C ,T} for the alphabet,Q = {A+,C+,G+,T+,A−,C−,G−,T−}and possibly start and end states for the states set.The transition probability between the + and − states is small.Generally, the chance of switching from + to − is larger than vice versa.In the case of the CpG-islands the emission probabilities are all 0 or 1 ascan be seen from below.

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 33 / 65

Page 37: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

HMM for CpG-islands

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 34 / 65

Page 38: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Paths

DefinitionA path π = (π1, π2, . . . , πL) is a sequence of states in the model M. Eachstate emits a symbol with a certain probability thereby generating asequence of symbols.

Generally a sequence can be generated from an HMM as follows: first abegin state π1 is chosen according to the probability p0i . In that state asymbol is emitted according to the emission probability eπ1 for that state.We then move to the next state π2 with probability pπ1i and so on. Wewould report the sequence of emitted symbols.

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 35 / 65

Page 39: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Hidden Paths

A path π = π1...πn in the HMM is defined as a sequence of states.Consider path π = FFFBBBBBFFF and sequence x = 01011101001

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 36 / 65

Page 40: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

P(x |π) Calculation

P(x |π): Probability that sequence x was generated by the path π:

P(x |π) = P(π0 → π1)× Πni=1P(xi |πi )× P(πi → πi+1)

= pπ0,π1 × Πni=1eπi (xi )× pπi ,πi+1

= Πni=0eπi+1(xi+1)× pπi ,πi+1

if we count from i = 0 instead of i = 1

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 37 / 65

Page 41: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

P(x |π) Calculation

P(x |π): Probability that sequence x was generated by the path π:

P(x |π) = P(π0 → π1)× Πni=1P(xi |πi )× P(πi → πi+1)

= pπ0,π1 × Πni=1eπi (xi )× pπi ,πi+1

= Πni=0eπi+1(xi+1)× pπi ,πi+1

if we count from i = 0 instead of i = 1

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 37 / 65

Page 42: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

P(x |π) Calculation

Calulation of P(x |π) in the example:

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 38 / 65

Page 43: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

P(x |π) Calculation

We assume that we knew π and observed xHowever, we do not know π (only the dealer knows) → π is hiddenIf you only observe x=01011101001, you might ask whether or notπ=FFFBBBBBFFF is the “best" explanation for xIf it’s not the best, can we reconstruct the best one?

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 39 / 65

Page 44: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Decoding Problem

Goal: Find an optimal hidden path of states given observations

Input: Sequence of observations x = x1...xn generated by an HMMM(Σ,Q,A,E )

Output: A path that maximizes P(x |π) over all possible paths π

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 40 / 65

Page 45: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Building Manhattan for Decoding Problem

Andrew Viterbi used the Manhattan grid model to solve the DecodingProblem.Every choice of π = π1...πn corresponds to a path in the graph.The only valid direction in the graph is eastward.This graph has |Q|2(n − 1) edges.

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 41 / 65

Page 46: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Edit Graph for Decoding Problem

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 42 / 65

Page 47: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Decoding Problem vs. Alignment Problem

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 43 / 65

Page 48: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Decoding Problem as Finding a Longest Path in a DAG

The Decoding Problem is reduced to finding a longest path in thedirected acyclic graph (DAG) above.Notes: the length of the path is defined as the product of its edges’weights, not the sum.

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 44 / 65

Page 49: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Decoding Problem

Every path in the graph has the probability P(x |π)

The Viterbi algorithm finds the path that maximizes P(x |π) among allpossible pathsThe Viterbi algorithm runs in O(n|Q|2) time

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 45 / 65

Page 50: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Decoding Problem: weights of edges

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 46 / 65

Page 51: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Decoding Problem: weights of edges

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 47 / 65

Page 52: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Decoding Problem: weights of edges

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 48 / 65

Page 53: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Decoding Problem: weights of edges

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 49 / 65

Page 54: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Decoding Problem and Dynamic Programming

vl ,i+1 = maxk∈Q{vk,i · weight of edge between (k , i) and (l , i + 1)}

= maxk∈Q{vk,i · akl · el (xi+1)}

= el (xi+1) ·maxk∈Q{vk,i · akl}

vk(i) is called the Viterbi variable.

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 50 / 65

Page 55: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Decoding Problem

Initialization:vbegin,0 = 1vk,0 = 0 for k 6= begin

Let π∗ be the optimal path. Then,P(x |π∗) = maxk∈Q{vk,n · ak,end}

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 51 / 65

Page 56: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Viterbi Algorithm

The value of the product can become extremely small, which leads tooverflowing.To avoid overflowing, use log value instead:

vk,i+1 = logel (xi+1) + maxk∈Q{vk,i + log(pkl )}

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 52 / 65

Page 57: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Viterbi example -weather-

High (H) and Low (L) pressure fronts are related to the weatherHigh fronts tend to provoke sunny daysCloudy days are related to Low frontsHaving seen the weather in the last three days we want to know whichpressure front had generated the weather:Weather in last 3 days: Sunny-Cloudy-Sunny

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 53 / 65

Page 58: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Viterbi example

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 54 / 65

Page 59: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Viterbi example -DP-

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 55 / 65

Page 60: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Viterbi example -DP-

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 56 / 65

Page 61: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Viterbi example -DP-

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 57 / 65

Page 62: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Viterbi example -DP-

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 58 / 65

Page 63: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Forward-Backward Problem

Viterbi → Given a sequence of coin tosses, what is the most probable pathof states

Given: a sequence of coin tosses generated by an HMM.

Goal: find the probability that the dealer was using a biased coin at aparticular time.

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 59 / 65

Page 64: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Forward Algorithm

A blind hermit can only sense the seaweed state (dry, damp, soggy), butneeds to know the weather (sunny, cloudy and rainy), i.e. the hiddenstates. HMM M(Σ,Q,A,E ) is known

Viterbi would calculatethe most probablesequence of states thatgeneratesx=dry-damp-soggy.Forward algorithmcalculates the probabilityof being at state k attime i

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 60 / 65

Page 65: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Forward Algorithm

P(x , πi = k) = P(observation | hidden state is k) × P (all paths to state kat time t)P(x , πi = k) =P(damp|cloudy) × P(all paths to state cloudy at time T=2)

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 61 / 65

Page 66: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Forward Algorithm

Define fk,i (forward probability) as the probability of emitting the prefixx1...xi and reaching the state π = k → P(x , πi = k)The recurrence for the forward algorithm:

fk,i = ek(xi ) · Σfl ,i−1 · plk

It is calculated as viterbi, with dynamic programming. Instead of taking theproduct of the probabilities, sum them up

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 62 / 65

Page 67: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Backward-algorithm

Using the Viterbi algorithm we can search for the most probable path, ie.state sequence that has generated an observed sequence. But this does nothelp us with prediction on future states.

DefinitionThe backward-variable is defined as the probability of being at state πi = kand then to generate (emit) the suffix sequence (xi+1, . . . , xL):

bk(i) = P(xi+1 . . . xL, πi = k)

The backward algorithm uses a very similar recursion formula:

bk(i) =∑l∈Q

el (xi+1)bl (i + 1)pkl

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 63 / 65

Page 68: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Backward-algorithm

Input: HMM M = (Σ,Q,P, e) and sequence of symbols xOutput: probability P(x | M)

Initialization: (i = L): bk(L) = pk0 for all k .

For all i = L− 1 . . . 1, k ∈ Q:bk(i) =

∑l∈Q el (xi+1)bl (i + 1)pkl

Termination: P(x | M) =∑

l∈Q(p0lel (x1)bl (1))

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 64 / 65

Page 69: HMM -class 14- - PEDECIBA · 2011. 8. 30. · = fH;Tgforcointossing = f1;2;3;4;5;6gfordicetossing Q: setofhiddenstates,eachemittingsymbolsfrom Q = fF;Bgforcointossing Bioinfo I (Institut

Comparison of the three variables

Viterbi vk(i) probability, with which the most probablestate path generates the sequence of symbols(x1, x2, . . . , xi ) and the system is in state kat time i .

Forward fk(i) probability, that the prefix sequence of sym-bols x1, . . . , xi is generated, and the systemis in state k at time i .

Backward bk(i) probability, that the system starts in state kat time i and then generates the sequence ofsymbols xi+1, . . . , xL.

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 65 / 65