uncovering sequences mysteries with hidden markov model

132
Cédric Notredame (12/05/22) Uncovering Sequences Mysteries With Hidden Markov Model Cédric Notredame

Upload: ikia

Post on 12-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Uncovering Sequences Mysteries With Hidden Markov Model. Cédric Notredame. Our Scope. Look once Under the Hood. Understand the principle of HMMs. Understand HOW HMMs are used in Biology. Outline. -Reminder of Bayesian Probabilities. -HMMs and Markov Chains. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Uncovering

Sequences

Mysteries

With

Hidden Markov

ModelCédric Notredame

Page 2: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Page 3: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Our Scope

Understand the principle of HMMs

Understand HOW HMMs are used in Biology

Look once Under the Hood

Page 4: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Outline

-Reminder of Bayesian Probabilities

-Application to gene prediction

-Application Tm predictions

-HMMs and Markov Chains

-Application to Domain/Prot Family Prediction

-Future Applications

Page 5: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Conditional Probabilities

AndBayes Theorem

Page 6: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

I now send you an essay which I have found among the papers of our deceased friend Mr Bayes, and which, in my opinion, has great merit... In an introduction which he has writ to this Essay, he says, that his design at first in thinking on the subject of it was, to find out a method by which we might judge concerning the probability that an event has to happen, in given circumstances, upon supposition that we know nothing concerning it but that, under the same circumstances, it has happened a certain number of times, and failed a certain other number of times.

Bayes

Page 7: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

“The Durbin…”

Page 8: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

What is a Probabilistic Model ?

Dice = Probabilistic Model

-Each Possible outcome has a probability (1/6)

-Biological Questions:

-What kind of dice would generate coding DNA

-Non-Coding ?

Page 9: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Which Parameters ?

Dice = Probabilistic Model

-A Priori estimation: 1/6 for each Number

Parameters: proba of each outcome

-Through Observation:-measure frequencies on a large numberof events

OR

Page 10: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Which Parameters ?

Model: Intra/Extra Protein

1- Make a set of Inside Proteins using annotation

Parameters: proba of each outcome

2- Make a set of Outside Proteins using annotation

3- COUNT Frequencies on the two sets

Model Accuracy Training Set

Page 11: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Maximum Likelihood Models

Model: Intra/Extra Proteins

1- Make training set

2- Count Frequencies

Model Accuracy Training Set

Maximum Likelihood Model:

Model probability MAXIMISES Data probability

Page 12: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Maximum Likelihood Models

Model: Intra/Extra-Cell Proteins

Model Probability MAXIMISES Data ProbabilityAND Data Probability MAXIMISES Model Probability

P ( Model ¦ Data) is Maximised

¦ means GIVEN!

Maximum Likelihood Model

Page 13: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Maximum Likelihood Models

Model: Intra/Extra-Cell Proteins

Model Probability MAXIMISES Data ProbabilityAND Data Probability MAXIMISES Model Probability

P ( Model ¦ Data) is Maximised

Maximum Likelihood Model

P ( Data ¦ Model) is Maximised

Page 14: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Maximum Likelihood Models

Model: Intra/Extra-Cell Proteins

Data: 11121112221212122121112221112121112211111

P ( Coin ¦ Data)< P(Dice ¦ Data)

Maximum Likelihood Model

Page 15: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Conditional Probabilities

                                                    

Page 16: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Conditional Probabilities

The Probability that something happens IF

something else ALSO

Happens

P (Win Lottery ¦ Participation)

Page 17: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Conditional Probability

The Probability that something happens IF

something else ALSO

Happens

Dice 1Dice 2

P(6¦ Dice 1)=1/6P(6¦ Dice 2)=1/2

Loaded!

Page 18: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

P(6¦ D1)=1/6P(6¦ D2)=1/2

P(6,D2)=P(6¦D2) * P(D2)=1/2* 1/100

Joint Probability

The Probability that something happens IF

something else ALSO

Happens

Comma

AND

Page 19: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Joint Probability

Question: What is the probability of Making a 6, given that the Loaded Dice is used 1% of

the time

P(6¦ DF and DL)= P(6, DF) + P(6, DL)= P(6 ¦ DF) * P(DF) + P(6¦ DL)*P(DL)= 1/6*0.99 + 1/2*0.01= 0.17

(0.16 for an unloaded dice)

Page 20: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Joint Probability

P(6¦ DF and DL)= P(6, DF) + P(6, DL)= P(6 ¦ DF) * P(DL) + P(6¦ DF)*P(DL)= 1/6*0.99 + 1/2*0.01= 0.17(0.16 for an unloaded dice)

Unsuspected Heterogeneity In the training set

Inaccurate Parameters Estimation

Page 21: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Bayes Theorem

X : Model or Data or any EventY : Model or Data or any Event

P(Xi¦ Y) =

P(Y¦Xi) * P(Xi)

(P(Y¦Xi)*P(Xi

))i

Page 22: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Bayes Theorem

X : Model or Data or any EventY : Model or Data or any Event

XT=X+ X

P(Y,X)+ P(Y,X)

P(Y)

P(X¦ Y) =

P(Y¦X) * P(X)

P(Y¦X)*P(X)+ P(Y¦X)*P(X)

Page 23: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Bayes Theorem

X : Model or Data or any EventY : Model or Data or any event

P(X¦ Y) =

P(Y¦X) * P(X)

P(Y)

Proba of Observing XIF Y is fulfilled ‘Remove’ P(Y)

to Get P(X¦Y)

Proba of Observing Y

AND X simultaneously

Page 24: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Bayes Theorem

X : Model or Data or any EventY : Model or Data or any event

P(X¦Y) = P(X,Y)

P(Y)

Proba of Observing XIF Y is fulfilled

Proba of Observing Y and X simultaneously

‘Remove’ P(Y) to Get P(X¦Y)

Page 25: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using Bayes Theorem

Question:The dice gave three 6s in a rowIS IT LOADED !!!

We will use Bayes Theorem to test our belief:

If the Dice was loaded (model) what would be the probability of this

ModelGiven the data (three 6 in a row)

Page 26: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using Bayes Theorem

Question:The dice gave three 6s in a rowIS IT LOADED !!!

P(D1)=0.99P(D2)=0.01P(6¦D1)=1/6P(6¦D2)=1/2

Occasionally DishonestCasino…

Page 27: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using Bayes Theorem

Question:The dice gave three 6s in a rowIS IT LOADED !!!

P(D2¦63) = P(63 ¦D2)*P(D2)

P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2)

P(D1)=0.99P(D2)=0.01P(6¦D1)=1/6P(6¦D2)=1/2

P(X¦ Y) =

P(Y¦X)*P(X)

P(Y)

63 with D1 63 with D2

Y: 63

X: D2

Page 28: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using Bayes Theorem

Question:The dice gave three 6s in a rowIS IT LOADED !!!

P(D2¦63) = P(63 ¦D2)*P(D2)

P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2)

P(D1)=0.99P(D2)=0.01P(6¦D1)=1/6P(6¦D2)=1/2

P(X¦ Y) =

P(X,Y)

P(Y)

= 0.21

Probably NOT

Page 29: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Posterior Probability

Question:The dice gave three 6s in a rowIS IT LOADED !!!

P(D2¦63) = P(63 ¦D2)*P(D2)

P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2)= 0.21

0.21 is a posterior probability: it was estimated AFTER the Data was obtained

P(63¦D2) is the likelihood of the Hypotheses

Page 30: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Debunking Headlines

P(Migrant) =0.1P(Criminal) =0.0001P(M¦C)=0.5

P(C¦M) =

P(M¦C)*P(C)

P(M)

50% of the crimes are committed by Migrants.

Question: Are 50% of the Migrants Criminals??.

NO: 0.05% Migrants only are Criminals (NOT 50%!)

= 0.5*0.0001

0.1=0.0005P(C¦M)

=

P(M¦C)*P(C)

P(M)

Page 31: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Debunking Headlines

50% of Gene Promoters contain TATA.

P(T)=0.1P(P)=0.0001P(T¦P)=0.5

P(P¦T) = P(T¦P)*P(P)

P(T)

Question:IS TATA a good gene predictor

NO

= 0.5*0.0001

0.1=0.0005P(P¦T) =

P(T¦P)*P(P)

P(T)

Page 32: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Bayes Theorem

Bayes Theorem Reveals the Trade-offBetween

Sensitivity:Finding ALL the genesand

Specificity: Finding ONLY genes

TATA=High Sensitivity / Low Specificity

Page 33: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Markov Chains

Page 34: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

What is a Markov Chain ?

Simple Chain: One Dice

-Each Roll is the same-A Roll does not depend on the previous

Markov Chain: Two Dices

-You only use ONE dice: the fair OR the loaded

-The Dice you roll only depends on the previous roll

Page 35: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

What is a Markov Chain ?

Biological Sequences Tend To Behave like Markov Chains

Question/Example

Is it possible to Tell Whether my sequence is CpG island ???

Page 36: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Page 37: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

What is a Markov Chain ?

Question:

Identify CpG Island sequences

Old Fashion Solution

-Slide a Window of size: Captain’s Height/-Measure the % of CpG-Plot it against the sequence-Decide

Page 38: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

sliding Window Methods

Average

Sliding Window

Sliding Window

Page 39: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

What is a Markov Chain ?

Question:

Identify CpG Island sequences

Bayesian Solution

-Make a CpG Markov Chain-Run the sequence through the Chain-Likelihood for the chain to produce the sequence?

Page 40: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

A

C G

T

Transition

State

Transition ProbabilitiesProbability of Transition from G to C

AGC=P(Xi=C ¦ Xi-1=G)

Page 41: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

P(sequence)=P(XL,XL-1,XL-2,….., X1)

Remember: P(X,Y)=P(X¦Y)*P(Y)

P(sequence)=P(XL¦XL-1)*P(XL-1¦XL-2)….., P(X1) )

In The Markov Chain, XL only depends on XL-1

Page 42: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

P(sequence)=P(XL¦XL-1)*P(XL-1¦XL-2)….., P(X1) )

L

i=2Axi-1 xi

P(sequence)=P(x1)*

AGC=P(Xi=C ¦ Xi-1=G)

Page 43: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

A

C G

T

Arbitrary Beginning and End States can be addedTo The Chain.

By Convention, Only the Beginning State is added

B

Page 44: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

A

C G

T

B

Adding An End State with a Transition Proba T Defines Length probabilities

P(all the sequences length L)=T(1-T)L-1

E

Page 45: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

A

C G

T

The transition are probabilities

The sum of the probability of all thepossible Sequences of all possible

Lengthis 1

B E

Page 46: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using Markov Chains

To Predict

Page 47: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

What is a Prediction

Given A sequence We want to know what is the probability that this sequence is a CpG

1-We need a training set:-CpG+ sequences-CpG- sequences

2-We will Measure the transition frequencies, and treat them like probabilities

Page 48: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

What is a Prediction

Is my sequence a CpG ???

2-We will Measure the transition frequencies, and treat them like probabilities

A+GC

N+GC

N+GX

X

=Ratio between the number of transitions GC, and all the other transitions involving G->X

Transition GC: G followed by a C

GCCGCTGCGCGA

Page 49: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

1

What is a Prediction

Is my sequence a CpG ???

2-We will Measure the transition frequencies, and treat them like probabilities

A0.180.170.160.08

C0.270.360.330.35

G0.420.270.370.38

T0.120.180.120.18

+ACGT

A0.300.320.250.17

C0.210.300.250.24

G0.280.080.300.29

T0.210.300.200.29

-ACGT

Page 50: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

A0.180.170.160.08

C0.270.360.330.35

G0.420.270.370.38

T0.120.180.120.18

+ACGT

L

i=1P(seq ¦ M+)= +

Axi-1 xi

What is a Prediction

Is my sequence a CpG ???

3-Evaluate the probability for each of these models to generate our sequence

L

i=1P(seq ¦ M-)= -

A0.300.320.250.17

C0.210.300.250.24

G0.280.080.300.29

T0.210.300.200.29

-ACGT

Axi-1 xi

Page 51: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using The Log ODD

Is my sequence a CpG ???

4-Measure the Log Odd

Log Odd Confrontation of the Two Models…Log2 Gives a value in bits (standard)LEN Gives a less spread out score distribution

S(seq)= LogP(seq ¦ M+)

P(seq ¦ M-)~

A+Xi-1,Xi

A-Xi-1,Xi

log2X

1

LEN

Page 52: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using The Log ODD

Is my sequence a CpG ???

4-Measure the Log Odd

Positive: more likely than NOT to be CpG

Negative: more likely NOT to be CpG

S(seq)= LogP(seq ¦ M+)

P(seq ¦ M-)~

A+Xi-1,Xi

A-Xi-1,Xi

log2X

1

LEN

Page 53: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using The Log ODD

Is my sequence a CpG ???

5-Plot the score distribution

N seq

Bits0

Page 54: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using The Log ODD

Is my sequence a CpG ???

5-Plot the score distribution

N seq

Bits0

Things can go Wrong-bad training set-bad param estimation

Page 55: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using The Log ODD

Is my sequence a CpG ???

-The Markov Chain is a Good discriminator-PB: What to do with long sequences That are partly CpG, and partly NON CpG ???-How Can we make a prediction Nucleotide per Nucleotide??

-We want to uncover the HIDDEN Boundaries

Page 56: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Hidden Markov Models

Page 57: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Hidden Markov Model:Switching Dices

-If you are Cheating You want to switch Dices Without Telling!

-The MODEL Switch is HIDDEN

Simple Chain: One Dice

-Each Roll is the same

-A roll does not depend on the previous

Markov Chain: Two Dices

-You only use ONE dice: the fair OR the loaded

-The Dice you roll only depends on the previous roll

Page 58: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using HMMS

Question: I want to find the CpG boundaries

The chain had four symbol AGCT

The Model has eight states: A+, A-, G+, G-, C+, C-, T+, T-

There is no 1to1 correspondence symbol/states:

The state of each symbol is hiddenA can either be in A+ or A-

Page 59: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using HMMs

Question: I want to find the CpG boundaries

1-Define the model topology

A+ G+ C+ T+

A- G- C- T-

EVERY transition is possible

C+ TO G- cost more

Page 60: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using HMMs

Question: I want to find the CpG boundaries

2-Parameterise the model: count frequencies…

A0.180.170.160.08

C0.270.360.330.35

G0.420.270.370.38

T0.120.180.120.18

+ACGT

A0.300.320.250.17

C0.210.300.250.24

G0.280.080.300.29

T0.210.300.200.29

-ACGT

We also Need + to -

Page 61: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using HMMs

Question: I want to find the CpG boundaries

3-FORCE the model to emit your sequence: Viterbi

One can use the model to emit any sequence. This sequence is named a PATH () because it is a walk through the model

G+ C+ G+ C+ T+ C+ C+ C- C- G- T- ….

Page 62: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

The path with the occasionally dishonest Casino

-The state L, emits a symbol with a proba

AL,F =P(i=L¦ i-1=F)

P (emit 6 with L)=EL(6) = P(Xi=6 ¦ i=L)=0.5

Using HMMs

Question: I want to find the CpG boundaries

3-FORCE the model to emit your sequence: Viterbi

Switch Dices: Transition

Roll The Dice: Emission

Page 63: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

1- 0.162- 0.163- 0.164- 0.165- 0.166- 0.16

1- 0.102- 0.103- 0.104- 0.105- 0.106- 0.50

Fair Loaded

Two States: Fair and Loaded

SixEmissionForStateLoaded

Six EmissionFor State Fair

Page 64: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

1- 0.162- 0.163- 0.164- 0.165- 0.166- 0.16

Fair

1- 0.102- 0.103- 0.104- 0.105- 0.106- 0.50

Loaded

P (emit 6L) =EL(6) = P(Xi=6 ¦ i=L)=0.5

Emissionsof L withTheir Proba

AL,F =P(i=L¦ i-1=F) Switch Dices: Transition

Roll The Dice: Emission

Page 65: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

A+

A-

G+

G-

C+

C-

T+

T-

8 STATES, 1 EMISSION per State

Page 66: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using HMMs

Question: I want to find the CpG boundaries

3-FORCE the model to emit your sequence: Viterbi

The path:-goes from state to state with a proba

AG+,C+ =P(i=C+¦ i-1=G+)

-in x, it EMITS a symbol with a proba 1

Proba emit G=EG+(G) = P(Xi=G ¦ i=G+)

1

Page 67: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using HMMs

Question: I want to find the CpG boundaries

3-FORCE the model to emit your sequence: Viterbi

We are interested in the joint probability of the PATH (chain of G+, C-…) with our Sequence X

Ei

i=1

P(X,)=L

Ai,i-1

(Xi)*A0,1

*

Page 68: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using HMMs

Question: I want to find the CpG boundaries

3-FORCE the model to emit your sequence: Viterbi

Ei

i=1

P(X,)=L

Ai,i-1

(Xi)*A0,1

*

A0,C+ *1 * A C+,G- *1 * AG-,C- *1 * AC-,G+ *1

P= C+ G- C- G+X= C G C G

Page 69: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using HMMs

Question: I want to find the CpG boundaries

3-FORCE the model to emit your sequence: Viterbi

To Make a prediction We must Identify the Best Scoring Path:

A0,C+ *1 * A C+,G- *1 * AG-,C- *1 * AC-,G+ *1

*=argmax P(x,)

This is NOT a prediction

Page 70: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using HMMs

Question: I want to find the CpG boundaries

3-FORCE the model to emit your sequence: Viterbi

To Make a prediction We must Identify the Best Scoring Path:

*=argmax P(x,)

We do this recursively with the VITERBI Algorithm

Page 71: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

A+G+C+T+A-G-C-T-

C

A+G+C+T+A-G-C-T-

G

G+C+G-A-

G+C+G-A-

A+G+C+T+A-G-C-T-

A

G+C+G-

G+C+G+A+G+C+T+A-G-C-T-

G

A+G+C+T+A-G-C-T-

C

G+

G-

A+G+C+T+A-G-C-T-

G

G+C+

G+C+

Page 72: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

A+G+C+T+A-G-C-T-

G

A+G+C+T+A-G-C-T-

C

A+G+C+T+A-G-C-T-

G

A+G+C+T+A-G-C-T-

A

A+G+C+T+A-G-C-T-

G

A+G+C+T+A-G-C-T-

C

G+ C+ G- A- G- C-

Trace Back

Page 73: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Initiation:

V0(0)=1, Vk(0)=0 for every k

Recursion: i=1..L

Vl (i)=El(Xi)*Maxk (Vk(i-1)*Akl)

ptri (l)=argmax (Vk(i-1) *Akl)

Termination: i=1..L

P(x,*)=Maxk (Vk(L)*Ak0)

-k and l are two states

-Vk(i) score of the best path 1…i, that finishes on state k and position i

Page 74: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Initiation: k and l are two states

Recursion: i=1..L

Vl (i)=El(Xi)*Maxk (Vk(i-1)*Akl)

V0(0)=1, Vk(0)=0 for every k

Multiplying Proba can cause an underflow problem

Usually, Proba multiplications are replaced with Log additions

log (a*b) = log (a) + log (b)

Page 75: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using HMMs

Question: I want to know the Probability of my sequence Given The model

In Theory, you must sum over ALL the possible PATH. In practice:

* is a good approximation

Page 76: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using HMMs

Question: I want to know the Proba of my sequence Given The model

The Forward Algorithm Gives the exact value of P(x)

* is a good approximation But…

Page 77: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Initiation: k and l are two states

Recursion: i=1..L

Vl (i)=El(Xi)*Maxk (Vk(i-1)*Akl)

V0(0)=1, Vk(0)=0 for every k

Termination:P(x,*)=Maxk (Vk(L)*Ak0)

Viterbi

Initiation: k and l are two states

Recursion: i=1..L

Vl (i)=El(Xi)*k (Vk(i-1)*Akl)

V0(0)=1, Vk(0)=0 for every k

Termination: P(x)=k (Vk(L)*Ak0)

Forward

Page 78: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Initiation: k and l are two states

Recursion: i=1..L

Vl (i)=El(Xi)*Maxk (Vk(i-1)*Akl)

V0(0)=1, Vk(0)=0 for every k

Termination: P(x,*)=Maxk (Vk(L)*Ak0)

Viterbi

A+G+C+T+A-G-C-T-

A+G+C+T+A-G-C-T-

G+

G-

Max

Initiation: k and l are two states

Recursion: i=1..L

Vl (i)=El(Xi)*k (Vk(i-1)*Akl)

V0(0)=1, Vk(0)=0 for every k

Termination: P(x)=k (Vk(L)*Ak0)

Forward

A+G+C+T+A-G-C-T-

A+G+C+T+A-G-C-T-

G+

G-

Page 79: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Posterior Decodingof

Hidden Markov Models

Page 80: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Why Posterior Decoding ?

-Viterbi is BRUTAL !!!!-It does Not Associate Individual PredictionsWith a Probability

Question: What is the probability that Nucleotide 1300 really is a CpG Boundary ?

ANSWER: The Backward Algorithm

Page 81: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Posterieur Decoding ?

Question: What is the probability that Nucleotide 1300 really is a CpG Boundary ?

P (X,i=l)

Probability of Sequence X WITH

position i is in state l

Page 82: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Posterieur Decoding

i

P (x,i=l)=P(X1…Xi¦ i=l) * P(XL… Xi+1¦ i=l)

i=l

Forward Algorithm

i=l

Backward Algorithm

Page 83: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Initiation:

Recursion: i=1..L

Fl (i)=El(Xi)*k (Fk(i-1)*Akl)

F0(0)=1, Fk(0)=0 for every k

Termination: P(x)=k (Fk(L)*Ak0)

Forward

Initiation:

Recursion: i=L..1

Bl (i)=El(Xi)*k (Bk(i+1)*Akl)

B0(0)=1, Bk(L)=Ak0 for every k

Termination: P(x)=k (Bk(1)*Ak0)

Backward

Page 84: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Recursion: i=1..L

Fl (i)=Fl(Xi)*k (Fk(i-1)*Akl)Forward

Recursion: i=L..1

Bl (i)=Bl(Xi)*k (Bk(i+1)*Akl)Backward

P (i=l,X)=Fl(i)*Bl(i)

P (i=l,X)=P(i=l ¦ X)*P(X) = Fl(i) * Bl(i)

Fl(i) * Bl(i)

P(X)P(i=l ¦ X)=

P(X)=F(L)=B(1)

Page 85: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Sliding Window

P(i=l ¦ X)

Free From The Sliding Window ofArbitrary Size!!!!

Page 86: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

P(i=l ¦ X)

Posterior Decoding is Less Sensitive to the Parameterisation of the model.

Page 87: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Training HMMs

Page 88: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Training HMMs ?

Case 1-Set of annotated data

Parameters can be estimated on this data where thePATH is known.

Case 2-NO annotated data and a Model

-Parameterise the model so P(Model¦data)=max-Start with random parameters-Iterate using Baum-Welch, Viterbi or EM

Page 89: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Trainning HMMs ?

Difficult !!!!

Page 90: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

What MattersAbout

Hidden Markov Models

Page 91: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

HMM and Markov Chains

Bayes Theorem

-Markov Chain: When There is no Hidden State

-Hidden Markov Models: When a Nucleotide can be in different HIDDEN states

Page 92: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Three Algorithms for HMMS

Viterbi: -Make the State assignments-Predict

Forward: Evaluate the Sequence Probability under the considered model

Backward and Posterior Decoding:Evaluating the proba of the predictionWindow-Free

Page 93: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Applicationsof HMMs

Page 94: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

What To Do with an HMM?

Transmembrane domain predictions

www.cbs.dtu.dk/services/TMHMM/

Page 95: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

What To Do with an HMM?

RNA structure Prediction/Fold Recognition

SCGF: Stochastic Context Free Grammars(Sean Eddy)

Page 96: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

What To Do with an HMM?

Gene Prediction

State of the art use HMMs

Genemark: Prokaryotes

GenScan: Eukaryotes

Page 97: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

GeneMark

Page 98: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

A typical HMM for Coding DNA

S

GGG 0.02GGGA 0.00GGGT 0.6GGGC 0.38G

TGG 1.00W

64 Codons

GGG 0.02GGGA 0.00GGGT 0.6GGGC 0.38G

TGG 1.00W

E

64 Codons

Page 99: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Emission (codon Frequency)

Transition (Dipeptide)

A Typical HMM for Coding DNA

Page 100: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

GeneMark HMM

HMM order 5: 6th Nucleotide depends on the 5 previous

Proba of seq (GGG-TGG Given Model)=

Proba(GGG)*Proba(GGG->TGG)*Proba(TGG)

Takes into account Codon Bias AND dipeptide Comp

Page 101: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

What To Do with an HMM?

Family and Domain Identification

PfamSmartProsite Profiles

Page 102: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

What To Do with an HMM?

Bayesian Phylogenic Inference

chite

wheattrybr

mouse

morphbank.ebc.uu.se/mrbayes/manual.php

Page 103: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

What To Do with an HMM?

Metabolic Networks: Bayesian Networks

www.cs.huji.ac.il/~nirf/

Page 104: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

CollectionsOf

Domains HMMs

Page 105: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

What is a Domain HMM ?

SAM, HMMER, PFtools

Page 106: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Emission Proba

Page 107: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using Domain HMMs

Question: I want to Compare my HMM with all the sequences in SwissProt

Very Similar to Dynamic Programming

Requires an adapted Viterbi: Pair-HMM

Page 108: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using Domain HMMs

Question: What are the Available CollectionsOf Pre-computed HMMs

Interpro unites many collections

Page 109: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Interpro: The Idea of Domains

Page 110: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Interpro: A Federation of Databases

Page 111: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using InterPro: Asking a question

Which Domains does the oncogene FosB contain?

Page 112: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using InterPro: Asking a question

Page 113: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using InterPro: Asking a question

Page 114: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Finding Domains

-How can I be sure that the domain Prediction of my Protein is real ?

Use the EMBnet pfscan

Page 115: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Using EMBNet PFscan

Page 116: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Posterior Decoding With EMBNet PFscan

Important Position that is Well conserved in our sequence

Page 117: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Posterior

Prior

Page 118: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

The Inside

Of Pfam

Page 119: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

A Typical pfam Domain

Page 120: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

A Typical pfam Domain

HMMER Package:

Page 121: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Page 122: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Going FurtherBuilding and Using

HMMs

Page 123: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

HMMer2: hmmer.wustl.edu/Used to create and distribute Pfam

PFtools: www.isrec.isb-sib.ch/ftp-server/pftools/Used to create and distribute Prosite

SAM T02

www.cse.ucsc.edu/research/compbio/sam.html

Page 124: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

EMBOSS Online

www.hgmp.mrc.ac.uk/SOFTWARE/EMBOSS

Jemboss: a JAVA aplet interacting with an EMBOSSServer

Page 125: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Page 126: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

HMMer

Page 127: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Page 128: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

EMBASSY(Hmmer)

Page 129: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

Page 130: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

In The End:Markov Uncovered

Page 131: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

HMM and Markov Chains

Domain Collections

Gene Prediction

Bayesian Phylogenetic Inferencechite

wheattrybr

mouse

Page 132: Uncovering Sequences Mysteries  With Hidden Markov Model

Cédric Notredame (21/04/23)

HMM and Markov Chains

Domain Collections

Profiles HMM Generalized Profiles

Interactive Tools