hidden markov models - leiden...

1

1

Hidden Markov Models

based on chapters from the book

Durbin, Eddy, Krogh and Mitchison

Biological Sequence Analysis

Shamir’s lecture notes

and Rabiner’s tutorial on HMM

2

music recognition

deal with variations in

- pitch

- timing

- timbre

- …

2

3

Stock Market Prediction

• Actual Value versus Forcasted Value for Tata Steel in Rupees over the period 5-9 2009 – 23-9 2011.

• Variations of value over time.

• From: A. Gupta, B. Dhingra, Stock Market Prediction Using Hidden Markov Models, 2011.

4

Activity Tracking

Activities:

• Walking

• Running

• Cycling

• stair climbing

• sleeping, etc.

3

5

application: gene finding

deal with variations in

- actual sound → actual base (match/substitutions)

- timing → insertions/deletions

6

Basic Questions

Given:

• A sequence of “observations”

• A probabilistic model of our “domain”

Questions:

• Does the given sequence belong to a certain family?– Markov chains

– Hidden Markov Models (HMMs)

• Can we say something about the internal structure of the sequence? (indirect observations)– Hidden Markov Models (HMMs)

4

7

Introduction Markov Chain Model

Characteristics

• Discrete time

• Discrete space

• No state History– Present state only

• States and transitions

Notations:

P(X) probability for event X

P(X,Y) event X and event Y

P(X|Y) event X given event Y

A

C

B0.4

0.30.3

0.2

0.81

Discrete vs Continuous

8

Definition of Markov Chain Model

• A Markov chain[1] model is defined by

– a set of states

• some states emit a symbol (unique per state)

• other states (e.g., the begin state) are silent

– a set of transitions with associated probabilities

• the transitions going out of a given state define a distribution over the

possible next states (i.e., all positive, and sum equals 1)

[1] Марков А. А., Распространение закона больших чисел на величины, зависящие друг

от друга. — Известия физико-математического общества при Казанском

университете. — 2-я серия. — Том 15. (1906) — С. 135—156

5

9

Markov Model

Markov Model M = (Q,P,T), with

• Q the set of states

• P the set of initial probabilities px for each state x in Q

• T = (txy) the transition probabilities matrix/graph, with txy the probability of the transition from state x to state y.

This is a first order Markov Model:

no history is modeled

An observation X is a sequence of states:

X = x1x2 … xn

The probability of an observation X given the model M is equal to:

A

C

B

tAC

tAA

Markov Model M:

10

A Markov Chain Model Example

• Transition

probabilities

– Pr(xi=a|xi-1=g)=0.16

– Pr(xi=c|xi-1=g)=0.34

– Pr(xi=g|xi-1=g)=0.38

– Pr(xi=t|xi-1=g)=0.12

1)|Pr( 1 gxx ii

over all neighbors xi

6

11

The Probability of a Sequence for a Markov Chain Model

Pr(CGGT)=Pr(C)Pr(G|C)Pr(G|G)Pr(T|G)

12

Markov Chains: Another Example

A CB

0.7

0.3

0.2

0.8

0.6

0.4

A CB

0.6

0.4

0.3

0.6

0.5

0.50.1

AABBCCC

P( AABBCCC | M1 ) =

1·7·3·2·8·6·6·10-6 = 1.2 10-2

P( AABBCCC | M2 ) =

1·6·4·3·6·5·5·10-6 = 1.1 10-2

unique starting state A

.7 .3 00 .2 .8.4 0 .6

T =

Q = { A, B, C }

P = ( 1, 0, 0 )

1 .7 .3 .2 .8 .6 .6

A B C

C

B

A

M1:

M2:

7

13

Markov Models: Properties

Given some sequence x of length L, we can ask:

How probable is the sequence x given our model M?

• For any probabilistic model of sequences, we can

write this probability as

• key property of a (1st order) Markov chain: the

probability of each xi depends only on the value of

xi-1

)Pr()......|Pr()...|Pr(

)...Pr()Pr(

112111

11

xxxxxxx

xxxx

LLLL

LL

L

i

ii

LLLL

xxx

xxxxxxxx

2

11

112211

)|Pr()Pr(

)Pr()|Pr()...|Pr()|Pr()Pr(

14

Markov Model: Underflow Problem

A

C

B

tAC

tAA

small values: underflow

• initial state x0 fixed

~ initial probabilities

• final state [not depicted]

0

t0A

t0C t0B

M:

9

17

Motivation for Markov Models in Computational Biology

• There are many cases in which we would like to

represent the statistical regularities of some

class of sequences

– genes

– various regulatory sites in DNA (e.g., where RNA

polymerase and transcription factors bind)

– proteins in a given family

• Markov models are well suited to this type of task

18

Markov Chain: An Example Application

• CpG islands

– CG di-nucleotides are rarer in eukaryotic genomes than expected

given the marginal probabilities of C and G

– but the regions upstream of genes (reading is from 5’ to 3’) are

richer in CG di-nucleotides than elsewhere – so called CpG islands

– useful evidence for finding genes

• Application: Predict CpG islands with Markov chains

– a Markov chain to represent CpG islands

– a Markov chain to represent the rest of the genome

10

19

Markov Chains for Discrimination

• Suppose we want to distinguish CpG islands

from other sequence regions

• Given sequences from CpG islands, and

sequences from other regions, we can construct

– a model to represent CpG islands

– a null model to represent the other regions

• We can then score a test sequence X by:

)|Pr(

)|Pr(log)(

nullModelX

CpGModelXXscore

20

Markov Chains for Discrimination

As before we can use the scoring function:

• Because according to Bayes’ rule we have:

• If we are not taking into account prior probabilities (Pr(CpG) and

Pr(null)) of the two classes, then from Bayes’ rule it is clear that we

just need to compare Pr(X|CpG) and Pr(X|null) as is done in our

scoring function score().

)Pr(

)Pr()|Pr()|Pr(

X

CpGCpGXXCpG

)Pr(

)Pr()|Pr()|Pr(

X

nullnullXXnull

)|Pr(

)|Pr(log)(

nullModelX

CpGModelXXscore

11

21

Markov Chain Application: CpG islands

+ A C G TA 0.180 0.274 0.426 0.120C 0.171 0.368 0.274 0.188G 0.161 0.339 0.375 0.125T 0.079 0.355 0.384 0.182

- A C G TA 0.300 0.205 0.285 0.210C 0.322 0.298 0.078 0.302G 0.248 0.246 0.298 0.208T 0.177 0.239 0.292 0.292A C

G T

In general consecutive CG pairs

CG → CG are rare, although ‘islands’

Occur in signal (e.g.) promotor regions.

island

non island

observed

frequencies

22

basic questions

observation: DNA sequence

model 1: CpG islands

model 2: non-islands

• does this sequence belong to a certain family?

Markov chains

is this a CpG island (or not)?

• can we say something about the internal structure?

Markov Chains: windowing

where are the CpG islands?

12

23

application: CpG islands

+ A C G TA 0.180 0.274 0.426 0.120C 0.171 0.368 0.274 0.188G 0.161 0.339 0.375 0.125T 0.079 0.355 0.384 0.182

- A C G TA 0.300 0.205 0.285 0.210C 0.322 0.298 0.078 0.302G 0.248 0.246 0.298 0.208T 0.177 0.239 0.292 0.292

score

island non island

X = ACGT A->C C->G G->T

0.274 · 0.274 · 0.125

0.205 · 0.078 · 0.208= 2.82

Note: A score > 1 is an

Indication of a CpG island.

24

application: CpG islands

log-score (log2)

X = ACGT

0.274 · 0.274 · 0.125

0.205 · 0.078 · 0.208log2 = 0.42 + 1.81 – 0.73 = 1.50

LLR A C G TA -0.74 0.42 0.58 -0.80C -0.91 0.30 1.81 -0.69G -0.62 0.46 0.33 -0.73T -1.17 0.57 0.39 -0.68

LLR = Log-Likelihood Ratio

log2(0.274/0.078) = 1.81

13

25

CpG Log-Likelihood Ratio

LLR A C G TA -0.74 0.42 0.58 -0.80C -0.91 0.30 1.81 -0.69G -0.62 0.46 0.33 -0.73T -1.17 0.57 0.39 -0.68

LLR(ACGT) = 0.42+1.81–0.73 = 1.50

• is a (short) sequence a CpG island ?

compare with observed data (normalized for length)

• where (in long sequence) are CpG islands ?

first approach: sliding window

• ! What would be the length of window?

( 0.37 ‘bits’ per base )

1.5/4 = 0,375

26

empirical data

• is a (short) sequence a CpG island ?

compare with observed data (normalized for length)

CpG islandsNon-CpG

14

27

CpGplot

ACCGATACGATGAGAATGAGCAATGTAGTGAATCGTTTCAGCTACTCTCTATCGTAGCATTACTATGCAGTCAGTGATGCGCGCTAGCCGCGTAGCTCGCGGTCGCATCGCTGGCCGTAGCTGCGTACGATCTGCTGTACGCTGATCGGAGCGCTGCATCTCAACTGACTCATACTCATATGTCTACATCATCATCATTCATGTCAGTCTAGCATACTATTATCGACGACTGATCGATCTGACTGCTAGTAGACGTACCGAGCCAGGCATACGACATCAGTCGACT

• where (in long sequence) are CpG islands ?

first approach: sliding window

28

CpGplot

observed vs. expected

percentage

putative islands

Islands of unusual CG composition EMBOSS_001 from 1 to 286 Observed/Expected ratio > 0.60 Percent C + Percent G > 50.00 Length > 50 Length 114 (51..164)

Window size 100

C and G contents =>

expected CG occurrences

%C + %G

A set of 10 windows

fulfilling the thresholds

before island is called

15

29

Some Notes on: Higher Order Markov Chains

• The Markov property specifies that the probability of a state depends only

on the probability of the previous state

• But we can build more “memory” into our states by using a higher order

Markov model

• In an n-th order Markov model

The probability of the current state depends on the previous n states.

),...,|Pr(),...,,|Pr( 1121 niiiiii xxxxxxx

30

Selecting the Order of a Markov Chain Model

• But the number of parameters we need to estimate for an

n-th order Markov model grows exponentially with the order

– for modeling DNA we need parameters (# of state

transitions) for an n-th order model

• The higher the order, the less reliable we can expect our

parameter estimates to be

– estimating the parameters of a 2nd order Markov chain from the

complete genome of E. Coli (5.44 x 106 bases) , we would see each

(length 3) word ~ 85.000 times on average (divide by 43)

– estimating the parameters of a 9th order chain, we would see each

(length 10) word ~ 5 times on average (divide by 410 ~ 106)

)4( 1nO

16

31

Higher Order Markov Chains

• An n-th order Markov chain over some alphabet A is

equivalent to a first order Markov chain over the alphabet

of n-tuples: An

• Example: a 2nd order Markov model for DNA can be

treated as a 1st order Markov model over alphabet

AA, AC, AG, AT

CA, CC, CG, CT

GA, GC, GG, GT

TA, TC, TG, TT

Transition probabilities:

P(A|AA) , P(A| AC), etc.

32

A Fifth Order Markov Chain Equivalent

Pr(GCTACA)=Pr(GCTAC)Pr(A|GCTAC)

17

33

hidden Markov model

Where (in long sequence) are CpG islands?

• first approach: Markov Chains + windowing

• second approach: hidden Markov model

34

Hidden Markov Model: A Simple HMM

Given observed sequence AGGCT, which state emits which item?

Model 1 Model 2

18

35

Another example: Eddy (2004)

An (toy) HMM for 5’ splice site recognition.

Figure from: What is a hidden Markov model?

Sean R Eddy. Nature Biotechnology 22, 1315 - 1316 (2004)

prob. of path

P( si=5 | X)

Posterior decoding P(pi=q | X),

i.e., given sequence X

what is the probability that

the i-th state is equal to q.

36

Example: weather

0.3

0.4

0.6 0.2

0.1

0.1

0.5

0.4

0.4

P( )=0.1P( )=0.2P( )=0.7

HP( )=0.3P( )=0.4P( )=0.3

M

L P( )=0.6P( )=0.3P( )=0.1

pH = 0.4pM= 0.2pL = 0.4

observed weather vs. pressure

emission

probabilities

transition

probabilitiesinitial

probabilities

19

37

Example: weather

( , , )0.3

0.4

0.6 0.2

0.1

0.1

0.5

0.4

0.4

H M

L

pH = 0.4pM= 0.2pL = 0.4

(0.1, 0.2, 0.7)

(0.3, 0.4, 0.3)

(0.6, 0.3, 0.1)

( R, C, S )

P( RCCSS | HHHHH ) = 1·2·2·7·7 = 196 (x10-5)

P( RCCSS | MMMMM ) = 3·4·4·3·3 = 432 (x10-5)

P( RCCSS, HHHHH ) = 4·1·6·2·6·2·6·7·6·7 = 1016 (x10-7)

P( RCCSS, MMMMM ) = 2·3·2·4·2·4·2·3·2·3 = 14 (x10-7)

Given path

Emissions

Emissions

38

CpG islands ctd.

+ A C G TA 0.180 0.274 0.426 0.120C 0.171 0.368 0.274 0.188G 0.161 0.339 0.375 0.125T 0.079 0.355 0.384 0.182

- A C G TA 0.300 0.205 0.285 0.210C 0.322 0.298 0.078 0.302G 0.248 0.246 0.298 0.208T 0.177 0.239 0.292 0.292

8 states A+ vs A-

unique observation each statep

1-p 1-q

q

A C

G T

8x8 =64 transitions!

A C

G T

(1-p)/4

0.180p

‘+’ denotes CpG island

‘-’ denotes non-CpG island

20

39

hidden Markov model

model M = (,Q,T)

• states Q

• transition probabilities

observation

observe states indirectly ‘hidden’

• emission probabilities

probability

observation given the model

? there may be many state seq’s

A

C

B

tAC

tAA

x y

eAx

eAy

underlying process

what we see

40

HMM main questions

tpq Given HMM M:

• probability of observation X?

• most probable state sequence?

• how to find the parameters of

the model M? training

observation X*

21

41

Three Important Questions(See also L.R. Rabiner (1989))

• How likely is a given sequence?

– The Forward algorithm (probability over all paths)

• What is the most probable “path” for

generating a given sequence?

– The Viterbi algorithm

• How can we learn the HMM parameters given

a set of sequences?

– The Forward-Backward (Baum-Welch) algorithm

42

probability … !

Given sequence X: most probable state vs. optimal path

* most probable state (over all state sequences)

posterior decoding

using forward & backward probabilities

* most probable path (= single state sequence)

Viterbi

1

0.4

0.6

0.7

0.3

1

0.4

0.6

0.5

0.5

11

1

probability of state

start end

s1 s1

s2s2

22

43

The Forward Algorithm:

probability of observation X

xi

dynamic programming: probability ending in state q emitting symbol xi

%

%

%

A

B

C

x1 xi-1xi-2

state

44

The Forward Algorithm:

probability of observation X probability observing x1, …, xi and ending in state q:

‘forward’ probability

* = end-state

23

45

Probability of observation:

weather

( , , )0.3

0.4

0.6 0.2

0.1

0.1

0.5

0.4

0.4

H M

L

pH = 0.4pM= 0.2pL = 0.4

(0.1, 0.2, 0.7)

(0.3, 0.4, 0.3)

(0.6, 0.3, 0.1)

( R, C, S )

1:R 2:C H 0 4·1 = 4 (4·6 +6·4 +24·1)·2 = 144 (x10-4)

M 0 2·3 = 6 (4·3 +6·2 +24·5)·4 = 576 (x10-4)

L 0 4·6 = 24 (4·1 +6·4 +24·4)·3 = 372 (x10-4)

0 1

Initial state:

• Remain in H

• Coming from M

• Coming from L

P( RCCSS ) = P( RC… )

Transitions:

Start:

P(R...)

R C S

46

HMM: posterior decoding

%A

B

i

forward backward

i-th state equals q

=>P(X)

24

47

HMM main questions

tpq

• probability of this observation?

• most probable state sequence?

• how to find the model? training

observation X*

again:

We cannot try all possibilities

Viterbi

most probable state sequence

X:

48

Viterbi algorithm

xi

most probable state sequence for observation X

(1) dynamic programming: vq(i) probability ending in state q and emitting xi

%

%

%

A

B

C

vq(i)

State:

25

49

Decoding Problem: The Viterbi algorithm

xi xL

(1) dynamic programming: max probability ending in state

(2) traceback: most probable state sequence

A

B

C

states

given

sequence

q (=B)

50

Posterior Decoding Problem

Another decoding method, Posterior Decoding:

Input:

Given a Hidden Markov Model M = (Σ, Q, Θ) and a

sequence X for which the generating path P is

unknown.

Question:

For each 1 ≤ i ≤ L (the length of the path P) and state

k in Q compute the probability: P(πi = k | X).

26

51

Posterior Decoding Problem

P(πi = k | X) gives two additional decoding possibilities:

1. Alternative ‘path’ P* that follows the max probability states: argmaxstate k { P(πi = k | X) }.

2. Define a function g(k) on the states k in Q, then

G( i | X) = ∑k { P(πi = k | X) . g(k) }

We can use 2) to calculate the posterior probability of each nucleotide to be in a CpG-island, using a function g(k) defined on all states k in Q:

g(k) = 1 for all k that are CpG-island states,

0 otherwise.

52

HMM Decoding: two explanations

posterior Σbest state every position

But: path may not be allowed by model

viterbi maxoptimal global path

But: many paths with similar probability

27

53

HMM for Hidden Coin Tossing

HT

T

T T

T

H

T

……… H H T T H T H H T T H

54

dishonest casino dealer

28

55


56


Observation366163666466232534413661661163252562462255265252266435353336

Viterbi LLLLLLLLLLLLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Compare to:

Forward FFLLLLLLLLLLLLFFFFFFFFLFLLLFLLFFFFFFFFFFFFFFFFFFFFLFFFFFFFFF

Posterior (total) LLLLLLLLLLLLFFFFFFFFFLLLLLLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

29

57

Learning if correct path is known

• Learning is simple if we know the correct path for each

sequence in our training set

• estimate parameters by counting the number of times each

parameter is used across the training set

58

Sketch: Parameter estimation

training sequences X(i)

optimize score for model Θ.

If state sequences are known

count transitions pq

count emissions b in p

divide by

total transitions in p

emissions in q

Laplace correction for dealing with ‘zero’ probabilities.

Adding 1 to each count.

30

59

Learning With Hidden State

• If we don’t know the correct path for each sequence

in our training set, consider all possible paths for the

sequence

• Estimate parameters through a procedure that counts

the expected number of times each parameter is used

across the training set.

60

Learning Parameters: The Baum-Welch Algorithm

• Here we use the Forward-Backward algorithm

• An Expectation Maximization (EM) algorithm

– EM is a family of algorithms for learning probabilistic

models in problems that involve hidden states

• In this context, the hidden state is the path that best

explains each training sequence.

• Note, finding the parameters of the HMM that optimally

explains the given sequences is NP-Complete!

31

61

HMM: state sequences unknown: Baum-Welch

Baum-Welch training

• Based on given HMM Θ

• Given a training set of sequences X

• Determine:– expected number of transitions and

– expected number of emissions

• Apply ML and build a new (better) model:

– ML tries to find a model that gives the

training data the highest likelihood

• Iterate until convergence.

Note:

• can get stuck in local maxima

• does not understand the semantics of the states

Baum-Welch Re-estimation

62

Probability of state q when emitting Xi:

Probability of transition (p,q) after emitting Xi:

For the re-estimation we need the expected counts

For the transitions and the emissions in the HMM:

• Apply the backward-forward algorithm.

32

Baum-Welch

63

Estimation of Transition Probability

sum over all training sequences X

sum over all positions i

Estimation of Emission Probability

sum over all training sequences X

sum over all positions i with xi=b

Estimate parameters by ratio of expected counts.

64

Baum-Welch training

concerns:

• guaranteed to converge

target score, not Θ

• unstable solutions !

• local maximum

practical

•small values -> renormalize

tips:

• repeat for several initial HMM Θ

• start with meaningful HMM Θ

33

65

Viterbi training

Viterbi training (sketch):

• determine optimal paths

• re-compute as if paths are known

• score may decrease!

66

Computational Complexity of HMM Algorithms

• Given an HMM with S states and a sequence of

length L, the complexity of the Forward, Backward

and Viterbi algorithms is

– This assumes that the states are densely interconnected

• Given M training sequences of length L, the

complexity of Baum Welch on each iteration is

)( 2LSO

)( 2LMSO

34

67

Important Papers on HMM

L.R. Rabiner, A Tutorial on Hidden Markov Models and

Selected Applications in Speech Recognition,

Proceeding of the IEEE, Vol. 77, No. 22, February 1989.

Krogh, I. Saira Mian, D. Haussler, A Hidden Markov Model

that finds genes in E. coli DNA, Nucleid Acids Research,

Vol. 22 (1994), pp 4768-4778

Furthermore:

R. Hassan, A combination of hidden Markov model and fuzzy

model for stock market forecasting, Neurocomputing

archive, Vol. 72 , Issue 16-18, pp 3439-3446, October

2009.

68

Applications

Hidden Markov Models

35

69

model topology

A C

G T

many states & fully connected

training seldom works => local maxima

use knowledge about the problem

For example:

Use a linear model for profile alignment:

begin M2end

70

silent states

quadratic vs. linear size

but less modeling possibilities

[round silent states (no emission)]

1 2 3 4 5

high/low transition probabilities

skipping nodes

[square emitting states]

36

71

silent states: algorithm

transition / emission

For emitting states q

=> calculated as before

But for silent states q

- no silent loops (!):

- update in ‘topological order’

Previously: forward algorithm

From state p to state q

q

q

72

profile alignment (no gaps)

profile HMM P ‘dedicated topology’

Let ei(b) be equal to the probability of

observing symbol b at pos i, then:

Assume a given

profile set:

12345678VGAHAGEYVTGNVDEVVEADVAGHVKSNDVADVYSTVETSFNANIPKHIAGNGAGV

No gaps

transition probabilities: 1

trivial alignment HMM to sequence

begin M4end

=> Emission probability distribution function at state 4

37

73

affine model

open gap extension

profile alignment with gaps

Mj Mj+1

Ij insert state

match states

Given profile

sequences:

VGA--HAGEYVNA--NVDEVVEA--DVAGHVKG--NYDEDVYS--TYETSFNA--NIPKHIAGADNGAGV123__45678

Emission probability distribution based on:

- background probabilities: ei(b) = p(b)

- or based on alignment (match)

74

profile alignment with gaps and deletes

insert state

match states

Given profile

Sequences:

VGA--HAGEYV----NVDEVVEA--DVAGHVKG------DVYS--TYETSFNA--NIPKHIAGADNGAGV123__45678

Dj-1 Dj

Mj-1 Mj Mj+1

delete state

(silent)

adapt Viterbi =>

Mj Mj+1

Ij

38

75

HMM for profiles / multiple alignment

D

begin Mjend

I

Deletion (D)

Insertion (I)

same level

same position

Match (M)

Viterbi

jjj MY

Y

jDIMY

iM

M

j tivxeiv1

).1(max).()( 1,,

jj IY

Y

jDIMY

i

I

j tivxpiv1

).1(max).()(,,

jj DY

Y

jDIMY

D

j tiviv1

).(max)( 1,,

76

profile alignment

given multiple alignment

Insertion / Deletion states

VGA--HAGEYV----NVDEVVEA--DVAGHVKG------DVYS--TYETSFNA--NIPKHIAGADNGAGV123 45678

Example counting for state 1:

transitions

M1M2 6+1 7/10

M1I1 0+1 1/10

M1D1 1+1 2/10

emissions

F 1+1 2/27

I 1+1 2/27

V 5+1 6/27

other 17x 0+1 1/27

Laplace correction, i.e., adding 1 for each frequency to avoid dividing by 0

39

77

Multiple Sequence Alignment using a Profile HMM

Multiple Sequence Alignment Problem:

Given a set of sequences S1,…, Sn.

How can the set of sequences be optimally aligned?

Assume a profile HMM P is known, then:

- Align each sequence Si to the profile separately

- Accumulate the obtained alignments to a multiple

alignment

78

Multiple Sequence Alignment: using a Profile HMM

Multiple Sequence Alignment Problem:

Given sequence S1,…, Sn, how can they be optimally aligned?

Assume a profile HMM P is not known, then obtain an HMM profile P from S1,…, Sn as follows:

- Choose a length L for the profile HMM and initialize the transition and emission probabilities.

- Train the HMM using Baum- Welch on all sequences S1,…, Sn.

Now obtain the multiple alignment using this HMM P as in the previous case:

- Align each sequence Si to the profile separately

- Accumulate the obtained alignments to a multiple alignment

40

79

multiple alignment with profile

align each sequence separately

accumulate alignments M and D positions

align inserts (I) leftmost i positions

IAGADNGAGV123II45678

VGAHAGEY12345678

FNAPNI-KH123I45678

D

VGA--HAGEYFNAP-NI-KHIAGADNGAGV123 45678

80

Gene Finding

41

81

gene finding

central dogma:

DNA transcription RNA translation protein

only 2%-3% coding … find these regions

Prokaryotes Eukariotes

no nucleus nucleus

most of genome is coding

H.influenza: 70 % coding

part of genome is coding

Human: 2% - 3% coding

continuous genes introns & exons

82

Biological Background: Transcription

Gene Expression from DNA to Protein

• Transcription

• Translation

Transcription

• DNA -> mRNA (messenger RNA).

• From the 5’ end to the 3’ , i.e., downstream (opposite is upstream)

• RNA polymerase enzyme starts a few bases upstream of the coding region, and terminates a few bases after the end of the coding region.

• The upstream and downstream regions not used for the code of the protein are called untranslated regions (UTRs)

• RNA polymerase molecules start transcription by recognizing and binding to promoter regions upstream of the transcription start sites.

• Each promoter region has a signal which can encourage or suspend transcription.

42

83

Biological Background: Translation

Translation mRNA -> Protein is executed by ribosomes:

• Each triplet of bases in the mRNA is a command for the ribosomes, called codon.

• 64 different possible codons and only 20 amino acids => multiple codons represent the same amino acid.

Ribosome starts scanning the mRNA molecule from 5’ end to 3’ end;

If Ribosome detects start codon (also code for Methionine)

Then

While (Ribosome did not detect any of the three possible stop codons)

do

Generating an amino acid sequence coded by the mRNA;

Scan next codon;

od

Detach the chain of amino acids from the Ribosome;

Endif (Simplified)

84

Codons

start

AUG

stop

UAA,

UAG,

UGA

43

85

Codons

start

AUG

stop

UAA,

UAG,

UGA

86

Open Reading Frames (ORFs)

3 (or 6) ORFs

start AUG

stop UAA, UAG, UGA

Coding region stops with a stop codon

3 out of 64 codons are stop codons ,i.e., random

probability of stop codon is 3/64

=> expected that every region of 21 codons has 1 stop (random)

But: average protein 1000bp [much longer]

Thus search for long ORFs in all 3 reading frames => coding regions!

- miss short genes

- miss overlapping long ORFs on opposite strands

- too many found (6500 ORFs in E.Coli genome, but only 1100 genes)

ACT GAC TGA CT GACTGACTGAC TGACTGACTGACTG ACT GAC TG ACTGACTGACT GACTGACTGACTGA CTG AC TGA CTGACTGACTG ACTGACTGACT

5’-end 3’-enddownstream

upstream

44

87

Reading Frames

AGGCATGCGATCCAAGTTCCACCATGATGACATGATGACTA


upstream

TCCGTACGCTAGGTTCAAGGTGGTACTACTGTACTACTGAT

5’-end3’-end downstream

Complementary DNA strand

DNA strand

C G

A T

88

Reading Frames

AGG CAT GCG ATC CAA GTT CCA CCA TGA TGA CAT GAT GAC TAAGGC ATG CGA TCC AAG TTC CAC CAT GAT GAC ATG ATG ACT AA.GCA TGC GAT CCA AGT TCC ACC ATG ATG ACA TGA TGA CTA A..


upstream

..T CCG TAC GCT AGG TTC AAG GTG GTA CTA CTG TAC TAC TGA.TC CGT ACG CTA GGT TCA AGG TGG TAC TAC TGT ACT ACT GATTCC GTA CGC TAG GTT CAA GGT GGT ACT ACT GTA CTA CTG ATT

5’-end3’-end downstream

Complementary DNA strand

DNA strand

45

89

Prokaryotes DNA Structure

90

Eukaryotes DNA Structure

46

91

genes are not random

motto

Leu Leucine 6 codons 6.9

Ala Alanine 4 6.5

Trp Tryptophan 1 1

codon frequencies (per 20)

‘random’ coding

A or T in 2nd position sometimes 90%

Markov models• codon triplets as states [64 states]

~ 3rd order (but no overlap)

• triplet frequencies only

keep 3 frames in sliding window cf. CpG

92

promotor regions

‘consensus’ sequence i.e. not exact

… n TTGAC n18 TATAAT n6 N n …

position weight matrix of TATA box:

pos 1 2 3 4 5 6A 2 95 26 59 51 1C 9 2 14 13 20 3G 10 1 16 15 13 0T 79 3 44 13 17 96

wmm weight matrix model

wam weigth array matrix

+ dependencies between adjacent positions

start

codingTATA box

cf. ‘profile’

47

93

HMM Gene Finding

Krogh, I. Saira Mian, D. Haussler, A Hidden

Markov Model that finds genes in E. coli

DNA, Nucleid Acids Research, Vol. 22, pp

4768-4778, 1994

94

HMM Model

61 Codon Models

Start Codon Model• ATG• GTA• TTT (rare)

Stop Codon Models• TAA and TGA• TAG

48

95

More Advanced Intergenic Model

96

49

97

Statistics on Data Set EcoSeq6

98

HMM Results

Data Set:

• EcoSeq6 contained about 1/3th of the complete E. coli genome (total 5.44x106 nucleotides, 5416 genes), and was not fully annotated at that time

HMM Training:

• on ~106 nucleotides from the EcoSeq6 database of labeled genes (K. Rudd, 1991)

HMM Testing

• On the remainder of ~325.000 nucleotides

Method:

• For each contig in the test the Viterbi algorithm was used to find the most likely path through the hidden states of the HMM

• This path was then used to define a parse of the contig into genes separated by intergenic regions

50

99

HMM Results

Post-processing consists of 3 rules to handle:• Overlapping genes, which will look like frame-shifts

• Short genes overlapping with long genes on the opposite direction,

as a result of self-complementary type codons.

100

HMM Results

• 80% of the labeled protein coding genes were

exactly found (i.e. with precisely the same

start and end codon)

• 5% found within 10 codons from start codon

• 5% overlap by at least 60 bases or 50%

• 5% missed completely

• Several new genes indicated

• Several insertion and deletion errors were

labeled in the contig parse

51

101

Important Papers on HMM

L.R. Rabiner, A Tutorial on Hidden Markov Models and

Selected Applications in Speech Recognition,

Proceeding of the IEEE, Vol. 77, No. 22, February 1989.

Krogh, I. Saira Mian, D. Haussler, A Hidden Markov Model

that finds genes in E. coli DNA, Nucleid Acids Research,

Vol. 22 (1994), pp 4768-4778

Furthermore:

R. Hassan, A combination of hidden Markov model and fuzzy

model for stock market forecasting, Neurocomputing

archive, Vol. 72 , Issue 16-18, pp 3439-3446, October

2009.

102

Bibliography

[1] H. Carrillo and D. Lipmann. The multiple sequence alignment problem in biology. SIAM

J. Appl. Math, 48:1073–1082, 1988.

[2] D. Feng and R. F. Doolittle. Progressive sequence alignment as a prerequisite to correct

phylogenetic trees. J. Mol. Evol., 25:351–360, 1987.

[3] W. M. Fitch and E. Margoliash. Construction of phylogenetic trees. science, 15:279–284,

1967.

[4] D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press,

New York, 1997.

[5] T. Jiang, L. Wang, and E. L. Lawler. Approximation algorithms for tree alignment with

a given phylogeny. Algorithmica, 16:302–315, 1996.

[6] D. J. Lipman, S. Altshul, and J. Kececiogly. A tool for multiple sequence alignment.

Proc. Natl. Academy Science, 86:4412–4415, 1989.

[7] M. Murata, J.S. Richardson, and J.L. Sussman. Three protein alignment. Medical

Information Sciences, 231:9, 1999.

[8] J. D. Thompson, D. G. Higgins, and T. J. Gibson. Clustal w: improving the sensitivity

of progressive multiple sequence alignment through sequence weighting, position-specific

gap penalties and weight matrix choice. Nucleic Acids Res, 22:4673–80, 1994.

[9] L. Wang and T. Jiang. On the complexity of multiple sequence alignment. Computational

Biology, 1:337–348, 1994.

[10] http://www.uib.no/aasland/chromo/chromoCC.html.

[11] http://www.uib.no/aasland/chromo/chromo-tree.gif.

52

103

References

• Lecture notes@M. Craven’s website: www.biostat.wisc.edu/~craven

• A. Baxevanis and B. F. F. Ouellette. Bioinformatics: A Practical Guide to the Analysis of

Genes and Proteins (3rd ed.). John Wiley & Sons, 2004

• R.Durbin, S.Eddy, A.Krogh and G.Mitchison. Biological Sequence Analysis: Probability

Models of Proteins and Nucleic Acids. Cambridge University Press, 1998

• N. C. Jones and P. A. Pevzner. An Introduction to Bioinformatics Algorithms. MIT

Press, 2004

• I. Korf, M. Yandell, and J. Bedell. BLAST. O'Reilly, 2003

• L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech

recognition. Proc. IEEE, 77:257--286, 1989

• J. C. Setubal and J. Meidanis. Introduction to Computational Molecular Biology. PWS

Pub Co., 1997.

• M. S. Waterman. Introduction to Computational Biology: Maps, Sequences, and

Genomes. CRC Press, 1995

• Krogh, I. Saira Mian, D. Haussler, A Hidden Markov Model that finds genes in E. coli

DNA, Nucleid Acids Research, Vol. 22, pp 4768-4778, 1994

104

Transcription and Translation

• http://www.ebi.ac.uk/Tools/emboss/transeq/help.html

• http://www.ebi.ac.uk/2can/tutorials/transcription.html

http://www.ebi.ac.uk/Tools/emboss/transeq/help.html

http://www.ebi.ac.uk/2can/tutorials/transcription.html

hidden markov models - leiden...

Documents