hidden markov models - leiden...
TRANSCRIPT
1
1
Hidden Markov Models
based on chapters from the book
Durbin, Eddy, Krogh and Mitchison
Biological Sequence Analysis
Shamir’s lecture notes
and Rabiner’s tutorial on HMM
2
music recognition
deal with variations in
- pitch
- timing
- timbre
- …
2
3
Stock Market Prediction
• Actual Value versus Forcasted Value for Tata Steel in Rupees over the period 5-9 2009 – 23-9 2011.
• Variations of value over time.
• From: A. Gupta, B. Dhingra, Stock Market Prediction Using Hidden Markov Models, 2011.
4
Activity Tracking
Activities:
• Walking
• Running
• Cycling
• stair climbing
• sleeping, etc.
3
5
application: gene finding
deal with variations in
- actual sound → actual base (match/substitutions)
- timing → insertions/deletions
6
Basic Questions
Given:
• A sequence of “observations”
• A probabilistic model of our “domain”
Questions:
• Does the given sequence belong to a certain family?– Markov chains
– Hidden Markov Models (HMMs)
• Can we say something about the internal structure of the sequence? (indirect observations)– Hidden Markov Models (HMMs)
4
7
Introduction Markov Chain Model
Characteristics
• Discrete time
• Discrete space
• No state History– Present state only
• States and transitions
Notations:
P(X) probability for event X
P(X,Y) event X and event Y
P(X|Y) event X given event Y
A
C
B0.4
0.30.3
0.2
0.81
Discrete vs Continuous
8
Definition of Markov Chain Model
• A Markov chain[1] model is defined by
– a set of states
• some states emit a symbol (unique per state)
• other states (e.g., the begin state) are silent
– a set of transitions with associated probabilities
• the transitions going out of a given state define a distribution over the
possible next states (i.e., all positive, and sum equals 1)
[1] Марков А. А., Распространение закона больших чисел на величины, зависящие друг
от друга. — Известия физико-математического общества при Казанском
университете. — 2-я серия. — Том 15. (1906) — С. 135—156
5
9
Markov Model
Markov Model M = (Q,P,T), with
• Q the set of states
• P the set of initial probabilities px for each state x in Q
• T = (txy) the transition probabilities matrix/graph, with txy the probability of the transition from state x to state y.
This is a first order Markov Model:
no history is modeled
An observation X is a sequence of states:
X = x1x2 … xn
The probability of an observation X given the model M is equal to:
A
C
B
tAC
tAA
Markov Model M:
10
A Markov Chain Model Example
• Transition
probabilities
– Pr(xi=a|xi-1=g)=0.16
– Pr(xi=c|xi-1=g)=0.34
– Pr(xi=g|xi-1=g)=0.38
– Pr(xi=t|xi-1=g)=0.12
1)|Pr( 1 gxx ii
over all neighbors xi
6
11
The Probability of a Sequence for a Markov Chain Model
Pr(CGGT)=Pr(C)Pr(G|C)Pr(G|G)Pr(T|G)
12
Markov Chains: Another Example
A CB
0.7
0.3
0.2
0.8
0.6
0.4
A CB
0.6
0.4
0.3
0.6
0.5
0.50.1
AABBCCC
P( AABBCCC | M1 ) =
1·7·3·2·8·6·6·10-6 = 1.2 10-2
P( AABBCCC | M2 ) =
1·6·4·3·6·5·5·10-6 = 1.1 10-2
unique starting state A
.7 .3 00 .2 .8.4 0 .6
T =
Q = { A, B, C }
P = ( 1, 0, 0 )
1 .7 .3 .2 .8 .6 .6
A B C
C
B
A
M1:
M2:
7
13
Markov Models: Properties
Given some sequence x of length L, we can ask:
How probable is the sequence x given our model M?
• For any probabilistic model of sequences, we can
write this probability as
• key property of a (1st order) Markov chain: the
probability of each xi depends only on the value of
xi-1
)Pr()......|Pr()...|Pr(
)...Pr()Pr(
112111
11
xxxxxxx
xxxx
LLLL
LL
L
i
ii
LLLL
xxx
xxxxxxxx
2
11
112211
)|Pr()Pr(
)Pr()|Pr()...|Pr()|Pr()Pr(
14
Markov Model: Underflow Problem
A
C
B
tAC
tAA
small values: underflow
• initial state x0 fixed
~ initial probabilities
• final state [not depicted]
0
t0A
t0C t0B
M:
8
15
Markov Model: Comparing Models
M1
M2
Question: X best explained by which model?
P(X | M1) vs. P(X | M2)
P(M1 | X) vs. P(M2 | X) !!
Bayes Rule: P(A|B) = P(B|A)P(A) / P(B)
P(M1|X) P(X|M1)P(M1)
P(M2|X) P(X|M2)P(M2) =
Given:
We can calculate:
We want to know:
16
motto
bases are not random
9
17
Motivation for Markov Models in Computational Biology
• There are many cases in which we would like to
represent the statistical regularities of some
class of sequences
– genes
– various regulatory sites in DNA (e.g., where RNA
polymerase and transcription factors bind)
– proteins in a given family
• Markov models are well suited to this type of task
18
Markov Chain: An Example Application
• CpG islands
– CG di-nucleotides are rarer in eukaryotic genomes than expected
given the marginal probabilities of C and G
– but the regions upstream of genes (reading is from 5’ to 3’) are
richer in CG di-nucleotides than elsewhere – so called CpG islands
– useful evidence for finding genes
• Application: Predict CpG islands with Markov chains
– a Markov chain to represent CpG islands
– a Markov chain to represent the rest of the genome
10
19
Markov Chains for Discrimination
• Suppose we want to distinguish CpG islands
from other sequence regions
• Given sequences from CpG islands, and
sequences from other regions, we can construct
– a model to represent CpG islands
– a null model to represent the other regions
• We can then score a test sequence X by:
)|Pr(
)|Pr(log)(
nullModelX
CpGModelXXscore
20
Markov Chains for Discrimination
As before we can use the scoring function:
• Because according to Bayes’ rule we have:
• If we are not taking into account prior probabilities (Pr(CpG) and
Pr(null)) of the two classes, then from Bayes’ rule it is clear that we
just need to compare Pr(X|CpG) and Pr(X|null) as is done in our
scoring function score().
)Pr(
)Pr()|Pr()|Pr(
X
CpGCpGXXCpG
)Pr(
)Pr()|Pr()|Pr(
X
nullnullXXnull
)|Pr(
)|Pr(log)(
nullModelX
CpGModelXXscore
11
21
Markov Chain Application: CpG islands
+ A C G TA 0.180 0.274 0.426 0.120C 0.171 0.368 0.274 0.188G 0.161 0.339 0.375 0.125T 0.079 0.355 0.384 0.182
- A C G TA 0.300 0.205 0.285 0.210C 0.322 0.298 0.078 0.302G 0.248 0.246 0.298 0.208T 0.177 0.239 0.292 0.292A C
G T
In general consecutive CG pairs
CG → CG are rare, although ‘islands’
Occur in signal (e.g.) promotor regions.
island
non island
observed
frequencies
22
basic questions
observation: DNA sequence
model 1: CpG islands
model 2: non-islands
• does this sequence belong to a certain family?
Markov chains
is this a CpG island (or not)?
• can we say something about the internal structure?
Markov Chains: windowing
where are the CpG islands?
12
23
application: CpG islands
+ A C G TA 0.180 0.274 0.426 0.120C 0.171 0.368 0.274 0.188G 0.161 0.339 0.375 0.125T 0.079 0.355 0.384 0.182
- A C G TA 0.300 0.205 0.285 0.210C 0.322 0.298 0.078 0.302G 0.248 0.246 0.298 0.208T 0.177 0.239 0.292 0.292
score
island non island
X = ACGT A->C C->G G->T
0.274 · 0.274 · 0.125
0.205 · 0.078 · 0.208= 2.82
Note: A score > 1 is an
Indication of a CpG island.
24
application: CpG islands
log-score (log2)
X = ACGT
0.274 · 0.274 · 0.125
0.205 · 0.078 · 0.208log2 = 0.42 + 1.81 – 0.73 = 1.50
LLR A C G TA -0.74 0.42 0.58 -0.80C -0.91 0.30 1.81 -0.69G -0.62 0.46 0.33 -0.73T -1.17 0.57 0.39 -0.68
LLR = Log-Likelihood Ratio
log2(0.274/0.078) = 1.81
13
25
CpG Log-Likelihood Ratio
LLR A C G TA -0.74 0.42 0.58 -0.80C -0.91 0.30 1.81 -0.69G -0.62 0.46 0.33 -0.73T -1.17 0.57 0.39 -0.68
LLR(ACGT) = 0.42+1.81–0.73 = 1.50
• is a (short) sequence a CpG island ?
compare with observed data (normalized for length)
• where (in long sequence) are CpG islands ?
first approach: sliding window
• ! What would be the length of window?
( 0.37 ‘bits’ per base )
1.5/4 = 0,375
26
empirical data
• is a (short) sequence a CpG island ?
compare with observed data (normalized for length)
CpG islandsNon-CpG
14
27
CpGplot
ACCGATACGATGAGAATGAGCAATGTAGTGAATCGTTTCAGCTACTCTCTATCGTAGCATTACTATGCAGTCAGTGATGCGCGCTAGCCGCGTAGCTCGCGGTCGCATCGCTGGCCGTAGCTGCGTACGATCTGCTGTACGCTGATCGGAGCGCTGCATCTCAACTGACTCATACTCATATGTCTACATCATCATCATTCATGTCAGTCTAGCATACTATTATCGACGACTGATCGATCTGACTGCTAGTAGACGTACCGAGCCAGGCATACGACATCAGTCGACT
• where (in long sequence) are CpG islands ?
first approach: sliding window
28
CpGplot
observed vs. expected
percentage
putative islands
Islands of unusual CG composition EMBOSS_001 from 1 to 286 Observed/Expected ratio > 0.60 Percent C + Percent G > 50.00 Length > 50 Length 114 (51..164)
Window size 100
C and G contents =>
expected CG occurrences
%C + %G
A set of 10 windows
fulfilling the thresholds
before island is called
15
29
Some Notes on: Higher Order Markov Chains
• The Markov property specifies that the probability of a state depends only
on the probability of the previous state
• But we can build more “memory” into our states by using a higher order
Markov model
• In an n-th order Markov model
The probability of the current state depends on the previous n states.
),...,|Pr(),...,,|Pr( 1121 niiiiii xxxxxxx
30
Selecting the Order of a Markov Chain Model
• But the number of parameters we need to estimate for an
n-th order Markov model grows exponentially with the order
– for modeling DNA we need parameters (# of state
transitions) for an n-th order model
• The higher the order, the less reliable we can expect our
parameter estimates to be
– estimating the parameters of a 2nd order Markov chain from the
complete genome of E. Coli (5.44 x 106 bases) , we would see each
(length 3) word ~ 85.000 times on average (divide by 43)
– estimating the parameters of a 9th order chain, we would see each
(length 10) word ~ 5 times on average (divide by 410 ~ 106)
)4( 1nO
16
31
Higher Order Markov Chains
• An n-th order Markov chain over some alphabet A is
equivalent to a first order Markov chain over the alphabet
of n-tuples: An
• Example: a 2nd order Markov model for DNA can be
treated as a 1st order Markov model over alphabet
AA, AC, AG, AT
CA, CC, CG, CT
GA, GC, GG, GT
TA, TC, TG, TT
Transition probabilities:
P(A|AA) , P(A| AC), etc.
32
A Fifth Order Markov Chain Equivalent
Pr(GCTACA)=Pr(GCTAC)Pr(A|GCTAC)
17
33
hidden Markov model
Where (in long sequence) are CpG islands?
• first approach: Markov Chains + windowing
• second approach: hidden Markov model
34
Hidden Markov Model: A Simple HMM
Given observed sequence AGGCT, which state emits which item?
Model 1 Model 2
18
35
Another example: Eddy (2004)
An (toy) HMM for 5’ splice site recognition.
Figure from: What is a hidden Markov model?
Sean R Eddy. Nature Biotechnology 22, 1315 - 1316 (2004)
prob. of path
P( si=5 | X)
Posterior decoding P(pi=q | X),
i.e., given sequence X
what is the probability that
the i-th state is equal to q.
36
Example: weather
0.3
0.4
0.6 0.2
0.1
0.1
0.5
0.4
0.4
P( )=0.1P( )=0.2P( )=0.7
HP( )=0.3P( )=0.4P( )=0.3
M
L P( )=0.6P( )=0.3P( )=0.1
pH = 0.4pM= 0.2pL = 0.4
observed weather vs. pressure
emission
probabilities
transition
probabilitiesinitial
probabilities
19
37
Example: weather
( , , )0.3
0.4
0.6 0.2
0.1
0.1
0.5
0.4
0.4
H M
L
pH = 0.4pM= 0.2pL = 0.4
(0.1, 0.2, 0.7)
(0.3, 0.4, 0.3)
(0.6, 0.3, 0.1)
( R, C, S )
P( RCCSS | HHHHH ) = 1·2·2·7·7 = 196 (x10-5)
P( RCCSS | MMMMM ) = 3·4·4·3·3 = 432 (x10-5)
P( RCCSS, HHHHH ) = 4·1·6·2·6·2·6·7·6·7 = 1016 (x10-7)
P( RCCSS, MMMMM ) = 2·3·2·4·2·4·2·3·2·3 = 14 (x10-7)
Given path
Emissions
Emissions
38
CpG islands ctd.
+ A C G TA 0.180 0.274 0.426 0.120C 0.171 0.368 0.274 0.188G 0.161 0.339 0.375 0.125T 0.079 0.355 0.384 0.182
- A C G TA 0.300 0.205 0.285 0.210C 0.322 0.298 0.078 0.302G 0.248 0.246 0.298 0.208T 0.177 0.239 0.292 0.292
8 states A+ vs A-
unique observation each statep
1-p 1-q
q
A C
G T
8x8 =64 transitions!
A C
G T
(1-p)/4
0.180p
‘+’ denotes CpG island
‘-’ denotes non-CpG island
20
39
hidden Markov model
model M = (,Q,T)
• states Q
• transition probabilities
observation
observe states indirectly ‘hidden’
• emission probabilities
probability
observation given the model
? there may be many state seq’s
A
C
B
tAC
tAA
x y
eAx
eAy
underlying process
what we see
40
HMM main questions
tpq Given HMM M:
• probability of observation X?
• most probable state sequence?
• how to find the parameters of
the model M? training
observation X*
21
41
Three Important Questions(See also L.R. Rabiner (1989))
• How likely is a given sequence?
– The Forward algorithm (probability over all paths)
• What is the most probable “path” for
generating a given sequence?
– The Viterbi algorithm
• How can we learn the HMM parameters given
a set of sequences?
– The Forward-Backward (Baum-Welch) algorithm
42
probability … !
Given sequence X: most probable state vs. optimal path
* most probable state (over all state sequences)
posterior decoding
using forward & backward probabilities
* most probable path (= single state sequence)
Viterbi
1
0.4
0.6
0.7
0.3
1
0.4
0.6
0.5
0.5
11
1
probability of state
start end
s1 s1
s2s2
22
43
The Forward Algorithm:
probability of observation X
xi
dynamic programming: probability ending in state q emitting symbol xi
%
%
%
A
B
C
x1 xi-1xi-2
state
44
The Forward Algorithm:
probability of observation X probability observing x1, …, xi and ending in state q:
‘forward’ probability
* = end-state
23
45
Probability of observation:
weather
( , , )0.3
0.4
0.6 0.2
0.1
0.1
0.5
0.4
0.4
H M
L
pH = 0.4pM= 0.2pL = 0.4
(0.1, 0.2, 0.7)
(0.3, 0.4, 0.3)
(0.6, 0.3, 0.1)
( R, C, S )
1:R 2:C H 0 4·1 = 4 (4·6 +6·4 +24·1)·2 = 144 (x10-4)
M 0 2·3 = 6 (4·3 +6·2 +24·5)·4 = 576 (x10-4)
L 0 4·6 = 24 (4·1 +6·4 +24·4)·3 = 372 (x10-4)
0 1
Initial state:
• Remain in H
• Coming from M
• Coming from L
P( RCCSS ) = P( RC… )
Transitions:
Start:
P(R...)
R C S
46
HMM: posterior decoding
%A
B
i
forward backward
i-th state equals q
=>P(X)
24
47
HMM main questions
tpq
• probability of this observation?
• most probable state sequence?
• how to find the model? training
observation X*
again:
We cannot try all possibilities
Viterbi
most probable state sequence
X:
48
Viterbi algorithm
xi
most probable state sequence for observation X
(1) dynamic programming: vq(i) probability ending in state q and emitting xi
%
%
%
A
B
C
vq(i)
State:
25
49
Decoding Problem: The Viterbi algorithm
xi xL
(1) dynamic programming: max probability ending in state
(2) traceback: most probable state sequence
A
B
C
states
given
sequence
q (=B)
50
Posterior Decoding Problem
Another decoding method, Posterior Decoding:
Input:
Given a Hidden Markov Model M = (Σ, Q, Θ) and a
sequence X for which the generating path P is
unknown.
Question:
For each 1 ≤ i ≤ L (the length of the path P) and state
k in Q compute the probability: P(πi = k | X).
26
51
Posterior Decoding Problem
P(πi = k | X) gives two additional decoding possibilities:
1. Alternative ‘path’ P* that follows the max probability states: argmaxstate k { P(πi = k | X) }.
2. Define a function g(k) on the states k in Q, then
G( i | X) = ∑k { P(πi = k | X) . g(k) }
We can use 2) to calculate the posterior probability of each nucleotide to be in a CpG-island, using a function g(k) defined on all states k in Q:
g(k) = 1 for all k that are CpG-island states,
0 otherwise.
52
HMM Decoding: two explanations
posterior Σbest state every position
But: path may not be allowed by model
viterbi maxoptimal global path
But: many paths with similar probability
27
53
HMM for Hidden Coin Tossing
HT
T
T T
T
H
T
……… H H T T H T H H T T H
54
dishonest casino dealer
28
55
dishonest casino dealer
56
dishonest casino dealer
Observation366163666466232534413661661163252562462255265252266435353336
Viterbi LLLLLLLLLLLLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
Compare to:
Forward FFLLLLLLLLLLLLFFFFFFFFLFLLLFLLFFFFFFFFFFFFFFFFFFFFLFFFFFFFFF
Posterior (total) LLLLLLLLLLLLFFFFFFFFFLLLLLLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
29
57
Learning if correct path is known
• Learning is simple if we know the correct path for each
sequence in our training set
• estimate parameters by counting the number of times each
parameter is used across the training set
58
Sketch: Parameter estimation
training sequences X(i)
optimize score for model Θ.
If state sequences are known
count transitions pq
count emissions b in p
divide by
total transitions in p
emissions in q
Laplace correction for dealing with ‘zero’ probabilities.
Adding 1 to each count.
30
59
Learning With Hidden State
• If we don’t know the correct path for each sequence
in our training set, consider all possible paths for the
sequence
• Estimate parameters through a procedure that counts
the expected number of times each parameter is used
across the training set.
60
Learning Parameters: The Baum-Welch Algorithm
• Here we use the Forward-Backward algorithm
• An Expectation Maximization (EM) algorithm
– EM is a family of algorithms for learning probabilistic
models in problems that involve hidden states
• In this context, the hidden state is the path that best
explains each training sequence.
• Note, finding the parameters of the HMM that optimally
explains the given sequences is NP-Complete!
31
61
HMM: state sequences unknown: Baum-Welch
Baum-Welch training
• Based on given HMM Θ
• Given a training set of sequences X
• Determine:– expected number of transitions and
– expected number of emissions
• Apply ML and build a new (better) model:
– ML tries to find a model that gives the
training data the highest likelihood
• Iterate until convergence.
Note:
• can get stuck in local maxima
• does not understand the semantics of the states
Baum-Welch Re-estimation
62
Probability of state q when emitting Xi:
Probability of transition (p,q) after emitting Xi:
For the re-estimation we need the expected counts
For the transitions and the emissions in the HMM:
• Apply the backward-forward algorithm.
32
Baum-Welch
63
Estimation of Transition Probability
sum over all training sequences X
sum over all positions i
Estimation of Emission Probability
sum over all training sequences X
sum over all positions i with xi=b
Estimate parameters by ratio of expected counts.
64
Baum-Welch training
concerns:
• guaranteed to converge
target score, not Θ
• unstable solutions !
• local maximum
practical
•small values -> renormalize
tips:
• repeat for several initial HMM Θ
• start with meaningful HMM Θ
33
65
Viterbi training
Viterbi training (sketch):
• determine optimal paths
• re-compute as if paths are known
• score may decrease!
66
Computational Complexity of HMM Algorithms
• Given an HMM with S states and a sequence of
length L, the complexity of the Forward, Backward
and Viterbi algorithms is
– This assumes that the states are densely interconnected
• Given M training sequences of length L, the
complexity of Baum Welch on each iteration is
)( 2LSO
)( 2LMSO
34
67
Important Papers on HMM
L.R. Rabiner, A Tutorial on Hidden Markov Models and
Selected Applications in Speech Recognition,
Proceeding of the IEEE, Vol. 77, No. 22, February 1989.
Krogh, I. Saira Mian, D. Haussler, A Hidden Markov Model
that finds genes in E. coli DNA, Nucleid Acids Research,
Vol. 22 (1994), pp 4768-4778
Furthermore:
R. Hassan, A combination of hidden Markov model and fuzzy
model for stock market forecasting, Neurocomputing
archive, Vol. 72 , Issue 16-18, pp 3439-3446, October
2009.
68
Applications
Hidden Markov Models
35
69
model topology
A C
G T
many states & fully connected
training seldom works => local maxima
use knowledge about the problem
For example:
Use a linear model for profile alignment:
begin M2end
70
silent states
quadratic vs. linear size
but less modeling possibilities
[round silent states (no emission)]
1 2 3 4 5
high/low transition probabilities
skipping nodes
[square emitting states]
36
71
silent states: algorithm
transition / emission
For emitting states q
=> calculated as before
But for silent states q
- no silent loops (!):
- update in ‘topological order’
Previously: forward algorithm
From state p to state q
q
q
72
profile alignment (no gaps)
profile HMM P ‘dedicated topology’
Let ei(b) be equal to the probability of
observing symbol b at pos i, then:
Assume a given
profile set:
12345678VGAHAGEYVTGNVDEVVEADVAGHVKSNDVADVYSTVETSFNANIPKHIAGNGAGV
No gaps
transition probabilities: 1
trivial alignment HMM to sequence
begin M4end
=> Emission probability distribution function at state 4
37
73
affine model
open gap extension
profile alignment with gaps
Mj Mj+1
Ij insert state
match states
Given profile
sequences:
VGA--HAGEYVNA--NVDEVVEA--DVAGHVKG--NYDEDVYS--TYETSFNA--NIPKHIAGADNGAGV123__45678
Emission probability distribution based on:
- background probabilities: ei(b) = p(b)
- or based on alignment (match)
74
profile alignment with gaps and deletes
insert state
match states
Given profile
Sequences:
VGA--HAGEYV----NVDEVVEA--DVAGHVKG------DVYS--TYETSFNA--NIPKHIAGADNGAGV123__45678
Dj-1 Dj
Mj-1 Mj Mj+1
delete state
(silent)
adapt Viterbi =>
Mj Mj+1
Ij
38
75
HMM for profiles / multiple alignment
D
begin Mjend
I
Deletion (D)
Insertion (I)
same level
same position
Match (M)
Viterbi
jjj MY
Y
jDIMY
iM
M
j tivxeiv1
).1(max).()( 1,,
jj IY
Y
jDIMY
i
I
j tivxpiv1
).1(max).()(,,
jj DY
Y
jDIMY
D
j tiviv1
).(max)( 1,,
76
profile alignment
given multiple alignment
Insertion / Deletion states
VGA--HAGEYV----NVDEVVEA--DVAGHVKG------DVYS--TYETSFNA--NIPKHIAGADNGAGV123 45678
Example counting for state 1:
transitions
M1M2 6+1 7/10
M1I1 0+1 1/10
M1D1 1+1 2/10
emissions
F 1+1 2/27
I 1+1 2/27
V 5+1 6/27
other 17x 0+1 1/27
Laplace correction, i.e., adding 1 for each frequency to avoid dividing by 0
39
77
Multiple Sequence Alignment using a Profile HMM
Multiple Sequence Alignment Problem:
Given a set of sequences S1,…, Sn.
How can the set of sequences be optimally aligned?
Assume a profile HMM P is known, then:
- Align each sequence Si to the profile separately
- Accumulate the obtained alignments to a multiple
alignment
78
Multiple Sequence Alignment: using a Profile HMM
Multiple Sequence Alignment Problem:
Given sequence S1,…, Sn, how can they be optimally aligned?
Assume a profile HMM P is not known, then obtain an HMM profile P from S1,…, Sn as follows:
- Choose a length L for the profile HMM and initialize the transition and emission probabilities.
- Train the HMM using Baum- Welch on all sequences S1,…, Sn.
Now obtain the multiple alignment using this HMM P as in the previous case:
- Align each sequence Si to the profile separately
- Accumulate the obtained alignments to a multiple alignment
40
79
multiple alignment with profile
align each sequence separately
accumulate alignments M and D positions
align inserts (I) leftmost i positions
IAGADNGAGV123II45678
VGAHAGEY12345678
FNAPNI-KH123I45678
D
VGA--HAGEYFNAP-NI-KHIAGADNGAGV123 45678
80
Gene Finding
41
81
gene finding
central dogma:
DNA transcription RNA translation protein
only 2%-3% coding … find these regions
Prokaryotes Eukariotes
no nucleus nucleus
most of genome is coding
H.influenza: 70 % coding
part of genome is coding
Human: 2% - 3% coding
continuous genes introns & exons
82
Biological Background: Transcription
Gene Expression from DNA to Protein
• Transcription
• Translation
Transcription
• DNA -> mRNA (messenger RNA).
• From the 5’ end to the 3’ , i.e., downstream (opposite is upstream)
• RNA polymerase enzyme starts a few bases upstream of the coding region, and terminates a few bases after the end of the coding region.
• The upstream and downstream regions not used for the code of the protein are called untranslated regions (UTRs)
• RNA polymerase molecules start transcription by recognizing and binding to promoter regions upstream of the transcription start sites.
• Each promoter region has a signal which can encourage or suspend transcription.
42
83
Biological Background: Translation
Translation mRNA -> Protein is executed by ribosomes:
• Each triplet of bases in the mRNA is a command for the ribosomes, called codon.
• 64 different possible codons and only 20 amino acids => multiple codons represent the same amino acid.
Ribosome starts scanning the mRNA molecule from 5’ end to 3’ end;
If Ribosome detects start codon (also code for Methionine)
Then
While (Ribosome did not detect any of the three possible stop codons)
do
Generating an amino acid sequence coded by the mRNA;
Scan next codon;
od
Detach the chain of amino acids from the Ribosome;
Endif (Simplified)
84
Codons
start
AUG
stop
UAA,
UAG,
UGA
43
85
Codons
start
AUG
stop
UAA,
UAG,
UGA
86
Open Reading Frames (ORFs)
3 (or 6) ORFs
start AUG
stop UAA, UAG, UGA
Coding region stops with a stop codon
3 out of 64 codons are stop codons ,i.e., random
probability of stop codon is 3/64
=> expected that every region of 21 codons has 1 stop (random)
But: average protein 1000bp [much longer]
Thus search for long ORFs in all 3 reading frames => coding regions!
- miss short genes
- miss overlapping long ORFs on opposite strands
- too many found (6500 ORFs in E.Coli genome, but only 1100 genes)
ACT GAC TGA CT GACTGACTGAC TGACTGACTGACTG ACT GAC TG ACTGACTGACT GACTGACTGACTGA CTG AC TGA CTGACTGACTG ACTGACTGACT
5’-end 3’-enddownstream
upstream
44
87
Reading Frames
AGGCATGCGATCCAAGTTCCACCATGATGACATGATGACTA
5’-end 3’-enddownstream
upstream
TCCGTACGCTAGGTTCAAGGTGGTACTACTGTACTACTGAT
5’-end3’-end downstream
Complementary DNA strand
DNA strand
C G
A T
88
Reading Frames
AGG CAT GCG ATC CAA GTT CCA CCA TGA TGA CAT GAT GAC TAAGGC ATG CGA TCC AAG TTC CAC CAT GAT GAC ATG ATG ACT AA.GCA TGC GAT CCA AGT TCC ACC ATG ATG ACA TGA TGA CTA A..
5’-end 3’-enddownstream
upstream
..T CCG TAC GCT AGG TTC AAG GTG GTA CTA CTG TAC TAC TGA.TC CGT ACG CTA GGT TCA AGG TGG TAC TAC TGT ACT ACT GATTCC GTA CGC TAG GTT CAA GGT GGT ACT ACT GTA CTA CTG ATT
5’-end3’-end downstream
Complementary DNA strand
DNA strand
45
89
Prokaryotes DNA Structure
90
Eukaryotes DNA Structure
46
91
genes are not random
motto
Leu Leucine 6 codons 6.9
Ala Alanine 4 6.5
Trp Tryptophan 1 1
codon frequencies (per 20)
‘random’ coding
A or T in 2nd position sometimes 90%
Markov models• codon triplets as states [64 states]
~ 3rd order (but no overlap)
• triplet frequencies only
keep 3 frames in sliding window cf. CpG
92
promotor regions
‘consensus’ sequence i.e. not exact
… n TTGAC n18 TATAAT n6 N n …
position weight matrix of TATA box:
pos 1 2 3 4 5 6A 2 95 26 59 51 1C 9 2 14 13 20 3G 10 1 16 15 13 0T 79 3 44 13 17 96
wmm weight matrix model
wam weigth array matrix
+ dependencies between adjacent positions
start
codingTATA box
cf. ‘profile’
47
93
HMM Gene Finding
Krogh, I. Saira Mian, D. Haussler, A Hidden
Markov Model that finds genes in E. coli
DNA, Nucleid Acids Research, Vol. 22, pp
4768-4778, 1994
94
HMM Model
61 Codon Models
Start Codon Model• ATG• GTA• TTT (rare)
Stop Codon Models• TAA and TGA• TAG
48
95
More Advanced Intergenic Model
96
49
97
Statistics on Data Set EcoSeq6
98
HMM Results
Data Set:
• EcoSeq6 contained about 1/3th of the complete E. coli genome (total 5.44x106 nucleotides, 5416 genes), and was not fully annotated at that time
HMM Training:
• on ~106 nucleotides from the EcoSeq6 database of labeled genes (K. Rudd, 1991)
HMM Testing
• On the remainder of ~325.000 nucleotides
Method:
• For each contig in the test the Viterbi algorithm was used to find the most likely path through the hidden states of the HMM
• This path was then used to define a parse of the contig into genes separated by intergenic regions
50
99
HMM Results
Post-processing consists of 3 rules to handle:• Overlapping genes, which will look like frame-shifts
• Short genes overlapping with long genes on the opposite direction,
as a result of self-complementary type codons.
100
HMM Results
• 80% of the labeled protein coding genes were
exactly found (i.e. with precisely the same
start and end codon)
• 5% found within 10 codons from start codon
• 5% overlap by at least 60 bases or 50%
• 5% missed completely
• Several new genes indicated
• Several insertion and deletion errors were
labeled in the contig parse
51
101
Important Papers on HMM
L.R. Rabiner, A Tutorial on Hidden Markov Models and
Selected Applications in Speech Recognition,
Proceeding of the IEEE, Vol. 77, No. 22, February 1989.
Krogh, I. Saira Mian, D. Haussler, A Hidden Markov Model
that finds genes in E. coli DNA, Nucleid Acids Research,
Vol. 22 (1994), pp 4768-4778
Furthermore:
R. Hassan, A combination of hidden Markov model and fuzzy
model for stock market forecasting, Neurocomputing
archive, Vol. 72 , Issue 16-18, pp 3439-3446, October
2009.
102
Bibliography
[1] H. Carrillo and D. Lipmann. The multiple sequence alignment problem in biology. SIAM
J. Appl. Math, 48:1073–1082, 1988.
[2] D. Feng and R. F. Doolittle. Progressive sequence alignment as a prerequisite to correct
phylogenetic trees. J. Mol. Evol., 25:351–360, 1987.
[3] W. M. Fitch and E. Margoliash. Construction of phylogenetic trees. science, 15:279–284,
1967.
[4] D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press,
New York, 1997.
[5] T. Jiang, L. Wang, and E. L. Lawler. Approximation algorithms for tree alignment with
a given phylogeny. Algorithmica, 16:302–315, 1996.
[6] D. J. Lipman, S. Altshul, and J. Kececiogly. A tool for multiple sequence alignment.
Proc. Natl. Academy Science, 86:4412–4415, 1989.
[7] M. Murata, J.S. Richardson, and J.L. Sussman. Three protein alignment. Medical
Information Sciences, 231:9, 1999.
[8] J. D. Thompson, D. G. Higgins, and T. J. Gibson. Clustal w: improving the sensitivity
of progressive multiple sequence alignment through sequence weighting, position-specific
gap penalties and weight matrix choice. Nucleic Acids Res, 22:4673–80, 1994.
[9] L. Wang and T. Jiang. On the complexity of multiple sequence alignment. Computational
Biology, 1:337–348, 1994.
[10] http://www.uib.no/aasland/chromo/chromoCC.html.
[11] http://www.uib.no/aasland/chromo/chromo-tree.gif.
52
103
References
• Lecture notes@M. Craven’s website: www.biostat.wisc.edu/~craven
• A. Baxevanis and B. F. F. Ouellette. Bioinformatics: A Practical Guide to the Analysis of
Genes and Proteins (3rd ed.). John Wiley & Sons, 2004
• R.Durbin, S.Eddy, A.Krogh and G.Mitchison. Biological Sequence Analysis: Probability
Models of Proteins and Nucleic Acids. Cambridge University Press, 1998
• N. C. Jones and P. A. Pevzner. An Introduction to Bioinformatics Algorithms. MIT
Press, 2004
• I. Korf, M. Yandell, and J. Bedell. BLAST. O'Reilly, 2003
• L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech
recognition. Proc. IEEE, 77:257--286, 1989
• J. C. Setubal and J. Meidanis. Introduction to Computational Molecular Biology. PWS
Pub Co., 1997.
• M. S. Waterman. Introduction to Computational Biology: Maps, Sequences, and
Genomes. CRC Press, 1995
• Krogh, I. Saira Mian, D. Haussler, A Hidden Markov Model that finds genes in E. coli
DNA, Nucleid Acids Research, Vol. 22, pp 4768-4778, 1994
104
Transcription and Translation
• http://www.ebi.ac.uk/Tools/emboss/transeq/help.html
• http://www.ebi.ac.uk/2can/tutorials/transcription.html