generalized hidden markov models
DESCRIPTION
Generalized Hidden Markov Models. for Eukaryotic gene prediction. CBB 231 / COMPSCI 261. W.H. Majoros. Recall: Gene Syntax. complete mRNA. coding segment. ATG. TGA. exon. intron. exon. intron. exon. AG. GT. AG. GT. TGA. ATG. start codon. donor site. acceptor site. donor site. - PowerPoint PPT PresentationTRANSCRIPT
Generalized Hidden Markov ModelsGeneralized Hidden Markov ModelsGeneralized Hidden Markov ModelsGeneralized Hidden Markov Models
CBB 231 / COMPSCI 261
for Eukaryotic gene predictionfor Eukaryotic gene predictionfor Eukaryotic gene predictionfor Eukaryotic gene prediction
W.H. MajorosW.H. MajorosW.H. MajorosW.H. Majoros
ATGATG TGATGA
coding segment
complete mRNA
ATG GT AG GT AG. . . . . . . . .
start codon stop codondonor site donor siteacceptor site
acceptor site
exon exon exonintronintron
TGA
Recall: Gene SyntaxRecall: Gene Syntax
An HMM is aAn HMM is a stochastic machine M=(Q, , Pt, Pe) consisting of the
following:following:
• a finite set of states, Q={q0, q1, ... , qm}• a finite alphabet ={s0, s1, ... , sn}• a transition distribution Pt : Q×Q a¡ i.e., Pt (qj | qi) • an emission distribution Pe : Q× a¡ i.e., Pe (sj | qi)
q 0
100%
80%
15%
30% 70%
5%
R=0%Y = 100%
q1
Y=0%R = 100%
q2
Recall: “Pure” HMMsRecall: “Pure” HMMs
M1=({q0,q1,q2},{Y,R},Pt,Pe)
Pt={(q0,q1,1), (q1,q1,0.8), (q1,q2,0.15), (q1,q0,0.05), (q2,q2,0.7), (q2,q1,0.3)}
Pe={(q1,Y,1), (q1,R,0), (q2,Y,0), (q2,R,1)}
An ExampleAn Example
A GHMM is aA GHMM is a stochastic machine M=(Q, , Pt, Pe, Pd) consisting of the following:following:
• a finite set of states, Q={q0, q1, ... , qm}• a finite alphabet ={s0, s1, ... , sn}• a transition distribution Pt : Q×Q a¡ i.e., Pt (qj | qi) • an emission distribution Pe : Q×*×¥ a¡ i.e., Pe (sj | qi,dj)• a duration distribution Pe : Q×¥ a¡ i.e., Pd (dj | qi)
Generalized HMMsGeneralized HMMs
• each state now emits an entire subsequence rather than just one symbol• feature lengths are now explicitly modeled, rather than implicitly geometric• emission probabilities can now be modeled by any arbitrary probabilistic model• there tend to be far fewer states => simplicity & ease of modification
Key DifferencesKey Differences
Ref: Kulp D, Haussler D, Reese M, Eeckman F (1996) A generalized hidden Markov model for the recognition of human genes in DNA. ISMB '96.
exon length
)1()|()|...( 11
010 ppxPxxP d
d
iied −⎟⎟
⎠
⎞⎜⎜⎝
⎛= −
−
=− ∏ θθ
geometric distributiongeometric distribution
geometric
HMMs & Geometric Feature LengthsHMMs & Geometric Feature Lengths
Advantages: * Submodel abstraction * Architectural simplicity * State duration modeling
Disadvantages: * Decoding complexity
Model Abstraction in GHMMsModel Abstraction in GHMMs
Fixed-length states are called signal states (diamonds).
Variable-length states are called content
states (ovals).
Each state has a separate submodel or sensor
Typical GHMM Topology for Gene FindingTypical GHMM Topology for Gene Finding
Sparse: in-degree of all states is
bounded by some small constant
1. WMM (Weight Matrix)
2. Nth-order Markov Chain (MC)
3. Three-Periodic Markov Chain (3PMC)
5. Codon Bias
6. MDD
7. Interpolated Markov Model
1. WMM (Weight Matrix)
2. Nth-order Markov Chain (MC)
3. Three-Periodic Markov Chain (3PMC)
5. Codon Bias
6. MDD
7. Interpolated Markov Model
€
Pi (xi)i=0
L−1
∏
€
P(xi |x0...xi−1) P(xi |xi−n...xi−1)i=n
L−1
∏i=0
n−1
∏
€
P( f+ i)(mod3) (xi)i=0
L−1
∏
€
P(x+3ix+3i+1x+3i+2 )i=0
n−1
∏
Some GHMM Submodel TypesSome GHMM Submodel Types
€
PeIMM (s |g0...gk−1)=
λkGPe(s |g0...gk−1)+ (1−λk
G )PeIMM (s |g1...gk−1) ifk> 0
Pe(s) ifk=0
⎧ ⎨ ⎩
Ref: Burge C (1997) Identification of complete gene structures in human genomic DNA. PhD thesis. Stanford University.
Ref: Salzberg SL, Delcher AL, Kasif S, White O (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Research 26:544-548.
)()|(
)(
)(
)()|(max
φφφ
φφ
φφφφφ
PSPargmax
SPargmaxSP
SPargmaxSPargmax
=
∧=
∧==
€
P(φ) = Pt(yi+1 |yi )i=0
L
∏
€
P(S|φ) = Pe(xi |yi+1)i=0
L−1
∏
€
φmax=argmax
φPt(q0 |yL ) Pe(xi |yi+1)Pt(yi+1 |yi )
i=0
L−1
∏
emission prob.emission prob. transition prob.transition prob.
Recall: Decoding with an HMMRecall: Decoding with an HMM
)()|(
)(
)(
)()|(max
φφφ
φφ
φφφφφ
PSPargmax
SPargmaxSP
SPargmaxSPargmax
=
∧=
∧==
€
P(φ) = Pt(yi+1 |yi )Pd(di |yi)i=0
|φ|−2
∏
€
P(S|φ) = Pe(Si |yi ,di )i=1
|φ|−2
∏
€
φmax=argmax
φPe(Si |yi ,di )Pt(yi+1 |yi)Pd(di |yi )
i=0
|φ|−2
∏
emission prob.emission prob. transition prob.
transition prob.
duration prob.
duration prob.
Decoding with a GHMMDecoding with a GHMM
⎪⎩
⎪⎨⎧
=
>−=.0 if )|()|(
,0 if ),()|()1,(max),(
00 kqxPqqP
kqxPqqPkjVjkiVieit
ikejit
sequence
stat
es
(i,k)
kk-1. . .
k-2 k+1. . . . . .
Recall: Viterbi Decoding for HMMsRecall: Viterbi Decoding for HMMs
run time: O(L×|Q|2)
Naive GHMM DecodingNaive GHMM Decoding
run time: O(L3×|Q|2)
(1) The emission functions Pe for variable-length features are factorable in the sense that they can be expressed as a product of terms evaluated at each position in a putative feature—i.e.,
factorable(Pe) f Pe(S)=∏i f(S,i)
for 0≤i<|S| over any putative feature S.
(2) The lengths of noncoding features in genomes are geometrically distributed.
(3) The model contains no transitions between variable-length states.
Assumptions in GHMM Decoding for Gene FindingAssumptions in GHMM Decoding for Gene Finding
Assumption #3Assumption #3
Variable-length states cannot transition directly to each other.
Each signal state has a signal sensor:
.0045 .0032 .0031 .0021 .002 .0072 .0023 .0034 .0082
The “trellis” or “ORF graph”:
Efficient Decoding via Signal SensorsEfficient Decoding via Signal Sensors
Efficient Decoding via Signal SensorsEfficient Decoding via Signal Sensors
GCTATCGATTCTCTAATCGTCTATCGATCGTGGTATCGTACGTTCATTACTGACT...
sensor 1
sensor 2
sensor n . . .ATG’s
GT’S
AG’s
. . .signal queues
sequence:
detect putative signals during left-to-right pass over squence
insert into type-specific signal queues
...ATG.........ATG......ATG..................GT
newly detected signal
elements of the
“ATG” queue
trellis links
Decoding with an ORF GraphDecoding with an ORF Graph
The Notion of “Eclipsing”The Notion of “Eclipsing”
ATGGATGCTACTTGACGTACTTAACTTACCGATCTCT0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0
in-frame stop codon!
Bounding the Number of Coding PredecessorsBounding the Number of Coding Predecessors
Geometric Noncoding Lengths (Assumption #2)Geometric Noncoding Lengths (Assumption #2)
Prefix Sum Arrays for Emission ProbabilitiesPrefix Sum Arrays for Emission Probabilities(Assumption #1: factorable content sensors)
Bounding the Number of Noncoding PredecessorsBounding the Number of Noncoding Predecessors
Ref: Burge C (1997) Identification of complete gene structures in human genomic DNA. PhD thesis. Stanford University.
PSA DecodingPSA Decoding
run time: O(L×|Q|)(assuming a sparse GHMM topology)
Traceback for GHMMsTraceback for GHMMs
DSP DecodingDSP Decoding
Ref: Majoros WM, Pertea M, Delcher AL, Salzberg SL (2005) Efficient decoding algorithms for generalized hidden Markov model gene finders. BMC Bioinformatics 6:16.
DSP = Dynamic Score Propagation: we incrementally propagate the scores of all possible partial parses up to the current point in the sequence, during a single left-to-right pass over the sequence.
DSP DecodingDSP Decoding
run time: O(L×|Q|)(assuming a sparse GHMM topology)
PSA vs. DSP DecodingPSA vs. DSP Decoding
Space complexity:
PSA: O(L×|Q|)DSP: O(L+|Q|)
Modeling IsochoresModeling Isochores
“inhomogeneous GHMM”
Ref: Allen JE, Majoros WH, Pertea M, Salzberg SL (2006) JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. Genome Biology 7(Suppl 1):S9.
Explicit Modeling of Noncoding LengthsExplicit Modeling of Noncoding Lengths
still O(L×|Q|), but slower by a constant factor
Ref: Stanke M, Waack S (2003) Gene prediction
with a hidden Markov model and
a new intron submodel.
Bioinformatics 19:II215-II225.
€
θMLE =argmax
θP(S,φ)
(S,φ)∈T
∏ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
=argmaxθ
Pe(Si |yi ,di )Pt(yi |yi−1)Pd(di |yi )yi∈φ∏
(S,φ)∈T
∏ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
=argmaxθ
Pt(yi |yi−1)Pd(di |yi )yi∈φ∏ Pe(xj |yi)
j=0
|Si|−1
∏(S,φ)∈T
∏ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
MLE Training for GHMMsMLE Training for GHMMs
estimate via labeled
training data, as in HMM
estimate via labeled
training data, as in HMM
construct a histogram of
observed feature lengths
∑ −
=
= 1||
0 ,
,, Q
h hi
jiji
A
Aa
€
ei,k =Ei,k
Ei,hh=0
||−1∑
€
θCML =argmax
θP(φ |S)
(S,φ)∈T
∏ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
=argmaxθ
Pe(Si |yi ,di )Pt(yi |yi−1)Pd(di |yi )yi∈φ∏
P(S)(S,φ)∈T
∏
⎛
⎝
⎜ ⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟ ⎟
Maximum Likelihood vs. Conditional Maximum LikelihoodMaximum Likelihood vs. Conditional Maximum Likelihood
€
θMLE =argmax
θP(S,φ)
(S,φ)∈T
∏ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
=argmaxθ
Pe(Si |yi ,di )Pt(yi |yi−1)Pd(di |yi )yi∈φ∏
(S,φ)∈T
∏ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
maximizing Pe, Pt, Pd separately maximizes the entire expression
maximizing Pe, Pt, Pd separately DOES NOT maximize the entire expression
MLE≠CML
Discriminative Training of GHMMsDiscriminative Training of GHMMs
€
θdiscrim=argmax
θaccuracy on training set( )
Discriminative Training of GHMMsDiscriminative Training of GHMMsR
ef:
Maj
oros
WM
, Sal
zber
g S
L (
2004
) A
n em
piri
cal a
naly
sis
of tr
aini
ng p
roto
cols
for
pr
obab
ilis
tic
gene
fin
ders
. BM
C B
ioin
form
atic
s 5:
206.
• GHMMs generalize HMMs by allowing each state to emit a subsequence rather than just a single symbol
• Whereas HMMs model all feature lengths using a geometric distribution, coding features can be modeled using an arbitrary length distribution in a GHMM
• Emission models within a GHMM can be any arbitrary probabilistic model (“submodel abstraction”), such as a neural network or decision tree
• GHMMs tend to have many fewer states => simplicity & modularity
• When used for parsing sequences, HMMs and GHMMs should be trained discriminatively, rather than via MLE, to achieve optimal predictive accuracy
• GHMMs generalize HMMs by allowing each state to emit a subsequence rather than just a single symbol
• Whereas HMMs model all feature lengths using a geometric distribution, coding features can be modeled using an arbitrary length distribution in a GHMM
• Emission models within a GHMM can be any arbitrary probabilistic model (“submodel abstraction”), such as a neural network or decision tree
• GHMMs tend to have many fewer states => simplicity & modularity
• When used for parsing sequences, HMMs and GHMMs should be trained discriminatively, rather than via MLE, to achieve optimal predictive accuracy
SummarySummary