generalized hidden markov models

Generalized Hidden Markov ModelsGeneralized Hidden Markov ModelsGeneralized Hidden Markov ModelsGeneralized Hidden Markov Models

CBB 231 / COMPSCI 261

for Eukaryotic gene predictionfor Eukaryotic gene predictionfor Eukaryotic gene predictionfor Eukaryotic gene prediction

W.H. MajorosW.H. MajorosW.H. MajorosW.H. Majoros

ATGATG TGATGA

coding segment

complete mRNA

ATG GT AG GT AG. . . . . . . . .

start codon stop codondonor site donor siteacceptor site

acceptor site

exon exon exonintronintron

TGA

Recall: Gene SyntaxRecall: Gene Syntax

An HMM is aAn HMM is a stochastic machine M=(Q, , Pt, Pe) consisting of the

following:following:

• a finite set of states, Q={q0, q1, ... , qm}• a finite alphabet ={s0, s1, ... , sn}• a transition distribution Pt : Q×Q a¡ i.e., Pt (qj | qi) • an emission distribution Pe : Q× a¡ i.e., Pe (sj | qi)

q 0

100%

80%

15%

30% 70%

5%

R=0%Y = 100%

q1

Y=0%R = 100%

q2

Recall: “Pure” HMMsRecall: “Pure” HMMs

M1=({q0,q1,q2},{Y,R},Pt,Pe)

Pt={(q0,q1,1), (q1,q1,0.8), (q1,q2,0.15), (q1,q0,0.05), (q2,q2,0.7), (q2,q1,0.3)}

Pe={(q1,Y,1), (q1,R,0), (q2,Y,0), (q2,R,1)}

An ExampleAn Example

A GHMM is aA GHMM is a stochastic machine M=(Q, , Pt, Pe, Pd) consisting of the following:following:

• a finite set of states, Q={q0, q1, ... , qm}• a finite alphabet ={s0, s1, ... , sn}• a transition distribution Pt : Q×Q a¡ i.e., Pt (qj | qi) • an emission distribution Pe : Q×*×¥ a¡ i.e., Pe (sj | qi,dj)• a duration distribution Pe : Q×¥ a¡ i.e., Pd (dj | qi)

Generalized HMMsGeneralized HMMs

• each state now emits an entire subsequence rather than just one symbol• feature lengths are now explicitly modeled, rather than implicitly geometric• emission probabilities can now be modeled by any arbitrary probabilistic model• there tend to be far fewer states => simplicity & ease of modification

Key DifferencesKey Differences

Ref: Kulp D, Haussler D, Reese M, Eeckman F (1996) A generalized hidden Markov model for the recognition of human genes in DNA. ISMB '96.

exon length

)1()|()|...( 11

010 ppxPxxP d

d

iied −⎟⎟

⎠

⎞⎜⎜⎝

⎛= −

−

=− ∏ θθ

geometric distributiongeometric distribution

geometric

HMMs & Geometric Feature LengthsHMMs & Geometric Feature Lengths

Advantages: * Submodel abstraction * Architectural simplicity * State duration modeling

Disadvantages: * Decoding complexity

Model Abstraction in GHMMsModel Abstraction in GHMMs

Fixed-length states are called signal states (diamonds).

Variable-length states are called content

states (ovals).

Each state has a separate submodel or sensor

Typical GHMM Topology for Gene FindingTypical GHMM Topology for Gene Finding

Sparse: in-degree of all states is

bounded by some small constant

1. WMM (Weight Matrix)

2. Nth-order Markov Chain (MC)

3. Three-Periodic Markov Chain (3PMC)

5. Codon Bias

6. MDD

7. Interpolated Markov Model

1. WMM (Weight Matrix)

2. Nth-order Markov Chain (MC)

3. Three-Periodic Markov Chain (3PMC)

5. Codon Bias

6. MDD

7. Interpolated Markov Model

€

Pi (xi)i=0

L−1

∏

€

P(xi |x0...xi−1) P(xi |xi−n...xi−1)i=n

L−1

∏i=0

n−1

∏

€

P( f+ i)(mod3) (xi)i=0

L−1

∏

€

P(x+3ix+3i+1x+3i+2 )i=0

n−1

∏

Some GHMM Submodel TypesSome GHMM Submodel Types

€

PeIMM (s |g0...gk−1)=

λkGPe(s |g0...gk−1)+ (1−λk

G )PeIMM (s |g1...gk−1) ifk> 0

Pe(s) ifk=0

⎧ ⎨ ⎩

Ref: Burge C (1997) Identification of complete gene structures in human genomic DNA. PhD thesis. Stanford University.

Ref: Salzberg SL, Delcher AL, Kasif S, White O (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Research 26:544-548.

)()|(

)(

)(

)()|(max

φφφ

φφ

φφφφφ

PSPargmax

SPargmaxSP

SPargmaxSPargmax

=

∧=

∧==

€

P(φ) = Pt(yi+1 |yi )Pd(di |yi)i=0

|φ|−2

∏

€

P(S|φ) = Pe(Si |yi ,di )i=1

|φ|−2

∏

€

φmax=argmax

φPe(Si |yi ,di )Pt(yi+1 |yi)Pd(di |yi )

i=0

|φ|−2

∏

emission prob.emission prob. transition prob.

transition prob.

duration prob.

duration prob.

Decoding with a GHMMDecoding with a GHMM

⎪⎩

⎪⎨⎧

=

>−=.0 if )|()|(

,0 if ),()|()1,(max),(

00 kqxPqqP

kqxPqqPkjVjkiVieit

ikejit

sequence

stat

es

(i,k)

kk-1. . .

k-2 k+1. . . . . .

Recall: Viterbi Decoding for HMMsRecall: Viterbi Decoding for HMMs

run time: O(L×|Q|2)

Naive GHMM DecodingNaive GHMM Decoding

run time: O(L3×|Q|2)

(1) The emission functions Pe for variable-length features are factorable in the sense that they can be expressed as a product of terms evaluated at each position in a putative feature—i.e.,

factorable(Pe) f Pe(S)=∏i f(S,i)

for 0≤i<|S| over any putative feature S.

(2) The lengths of noncoding features in genomes are geometrically distributed.

(3) The model contains no transitions between variable-length states.

Assumptions in GHMM Decoding for Gene FindingAssumptions in GHMM Decoding for Gene Finding

Assumption #3Assumption #3

Variable-length states cannot transition directly to each other.

Each signal state has a signal sensor:

.0045 .0032 .0031 .0021 .002 .0072 .0023 .0034 .0082

The “trellis” or “ORF graph”:

Efficient Decoding via Signal SensorsEfficient Decoding via Signal Sensors

Efficient Decoding via Signal SensorsEfficient Decoding via Signal Sensors

GCTATCGATTCTCTAATCGTCTATCGATCGTGGTATCGTACGTTCATTACTGACT...

sensor 1

sensor 2

sensor n . . .ATG’s

GT’S

AG’s

. . .signal queues

sequence:

detect putative signals during left-to-right pass over squence

insert into type-specific signal queues

...ATG.........ATG......ATG..................GT

newly detected signal

elements of the

“ATG” queue

trellis links

Decoding with an ORF GraphDecoding with an ORF Graph

The Notion of “Eclipsing”The Notion of “Eclipsing”

ATGGATGCTACTTGACGTACTTAACTTACCGATCTCT0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0

in-frame stop codon!

Bounding the Number of Coding PredecessorsBounding the Number of Coding Predecessors

Geometric Noncoding Lengths (Assumption #2)Geometric Noncoding Lengths (Assumption #2)

Prefix Sum Arrays for Emission ProbabilitiesPrefix Sum Arrays for Emission Probabilities(Assumption #1: factorable content sensors)

Bounding the Number of Noncoding PredecessorsBounding the Number of Noncoding Predecessors

Ref: Burge C (1997) Identification of complete gene structures in human genomic DNA. PhD thesis. Stanford University.

PSA DecodingPSA Decoding

run time: O(L×|Q|)(assuming a sparse GHMM topology)

Traceback for GHMMsTraceback for GHMMs

DSP DecodingDSP Decoding

Ref: Majoros WM, Pertea M, Delcher AL, Salzberg SL (2005) Efficient decoding algorithms for generalized hidden Markov model gene finders. BMC Bioinformatics 6:16.

DSP = Dynamic Score Propagation: we incrementally propagate the scores of all possible partial parses up to the current point in the sequence, during a single left-to-right pass over the sequence.

DSP DecodingDSP Decoding

run time: O(L×|Q|)(assuming a sparse GHMM topology)

PSA vs. DSP DecodingPSA vs. DSP Decoding

Space complexity:

PSA: O(L×|Q|)DSP: O(L+|Q|)

Modeling IsochoresModeling Isochores

“inhomogeneous GHMM”

Ref: Allen JE, Majoros WH, Pertea M, Salzberg SL (2006) JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. Genome Biology 7(Suppl 1):S9.

Explicit Modeling of Noncoding LengthsExplicit Modeling of Noncoding Lengths

still O(L×|Q|), but slower by a constant factor

Ref: Stanke M, Waack S (2003) Gene prediction

with a hidden Markov model and

a new intron submodel.

Bioinformatics 19:II215-II225.

€

θMLE =argmax

θP(S,φ)

(S,φ)∈T

∏ ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

=argmaxθ

Pe(Si |yi ,di )Pt(yi |yi−1)Pd(di |yi )yi∈φ∏

(S,φ)∈T

∏ ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

=argmaxθ

Pt(yi |yi−1)Pd(di |yi )yi∈φ∏ Pe(xj |yi)

j=0

|Si|−1

∏(S,φ)∈T

∏ ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

MLE Training for GHMMsMLE Training for GHMMs

estimate via labeled

training data, as in HMM

estimate via labeled

training data, as in HMM

construct a histogram of

observed feature lengths

∑ −

=

= 1||

0 ,

,, Q

h hi

jiji

A

Aa

€

ei,k =Ei,k

Ei,hh=0

||−1∑

€

θCML =argmax

θP(φ |S)

(S,φ)∈T

∏ ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

=argmaxθ


P(S)(S,φ)∈T

∏

⎛

⎝

⎜ ⎜ ⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟ ⎟ ⎟

Maximum Likelihood vs. Conditional Maximum LikelihoodMaximum Likelihood vs. Conditional Maximum Likelihood

€

θMLE =argmax

θP(S,φ)

(S,φ)∈T

∏ ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

=argmaxθ


(S,φ)∈T

∏ ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

maximizing Pe, Pt, Pd separately maximizes the entire expression

maximizing Pe, Pt, Pd separately DOES NOT maximize the entire expression

MLE≠CML

Discriminative Training of GHMMsDiscriminative Training of GHMMs

€

θdiscrim=argmax

θaccuracy on training set( )

Discriminative Training of GHMMsDiscriminative Training of GHMMsR

ef:

Maj

oros

WM

, Sal

zber

g S

L (

2004

) A

n em

piri

cal a

naly

sis

of tr

aini

ng p

roto

cols

for

pr

obab

ilis

tic

gene

fin

ders

. BM

C B

ioin

form

atic

s 5:

206.

• GHMMs generalize HMMs by allowing each state to emit a subsequence rather than just a single symbol

• Whereas HMMs model all feature lengths using a geometric distribution, coding features can be modeled using an arbitrary length distribution in a GHMM

• Emission models within a GHMM can be any arbitrary probabilistic model (“submodel abstraction”), such as a neural network or decision tree

• GHMMs tend to have many fewer states => simplicity & modularity

• When used for parsing sequences, HMMs and GHMMs should be trained discriminatively, rather than via MLE, to achieve optimal predictive accuracy

• GHMMs generalize HMMs by allowing each state to emit a subsequence rather than just a single symbol

• Whereas HMMs model all feature lengths using a geometric distribution, coding features can be modeled using an arbitrary length distribution in a GHMM

• Emission models within a GHMM can be any arbitrary probabilistic model (“submodel abstraction”), such as a neural network or decision tree

• GHMMs tend to have many fewer states => simplicity & modularity

• When used for parsing sequences, HMMs and GHMMs should be trained discriminatively, rather than via MLE, to achieve optimal predictive accuracy

SummarySummary

generalized hidden markov models

Documents

emission distribution

duration distribution

transition distribution

variablelength states

signal states diamonds

fewer states

finite set of states

transition prob