generalized hidden markov models

34
Generalized Hidden Markov Generalized Hidden Markov Models Models CBB 231 / COMPSCI 261 for Eukaryotic gene for Eukaryotic gene prediction prediction W.H. Majoros W.H. Majoros

Upload: majed

Post on 08-Jan-2016

45 views

Category:

Documents


2 download

DESCRIPTION

Generalized Hidden Markov Models. for Eukaryotic gene prediction. CBB 231 / COMPSCI 261. W.H. Majoros. Recall: Gene Syntax. complete mRNA. coding segment. ATG. TGA. exon. intron. exon. intron. exon. AG. GT. AG. GT. TGA. ATG. start codon. donor site. acceptor site. donor site. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Generalized Hidden Markov Models

Generalized Hidden Markov ModelsGeneralized Hidden Markov ModelsGeneralized Hidden Markov ModelsGeneralized Hidden Markov Models

CBB 231 / COMPSCI 261

for Eukaryotic gene predictionfor Eukaryotic gene predictionfor Eukaryotic gene predictionfor Eukaryotic gene prediction

W.H. MajorosW.H. MajorosW.H. MajorosW.H. Majoros

Page 2: Generalized Hidden Markov Models

ATGATG TGATGA

coding segment

complete mRNA

ATG GT AG GT AG. . . . . . . . .

start codon stop codondonor site donor siteacceptor site

acceptor site

exon exon exonintronintron

TGA

Recall: Gene SyntaxRecall: Gene Syntax

Page 3: Generalized Hidden Markov Models

An HMM is aAn HMM is a stochastic machine M=(Q, , Pt, Pe) consisting of the

following:following:

• a finite set of states, Q={q0, q1, ... , qm}• a finite alphabet ={s0, s1, ... , sn}• a transition distribution Pt : Q×Q a¡ i.e., Pt (qj | qi) • an emission distribution Pe : Q× a¡ i.e., Pe (sj | qi)

q 0

100%

80%

15%

30% 70%

5%

R=0%Y = 100%

q1

Y=0%R = 100%

q2

Recall: “Pure” HMMsRecall: “Pure” HMMs

M1=({q0,q1,q2},{Y,R},Pt,Pe)

Pt={(q0,q1,1), (q1,q1,0.8), (q1,q2,0.15), (q1,q0,0.05), (q2,q2,0.7), (q2,q1,0.3)}

Pe={(q1,Y,1), (q1,R,0), (q2,Y,0), (q2,R,1)}

An ExampleAn Example

Page 4: Generalized Hidden Markov Models

A GHMM is aA GHMM is a stochastic machine M=(Q, , Pt, Pe, Pd) consisting of the following:following:

• a finite set of states, Q={q0, q1, ... , qm}• a finite alphabet ={s0, s1, ... , sn}• a transition distribution Pt : Q×Q a¡ i.e., Pt (qj | qi) • an emission distribution Pe : Q×*×¥ a¡ i.e., Pe (sj | qi,dj)• a duration distribution Pe : Q×¥ a¡ i.e., Pd (dj | qi)

Generalized HMMsGeneralized HMMs

• each state now emits an entire subsequence rather than just one symbol• feature lengths are now explicitly modeled, rather than implicitly geometric• emission probabilities can now be modeled by any arbitrary probabilistic model• there tend to be far fewer states => simplicity & ease of modification

Key DifferencesKey Differences

Ref: Kulp D, Haussler D, Reese M, Eeckman F (1996) A generalized hidden Markov model for the recognition of human genes in DNA. ISMB '96.

Page 5: Generalized Hidden Markov Models

exon length

)1()|()|...( 11

010 ppxPxxP d

d

iied −⎟⎟

⎞⎜⎜⎝

⎛= −

=− ∏ θθ

geometric distributiongeometric distribution

geometric

HMMs & Geometric Feature LengthsHMMs & Geometric Feature Lengths

Page 6: Generalized Hidden Markov Models

Advantages: * Submodel abstraction * Architectural simplicity * State duration modeling

Disadvantages: * Decoding complexity

Model Abstraction in GHMMsModel Abstraction in GHMMs

Page 7: Generalized Hidden Markov Models

Fixed-length states are called signal states (diamonds).

Variable-length states are called content

states (ovals).

Each state has a separate submodel or sensor

Typical GHMM Topology for Gene FindingTypical GHMM Topology for Gene Finding

Sparse: in-degree of all states is

bounded by some small constant

Page 8: Generalized Hidden Markov Models

 1. WMM (Weight Matrix) 

2. Nth-order Markov Chain (MC)

 3. Three-Periodic Markov Chain (3PMC)

 5. Codon Bias

6. MDD

7. Interpolated Markov Model

 1. WMM (Weight Matrix) 

2. Nth-order Markov Chain (MC)

 3. Three-Periodic Markov Chain (3PMC)

 5. Codon Bias

6. MDD

7. Interpolated Markov Model

Pi (xi)i=0

L−1

P(xi |x0...xi−1) P(xi |xi−n...xi−1)i=n

L−1

∏i=0

n−1

P( f+ i)(mod3) (xi)i=0

L−1

P(x+3ix+3i+1x+3i+2 )i=0

n−1

Some GHMM Submodel TypesSome GHMM Submodel Types

PeIMM (s |g0...gk−1)=

λkGPe(s |g0...gk−1)+ (1−λk

G )PeIMM (s |g1...gk−1) ifk> 0

Pe(s) ifk=0

⎧ ⎨ ⎩

Ref: Burge C (1997) Identification of complete gene structures in human genomic DNA. PhD thesis. Stanford University.

Ref: Salzberg SL, Delcher AL, Kasif S, White O (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Research 26:544-548.

Page 9: Generalized Hidden Markov Models

)()|(

)(

)(

)()|(max

φφφ

φφ

φφφφφ

PSPargmax

SPargmaxSP

SPargmaxSPargmax

=

∧=

∧==

P(φ) = Pt(yi+1 |yi )i=0

L

P(S|φ) = Pe(xi |yi+1)i=0

L−1

φmax=argmax

φPt(q0 |yL ) Pe(xi |yi+1)Pt(yi+1 |yi )

i=0

L−1

emission prob.emission prob. transition prob.transition prob.

Recall: Decoding with an HMMRecall: Decoding with an HMM

Page 10: Generalized Hidden Markov Models

)()|(

)(

)(

)()|(max

φφφ

φφ

φφφφφ

PSPargmax

SPargmaxSP

SPargmaxSPargmax

=

∧=

∧==

P(φ) = Pt(yi+1 |yi )Pd(di |yi)i=0

|φ|−2

P(S|φ) = Pe(Si |yi ,di )i=1

|φ|−2

φmax=argmax

φPe(Si |yi ,di )Pt(yi+1 |yi)Pd(di |yi )

i=0

|φ|−2

emission prob.emission prob. transition prob.

transition prob.

duration prob.

duration prob.

Decoding with a GHMMDecoding with a GHMM

Page 11: Generalized Hidden Markov Models

⎪⎩

⎪⎨⎧

=

>−=.0 if )|()|(

,0 if ),()|()1,(max),(

00 kqxPqqP

kqxPqqPkjVjkiVieit

ikejit

sequence

stat

es

(i,k)

kk-1. . .

k-2 k+1. . . . . .

Recall: Viterbi Decoding for HMMsRecall: Viterbi Decoding for HMMs

run time: O(L×|Q|2)

Page 12: Generalized Hidden Markov Models

Naive GHMM DecodingNaive GHMM Decoding

run time: O(L3×|Q|2)

Page 13: Generalized Hidden Markov Models

(1) The emission functions Pe for variable-length features are factorable in the sense that they can be expressed as a product of terms evaluated at each position in a putative feature—i.e.,

factorable(Pe) f Pe(S)=∏i f(S,i)

for 0≤i<|S| over any putative feature S.

(2) The lengths of noncoding features in genomes are geometrically distributed.

(3) The model contains no transitions between variable-length states.

Assumptions in GHMM Decoding for Gene FindingAssumptions in GHMM Decoding for Gene Finding

Page 14: Generalized Hidden Markov Models

Assumption #3Assumption #3

Variable-length states cannot transition directly to each other.

Page 15: Generalized Hidden Markov Models

Each signal state has a signal sensor:

.0045 .0032 .0031 .0021 .002 .0072 .0023 .0034 .0082

The “trellis” or “ORF graph”:

Efficient Decoding via Signal SensorsEfficient Decoding via Signal Sensors

Page 16: Generalized Hidden Markov Models

Efficient Decoding via Signal SensorsEfficient Decoding via Signal Sensors

GCTATCGATTCTCTAATCGTCTATCGATCGTGGTATCGTACGTTCATTACTGACT...

sensor 1

sensor 2

sensor n . . .ATG’s

GT’S

AG’s

. . .signal queues

sequence:

detect putative signals during left-to-right pass over squence

insert into type-specific signal queues

...ATG.........ATG......ATG..................GT

newly detected signal

elements of the

“ATG” queue

trellis links

Page 17: Generalized Hidden Markov Models

Decoding with an ORF GraphDecoding with an ORF Graph

Page 18: Generalized Hidden Markov Models

The Notion of “Eclipsing”The Notion of “Eclipsing”

ATGGATGCTACTTGACGTACTTAACTTACCGATCTCT0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0

in-frame stop codon!

Page 19: Generalized Hidden Markov Models

Bounding the Number of Coding PredecessorsBounding the Number of Coding Predecessors

Page 20: Generalized Hidden Markov Models

Geometric Noncoding Lengths (Assumption #2)Geometric Noncoding Lengths (Assumption #2)

Page 21: Generalized Hidden Markov Models

Prefix Sum Arrays for Emission ProbabilitiesPrefix Sum Arrays for Emission Probabilities(Assumption #1: factorable content sensors)

Page 22: Generalized Hidden Markov Models

Bounding the Number of Noncoding PredecessorsBounding the Number of Noncoding Predecessors

Ref: Burge C (1997) Identification of complete gene structures in human genomic DNA. PhD thesis. Stanford University.

Page 23: Generalized Hidden Markov Models

PSA DecodingPSA Decoding

run time: O(L×|Q|)(assuming a sparse GHMM topology)

Page 24: Generalized Hidden Markov Models

Traceback for GHMMsTraceback for GHMMs

Page 25: Generalized Hidden Markov Models

DSP DecodingDSP Decoding

Ref: Majoros WM, Pertea M, Delcher AL, Salzberg SL (2005) Efficient decoding algorithms for generalized hidden Markov model gene finders. BMC Bioinformatics 6:16.

DSP = Dynamic Score Propagation: we incrementally propagate the scores of all possible partial parses up to the current point in the sequence, during a single left-to-right pass over the sequence.

Page 26: Generalized Hidden Markov Models

DSP DecodingDSP Decoding

run time: O(L×|Q|)(assuming a sparse GHMM topology)

Page 27: Generalized Hidden Markov Models

PSA vs. DSP DecodingPSA vs. DSP Decoding

Space complexity:

PSA: O(L×|Q|)DSP: O(L+|Q|)

Page 28: Generalized Hidden Markov Models

Modeling IsochoresModeling Isochores

“inhomogeneous GHMM”

Ref: Allen JE, Majoros WH, Pertea M, Salzberg SL (2006) JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. Genome Biology 7(Suppl 1):S9.

Page 29: Generalized Hidden Markov Models

Explicit Modeling of Noncoding LengthsExplicit Modeling of Noncoding Lengths

still O(L×|Q|), but slower by a constant factor

Ref: Stanke M, Waack S (2003) Gene prediction

with a hidden Markov model and

a new intron submodel.

Bioinformatics 19:II215-II225.

Page 30: Generalized Hidden Markov Models

θMLE =argmax

θP(S,φ)

(S,φ)∈T

∏ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟

=argmaxθ

Pe(Si |yi ,di )Pt(yi |yi−1)Pd(di |yi )yi∈φ∏

(S,φ)∈T

∏ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟

=argmaxθ

Pt(yi |yi−1)Pd(di |yi )yi∈φ∏ Pe(xj |yi)

j=0

|Si|−1

∏(S,φ)∈T

∏ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟

MLE Training for GHMMsMLE Training for GHMMs

estimate via labeled

training data, as in HMM

estimate via labeled

training data, as in HMM

construct a histogram of

observed feature lengths

∑ −

=

= 1||

0 ,

,, Q

h hi

jiji

A

Aa

ei,k =Ei,k

Ei,hh=0

||−1∑

Page 31: Generalized Hidden Markov Models

θCML =argmax

θP(φ |S)

(S,φ)∈T

∏ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟

=argmaxθ

Pe(Si |yi ,di )Pt(yi |yi−1)Pd(di |yi )yi∈φ∏

P(S)(S,φ)∈T

⎜ ⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟ ⎟

Maximum Likelihood vs. Conditional Maximum LikelihoodMaximum Likelihood vs. Conditional Maximum Likelihood

θMLE =argmax

θP(S,φ)

(S,φ)∈T

∏ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟

=argmaxθ

Pe(Si |yi ,di )Pt(yi |yi−1)Pd(di |yi )yi∈φ∏

(S,φ)∈T

∏ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟

maximizing Pe, Pt, Pd separately maximizes the entire expression

maximizing Pe, Pt, Pd separately DOES NOT maximize the entire expression

MLE≠CML

Page 32: Generalized Hidden Markov Models

Discriminative Training of GHMMsDiscriminative Training of GHMMs

θdiscrim=argmax

θaccuracy on training set( )

Page 33: Generalized Hidden Markov Models

Discriminative Training of GHMMsDiscriminative Training of GHMMsR

ef:

Maj

oros

WM

, Sal

zber

g S

L (

2004

) A

n em

piri

cal a

naly

sis

of tr

aini

ng p

roto

cols

for

pr

obab

ilis

tic

gene

fin

ders

. BM

C B

ioin

form

atic

s 5:

206.

Page 34: Generalized Hidden Markov Models

• GHMMs generalize HMMs by allowing each state to emit a subsequence rather than just a single symbol

• Whereas HMMs model all feature lengths using a geometric distribution, coding features can be modeled using an arbitrary length distribution in a GHMM

• Emission models within a GHMM can be any arbitrary probabilistic model (“submodel abstraction”), such as a neural network or decision tree

• GHMMs tend to have many fewer states => simplicity & modularity

• When used for parsing sequences, HMMs and GHMMs should be trained discriminatively, rather than via MLE, to achieve optimal predictive accuracy

• GHMMs generalize HMMs by allowing each state to emit a subsequence rather than just a single symbol

• Whereas HMMs model all feature lengths using a geometric distribution, coding features can be modeled using an arbitrary length distribution in a GHMM

• Emission models within a GHMM can be any arbitrary probabilistic model (“submodel abstraction”), such as a neural network or decision tree

• GHMMs tend to have many fewer states => simplicity & modularity

• When used for parsing sequences, HMMs and GHMMs should be trained discriminatively, rather than via MLE, to achieve optimal predictive accuracy

SummarySummary