introduction to bioinformatics: lecture xiii profile and other hidden markov models

24
JM - http://folding.chmcc.o rg 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Jarek Meller Meller Division of Biomedical Informatics, Division of Biomedical Informatics, Children’s Hospital Research Foundation Children’s Hospital Research Foundation & Department of Biomedical Engineering, & Department of Biomedical Engineering, UC UC

Upload: teague

Post on 18-Mar-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models. Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC. Outline of the lecture. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 1

Introduction to Bioinformatics: Lecture XIIIProfile and Other Hidden Markov Models

Jarek MellerJarek Meller

Division of Biomedical Informatics, Division of Biomedical Informatics, Children’s Hospital Research Foundation Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC& Department of Biomedical Engineering, UC

Page 2: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 2

Outline of the lecture

Multiple alignments, family profiles and

probabilistic models of biological sequences From simple Markov models to Hidden

Markov Models (HMMs) Profile HMMs: topology and parameter

optimization Finding optimal alignments: the Viterbi

algorithm Other applications of HMMs

Page 3: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 3

Web watch: personalized predictive medicine

Targeting crucial signal transduction pathway in lung cancer:an inhibitor of the Epidermal Growth Factor Receptor (EGFR)catalytic activity that binds EGFRs with specific mutations.

Genotyping the EGFR gene appears to be sufficient to predictthe outcome of the therapy. Paez JG et. al. Science 304

Page 4: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 4

Hidden Markov Models for biological sequences

Problems with grammatical structure, such as gene finding, family profiles and protein function prediction, transmembrane domains prediction

In general, one may think of different biases in different fragments of the sequence (due to functional role for example) or of different states emitting these fragments using different probability distributions

Durbin et. al., Chapters 3 to 6

Page 5: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 5

Example: Markov chain model for CpG islands

Motivation: CpG dinucleotides (and not the C-G bas pairs across the two strands) are frequently methylated at C, with methyl-C mutating with a higherrate into a T; however, the methylation process is suppressed around regulatory sequences (e.g. promoters) where CpG islands occur more often.

A

C G

TTransition probabilities:

tT,G=P(ai=G | ai-1=T) etc.

The overall probability of a sequence defined as product of transition probabilities

Page 6: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 6

Example: Hidden Markov model for CpG islands

A*

C* G*

T*

Adding four more states (A*,C*,T*,G*) to represent the “island” model, as opposed to non-island model with unlikely transitions between the models one obtains a “hidden” MM for CpG islands.

There is no longer one-to-one correspondence between the states and the symbols and knowing the sequence we cannot tell state the modelwas in when generating subsequent letters in the sequence.

A

C G

T

Page 7: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 7

Probabilistic models of biological sequences

For any probabilistic model the total probability of observing a sequence a1a2…an may be written as:

P(a1a2…an) = P(an| an-1… a1) P(an-1| an-2… a1) … P(a1)

In Markov chain models we simply have: P(a1a2…an) = P(an| an-1) P(an-1| an-2) … P(a1)

HMMs are generalization of Markov chain models, with

some “hidden” states that “emit” sequence symbols according to certain probability distributions and (Markov) transitions between pairs of hidden states

Page 8: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 8

HMMs as probabilistic linguistic models

HMMs may be in fact regarded as probabilistic, finite automata that generate certain “languages”: sets of words (sentences etc.) with specific “grammatical” structure.

For example, promoter, start, exon, splice junction, intron, stop “states” will appear in a linguistic model of a gene, whereas column (sequence position), insert and deletion states will be employed in a linguistic model of a (protein) family profile.

Page 9: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 9

HMMs for gene prediction: an exon model

Page 10: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 10

HMMs and the supervised learning approach

Given a training set of aligned sequences find optimal transition and emission probabilities that maximize probability of observing the training sequences – Baum-Welch (Expectation Maximization) or Viterbi training algorithm

In recognition phase, having the optimized probabilities, we ask what is the likelihood that a new sequence belongs to a family i.e. it is generated by the HMM with sufficiently high probability. The Viterbi algorithm, which is in fact dynamic programming in a suitable formulation, is used to find an optimal path through the states, which defines the optimal alignment

Page 11: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 11

Ungapped profiles and the corresponding HMMs

Beg Mj End… …

Example

AGAAACTAGGAATTTGAATCT

P(AGAAACT)=16/81P(TGGATTT)=1/81

1 2 3 4 5 6 7

A 2/3 0 2/3 1 2/3 0 0

T 1/3 0 0 0 1/3 1/3 1

C 0 0 0 0 0 2/3 0

G 0 1 1/3 0 0 0 0

Each blue square represents a match state that “emits” each letter withcertain probability ej(a) which is defined by frequency of a at position j:

Typically, pseudo-counts are added in HMMs to avoid zero probabilities.

Page 12: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 12

HMMs and likelihood optimization

Page 13: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 13

Likelihood optimization …

Page 14: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 14

Insertions and deletions in profile HMMs

Beg Mj End

Ij

Insert states emit symbols just like the match states, however, theemission probabilities are typically assumed to follow the backgrounddistribution and thus do not contribute to log-odds scores.

Transitions Ij -> Ij are allowed and account for an arbitrary numberof inserted residues that are effectively unaligned (their order withinan inserted region is arbitrary).

Page 15: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 15

Insertions and deletions in profile HMMs

Beg Mj End

Dj

Deletions are represented by silent states which do not emit any letters.A sequence of deletions (with D -> D transitions) may be used to connectany two match states, accounting for segments of the multiple alignmentthat are not aligned to any symbol in a query sequence (string).

The total cost of a deletion is the sum of the costs of individual transitions(M->D, D->D, D->M) that define this deletion. As in case of insertions, bothlinear and affine gap penalties can be easily incorporated in this scheme.

Page 16: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 16

Gap penalties: evolutionary and computational considerations

Linear gap penalties:

(k) = - k d

for a gap of length k and constant d

Affine gap penalties:

(k) = - [ d + (k -1) e ]

where d is opening gap penalty and e an extension gap penalty.

Page 17: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 17

Profile HMMs as a model for multiple alignments

Beg Mj End

Ij

Dj

ExampleAG---CA-AG-CAG-AA---AAACAG---C** *

Page 18: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 18

Observed emission and transition counts

C0 C1 C2 C3

A - 4 0 0

C - 0 0 4

G - 0 3 0

T - 0 0 0

Beg Mj End

Ij

Dj

AG...CA-AG.CAGAA.---AAACAG...C

C0 C1 C2 C3

A 0 0 6 0

C 0 0 0 0

G 0 0 1 0

T 0 0 0 0

Match emissions Insert emissions

4 23 4

1

1

1

21

41

12

C0 C1 C2 C3

Page 19: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 19

Computing emission and transition probabilities

Page 20: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 20

Optimal alignment corresponds to a path with the highest probability (or log-odds score)

Beg Mj End

Ij

Dj

Problem Given the above model, with emission and transition probabilities obtained previously, find the optimal path (alignment) for the query sequence AGAC

Problem Find emission and transition counts assuming that the 4th column in the example of multiple alignment in slide 15 corresponds to another match state (and not an insert state)

Page 21: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 21

Outline of the Viterbi algorithm

Beg Mj End

Ij

Dj

Page 22: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 22

Profile HMMs for local alignments

Mj

Ij

Dj

Beg End

Q Q

The trick consists of adding additional insert states Q that model flankingunaligned sequences using background frequencies qa and large tQ,Q

Page 23: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 23

Summary

In general, when the states generating training sequences (alignments) are not known an iterative procedure

Problem with local minima, topology choice (length of the profile)

Excellent results in family assignment (SAM, PFAM), gene prediction, trans-membrane domain recognition etc.

Page 24: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

JM - http://folding.chmcc.org 24

Outline of the lecture