applying hidden markov models to bioinformatics conor buckley

28
Applying Hidden Markov Models to Bioinformatics Conor Buckley

Upload: deborah-sandridge

Post on 14-Dec-2015

237 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applying Hidden Markov Models to Bioinformatics Conor Buckley

Applying Hidden Markov Models to Bioinformatics

Conor Buckley

Page 2: Applying Hidden Markov Models to Bioinformatics Conor Buckley

OutlineWhat are Hidden Markov Models?Why are they a good tool for

Bioinformatics?Applications in Bioinformatics

Page 3: Applying Hidden Markov Models to Bioinformatics Conor Buckley

History of Hidden Markov Models

HMM were first described in a series of statistical papers by Leonard E. Baum and other authors in the second half of the 1960s. One of the first applications of HMMs was speech recogniation, starting in the mid-1970s. They are commonly used in speech recognition systems to help to determine the words represented by the sound wave forms captured

In the second half of the 1980s, HMMs began to be applied to the analysis of biological sequences, in particular DNA.

Since then, they have become ubiquitous in bioinformatics

Source: http://en.wikipedia.org/wiki/Hidden_Markov_model#History

Page 4: Applying Hidden Markov Models to Bioinformatics Conor Buckley

What are Hidden Markov Models?

HMM: A formal foundation for making probabilistic models of linear sequence 'labeling' problems.

They provide a conceptual toolkit for building complex models just by drawing an intuitive picture.

Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

Page 5: Applying Hidden Markov Models to Bioinformatics Conor Buckley

What are Hidden Markov Models?

Machine learning approach in bioinformatics

Machine learning algorithms are presented with training data, which are used to derive important insights about the (often hidden) parameters.Once an algorithm has been trained, it can

apply these insights to the analysis of a test sample

As the amount of training data increases, the accuracy of the machine learning algorithm typically increasess as well.Source:

http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

Page 6: Applying Hidden Markov Models to Bioinformatics Conor Buckley

Hidden Markov ModelsHas N states, called S1, S2, ... SnThere are discrete timesteps, t=0,

t=1

S1

S2

S3

N = 3t = 0

Source: http://www.autonlab.org/tutorials/hmm.html

Page 7: Applying Hidden Markov Models to Bioinformatics Conor Buckley

Hidden Markov ModelsHas N states, called S1, S2, ... SnThere are discrete timesteps, t=0,

t=1For each timestep, the system is in

exactly one of the available states.

S1

S2

S3

N = 3t = 0

Page 8: Applying Hidden Markov Models to Bioinformatics Conor Buckley

Hidden Markov Models

S1 S2 S3

Bayesian network with time slices

Bayesian Network Image: http://en.wikipedia.org/wiki/File:Hmm_temporal_bayesian_net.svg

Page 9: Applying Hidden Markov Models to Bioinformatics Conor Buckley

A Markov Chain

Bayes' Theory• (statistics) a theorem describing how the conditional probability

of a set of possible causes for a given observed event can be computed from knowledge of the probability of each cause and the conditional probability of the outcome of each cause

- http://wordnetweb.princeton.edu/perl/webwn?s=bayes%27%20theorem

Page 10: Applying Hidden Markov Models to Bioinformatics Conor Buckley

Building a Markov ChainConcrete Example Two friends, Alice and Bob, who live far apart from each other

and who talk together daily over the telephone about what they did that day.

Bob is only interested in three activities: walking in the park, shopping, and cleaning his apartment.

The choice of what to do is determined exclusively by the weather on a given day.

Alice has no definite information about the weather where Bob lives, but she knows general trends.

Based on what Bob tells her he did each day, Alice tries to guess what the weather must have been like.

Alice believes that the weather operates as a discrete Markov chain. There are two states, "Rainy" and "Sunny", but she cannot observe them directly, that is, they are hidden from her.

On each day, there is a certain chance that Bob will perform one of the following activities, depending on the weather: "walk", "shop", or "clean". Since Bob tells Alice about his activities, those are the observations. Source: Wikipedia.org

Page 11: Applying Hidden Markov Models to Bioinformatics Conor Buckley

Hidden Markov Models

Page 12: Applying Hidden Markov Models to Bioinformatics Conor Buckley

Building a Markov Chain

Page 13: Applying Hidden Markov Models to Bioinformatics Conor Buckley

What now?

* Find out the most probable output sequence

Vertibi's algorithmDynamic programming algorithm for finding

the most likely sequence of hidden states – called the Vertibi path – that results in a sequence of observed events.

Page 14: Applying Hidden Markov Models to Bioinformatics Conor Buckley

Vertibi Resultshttp://pcarvalho.com/forward_viterbi/

Page 15: Applying Hidden Markov Models to Bioinformatics Conor Buckley

Bioinformatics Example

Assume we are given a DNA sequence that begins in an exon, contains one 5' splice site and ends in an intron

Identify where the switch from exon to intron occursWhere is the splice site??

Sourece: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

Page 16: Applying Hidden Markov Models to Bioinformatics Conor Buckley

Bioinformatics ExampleIn order for us to guess, the sequences of

exons, splice sites and introns must have different statistical properties.

Let's say...Exons have a uniform base composition on

averageA/C/T/G: 25% for each base

Introns are A/T richA/T: 40% for each C/G: 10% for each

5' Splice site consensus nucleotide is almost always a G...G: 95%A: 5%

Sourece: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

Page 17: Applying Hidden Markov Models to Bioinformatics Conor Buckley

Bioinformatics Example

We can build an Hidden Markov ModelWe have three states

"E" for Exon"5" for 5' SS"I" for Intron

Each State has its own emission probabilities which model the base composition of exons, introns and consensus G at the 5'SS

Each state also has transition probabilities (arrows)

Sourece: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

Page 18: Applying Hidden Markov Models to Bioinformatics Conor Buckley

HMM: A Bioinformatics Visual

We can use HMMs to generate a sequenceWhen we visit a state, we emit a nucleotide bases on

the emission probability distributionWe also choose a state to visit next according to the

state's transition probability distribution.

Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

We generate two strings of informationObserved SequenceUnderlying State Path

Page 19: Applying Hidden Markov Models to Bioinformatics Conor Buckley

HMM: A Bioinformatics Visual

The state path is a Markov ChainSince we're only given the observed sequence,

this underlying state path is a hidden Markov Chain

Therefore...We can apply Bayesian Probability

Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

Page 20: Applying Hidden Markov Models to Bioinformatics Conor Buckley

HMM: A Bioinformatics Visual

S – Observed sequence

π – State Path

Θ – Parameters

The probability P(S, π|HMM, Θ) is the product of all emission probabilites and transition probilities.

Lets look at an example...

Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

Page 21: Applying Hidden Markov Models to Bioinformatics Conor Buckley

HMM: A Bioinformatics Visual

There are 27 transitions and 26 emissions. Multiply all 53 probabilities together (and take the log,

since these are small numbers) and you'll calculate log P(S, π|HMM, Θ) = -41.22

Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

Page 22: Applying Hidden Markov Models to Bioinformatics Conor Buckley

HMM: A Bioinformatics Visual

The model parameters and overall sequences scores are all probabilities Therefore we can use Bayesian probability theory to manipulate

these numbers in standard, powerful ways, including optimizing parameters and interpreting the signifigance of scores.

Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

Page 23: Applying Hidden Markov Models to Bioinformatics Conor Buckley

HMM: A Bioinformatics Visual

Posterior Decoding: An alternative state path where the SS falls on the 6th G

instead of the 5th (log probabilities of -41.71 versus -41.22) How confident are we that the fifth G is the right choice?

Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

Page 24: Applying Hidden Markov Models to Bioinformatics Conor Buckley

HMM: A Bioinformatics Visual

We can calculate our confidence directly. The probability that nucleotide i was emitted by state k is the sum of

the probabilities of all the states paths use state k to generate i, normalized by the sum over all possible state paths

Result: We get a probability of 46% that the best-scoring fifth G is correct and 28% that the sixth G position is correct.

Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

Page 25: Applying Hidden Markov Models to Bioinformatics Conor Buckley

Further PossibilitesThe toy-model provided by the article is a

simple exampleBut we can go further, we could add a more

realistic consensus GTRAGT at the 5' splice site

We could put a row of six HMM states in place of '5' state to model a six-base ungapped consensus motif

Possibilities are not limited

Page 26: Applying Hidden Markov Models to Bioinformatics Conor Buckley

The catchHMM don't deal well with correlations

between nucleotidesBecause they assume that each emitted

nucleotide depends only on one underlying state.

Example of bad use for HMM:Conserved RNA base pairs which induce

long-range pairwise correlations; one position might be any nucleotide but the base-paired partner must be complementary.

An HMM state path has no way of 'remembering' what a distant state generated.

Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

Page 28: Applying Hidden Markov Models to Bioinformatics Conor Buckley

Questions?