usage of profile hmms in bioinformatics

23
Using PFAM database’s profile HMMs in MATLAB Bioinformatics Toolkit Presentation by: Athina Ropodi University of Athens- Information Technology in Medicine and Biology

Upload: cree

Post on 13-Jan-2016

61 views

Category:

Documents


1 download

DESCRIPTION

Usage of profile HMMs in Bioinformatics. Using PFAM database’s profile HMMs in MATLAB Bioinformatics Toolkit Presentation by: Athina Ropodi University of Athens- Information Technology in Medicine and Biology. outline. Introduction HMMs Profile HMMs Pfam Database General info - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Usage of profile HMMs in Bioinformatics

Using PFAM database’s profile HMMs in MATLAB Bioinformatics Toolkit

Presentation by: Athina Ropodi

University of Athens- Information Technology in Medicine and Biology

Page 2: Usage of profile HMMs in Bioinformatics

IntroductionHMMsProfile HMMs

Pfam DatabaseGeneral infoUseful linksAvailable Data

Bioinformatics ToolkitFunction presentation

Other available software Bibliography

Page 3: Usage of profile HMMs in Bioinformatics

In order to approach sequential data without failing to exploit any correlation between observations close to each other, we need a probabilistic model that calculates the joint distributions for the sequence of observations.

A simple way to do this is by assuming a Markovian chain model. The probability of going form one state to another is called transition probability.

In Hidden Markov Models(HMM), assuming a sequence of symbols (X), e.g. nucleotides in a DNA sequence or amino-acids in the case of protein sequences, the emission probabilities are defined as the probability of having symbol b when in state k.

Page 4: Usage of profile HMMs in Bioinformatics
Page 5: Usage of profile HMMs in Bioinformatics

The M-states produce one of 20 amino-acid letters, according to P(x|mi).

For each state, there is a delete state(di), where no amino-acid is produced.

There is a total of M+1 insert states to either side of match states according to P(x|di).

Page 6: Usage of profile HMMs in Bioinformatics

Pfam is a collection of multiple sequence alignments and profile hidden Markov models (HMMs). Each Pfam HMM represents a protein family or domain.

For each Pfam entry there is a family page which can be accessed in several ways.

Pfam contains two types of families, Pfam-A and Pfam-B. Pfam-A families are manually curated HMM based families which we build using an alignment of a small number of representative sequences.

Page 7: Usage of profile HMMs in Bioinformatics

For each family we build two HMMs, one to represent fragment matches and one to represent full length matches. We use the HMMER2 software to build and search our profile HMMs.

Available links:http://pfam.sanger.ac.uk/http://hmmer.janelia.org/

Page 8: Usage of profile HMMs in Bioinformatics

Each family has the following data:

A seed alignment which is a hand edited multiple alignment representing the family.

Hidden Markov Models (HMM) derived from the seed alignment, which can be used to find new members of the domain and also take a set of sequences to realign them to the model. One HMM is in ls mode (global) the other is an fs mode (local) model.

A full alignment which is an automatic alignment of all the examples of the domain using the two HMMs to find and then align the sequences.

Annotation that contains a brief description of the domain, links to other databases and some Pfam specific data. To record how the family was constructed.

Page 9: Usage of profile HMMs in Bioinformatics
Page 10: Usage of profile HMMs in Bioinformatics

v. 3.1 for MATLAB (2008a) Uses the profile HMMs found in PFAM. The search is usually done by accession

number or name of the family. Multiple sequence profiles — MATLAB

implementations for multiple alignment and profile hidden Markov model.

algorithms (gethmmprof, gethmmalignment, gethmmtree, pfamhmmread, hmmprofalign, hmmprofestimate, hmmprofgenerate, hmmprofmerge, hmmprofstruct, showhmmprof).

Page 11: Usage of profile HMMs in Bioinformatics

HMMStruct = gethmmprof(‘2’)  Name: '7tm_2' PfamAccessionNumber: 'PF00002.14' ModelDescription: [1x42 char] ModelLength: 296 Alphabet: 'AA' MatchEmission: [296x20 double] InsertEmission: [296x20 double] NullEmission: [1x20 double] BeginX: [297x1 double] MatchX: [295x4 double] InsertX: [295x2 double] DeleteX: [295x2 double] FlankingInsertX: [2x2 double] LoopX: [2x2 double] NullX: [2x1 double]

Number of match states

emission probabilities

in the MATCH states.Symbol emission

probabilities in the MATCH and INSERT states for the NULL

model.

Page 12: Usage of profile HMMs in Bioinformatics

>>site='http://pfam.sanger.ac.uk/';hmm = pfamhmmread([site

'family/gethmm?mode=ls&id=7tm_2']);Ή>>pfamhmmread(‘pf00002.ls’);

Page 13: Usage of profile HMMs in Bioinformatics

>>model = pfamhmmread('pf00002.ls');showhmmprof(model, 'Scale', 'logodds');hydrophobic = 'IVLFCMAGTSWYPHNDQEKR';showhmmprof(model, 'Order', hydrophobic);

'logprob' — Log probabilities 'prob' — Probabilities 'logodds' — Log-odd ratios

Page 14: Usage of profile HMMs in Bioinformatics
Page 15: Usage of profile HMMs in Bioinformatics
Page 16: Usage of profile HMMs in Bioinformatics

Choices for TypeValue are: 'seed' — Returns a tree with only the

alignments used to generate the HMM model.

'full' (default) — Returns a tree with all of the alignments that match the model.

>>tree = gethmmtree(2, 'type', 'seed');And>>tr = phytreeread('pf00002.tree');

Page 17: Usage of profile HMMs in Bioinformatics
Page 18: Usage of profile HMMs in Bioinformatics

Gethmmalignment: retrieve multiple sequence alignment associated with hmm profile from Pfam database

Hmmprofalign: Align query sequence to profile using hidden Markov model alignment

>>load('hmm_model_examples','model_7tm_2');

exampleload('hmm_model_examples','sequences');

exampleSCCR_RABIT=sequences(2).Sequence;[a,s]=hmmprofalign(model_7tm_2,SCCR_RABIT

,'showscore',true);

Page 19: Usage of profile HMMs in Bioinformatics

a = 514.7448 s =

LLKLKVMYTVGYSSS-LVMLLVALGILCAFRRLHCTRNYIHMHLFLSFILRALSNFIKDAVLFSSDdaihcdahrvgCKLVMVFFQYCIMANYAWLLVEGLYLHSLLVVS---FFSERKCLQGFVVLGWGSPAMFVTSWAVTR------------HFLEDSGC-WDIN-ANAAIWWVIRGPVILSILINFILFINILRILTRKLR----TQETRGQDMNHYKRLARSTLLLIPLFGVHYIVFVFSPEG-----AMEIQLFFELALGSFQGLVVAVLYCFLNGEV

Page 20: Usage of profile HMMs in Bioinformatics

hmmprofestimate - Estimate profile hidden Markov model (HMM) parameters using pseudocounts

Hmmprofgenerate - Generate random sequence drawn from profile hidden Markov model (HMM)

Hmmprofmerge - Concatenate prealigned strings of several sequences to profile hidden Markov model (HMM)

Page 21: Usage of profile HMMs in Bioinformatics

>> load('hmm_model_examples','model_7tm_2‘)%load modelload('hmm_model_examples','sequences') %load sequences 

for ind =1:length(sequences) [scores(ind),sequences(ind).Aligned] =... hmmprofalign(model_7tm_2,sequences(ind).Sequence);

end hmmprofmerge(sequences, scores)

Page 22: Usage of profile HMMs in Bioinformatics

HMMER:http://hmmer.wustl.edu/SAM:http://www.cse.ucsc.edu/research/compbio/

sam.htmlPFTOOLS: http://www.isrec.isb-sib.ch/ftp-server/pftools/GENEWISE:http://www.ebi.ac.uk/Wise2/PROBE: ftp://ftp.ncbi.nih.gov/pub/neuwald/probe1.0/META-MEME: http://metameme.sdsc.edu/PSI-BLAST:http://www.ncbi.nlm.nih.gov/BLAST/newblast.html

Page 23: Usage of profile HMMs in Bioinformatics

[1] Durbin et al. “Biological Sequence Analysis“, Cambridge University Press, 1998

[2] Anders Krogh et al. “Hidden Markov Models in Computational Biology- Applications to protein modeling”, 1994

[3] Sean R.Eddy “Profile Hidden Markov Models”, 1998

[4] Sean R.Eddy “Hidden Markov Models”, 1996[5] http://hmmer.janelia.org/#thanks[6] E.L.L. Sonnhammer, S.R. Eddy and R.

Durbin, “Pfam: a comprehensive database of protein families based on seed alignments”, 1997

[7] R.D. Finn et al. “Pfam: clans, web tools and services”, 2006

[8] http://www.mathworks.com/