computational biology, part c family pairwise search and cobbling robert f. murphy copyright 2000,...

Post on 18-Jan-2018

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Possible Approaches Model-based Model-based  Motif-based (MEME/MAST)  Hidden Markov model-based (HMMER) Non-model-based Non-model-based  Family Pairwise Search (FPS)

TRANSCRIPT

Computational Biology, Part CFamily Pairwise Search and

Cobbling

Robert F. MurphyRobert F. MurphyCopyright Copyright 2000, 2001. 2000, 2001.

All rights reserved.All rights reserved.

Overall Goals

Find previously unrecognized members of a Find previously unrecognized members of a familyfamily

Develop a model of a familyDevelop a model of a family

Possible Approaches

Model-basedModel-based Motif-based (MEME/MAST)Motif-based (MEME/MAST) Hidden Markov model-based (HMMER)Hidden Markov model-based (HMMER)

Non-model-basedNon-model-based Family Pairwise Search (FPS)Family Pairwise Search (FPS)

PSSMs

Motifs can be summarized and searched for Motifs can be summarized and searched for using using PPosition-osition-SSpecific pecific SScoring coring MMatricesatrices

Calculated from a multiple alignment of a Calculated from a multiple alignment of a conserved region for members of a familyconserved region for members of a family

Learning PSSMs

Unsupervised learning methods can be used Unsupervised learning methods can be used to find motifs in unaligned sequencesto find motifs in unaligned sequences

Best characterized algorithm is MEMEBest characterized algorithm is MEME T.L. Bailey & C. Elkan (1995) Unsupervised Learning of T.L. Bailey & C. Elkan (1995) Unsupervised Learning of

Multiple Motifs in Biopolymers Using Expectation Multiple Motifs in Biopolymers Using Expectation Maximization. Maximization. Machine Learning J. 21Machine Learning J. 21:51-83:51-83

Problems with PSSMs

Some families are characterized by two or Some families are characterized by two or more “sub”-motifs with variable spacing more “sub”-motifs with variable spacing between thembetween them

Deciding upon motif boundaries difficultDeciding upon motif boundaries difficult Possible information in intervening Possible information in intervening

sequences lost if only motifs are usedsequences lost if only motifs are used

Cobbling

Pick “most representative” protein sequence Pick “most representative” protein sequence from a familyfrom a family

Convert it to a profile by replacing each Convert it to a profile by replacing each amino acid by the corresponding column amino acid by the corresponding column from a similarity matrix from a similarity matrix

Cobbling

For each recognized “motif” in the family, For each recognized “motif” in the family, replace the corresponding section of the replace the corresponding section of the profile with the profile of the motifprofile with the profile of the motif

Cobbling

Advantage: At least some sequence Advantage: At least some sequence information between motifs is retained.information between motifs is retained.

S. Henikoff & J.G. Henikoff (1997) S. Henikoff & J.G. Henikoff (1997) Embedding strategies for effective use of Embedding strategies for effective use of information from multiple sequence information from multiple sequence alignments. alignments. Protein Science 6Protein Science 6:698-705:698-705

Cobbler Illustration

scores from profiles of conserved motifs

similarity scores for sequence from “most representative” family member

sequence of “most representative” family member

Family Pairwise Search

For all known members of family, calculate For all known members of family, calculate (pairwise) homology to each sequence in (pairwise) homology to each sequence in database (using BLAST) and sum those database (using BLAST) and sum those scoresscores

Family Pairwise Search

Does not generate a model of the motifDoes not generate a model of the motif Analogous to k nearest neighbor Analogous to k nearest neighbor

classificationclassification

Which method is best?

Compare BLAST using a randomly chosen Compare BLAST using a randomly chosen family member, BLAST FPS, MEME, family member, BLAST FPS, MEME, HMMERHMMER

W.N. Gundy (1998) Homology Detection W.N. Gundy (1998) Homology Detection via Family Pairwise Search. via Family Pairwise Search. J. Comput. J. Comput. Biol. 5Biol. 5:479-492:479-492

Comparison Protocol

For each methodFor each method For each known protein familyFor each known protein family

Train with family membersTrain with family membersSearch database for matchesSearch database for matchesRank by score from searchRank by score from searchDetermine how many known family Determine how many known family

members are ranked highlymembers are ranked highly

Comparison Protocol

Evaluation metricEvaluation metric average ROCaverage ROC5050

ROCROC5050 is the fraction of true positives detected at a is the fraction of true positives detected at a threshold giving 50 false negativesthreshold giving 50 false negatives

average over all familiesaverage over all families Bigger is better!Bigger is better!

Comparison Protocol

Caution!Caution! True positive True positive defined as being listed as a defined as being listed as a

member of the family in the PROSITE member of the family in the PROSITE compilationcompilation

Some Some false positivesfalse positives could be actual family could be actual family members that were missed during PROSITE members that were missed during PROSITE compilation!compilation!

(Should be minor effect)(Should be minor effect)

Results

BLAST FPS

BLAST

HMMER

MAST

Conclusion

FPS better than single sequence BLASTFPS better than single sequence BLAST FPS better than model-based methodsFPS better than model-based methods

Which is best (part 2)?

Compare BLAST, BLAST FPS, cobbled Compare BLAST, BLAST FPS, cobbled BLAST, cobbled BLAST FPSBLAST, cobbled BLAST FPS

W.N. Grundy and T.L. Bailey (1999) W.N. Grundy and T.L. Bailey (1999) Family pairwise search with embedded Family pairwise search with embedded motif models. motif models. Bioinformatics 15:Bioinformatics 15:463-470463-470

Comparison Protocol

Evaluation metricEvaluation metric rank sumrank sum

calculate difference in ROCcalculate difference in ROC5050 for two methods for a for two methods for a given familygiven family

sort by absolute value of differencesort by absolute value of difference sum ranks of families for which one method is better sum ranks of families for which one method is better

than the otherthan the other Bigger is better!Bigger is better!

Results

Conclusion

For task of finding members of a family For task of finding members of a family given a reasonable number of known given a reasonable number of known members of that family, cobbled FPS is best members of that family, cobbled FPS is best currently available method!currently available method!

top related