computational biology, part 4 protein coding regions

68
Computational Biology, Part 7 Supervised Machine Learning and Searching for Sequence Families Robert F. Murphy Robert F. Murphy Copyright Copyright 2008-2009. 2008-2009. All rights reserved. All rights reserved.

Upload: butest

Post on 03-Jul-2015

226 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Computational Biology, Part 4 Protein Coding Regions

Computational Biology, Part 7Supervised Machine Learning and Searching for Sequence Families

Robert F. MurphyRobert F. Murphy

Copyright Copyright 2008-2009. 2008-2009.

All rights reserved.All rights reserved.

Page 2: Computational Biology, Part 4 Protein Coding Regions

www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf

Page 3: Computational Biology, Part 4 Protein Coding Regions

What is Machine Learning?

■ Fundamental Question of Computer Fundamental Question of Computer Science: How can we build machines that Science: How can we build machines that solve problems, and which problems are solve problems, and which problems are inherently tractable/intractable?inherently tractable/intractable?

■ Fundamental Question of Statistics: What Fundamental Question of Statistics: What can be inferred from data plus a set of can be inferred from data plus a set of modeling assumptions, with what modeling assumptions, with what reliability?reliability?

Tom Mitchell white paper

Page 4: Computational Biology, Part 4 Protein Coding Regions

Fundamental Question of Machine Learning

■ How can we build computer systems that How can we build computer systems that automatically improve with experience, and automatically improve with experience, and what are the fundamental laws that govern what are the fundamental laws that govern all learning processes?all learning processes?◆ Tom MitchellTom Mitchell

Tom Mitchell white paper

Page 5: Computational Biology, Part 4 Protein Coding Regions

Why Machine Learning?

■ Learn relationships from large sets of complex Learn relationships from large sets of complex data: Data miningdata: Data mining◆ Predict clinical outcome from testsPredict clinical outcome from tests◆ Decide whether someone is a good credit riskDecide whether someone is a good credit risk

■ Do tasks too complex to program by handDo tasks too complex to program by hand◆ Autonomous drivingAutonomous driving

■ Customize programs to user needsCustomize programs to user needs◆ Recommend book/movie based on previous likesRecommend book/movie based on previous likes

Tom Mitchell white paper

Page 6: Computational Biology, Part 4 Protein Coding Regions

Why Machine Learning?

■ Economically efficientEconomically efficient■ Can consider larger data spaces and Can consider larger data spaces and

hypothesis spaces than people canhypothesis spaces than people can■ Can formalize learning problem to Can formalize learning problem to

explicitly identify/describe goals and explicitly identify/describe goals and criteriacriteria

Page 7: Computational Biology, Part 4 Protein Coding Regions

Successful Machine Learning Applications

■ Speech recognitionSpeech recognition◆ Telephone menu navigationTelephone menu navigation

■ Computer visionComputer vision◆ Mail sortingMail sorting

■ Bio-surveillanceBio-surveillance◆ Identifying disease outbreaksIdentifying disease outbreaks

■ Robot controlRobot control◆ Autonomous drivingAutonomous driving

■ Empirical scienceEmpirical science

Tom Mitchell white paper

Page 8: Computational Biology, Part 4 Protein Coding Regions

Machine Learning Paradigms

■ Supervised LearningSupervised Learning◆ ClassificationClassification◆ RegressionRegression

■ Unsupervised LearningUnsupervised Learning◆ ClusteringClustering

■ Semi-supervised LearningSemi-supervised Learning◆ CotrainingCotraining◆ Active learningActive learning

Page 9: Computational Biology, Part 4 Protein Coding Regions

Supervised Learning

■ ApproachesApproaches◆ Classification (discrete predictions)Classification (discrete predictions)◆ Regression (continuous predictions)Regression (continuous predictions)

■ Common considerationsCommon considerations◆ Representation (Features)Representation (Features)◆ Feature SelectionFeature Selection◆ Functional formFunctional form◆ Evaluation of predictive powerEvaluation of predictive power

Page 10: Computational Biology, Part 4 Protein Coding Regions

Classification vs. Regression

■ If I want to predict whether a patient will If I want to predict whether a patient will die from a disease within six months, that is die from a disease within six months, that is classificationclassification

■ If I want to predict how long the patient will If I want to predict how long the patient will live, that is regressionlive, that is regression

Page 11: Computational Biology, Part 4 Protein Coding Regions

Representation

■ Definition of thing or things to be predictedDefinition of thing or things to be predicted◆ Classification: Classification: classesclasses◆ Regression: Regression: regression variableregression variable

■ Definition of things (Definition of things (instancesinstances) to make ) to make predictions forpredictions for◆ IndividualsIndividuals◆ FamiliesFamilies◆ Neighborhoods, etc.Neighborhoods, etc.

■ Choice of descriptors (Choice of descriptors (featuresfeatures) to describe ) to describe different aspects of instancesdifferent aspects of instances

Page 12: Computational Biology, Part 4 Protein Coding Regions

Formal description

■ Defining Defining XX as a set of as a set of instancesinstances x x described described by by featuresfeatures

■ Given training examples Given training examples D D from from XX■ Given a Given a target function ctarget function c that maps that maps X-X->{0,1}>{0,1}■ Given a Given a hypothesis space Hhypothesis space H■ Determine an hypothesis Determine an hypothesis hh in in HH such that such that

h(x)h(x)==c(x) c(x) for all for all xx in in DD

Courtesy Tom Mitchell

Page 13: Computational Biology, Part 4 Protein Coding Regions

Inductive learning hypothesis

■ Any hypothesis found to approximate the Any hypothesis found to approximate the target function well over a sufficiently large target function well over a sufficiently large set of training examples will also set of training examples will also approximate the target function over other approximate the target function over other unobserved examplesunobserved examples

Courtesy Tom Mitchell

Page 14: Computational Biology, Part 4 Protein Coding Regions

Hypothesis space

■ The hypothesis space determines the The hypothesis space determines the functional formfunctional form

■ It defines what are allowable rules/functions It defines what are allowable rules/functions for classificationfor classification

■ Each classification method uses a different Each classification method uses a different hypothesis spacehypothesis space

Page 15: Computational Biology, Part 4 Protein Coding Regions

-+

???

Simple two class problem

Describe each image by featuresTrain classifier

Page 16: Computational Biology, Part 4 Protein Coding Regions

k-Nearest Neighbor (kNN)

■ In feature space, training examples areIn feature space, training examples are

Page 17: Computational Biology, Part 4 Protein Coding Regions

k-Nearest Neighbor (kNN)

■ We want to label ‘We want to label ‘??’’

Feature #1 (e.g.., ‘area’)

Feature #2(e.g.., roundness)

+

-++ +

+

+

+

-

--

--

?

Page 18: Computational Biology, Part 4 Protein Coding Regions

k-Nearest Neighbor (kNN)

■ Find k nearest neighbors and voteFind k nearest neighbors and vote

for k=3,

nearest neighbors are

So we label it +

Page 19: Computational Biology, Part 4 Protein Coding Regions

Linear Discriminants

■ Fit multivariate Gaussian to each classFit multivariate Gaussian to each class■ Measure distance from ? to each GaussianMeasure distance from ? to each Gaussian

area

bright.

+

-

+

+

+

+

-

--

--

?

Page 20: Computational Biology, Part 4 Protein Coding Regions

Decision trees

■ Again we want to label ‘Again we want to label ‘??’’

Slide courtesy of Christos Faloutsos

Page 21: Computational Biology, Part 4 Protein Coding Regions

Decision trees

■ so we build a decision tree:so we build a decision tree:

50

40

Slide courtesy of Christos Faloutsos

Page 22: Computational Biology, Part 4 Protein Coding Regions

Decision trees

■ so we build a decision tree:so we build a decision tree:

Slide courtesy of Christos Faloutsos

Page 23: Computational Biology, Part 4 Protein Coding Regions

Decision trees

■ Goal: split address space in (almost) Goal: split address space in (almost) homogeneous regionshomogeneous regions

area<50

Y

+round. <40

N

-...

Y N

‘area’

round.

+

-++ +

++

+

-

--

--

?

50

40

Slide courtesy of Christos Faloutsos

Page 24: Computational Biology, Part 4 Protein Coding Regions

Support vector machines

■ Again we want to label ‘Again we want to label ‘??’’

Feature #1 (e.g.., ‘area’)

Feature #2(e.g.., roundness)

+

-++ +

+

+

+

-

--

--

?

Slide courtesy of Christos Faloutsos

Page 25: Computational Biology, Part 4 Protein Coding Regions

Support Vector Machines (SVMs)

■ Use single linear separator??Use single linear separator??

area

round.

+

-

+

+

+

+

-

--

--

?

Slide courtesy of Christos Faloutsos

Page 26: Computational Biology, Part 4 Protein Coding Regions

Support Vector Machines (SVMs)■ Use single linear separator??Use single linear separator??

area

round.

+

-

+

+

+

+

-

--

--

?

Slide courtesy of Christos Faloutsos

Page 27: Computational Biology, Part 4 Protein Coding Regions

Support Vector Machines (SVMs)■ Use single linear separator??Use single linear separator??

area

round.

+

-

+

+

+

+

-

--

--

?

Slide courtesy of Christos Faloutsos

Page 28: Computational Biology, Part 4 Protein Coding Regions

Support Vector Machines (SVMs)■ Use single linear separator??Use single linear separator??

+

-

+

+

+

+

-

--

--

?

area

round.

Slide courtesy of Christos Faloutsos

Page 29: Computational Biology, Part 4 Protein Coding Regions

Support Vector Machines (SVMs)■ Use single linear separator??Use single linear separator??

+

-

+

+

+

+

-

--

--

?

area

round.

Slide courtesy of Christos Faloutsos

Page 30: Computational Biology, Part 4 Protein Coding Regions

Support Vector Machines (SVMs)■ we want to label ‘we want to label ‘??’ - linear separator??’ - linear separator??■ A: the one with the widest corridor!A: the one with the widest corridor!

area

round.

+

-+

++

+

-

--

--

?

Slide courtesy of Christos Faloutsos

Page 31: Computational Biology, Part 4 Protein Coding Regions

Support Vector Machines (SVMs)■ What if the points for each class are not What if the points for each class are not

readily separated by a straight line?readily separated by a straight line?■ Use the “kernel trick” – project the points Use the “kernel trick” – project the points

into a higher dimensional space in which we into a higher dimensional space in which we hope that straight lines will separate the hope that straight lines will separate the classesclasses

■ ““kernel” refers to the function used for this kernel” refers to the function used for this projection projection

Page 32: Computational Biology, Part 4 Protein Coding Regions

Support Vector Machines (SVMs)■ Definition of SVMs explicitly considers Definition of SVMs explicitly considers

only two classesonly two classes■ What if we have more than two classes?What if we have more than two classes?■ Train multiple SVMsTrain multiple SVMs■ Two basic approachesTwo basic approaches

◆ One against all (one SVM for each class)One against all (one SVM for each class)◆ Pairwise SVMs (one for each pair of classes)Pairwise SVMs (one for each pair of classes)

• Various ways of implementing thisVarious ways of implementing this

Page 33: Computational Biology, Part 4 Protein Coding Regions

Cross-Validation

■ If we train a classifier to minimize error on a set of If we train a classifier to minimize error on a set of data, have no ability to estimate (generalize) error data, have no ability to estimate (generalize) error that will be seen on new datasetthat will be seen on new dataset

■ To calculate To calculate generalizablegeneralizable accuracy, we use accuracy, we use n-n-foldfold cross-validationcross-validation

■ Divide images into Divide images into nn sets, train using sets, train using nn-1 of them -1 of them and test on the remaining setand test on the remaining set

■ Repeat until each set is used as test set and average Repeat until each set is used as test set and average results across all trialsresults across all trials

■ Variation on this is called Variation on this is called leave-one-outleave-one-out

Page 34: Computational Biology, Part 4 Protein Coding Regions

Describing classifier errors

■ For binary classifiers (positive or negative), defineFor binary classifiers (positive or negative), define◆ TP = true positives, FP = false positivesTP = true positives, FP = false positives◆ TN = true negatives, FN = false negativesTN = true negatives, FN = false negatives◆ Recall = TP / (TP + FN)Recall = TP / (TP + FN)◆ Precision = TP / (TP + FP)Precision = TP / (TP + FP)◆ F-measure= 2*Recall*Precision/(Recall + Precision)F-measure= 2*Recall*Precision/(Recall + Precision)

Page 35: Computational Biology, Part 4 Protein Coding Regions

Confusion matrix - binaryTrue \ Predicted Positive Negative

Positive True Positive False Negative

Negative False Positive True Negative

Page 36: Computational Biology, Part 4 Protein Coding Regions

Precision-recall analysis

Vary classifier parameter to “loosen” some performance estimate: i.e., confidence

Ideal performance

Page 37: Computational Biology, Part 4 Protein Coding Regions

Describing classifier errors

■ For multi-class classifiers, typically reportFor multi-class classifiers, typically report◆ Accuracy = Accuracy = # test images correctly classified# test images correctly classified

# test images# test images◆ Confusion matrix = table showing all possible Confusion matrix = table showing all possible

combinations of true class and predicted classcombinations of true class and predicted class

Page 38: Computational Biology, Part 4 Protein Coding Regions

Confusion matrix – multi-class

Overall accuracy = 98%

True Class

Output of the Classifier

DNA ER Gia Gpp Lam Mit Nuc Act TfR Tub

DNA 98 2 0 0 0 0 0 0 0 0

ER 0 100 0 0 0 0 0 0 0 0

Gia 0 0 100 0 0 0 0 0 0 0

Gpp 0 0 0 96 4 0 0 0 0 0

Lam 0 0 0 4 95 0 0 0 0 2

Mit 0 0 2 0 0 96 0 2 0 0

Nuc 0 0 0 0 0 0 100 0 0 0

Act 0 0 0 0 0 0 0 100 0 0

TfR 0 0 0 0 2 0 0 0 96 2

Tub 0 2 0 0 0 0 0 0 0 98

Page 39: Computational Biology, Part 4 Protein Coding Regions

Ground truth

■ What is the source and confidence of a class What is the source and confidence of a class label?label?

■ Most common: Human assignment, Most common: Human assignment, unknown confidenceunknown confidence

■ Preferred: Assignment by experimental Preferred: Assignment by experimental design, confidence ~100%design, confidence ~100%

Page 40: Computational Biology, Part 4 Protein Coding Regions

Stating Goals vs. Approaches

■ Temptation when first considering using a Temptation when first considering using a machine learning approach to a biological machine learning approach to a biological problem is to describe the problem as problem is to describe the problem as automating the approach that you would automating the approach that you would solve the problemsolve the problem

■ ““I need a program to predict how much a I need a program to predict how much a gene is expressed by measuring how well its gene is expressed by measuring how well its promoter matches a template”promoter matches a template”

Page 41: Computational Biology, Part 4 Protein Coding Regions

Stating Goals vs. Approaches

■ ““I need a program that given a gene I need a program that given a gene sequence predicts how much that gene is sequence predicts how much that gene is expressed by measuring how well its expressed by measuring how well its promoter matches a template”promoter matches a template”

■ ““I need a program that given a gene I need a program that given a gene sequence predicts how much that gene is sequence predicts how much that gene is expressed by learning from sequences of expressed by learning from sequences of genes whose expression is known”genes whose expression is known”

Page 42: Computational Biology, Part 4 Protein Coding Regions

Resources

■ Association for the Advancement of Artificial Association for the Advancement of Artificial IntelligenceIntelligence◆ http://www.aaai.org/AITopics/pmwiki/pmwiki.php/AITopics/MachineLearninghttp://www.aaai.org/AITopics/pmwiki/pmwiki.php/AITopics/MachineLearning

■ Machine Learning – Mitchell, Carnegie MellonMachine Learning – Mitchell, Carnegie Mellon

◆ http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.htmlhttp://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.html■ Practical Machine Learning – Jordan, UC BerkeleyPractical Machine Learning – Jordan, UC Berkeley

◆ http://www.cs.berkeley.edu/~asimma/294-fall06/http://www.cs.berkeley.edu/~asimma/294-fall06/■ Learning and Empirical Inference – Rish, Tesauro, Learning and Empirical Inference – Rish, Tesauro,

Jebara, Vadpnik – Columbia Jebara, Vadpnik – Columbia ◆ http://www1.cs.columbia.edu/~jebara/6998/http://www1.cs.columbia.edu/~jebara/6998/

Page 43: Computational Biology, Part 4 Protein Coding Regions
Page 44: Computational Biology, Part 4 Protein Coding Regions

Goals for sequence families

■ Find previously unrecognized members of a Find previously unrecognized members of a familyfamily

■ Develop a model of a familyDevelop a model of a family

Page 45: Computational Biology, Part 4 Protein Coding Regions

Possible Approaches

■ Model-basedModel-based◆ Motif-based (MEME/MAST)Motif-based (MEME/MAST)◆ Hidden Markov model-based (HMMER)Hidden Markov model-based (HMMER)◆ CobblingCobbling◆ Cobbled FPSCobbled FPS

■ Non-model-basedNon-model-based◆ Family Pairwise Search (FPS)Family Pairwise Search (FPS)

Page 46: Computational Biology, Part 4 Protein Coding Regions

PSSMs

■ Motifs can be summarized and searched for Motifs can be summarized and searched for using using PPosition-osition-SSpecific pecific SScoring coring MMatricesatrices

■ Calculated from a multiple alignment of a Calculated from a multiple alignment of a conserved region for members of a familyconserved region for members of a family

Page 47: Computational Biology, Part 4 Protein Coding Regions

Learning PSSMs

■ However, unsupervised learning methods can However, unsupervised learning methods can be used to find motifs in unaligned sequencesbe used to find motifs in unaligned sequences

■ Best characterized algorithm is MEMEBest characterized algorithm is MEME◆ T.L. Bailey & C. Elkan (1995) Unsupervised Learning of T.L. Bailey & C. Elkan (1995) Unsupervised Learning of

Multiple Motifs in Biopolymers Using Expectation Multiple Motifs in Biopolymers Using Expectation Maximization. Maximization. Machine Learning J. 21Machine Learning J. 21:51-83:51-83

■ Program for searching with MEME-Program for searching with MEME-generated PSSM is called MASTgenerated PSSM is called MAST

Page 48: Computational Biology, Part 4 Protein Coding Regions

Position Specific Iterated BLAST (PSI-BLAST)■ Use PSSMs, instead of a similarity matrix, to score Use PSSMs, instead of a similarity matrix, to score

matches between query and databasematches between query and database■ Iterate: using a multiple alignment of high scoring Iterate: using a multiple alignment of high scoring

sequences found in each search round to generate sequences found in each search round to generate a new PSSM for use in the next round of searchinga new PSSM for use in the next round of searching

■ Search until no new sequences are found, or the Search until no new sequences are found, or the user specified maximum number of iterations is user specified maximum number of iterations is reached, whichever comes firstreached, whichever comes first

■ Very similar to MEME/MASTVery similar to MEME/MAST

Page 49: Computational Biology, Part 4 Protein Coding Regions

Problems with PSSMs

■ Some families are characterized by two or Some families are characterized by two or more “sub”-motifs with variable spacing more “sub”-motifs with variable spacing between thembetween them

■ Deciding upon motif boundaries difficultDeciding upon motif boundaries difficult■ Possible information in intervening Possible information in intervening

sequences lost if only motifs are usedsequences lost if only motifs are used

Page 50: Computational Biology, Part 4 Protein Coding Regions

Cobbling

■ Pick “most representative” protein sequence Pick “most representative” protein sequence from a familyfrom a family

■ Convert it to a profile by replacing each Convert it to a profile by replacing each amino acid by the corresponding column amino acid by the corresponding column from a similarity matrix from a similarity matrix

Page 51: Computational Biology, Part 4 Protein Coding Regions

Cobbling

■ For each recognized “motif” in the family, For each recognized “motif” in the family, replace the corresponding section of the replace the corresponding section of the profile with the profile of the motifprofile with the profile of the motif

Page 52: Computational Biology, Part 4 Protein Coding Regions

Cobbling

■ Advantage: At least some sequence Advantage: At least some sequence information between motifs is retained.information between motifs is retained.

■ S. Henikoff & J.G. Henikoff (1997) S. Henikoff & J.G. Henikoff (1997) Embedding strategies for effective use of Embedding strategies for effective use of information from multiple sequence information from multiple sequence alignments. alignments. Protein Science 6Protein Science 6:698-705:698-705

Page 53: Computational Biology, Part 4 Protein Coding Regions

Cobbler Illustration

scores from profiles of conserved motifs

similarity scores for sequence from “most representative” family member

sequence of “most representative” family member

Page 54: Computational Biology, Part 4 Protein Coding Regions

Family Pairwise Search

■ For all known members of family, calculate For all known members of family, calculate (pairwise) homology to each sequence in (pairwise) homology to each sequence in database (using BLAST) and sum those database (using BLAST) and sum those scoresscores

Page 55: Computational Biology, Part 4 Protein Coding Regions

Family Pairwise Search

■ Does not generate a model of the motifDoes not generate a model of the motif■ Analogous to k nearest neighbor Analogous to k nearest neighbor

classificationclassification

Page 56: Computational Biology, Part 4 Protein Coding Regions

Which method is best?

■ CompareCompare◆ BLAST using randomly chosen family memberBLAST using randomly chosen family member◆ BLAST FPSBLAST FPS◆ MAST (uses MEME to build PSSM)MAST (uses MEME to build PSSM)◆ HMMERHMMER

■ W.N. Gundy (1998) Homology Detection W.N. Gundy (1998) Homology Detection via Family Pairwise Search. via Family Pairwise Search. J. Comput. J. Comput. Biol. 5Biol. 5:479-492:479-492

Page 57: Computational Biology, Part 4 Protein Coding Regions

Comparison Protocol

■ For each methodFor each method◆ For each known protein familyFor each known protein family

✦ Train with family membersTrain with family members✦ Search database for matchesSearch database for matches✦ Rank by score from searchRank by score from search✦ Determine how many known family Determine how many known family

members are ranked highlymembers are ranked highly

Page 58: Computational Biology, Part 4 Protein Coding Regions

Evaluation

TrainingSequences

Training Model

TestingSequences

All otherSequences

Known FamilyMembers

Searching

Ranked Listof Matches

Page 59: Computational Biology, Part 4 Protein Coding Regions

Evaluation metric - ROC

■ Define Define ROCROCnn

✦ ROCROCnn is the fraction of true positives detected at a is the fraction of true positives detected at a threshold giving threshold giving nn false positives false positives

Page 60: Computational Biology, Part 4 Protein Coding Regions

Example of Evaluation for ROC2

■ Assume 8 Known Family MembersAssume 8 Known Family Members■ If Ranked List of Matches is:If Ranked List of Matches is:

◆ Known Family Member 1Known Family Member 1◆ Known Family Member 7Known Family Member 7◆ Known Family Member 4Known Family Member 4◆ Known Family Member 2Known Family Member 2◆ Other Sequence 1Other Sequence 1◆ Known Family Member 3Known Family Member 3◆ Known Family Member 8Known Family Member 8◆ Other Sequence 2Other Sequence 2◆ Known Family Member 5Known Family Member 5◆ Known Family Member 6Known Family Member 6

■ ROCROC22 score is 0.75 (six of the eight scored higher score is 0.75 (six of the eight scored higher than the second highest scoring non-member)than the second highest scoring non-member)

Sequences up to 2 non-family members

Page 61: Computational Biology, Part 4 Protein Coding Regions

Protocol for Comparison of Methods ■ Calculate Calculate ROCROC5050 for each familyfor each family

■ Average over all familiesAverage over all families■ Bigger is better!Bigger is better!

Page 62: Computational Biology, Part 4 Protein Coding Regions

Results

BLAST FPS

HMMER

MAST

BLAST FPS

MAST

HMMER

BLAST

Page 63: Computational Biology, Part 4 Protein Coding Regions

Conclusion

■ FPS better than single sequence BLASTFPS better than single sequence BLAST■ FPS better than model-based methodsFPS better than model-based methods

Page 64: Computational Biology, Part 4 Protein Coding Regions

Comparison Protocol

■ Caution!Caution!◆ True positive True positive defined as being listed as a defined as being listed as a

member of the family in the PROSITE member of the family in the PROSITE compilationcompilation

◆ Some Some false positivesfalse positives could be actual family could be actual family members that were missed during PROSITE members that were missed during PROSITE compilation!compilation!

◆ (Should be minor effect)(Should be minor effect)

Page 65: Computational Biology, Part 4 Protein Coding Regions

Which is best (part 2)?

■ CompareCompare◆ BLASTBLAST◆ BLAST FPSBLAST FPS◆ cobbled BLASTcobbled BLAST◆ cobbled BLAST FPScobbled BLAST FPS

■ W.N. Grundy and T.L. Bailey (1999) W.N. Grundy and T.L. Bailey (1999) Family pairwise search with embedded Family pairwise search with embedded motif models. motif models. Bioinformatics 15:Bioinformatics 15:463-470463-470

Page 66: Computational Biology, Part 4 Protein Coding Regions

Comparison Protocol

■ Evaluation metricEvaluation metric◆ rank sumrank sum

✦ calculate difference in ROCcalculate difference in ROC5050 for two methods for a for two methods for a given familygiven family

✦ sort by absolute value of differencesort by absolute value of difference✦ sum ranks of families for which one method is better sum ranks of families for which one method is better

than the otherthan the other

◆ Bigger is better!Bigger is better!

Page 67: Computational Biology, Part 4 Protein Coding Regions

Results

Page 68: Computational Biology, Part 4 Protein Coding Regions

Conclusions

■ For task of finding members of a family For task of finding members of a family given a reasonable number of known given a reasonable number of known members of that family, cobbled FPS is best members of that family, cobbled FPS is best of the methods tested!of the methods tested!

■ As number of known sequences in a family As number of known sequences in a family grows, HMM methods may do bettergrows, HMM methods may do better