computational biology, part 4 protein coding regions

Computational Biology, Part 7Supervised Machine Learning and Searching for Sequence Families

Robert F. MurphyRobert F. Murphy

Copyright Copyright 2008-2009. 2008-2009.

All rights reserved.All rights reserved.

www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf

What is Machine Learning?

■ Fundamental Question of Computer Fundamental Question of Computer Science: How can we build machines that Science: How can we build machines that solve problems, and which problems are solve problems, and which problems are inherently tractable/intractable?inherently tractable/intractable?

■ Fundamental Question of Statistics: What Fundamental Question of Statistics: What can be inferred from data plus a set of can be inferred from data plus a set of modeling assumptions, with what modeling assumptions, with what reliability?reliability?

Tom Mitchell white paper

Fundamental Question of Machine Learning

■ How can we build computer systems that How can we build computer systems that automatically improve with experience, and automatically improve with experience, and what are the fundamental laws that govern what are the fundamental laws that govern all learning processes?all learning processes?◆ Tom MitchellTom Mitchell


Why Machine Learning?

■ Learn relationships from large sets of complex Learn relationships from large sets of complex data: Data miningdata: Data mining◆ Predict clinical outcome from testsPredict clinical outcome from tests◆ Decide whether someone is a good credit riskDecide whether someone is a good credit risk

■ Do tasks too complex to program by handDo tasks too complex to program by hand◆ Autonomous drivingAutonomous driving

■ Customize programs to user needsCustomize programs to user needs◆ Recommend book/movie based on previous likesRecommend book/movie based on previous likes


Why Machine Learning?

■ Economically efficientEconomically efficient■ Can consider larger data spaces and Can consider larger data spaces and

hypothesis spaces than people canhypothesis spaces than people can■ Can formalize learning problem to Can formalize learning problem to

explicitly identify/describe goals and explicitly identify/describe goals and criteriacriteria

Successful Machine Learning Applications

■ Speech recognitionSpeech recognition◆ Telephone menu navigationTelephone menu navigation

■ Computer visionComputer vision◆ Mail sortingMail sorting

■ Bio-surveillanceBio-surveillance◆ Identifying disease outbreaksIdentifying disease outbreaks

■ Robot controlRobot control◆ Autonomous drivingAutonomous driving

■ Empirical scienceEmpirical science


Machine Learning Paradigms

■ Supervised LearningSupervised Learning◆ ClassificationClassification◆ RegressionRegression

■ Unsupervised LearningUnsupervised Learning◆ ClusteringClustering

■ Semi-supervised LearningSemi-supervised Learning◆ CotrainingCotraining◆ Active learningActive learning

Supervised Learning

■ ApproachesApproaches◆ Classification (discrete predictions)Classification (discrete predictions)◆ Regression (continuous predictions)Regression (continuous predictions)

■ Common considerationsCommon considerations◆ Representation (Features)Representation (Features)◆ Feature SelectionFeature Selection◆ Functional formFunctional form◆ Evaluation of predictive powerEvaluation of predictive power

Classification vs. Regression

■ If I want to predict whether a patient will If I want to predict whether a patient will die from a disease within six months, that is die from a disease within six months, that is classificationclassification

■ If I want to predict how long the patient will If I want to predict how long the patient will live, that is regressionlive, that is regression

Representation

■ Definition of thing or things to be predictedDefinition of thing or things to be predicted◆ Classification: Classification: classesclasses◆ Regression: Regression: regression variableregression variable

■ Definition of things (Definition of things (instancesinstances) to make ) to make predictions forpredictions for◆ IndividualsIndividuals◆ FamiliesFamilies◆ Neighborhoods, etc.Neighborhoods, etc.

■ Choice of descriptors (Choice of descriptors (featuresfeatures) to describe ) to describe different aspects of instancesdifferent aspects of instances

Formal description

■ Defining Defining XX as a set of as a set of instancesinstances x x described described by by featuresfeatures

■ Given training examples Given training examples D D from from XX■ Given a Given a target function ctarget function c that maps that maps X-X->{0,1}>{0,1}■ Given a Given a hypothesis space Hhypothesis space H■ Determine an hypothesis Determine an hypothesis hh in in HH such that such that

h(x)h(x)==c(x) c(x) for all for all xx in in DD

Courtesy Tom Mitchell

Inductive learning hypothesis

■ Any hypothesis found to approximate the Any hypothesis found to approximate the target function well over a sufficiently large target function well over a sufficiently large set of training examples will also set of training examples will also approximate the target function over other approximate the target function over other unobserved examplesunobserved examples

Courtesy Tom Mitchell

Hypothesis space

■ The hypothesis space determines the The hypothesis space determines the functional formfunctional form

■ It defines what are allowable rules/functions It defines what are allowable rules/functions for classificationfor classification

■ Each classification method uses a different Each classification method uses a different hypothesis spacehypothesis space

-+

???

Simple two class problem

Describe each image by featuresTrain classifier

k-Nearest Neighbor (kNN)

■ In feature space, training examples areIn feature space, training examples are


■ We want to label ‘We want to label ‘??’’

Feature #1 (e.g.., ‘area’)

Feature #2(e.g.., roundness)

+

-++ +

+

+

+

-

--

--

?


■ Find k nearest neighbors and voteFind k nearest neighbors and vote

for k=3,

nearest neighbors are

So we label it +

Linear Discriminants

■ Fit multivariate Gaussian to each classFit multivariate Gaussian to each class■ Measure distance from ? to each GaussianMeasure distance from ? to each Gaussian

area

bright.

+

-

+

+

+

+

-

--

--

?

Decision trees

■ Again we want to label ‘Again we want to label ‘??’’

Slide courtesy of Christos Faloutsos

Decision trees

■ so we build a decision tree:so we build a decision tree:

50

40


Decision trees

■ so we build a decision tree:so we build a decision tree:


Decision trees

■ Goal: split address space in (almost) Goal: split address space in (almost) homogeneous regionshomogeneous regions

area<50

Y

+round. <40

N

-...

Y N

‘area’

round.

+

-++ +

++

+

-

--

--

?

50

40


Support vector machines

■ Again we want to label ‘Again we want to label ‘??’’

Feature #1 (e.g.., ‘area’)

Feature #2(e.g.., roundness)

+

-++ +

+

+

+

-

--

--

?


Support Vector Machines (SVMs)

■ Use single linear separator??Use single linear separator??

area

round.

+

-

+

+

+

+

-

--

--

?


Support Vector Machines (SVMs)■ Use single linear separator??Use single linear separator??

area

round.

+

-

+

+

+

+

-

--

--

?


Support Vector Machines (SVMs)■ Use single linear separator??Use single linear separator??

+

-

+

+

+

+

-

--

--

?

area

round.


Support Vector Machines (SVMs)■ we want to label ‘we want to label ‘??’ - linear separator??’ - linear separator??■ A: the one with the widest corridor!A: the one with the widest corridor!

area

round.

+

-+

++

+

-

--

--

?


Support Vector Machines (SVMs)■ What if the points for each class are not What if the points for each class are not

readily separated by a straight line?readily separated by a straight line?■ Use the “kernel trick” – project the points Use the “kernel trick” – project the points

into a higher dimensional space in which we into a higher dimensional space in which we hope that straight lines will separate the hope that straight lines will separate the classesclasses

■ ““kernel” refers to the function used for this kernel” refers to the function used for this projection projection

Support Vector Machines (SVMs)■ Definition of SVMs explicitly considers Definition of SVMs explicitly considers

only two classesonly two classes■ What if we have more than two classes?What if we have more than two classes?■ Train multiple SVMsTrain multiple SVMs■ Two basic approachesTwo basic approaches

◆ One against all (one SVM for each class)One against all (one SVM for each class)◆ Pairwise SVMs (one for each pair of classes)Pairwise SVMs (one for each pair of classes)

• Various ways of implementing thisVarious ways of implementing this

Cross-Validation

■ If we train a classifier to minimize error on a set of If we train a classifier to minimize error on a set of data, have no ability to estimate (generalize) error data, have no ability to estimate (generalize) error that will be seen on new datasetthat will be seen on new dataset

■ To calculate To calculate generalizablegeneralizable accuracy, we use accuracy, we use n-n-foldfold cross-validationcross-validation

■ Divide images into Divide images into nn sets, train using sets, train using nn-1 of them -1 of them and test on the remaining setand test on the remaining set

■ Repeat until each set is used as test set and average Repeat until each set is used as test set and average results across all trialsresults across all trials

■ Variation on this is called Variation on this is called leave-one-outleave-one-out

Describing classifier errors

■ For binary classifiers (positive or negative), defineFor binary classifiers (positive or negative), define◆ TP = true positives, FP = false positivesTP = true positives, FP = false positives◆ TN = true negatives, FN = false negativesTN = true negatives, FN = false negatives◆ Recall = TP / (TP + FN)Recall = TP / (TP + FN)◆ Precision = TP / (TP + FP)Precision = TP / (TP + FP)◆ F-measure= 2*Recall*Precision/(Recall + Precision)F-measure= 2*Recall*Precision/(Recall + Precision)

Confusion matrix - binaryTrue \ Predicted Positive Negative

Positive True Positive False Negative

Negative False Positive True Negative

Precision-recall analysis

Vary classifier parameter to “loosen” some performance estimate: i.e., confidence

Ideal performance

Describing classifier errors

■ For multi-class classifiers, typically reportFor multi-class classifiers, typically report◆ Accuracy = Accuracy = # test images correctly classified# test images correctly classified

# test images# test images◆ Confusion matrix = table showing all possible Confusion matrix = table showing all possible

combinations of true class and predicted classcombinations of true class and predicted class

Confusion matrix – multi-class

Overall accuracy = 98%

True Class

Output of the Classifier

DNA ER Gia Gpp Lam Mit Nuc Act TfR Tub

DNA 98 2 0 0 0 0 0 0 0 0

ER 0 100 0 0 0 0 0 0 0 0

Gia 0 0 100 0 0 0 0 0 0 0

Gpp 0 0 0 96 4 0 0 0 0 0

Lam 0 0 0 4 95 0 0 0 0 2

Mit 0 0 2 0 0 96 0 2 0 0

Nuc 0 0 0 0 0 0 100 0 0 0

Act 0 0 0 0 0 0 0 100 0 0

TfR 0 0 0 0 2 0 0 0 96 2

Tub 0 2 0 0 0 0 0 0 0 98

Ground truth

■ What is the source and confidence of a class What is the source and confidence of a class label?label?

■ Most common: Human assignment, Most common: Human assignment, unknown confidenceunknown confidence

■ Preferred: Assignment by experimental Preferred: Assignment by experimental design, confidence ~100%design, confidence ~100%

Stating Goals vs. Approaches

■ Temptation when first considering using a Temptation when first considering using a machine learning approach to a biological machine learning approach to a biological problem is to describe the problem as problem is to describe the problem as automating the approach that you would automating the approach that you would solve the problemsolve the problem

■ ““I need a program to predict how much a I need a program to predict how much a gene is expressed by measuring how well its gene is expressed by measuring how well its promoter matches a template”promoter matches a template”

Stating Goals vs. Approaches

■ ““I need a program that given a gene I need a program that given a gene sequence predicts how much that gene is sequence predicts how much that gene is expressed by measuring how well its expressed by measuring how well its promoter matches a template”promoter matches a template”

■ ““I need a program that given a gene I need a program that given a gene sequence predicts how much that gene is sequence predicts how much that gene is expressed by learning from sequences of expressed by learning from sequences of genes whose expression is known”genes whose expression is known”

Resources

■ Association for the Advancement of Artificial Association for the Advancement of Artificial IntelligenceIntelligence◆ http://www.aaai.org/AITopics/pmwiki/pmwiki.php/AITopics/MachineLearninghttp://www.aaai.org/AITopics/pmwiki/pmwiki.php/AITopics/MachineLearning

■ Machine Learning – Mitchell, Carnegie MellonMachine Learning – Mitchell, Carnegie Mellon

◆ http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.htmlhttp://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.html■ Practical Machine Learning – Jordan, UC BerkeleyPractical Machine Learning – Jordan, UC Berkeley

◆ http://www.cs.berkeley.edu/~asimma/294-fall06/http://www.cs.berkeley.edu/~asimma/294-fall06/■ Learning and Empirical Inference – Rish, Tesauro, Learning and Empirical Inference – Rish, Tesauro,

Jebara, Vadpnik – Columbia Jebara, Vadpnik – Columbia ◆ http://www1.cs.columbia.edu/~jebara/6998/http://www1.cs.columbia.edu/~jebara/6998/

http://www.aaai.org/AITopics/pmwiki/pmwiki.php/AITopics/MachineLearning

http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.html

http://www.cs.berkeley.edu/~asimma/294-fall06/

http://www1.cs.columbia.edu/~jebara/6998/

Goals for sequence families

■ Find previously unrecognized members of a Find previously unrecognized members of a familyfamily

■ Develop a model of a familyDevelop a model of a family

Possible Approaches

■ Model-basedModel-based◆ Motif-based (MEME/MAST)Motif-based (MEME/MAST)◆ Hidden Markov model-based (HMMER)Hidden Markov model-based (HMMER)◆ CobblingCobbling◆ Cobbled FPSCobbled FPS

■ Non-model-basedNon-model-based◆ Family Pairwise Search (FPS)Family Pairwise Search (FPS)

PSSMs

■ Motifs can be summarized and searched for Motifs can be summarized and searched for using using PPosition-osition-SSpecific pecific SScoring coring MMatricesatrices

■ Calculated from a multiple alignment of a Calculated from a multiple alignment of a conserved region for members of a familyconserved region for members of a family

Learning PSSMs

■ However, unsupervised learning methods can However, unsupervised learning methods can be used to find motifs in unaligned sequencesbe used to find motifs in unaligned sequences

■ Best characterized algorithm is MEMEBest characterized algorithm is MEME◆ T.L. Bailey & C. Elkan (1995) Unsupervised Learning of T.L. Bailey & C. Elkan (1995) Unsupervised Learning of

Multiple Motifs in Biopolymers Using Expectation Multiple Motifs in Biopolymers Using Expectation Maximization. Maximization. Machine Learning J. 21Machine Learning J. 21:51-83:51-83

■ Program for searching with MEME-Program for searching with MEME-generated PSSM is called MASTgenerated PSSM is called MAST

Position Specific Iterated BLAST (PSI-BLAST)■ Use PSSMs, instead of a similarity matrix, to score Use PSSMs, instead of a similarity matrix, to score

matches between query and databasematches between query and database■ Iterate: using a multiple alignment of high scoring Iterate: using a multiple alignment of high scoring

sequences found in each search round to generate sequences found in each search round to generate a new PSSM for use in the next round of searchinga new PSSM for use in the next round of searching

■ Search until no new sequences are found, or the Search until no new sequences are found, or the user specified maximum number of iterations is user specified maximum number of iterations is reached, whichever comes firstreached, whichever comes first

■ Very similar to MEME/MASTVery similar to MEME/MAST

Problems with PSSMs

■ Some families are characterized by two or Some families are characterized by two or more “sub”-motifs with variable spacing more “sub”-motifs with variable spacing between thembetween them

■ Deciding upon motif boundaries difficultDeciding upon motif boundaries difficult■ Possible information in intervening Possible information in intervening

sequences lost if only motifs are usedsequences lost if only motifs are used

Cobbling

■ Pick “most representative” protein sequence Pick “most representative” protein sequence from a familyfrom a family

■ Convert it to a profile by replacing each Convert it to a profile by replacing each amino acid by the corresponding column amino acid by the corresponding column from a similarity matrix from a similarity matrix

Cobbling

■ For each recognized “motif” in the family, For each recognized “motif” in the family, replace the corresponding section of the replace the corresponding section of the profile with the profile of the motifprofile with the profile of the motif

Cobbling

■ Advantage: At least some sequence Advantage: At least some sequence information between motifs is retained.information between motifs is retained.

■ S. Henikoff & J.G. Henikoff (1997) S. Henikoff & J.G. Henikoff (1997) Embedding strategies for effective use of Embedding strategies for effective use of information from multiple sequence information from multiple sequence alignments. alignments. Protein Science 6Protein Science 6:698-705:698-705

Cobbler Illustration

scores from profiles of conserved motifs

similarity scores for sequence from “most representative” family member

sequence of “most representative” family member

Family Pairwise Search

■ For all known members of family, calculate For all known members of family, calculate (pairwise) homology to each sequence in (pairwise) homology to each sequence in database (using BLAST) and sum those database (using BLAST) and sum those scoresscores

Family Pairwise Search

■ Does not generate a model of the motifDoes not generate a model of the motif■ Analogous to k nearest neighbor Analogous to k nearest neighbor

classificationclassification

Which method is best?

■ CompareCompare◆ BLAST using randomly chosen family memberBLAST using randomly chosen family member◆ BLAST FPSBLAST FPS◆ MAST (uses MEME to build PSSM)MAST (uses MEME to build PSSM)◆ HMMERHMMER

■ W.N. Gundy (1998) Homology Detection W.N. Gundy (1998) Homology Detection via Family Pairwise Search. via Family Pairwise Search. J. Comput. J. Comput. Biol. 5Biol. 5:479-492:479-492

Comparison Protocol

■ For each methodFor each method◆ For each known protein familyFor each known protein family

✦ Train with family membersTrain with family members✦ Search database for matchesSearch database for matches✦ Rank by score from searchRank by score from search✦ Determine how many known family Determine how many known family

members are ranked highlymembers are ranked highly

Evaluation

TrainingSequences

Training Model

TestingSequences

All otherSequences

Known FamilyMembers

Searching

Ranked Listof Matches

Evaluation metric - ROC

■ Define Define ROCROCnn

✦ ROCROCnn is the fraction of true positives detected at a is the fraction of true positives detected at a threshold giving threshold giving nn false positives false positives

Example of Evaluation for ROC2

■ Assume 8 Known Family MembersAssume 8 Known Family Members■ If Ranked List of Matches is:If Ranked List of Matches is:

◆ Known Family Member 1Known Family Member 1◆ Known Family Member 7Known Family Member 7◆ Known Family Member 4Known Family Member 4◆ Known Family Member 2Known Family Member 2◆ Other Sequence 1Other Sequence 1◆ Known Family Member 3Known Family Member 3◆ Known Family Member 8Known Family Member 8◆ Other Sequence 2Other Sequence 2◆ Known Family Member 5Known Family Member 5◆ Known Family Member 6Known Family Member 6

■ ROCROC22 score is 0.75 (six of the eight scored higher score is 0.75 (six of the eight scored higher than the second highest scoring non-member)than the second highest scoring non-member)

Sequences up to 2 non-family members

Protocol for Comparison of Methods ■ Calculate Calculate ROCROC5050 for each familyfor each family

■ Average over all familiesAverage over all families■ Bigger is better!Bigger is better!

Results

BLAST FPS

HMMER

MAST

BLAST FPS

MAST

HMMER

BLAST

Conclusion

■ FPS better than single sequence BLASTFPS better than single sequence BLAST■ FPS better than model-based methodsFPS better than model-based methods

Comparison Protocol

■ Caution!Caution!◆ True positive True positive defined as being listed as a defined as being listed as a

member of the family in the PROSITE member of the family in the PROSITE compilationcompilation

◆ Some Some false positivesfalse positives could be actual family could be actual family members that were missed during PROSITE members that were missed during PROSITE compilation!compilation!

◆ (Should be minor effect)(Should be minor effect)

Which is best (part 2)?

■ CompareCompare◆ BLASTBLAST◆ BLAST FPSBLAST FPS◆ cobbled BLASTcobbled BLAST◆ cobbled BLAST FPScobbled BLAST FPS

■ W.N. Grundy and T.L. Bailey (1999) W.N. Grundy and T.L. Bailey (1999) Family pairwise search with embedded Family pairwise search with embedded motif models. motif models. Bioinformatics 15:Bioinformatics 15:463-470463-470

Comparison Protocol

■ Evaluation metricEvaluation metric◆ rank sumrank sum

✦ calculate difference in ROCcalculate difference in ROC5050 for two methods for a for two methods for a given familygiven family

✦ sort by absolute value of differencesort by absolute value of difference✦ sum ranks of families for which one method is better sum ranks of families for which one method is better

than the otherthan the other

◆ Bigger is better!Bigger is better!

Results

Conclusions

■ For task of finding members of a family For task of finding members of a family given a reasonable number of known given a reasonable number of known members of that family, cobbled FPS is best members of that family, cobbled FPS is best of the methods tested!of the methods tested!

■ As number of known sequences in a family As number of known sequences in a family grows, HMM methods may do bettergrows, HMM methods may do better

computational biology, part 4 protein coding regions

Documents