support vector machine and string kernels for protein classification

36
Support Vector Machine and String Kernels for Protein Classification Christina Leslie Department of Computer Science Columbia University

Upload: nessa

Post on 28-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Christina Leslie. Support Vector Machine and String Kernels for Protein Classification. Department of Computer Science Columbia University. Learning Sequence-based Protein Classification. Problem: classification of protein sequence data into families and superfamilies - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Support Vector Machine  and String Kernels for  Protein Classification

Support Vector Machine and String Kernels for Protein Classification

Christina Leslie

Department of Computer ScienceColumbia University

Page 2: Support Vector Machine  and String Kernels for  Protein Classification

Learning Sequence-based Protein Classification Problem: classification of protein

sequence data into families and superfamilies

Motivation: Many proteins have been sequenced, but often structure/function remains unknown

Motivation: infer structure/function from sequence-based classification

Page 3: Support Vector Machine  and String Kernels for  Protein Classification

Sequence Data versus Structure and Function

>1A3N:A HEMOGLOBIN

VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK

KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA

VHASLDKFLASVSTVLTSKYR

>1A3N:B HEMOGLOBIN

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV

KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK

EFTPPVQAAYQKVVAGVANALAHKYH

>1A3N:C HEMOGLOBIN

VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK

KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA

VHASLDKFLASVSTVLTSKYR

>1A3N:D HEMOGLOBIN

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV

KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK

EFTPPVQAAYQKVVAGVANALAHKYH

Sequences for four chains of human hemoglobin

Tertiary Structure

Function: oxygen transport

Page 4: Support Vector Machine  and String Kernels for  Protein Classification

Structural Hierarchy SCOP:

Structural Classification of Proteins

Interested in superfamily-level homology – remote evolutionary relationship

Page 5: Support Vector Machine  and String Kernels for  Protein Classification

Learning Problem Reduce to binary classification problem:

positive (+) if example belongs to a family (e.g. G proteins) or superfamily (e.g. nucleoside triphosphate hydrolases), negative (-) otherwise

Focus on remote homology detection Use supervised learning approach to

train a classifier

Labeled TrainingSequences

Classification Rule

Learning Algorithm

Page 6: Support Vector Machine  and String Kernels for  Protein Classification

Two supervised learning approaches to classification Generative model approach

Build a generative model for a single protein family; classify each candidate sequence based on its fit to the model

Only uses positive training sequences

Discriminative approach Learning algorithm tries to learn decision

boundary between positive and negative examples

Uses both positive and negative training sequences

Page 7: Support Vector Machine  and String Kernels for  Protein Classification

Hidden Markov Models for Protein Families

Standard generative model: profile HMM Training data: multiple alignment of

examples from family Columns of alignment determine model

topology

7LES_DROME LKLLRFLGSGAFGEVYEGQLKTE....DSEEPQRVAIKSLRK.......

ABL1_CAEEL IIMHNKLGGGQYGDVYEGYWK........RHDCTIAVKALK........

BFR2_HUMAN LTLGKPLGEGCFGQVVMAEAVGIDK.DKPKEAVTVAVKMLKDD.....A

TRKA_HUMAN IVLKWELGEGAFGKVFLAECHNLL...PEQDKMLVAVKALK........

Page 8: Support Vector Machine  and String Kernels for  Protein Classification

Profile HMMs for Protein Families

Match, insert and delete states Observed variables: symbol sequence, x1 .. xL

Hidden variables: state sequence, 1 .. L

Parameters: transition and emission probabilities Joint probability: P(x, | )

Page 9: Support Vector Machine  and String Kernels for  Protein Classification

HMMs: Pros and Cons

Ladies and gentlemen, boys and girls:

Let us leave something for next week…

Page 10: Support Vector Machine  and String Kernels for  Protein Classification

Discriminative Learning

Discriminative approachTrain on both positive and negative examples to learn classifier

Modern computational learning theory• Goal: learn a classifier that generalizes well to new examples• Do not use training data to estimate parameters of probability distribution – “curse of dimensionality”

Page 11: Support Vector Machine  and String Kernels for  Protein Classification

Learning Theoretic Formalism for Classification Problem• Training and test data drawn i.i.d.

from fixed but unknown probability distribution D on

X {-1,1}• Labeled training set

S = {(x1, y1), … , (xm, ym)}

Page 12: Support Vector Machine  and String Kernels for  Protein Classification

Support Vector Machines (SVMs)

We use SVM as discriminative learning algorithm

+

+

+

+

_

+

__

_

_

+

+

_

• Training examples mapped to (usually high-dimensional) feature space by a feature map F(x) = (F1(x), … , Fd(x))• Learn linear decision boundary: Trade-off between maximizing geometric margin of the training data and minimizing margin violations

Page 13: Support Vector Machine  and String Kernels for  Protein Classification

SVM Classifiers Linear classifier

defined in feature space by

f(x) = <w,x> + b SVM solution gives

w = i xi as a linear combination of support vectors, a subset of the training vectors

+

+

+

+

_

+

__

_

_

+

+

_w

b

Page 14: Support Vector Machine  and String Kernels for  Protein Classification

Advantages of SVMs Large margin classifier: leads to

good generalization (performance on test sets)

Sparse classifier: depends only on support vectors, leads to fast classification, good generalization

Kernel method: as we’ll see, we can introduce sequence-based kernel functions for use with SVMs

Page 15: Support Vector Machine  and String Kernels for  Protein Classification

Hard Margin SVM Assume training data linearly

separable in feature space Space of linear classifiers

fw,b(x) = w, x + b

giving decision rulehw,b(x) = sign(fw,b(x))

If |w| = 1, geometric margin of training data for hw,b

S = MinS yi (w, xi + b)

+

++

_

+ +

+

__

_ _

_

w

b

Page 16: Support Vector Machine  and String Kernels for  Protein Classification

Hard Margin Optimization Hard margin SVM

optimization: given training data S, find linear classifier hw,b with maximal geometric margin S

Convex quadratic dual optimization problem

Sparse classifier in term of support vectors

+

+

+

+

_

+ +

+

__

_ _

_

Page 17: Support Vector Machine  and String Kernels for  Protein Classification

Hard Margin Generalization Error Bounds Theorem [Cristianini, Shawe-Taylor]: Fix a

real value M > 0. For any probability distribution D on X {-1,1} with support in a ball of radius R around the origin, with probability 1- over m random samples S, any linear hypothesis h with geometric margin

S M on S has error no more than

ErrD(h) (m, , M, R)provided that m is big enough

Page 18: Support Vector Machine  and String Kernels for  Protein Classification

SVMs for Protein Classification Want to define feature map from

space of protein sequences to vector space

Goals: Computational efficiency Competitive performance with known

methods No reliance on generative model –

general method for sequence-based classification problems

Page 19: Support Vector Machine  and String Kernels for  Protein Classification

Spectrum Feature Map for SVM Protein Classification

New feature map based on spectrum of a sequence

1. C. Leslie, E. Eskin, and W. Noble, The Spectrum Kernel: A String Kernel for SVM Protein Classification. Pacific Symposium on Biocomputing, 2002.

2. C. Leslie, E. Eskin, J. Weston and W. Noble, Mismatch String Kernels for SVM Protein Classification. NIPS 2002.

Page 20: Support Vector Machine  and String Kernels for  Protein Classification

The k-Spectrum of a Sequence

• Feature map for SVM based on spectrum of a sequence

• The k-spectrum of a sequence is the set of all k-length contiguous subsequences that it contains

• Feature map is indexed by all possible k-length subsequences

(“k-mers”) from the alphabet of amino acids • Dimension of feature space = 20k

• Generalizes to any sequence data

AKQDYYYYEI

AKQ KQD QDY DYY YYY

YYY YYE YEI

Page 21: Support Vector Machine  and String Kernels for  Protein Classification

k-Spectrum Feature Map Feature map for k-spectrum with no

mismatches: For sequence x, F(k)(x) = (Ft (x)){k-mers t},

where Ft (x) = #occurrences of t in x

AKQDYYYYEI

( 0 , 0 , … , 1 , … , 1 , … , 2 ) AAA AAC … AKQ … DYY … YYY

Page 22: Support Vector Machine  and String Kernels for  Protein Classification

(k,m)-Mismatch Feature Map Feature map for k-spectrum,

allowing m mismatches: if s is a k-mer, F(k,m)(s) = (Ft(s)){k-mers t},

where Ft(s) = 1 if s is within m mismatches from t, 0 otherwise

extend additively to longer sequences x by summing over all k-mers s in x

AKQDKQ

EKQ AAQAKY… …

Page 23: Support Vector Machine  and String Kernels for  Protein Classification

The Kernel Trick To train an SVM, can use kernel

rather than explicit feature map For sequences x, y, feature map F,

kernel value is inner product in feature space:

K(x, y) = F(x), F(y) Gives sequence similarity score Example of a string kernel Can be efficiently computed via

traversal of trie data structure

Page 24: Support Vector Machine  and String Kernels for  Protein Classification

Computing the (k,m)-Spectrum Kernel

Use trie (retrieval tree) to organize lexical traversal of all instances of k-length patterns (with mismatches) in the training data Each path down to a leaf in the trie corresponds

to a coordinate in feature map Kernel values for all training sequences updated

at each leaf node If m=0, traversal time for trie is linear in size of

training data Traversal time grows exponentially with m, but

usually small values of m are useful Depth-first traversal makes efficient use of

memory

Page 25: Support Vector Machine  and String Kernels for  Protein Classification

Example: Traversing the Mismatch Tree

Traversal for input sequence: AVLALKAVLL, k=8, m=1

Page 26: Support Vector Machine  and String Kernels for  Protein Classification

Example: Traversing the Mismatch Tree

Traversal for input sequence: AVLALKAVLL, k=8, m=1

Page 27: Support Vector Machine  and String Kernels for  Protein Classification

Example: Traversing the Mismatch Tree

Traversal for input sequence: AVLALKAVLL, k=8, m=1

Page 28: Support Vector Machine  and String Kernels for  Protein Classification

Example: Computing the Kernel for Pair of Sequences

Traversal of trie for k=3 (m=0)

EADLALGKAVF

ADLALGADQVFNG

AS1:

S2

:

Page 29: Support Vector Machine  and String Kernels for  Protein Classification

Example: Computing the Kernel for Pair of Sequences

Traversal of trie for k=3 (m=0)

EADLALGKAVF

ADLALGADQVFNG

A

D

S1:

S2

:

Page 30: Support Vector Machine  and String Kernels for  Protein Classification

Example: Computing the Kernel for Pair of Sequences

Traversal of trie for k=3 (m=0)

EADLALGKAVF

ADLALGADQVFNG

A

D

LUpdate kernel value for K(s1,s2) by adding contribution for feature ADL

s1:

s2:

Page 31: Support Vector Machine  and String Kernels for  Protein Classification

Fast prediction SVM training: determines subset of training

sequences corresponding to support vectors and their weights:

(xi, i), i = 1 .. r Prediction with no mismatches:

Represent SVM classifier by hash table mapping support k-mers to weights

Test sequences can be classified in linear time via look-up of k-mers

Prediction with mismatches: Represent classifier as sparse trie; traverse k-mer

paths occurring with mismatches in test sequence

Page 32: Support Vector Machine  and String Kernels for  Protein Classification

Experimental Design

Tested with set of experiments on SCOP dataset

Experiments designed to ask: Could the method discover a new family of a known superfamily?

Diagram from Jaakkola et al.

Page 33: Support Vector Machine  and String Kernels for  Protein Classification

Experiments 160 experiments for 33 target

families from 16 superfamilies Compared results against

SVM-Fisher SAM-T98 (HMM-based method) PSI-BLAST (heuristic alignment-based

method)

Page 34: Support Vector Machine  and String Kernels for  Protein Classification

Conclusions for SCOP Experiments Spectrum Kernel with SVM performs as

well as the best-known method for remote homology detection problem

Efficient computation of string kernel Fast prediction

Can precompute per k-mer scores and represent classifier as a lookup table

Gives linear time prdiction for both spectrum kernel, (unnormalized) mismatch kernel

General approach to classification problems for sequence data

Page 35: Support Vector Machine  and String Kernels for  Protein Classification

Feature Selection Strategies Explicit feature filtering

Compute score for each k-mer, based on training data statistics, during trie traversal and filter as we compute kernel

Feature elimination as a wrapper for SVM training Eliminate features corresponding to small

components wi in vector w defining SVM classifier

Kernel principal component analysis Project to principal components prior to training

Page 36: Support Vector Machine  and String Kernels for  Protein Classification

Ongoing and Future Work New families of string kernels, mismatching

schemes Applications to other sequence-based

classification problems, e.g. splice site prediction

Feature selection Explicit and implicit dimension reduction

Other machine learning approaches to using sparse string-based models for classification Boosting with string-based classifiers