protein classification. pdb growth new pdb structures

65
Protein Classification

Post on 22-Dec-2015

249 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Protein Classification. PDB Growth New PDB structures

Protein Classification

Page 2: Protein Classification. PDB Growth New PDB structures

PDB GrowthN

ew

PD

B s

tru

ctu

res

Page 3: Protein Classification. PDB Growth New PDB structures

Only a few folds are found in nature

Page 4: Protein Classification. PDB Growth New PDB structures

Protein classification

• Number of protein sequences grows exponentially

• Number of solved structures grows exponentially

• Number of new folds identified very small (and close to constant)

• Protein classification can Generate overview of structure types Detect similarities (evolutionary relationships) between protein sequences Help predict 3D structure of new protein sequences

Morten Nielsen,CBS, BioCentrum, DTU

SCOP release 1.69, Class # folds # superfamilies # families

All alpha proteins 218 376 608

All beta proteins 144 290 560

Alpha and beta proteins (a/b) 136 222 629

Alpha and beta proteins (a+b) 279 409 717

Multi-domain proteins 46 46 61

Membrane & cell surface 47 88 99

Small proteins 75 108 171

Total 945 1539 2845

Classification of 25,973 protein structures in PDB

Page 5: Protein Classification. PDB Growth New PDB structures

Protein world

Protein fold

Protein structure classification

Protein superfamily

Protein family

Morten Nielsen,CBS, BioCentrum, DTU

Page 6: Protein Classification. PDB Growth New PDB structures

Structure Classification Databases

• SCOP Manual classification (A. Murzin) scop.berkeley.edu

• CATH Semi manual classification (C. Orengo) www.biochem.ucl.ac.uk/bsm/cath

• FSSP Automatic classification (L. Holm)

www.ebi.ac.uk/dali/fssp/fssp.html

Morten Nielsen,CBS, BioCentrum, DTU

Page 7: Protein Classification. PDB Growth New PDB structures

Major classes in SCOP

• Classes All proteins All proteins and proteins (/) and proteins (+) Multi-domain proteins Membrane and cell surface proteins Small proteins Coiled coil proteins

Morten Nielsen,CBS, BioCentrum, DTU

Page 8: Protein Classification. PDB Growth New PDB structures

All : Hemoglobin (1bab)

Morten Nielsen,CBS, BioCentrum, DTU

Page 9: Protein Classification. PDB Growth New PDB structures

All : Immunoglobulin (8fab)

Morten Nielsen,CBS, BioCentrum, DTU

Page 10: Protein Classification. PDB Growth New PDB structures

Triosephosphate isomerase (1hti)

Morten Nielsen,CBS, BioCentrum, DTU

Page 11: Protein Classification. PDB Growth New PDB structures

: Lysozyme (1jsf)

Morten Nielsen,CBS, BioCentrum, DTU

Page 12: Protein Classification. PDB Growth New PDB structures

Families

• Proteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity)

• Families are further subdivided into Proteins

• Proteins are divided into Species The same protein may be found in

several species

Fold

Family

Superfamily

Proteins

Morten Nielsen,CBS, BioCentrum, DTU

Page 13: Protein Classification. PDB Growth New PDB structures

Superfamilies

• Proteins which are (remotely) evolutionarily related

Sequence similarity low

Share function

Share special structural features

• Relationships between members of a superfamily may not be readily recognizable from the sequence alone

Fold

Family

Superfamily

Proteins

Morten Nielsen,CBS, BioCentrum, DTU

Page 14: Protein Classification. PDB Growth New PDB structures

Folds

• >~50% secondary structure elements arranged in the same order in sequence and in 3D

• No evolutionary relation

Fold

Family

Superfamily

Proteins

Morten Nielsen,CBS, BioCentrum, DTU

Page 15: Protein Classification. PDB Growth New PDB structures

Protein Classification

• Given a new protein sequence, can we place it in its “correct” position within an existing protein hierarchy?

Methods

• BLAST / PsiBLAST

• Profile HMMs

• Supervised Machine Learning methods

Fold

Family

Superfamily

Proteins

?

new protein

Page 16: Protein Classification. PDB Growth New PDB structures

BLAST

(Basic Local Alignment Search Tool)

Main idea:

1. Construct a dictionary of all the words in the query

2. Initiate a local alignment for each word match between query and DB

Running Time: O(MN)

However, orders of magnitude faster than Smith-Waterman

query

DB

Page 17: Protein Classification. PDB Growth New PDB structures

PSI-BLAST

Given a sequence query x, and database D

1. Find all pairwise alignments of x to sequences in D

2. Collect all matches of x to y with some minimum significance

3. Construct position specific matrix M, a profile

4. Using the matrix M, search D for more matches

5. Iterate 1–4 until convergence

Profile M

Page 18: Protein Classification. PDB Growth New PDB structures

A profile

Page 19: Protein Classification. PDB Growth New PDB structures

Profile HMMs

• Each M state has a position-specific pre-computed substitution table• Each I and D state has position-specific gap penalties

• Profile HMM is a generative model: The sequence X that is aligned to H, is thought of as “generated by” H Therefore, H parametrizes a conditional distribution P(X | H)

Protein profile HMM

M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Page 20: Protein Classification. PDB Growth New PDB structures

Classification with Profile HMMs

M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Fold

Family

Superfamily

?M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

new protein

Page 21: Protein Classification. PDB Growth New PDB structures

Classification with Profile HMMs

• How generative models work

Training examples ( sequences known to be members of family )

Model assigns a probability to any given protein sequence.

The sequence from that family yield a higher probability than that of outside family.

• Log-likelihood ratio as score

P(X | H1) P(H1) P(H1|X) P(X) P(H1|X) L(X) = log -------------------------- = log --------------------- = log --------------

P(X | H0) P(H0) P(H0|X) P(X) P(H0|X)

Page 22: Protein Classification. PDB Growth New PDB structures

Generative Models

Page 23: Protein Classification. PDB Growth New PDB structures

Generative Models

Page 24: Protein Classification. PDB Growth New PDB structures

Generative Models

Page 25: Protein Classification. PDB Growth New PDB structures

Generative Models

Page 26: Protein Classification. PDB Growth New PDB structures

Generative Models

Page 27: Protein Classification. PDB Growth New PDB structures

Discriminative Methods

Instead of modeling the process that generates data, directly discriminate between classes

• More direct way to the goal• Better if model is not accurate

Page 28: Protein Classification. PDB Growth New PDB structures

Discriminative Models -- SVM

v

Decision Rule:red: vTx > 0

marginIf x1 … xn training examples,

sign(iixiTx) “decides” where x falls

• Train i to achieve best margin

Large Margin for |v| < 1 Margin of 1 for small |v|

Page 29: Protein Classification. PDB Growth New PDB structures

Discriminative protein classification

Jaakkola, Diekhans, Haussler, ISMB 1999

• Define the discriminant function to be

L(X) = XiH1 i K(X, Xi) - XjH0 j K(X, Xj)

We decide X family H whenever L(X) > 0

• For now, let’s just assume K(.,.) is a similarity function

• Then, we want to train i so that this classifier makes as few mistakes as possible in the new data

• Similarly to SVMs, train i so that margin is largest for 0 i 1

Page 30: Protein Classification. PDB Growth New PDB structures

Discriminative protein classification

• Ideally, for training examples, L(Xi) ≥ 1 if Xi H1, L(Xi) -1 otherwise

• This is not always possible; softer constraints are obtained with the following objective function

J() = XiH1 i(2 - L(Xi)) + XjH0 j(2 + L(Xj))

• Training: for Xi H, try to “make” L(Xi) = 1

1 - L(Xi) + i K(Xi, Xi) i -----------------------------; with minimum 0, and maximum 1

K(Xi, Xi)

• Similarly, for Xi H0 try to “make” L(Xi) = -1

Page 31: Protein Classification. PDB Growth New PDB structures

The Fisher Kernel

• The function K(X, Y) compares two sequences Acts effectively as an inner product in a (non-Euclidean) space Called “Kernel”

• Has to be positive definite• For any X1, …, Xn, the matrix K: Kij = K(Xi, Xj) is such that

For any X Rn, X ≠ 0, XT K X > 0

• Choice of this function is important

• Consider P(X | H1, ) – sufficient statistics How many expected times X takes each transition/emission

M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Page 32: Protein Classification. PDB Growth New PDB structures

The Fisher Kernel

Let be the vector of parameters of HMM (probs in each arrow & emission)

• Fisher score UX = log P(X | H1, ) Quantifies how each parameter contributes to generating X For two different sequences X and Y, can compare UX, UY

• D2F(X, Y) = ½ 2 |UX – UY|2; is just a scaling parameter

• Given this distance function, K(X, Y) is defined as a similarity measure: K(X, Y) = exp(-D2

F(X, Y)) Set so that the average distance of training sequences X i H1 to sequences Xj

H0 is 1

M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Page 33: Protein Classification. PDB Growth New PDB structures

The Fisher Kernel

• To train a classifier for a given family H1,

1. Build profile HMM, H1

2. UX = log P(X | H1, ) (Fisher score)3. D2

F(X, Y) = ½ 2 |UX – UY|2 (distance)4. K(X, Y) = exp(-D2

F(X, Y)), (akin to dot product)

5. L(X) = XiH1 i K(X, Xi) – XjH0 j K(X, Xj)6. Iteratively adjust to optimize

J() = XiH1 i(2 - L(Xi)) + XjH0 j(2 + L(Xj))

• To classify query X,

Compute UX

Compute K(X, Xi) for all training examples Xi with I ≠ 0 (few) Decide based on L(X) >? 0

Page 34: Protein Classification. PDB Growth New PDB structures

The Fisher Kernel

• If a given superfamily has more than one profile model,

Lmax(X) = maxi Li(X) = maxi (XjHi j K(X, Xj) – XjH0 j K(X, Xj))

M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Family

Superfamily

M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Page 35: Protein Classification. PDB Growth New PDB structures

O. Jangmin

Benchmarks

Page 36: Protein Classification. PDB Growth New PDB structures

Other methods

• WU-BLAST version 2.0a16 (Althcshul & Gish 1996)

PDB90 database was queried with each positive training examples, and E-values were recorded.

BLAST:SCOP-only

BLAST:SCOP+SAM-T98-homologs

Scores were combined by the maximum method

• SAM-T98 method

Null model: reverse sequence model

Same data and same set of models as in the SVM-Fisher

Combined with maximum methods

O. Jangmin

Page 37: Protein Classification. PDB Growth New PDB structures

Results

• Metric : the rate of false positives (RFP)

• RFP for a positive test sequence : the fraction of negative test sequences that score as good of better than positive sequence

• Result of the family of the nucleotide triphosphate hydrolases SCOP superfamily

Test the ability to distinguish 8 PDB90 G proteins from 2439 sequences in other SCOP folds

O. Jangmin

Page 38: Protein Classification. PDB Growth New PDB structures
Page 39: Protein Classification. PDB Growth New PDB structures

QUESTION

Running time of Fisher kernel SVM

on query X?

Page 40: Protein Classification. PDB Growth New PDB structures

k-mer based SVMs

Leslie, Eskin, Weston, Noble; NIPS 2002

Highlights

• K(X, Y) = exp(-½ 2 |UX – UY|2), requires expensive profile alignment:

UX = log P(X | H1, ) – O(|X| |H1|)

• Instead, new kernel K(X, Y) just “counts up” k-mers with mismatches in common between X and Y – O(|X|) in practice

• Off-the-shelf SVM software used

Page 41: Protein Classification. PDB Growth New PDB structures

k-mer based SVMs

• For given word size k, and mismatch tolerance l, define

K(X, Y) = # distinct k-long word pairs with ≤ l mismatches

• Define normalized kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y))

• SVM can be learned by supplying this kernel function

A B A C A R D I

A B R A D A B I

X

Y

K(X, Y) = 4

K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1

Page 42: Protein Classification. PDB Growth New PDB structures

SVMs will find a few support vectors

v

After training, SVM has determined a small set of sequences, the support vectors, who need to be compared with query sequence X

Page 43: Protein Classification. PDB Growth New PDB structures

Benchmarks

Page 44: Protein Classification. PDB Growth New PDB structures

Semi-Supervised Methods

GENERATIVE SUPERVISED METHODS

Page 45: Protein Classification. PDB Growth New PDB structures

Semi-Supervised Methods

DISCRIMINATIVE SUPERVISED METHODS

Page 46: Protein Classification. PDB Growth New PDB structures

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 47: Protein Classification. PDB Growth New PDB structures

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 48: Protein Classification. PDB Growth New PDB structures

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 49: Protein Classification. PDB Growth New PDB structures

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 50: Protein Classification. PDB Growth New PDB structures

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 51: Protein Classification. PDB Growth New PDB structures

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 52: Protein Classification. PDB Growth New PDB structures

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 53: Protein Classification. PDB Growth New PDB structures

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 54: Protein Classification. PDB Growth New PDB structures

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 55: Protein Classification. PDB Growth New PDB structures

Semi-Supervised Methods

UNSUPERVISED METHODS

Mixture of Centers

Data generated by a fixed set of centers (how many?)

Page 56: Protein Classification. PDB Growth New PDB structures

Semi-Supervised Methods

• Some examples are labeled

• Assume labels vary smoothly among all examples

Page 57: Protein Classification. PDB Growth New PDB structures

Semi-Supervised Methods

• Some examples are labeled

• Assume labels vary smoothly among all examples

• SVMs and other discriminative methods may make significant mistakes due to lack of data

Page 58: Protein Classification. PDB Growth New PDB structures

Semi-Supervised Methods

• Some examples are labeled

• Assume labels vary smoothly among all examples

Page 59: Protein Classification. PDB Growth New PDB structures

Semi-Supervised Methods

• Some examples are labeled

• Assume labels vary smoothly among all examples

Page 60: Protein Classification. PDB Growth New PDB structures

Semi-Supervised Methods

• Some examples are labeled

• Assume labels vary smoothly among all examples

Page 61: Protein Classification. PDB Growth New PDB structures

Semi-Supervised Methods

• Some examples are labeled

• Assume labels vary smoothly among all examples

Attempt to “contract” the distances within each cluster while keeping intracluster distances larger

Page 62: Protein Classification. PDB Growth New PDB structures

Semi-Supervised Methods

• Some examples are labeled

• Assume labels vary smoothly among all examples

Page 63: Protein Classification. PDB Growth New PDB structures

Cluster Kernels

1. Neighborhood 1. For each X, run PSI-BLAST to get similar seqs Nbd(X)

2. Define Φnbd(X) = 1/|Nbd(X)| X’ Nbd(X) Φoriginal(X’)

“Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X”

3. Knbd(X, Y) = 1/(|Nbd(X)|*|Nbd(Y)) X’ Nbd(X) Y’ Nbd(Y) K(X’, Y’)

2. Bagged mismatch

Page 64: Protein Classification. PDB Growth New PDB structures

Cluster Kernels

1. Neighborhood 1. For each X, run PSI-BLAST to get similar seqs Nbd(X)

2. Define Φnbd(X) = 1/|Nbd(X)| X’ Nbd(X) Φoriginal(X’)

“Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X”

3. Knbd(X, Y) = 1/(|Nbd(X)|*|Nbd(Y)) X’ Nbd(X) Y’ Nbd(Y) K(X’, Y’)

2. Bagged mismatch

1. Run k-means clustering n times, giving p = 1,…,n assignments cp(X)

2. For every X and Y, count up the fraction of times they are bagged together

Kbag(X, Y) = 1/n p 1(cp(X) = cp (Y))

3. Combine the “bag fraction” with the original comparison K(.,.)

Knew(X, Y) = Kbag(X, Y) K(X, Y)

Page 65: Protein Classification. PDB Growth New PDB structures

Benchmarks