profiles and fuzzy k-nearest neighbor algorithm for protein secondary structure prediction

44
Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction Rajkumar Bondugula, Ognen Duzlevski and Dong Xu Digital Biology Laboratory, Dept. of Computer Science University of Missouri – Columbia, MO 65211, USA

Upload: laszlo

Post on 12-Jan-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction. Rajkumar Bondugula, Ognen Duzlevski and Dong Xu. Digital Biology Laboratory, Dept. of Computer Science University of Missouri – Columbia, MO 65211, USA. Outline. Introduction - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Rajkumar Bondugula,

Ognen Duzlevski and Dong Xu

Digital Biology Laboratory, Dept. of Computer Science

University of Missouri – Columbia, MO 65211, USA

Page 2: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Outline

Introduction Protein secondary structure prediction Popular methods K-Nearest Neighbor method Fuzzy K-Nearest Neighbor method

Methods Filtering the prediction Results and discussion Summary and Future work

Page 3: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Introduction

Goal: Given a sequence of amino acids, predict in which one of the eight possible secondary structures states {H, G, I, B, E, C, S,T} will each residue fold in to.

CASP convention {H,G,I} → H {B,E} → E {C,S,T} → C

Example:Amino Acid VKDGYIVDXVNCTYFCGRNAYCNEECTKLXGEQWASPYYCYXLPDHVRTKGPGRCHSecondary StructureCEEEEEECCCCCCCCCCCHHHHHHHHHHCCCCEEEECCEEEEECCCCCCCCCCCCC

Page 4: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Protein 3-Dimensional structure

Page 5: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Importance of Secondary Structure

An intermediate step in 3D structure prediction structure → function

ClassificationEx: α, β, α/β, α+β

Helps in protein folding pathway determination

Page 6: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Existing Methods

Popular MethodsNeural Network methods

Ex: PSIPRED, PHD

Nearest Neighbor methods Ex: NNSSP

Hidden Markov Model methods

Page 7: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Why K-Nearest Neighbors method?

Methods based on Neural Networks and Hidden Markov models perform well if the query protein have many homologs

in the sequence databasenot easily expandable

The 1-Nearest Neighbor rule is bound above by no more than twice the optimal Baye’s error rate [Keller et. al, 1985]

K-NN will work better and better as more and more structures are being solved

Page 8: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

K-Nearest Neighbor Algorithm

Instances to be classified Classified instances

Page 9: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Instances to be classified Classified instances

K-Nearest Neighbor Algorithm

Page 10: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

K-Nearest Neighbor Algorithm

Instances to be classified class B class F

Page 11: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

K-Nearest Neighbor Algorithm

Advantages of Nearest Neighbor methodsSimple and transparent model

New structures can be added without re-training

Linear complexity

DisadvantageSlower compared to other models as processing is

delayed until prediction is needed

Page 12: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Why Fuzzy K-NN?

Disadvantages of Crisp K-NN Atypical examples are given as much as weight as those that

truly represent a particular class

Once instance is assigned to a class, there is no indication of its “strength” of its membership in that class

Page 13: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

- - - N L G A G N S G L N L G H V A L T F

Page 14: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

- - - N L G A

- - - N L G A G N S G L N L G H V A L T F

Page 15: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

- - - N L G A- - N L G A G

- - - N L G A G N S G L N L G H V A L T F

Page 16: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

- - - N L G A- - N L G A G- N L G A G N

- - - N L G A G N S G L N L G H V A L T F

Page 17: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

- - - N L G A- - N L G A G- - N L G A NN L G A G N S

- - - N L G A G N S G L N L G H V A L T F

Page 18: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

- - - N L G A- - N L G A G- N L G A G N SL G A G N S G

- - - N L G A G N S G L N L G H V A L T F

Page 19: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Position Specific Scoring Matrix

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

Length of protein(l)

20

PSI-BLAST

. . . N L G A G N S G L N L G H V A L T F . . .

A

R

N

D

C

Q

E

G

H

I

L

K

M

F

P

S

T

W

Y

V

Page 20: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Why Profile-FKNN?

Evolutionary information has been shown to increase the accuracy of secondary structure prediction by many popular methods

An attempt to combine the advantages of incorporating the evolutionary information, fuzzy set theory and nearest neighbor methods

Page 21: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Methods

Calculate profiles using PSI-BLAST The popular Rost and Sander database of 126

representative proteins (<25% sequence Identity)

Find K-Nearest Neighbors Calculate the membership values of the neighbors Calculate the membership values of the current

residue Assign classes Filter the output

Page 22: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Profile Calculation

The profiles of both the query protein and the test protein are calculated using the program PSI-BLAST

Parameters for PSI-BLAST Expectation Value (e) = 0.1

Maximum number of passes (j) = 3

E-value threshold for inclusion in multi-pass model (h) = 5

Default values for the rest of the parameters

Page 23: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

K-Nearest Neighbors

For each profile-window in the query protein, the position-weighted absolute distance ‘d’ is calculated from all profile-windows of all proteins in the database.

The profile-windows corresponding to K smallest distances are retained as the K-Nearest Neighbors

20

1 1

1,min,1maxi

W

j

Databaseij

Queryij jWjppd

Page 24: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Distance Calculation

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . N L G A G N S G L T F . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . N L G A G N S G L N L G H V A L T F . . .

Page 25: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . N L G A G N S G L T F . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . N L G A G N S G L N L G H V A L T F . . .

Distance Calculation

Page 26: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . N L G A G N S G L T F . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . N L G A G N S G L N L G H V A L T F . . .

Distance Calculation

Page 27: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . N L G A G N S G L T F . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . N L G A G N S G L N L G H V A L T F . . .

Distance Calculation

Page 28: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . N L G A G N S G L T F . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . N L G A G N S G L N L G H V A L T F . . .

Distance Calculation

Page 29: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . N L G A G N S G L T F . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . N L G A G N S G L N L G H V A L T F . . .

Distance Calculation

Page 30: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Membership Values of the Neighbors

The memberships of the nearest neighbors are assigned based on their corresponding secondary structures in various positions in the window

The residues near to the center are weighed more than the residues that are farther away

Page 31: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Membership values of the Neighbors

0.067 0.133 0.20 0.20 0.20 0.133 0.067

H

E 1 1 1

C 1 1 1 1

C C E E E C C

H = 0

E = 0.200x1 + 0.200x1 + 0.20x1 = 0.6

C = 0.067x1 + 0.133x1 +0.133x1 + 0.067x1 = 0.4C C E E E C C

E

N L G A G N S

A

Page 32: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Membership Value

The membership values of each residue in classes Helix, Sheet and Coil is calculated from the corresponding neighbors using the Fuzzy K-NN algorithm

Each residue is assigned to class in which it has the highest membership value

Helix = . . . 15 22 61 91 95 96 26 21 23 18 29 30 24 17 5 8 . . .

Sheet = . . . 22 28 13 1 1 2 8 8 12 11 42 44 46 29 14 10 . . .

Coil = . . . 63 50 26 8 4 2 65 71 65 71 29 26 31 53 81 82 . . .

Final = . . . C C H H H H C C C C E E E C C C . . .

Page 33: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Fuzzy K-Nearest neighbor Algorithm

BEGIN Initialize i=1. DO UNTIL(r assigned membership in all classes) Compute ui(r) using

Increment i. END DO UNTILEND

K

j

mj

K

j

mjij

i

rrd

rrdu

ru

1

12

1

12

),(/1

),(/1

)(

Where,

ui = membership value of

residue ‘r’ in class ‘i’,

i = Helix, Sheet or Coil

d(r,rj)= distance between query

window centered in

residue ‘r’ its jth

neighbor

m = 2 (Fuzzifier)

Page 34: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Structure Filtration

In the basic setting, the secondary structure state is class with highest membership value

Unrealistic structures may be present Popular methods of structure filtration

Neural Network

Heuristic based

Page 35: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Heuristic Filter

1. Smoothen the memberships values

2. Filter unrealistic structures Helix > 3 amino acids, -sheet > 2 amino acids

3. Calculate the thresholds to filter noise

4. Mark the possible Helix and Sheet regions Resolve conflicts based on average membership value in

overlap region

5. Fill the rest of the structure with Coil

11 25.05.025.0 nnn mmmm

Page 36: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Filter: Final Structure

Unfiltered CCCCCHCCCCCHHHHHHHHCCCCCCEEEEECCCCCCCCCCCCCEEEEEECCCCCCHHHCCCCCTarget CCCHHHCCCCHHHHHHHHHHHCCCCEEEEEECCCCEECCCCCCEEEEEEECCCCEECCCCEECFiltered CCHHHHCCCHHHHHHHHHHHHHCCCEEEEEECCCCCCCCCCCCEEEEEEECCCCCCCCCCCCC

Page 37: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Metrics

Seven commonly used metricsQ3 = Number of correctly predicted residues x 100

Total number of residues

Q<H,E,C>= Number of <helix,sheet,coil> residues correctly predicted X100

Total number of residues in <helix,sheet,coil>

Matthew’s Correlation Coefficient

MCC<H,E,C>= opuponun

uopn

where, p – true positives n – true negatives u – false negatives o – false positives

Page 38: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Results

Q3(%) QH(%) QE(%) QC(%) MH ME MC

Unfiltered 74.0 69.6 55.8 79.9 0.58 0.61 0.54

Filtered 76.2 68.1 66.1 80.4 0.64 0.64 0.56

Performance on database of 1973 proteins (<25% sequence identity) generated by the PISCES1 server

1. G. Wang and R. L. Dunbrack, Jr. PISCES: a protein sequence culling server. Bioinformatics, 19:1589-1591, 2003.

Page 39: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Relative Performance

Method Accuracy

MBR1 66.40

NN2 68.00

NNSSP3 72.20

PFKNN 76.20

1. X. Zhang, J. P. Mesirov and D.L Waltz. Hybrid system for Protein Secondary Structure Prediction. J. Mol. Biol., 225:1049-1063, 1992

2. Tau-Mu Yi and E. S. Lander. Protein Secondary Structure Prediction using Nearest-Neighbor Methods. J. Mol. Biol., 232:1117-1129, 1993

3. A. A. Salamov and V. V. Solovyev. Prediction of Protein Secondary Structure by Combining Nearest-neighbor Algorithm and Multiple Sequence Alignments. J. Mol. Biol., 247:11-15, 1995

Page 40: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Summary

A novel approach for PSSP Evolutionary information

K-Nearest Neighbor algorithm

Fuzzy set theory

Most accurate KNN approach to date Easily expandable Accuracy increases with new structures Average computing time < 1 min on a single

CPU machine

Page 41: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Future Work

System with faster search capabilitiesEfficient search for neighbors

Accurate prediction system

Page 42: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Acknowledgements

Dr. James Keller for insight into the Fuzzy K-Nearest Neighbor Algorithm

Oak Ridge National Laboratory for providing the supercomputing facilities

Members of Digital Biology Laboratory for their support

Page 43: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Software

The enhanced version of the software is coded in C and is available upon request. Please e-mail your requests to

[email protected]

[email protected]

Page 44: Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Thank you for

Participation!