profiles and fuzzy k-nearest neighbor algorithm for protein secondary structure prediction rajkumar...

44
Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction Rajkumar Bondugula, Ognen Duzlevski and Dong Xu Digital Biology Laboratory, Dept. of Computer Science University of Missouri – Columbia, MO 65211, USA

Post on 21-Dec-2015

225 views

Category:

Documents


2 download

TRANSCRIPT

Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Rajkumar Bondugula,

Ognen Duzlevski and Dong Xu

Digital Biology Laboratory, Dept. of Computer Science

University of Missouri – Columbia, MO 65211, USA

Outline

Introduction Protein secondary structure prediction Popular methods K-Nearest Neighbor method Fuzzy K-Nearest Neighbor method

Methods Filtering the prediction Results and discussion Summary and Future work

Introduction

Goal: Given a sequence of amino acids, predict in which one of the eight possible secondary structures states {H, G, I, B, E, C, S,T} will each residue fold in to.

CASP convention {H,G,I} → H {B,E} → E {C,S,T} → C

Example:Amino Acid VKDGYIVDXVNCTYFCGRNAYCNEECTKLXGEQWASPYYCYXLPDHVRTKGPGRCHSecondary StructureCEEEEEECCCCCCCCCCCHHHHHHHHHHCCCCEEEECCEEEEECCCCCCCCCCCCC

Protein 3-Dimensional structure

Importance of Secondary Structure

An intermediate step in 3D structure prediction structure → function

ClassificationEx: α, β, α/β, α+β

Helps in protein folding pathway determination

Existing Methods

Popular MethodsNeural Network methods

Ex: PSIPRED, PHD

Nearest Neighbor methods Ex: NNSSP

Hidden Markov Model methods

Why K-Nearest Neighbors method?

Methods based on Neural Networks and Hidden Markov models perform well if the query protein have many homologs

in the sequence databasenot easily expandable

The 1-Nearest Neighbor rule is bound above by no more than twice the optimal Baye’s error rate [Keller et. al, 1985]

K-NN will work better and better as more and more structures are being solved

K-Nearest Neighbor Algorithm

Instances to be classified Classified instances

Instances to be classified Classified instances

K-Nearest Neighbor Algorithm

K-Nearest Neighbor Algorithm

Instances to be classified class B class F

K-Nearest Neighbor Algorithm

Advantages of Nearest Neighbor methodsSimple and transparent model

New structures can be added without re-training

Linear complexity

DisadvantageSlower compared to other models as processing is

delayed until prediction is needed

Why Fuzzy K-NN?

Disadvantages of Crisp K-NN Atypical examples are given as much as weight as those that

truly represent a particular class

Once instance is assigned to a class, there is no indication of its “strength” of its membership in that class

- - - N L G A G N S G L N L G H V A L T F

- - - N L G A

- - - N L G A G N S G L N L G H V A L T F

- - - N L G A- - N L G A G

- - - N L G A G N S G L N L G H V A L T F

- - - N L G A- - N L G A G- N L G A G N

- - - N L G A G N S G L N L G H V A L T F

- - - N L G A- - N L G A G- - N L G A NN L G A G N S

- - - N L G A G N S G L N L G H V A L T F

- - - N L G A- - N L G A G- N L G A G N SL G A G N S G

- - - N L G A G N S G L N L G H V A L T F

Position Specific Scoring Matrix

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

Length of protein(l)

20

PSI-BLAST

. . . N L G A G N S G L N L G H V A L T F . . .

A

R

N

D

C

Q

E

G

H

I

L

K

M

F

P

S

T

W

Y

V

Why Profile-FKNN?

Evolutionary information has been shown to increase the accuracy of secondary structure prediction by many popular methods

An attempt to combine the advantages of incorporating the evolutionary information, fuzzy set theory and nearest neighbor methods

Methods

Calculate profiles using PSI-BLAST The popular Rost and Sander database of 126

representative proteins (<25% sequence Identity)

Find K-Nearest Neighbors Calculate the membership values of the neighbors Calculate the membership values of the current

residue Assign classes Filter the output

Profile Calculation

The profiles of both the query protein and the test protein are calculated using the program PSI-BLAST

Parameters for PSI-BLAST Expectation Value (e) = 0.1

Maximum number of passes (j) = 3

E-value threshold for inclusion in multi-pass model (h) = 5

Default values for the rest of the parameters

K-Nearest Neighbors

For each profile-window in the query protein, the position-weighted absolute distance ‘d’ is calculated from all profile-windows of all proteins in the database.

The profile-windows corresponding to K smallest distances are retained as the K-Nearest Neighbors

20

1 1

1,min,1maxi

W

j

Databaseij

Queryij jWjppd

Distance Calculation

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . N L G A G N S G L T F . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . N L G A G N S G L N L G H V A L T F . . .

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . N L G A G N S G L T F . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . N L G A G N S G L N L G H V A L T F . . .

Distance Calculation

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . N L G A G N S G L T F . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . N L G A G N S G L N L G H V A L T F . . .

Distance Calculation

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . N L G A G N S G L T F . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . N L G A G N S G L N L G H V A L T F . . .

Distance Calculation

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . N L G A G N S G L T F . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . N L G A G N S G L N L G H V A L T F . . .

Distance Calculation

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . .

. . . N L G A G N S G L T F . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . .

. . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . .

. . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . .

. . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . .

. . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . .

. . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . .

. . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . .

. . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . .

. . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . .

. . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . .

. . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . .

. . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

. . . N L G A G N S G L N L G H V A L T F . . .

Distance Calculation

Membership Values of the Neighbors

The memberships of the nearest neighbors are assigned based on their corresponding secondary structures in various positions in the window

The residues near to the center are weighed more than the residues that are farther away

Membership values of the Neighbors

0.067 0.133 0.20 0.20 0.20 0.133 0.067

H

E 1 1 1

C 1 1 1 1

C C E E E C C

H = 0

E = 0.200x1 + 0.200x1 + 0.20x1 = 0.6

C = 0.067x1 + 0.133x1 +0.133x1 + 0.067x1 = 0.4C C E E E C C

E

N L G A G N S

A

Membership Value

The membership values of each residue in classes Helix, Sheet and Coil is calculated from the corresponding neighbors using the Fuzzy K-NN algorithm

Each residue is assigned to class in which it has the highest membership value

Helix = . . . 15 22 61 91 95 96 26 21 23 18 29 30 24 17 5 8 . . .

Sheet = . . . 22 28 13 1 1 2 8 8 12 11 42 44 46 29 14 10 . . .

Coil = . . . 63 50 26 8 4 2 65 71 65 71 29 26 31 53 81 82 . . .

Final = . . . C C H H H H C C C C E E E C C C . . .

Fuzzy K-Nearest neighbor Algorithm

BEGIN Initialize i=1. DO UNTIL(r assigned membership in all classes) Compute ui(r) using

Increment i. END DO UNTILEND

K

j

mj

K

j

mjij

i

rrd

rrdu

ru

1

12

1

12

),(/1

),(/1

)(

Where,

ui = membership value of

residue ‘r’ in class ‘i’,

i = Helix, Sheet or Coil

d(r,rj)= distance between query

window centered in

residue ‘r’ its jth

neighbor

m = 2 (Fuzzifier)

Structure Filtration

In the basic setting, the secondary structure state is class with highest membership value

Unrealistic structures may be present Popular methods of structure filtration

Neural Network

Heuristic based

Heuristic Filter

1. Smoothen the memberships values

2. Filter unrealistic structures Helix > 3 amino acids, -sheet > 2 amino acids

3. Calculate the thresholds to filter noise

4. Mark the possible Helix and Sheet regions Resolve conflicts based on average membership value in

overlap region

5. Fill the rest of the structure with Coil

11 25.05.025.0 nnn mmmm

Filter: Final Structure

Unfiltered CCCCCHCCCCCHHHHHHHHCCCCCCEEEEECCCCCCCCCCCCCEEEEEECCCCCCHHHCCCCCTarget CCCHHHCCCCHHHHHHHHHHHCCCCEEEEEECCCCEECCCCCCEEEEEEECCCCEECCCCEECFiltered CCHHHHCCCHHHHHHHHHHHHHCCCEEEEEECCCCCCCCCCCCEEEEEEECCCCCCCCCCCCC

Metrics

Seven commonly used metricsQ3 = Number of correctly predicted residues x 100

Total number of residues

Q<H,E,C>= Number of <helix,sheet,coil> residues correctly predicted X100

Total number of residues in <helix,sheet,coil>

Matthew’s Correlation Coefficient

MCC<H,E,C>= opuponun

uopn

where, p – true positives n – true negatives u – false negatives o – false positives

Results

Q3(%) QH(%) QE(%) QC(%) MH ME MC

Unfiltered 74.0 69.6 55.8 79.9 0.58 0.61 0.54

Filtered 76.2 68.1 66.1 80.4 0.64 0.64 0.56

Performance on database of 1973 proteins (<25% sequence identity) generated by the PISCES1 server

1. G. Wang and R. L. Dunbrack, Jr. PISCES: a protein sequence culling server. Bioinformatics, 19:1589-1591, 2003.

Relative Performance

Method Accuracy

MBR1 66.40

NN2 68.00

NNSSP3 72.20

PFKNN 76.20

1. X. Zhang, J. P. Mesirov and D.L Waltz. Hybrid system for Protein Secondary Structure Prediction. J. Mol. Biol., 225:1049-1063, 1992

2. Tau-Mu Yi and E. S. Lander. Protein Secondary Structure Prediction using Nearest-Neighbor Methods. J. Mol. Biol., 232:1117-1129, 1993

3. A. A. Salamov and V. V. Solovyev. Prediction of Protein Secondary Structure by Combining Nearest-neighbor Algorithm and Multiple Sequence Alignments. J. Mol. Biol., 247:11-15, 1995

Summary

A novel approach for PSSP Evolutionary information

K-Nearest Neighbor algorithm

Fuzzy set theory

Most accurate KNN approach to date Easily expandable Accuracy increases with new structures Average computing time < 1 min on a single

CPU machine

Future Work

System with faster search capabilitiesEfficient search for neighbors

Accurate prediction system

Acknowledgements

Dr. James Keller for insight into the Fuzzy K-Nearest Neighbor Algorithm

Oak Ridge National Laboratory for providing the supercomputing facilities

Members of Digital Biology Laboratory for their support

Software

The enhanced version of the software is coded in C and is available upon request. Please e-mail your requests to

[email protected]

[email protected]

Thank you for

Participation!