Transcript
Page 1: Support vector machine approach for protein subcelluar localization prediction (SubLoc)

Support vector machine approach for p

rotein subcelluar localization prediction

(SubLoc)(SubLoc)

Kim Hye JinIntelligent Multimedia Lab.

2001.09.07.

Page 2: Support vector machine approach for protein subcelluar localization prediction (SubLoc)

Contents

• Introduction• Materials and Methods

– Support vector machine– Design and implementation of the

prediction system– Prediction system assessment

• Result• Discussion and Conclusion

Page 3: Support vector machine approach for protein subcelluar localization prediction (SubLoc)

Introduction (1)

• Motivation– A key functional charactristic of potential gene prod

ucts such as proteins

• Traditional methods– Protein N-terminal sorting signals

• Nielsen et al.,(1999), von Heijne et al (1997)

– Amino acid composition• Nakashima and Inshikawa(1994), Nakai(2000)Andrade et al(1998), Cedano et al(1997), Reinhart and Hub

bard(1998)

Page 4: Support vector machine approach for protein subcelluar localization prediction (SubLoc)

Materials and Methods(1)

• Dataset - SWISSPROT release 33.0- Essential sequences which complete and r

eliable localization annotations- No transmembrane proteins

By Rost et al.,1996; Hirokawa et al.,1998;Lio and Vnnucci,2000

- Redundancy reduction- Effectiveness test

- by Reinhardt and Hubbard (1998)

Page 5: Support vector machine approach for protein subcelluar localization prediction (SubLoc)
Page 6: Support vector machine approach for protein subcelluar localization prediction (SubLoc)

Support vector machine(1)

• A quadratic optimization problem with boundary constraints and one linear equality constraints

• Basically for two classification problem input vector x =(x1, .. x20) ( xi :aa) output vector y∈{-1,1}

• Idea– Map input vectors into a high dimension feature space– Construct optimal separating hyperplane(OSH)– maximize the margin; the distance between hyperplane and the neare

st data points of each class in the space H– Mapping by a kernel function K(xK(xii,x,xjj))

Page 7: Support vector machine approach for protein subcelluar localization prediction (SubLoc)
Page 8: Support vector machine approach for protein subcelluar localization prediction (SubLoc)

Support vector machine(2)• Decision function

• Where the coefficient by solving convex quadratic programming

Page 9: Support vector machine approach for protein subcelluar localization prediction (SubLoc)

Support vector machine(3)

• Constraints– In eq(2), C is regularization parameter => control the trade-of

f between margin and misclassification error

• Typical kernel functions

Eq(3), polynomial with d parameterEq(4), radial basic function (RBF) with r parameter

Page 10: Support vector machine approach for protein subcelluar localization prediction (SubLoc)

Support vector machine(4)

• Benefits of SVM– Globally optimization– Handle large feature spaces– Effectively avoid over-fitting by

controlling margin– Automatically identify a small subset

made up of informative points

Page 11: Support vector machine approach for protein subcelluar localization prediction (SubLoc)

Design and implementation of the prediction system

• Problem :Multi-class classification problem– Prokaryotic sequences 3 classes– Eukaryotic sequences 4 classes

• Solution– To reduce the multi-classification into binary classification– 1-v-r SVM( one versus rest )

• QP problem – LOQO algorithm (Vanderbei, 1994)

• SVMlight

• Speed– Less than 10 min on a PC running at 500MHz

Page 12: Support vector machine approach for protein subcelluar localization prediction (SubLoc)

Prediction system assessment

• Prediction quality test by jackknife test– Each protein was singled out in turn as a

test protein with the remaining proteins used to train SVM

Page 13: Support vector machine approach for protein subcelluar localization prediction (SubLoc)

Results (1)

• SubLoc prediction accuracy by jackknife test– Prokaryotic sequence case

• d=1and d=9 for polynomial kernel• =5.0 for RBF• C = 1000 for SVM constraints

– Eukaryotic sequence case• d =9 for polynomial kernel• =16.0 for RBF• C=500 for each SVM

• Test : 5–fold cross validation ( since limited computational power)

Page 14: Support vector machine approach for protein subcelluar localization prediction (SubLoc)
Page 15: Support vector machine approach for protein subcelluar localization prediction (SubLoc)
Page 16: Support vector machine approach for protein subcelluar localization prediction (SubLoc)

Comparison

• based on amino acid composition – Neural network

• Reinhardt and Hubbard, 1998

– Covariant discriminant algorithm• Chou and Elrod, 1999

• Based on the full sequence information in genome sequence– Markov model ( Yuan, 1999)

Page 17: Support vector machine approach for protein subcelluar localization prediction (SubLoc)
Page 18: Support vector machine approach for protein subcelluar localization prediction (SubLoc)
Page 19: Support vector machine approach for protein subcelluar localization prediction (SubLoc)

Assigning a reliability index

• RI (reliability index)Diff between the highest

and the second - highest output value of the 1-v-r SVM

• 78% of all sequence have RI ≥3 and 95.9% correct prediction

Page 20: Support vector machine approach for protein subcelluar localization prediction (SubLoc)

Robustness to errors in the N-terminal sequence

Page 21: Support vector machine approach for protein subcelluar localization prediction (SubLoc)
Page 22: Support vector machine approach for protein subcelluar localization prediction (SubLoc)

Discussion and ConclusionDiscussion and Conclusion

• SVM information condensation– The number of SVs is quite small– The ratio of SVs to all training is 13-30%

Page 23: Support vector machine approach for protein subcelluar localization prediction (SubLoc)

SVM parameter selection

• Little influence on the classification performance– Table8 shows with little difference between

kernel functions– Robust characteristic of the dataset

by Vapnik(1995)

Page 24: Support vector machine approach for protein subcelluar localization prediction (SubLoc)

Improvement of the perfomance

• Combining with other methods– Sorting signal base method and amino acid

composition• Signal : sensitive to errors in N terminal• Composition: weakness in similar aa

• Incorporate other informative features• Bayesian system integrating in the whole genom

e expression data• Fluorescence microscope images


Top Related