(subloc) support vector machine approach for protein subcelluar localization prediction (subloc) kim...

24
Support vector machine appr oach for protein subcelluar locali zation prediction (SubLoc) (SubLoc) Kim Hye Jin Intelligent Multimedia La b. 2001.09.07.

Upload: summer-irons

Post on 29-Mar-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07

Support vector machine approach for p

rotein subcelluar localization prediction

(SubLoc)(SubLoc)

Kim Hye JinIntelligent Multimedia Lab.

2001.09.07.

Page 2: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07

Contents

• Introduction• Materials and Methods

– Support vector machine– Design and implementation of the

prediction system– Prediction system assessment

• Result• Discussion and Conclusion

Page 3: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07

Introduction (1)

• Motivation– A key functional charactristic of potential gene prod

ucts such as proteins

• Traditional methods– Protein N-terminal sorting signals

• Nielsen et al.,(1999), von Heijne et al (1997)

– Amino acid composition• Nakashima and Inshikawa(1994), Nakai(2000)Andrade et al(1998), Cedano et al(1997), Reinhart and Hub

bard(1998)

Page 4: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07

Materials and Methods(1)

• Dataset - SWISSPROT release 33.0- Essential sequences which complete and r

eliable localization annotations- No transmembrane proteins

By Rost et al.,1996; Hirokawa et al.,1998;Lio and Vnnucci,2000

- Redundancy reduction- Effectiveness test

- by Reinhardt and Hubbard (1998)

Page 5: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07
Page 6: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07

Support vector machine(1)

• A quadratic optimization problem with boundary constraints and one linear equality constraints

• Basically for two classification problem input vector x =(x1, .. x20) ( xi :aa) output vector y∈{-1,1}

• Idea– Map input vectors into a high dimension feature space– Construct optimal separating hyperplane(OSH)– maximize the margin; the distance between hyperplane and the neare

st data points of each class in the space H– Mapping by a kernel function K(xK(xii,x,xjj))

Page 7: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07
Page 8: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07

Support vector machine(2)• Decision function

• Where the coefficient by solving convex quadratic programming

Page 9: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07

Support vector machine(3)

• Constraints– In eq(2), C is regularization parameter => control the trade-of

f between margin and misclassification error

• Typical kernel functions

Eq(3), polynomial with d parameterEq(4), radial basic function (RBF) with r parameter

Page 10: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07

Support vector machine(4)

• Benefits of SVM– Globally optimization– Handle large feature spaces– Effectively avoid over-fitting by

controlling margin– Automatically identify a small subset

made up of informative points

Page 11: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07

Design and implementation of the prediction system

• Problem :Multi-class classification problem– Prokaryotic sequences 3 classes– Eukaryotic sequences 4 classes

• Solution– To reduce the multi-classification into binary classification– 1-v-r SVM( one versus rest )

• QP problem – LOQO algorithm (Vanderbei, 1994)

• SVMlight

• Speed– Less than 10 min on a PC running at 500MHz

Page 12: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07

Prediction system assessment

• Prediction quality test by jackknife test– Each protein was singled out in turn as a

test protein with the remaining proteins used to train SVM

Page 13: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07

Results (1)

• SubLoc prediction accuracy by jackknife test– Prokaryotic sequence case

• d=1and d=9 for polynomial kernel• =5.0 for RBF• C = 1000 for SVM constraints

– Eukaryotic sequence case• d =9 for polynomial kernel• =16.0 for RBF• C=500 for each SVM

• Test : 5–fold cross validation ( since limited computational power)

Page 14: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07
Page 15: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07
Page 16: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07

Comparison

• based on amino acid composition – Neural network

• Reinhardt and Hubbard, 1998

– Covariant discriminant algorithm• Chou and Elrod, 1999

• Based on the full sequence information in genome sequence– Markov model ( Yuan, 1999)

Page 17: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07
Page 18: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07
Page 19: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07

Assigning a reliability index

• RI (reliability index)Diff between the highest

and the second - highest output value of the 1-v-r SVM

• 78% of all sequence have RI ≥3 and 95.9% correct prediction

Page 20: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07

Robustness to errors in the N-terminal sequence

Page 21: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07
Page 22: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07

Discussion and ConclusionDiscussion and Conclusion

• SVM information condensation– The number of SVs is quite small– The ratio of SVs to all training is 13-30%

Page 23: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07

SVM parameter selection

• Little influence on the classification performance– Table8 shows with little difference between

kernel functions– Robust characteristic of the dataset

by Vapnik(1995)

Page 24: (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07

Improvement of the perfomance

• Combining with other methods– Sorting signal base method and amino acid

composition• Signal : sensitive to errors in N terminal• Composition: weakness in similar aa

• Incorporate other informative features• Bayesian system integrating in the whole genom

e expression data• Fluorescence microscope images