Transcript
Page 1: Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program

Artificial Intelligence Research LaboratoryBioinformatics and Computational Biology ProgramComputational Intelligence, Learning, and Discovery ProgramDepartment of Computer Science

Rocky 2006

Acknowledgements: This work is supported in part by a grant from the National Institutes of Health (GM 066387) to Vasant Honavar & Drena Dobbs

Glycosylation Site Prediction using Machine Learning Approaches Cornelia Caragea, Jivko Sinapov, Adrian Silvescu, Drena Dobbs and Vasant Honavar

Biological MotivationGlycosylation is one of the most complex post-translational modifications (PTMs). It is the site-specific enzymatic addition of saccharides to proteins and lipids. Most proteins in eukaryotic cells undergo glycosylation.Types of Glycosylation

M K L I T I L

C

F

LSR

LLPSL

T

QE S

S Q E I D

Non-Glycosylated?Glycosylated?

N-linked? O-linked? C-linked?

H3N+

COO-

Problem: Predict glycosylation sites from amino acid sequence

Previous Approaches• Trained Neural Networks used in netOglyc prediction server (Hansen et al., 1995)• Dataset: mucin type O-linked glycosylation sites in mammalian proteins

• Trained SVMs based on physical properties, 0/1 system and a combination of these two (Li et al., 2006)• Dataset: mucin type O-linked glycosylation sites in mammalian proteins• Negative examples extracted from sequences with no known glycosylated sites• Trained/tested using different ratios of positive and negative sites

Our Approach• We investigate 3 types of glycosylation and use an ensemble classifier approach• Dataset: N-, C- and O-linked glycoslation sites in proteins from several different species: human, rat, mouse, insect, worm, horse, etc.• Negative examples extracted from sequences with at least one experimentally verified glycosylated site

DatasetO-GlycBase v6.00: O- , N- & C- glycosylated proteins with 242 glycosylated entries available at http://www.cbs.dtu.dk/databases/OGLYCBASE/Oglyc.base.html

Glycosylation Type

Positive Sites

Negative Sites

O-Linked (S/T)

2098 11623

N-Linked (N)

251 1430

C-Linked (W)

47 73

Total 2366 13126

Train DBSampling

. . . .

S1 S2 S3 Sk

C1

train

C2 C3 Ck . . . .

Bag of Trained Classifiers

Test DB

WeightedMajority

VotePredictions

train

train

train

train

Training an ensemble classifier

Classifiers• SVM • 0/1 String Kernel

• Substitution Matrix Kernel

• Blast - Polynomial Kernel• J48• Naïve Bayes• Identity windows• Identity plus additional information

( S(x i,y i)i1

|w|

)e where S(x i,y i) 1 if x i y i and 0 otherwise

S(x i,y i) entry(x i,y i) in the Blosum62 matrix

C-mannosylation

Glycosylation

N-linked glycosylation GPI anchor

N-acetylglucosamine(N-GlcNAc)

O-N-acetylgalactosamine(O-GalNAc)

O-N-acetylglucosamine (O-GlcNAc)

O-fucose

O-glucose

O-mannose

O-hexose

O-xylose

C-mannose

O-linked glycosylation

ROC Curves for N-Linked

ROC Curves for O-Linked

ROC Curves for C-Linked

Comparison of ROC Curves for single and ensemble classifier

Results

ConclusionIn this work we addressed the problem of predicting glycosylation sites. Three types of machine learning algorithms were used: SVM, NB, and DT. We built predictive ensemble classifiers based on data corresponding to three forms of glycosylation: O-, N-, and C-Linked glycosylation. Our experiments show encouraging results.

Top Related