artificial intelligence research laboratory bioinformatics and computational biology program
DESCRIPTION
Train DB. Sampling. Test DB. train. train. train. train. train. Bag of Trained Classifiers. Predictions. Weighted Majority Vote. Glycosylation. N-linked glycosylation. O-linked glycosylation. GPI anchor. C-mannosylation. N-acetylglucosamine (N-GlcNAc). C-mannose. O -mannose. - PowerPoint PPT PresentationTRANSCRIPT
Artificial Intelligence Research LaboratoryBioinformatics and Computational Biology ProgramComputational Intelligence, Learning, and Discovery ProgramDepartment of Computer Science
Rocky 2006
Acknowledgements: This work is supported in part by a grant from the National Institutes of Health (GM 066387) to Vasant Honavar & Drena Dobbs
Glycosylation Site Prediction using Machine Learning Approaches Cornelia Caragea, Jivko Sinapov, Adrian Silvescu, Drena Dobbs and Vasant Honavar
Biological MotivationGlycosylation is one of the most complex post-translational modifications (PTMs). It is the site-specific enzymatic addition of saccharides to proteins and lipids. Most proteins in eukaryotic cells undergo glycosylation.Types of Glycosylation
M K L I T I L
C
F
LSR
LLPSL
T
QE S
S Q E I D
Non-Glycosylated?Glycosylated?
N-linked? O-linked? C-linked?
H3N+
COO-
Problem: Predict glycosylation sites from amino acid sequence
Previous Approaches• Trained Neural Networks used in netOglyc prediction server (Hansen et al., 1995)• Dataset: mucin type O-linked glycosylation sites in mammalian proteins
• Trained SVMs based on physical properties, 0/1 system and a combination of these two (Li et al., 2006)• Dataset: mucin type O-linked glycosylation sites in mammalian proteins• Negative examples extracted from sequences with no known glycosylated sites• Trained/tested using different ratios of positive and negative sites
Our Approach• We investigate 3 types of glycosylation and use an ensemble classifier approach• Dataset: N-, C- and O-linked glycoslation sites in proteins from several different species: human, rat, mouse, insect, worm, horse, etc.• Negative examples extracted from sequences with at least one experimentally verified glycosylated site
DatasetO-GlycBase v6.00: O- , N- & C- glycosylated proteins with 242 glycosylated entries available at http://www.cbs.dtu.dk/databases/OGLYCBASE/Oglyc.base.html
Glycosylation Type
Positive Sites
Negative Sites
O-Linked (S/T)
2098 11623
N-Linked (N)
251 1430
C-Linked (W)
47 73
Total 2366 13126
Train DBSampling
. . . .
S1 S2 S3 Sk
C1
train
C2 C3 Ck . . . .
Bag of Trained Classifiers
Test DB
WeightedMajority
VotePredictions
train
train
train
train
Training an ensemble classifier
Classifiers• SVM • 0/1 String Kernel
• Substitution Matrix Kernel
• Blast - Polynomial Kernel• J48• Naïve Bayes• Identity windows• Identity plus additional information
( S(x i,y i)i1
|w|
)e where S(x i,y i) 1 if x i y i and 0 otherwise
S(x i,y i) entry(x i,y i) in the Blosum62 matrix
C-mannosylation
Glycosylation
N-linked glycosylation GPI anchor
N-acetylglucosamine(N-GlcNAc)
O-N-acetylgalactosamine(O-GalNAc)
O-N-acetylglucosamine (O-GlcNAc)
O-fucose
O-glucose
O-mannose
O-hexose
O-xylose
C-mannose
O-linked glycosylation
ROC Curves for N-Linked
ROC Curves for O-Linked
ROC Curves for C-Linked
Comparison of ROC Curves for single and ensemble classifier
Results
ConclusionIn this work we addressed the problem of predicting glycosylation sites. Three types of machine learning algorithms were used: SVM, NB, and DT. We built predictive ensemble classifiers based on data corresponding to three forms of glycosylation: O-, N-, and C-Linked glycosylation. Our experiments show encouraging results.