![Page 1: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/1.jpg)
Semantic Similarity over Gene Ontology for Multi-label Protein
Subcellular Localization
Shibiao WAN and Man-Wai MAKThe Hong Kong Polytechnic University
Sun-Yuan KUNGPrinceton University
![Page 2: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/2.jpg)
2
Outline
1. Introduction and Motivation2. Retrieval of GO Terms3. Semantic Similarity Measures4. Multi-label Multi-Class Classification5. Results6. Conclusions
![Page 3: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/3.jpg)
3
Proteins and Their Subcellular Locations
![Page 4: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/4.jpg)
4
Subcellular Localization Prediction
• The subcellular locations of proteins help biologists to elucidate the functions of proteins.
• Identifying the subcellular locations by entirely experimental means is time-consuming and costly.
• Computational methods are necessary for subcellular localization prediction.
• Previous research has found that gene ontology (GO) based methods outperform methods based on other protein features (e.g. AA composition).
![Page 5: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/5.jpg)
5
Multi-label Problem• Some proteins can simultaneously reside at, or move
between, two or more subcellular locations.• Multi-label (Multi-location) proteins play important
roles in some metabolic processes taking place in multiple subcellular locations.
• State-of-the-art multi-label predictors, such as Plant-mPLoc, iLoc-Plant, and mGOASVM use frequency counts of GO terms as features.
• In this work, we propose using semantic similarity of GO terms as features for multi-label subcellular localization prediction.
![Page 6: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/6.jpg)
GO Extraction by searching GOA
databaseSVM Subcellular
Location(s)
Method’s Flowchart
Semantic Similarity Measure
6
GOA Database
BLAST Swiss-ProtDatabase
homolog AC
S
AC
SVM
SVM
M
Multi-label SVM
.
.
.
.
.
.
SS: Semantic Similarity
GO of trainingproteins
Semantic Similarity
Vector
![Page 7: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/7.jpg)
7
Gene Ontology Gene ontology is a set of standardized
vocabularies annotating the functions of genes and gene products
GO terms, e.g., GO:0000187 A protein sequence may correspond to 0, 1 or
many GO terms.
![Page 8: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/8.jpg)
8
Gene Ontology: Example
Search----GO:0000187 in http://www.geneontology.org/
![Page 9: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/9.jpg)
9
GOA Database
• Gene Ontology Annotation database.
– Provide structured annotations to proteins in UniProt Knowledgebase (UniProtKB) and other protein databases using standardized GO vocabularies.
– Include a series of cross-references to other databases.
• Given an Accession Number, the GOA database allows us to find a set of GO terms associated with that accession number.
![Page 10: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/10.jpg)
10
GOA DatabaseAccession Number
(AC) GO term(s)
Search A0M8T9 in http://www.ebi.ac.uk/GOA/
1 AC maps to many GO terms !
![Page 11: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/11.jpg)
GO Extraction by searching GOA
database
Finding GO Terms without an Accession Number
11
GOA Database
BLAST Swiss-ProtDatabase
homolog AC
S
AC
GO Terms of Qi
![Page 12: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/12.jpg)
12
Semantic Similarity Measure
Find Common Ancestors
GO
Database
GO term x
GO term y
A(x,y) ComputingSemantic Similarity
sim(x,y)
SQL QueryAncestors
![Page 13: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/13.jpg)
13
Finding Common Ancestors, A(x,y)
![Page 14: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/14.jpg)
14
GO:0000187
is_a part_of
Finding Common Ancestors, A(x,y)
![Page 15: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/15.jpg)
15
Semantic Similarity MeasureWe use Lin’s measure to estimate the semantic similarity between two GO terms (x and y):
![Page 16: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/16.jpg)
16
Semantic Similarity between 2 ProteinsSemantic similarity between 2 proteins (Gi, Gj):
Semantic Similarity Vector:
No. of training proteins
where
![Page 17: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/17.jpg)
17
Multi-label SVM Scoring
GO of trainingproteins
GO of Qt
=
![Page 18: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/18.jpg)
18
Benchmark Datasets The Plant dataset
![Page 19: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/19.jpg)
19
Performance MetricsOverall locative accuracy:
Overall actual accuracy:
Actual accuracy is more objective and stricter!
![Page 20: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/20.jpg)
20
Performance Comparison The Plant dataset
![Page 21: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/21.jpg)
Conclusions
21
• Our Proposed predictor performs significantly better than
Plant-mPLoc and iLoc-Plant, and also better than mGOASVM, in terms of locative and actual accuracies.
• As for individual locative accuracies, our proposed predictor are significantly higher than the three predictors for all of the 12 locations.
• In terms of GO information extraction, Plant-mPLoc, iLoc-Plant and mGOASVM use the occurrences of GO terms as features, whereas the proposed predictor discovers the semantic relationship between GO terms, from which the semantic similarity between proteins can be obtained.
![Page 22: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/22.jpg)
Web Servers
22
![Page 23: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/23.jpg)
23
Thank you!
![Page 24: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/24.jpg)
Multi-label SVM Classifier
Transformed labels for M-class problem:
24
![Page 25: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/25.jpg)
25
YAC known ?
Retrieve homologs by BLAST; maxk 1k
?maxkk
Retrieve a set of GO termsiki ,G
Multi-label SVM classification
N
Y
Y
N
N
0k
1 kk
Using back-up methods
Using the homolog th-k
?0|, iki|G
Retrieving GO Terms with/without AC
![Page 26: Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization](https://reader036.vdocuments.mx/reader036/viewer/2022081520/56816587550346895dd84001/html5/thumbnails/26.jpg)
26
• The relationships between GO terms in the GO hierarchy can be obtained from the SQL database through the link: http://archive.geneontology.org/latest-termdb/go_daily-termdb-tables.tar.gz.
• We only considered the ‘is-a’ relationship.
Finding Common Ancestors