prediction of proteins that participate in learning process by machine learning
Post on 12-Jan-2016
45 Views
Preview:
DESCRIPTION
TRANSCRIPT
prediction of proteins that participate in learning process
by machine learning
Dan EvronMiri Michaeli
Project Advisors: Dr. Gal ChechikOssnat Bar Shira
Biological Background
• A synapse is a junction between 2 neurons.
• How does Synaptic Transmission works?
Hebbian theory
Donald Hebb:
»"When an axon of cell A is near enough to excite B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased"
Synaptic Plasticity
• synaptic plasticity is the ability of the synapse to change in strength by molecular alteration.
• What kind of alterations happen during synaptic plasticity?
Synaptic Plasticitychanges
• Pre synaptic release probability.
• The number of postsynaptic receptors.
• Properties of postsynaptic receptors.
example
• Change in the probability of glutamate release.
• Insertion or removal of postsynaptic AMPA receptors.
• phosphorylation and de-phosphorylation inducing a change in AMPA receptor conductance.
What is the connection to learning and memory?
synaptic plasticity is one of the important neurochemical foundations of learning and memory.
Learning in Aplysia• Habituation• Sensitization• Classical conditioning
• All found in the gill withdrawal reflex !!!
• Kendel’s work connects organism level learning to cellular level learning !!!
And what about us?
• in mammals: • Many of the pathways are far from
understood.
• Much bigger and complex nervous system.
• Research shows that many principals are the same (LTP/LTD in the Hippocampus).
Project Idea & Goal
Biological research has found many proteins which are connected to biological pathways involved in learning in the neuron and synapse. Yet, pathways are far from understood and many components are missing.
Our goal is to find candidate proteins that may take part in these pathways and have not been discovered yet.
How will we do that?
1. Collect numerical data on organism proteins.
2. Collect ontologies about synaptic plasticity
3. Label each gene as related / non related to synaptic ontologies (according to data)
4. Use SVM as a classifier
5. Search for false positive genes in results
6. Publish a great article and win a Nobel prize! (or just dream about it…)
Our research organism is…Mus musculus
AKA...
The house mouse!
Tools & Databases
GEO (Gene Expression Omnibus)
MGI (Mouse Genome Informatics)
GO (Gene Ontology)
MPPDB (Mouse Protein-Protein Interaction Database)
SynDB (Synapse Database)
Tools & Databases
• Classifier: SVM (Support Vector Machine)
The project had 2 main phases:
• Phase 1: – Work only on PPI data– Create baseline for further work
• Phase 2:– Increase our PPI data– another data type: gene expression– Combine the PPI and GE data– Try to improve prediction !!
Phase 1:
• Extract PPI data from BioGRID
• Label the matrix for each ontology
• Perform SVM algorithm on the sets
• Calculate baseline
Phase 1 - results• Most ontologies had only few related genes -
problematic.• Baseline:
baseline SVM prediction
010
203040
506070
8090
endoplasmicreticulum
ion channel activity G protein coupledreceptor protein
signaling pathw ay
ontologies
Phase 2
will another type of data improve the results?
Gene expression
Step 1 - extracting data
– Representative set of mouse proteins from MGI.
– Gene Expression data from experiments related to synaptic and neuronal learning.
– Mouse Protein Protein Interaction (PPI) from several data bases.
– gene ontologies from GO.– Synaptic ontologies from SynDB.
Step 2 – processing data
• Each gene expression data comes in separate files - need to be combined.
• Normalize gene expression data.
• Create PPI’s matrix.
• Convert PPI’s proteins to genes.
Step 3 - combine the data
According to the list of genes:– Matrix that combine PPI&GE when each gene
has at least one data type. (“union”)– Matrix that combine PPI&GE when each gene
has both data types. (“intersect”)– PPI matrices from the two mentioned matrices– GE matrices from the two mentioned matrices
Step 4 - labeling the data
• For each set, and each ontology we labeled the genes (related/non related).
Step 5 - perform SVM algorithm on the sets
Step 6 - process the results
• Evaluate prediction success (AUC).
• Find potential false positive candidates.
So how did we do?
We have to build a ROC curve before..
What is ROC?
ROC = Receiver Operating Characteristic.
• Our SVM builds a ROC curve - that is a graphical plot of the sensitivity vs. specificity.
• During the SVM run-time, it calculates the AUC of the ROC curve made by it after classification.
What is AUC?
• AUC = Area Under the Curve.• The AUC is a way to evaluate accuracy of the
learning model by averaging the prediction precision.
• The AUC spans between 0.5 and 1, when 0.5 shows that the test has a 50% precision (equals to tossing a coin!) and 1 indicates a perfect precision ability.
• The AUC enables us to examine and compare SVM results.
Results
• Intersect of the data:– Size of all 3 matrices is similar – enables
comparison.– Average AUC: GE alone: 75%
PPI alone: 63%
GE + PPI: 75%
Results - intersectComparison of AUC in GE, PPI and GE+PPI
0.000.100.200.300.400.500.600.700.800.90
endo
som
e
mito
chon
dria
lin
ner
mem
bran
e
G-p
rote
inco
uple
dre
cept
orac
tivity
G-p
rote
inco
uple
dre
cept
orpr
otei
n
ion
chan
nel
activ
ity
volta
ge-g
ated
ion
chan
nel
activ
ity
GO terms
AU
C
PPI
GE
GE + PPI
Results
• Union of the data:– Close to reality in number of genes (14K in
matrices, 15K in representative list)– Average AUC in GE alone = GE + PPI = 74%– The matrices size issue– PPI alone corresponded to different GO
categories, so can not be compared.
Results - union
Comparison of AUC in GE and GE+PPI
0.000.100.200.300.400.500.600.700.800.901.00
en
do
pla
sm
icre
tic
ulu
mm
em
bra
ne
mit
oc
ho
nd
ria
lre
sp
ira
tory
ch
ain
ca
lciu
m c
ha
nn
el
ac
tiv
ity
ex
tra
ce
llula
rlig
an
d-g
ate
d i
on
ch
an
ne
l a
cti
vit
y
ca
tio
n c
ha
nn
el
ac
tiv
ity
sy
na
pti
c v
es
icle
ne
uro
tra
ns
mit
ter
rec
ep
tor
ac
tiv
ity
GO terms
AU
CGE
GE + PPI
Conclusions
• We can compare between different types of data only from the “intersect” mats.
• In intersect, the PPI sets the size, therefore we have same GO categories.
• In union, GE size took over the PPI data and that is the reason for different GO categories (GO categories in both PPI’s are the same).
• PPI did not contribute to prediction !
(bad news…)
The good news…
• Still, 75% is a nice accuracy!
• We found several false positive genes, that may be related to synaptic plasticity and have not been discovered yet as such.
examples:– Neurogranin (NRGN) – CADPS
Neurogranin (NRGN)
Acts as a "third messenger" substrate of protein kinase C-mediated molecular cascades during synaptic development and remodeling. Binds to calmodulin in the absence of calcium.
Ca++-dependent secretion activator(CADPS)
Calcium-binding protein involved in exocytosis of vesicles filled with neurotransmitters and neuropeptides. Probably acts upstream of fusion in the biogenesis or maintenance of maturesecretory vesicles.
Next steps..
• Computationally:– Improve the classification by adding new
types of data and / or by different representation of the data.
• Biologically:– Explore through biological experiments the
proteins we have found (the FP list).
top related