prediction of proteins that participate in learning process by machine learning

Post on 12-Jan-2016

45 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

prediction of proteins that participate in learning process by machine learning. Dan Evron Miri Michaeli Project Advisors: Dr. Gal Chechik Ossnat Bar Shira. Biological Background. A synapse is a junction between 2 neurons. How does Synaptic Transmission works?. - PowerPoint PPT Presentation

TRANSCRIPT

prediction of proteins that participate in learning process

by machine learning

Dan EvronMiri Michaeli

Project Advisors: Dr. Gal ChechikOssnat Bar Shira

Biological Background

• A synapse is a junction between 2 neurons.

• How does Synaptic Transmission works?

Hebbian theory

Donald Hebb:

»"When an axon of cell A is near enough to excite B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased"

Synaptic Plasticity

• synaptic plasticity is the ability of the synapse to change in strength by molecular alteration.

• What kind of alterations happen during synaptic plasticity?

Synaptic Plasticitychanges

• Pre synaptic release probability.

• The number of postsynaptic receptors.

• Properties of postsynaptic receptors.

example

• Change in the probability of glutamate release.

• Insertion or removal of postsynaptic AMPA receptors.

• phosphorylation and de-phosphorylation inducing a change in AMPA receptor conductance.

What is the connection to learning and memory?

synaptic plasticity is one of the important neurochemical foundations of learning and memory.

Learning in Aplysia• Habituation• Sensitization• Classical conditioning

• All found in the gill withdrawal reflex !!!

• Kendel’s work connects organism level learning to cellular level learning !!!

And what about us?

• in mammals: • Many of the pathways are far from

understood.

• Much bigger and complex nervous system.

• Research shows that many principals are the same (LTP/LTD in the Hippocampus).

Project Idea & Goal

Biological research has found many proteins which are connected to biological pathways involved in learning in the neuron and synapse. Yet, pathways are far from understood and many components are missing.

Our goal is to find candidate proteins that may take part in these pathways and have not been discovered yet.

How will we do that?

1. Collect numerical data on organism proteins.

2. Collect ontologies about synaptic plasticity

3. Label each gene as related / non related to synaptic ontologies (according to data)

4. Use SVM as a classifier

5. Search for false positive genes in results

6. Publish a great article and win a Nobel prize! (or just dream about it…)

Our research organism is…Mus musculus

AKA...

The house mouse!

Tools & Databases

GEO (Gene Expression Omnibus)

MGI (Mouse Genome Informatics)

GO (Gene Ontology)

MPPDB (Mouse Protein-Protein Interaction Database)

SynDB (Synapse Database)

Tools & Databases

• Classifier: SVM (Support Vector Machine)

The project had 2 main phases:

• Phase 1: – Work only on PPI data– Create baseline for further work

• Phase 2:– Increase our PPI data– another data type: gene expression– Combine the PPI and GE data– Try to improve prediction !!

Phase 1:

• Extract PPI data from BioGRID

• Label the matrix for each ontology

• Perform SVM algorithm on the sets

• Calculate baseline

Phase 1 - results• Most ontologies had only few related genes -

problematic.• Baseline:

baseline SVM prediction

010

203040

506070

8090

endoplasmicreticulum

ion channel activity G protein coupledreceptor protein

signaling pathw ay

ontologies

Phase 2

will another type of data improve the results?

Gene expression

Step 1 - extracting data

– Representative set of mouse proteins from MGI.

– Gene Expression data from experiments related to synaptic and neuronal learning.

– Mouse Protein Protein Interaction (PPI) from several data bases.

– gene ontologies from GO.– Synaptic ontologies from SynDB.

Step 2 – processing data

• Each gene expression data comes in separate files - need to be combined.

• Normalize gene expression data.

• Create PPI’s matrix.

• Convert PPI’s proteins to genes.

Step 3 - combine the data

According to the list of genes:– Matrix that combine PPI&GE when each gene

has at least one data type. (“union”)– Matrix that combine PPI&GE when each gene

has both data types. (“intersect”)– PPI matrices from the two mentioned matrices– GE matrices from the two mentioned matrices

Step 4 - labeling the data

• For each set, and each ontology we labeled the genes (related/non related).

Step 5 - perform SVM algorithm on the sets

Step 6 - process the results

• Evaluate prediction success (AUC).

• Find potential false positive candidates.

So how did we do?

We have to build a ROC curve before..

What is ROC?

ROC = Receiver Operating Characteristic.

• Our SVM builds a ROC curve - that is a graphical plot of the sensitivity vs. specificity.

• During the SVM run-time, it calculates the AUC of the ROC curve made by it after classification.

What is AUC?

• AUC = Area Under the Curve.• The AUC is a way to evaluate accuracy of the

learning model by averaging the prediction precision.

• The AUC spans between 0.5 and 1, when 0.5 shows that the test has a 50% precision (equals to tossing a coin!) and 1 indicates a perfect precision ability.

• The AUC enables us to examine and compare SVM results.

Results

• Intersect of the data:– Size of all 3 matrices is similar – enables

comparison.– Average AUC: GE alone: 75%

PPI alone: 63%

GE + PPI: 75%

Results - intersectComparison of AUC in GE, PPI and GE+PPI

0.000.100.200.300.400.500.600.700.800.90

endo

som

e

mito

chon

dria

lin

ner

mem

bran

e

G-p

rote

inco

uple

dre

cept

orac

tivity

G-p

rote

inco

uple

dre

cept

orpr

otei

n

ion

chan

nel

activ

ity

volta

ge-g

ated

ion

chan

nel

activ

ity

GO terms

AU

C

PPI

GE

GE + PPI

Results

• Union of the data:– Close to reality in number of genes (14K in

matrices, 15K in representative list)– Average AUC in GE alone = GE + PPI = 74%– The matrices size issue– PPI alone corresponded to different GO

categories, so can not be compared.

Results - union

Comparison of AUC in GE and GE+PPI

0.000.100.200.300.400.500.600.700.800.901.00

en

do

pla

sm

icre

tic

ulu

mm

em

bra

ne

mit

oc

ho

nd

ria

lre

sp

ira

tory

ch

ain

ca

lciu

m c

ha

nn

el

ac

tiv

ity

ex

tra

ce

llula

rlig

an

d-g

ate

d i

on

ch

an

ne

l a

cti

vit

y

ca

tio

n c

ha

nn

el

ac

tiv

ity

sy

na

pti

c v

es

icle

ne

uro

tra

ns

mit

ter

rec

ep

tor

ac

tiv

ity

GO terms

AU

CGE

GE + PPI

Conclusions

• We can compare between different types of data only from the “intersect” mats.

• In intersect, the PPI sets the size, therefore we have same GO categories.

• In union, GE size took over the PPI data and that is the reason for different GO categories (GO categories in both PPI’s are the same).

• PPI did not contribute to prediction !

(bad news…)

The good news…

• Still, 75% is a nice accuracy!

• We found several false positive genes, that may be related to synaptic plasticity and have not been discovered yet as such.

examples:– Neurogranin (NRGN) – CADPS

Neurogranin (NRGN)

Acts as a "third messenger" substrate of protein kinase C-mediated molecular cascades during synaptic development and remodeling. Binds to calmodulin in the absence of calcium.

Ca++-dependent secretion activator(CADPS)

Calcium-binding protein involved in exocytosis of vesicles filled with neurotransmitters and neuropeptides. Probably acts upstream of fusion in the biogenesis or maintenance of maturesecretory vesicles.

Next steps..

• Computationally:– Improve the classification by adding new

types of data and / or by different representation of the data.

• Biologically:– Explore through biological experiments the

proteins we have found (the FP list).

top related