function and phenotype prediction through data and knowledge fusion

50
Function and Phenotype Prediction through Data and Knowledge Fusion Karin M. Verspoor, The University of Melbourne [email protected] 27 January 2016 – King Abdullah University of Science and Technology, Computational Bioscience Research Center

Upload: karin-verspoor

Post on 12-Apr-2017

260 views

Category:

Health & Medicine


0 download

TRANSCRIPT

Page 1: Function and Phenotype Prediction through Data and Knowledge Fusion

Function and Phenotype Prediction through Data and Knowledge Fusion

Karin M. Verspoor, The University of [email protected]

27 January 2016 – King Abdullah University of Science and Technology, Computational Bioscience Research Center

Page 2: Function and Phenotype Prediction through Data and Knowledge Fusion

We have the blueprints to life,but we don’t know how to read them.

• At least a quarter of protein families in PFAM have no known function (Domains of Unknown Function)

• Millions of proteins uncharacterised

Page 3: Function and Phenotype Prediction through Data and Knowledge Fusion

From sequence to function

?

Page 4: Function and Phenotype Prediction through Data and Knowledge Fusion

What is protein function?

• Captures biologicalprocess, molecularfunction, cellularcomponent

• Common representation for Model organism databases tofacilitate sharing

The Gene Ontology (GO) provides a vocabulary

Page 5: Function and Phenotype Prediction through Data and Knowledge Fusion

What about phenotype?

Human Phenotype Ontology

Page 6: Function and Phenotype Prediction through Data and Knowledge Fusion

Knowledge-based featuresKnowledge source:

Page 7: Function and Phenotype Prediction through Data and Knowledge Fusion

Exponential knowledge growth

• ~1550 peer-reviewed gene-related databases in NAR online Mol Bio collection

• Over 25 million PubMed entries (> 2,000/day)

• Breakdown of disciplinary boundaries makes more of it relevant to each of us

• “Like drinking from a firehose” – Jim Ostell (NCBI IEB Chief)

Page 8: Function and Phenotype Prediction through Data and Knowledge Fusion

Text as a primary source of knowledge

Despite ever increasing structured resources, the literature remains the primary repository of knowledge in biomedicine

0

20000

40000

60000

80000

100000

120000

1/02 1/03 1/04 1/05 1/06 1/07

# Sw

iss-

Prot

Pro

tein

s

Proteins missing a FUNCTION commentProteins gaining a FUNCTION comment

“Manual curation is not sufficient for annotation of genomic databases”Baumgartner et al Bioinformatics (ISMB 2007)

Page 9: Function and Phenotype Prediction through Data and Knowledge Fusion

Why biomedical text mining?

1914 1921 1928 1935 1942 1949 1956 1963 1970 1977 1984 1991 1998 2005 20120

200000

400000

600000

800000

1000000

1200000

Year

Publ

icat

ions

per

yea

r

Exponential growth in size of Pubmed

Page 10: Function and Phenotype Prediction through Data and Knowledge Fusion

Data sources, Data Integration

• Structured Resources– Largely manually ‘curated’, high quality– Often unannotated– Organizes targeted information– Computable

• Unstructured Resources– Literature: peer reviewed, well-formed– Natural Language: ambiguity, complexity– Broad, current coverage of biological knowledge– Intended for Human communication

Page 11: Function and Phenotype Prediction through Data and Knowledge Fusion

Bio Text Analysis in a nutshell

Page 12: Function and Phenotype Prediction through Data and Knowledge Fusion

GO Function Prediction

Sokolov and Ben-Hur. J Bioinform Comput Biol. 2010 Apr;8(2):357-76.Sokolov, Funk, Graim, Verspoor, Ben-Hur. BMC Bioinformatics. 2013;14 Suppl 3:S10.

Page 13: Function and Phenotype Prediction through Data and Knowledge Fusion

GOstruct: Structured output SVM

Page 14: Function and Phenotype Prediction through Data and Knowledge Fusion

Structured output

• Represent a set of annotations as a single vector• Encodes the hierarchical structure from annotation to

root

Page 15: Function and Phenotype Prediction through Data and Knowledge Fusion

GOstruct approach

“What functions does this protein perform?”

Page 16: Function and Phenotype Prediction through Data and Knowledge Fusion

Feature integration via kernels

• Cross-species (sequence-based) features– e-values from significant BLAST hits– features from WoLF PSORT protein localization software– transmembrane protein prediction using TMHMM– k-mer composition of the N and C termini– low complexity regions

• Species-specific features– Protein interactions– Gene Expression– Phylogenetic profiles– Text-derived features

Page 17: Function and Phenotype Prediction through Data and Knowledge Fusion

Extraction & Analysis pipeline

Christopher Funk (2015) PhD dissertation, U. Colorado Denver

Page 18: Function and Phenotype Prediction through Data and Knowledge Fusion

Integrating Text

• Protein – Gene Ontology term co-occurrence• Protein – Protein co-occurrence

Sokolov et al. BMC Bioinformatics 2013, 14(Suppl 3):S10

Page 19: Function and Phenotype Prediction through Data and Knowledge Fusion

Text-based features

• Words– (tokens)

• Entities or Concepts– (gene/protein mentions)– (gene ontology concepts)

• Relations– (simple co-occurrences)

Page 20: Function and Phenotype Prediction through Data and Knowledge Fusion

Feature Extraction from text

Target: P50281 – Matrix metalloproteinase 14 (MMP14)

Page 21: Function and Phenotype Prediction through Data and Knowledge Fusion

Feature Extraction

Target: P50281 – Matrix metalloproteinase 14 (MMP14)

Page 22: Function and Phenotype Prediction through Data and Knowledge Fusion

Feature Extraction

Target: P50281 – Matrix metalloproteinase 14 (MMP14)

Bag of words:WordsSent1(membrane, otherwise, known, … , proteolytic, enzyme, known, extracellular, invasion, … , progression)WordsSent2(protein, and, message, levels, of, was , …)

Page 23: Function and Phenotype Prediction through Data and Knowledge Fusion

Feature Extraction

Target: P50281 – Matrix metalloproteinase 14 (MMP14)

Protein GO term co-mentions:sent_comen(P50281, GO:0008237)sent_comen(P50281, GO:0006508)sent_comen(P50281, GO:0009056)sent_comen(P50281, GO:0031012)nonSent_comen(P50281, GO:0010467)nonSent_comen(P50281, GO:0005623)

Page 24: Function and Phenotype Prediction through Data and Knowledge Fusion

Feature Extraction

Target: P50281 – Matrix metalloproteinase 14 (MMP14)

Protein GO term co-mentions:nonSent_comen(P50281, GO:0008237)nonSent_comen(P50281, GO:0006508)nonSent_comen(P50281, GO:0009056)nonSent_comen(P50281, GO:0031012)nonSent_comen(P50281, GO:0010467)nonSent_comen(P50281, GO:0005623)

Page 25: Function and Phenotype Prediction through Data and Knowledge Fusion

Feature Representation

Target: P50281 – Matrix metalloproteinase 14 (MMP14)

Bag of Words:P40281, known=2, membrane=1, protein=1, proteolytic=1, enzyme=1, …Protein GO term co-mentions (sentence):P40281, GO:0008237=1, GO:0006508=1, GO:0009056=1, GO:0031012=1Protein GO term co-mentions (non-sentence):P40281, GO:0010467=2, GO:0005623=2

Page 26: Function and Phenotype Prediction through Data and Knowledge Fusion

Feature Representation

Bag of Words:UniprotID1, w1=countw1, w2=countw2, w3=countw3, … , wi=countwiUniprotID2, w1=countw1, w2=countw2, w3=countw3, … , wi=countwi…UniprotIDi, w1=countw1, w2=countw2, w3=countw3, … , wi=countwiProtein GO term co-mentions (sentence):UniprotID1, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOiUniprotID2, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi…UniprotIDi, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOiProtein GO term co-mentions (non-sentence):UniprotID1, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOiUniprotID2, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi…UniprotIDi, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi

Page 27: Function and Phenotype Prediction through Data and Knowledge Fusion

An aside on GO concept recognition

• Given:– Gene Ontology (~46,000 concepts)

In mice lacking ephrin-A5 function, cell proliferation and survival of newborn neurons… (PMID 20474079)

• Return:– GO:0008283 cell proliferation– GO:0005125 cytokine activity– GO:0048666 neuron development

(can be based on a judgment about the depth of experimental evidence)

Page 28: Function and Phenotype Prediction through Data and Knowledge Fusion

(CRAFT example)

Previous in vitro experiments using renal

cell lines suggest recessive Aqp2

mutations result in improper trafficking

of the mutant water pore.

GO:0005623 – “cell”CL:0000000 – “cell”

PR:000004182 – “aquaporin-2”EG:359 – “Aqp2”

SO:0001059 – “sequence_alteration” GO:0006810 – “transport”

SO:0001059 – “sequence_alteration” GO:0015250 – “water channel activity”

CHEBI:15377 – “water”

Page 29: Function and Phenotype Prediction through Data and Knowledge Fusion

GO:0006900 – membrane budding

[Term]id: GO:0006900name: membrane budding…def: "The evagination of a membrane,

resulting in formation of a vesicle.”…synonym: "membrane evagination”synonym: "nonselective vesicle assembly”synonym: "vesicle biosynthesis”synonym: "vesicle formation”…

Variation in PMID: 12925238• Lipid rafts play a key role in

membrane budding…• …involvement of annexin A7

in budding of vesicles…• …Ca2+-mediated vesiculation

process was not impared.• Red blood cells which lack the

ability to vesiculate cause…• Having excluded a direct role

in vesicle formation…

GO vs NL

Page 30: Function and Phenotype Prediction through Data and Knowledge Fusion

Comparing tool performance on CR

• NCBO Annotator (96 combinations)

wholeWordOnly, filterNumber, stopWords, stopWordsCaseSensitive, minTermSize, withSynonyms

• MetaMap (864 combinations)

model, gaps, wordOrder,acronymAbb, derivationalVars, scoreFilter, minTermSize

• Concept Mapper(576 combinations)

searchStrategy, caseMatch, stemmer, orderIndependentLookup, findAllMatches, stopWords, synonyms

Funk et al. BMC Bioinformatics 2014, Feb 26;15:59.

Page 31: Function and Phenotype Prediction through Data and Knowledge Fusion

Literature alone is useful

MF BP CC0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Baseline (co-mentions as predictions) Co-mentionsBoW Co-mentions + BoW

Gene Ontology Branch

Mac

ro-a

vera

ged

F-m

easu

re

Page 32: Function and Phenotype Prediction through Data and Knowledge Fusion

Literature features approach performance of commonly used biological features

MF BP CC0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Trans/LocalizationHomologyNetworkLiteratureAll Combined

Mac

ro-a

vera

ged

F-m

easu

re

(and combining them with other features is even better!)

Page 33: Function and Phenotype Prediction through Data and Knowledge Fusion

Manual inspection of misclassifications

Some false positives appear to have literature support:

• GCNT1 – carbohydrate metabolic process (Q02742 - GO:0005975)Genes related to carbohydrate metabolism include PPP1R3C, B3GNT1, and GCNT1…[PMID:23646466]

• CERS2 – ceramide biosynthetic process (Q96G23 - GO:0046513)…CersS2, which uses C22-CoA forceramide synthesis… [PMID:22144673]

Page 34: Function and Phenotype Prediction through Data and Knowledge Fusion

Results: Multi-view learning

Page 35: Function and Phenotype Prediction through Data and Knowledge Fusion

Results: different sources

Mouse annotations from geneontology.org

Page 36: Function and Phenotype Prediction through Data and Knowledge Fusion

Phenotype Prediction

Kahanda I, Funk C, Verspoor K and Ben-Hur A 2015;] F1000Research 2015, 4:259 (doi: 10.12688/f1000research.6670.1)

Page 37: Function and Phenotype Prediction through Data and Knowledge Fusion

PHENOstruct

Human Phenotype OntologyOrgan, Inheritance, Onset subontologies have separate models

Page 38: Function and Phenotype Prediction through Data and Knowledge Fusion

Gold annotations via transfer

Page 39: Function and Phenotype Prediction through Data and Knowledge Fusion

PHENOstruct Features

• Network (functional association data)– protein-protein interactions– co-expression– co-localization– From BioGRID, STRING, GeneMANIA

• Gene Ontology (experimental) annotations• Literature mined data: bag of words in gene sentences• Genetic variants (protein -> disease -> variants)

Page 40: Function and Phenotype Prediction through Data and Knowledge Fusion

Performance

Subont. Terms Method AUC P-value

Organ 1,796

Binary SVMs 0.66 1.70E-262Clus-HMC-Ens 0.65 0.00E+00PHENOstruct 0.73 —

Inheritance 12

Binary SVMs 0.72 2.20E-01Clus-HMC-Ens 0.73 7.30E-01PHENOstruct 0.74 —

Onset 23

Binary SVMs 0.62 4.40E-03Clus-HMC-Ens 0.58 3.30E-05PHENOstruct 0.64 —

Page 41: Function and Phenotype Prediction through Data and Knowledge Fusion

PHENOstruct in Organ subontology

Page 42: Function and Phenotype Prediction through Data and Knowledge Fusion

Gold vs Predicted, P43681

Gold Predicted

Hierarchical, protein-centricP = 1.0; R = 0.62

Page 43: Function and Phenotype Prediction through Data and Knowledge Fusion

Impact of data sources

Page 44: Function and Phenotype Prediction through Data and Knowledge Fusion

Leave-one-source-out

Page 45: Function and Phenotype Prediction through Data and Knowledge Fusion

Top Literature Features

Category Tokens

proteins & complexes

cx32, kisspeptin, -308, t308, smn2, ns5, trap-positive, mpp+-induced, 1-methyl-4-phenylpyridinium,tnf-alpha-mediated, tnf-alpha-stimulated, tnf–mediated, ink4a/arf, ns4b, hmsh6, fukutin, cdtb, ns5b,apoai, tnf–stimulated, ns4a, tnf-alpha-, rhbmp-2, tnf-alpha-treated, frataxin, ki-ras, connexin32, tcdb,recql4, =-galcer, tyrosinase-related, hpms2, her4, cd40-cd40l, lmp2a, ryrs, mg2+-atpase, ews-fli1,abeta42, fancc, p40phox, her1, bdnf-induced, trap+, gfap-ir, daf-16/foxo, hdl3, -238, [tnf-alpha],cd40/cd40l, tnf–treated, anti-ngf, tep1, recq, nt-4, pfemp1, zo-2, nphp1, tnf-alpha-dependent,

pomt1, igm-positive, apoa-ii, p110alpha, fancf, tbx4, anti-cd40l, igggenes hmsh2, cx26, fkrp, smn1, cln3, nphp4, mn1, nnt, apex2, akt-2pathways ras/raf/mek/erk, pi3k-akt-mtordiseases/phenotypes

cmt1a, hnpp, hdl2, cln2, hpp, fmf, rtt, hnpcc, charcot-marie-tooth, amenorrhea, rett, anticardiolipin

misc.sheldrick, shelxl97, bruker, farrugia, ortep-3, platon, shelxs97, spek, sgdid, wlds, caii, aoa, tdf,

crysalis, wingx, amf

Page 46: Function and Phenotype Prediction through Data and Knowledge Fusion

Conclusions

• The literature provides a significant resource for biological function prediction

• The literature provides one ‘view’ of biological knowledge and is best combined with other resources

• Even some simple strategies for extracting associations from the literature can provide valuable information, taken at large scale– “bag of words” and co-occurrence models reasonable

starting point: capture implied relationships– scope for integration of more targeted extracted

relationships (e.g. protein-protein interactions), with the usual Precision/Recall tradeoff

Page 47: Function and Phenotype Prediction through Data and Knowledge Fusion

Acknowledgements

• Los Alamos National Laboratory– Michael Wall

• Colorado– Larry Hunter (U. Colorado Denver)– Christopher Funk (U. Colorado Denver)– Asa Ben-Hur (Colorado State University)– Indika Kahanda (Colorado State University)

• NICTA Victoria Research Laboratory– Geoffrey Macintyre (U. Cambridge)– Antonio Jimeno Yepes (IBM Research Australia)– Cheng-Soon Ong (NICTA Canberra)

• Funding: US NIH, US NSF, NICTA, Australian Research Council

Page 48: Function and Phenotype Prediction through Data and Knowledge Fusion

Thank you!

Page 49: Function and Phenotype Prediction through Data and Knowledge Fusion

Machine learning for text analysis

Training setNotes + labels

for classes of interest

Machine learning algorithm

Words, Phrases,Linguistic categories;

names of entities;Domain concepts; Document features

Biomedical knowledge sources

UMLSOBOs

Language processing

ModelRelating features

of the text to classes of interest

Page 50: Function and Phenotype Prediction through Data and Knowledge Fusion

Machine learning for text analysis

New textto be classified

Words, Phrases,Linguistic categories;

names of entities;Domain concepts; Document features

Biomedical knowledge sources

UMLSOBOs

Language processing

Model

Predicted Classification

(label)