function and phenotype prediction through data and knowledge fusion
TRANSCRIPT
Function and Phenotype Prediction through Data and Knowledge Fusion
Karin M. Verspoor, The University of [email protected]
27 January 2016 – King Abdullah University of Science and Technology, Computational Bioscience Research Center
We have the blueprints to life,but we don’t know how to read them.
• At least a quarter of protein families in PFAM have no known function (Domains of Unknown Function)
• Millions of proteins uncharacterised
From sequence to function
?
What is protein function?
• Captures biologicalprocess, molecularfunction, cellularcomponent
• Common representation for Model organism databases tofacilitate sharing
The Gene Ontology (GO) provides a vocabulary
What about phenotype?
Human Phenotype Ontology
Knowledge-based featuresKnowledge source:
Exponential knowledge growth
• ~1550 peer-reviewed gene-related databases in NAR online Mol Bio collection
• Over 25 million PubMed entries (> 2,000/day)
• Breakdown of disciplinary boundaries makes more of it relevant to each of us
• “Like drinking from a firehose” – Jim Ostell (NCBI IEB Chief)
Text as a primary source of knowledge
Despite ever increasing structured resources, the literature remains the primary repository of knowledge in biomedicine
0
20000
40000
60000
80000
100000
120000
1/02 1/03 1/04 1/05 1/06 1/07
# Sw
iss-
Prot
Pro
tein
s
Proteins missing a FUNCTION commentProteins gaining a FUNCTION comment
“Manual curation is not sufficient for annotation of genomic databases”Baumgartner et al Bioinformatics (ISMB 2007)
Why biomedical text mining?
1914 1921 1928 1935 1942 1949 1956 1963 1970 1977 1984 1991 1998 2005 20120
200000
400000
600000
800000
1000000
1200000
Year
Publ
icat
ions
per
yea
r
Exponential growth in size of Pubmed
Data sources, Data Integration
• Structured Resources– Largely manually ‘curated’, high quality– Often unannotated– Organizes targeted information– Computable
• Unstructured Resources– Literature: peer reviewed, well-formed– Natural Language: ambiguity, complexity– Broad, current coverage of biological knowledge– Intended for Human communication
Bio Text Analysis in a nutshell
GO Function Prediction
Sokolov and Ben-Hur. J Bioinform Comput Biol. 2010 Apr;8(2):357-76.Sokolov, Funk, Graim, Verspoor, Ben-Hur. BMC Bioinformatics. 2013;14 Suppl 3:S10.
GOstruct: Structured output SVM
Structured output
• Represent a set of annotations as a single vector• Encodes the hierarchical structure from annotation to
root
GOstruct approach
“What functions does this protein perform?”
Feature integration via kernels
• Cross-species (sequence-based) features– e-values from significant BLAST hits– features from WoLF PSORT protein localization software– transmembrane protein prediction using TMHMM– k-mer composition of the N and C termini– low complexity regions
• Species-specific features– Protein interactions– Gene Expression– Phylogenetic profiles– Text-derived features
Extraction & Analysis pipeline
Christopher Funk (2015) PhD dissertation, U. Colorado Denver
Integrating Text
• Protein – Gene Ontology term co-occurrence• Protein – Protein co-occurrence
Sokolov et al. BMC Bioinformatics 2013, 14(Suppl 3):S10
Text-based features
• Words– (tokens)
• Entities or Concepts– (gene/protein mentions)– (gene ontology concepts)
• Relations– (simple co-occurrences)
Feature Extraction from text
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Feature Extraction
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Feature Extraction
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Bag of words:WordsSent1(membrane, otherwise, known, … , proteolytic, enzyme, known, extracellular, invasion, … , progression)WordsSent2(protein, and, message, levels, of, was , …)
Feature Extraction
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Protein GO term co-mentions:sent_comen(P50281, GO:0008237)sent_comen(P50281, GO:0006508)sent_comen(P50281, GO:0009056)sent_comen(P50281, GO:0031012)nonSent_comen(P50281, GO:0010467)nonSent_comen(P50281, GO:0005623)
Feature Extraction
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Protein GO term co-mentions:nonSent_comen(P50281, GO:0008237)nonSent_comen(P50281, GO:0006508)nonSent_comen(P50281, GO:0009056)nonSent_comen(P50281, GO:0031012)nonSent_comen(P50281, GO:0010467)nonSent_comen(P50281, GO:0005623)
Feature Representation
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Bag of Words:P40281, known=2, membrane=1, protein=1, proteolytic=1, enzyme=1, …Protein GO term co-mentions (sentence):P40281, GO:0008237=1, GO:0006508=1, GO:0009056=1, GO:0031012=1Protein GO term co-mentions (non-sentence):P40281, GO:0010467=2, GO:0005623=2
Feature Representation
Bag of Words:UniprotID1, w1=countw1, w2=countw2, w3=countw3, … , wi=countwiUniprotID2, w1=countw1, w2=countw2, w3=countw3, … , wi=countwi…UniprotIDi, w1=countw1, w2=countw2, w3=countw3, … , wi=countwiProtein GO term co-mentions (sentence):UniprotID1, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOiUniprotID2, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi…UniprotIDi, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOiProtein GO term co-mentions (non-sentence):UniprotID1, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOiUniprotID2, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi…UniprotIDi, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi
An aside on GO concept recognition
• Given:– Gene Ontology (~46,000 concepts)
In mice lacking ephrin-A5 function, cell proliferation and survival of newborn neurons… (PMID 20474079)
• Return:– GO:0008283 cell proliferation– GO:0005125 cytokine activity– GO:0048666 neuron development
(can be based on a judgment about the depth of experimental evidence)
(CRAFT example)
Previous in vitro experiments using renal
cell lines suggest recessive Aqp2
mutations result in improper trafficking
of the mutant water pore.
GO:0005623 – “cell”CL:0000000 – “cell”
PR:000004182 – “aquaporin-2”EG:359 – “Aqp2”
SO:0001059 – “sequence_alteration” GO:0006810 – “transport”
SO:0001059 – “sequence_alteration” GO:0015250 – “water channel activity”
CHEBI:15377 – “water”
GO:0006900 – membrane budding
[Term]id: GO:0006900name: membrane budding…def: "The evagination of a membrane,
resulting in formation of a vesicle.”…synonym: "membrane evagination”synonym: "nonselective vesicle assembly”synonym: "vesicle biosynthesis”synonym: "vesicle formation”…
Variation in PMID: 12925238• Lipid rafts play a key role in
membrane budding…• …involvement of annexin A7
in budding of vesicles…• …Ca2+-mediated vesiculation
process was not impared.• Red blood cells which lack the
ability to vesiculate cause…• Having excluded a direct role
in vesicle formation…
GO vs NL
Comparing tool performance on CR
• NCBO Annotator (96 combinations)
wholeWordOnly, filterNumber, stopWords, stopWordsCaseSensitive, minTermSize, withSynonyms
• MetaMap (864 combinations)
model, gaps, wordOrder,acronymAbb, derivationalVars, scoreFilter, minTermSize
• Concept Mapper(576 combinations)
searchStrategy, caseMatch, stemmer, orderIndependentLookup, findAllMatches, stopWords, synonyms
Funk et al. BMC Bioinformatics 2014, Feb 26;15:59.
Literature alone is useful
MF BP CC0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Baseline (co-mentions as predictions) Co-mentionsBoW Co-mentions + BoW
Gene Ontology Branch
Mac
ro-a
vera
ged
F-m
easu
re
Literature features approach performance of commonly used biological features
MF BP CC0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Trans/LocalizationHomologyNetworkLiteratureAll Combined
Mac
ro-a
vera
ged
F-m
easu
re
(and combining them with other features is even better!)
Manual inspection of misclassifications
Some false positives appear to have literature support:
• GCNT1 – carbohydrate metabolic process (Q02742 - GO:0005975)Genes related to carbohydrate metabolism include PPP1R3C, B3GNT1, and GCNT1…[PMID:23646466]
• CERS2 – ceramide biosynthetic process (Q96G23 - GO:0046513)…CersS2, which uses C22-CoA forceramide synthesis… [PMID:22144673]
Results: Multi-view learning
Results: different sources
Mouse annotations from geneontology.org
Phenotype Prediction
Kahanda I, Funk C, Verspoor K and Ben-Hur A 2015;] F1000Research 2015, 4:259 (doi: 10.12688/f1000research.6670.1)
PHENOstruct
Human Phenotype OntologyOrgan, Inheritance, Onset subontologies have separate models
Gold annotations via transfer
PHENOstruct Features
• Network (functional association data)– protein-protein interactions– co-expression– co-localization– From BioGRID, STRING, GeneMANIA
• Gene Ontology (experimental) annotations• Literature mined data: bag of words in gene sentences• Genetic variants (protein -> disease -> variants)
Performance
Subont. Terms Method AUC P-value
Organ 1,796
Binary SVMs 0.66 1.70E-262Clus-HMC-Ens 0.65 0.00E+00PHENOstruct 0.73 —
Inheritance 12
Binary SVMs 0.72 2.20E-01Clus-HMC-Ens 0.73 7.30E-01PHENOstruct 0.74 —
Onset 23
Binary SVMs 0.62 4.40E-03Clus-HMC-Ens 0.58 3.30E-05PHENOstruct 0.64 —
PHENOstruct in Organ subontology
Gold vs Predicted, P43681
Gold Predicted
Hierarchical, protein-centricP = 1.0; R = 0.62
Impact of data sources
Leave-one-source-out
Top Literature Features
Category Tokens
proteins & complexes
cx32, kisspeptin, -308, t308, smn2, ns5, trap-positive, mpp+-induced, 1-methyl-4-phenylpyridinium,tnf-alpha-mediated, tnf-alpha-stimulated, tnf–mediated, ink4a/arf, ns4b, hmsh6, fukutin, cdtb, ns5b,apoai, tnf–stimulated, ns4a, tnf-alpha-, rhbmp-2, tnf-alpha-treated, frataxin, ki-ras, connexin32, tcdb,recql4, =-galcer, tyrosinase-related, hpms2, her4, cd40-cd40l, lmp2a, ryrs, mg2+-atpase, ews-fli1,abeta42, fancc, p40phox, her1, bdnf-induced, trap+, gfap-ir, daf-16/foxo, hdl3, -238, [tnf-alpha],cd40/cd40l, tnf–treated, anti-ngf, tep1, recq, nt-4, pfemp1, zo-2, nphp1, tnf-alpha-dependent,
pomt1, igm-positive, apoa-ii, p110alpha, fancf, tbx4, anti-cd40l, igggenes hmsh2, cx26, fkrp, smn1, cln3, nphp4, mn1, nnt, apex2, akt-2pathways ras/raf/mek/erk, pi3k-akt-mtordiseases/phenotypes
cmt1a, hnpp, hdl2, cln2, hpp, fmf, rtt, hnpcc, charcot-marie-tooth, amenorrhea, rett, anticardiolipin
misc.sheldrick, shelxl97, bruker, farrugia, ortep-3, platon, shelxs97, spek, sgdid, wlds, caii, aoa, tdf,
crysalis, wingx, amf
Conclusions
• The literature provides a significant resource for biological function prediction
• The literature provides one ‘view’ of biological knowledge and is best combined with other resources
• Even some simple strategies for extracting associations from the literature can provide valuable information, taken at large scale– “bag of words” and co-occurrence models reasonable
starting point: capture implied relationships– scope for integration of more targeted extracted
relationships (e.g. protein-protein interactions), with the usual Precision/Recall tradeoff
Acknowledgements
• Los Alamos National Laboratory– Michael Wall
• Colorado– Larry Hunter (U. Colorado Denver)– Christopher Funk (U. Colorado Denver)– Asa Ben-Hur (Colorado State University)– Indika Kahanda (Colorado State University)
• NICTA Victoria Research Laboratory– Geoffrey Macintyre (U. Cambridge)– Antonio Jimeno Yepes (IBM Research Australia)– Cheng-Soon Ong (NICTA Canberra)
• Funding: US NIH, US NSF, NICTA, Australian Research Council
Thank you!
Machine learning for text analysis
Training setNotes + labels
for classes of interest
Machine learning algorithm
Words, Phrases,Linguistic categories;
names of entities;Domain concepts; Document features
Biomedical knowledge sources
UMLSOBOs
Language processing
ModelRelating features
of the text to classes of interest
Machine learning for text analysis
New textto be classified
Words, Phrases,Linguistic categories;
names of entities;Domain concepts; Document features
Biomedical knowledge sources
UMLSOBOs
Language processing
Model
Predicted Classification
(label)