function and phenotype prediction through data and knowledge fusion

Function and Phenotype Prediction through Data and Knowledge Fusion

Karin M. Verspoor, The University of [email protected]

27 January 2016 – King Abdullah University of Science and Technology, Computational Bioscience Research Center

We have the blueprints to life,but we don’t know how to read them.

• At least a quarter of protein families in PFAM have no known function (Domains of Unknown Function)

• Millions of proteins uncharacterised

From sequence to function

?

What is protein function?

• Captures biologicalprocess, molecularfunction, cellularcomponent

• Common representation for Model organism databases tofacilitate sharing

The Gene Ontology (GO) provides a vocabulary

What about phenotype?

Human Phenotype Ontology

Knowledge-based featuresKnowledge source:

Exponential knowledge growth

• ~1550 peer-reviewed gene-related databases in NAR online Mol Bio collection

• Over 25 million PubMed entries (> 2,000/day)

• Breakdown of disciplinary boundaries makes more of it relevant to each of us

• “Like drinking from a firehose” – Jim Ostell (NCBI IEB Chief)

Text as a primary source of knowledge

Despite ever increasing structured resources, the literature remains the primary repository of knowledge in biomedicine

0

20000

40000

60000

80000

100000

120000

1/02 1/03 1/04 1/05 1/06 1/07

# Sw

iss-

Prot

Pro

tein

s

Proteins missing a FUNCTION commentProteins gaining a FUNCTION comment

“Manual curation is not sufficient for annotation of genomic databases”Baumgartner et al Bioinformatics (ISMB 2007)

Why biomedical text mining?

1914 1921 1928 1935 1942 1949 1956 1963 1970 1977 1984 1991 1998 2005 20120

200000

400000

600000

800000

1000000

1200000

Year

Publ

icat

ions

per

yea

r

Exponential growth in size of Pubmed

Data sources, Data Integration

• Structured Resources– Largely manually ‘curated’, high quality– Often unannotated– Organizes targeted information– Computable

• Unstructured Resources– Literature: peer reviewed, well-formed– Natural Language: ambiguity, complexity– Broad, current coverage of biological knowledge– Intended for Human communication

Bio Text Analysis in a nutshell

GO Function Prediction

Sokolov and Ben-Hur. J Bioinform Comput Biol. 2010 Apr;8(2):357-76.Sokolov, Funk, Graim, Verspoor, Ben-Hur. BMC Bioinformatics. 2013;14 Suppl 3:S10.

GOstruct: Structured output SVM

Structured output

• Represent a set of annotations as a single vector• Encodes the hierarchical structure from annotation to

root

GOstruct approach

“What functions does this protein perform?”

Feature integration via kernels

• Cross-species (sequence-based) features– e-values from significant BLAST hits– features from WoLF PSORT protein localization software– transmembrane protein prediction using TMHMM– k-mer composition of the N and C termini– low complexity regions

• Species-specific features– Protein interactions– Gene Expression– Phylogenetic profiles– Text-derived features

Extraction & Analysis pipeline

Christopher Funk (2015) PhD dissertation, U. Colorado Denver

Integrating Text

• Protein – Gene Ontology term co-occurrence• Protein – Protein co-occurrence

Sokolov et al. BMC Bioinformatics 2013, 14(Suppl 3):S10

Text-based features

• Words– (tokens)

• Entities or Concepts– (gene/protein mentions)– (gene ontology concepts)

• Relations– (simple co-occurrences)

Feature Extraction from text

Target: P50281 – Matrix metalloproteinase 14 (MMP14)

Feature Extraction


Feature Extraction


Bag of words:WordsSent1(membrane, otherwise, known, … , proteolytic, enzyme, known, extracellular, invasion, … , progression)WordsSent2(protein, and, message, levels, of, was , …)

Feature Extraction


Protein GO term co-mentions:sent_comen(P50281, GO:0008237)sent_comen(P50281, GO:0006508)sent_comen(P50281, GO:0009056)sent_comen(P50281, GO:0031012)nonSent_comen(P50281, GO:0010467)nonSent_comen(P50281, GO:0005623)

Feature Extraction


Protein GO term co-mentions:nonSent_comen(P50281, GO:0008237)nonSent_comen(P50281, GO:0006508)nonSent_comen(P50281, GO:0009056)nonSent_comen(P50281, GO:0031012)nonSent_comen(P50281, GO:0010467)nonSent_comen(P50281, GO:0005623)

Feature Representation


Bag of Words:P40281, known=2, membrane=1, protein=1, proteolytic=1, enzyme=1, …Protein GO term co-mentions (sentence):P40281, GO:0008237=1, GO:0006508=1, GO:0009056=1, GO:0031012=1Protein GO term co-mentions (non-sentence):P40281, GO:0010467=2, GO:0005623=2

Feature Representation

Bag of Words:UniprotID1, w1=countw1, w2=countw2, w3=countw3, … , wi=countwiUniprotID2, w1=countw1, w2=countw2, w3=countw3, … , wi=countwi…UniprotIDi, w1=countw1, w2=countw2, w3=countw3, … , wi=countwiProtein GO term co-mentions (sentence):UniprotID1, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOiUniprotID2, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi…UniprotIDi, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOiProtein GO term co-mentions (non-sentence):UniprotID1, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOiUniprotID2, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi…UniprotIDi, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi

An aside on GO concept recognition

• Given:– Gene Ontology (~46,000 concepts)

In mice lacking ephrin-A5 function, cell proliferation and survival of newborn neurons… (PMID 20474079)

• Return:– GO:0008283 cell proliferation– GO:0005125 cytokine activity– GO:0048666 neuron development

(can be based on a judgment about the depth of experimental evidence)

(CRAFT example)

Previous in vitro experiments using renal

cell lines suggest recessive Aqp2

mutations result in improper trafficking

of the mutant water pore.

GO:0005623 – “cell”CL:0000000 – “cell”

PR:000004182 – “aquaporin-2”EG:359 – “Aqp2”

SO:0001059 – “sequence_alteration” GO:0006810 – “transport”

SO:0001059 – “sequence_alteration” GO:0015250 – “water channel activity”

CHEBI:15377 – “water”

GO:0006900 – membrane budding

[Term]id: GO:0006900name: membrane budding…def: "The evagination of a membrane,

resulting in formation of a vesicle.”…synonym: "membrane evagination”synonym: "nonselective vesicle assembly”synonym: "vesicle biosynthesis”synonym: "vesicle formation”…

Variation in PMID: 12925238• Lipid rafts play a key role in

membrane budding…• …involvement of annexin A7

in budding of vesicles…• …Ca2+-mediated vesiculation

process was not impared.• Red blood cells which lack the

ability to vesiculate cause…• Having excluded a direct role

in vesicle formation…

GO vs NL

Comparing tool performance on CR

• NCBO Annotator (96 combinations)

wholeWordOnly, filterNumber, stopWords, stopWordsCaseSensitive, minTermSize, withSynonyms

• MetaMap (864 combinations)

model, gaps, wordOrder,acronymAbb, derivationalVars, scoreFilter, minTermSize

• Concept Mapper(576 combinations)

searchStrategy, caseMatch, stemmer, orderIndependentLookup, findAllMatches, stopWords, synonyms

Funk et al. BMC Bioinformatics 2014, Feb 26;15:59.

Literature alone is useful

MF BP CC0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Baseline (co-mentions as predictions) Co-mentionsBoW Co-mentions + BoW

Gene Ontology Branch

Mac

ro-a

vera

ged

F-m

easu

re

Literature features approach performance of commonly used biological features

MF BP CC0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Trans/LocalizationHomologyNetworkLiteratureAll Combined

Mac

ro-a

vera

ged

F-m

easu

re

(and combining them with other features is even better!)

Manual inspection of misclassifications

Some false positives appear to have literature support:

• GCNT1 – carbohydrate metabolic process (Q02742 - GO:0005975)Genes related to carbohydrate metabolism include PPP1R3C, B3GNT1, and GCNT1…[PMID:23646466]

• CERS2 – ceramide biosynthetic process (Q96G23 - GO:0046513)…CersS2, which uses C22-CoA forceramide synthesis… [PMID:22144673]

Results: Multi-view learning

Results: different sources

Mouse annotations from geneontology.org

Phenotype Prediction

Kahanda I, Funk C, Verspoor K and Ben-Hur A 2015;] F1000Research 2015, 4:259 (doi: 10.12688/f1000research.6670.1)

PHENOstruct

Human Phenotype OntologyOrgan, Inheritance, Onset subontologies have separate models

Gold annotations via transfer

PHENOstruct Features

• Network (functional association data)– protein-protein interactions– co-expression– co-localization– From BioGRID, STRING, GeneMANIA

• Gene Ontology (experimental) annotations• Literature mined data: bag of words in gene sentences• Genetic variants (protein -> disease -> variants)

Performance

Subont. Terms Method AUC P-value

Organ 1,796

Binary SVMs 0.66 1.70E-262Clus-HMC-Ens 0.65 0.00E+00PHENOstruct 0.73 —

Inheritance 12

Binary SVMs 0.72 2.20E-01Clus-HMC-Ens 0.73 7.30E-01PHENOstruct 0.74 —

Onset 23

Binary SVMs 0.62 4.40E-03Clus-HMC-Ens 0.58 3.30E-05PHENOstruct 0.64 —

PHENOstruct in Organ subontology

Gold vs Predicted, P43681

Gold Predicted

Hierarchical, protein-centricP = 1.0; R = 0.62

Impact of data sources

Leave-one-source-out

Top Literature Features

Category Tokens

proteins & complexes

cx32, kisspeptin, -308, t308, smn2, ns5, trap-positive, mpp+-induced, 1-methyl-4-phenylpyridinium,tnf-alpha-mediated, tnf-alpha-stimulated, tnf–mediated, ink4a/arf, ns4b, hmsh6, fukutin, cdtb, ns5b,apoai, tnf–stimulated, ns4a, tnf-alpha-, rhbmp-2, tnf-alpha-treated, frataxin, ki-ras, connexin32, tcdb,recql4, =-galcer, tyrosinase-related, hpms2, her4, cd40-cd40l, lmp2a, ryrs, mg2+-atpase, ews-fli1,abeta42, fancc, p40phox, her1, bdnf-induced, trap+, gfap-ir, daf-16/foxo, hdl3, -238, [tnf-alpha],cd40/cd40l, tnf–treated, anti-ngf, tep1, recq, nt-4, pfemp1, zo-2, nphp1, tnf-alpha-dependent,

pomt1, igm-positive, apoa-ii, p110alpha, fancf, tbx4, anti-cd40l, igggenes hmsh2, cx26, fkrp, smn1, cln3, nphp4, mn1, nnt, apex2, akt-2pathways ras/raf/mek/erk, pi3k-akt-mtordiseases/phenotypes

cmt1a, hnpp, hdl2, cln2, hpp, fmf, rtt, hnpcc, charcot-marie-tooth, amenorrhea, rett, anticardiolipin

misc.sheldrick, shelxl97, bruker, farrugia, ortep-3, platon, shelxs97, spek, sgdid, wlds, caii, aoa, tdf,

crysalis, wingx, amf

Conclusions

• The literature provides a significant resource for biological function prediction

• The literature provides one ‘view’ of biological knowledge and is best combined with other resources

• Even some simple strategies for extracting associations from the literature can provide valuable information, taken at large scale– “bag of words” and co-occurrence models reasonable

starting point: capture implied relationships– scope for integration of more targeted extracted

relationships (e.g. protein-protein interactions), with the usual Precision/Recall tradeoff

Acknowledgements

• Los Alamos National Laboratory– Michael Wall

• Colorado– Larry Hunter (U. Colorado Denver)– Christopher Funk (U. Colorado Denver)– Asa Ben-Hur (Colorado State University)– Indika Kahanda (Colorado State University)

• NICTA Victoria Research Laboratory– Geoffrey Macintyre (U. Cambridge)– Antonio Jimeno Yepes (IBM Research Australia)– Cheng-Soon Ong (NICTA Canberra)

• Funding: US NIH, US NSF, NICTA, Australian Research Council

Thank you!

Machine learning for text analysis

Training setNotes + labels

for classes of interest

Machine learning algorithm

Words, Phrases,Linguistic categories;

names of entities;Domain concepts; Document features

Biomedical knowledge sources

UMLSOBOs

Language processing

ModelRelating features

of the text to classes of interest

Machine learning for text analysis

New textto be classified

Words, Phrases,Linguistic categories;

names of entities;Domain concepts; Document features

Biomedical knowledge sources

UMLSOBOs

Language processing

Model

Predicted Classification

(label)

function and phenotype prediction through data and knowledge fusion

Health & Medicine