biomedical relation extraction for knowledge graph completion

23
Bio-RE for KG completion Gotta catch’em all!™ Claudiu Mihăilă

Upload: claudiu-mihaila

Post on 15-Apr-2017

96 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Biomedical Relation Extraction for Knowledge Graph Completion

Bio-RE for KG completion

Gotta catch’em all!™

Claudiu Mihăilă

Page 2: Biomedical Relation Extraction for Knowledge Graph Completion

Because it matters

Page 3: Biomedical Relation Extraction for Knowledge Graph Completion

Drug development process

Page 4: Biomedical Relation Extraction for Knowledge Graph Completion

Signalling pathways

Page 5: Biomedical Relation Extraction for Knowledge Graph Completion

Scientia potentia est

• ∼26M biomedical articlesindexed

• ∼3500 articles per day in 2015

• More information than any oneperson can comprehend

Page 6: Biomedical Relation Extraction for Knowledge Graph Completion

Structured databases

Page 7: Biomedical Relation Extraction for Knowledge Graph Completion

Structured databases

• High number of DBs

• Manually curated by experts

but

• Long backlog

• Limited coverage, reflectingbias of the curators

• Limited linkage to literatureevidence

Page 8: Biomedical Relation Extraction for Knowledge Graph Completion

ML Objective

Resistance to IL-10 inhibition of interferon gamma productionand expression of suppressor of cytokine signalling 1 in CD4+Tcells from patients with rheumatoid arthritis.

Page 9: Biomedical Relation Extraction for Knowledge Graph Completion

ML Objective

Resistance to IL-10 inhibition of interferon gamma productionand expression of suppressor of cytokine signalling 1 in CD4+Tcells from patients with rheumatoid arthritis.

Page 10: Biomedical Relation Extraction for Knowledge Graph Completion

ML Objective

Resistance to IL-10 inhibition of interferon gamma productionand expression of suppressor of cytokine signalling 1 in CD4+Tcells from patients with rheumatoid arthritis.

Page 11: Biomedical Relation Extraction for Knowledge Graph Completion

ML Objective

Resistance to IL-10 inhibition of interferon gamma productionand expression of suppressor of cytokine signalling 1 in CD4+Tcells from patients with rheumatoid arthritis.

Page 12: Biomedical Relation Extraction for Knowledge Graph Completion

ML Objective

Resistance to IL-10 inhibition of interferon gamma productionand expression of suppressor of cytokine signalling 1 in CD4+Tcells from patients with rheumatoid arthritis.

Page 13: Biomedical Relation Extraction for Knowledge Graph Completion

Supervised ML approaches

BioNLP-ST’16 – Task GE4 – NFκB KB construction

• 2-stage classification with SVM/LR + post-processing

• RNNs with LSTM units on words, PoSs, dependencies

P R F1Evex 0.47 0.32 0.38TEES 0.45 0.33 0.38VERSE 0.60 0.23 0.33

Page 14: Biomedical Relation Extraction for Knowledge Graph Completion

Supervised ML

• High confidence annotations

• Literature evidencemarked-up

but

• Limited coverage/high bias

• Limited number of corpora

• Expensive and slow toproduce training data

Page 15: Biomedical Relation Extraction for Knowledge Graph Completion

Distant SupervisionCombining DBs and Supervised ML

Align relation candidates extractedfrom text with known relationshipsand use structured database asdistant supervision training signal

• no bias to specific genre

• cheap and fast to produce fairlylarge training sets

Page 16: Biomedical Relation Extraction for Knowledge Graph Completion

Distant Supervision Example

PubMedMutations in the gene encoding the TAR DNA-binding protein 43have been identified in some familial amyotrophic lateral sclero-sis (ALS).

A novel missense mutation in a highly conserved region of TDP-43 was identified in a patient with sporadic ALS.

We screened the TARDBP mutation in 721 Japanese ALS by di-rect sequencing.

DB:IS_ASSOCIATED_WITH

TDP43, ALS

Page 17: Biomedical Relation Extraction for Knowledge Graph Completion

Famous work

DeepDive (Stanford/Lattice)

• Gene-gene interactionfrom PLOS biomedicaljournals

• Uses BIOGRID for distantsupervision

Literome

• Protein regulation eventextraction from Pubmedabstracts

• Uses the PathwayInteraction Database fordistant supervision

Page 18: Biomedical Relation Extraction for Knowledge Graph Completion

Information Extraction Pipeline

PMPMC

EntityRecognition

EntityResolution

SyntacticParsing

OpenIERelations

Page 19: Biomedical Relation Extraction for Knowledge Graph Completion

Knowledge graph

Page 20: Biomedical Relation Extraction for Knowledge Graph Completion

Distant supervision Pipeline

BioDBs

PMPMC

KnowledgeGraph

DistantSupervision

ExtractedRelations

Curation

InformationExtraction

Curation

Page 21: Biomedical Relation Extraction for Knowledge Graph Completion

Examples of learned relations(IS_ASSOCIATED_WITH, CAH, WNK1)

. . . mineralocorticoid excess can be caused by congenital adrenal hyperplasia (CAH) . . .due to mutations in the WNK1, WNK4, KLHL3, CUL3 genes.

PM:22932914

(IS_ASSOCIATED_WITH, Brachydactyly, CHSY1)

Our results place Chsy1 as an essential regulator of joint patterning and providea mouse model of human brachydactylies caused by mutations in CHSY1.

PM:22280990

(IS_ASSOCIATED_WITH, Cushing Syndrome, KCNJ5)

. . . these mutations, in addition to mutations in the KCNJ5 gene. . ., may be responsiblefor the tumorigenesis of APAs and CPAs with subclinical Cushing’s syndrome.

PM:26743443

Page 22: Biomedical Relation Extraction for Knowledge Graph Completion

Challenges

• Error propagation fromupstream tasks

• Cross-sentence relations

• Long tails - overfitting to themore common entity/entitypairs

• Speculation, negation, changesover time, conflictinginformation

Page 23: Biomedical Relation Extraction for Knowledge Graph Completion

Thank you for your attention!Any questions?