la semantic web meetup nov5th 2012
TRANSCRIPT
The Seman)c Web (There and Back Again)
Pablo N. Mendes Research Associate Open Knowledge
Founda)on 1 11/5/12
Car)c Ramakrishnan Research Scien)st
Datapop
Evolu)on of the Seman)c Web 1945
“I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transac)ons between people and computers.” – Tim Berners Lee
1991
2001
+ Internet
2 11/2/12
Emergent Knowledge in Public Text Nicolas Poussin
painted_by Nicolas Flammel
The Hunchback of Notre Dame
Victor Hugo
men-oned_in
wri1en_by
Priory of Sion
cryp-c_mo1o_of member_of
member_of
Louvre
displayed_at
displayed_at
Leonardo Da Vinci
painted_by men-oned_in
painted_by
3 11/2/12
Emergent Knowledge in Biomedical Research Papers
Confirmed by clinical trials
Swanson, D. R. (1986). "Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge." Perspec)ves in Biology and Medicine 30(1): 7-‐18.
12 subsequent studies support hypothesis
Spreading cor)cal depression Migraine Agacks
Spreading cor)cal depression Magnesium can inhibit
May be implicated in
Swanson, D. R. (1988). "Migraine and Magnesium: Eleven Neglected Connec)ons." Perspec)ves in Biology and Medicine 31(4): 526-‐557.
4 11/2/12
Eicosapentaenoic acid Blood viscosity
Raynaud’s disease pa)ents elevated blood viscosity.
Eicosapentaenoic acid Dietary fish oils contain
reduces
have
Applica)on of Emergent Knowledge in Biology – Drug Repurposing
Girnun, G. D., E. Naseri, et al. (2007). Cancer Cell 11(5): 395-‐406
Metallothianine
downregulates
Cancer cell death
Carbopla)n
DNA fragmenta)on
induces
induces
Peroxisome prolifertator-‐ac)viated receptor gamma
Rosiglitazone
PPARγ
ac)vates
downregulates
5 11/2/12
Research Areas
• Extrac)ng Factual Knowledge from Biomedical Research Ar)cles – En))es – “Carbopla)n induces Cell Death” – Rela)ons – induces(Carbopla)n, Cell Death) – Supervised Machine Learning • Expensive Training data
• Discovering Pagerns in Factual Knowledge – Paths – Carbopla)n ??? Rosiglitazone – Subgraphs
6 11/5/12
LA-‐PDFText – Extrac)ng Text From Research Papers
7 11/6/12
Ramakrishnan, C., A. Patnia, E. Hovy and G. Burns (2012). "Layout-‐Aware Text Extrac)on from Full-‐text PDF of Scien)fic Ar)cles." Source Code for Biology and Medicine 7(1): 7. hgp://code.google.com/p/lapdoext/
LA-‐PDFText – Extrac)ng Text From Research Papers
8 11/6/12
Ramakrishnan, C., A. Patnia, E. Hovy and G. Burns (2012). "Layout-‐Aware Text Extrac)on from Full-‐text PDF of Scien)fic Ar)cles." Source Code for Biology and Medicine 7(1): 7. hgp://code.google.com/p/lapdoext/
Unsupervised Fact Extrac)on Dallenbach-‐Hellweg, G. (1976) Fortschr Med 94(5): 256-‐263. Abstract: An excessive endogenous or exogenous s)mula)on by estrogen induces adenomatous hyperplasia of the endometrium.
induces
s)mula)on hyperplasia
endometrium adenomatous
estrogen excessive
An
endogenous
exogenous
the
nsubj
det
amod amod
conj_or
prep_by
amod prep_of
det
dobj
Relationship
Object head Subject head
9 11/2/12
Resul)ng Structure (RDF)
Dallenbach-‐Hellweg, G. (1976) Fortschr Med 94(5): 256-‐263. Abstract: An excessive endogenous or exogenous s)mula)on by estrogen induces adenomatous hyperplasia of the endometrium.
An excessive endogenous or exogenous stimulation
estrogen
modified_entity_1 composite_entity_1
endometrium
modified_entity_2
adenomatous hyperplasia
induces
hasModifier
hasModifier
hasPart
hasPart
hasPart
hasPart
10 11/6/12
Car)c Ramakrishnan, Pablo N. Mendes, Shaojun Wang, Amit P. Sheth: Unsupervised Discovery of Compound En))es for Rela)onship Extrac)on. EKAW 2008: 146-‐155
Detec)ng Nested En))es
11/5/12 11
Chevy Chase Bank on 5th and 3rd
Chevy Chase Bank on 5th and 3rd
nn
nn
prep_on
prep_on
Syntac)c Dependencies
[[[Chevy Chase]Person Bank]Org on 5th and 3rd]Loca)on
Result of Unsupervised Extrac)on
• 137,414,820 triples with named rela)ons – Triple “hair-‐ball”
Abstracts of ~18 million research ar)cles
~200 million parse trees En)ty Rela)onship network
12 11/5/12
An excessive endogenous or exogenous stimulation
estrogen
modified_entity_1 composite_entity_1
endometrium
modified_entity_2
adenomatous hyperplasia
induces
hasModifier
hasModifier
hasPart
hasPart
hasPart
hasPart
Discovering Pagerns in Factual Knowledge
11/6/12 13
Discovering Pagerns in Factual Knowledge
• Finding Paths – Exponen)al no. of paths Informa)on overload – Relevance not all paths are equally relevant
• Our solu)on – Subgraph detec)on with fixed node budget – Heuris)c edge weigh)ng to control relevance
11/6/12 14
Car)c Ramakrishnan, William H. Milnor, Maghew Perry, Amit P. Sheth: Discovering informa)ve connec)on subgraphs in mul)-‐rela)onal graphs. SIGKDD Explora)ons 7(2): 56-‐63 (2005)
Candidate Subgraph Iden)fica)on
• Bidirec)onal lock-‐step growth from S and T – Next hop based on edge weights – Terminate when cut edge limit reached – Results in candidate graph
11/6/12 15
Finding Best Subgraphs
• Candidate Graph – Too large to be useful – Lis)ng paths = informa)on overload
• Electrical Circuit – Edge weights = resistance – +1 volt at source node & ground at target
• Using Ohm’s and Kirchoff’s laws – find maximum current flow paths through the candidate graph from S to T
11/6/12 16
Car)c Ramakrishnan, William H. Milnor, Maghew Perry, Amit P. Sheth: Discovering informa)ve connec)on subgraphs in mul)-‐rela)onal graphs. SIGKDD Explora)ons 7(2): 56-‐63 (2005)
Semi-‐automated Knowledge Discovery in Biomedicine – How far are we?
• Trust in extracted facts – Extrac)on errors – Poor quality sources – No provenance – Misleading cita)ons – Inten)onally misleading research reports – Uninten)onal mistakes in research reports
• Informa)on overload
11/5/12 17
Building A Web of Linked En))es with DBpedia Spotlight
11/5/12 18
Pablo N. Mendes Research Associate Open Knowledge
Founda)on