improving recall for conjunctive queries on nlp graphs

Answering Conjunctive SPARQL Queries over NLP Graphs Lora Aroyo

Text

“30 are better than one”Improving recall for conjunctive

queries on NLP graphs

Text

Chris Welty, Ken Barker, Lora Aroyo, Shilpa Arora

(c) Andy Warhol

1Wednesday, October 17, 12


Goal: hypothesis generation & validationframework for NLP Graphs

Hypothesis: within this framework, thereis value in the secondary extraction graph for

conjunctive query answering

the probability of a secondary graph statement being correct increases significantly when that

statement generates a new result to a conjunctivequery over the primary graph


Machine Reading Program



The MRP Vision

Legacy SW

DB SME

to decrease the cost of maintaining critical system DBscan we replace the human without changing the LSW

can we build a machine reader for this



The MRP Vision

Legacy SW

DB SME query





The MRP Vision

Legacy SW

Machine Reader!

query





SRI Answer to the Vision

Legacy SW

DB SME query

replacing the human, but still with a DBNLP components must make their best guess,

without any knowledge of the specific task at hand, e.g. the query




Legacy SW

DB

query



NLP Stack




Legacy SW

DB

query



NLP Stack

Machine Reader!



The MRP Vision

Legacy SW

DB SME query





The MRP Vision

Legacy SW

DB SME query

Machine Reader!





The MRP Vision

Legacy SW

DB SME query

Machine Reader!

the NLP process is not a one-shot deal the query provides context for what the user is seeking

and thus an opportunity to re-interpret the text

NLP Stack

NLP Graphs

re-interpret



NLP Stack• Contains NER, CoRef, RelEx, entity disambiguation

• RelEx: SVM learner with output score: probabilities/confidences for each known relation that the sentence expresses it between each pair of mentions

• Run over target corpus producing NLP graph

• nodes are entities (clusters of mentions produced by coref)

• edges are type statements between entities and classes in the ontology, or relations detected between mentions of these entities in the corpus



RDF for NLP• use SemTech to influence the NLP stack vs. NLP components

to only feed the knowledge integration layer

• to store the results of IE in RDF Graphs (NLP Graphs), where:

• each triple has a confidence of the NLP components and provenance indicating where the triple was stated in natural language text

• triple - not an expression of truth, but a representation of what an NLP component, or a human annotator, read in a document

• confidence - not that the triple is true, but reflects the confidence that the text states the triple (component level confidence)



“... Mr. X of India ...”

“... in countries like, India, Iran, Iraq ...”



“... Mr. X of India ...”

“... in countries like, India, Iran, Iraq ...”

NLP Stack

Evidence

Mr. X India

India Iran Iraq

Person GPE

GPE Country

citizenOf

subPlaceOf

sameAs



NLP Graph

Mr. X

India

Iraq

citizenOf

subPlaceOf

PersonGPE

rdf:type rdf:type

rdf:type

Country

rdf:subClassOf

Iran

rdf:type

India

RDF GraphThe nodes & arcs refer to the results of NLP, not “truth”There is error (precision, recall)There is confidence associated with each triple



NLP Stack produces

• two NLP graphs

• primary graph = the single best type, relation & coreference results for each mention or mention pair

• secondary graph = all possibilities considered by the NLP stack


SPARQL Queries on NLP Graphs

19-Sept-2012 Hypothesis Generation for Answering Queries over NLP Graphs Lora Aroyo



Conjunctive QueryFind Jordanian citizens who are members of Hezbollah

SELECT ?pWHERE {?p mric:citizenOf geo:Jordan .mric:Hezbollah mric:hasMember ?p .

find all bindings for the variable ?p that satisfy the query report where in the target corpus the answers were found (spans of text expressing the relations in the query)



Conjunctive Queries Recall

• [Π Recall(Rk) ] x Recallcoref

• for conjunctive query of n terms recall could be O(Recalln)

• for complex queries Recall becomes dominating factor, where the overall Recall gets severely degraded by term Recall

• in our experiments: query recall <.1 for n>3

• all NLP components had to work correctly to get an answer

k=1

n



• find solutions to subsets of a conjunctive SPARQL query as candidate solutions to the full query

• attempt to confirm the candidate solutions using various kinds of inference, external resources & secondary extraction results

... solution?


hypothesis generation that focuses on parts of an NLP graph that almost match a query, identifying statements that if

proven would generate new query solutions

we are looking for missing links in a graph that, if added, would result in a new query solution

... in other words




R1(x,y) R2(x,z) R3(z,w)Q:

R1

R2

R3

so, each hypothesis set if added to the primary NLP graph would provide a new answer to the original query

only validated hypotheses are added to the query result

R3?

R3?

R3?



Hypothesis Generation• Relaxes queries of size N by removing query terms Q

• Finds solutions to the remaining set of terms

• for each solution bind the variables in Q forming a hypothesis

• If no solutions to subqueries of size N-1 are found, then N-2

• appropriate for queries that are almost answerable, e.g. when most of the terms in query are not missing

• biased towards generating more answers to queries, e.g. perform poorly on queries for which the corpus does not contain the answer



find all terrorist organizations that were agents of bombings in Lebanon on October 23, 1983:

SELECT ?tWHERE {?t rdf:type mric:TerroristOrganization .?b rdf:type mric:Bombing .?b mric:mediatingAgent ?t .?b mric:eventLocation mric:Lebanon .?b mric:eventDate "1983-10-23" .

}



mric:TerroristOrganizationmric:bombing

rdf:type

b

mric:Lebanon

mric:eventLocation

1983-10-23

mric:eventDate

t mric:mediatingAgent

rdf:type




rdf:type

b

mric:Lebanon

mric:eventLocation

1983-10-23

mric:eventDate


rdf:type

find all bombings in Lebanon on 1983-10-23 with agents (hypothesize that the agents are terrorist organizations)

1



mric:bombing

rdf:type

b

mric:Lebanon

mric:eventLocation

1983-10-23

mric:eventDate


find all bombings in Lebanon on 1983-10-23 with agents (hypothesize that the agents are terrorist organizations)

1




rdf:type

b

mric:Lebanon

mric:eventLocation

1983-10-23

mric:eventDate


rdf:type

find all events in Lebanon on 1983-10-23 by terrorist orgs (hypothesize that the events are bombings)

2



mric:TerroristOrganization

b

mric:Lebanon

mric:eventLocation

1983-10-23

mric:eventDate


rdf:type

find all events in Lebanon on 1983-10-23 by terrorist orgs (hypothesize that the events are bombings)

2




rdf:type

b

mric:Lebanon

mric:eventLocation

1983-10-23

mric:eventDate


rdf:type

find all bombings in Lebanon on 1983-10-23 (all known terrorist organizations are hypothetical agents)

3




rdf:type

b

mric:Lebanon

mric:eventLocation

1983-10-23

mric:eventDate

t

rdf:type

find all bombings in Lebanon on 1983-10-23 (all known terrorist organizations are hypothetical agents)

3




rdf:type

b

mric:Lebanon

mric:eventLocation

1983-10-23

mric:eventDate


rdf:type

find all bombings by terrorist orgs on 1983-10-23 (hypothesize that the bombings were in Lebanon)

4




rdf:type

b

1983-10-23

mric:eventDate


rdf:type

find all bombings by terrorist orgs on 1983-10-23 (hypothesize that the bombings were in Lebanon)

4




rdf:type

b

mric:Lebanon

mric:eventLocation

1983-10-23

mric:eventDate


rdf:type

find all bombings by terrorist orgs in Lebanon (hypothesize that the bombings were on 1983-10-23)




rdf:type

b

mric:Lebanon

mric:eventLocation


rdf:type

find all bombings by terrorist orgs in Lebanon (hypothesize that the bombings were on 1983-10-23)



find all bombings by terrorist orgs in Lebanon (hypothesize that the bombing1 was on 1983-10-23)


rdf:type

b

mric:Lebanon

mric:eventLocation


rdf:type





rdf:type

b

mric:Lebanon

mric:eventLocation


rdf:type


rdf:type

b

mric:Lebanon

mric:eventLocation


rdf:type

racr: bombing1racr:orgs65





rdf:type

b

mric:Lebanon

mric:eventLocation


rdf:type

1983-10-23


rdf:type

b

mric:Lebanon

mric:eventLocation


rdf:type

racr: bombing1racr:orgs65

mric:eventDate



Hypothesis Validation• a stack of hypothesis checkers: (1) report confidence

whether a hypothesis holds and (2) provide provenance: a pointer to a span of text that supports the hypothesis

• to limit complex computational tasks, e.g. formal reasoning or choosing between multiple low-confidence extractions

• such tasks are made more tractable by using hypotheses as goals, e.g. a reasoner may be used effectively by constraining to only a part of the graph connected to a hypothesis



Hypothesis Checkers

• knowledge base (previous work)• taxonomic inference & complex rules• rules derived directly from the ontology• general, domain-independent rules, e.g. family

relationships, and geo knowledge

• TyCor (previous work)

• secondary extraction graph (new work)



Rules Derived from Ontology

• simple superclass-subclass rules (Bombing (?x) → Attack (?x))

• simple relation-subrelation rules (hasSon (?x, ?y) → hasChild (?x, ?y))

• simple relation inverse rules (hasChild (?x,?y) ↔ hasParent (?y,?x))



Complex Rules from Ontology

• 40 complex rules based on specialization of the domain or range of sub-relations

(hasSubGroup (?x, ?y) & HumanOrganization (?x) → hasSubOrganization (?x, ?y))


Core Claim: Secondary Graph

is a productive source for hypothesis validation in conjunction with the primary graph to answer a query




Secondary Graph

• an NLP Graph generated from *all* the interpretations considered by the NLP stack, so obviously quite large

• multiple mentions, mention types, multiple entities, multiple entity types & multiple relations between them

• pruned at a particular confidence threshold


Experimental Setuptesting the ideas




Initial MRP Setup

• OWL target ontology: types & binary relations

• 10-50K documents - Gigaword (sub)corpus

• 79 docs manually annotated (mentions of the target relations & their argument types)

• 50 SPARQL queries (expected to be answered in NLP Graph)

• query results evaluated manually

• each query has at least one correct answer in the corpus

• some queries have over 1000 correct answers

find mentions of the ontology types & relations in the corpus & extract them into an RDF Graph



Initial MRP Evaluation• required extensive manual effort:

• no match between system node IDs and GS node IDs

• provenance for evaluators to find mentions from a graph

• evaluators semi-automatically map the system result entity IDs to GS entity IDs

• expensive, error-prone & difficult to reproduce ...

• difficult to test systems adequately before the evaluation

• only 50 queries were used - not enough for significant system validation, e.g. not able to tune system thresholds



How did we change this?• we decided to sacrifice corpus size in favor of having entity IDs (eliminating

the manual step in the evaluation)

• we created a gold standard corpus

• 169 docs manually annotated with types, relations, coreference and entity names

• generated Gold-Standard NLP graph from manually annotated data

• automatically generated SPARQL queries from GS graph

• we ran only the RelEx component using GS mentions, types & coref giving us the GS entity IDs in the system graph

measure performance of system results against these GS results



Evaluation & Test Data• 60 train, 60 devtest & 49 final (blind) test

• manually annotated with NER, coref, relations

• extracted from Gigaword

• split to balance distribution of 48 domain relations

• generated Gold-Standard NLP graph from manually annotated data

• RelEx component trained & applied using GS mentions, types & coref

• increases the F-measure (F=.28) of the RelEx output, but used in the baseline and in the test experiments so it doesn’t affect the results



SPARQL Evaluation Queries• 475 test queries for the devtest set and 492 for test.

• generated from the GS NLP graph for each document set by:

• extracting random connected subsets of the graph containing 2-8 domain relations (not including rdf:type)

• adding type statements for each node

• replacing nodes that had no proper names in text with select variables

• run the query over the same GS graph and the results became our gold standard results for query evaluation (since they had variables the results would be different than what we started with)



NLP Graphs from RelEx Output

• RelEx: a set of SVM binary classifiers, one per relation

• for each sentence in the corpus, for each pair of mentions in that sentence, for each relation it produces a probability that that pair is related by the relation

• NLP graphs are generated by selecting relations from RelEx output in two ways:

• Primary: takes only the top scoring relation between any mention pair above a confidence threshold (0, .1 and .2)

• Secondary: takes all relations between all mention pairs above 0 confidence

• All type triples come from the Gold Standard (GS)

• Precision & Recall are determined by automatically comparing system query results to the GS query results (for every query we know all the answers)



Threshold Choices

• Threshold .2 --> max F1=.28 on devset for RelEx

• Threshold .1 --> guessed threshold before having any data to back it up

• we could have tried more thresholds but it was a lot of work

• in our experiments, we explored threshold space over hundreds of queries - satisfactory to tune the threshold parameters



Graph Notation• We refer to the graphs by document set (dev or test) and top/

all @threshold, e.g.

• [email protected] = NLP Graph on dev set using top relations above .2 confidence

• testAll@0 = NLP graph on test set using all relations above 0 confidence

• 3 primary graphs, in all cases using top, and selecting relations at thresholds 0, .1, and .2

• 1 secondary graph using the all@0 setting (R=.97)


This Evaluation Setupallows to run experiments repeatedly over

hundreds of queries with no manual intervention




6 Experiments

• 3 for dev, 3 for test

• each experiment compares query results from only PG to query results using the PG+SG for hypothesis validation

• the three experiments compare performance at different primary graph thresholds



0-threshold primary graph with & without secondary graph

secondary graph: all@0

for a given PG threshold we vary the SG threshold for validated hypotheses (x-axis)

F1



.1-threshold primary graph with & without secondary graph


the red line indicates the PG threshold - the PG-only flattens below this threshold as expected

F1





the red line indicates the PG threshold - the PG-only flattens below this threshold as expected

best performance point (.01 SG threshold)

F1





the best performing configuration for dev is .2 threshold PG with SG hypotheses validated at .01 threshold

F1





the best performing configuration for dev is .2 threshold PG with SG hypotheses validated at .01 threshold

best performance point (.01 SG threshold)

F1



Text

Performance

the test set was truly blind, we ran it only once

R - expected, F - hoped, P - surprised

the probability of a relation holding between two mentions increases significantly if that relation would complete a conjunctive query result



Example: Generated Query

Q161: "Find events in which the leader of Venezuela is the mediating agent"

?e1 mric:MediatingAgent ?p1

geo:Venezuela mric:isLedBy ?p1

geo:Venezuela rdf:type mric:GeopoliticalEntity

?p1 rdf:type mric:Person

?e1 rdf:type mric:Event










no solutions in PG










find binding for p1 (346)










generates 346 hypotheses

finds support in SG for isLedBy("Venezuela", "Hugo Chavez")



Questions?

@laroyohttp://lora-aroyo.org


http://lora-aroyo.org

http://lora-aroyo.org

improving recall for conjunctive queries on nlp graphs

Technology