question answering with subgraph embeddings

Question Answering with Subgraph Embeddings

A. Bordes, S. Chopra, J. Weston(presented by Karel Ha)

Introduction Task definition Embedding Questions and Answers Experiments Conclusion

Open-domain question answering

Definition (Open QA)

the task to automatically answer questions asked in naturallanguage on any topic or in any domain

Nowadays: lookup in large scale structured knowledge bases (KBs)

Two main approaches, based on:

information retrieval (Yao and Van Durme, ’14)

semantic parsing (Berant et al., ’13; Berant and Liang, ’14)

A lot of human supervision: lexicons, grammars, KB schema . . .

2 / 17

Embedding model

. . . learns representations as low-dimensional vectors of wordsand KBs elements

In particular, subgraph embeddings:

richer representation of the answers (QA path, surroundingsubgraph of the KB)

less human supervision

ability to answer more complicated questions (indirectanswers)

more sophisticated inferences

3 / 17

Embedding model

3 / 17

Embedding model

3 / 17

Embedding model

3 / 17

Embedding model

3 / 17

Freebase

the knowledge base of general facts

Database of 14 million triplets:

subject

object

connected by type1.type2.predicate

Turned automatically into questions-answer pairs for training:“What is the predicate of the type2 subject? (object)”“What is the nationality of the person barack obama?(united states)”

4 / 17

Freebase

subject

object

4 / 17

Freebase

subject

object

4 / 17

Freebase

subject

object

Turned automatically into questions-answer pairs for training:“What is the predicate of the type2 subject? (object)”

“What is the nationality of the person barack obama?(united states)”

4 / 17

Freebase

subject

object

4 / 17

WebQuestions

the dataset of 5810 QA pairs built from Freebase

Questions from:

Answers from:

One Freebase entity identified in each (natural) question usingstring matching:“What degrees did Barack Obama get? (bachelor of arts,juris doctor)”

5 / 17

WebQuestions

the dataset of 5810 QA pairs built from FreebaseQuestions from:

Answers from:

5 / 17

WebQuestions

the dataset of 5810 QA pairs built from FreebaseQuestions from:

Answers from:

5 / 17

ClueWeb Extractions

provided by (Lin et al. ’12)

2 million extractions

(subject, “text string”, object)

“Where barack obama was allegedly bear in? (hawaii)”

6 / 17

ClueWeb Extractions

provided by (Lin et al. ’12)

2 million extractions

(subject, “text string”, object)

“Where barack obama was allegedly bear in? (hawaii)”

6 / 17

Paraphrases

There are issues with automagically generated questions:

semi-automatic wording

rigid syntax

often unnatural

⇒ pairs of question paraphrases:

from WikiAnswers (users have option to tag rephrasingsof questions)

2 million distinct questions

350,000 paraphrase clusters

7 / 17

Paraphrases

There are issues with automagically generated questions:

semi-automatic wording

rigid syntax

often unnatural

⇒ pairs of question paraphrases:

from WikiAnswers (users have option to tag rephrasingsof questions)

2 million distinct questions

350,000 paraphrase clusters

7 / 17

Scoring function

Let us have a question q and a candidate answer a.Goal: to learn the scoring function

S(q, a) = f(q)>g(a)

So that: high score if a is the correct answer; low score otherwise

Where:

k is the dimension of the embedding spaceNW is the number of words, NS the number of entities andrelations, N = NW +NS

W ∈ Rk×N , where the column W∗,i is the embedding of thei-th word, entity or relationvectors ϕ(q) ∈ NN

0 indicates how many times each wordoccurs in the questionψ(a) ∈ NN

0 is the vector representation of the answer (nextslide)f(q) = Wϕ(q) and g(a) = Wψ(a) are embeddings into Rk

8 / 17

Scoring function

S(q, a) = f(q)>g(a)

So that: high score if a is the correct answer; low score otherwiseWhere:

8 / 17

Scoring function

S(q, a) = f(q)>g(a)

W ∈ Rk×N , where the column W∗,i is the embedding of thei-th word, entity or relation

vectors ϕ(q) ∈ NN0 indicates how many times each word

occurs in the questionψ(a) ∈ NN

8 / 17

Scoring function

S(q, a) = f(q)>g(a)

0 is the vector representation of the answer (nextslide)

f(q) = Wϕ(q) and g(a) = Wψ(a) are embeddings into Rk

8 / 17

Scoring function

S(q, a) = f(q)>g(a)

8 / 17

Representing Candidate Answers

Single Entity1-of-NS coded vector with 1 corresponding to the entity of theanswer, and 0 elsewhere

Path Representation3-of-NS or 4-of-NS coded vector: 1-hop or 2-hop path fromthe question entity to the answer entity using the relationtypes (but not entities) in-between(barack obama, people.person.place of birth,

honolulu)

(barack obama, people.person.place of birth,

location.location.containedby, hawaii)

Subgraph Representationpath representation + the entire subgraph of entitiesconnected to the candidate answer entitydouble the size for entities and relations: N = NW + 2NS

C entities connected to the answer entity via D relations:either (3 + C +D)-of-NS or (4 + C +D)-of-NS coded vector

9 / 17

Path Representation3-of-NS or 4-of-NS coded vector: 1-hop or 2-hop path fromthe question entity to the answer entity using the relationtypes (but not entities) in-between

honolulu)

9 / 17

honolulu)

9 / 17

honolulu)

9 / 17

honolulu)

Subgraph Representationpath representation + the entire subgraph of entitiesconnected to the candidate answer entity

double the size for entities and relations: N = NW + 2NS

9 / 17

honolulu)

9 / 17

honolulu)

9 / 17

Overview

10 / 17

Training and Ranking Loss Function

Let D = {(qi, ai) : i = 1, . . . , |D|} be the training set of QA pairs.

Goal: to minimize

|D|∑i=1

∑a∈A(ai)

max{0, (S(qi, a) +m)− S(qi, ai)}

So that: the score of a question paired with a correct answer isgreater than with any incorrect answer a by at least m.Where:

margin parameter m, the set of incorrect candidates Aa sampled from A

50% of time: another entity connected to the question entity(i.e., through other candidate paths)50% of time: random answer entity

optimization:stochastic gradient descentmulti-threadedcolumns of W are inside the unit ball: ∀i : ||W∗,i|| ≤ 1

11 / 17

Let D = {(qi, ai) : i = 1, . . . , |D|} be the training set of QA pairs.Goal: to minimize

|D|∑i=1

∑a∈A(ai)

So that: the score of a question paired with a correct answer isgreater than with any incorrect answer a by at least m.

Where:

11 / 17

|D|∑i=1

∑a∈A(ai)

11 / 17

|D|∑i=1

∑a∈A(ai)

11 / 17

|D|∑i=1

∑a∈A(ai)

optimization:stochastic gradient descentmulti-threadedcolumns of W are inside the unit ball: ∀i : ||W∗,i|| ≤ 1 11 / 17

Multitask Training of Embeddings

Synthethic questions inadequately cover the syntax of naturallanguage!

⇒ Multi-task the training with the task of paraphrase prediction!

scoring function Sprp(q1, q2) = f(q1)>f(q2)

same embedding matrix W

negative samples taken from different paraphrase clusters

12 / 17

Inference

The trained model predicts the answer by

a = argmaxa′∈A(q)

S(q, a′)

where Aq is the candidate answer set for each question, chosen as:

the whole KB (slow and imprecise!)

C1: only Freebase triplets involving the question entity(simple factual questions, 1-hop paths)

C2: additional triplets at 2-hop distance from the questionentity (i.e., neighbors of neighbors)

not all the quadruplets (too large set), but predictions made inturns, using best-first search heuristicsrank KB’s relations using S(q, a)top 10 relations ⇒ only 2-hop paths containing them1-hop triplets weighted by value 1.5

13 / 17

Inference

S(q, a′)

13 / 17

Inference

S(q, a′)

13 / 17

Inference

S(q, a′)

13 / 17

Inference

S(q, a′)

not all the quadruplets (too large set), but predictions made inturns, using best-first search heuristicsrank KB’s relations using S(q, a)

top 10 relations ⇒ only 2-hop paths containing them1-hop triplets weighted by value 1.5

13 / 17

Inference

S(q, a′)

13 / 17

Multiple answers

“Who are David Beckham’s children?” ⇒ a whole set of answers,all on the path:

(david beckham, people.person.children, *)

Vector representing multiple answers is the average of sub-answers:

ψall(a′) =

|a′|∑a′j∈a′

ψ(a′j)

14 / 17

Multiple answers

“Who are David Beckham’s children?” ⇒ a whole set of answers,all on the path:

(david beckham, people.person.children, *)

Vector representing multiple answers is the average of sub-answers:

ψall(a′) =

|a′|∑a′j∈a′

ψ(a′j)

14 / 17

Results on the WebQuestions test set

15 / 17

Conclusion

Pros of subgraph embeddings:

training uses only QA pairs and the knowledge base

low-dimension embeddings

exploiting richer structure of the answers, which is provided bytheir local neighborhood in the KB graph

training for question paraphrasing task at the same time

promising performance on the WebQuestions benchmark

16 / 17

Conclusion

16 / 17

Conclusion

16 / 17

Thank you!

Time for human Question Answering!

17 / 17

Thank you!

Time for human Question Answering!

17 / 17

question answering with subgraph embeddings

Science

introduction task denition

denition open qa

answers qa path

kbs elements

subgraph embeddings

surrounding subgraph

richer representation

natural language