bidirectional attentive memory networks for question ...bidirectional attentive memory networks for...

Proceedings of NAACL-HLT 2019, pages 2913–2923Minneapolis, Minnesota, June 2 - June 7, 2019. c©2019 Association for Computational Linguistics

2913

Bidirectional Attentive Memory Networks for Question Answering overKnowledge Bases

Yu ChenRensselaer Polytechnic Institute

[email protected]

Lingfei WuIBM Research

[email protected]

Mohammed J. ZakiRensselaer Polytechnic Institute

[email protected]

Abstract

When answering natural language questionsover knowledge bases (KBs), different ques-tion components and KB aspects play differ-ent roles. However, most existing embedding-based methods for knowledge base questionanswering (KBQA) ignore the subtle inter-relationships between the question and the KB(e.g., entity types, relation paths and context).In this work, we propose to directly modelthe two-way flow of interactions betweenthe questions and the KB via a novel Bidi-rectional Attentive Memory Network, calledBAMnet. Requiring no external resourcesand only very few hand-crafted features, onthe WebQuestions benchmark, our methodsignificantly outperforms existing information-retrieval based methods, and remains com-petitive with (hand-crafted) semantic parsingbased methods. Also, since we use attentionmechanisms, our method offers better inter-pretability compared to other baselines.

1 Introduction

With the rapid growth in large-scale knowledgebases (KBs) such as DBPedia (Auer et al., 2007)and FreeBase (Google, 2018), knowledge basequestion answering (KBQA) has drawn increasingattention over the past few years. Given questionsin natural language (NL), the goal of KBQA is toautomatically find answers from the underlying KB,which provides a more natural and intuitive way toaccess the vast underlying knowledge resources.

One of the most prominent challenges of KBQAis the lexical gap. For instance, the same questioncan be expressed in various ways in NL while a KBusually has a canonical lexicon. It is therefore non-trivial to map an NL question to a structured KB.The approaches proposed to tackle the KBQA taskcan be roughly categorized into two groups: se-mantic parsing (SP) and information retrieval (IR)

approaches. SP-based approaches address the prob-lem by constructing a semantic parser that convertsNL questions into intermediate logic forms, whichcan be executed against a KB. Traditional semanticparsers (Wong and Mooney, 2007) require anno-tated logical forms as supervision, and are limitedto narrow domains with a small number of logicalpredicates. Recent efforts overcome these limita-tions via the construction of hand-crafted rules orfeatures (Abujabal et al., 2017; Hu et al., 2018)schema matching (Cai and Yates, 2013), and usingweak supervision from external resources (Krish-namurthy and Mitchell, 2012).

Unlike SP-based approaches that usually assumea pre-defined set of lexical triggers or rules, whichlimit their domains and scalability, IR-based ap-proaches directly retrieve answers from the KB inlight of the information conveyed in the questions.These IR-based approaches usually do not requirehand-made rules and can therefore scale better tolarge and complex KBs. Recently, deep neuralnetworks have been shown to produce strong re-sults on many NLP tasks. In the field of KBQA,under the umbrella of IR-based approaches, manyembedding-based methods (Bordes et al., 2014b;Hao et al., 2017) have been proposed and haveshown promising results. These methods adoptvarious ways to encode questions and KB sub-graphs into a common embedding space and di-rectly match them in that space, and can be typi-cally trained in an end-to-end manner.

Compared to existing embedding-based meth-ods that encode questions and KB subgraphs in-dependently, we introduce a novel BidirectionalAttentive Memory network, called BAMnet thatcaptures the mutual interactions between questionsand the underlying KB, which is stored in a content-addressable memory. We assume that the worldknowledge (i.e., the KB) is helpful for better un-derstanding the questions. Similarly, the questions

2914

themselves can help us focus on important KBaspects. To this end, we design a two-layered bidi-rectional attention network. The primary attentionnetwork is intended to focus on important parts ofa question in light of the KB and important KBaspects in light of the question. Built on top ofthat, the secondary attention network is intendedto enhance the question and KB representations byfurther exploiting the two-way attention. Throughthis idea of hierarchical two-way attention, we areable to distill the information that is the most rel-evant to answering the questions on both sides ofthe question and KB.

We highlight the contributions of this paper asfollows: 1) we propose a novel bidirectional at-tentive memory network for the task of KBQAwhich is intended to directly model the two-way in-teractions between questions and the KB; 2) bydesign, our method offers good interpretabilitythanks to the attention mechanisms; 3) on the We-bQuestions benchmark, our method significantlyoutperforms previous information-retrieval basedmethods while remaining competitive with (hand-crafted) semantic parsing based methods.

2 Related work

Two broad classes of SP-based and IR-based ap-proaches have been proposed for KBQA. The for-mer attempts to convert NL questions to logicforms. Recent work focused on approaches basedon weak supervision from either external resources(Krishnamurthy and Mitchell, 2012; Berant et al.,2013; Yao and Van Durme, 2014; Hu et al., 2018;Yih et al., 2015; Yavuz et al., 2016), schema match-ing (Cai and Yates, 2013), or using hand-craftedrules and features (Unger et al., 2012; Berant et al.,2013; Berant and Liang, 2015; Reddy et al., 2016;Bao et al., 2016; Abujabal et al., 2017; Hu et al.,2018; Bast and Haussmann, 2015; Yih et al., 2015).A thread of research has been explored to gener-ate semantic query graphs from NL questions suchas using coarse alignment between phrases andpredicates (Berant et al., 2013), searching partiallogical forms via an agenda-based strategy (Berantand Liang, 2015), pushing down the disambigua-tion step into the query evaluation stage (Hu et al.,2018), or exploiting rich syntactic information inNL questions (Xu et al., 2018a,b). Notably, an-other thread of SP-based approaches try to exploitIR-based techniques (Yao and Van Durme, 2014;Bast and Haussmann, 2015; Yang et al., 2014; Yih

et al., 2015; Bao et al., 2016; Yavuz et al., 2016;Liang et al., 2016) by computing the similarityof two sequences as features, leveraging a neuralnetwork-based answer type prediction model, ortraining end-to-end neural symbolic machine viaREINFORCE (Williams, 1992). However, mostSP-based approaches more or less rely on hand-crafted rules or features, which limits their scala-bility and transferability.

The other line of work (the IR-based) has fo-cused on mapping answers and questions into thesame embedding space, where one could query anyKB independent of its schema without requiringany grammar or lexicon. Bordes et al. (2014b) werethe first to apply an embedding-based approach forKBQA. Later, Bordes et al. (2014a) proposed theidea of subgraph embedding, which encodes moreinformation (e.g., answer path and context) aboutthe candidate answer. In follow-up work (Bordeset al., 2015; Jain, 2016), memory networks (We-ston et al., 2014) were used to store candidates, andcould be accessed iteratively to mimic multi-hopreasoning. Unlike the above methods that mainlyuse a bag-of-words (BOW) representation to en-code questions and KB resources, (Dong et al.,2015; Hao et al., 2017) apply more advanced net-work modules (e.g., CNNs and LSTMs) to encodequestions. Hybrid methods have also been pro-posed (Feng et al., 2016; Xu et al., 2016; Das et al.,2017), which achieve improved results by leverag-ing additional knowledge sources such as free text.While most embedding-based approaches encodequestions and answers independently, (Hao et al.,2017) proposed a cross-attention mechanism to en-code questions according to various candidate an-swer aspects. Differently, in this work, our methodgoes one step further by modeling the bidirectionalinteractions between questions and a KB.

The idea of bidirectional attention proposed inthis work is similar to those applied in machinereading comprehension (Wang and Jiang, 2016;Seo et al., 2016; Xiong et al., 2016). However,these previous works focus on capturing the inter-actions between two bodies of text, in this work,we focus on modeling the interactions between onebody of text and a KB.

3 A modular bidirectional attentivememory network for KBQA

Given an NL question, the goal is to fetch answersfrom the underlying KB. Our proposed BAMnet

2915

Figure 1: Overall architecture of the BAMnet model.

model consists of four components which are theinput module, memory module, reasoning moduleand answer module, as shown in Fig. 1.

3.1 Input module

An input NL question Q = {qi}|Q|i=1 is representedas a sequence of word embeddings (qi) by applyinga word embedding layer. We then use a bidirec-tional LSTM (Hochreiter and Schmidhuber, 1997)to encode the question as HQ (in Rd×|Q|) which isthe sequence of hidden states (i.e., the concatena-tion of forward and backward hidden states) gener-ated by the BiLSTM.

3.2 Memory module

Candidate generation Even though all the enti-ties from the KB could in principle be candidateanswers, this is computationally expensive and un-necessary in practice. We only consider those en-tities which are “close” to the main topic entity ofa question. An answer is the text description (e.g.,a name) of an entity node. For example, Ohio isthe topic entity of the question “Who was the sec-retary of state of Ohio in 2011?” (see Fig. 2). Aftergetting the topic entity, we collect all the entitiesconnected to it within h hops as candidate answers,which we denote as {Ai}|A|i=1.KB representation For each candidate answerfrom the KB, we encode three types of information:answer type, path and context.

Answer type Entity type information is an im-portant clue in ranking answers. For example, ifa question uses the interrogative word where, thencandidate answers with types relevant to the con-cept of location are more likely to be correct. Weuse a BiLSTM to encode its text description to get

Figure 2: A working example from Freebase. Relationsin Freebase have hierarchies where high-level ones pro-vide too broad or even noisy information about the re-lation. Thus, we choose to use the lowest level one.

a d-dimensional vector Ht1i (i.e., the concatenation

of last forward and backward hidden states).

Answer path We define an answer path as asequence of relations from a candidate answer toa topic entity. For example, for the Ohio question(see Fig. 2), the answer path of Jon A. Husted canbe either represented as a sequence of relation ids[office holder, governing officials] or the text de-scription [office, holder, governing, officials]. Wethus encode an answer path as Hp1

i via a BiLSTM,and as Hp2

i by computing the average of its relationembeddings via a relation embedding layer.

Answer context The answer context is definedas the surrounding entities (e.g., sibling nodes)of a candidate which can help answer questionswith constraints. For example, in Fig. 2, the an-swer context of Jon A. Husted includes the govern-ment position title secretary of state and startingdate 2011-01-09. However, for simple questionswithout constraints, the answer context is unnec-

2916

essary and can potentially incorporate noise. Wetackle this issue with two strategies: 1) we use anovel importance module (explained later) to focuson important answer aspects, and 2) we only con-sider those context nodes that have overlap with thequestion. Specifically, for each context node (i.e.,a sequence of words) of a candidate, we first com-pute the longest common subsequence between itand the question, we then encode it via a BiLSTMonly if we get a non-stopwords substring. Finally,the answer context of a candidate answer will beencoded as the average of all context node repre-sentations, which we denote as Hc

i .Key-value memory module In our model, we usea key-value memory network (Miller et al., 2016)to store candidate answers. Unlike a basic memorynetwork (Weston et al., 2014), its addressing stageis based on the key memory while the reading stageuses the value memory, which gives greater flexi-bility to encode prior knowledge via functionalityseparation. Thus, after encoding the answer type,path and context, we apply linear projections onthem as follows:

Mkti = fk

t (Ht1i ) Mvt

i = fvt (H

t1i )

Mkp

i = fkp ([H

p1i ;Hp2

i ]) Mvpi = fv

p ([Hp1i ;Hp2

i ])

Mkci = fk

c (Hci ) Mvc

i = fvc (H

ci )

(1)

where Mkti and Mvt

i are d-dimensional key andvalue representations of answer type At

i, respec-tively. Similarly, we have key and value repre-sentations for answer path and answer context.We denote M as a key-value memory whoserow Mi = {Mk

i ,Mvi } (both in Rd×3), where

Mki = [Mkt

i ;Mkpi ;Mkc

i ] comprises the keys, andMv

i = [Mvti ;M

vpi ;Mvc

i ] comprises the values.Here [, ] and [; ] denote row-wise and column-wiseconcatenations, respectively.

3.3 Reasoning moduleThe reasoning module consists of a generalizationmodule, and our novel two-layered bidirectionalattention network which aims at capturing the two-way interactions between questions and the KB.The primary attention network contains the KB-aware attention module which focuses on the im-portant parts of a question in light of the KB, andthe importance module which focuses on the im-portant KB aspects in light of the question. Thesecondary attention network (enhancing module inFig. 1) is intended to enhance the question and KBvectors by further exploiting the two-way attention.

Figure 3: KB-aware attention module. CAT: concatena-tion, SelfAtt: self-attention, AddAtt: additive attention.

KB-aware attention module Not all words in aquestion are created equal. We use a KB-awareattention mechanism to focus on important com-ponents of a question, as shown in Fig. 3. Specifi-cally, we first apply self-attention (SelfAtt) over allquestion word vectors HQ to get a d-dimensionalquestion vector q as follows

q = BiLSTM([HQAQQT,HQ])

AQQ = softmax((HQ)THQ)(2)

where softmax is applied over the last dimension ofan input tensor by default. Using question summaryq, we apply another attention (AddAtt) over thememory to obtain answer type mt, path mp andcontext summary mc:

mx =

|A|∑i=1

axi ·Mvxi

ax = Attadd(q,Mkx)

(3)

where x ∈ {t, p, c}, and Attadd(x,y) =softmax(tanh([xT ,y]W1)W2), with W1 ∈R2d×d and W2 ∈ Rd×1 being trainable weights.

So far, we have obtained the KB summarym = [mt;mp;mc] in light of the question. Weproceed to compute the question-to-KB attentionbetween question word qi and KB aspects as for-mulated by AQm = HQT

m. By applying maxpooling over the last dimension (i.e., the KB aspectdimension) of AQm, that is, aQi = maxj A

Qmij ,

we select the strongest connection between qi andthe KB. The idea behind it is that each word in aquestion serves a specific purpose (i.e., indicatinganswer type, path or context), and max pooling canhelp find out that purpose. We then apply a soft-max over the resulting vector to obtain aQ whichis a KB-aware question attention vector since itindicates the importance of qi in light of the KB.

2917

Importance module The importance module fo-cuses on important KB aspects as measured by theirrelevance to the questions. We start by computinga |Q| × |A| × 3 attention tensor AQM which indi-cates the strength of connection between each pairof {qi, Ax

j }x={t,p,c}. Then, we take the max of thequestion word dimension of AQM and normalizeit to get an attention matrix AM , which indicatesthe importance of each answer aspect for each can-didate answer. After that, we proceed to computequestion-aware memory representations Mk. Thus,we have:

Mv = {Mvi }|A|i=1 ∈ R|A|×d Mv

i =

3∑j=1

Mvij

Mk = {Mki }|A|i=1 ∈ R|A|×d Mk

i = AMi Mk

i

AM = softmax(AMT)T AM = max

i{AQM

i }|Q|i=1

AQM =(MkHQ)T

(4)

Enhancing module We further enhance the ques-tion and KB representations by exploiting two-wayattention. We compute the KB-enhanced questionrepresentation q which incorporates the relevantKB information by applying max pooling overthe last dimension (i.e., the answer aspect dimen-sion) of AQM , that is, AQ

M = maxk{AQM.,.,k }

3k=1,

and then normalizing it to get a question-to-KBattention matrix AQ

M from which we computethe question-aware KB summary and incorpo-rate it into the question representation HQ =

HQ + aQ � (AQMMv)

T. Finally, we obtain a d-

dimensional KB-enhanced question representationq = HQaQ.

Similarly, we compute a question-enhanced KBrepresentation M

k which incorporates the relevantquestion information:

Mk= Mk + aM � (AM

Q (HQ)T )

aM = (AQM )T aQ ∈ R|A|×1

AMQ = softmax(AQ

M

T) ∈ R|A|×|Q|

(5)

Generalization module We add a one-hop atten-tion process before answering. We use the questionrepresentation q to query over the key memory M

k

via an attention mechanism, and fetch the most rel-evant information from the value memory, whichis then used to update the question vector using aGRU (Cho et al., 2014). Finally, we apply a resid-ual layer (He et al., 2016) (i.e., y = f(x) + x)

and batch normalization (BN) (Ioffe and Szegedy,2015), which help the model performance in prac-tice. Thus, we have

q = BN(q+ q′) q′ = GRU(q, m)

m =

|A|∑i=1

ai · Mvi a = AttGRU

add (q,Mk)

(6)

3.4 Answer moduleGiven the representation of question Q which isq and the representation of candidate answers{Ai}|A|i=1 which is {Mk

i }|A|i=1, we compute the

matching score S(q,Mki ) between every pair

(Q,Ai) as S(q,a) = qT · a. The candidate an-swers are then ranked by their scores.

3.5 Training and testingTraining Intermediate modules such as the en-hancing module generate “premature” represen-tations of questions (e.g., q) and candidate an-swers (e.g., Mk). Even though these intermediaterepresentations are not optimal for answer predic-tion, we can still use them along with the finalrepresentations to jointly train the model, whichwe find helps the training probably by providingmore supervision since we are directly forcing in-termediate representations to be helpful for predic-tion. Moreover, we directly match interrogativewords to KB answer types. A question Q is repre-sented by a 16-dimensional interrogative word (weuse “which”, “what”, “who”, “whose”, “whom”,“where”, “when”, “how”, “why” and “whether”)embedding qw and a candidate answer Ai is rep-resented by entity type embedding Ht2

i with thesame size. We then compute the matching scoreS(qw,Ht2

i ) between them. Although we only haveweak labels (e.g., incorrect answers do not neces-sarily imply incorrect types) for the type matchingtask, and there are no shared representations be-tween two tasks, we find in practice this strategyhelps the training process as shown in Section 4.4.Loss Function: In the training phase, we force pos-itive candidates to have higher scores than negativecandidates by using a triplet-based loss function:

o = g(HQaQ,

3∑j=1

Mk.,j) + g(q,M

k)

+g(q,Mk) + g(qw,Ht2)

(7)

where g(q,M) =∑

a+∈A+

a−∈A−`(S(q,Ma+), S(q,Ma−)) ,

and `(y, y) = max(0, 1 + y − y) is a hinge loss

2918

function, and A+ and A− denote the positive (i.e.,correct) and negative (i.e., incorrect) answer sets,respectively. Note that at training time, the candi-date answers are extracted from the KB subgraphof the gold-standard topic entity, with the memorysize set to Nmax. We adopt the following samplingstrategy which works well in practice: if Nmax islarger than the number of positive answers |A+|,we keep all the positive answers and randomly se-lect negative answers to fill up the memory; oth-erwise, we randomly select min(Nmax/2, |A−|)negative answers and fill up the remaining memorywith random positive answers.Testing At testing time, we need to first find thetopic entity. We do this by using the top result re-turned by a separately trained topic entity predictor(we also compare with the result returned by theFreebase Search API). Then, the answer module re-turns the candidate answer with the highest scoresas predicted answers. Since there can be multipleanswers to a given question, the candidates whosescores are close to the highest score within a certainmargin, θ, are regarded as good answers as well.Therefore, we formulate the inference process asfollows:

A = {a | a ∈ A & maxa′∈A{S(q,Mk

a′)} − S(q,Mka) < θ}

(8)

where maxa′∈A{S(q,Mka′)} is the score of the

best matched answer and A is the predicted an-swer set. Note that θ is a hyper-parameter whichcontrols the degree of tolerance. Decreasing thevalue of θ makes the model become stricter whenpredicting answers.

3.6 Topic entity predictionGiven a question Q, the goal of a topic entity pre-dictor is to find the best topic entity c from the can-didate set {Ci}|C|i=1 returned by external topic entitylinking tools (we use the Freebase Search API andS-MART (Yang and Chang, 2016) in our experi-ments). We use a convolutional network (CNN) toencode Q into a d-dimensional vector e. For candi-date topic entity Ci, we encode three types of KBaspects, namely, the entity name, entity type andsurrounding relations where both entity name andtype are represented as a sequence of words whilesurrounding relations are represented as a bag of se-quences of words. Specifically, we use three CNNsto encode them into three d-dimensional vectors,namely, Cn

i , Cti and Cr1

i . Note that for surround-ing relations, we first encode each of the relations

and then compute their average. Additionally, wecompute an average of the relation embeddings viaa relation embedding layer which we denote asCr2

i . We then apply linear projections on the abovevectors as follows:

Pki = fk([Cn

i ;Cti;C

r1i ;Cr2

i ])

Pvi = fv([Cn

i ;Cti;C

r1i ;Cr2

i ])(9)

where Pki and Pv

i are d-dimensional key and valuerepresentations of candidate Ci, respectively. Fur-thermore, we compute the updated question vec-tor e using the generalization module mentionedearlier. Next, we use a dot product to computethe similarity score between Q and Ci. A triplet-based loss function is used as formulated by o =g(e,Pk

i ) + g(e,Pki ) where g(.) is the aforemen-

tioned hinge loss function. When training the pre-dictor, along with the candidates returned from ex-ternal entity linking tools, we do negative sampling(using string matching) to get more supervision. Inthe testing phase, the candidate with the highestscore is returned as the best topic entity and nonegative sampling is applied.

4 Experiments

This section provides an extensive evaluation ofour proposed BAMnet model against state-of-the-art KBQA methods. The implementationof BAMnet is available at https://github.com/hugochan/BAMnet.

4.1 Data and metrics

We use the Freebase KB and the WebQuestionsdataset, described below:Freebase This is a large-scale KB (Google, 2018)that consists of general facts organized as subject-property-object triples. It has 41M non-numericentities, 19K properties, and 596M assertions.WebQuestions This dataset (Berant et al., 2013)(nlp.stanford.edu/software/sempre)contains 3,778 training examples and 2,032 testexamples. We further split the training instancesinto a training set and development set via a80%/20% split. Approximately 85% of questionscan be directly answered via a single FreeBasepredicate. Also, each question can have multipleanswers. In our experiments, we use a developmentversion of the dataset (Baudis and Pichl, 2016),which additionally provides (potentially noisy)entity mentions for each question.

2919

Following (Berant et al., 2013), macro F1 scores(i.e., the average of F1 scores over all questions)are reported on the WebQuestions test set.

4.2 Model settings

When constructing the vocabularies of words, en-tity types or relation types, we only consider thosequestions and their corresponding KB subgraphsappearing in the training and validation sets. Thevocabulary size of words is V = 100, 797. Thereare 1,712 entity types and 4,996 relation types inthe KB subgraphs. Notably, in FreeBase, one en-tity might have multiple entity types. We only usethe first one available, which is typically the mostconcrete one. For those non-entity nodes which areboolean values or numbers, we use “bool” or “num”as their types, respectively.

We also adopt a query delexicalization strategywhere for each question, the topic entity mentionas well as constraint entity mentions (i.e., thosebelonging to “date”, “ordinal” or “number”) arereplaced with their types. When encoding KB con-text, if the overlap belongs to the above types, wealso do this delexicalization, which will guaranteeit matches up with the delexicalized question wellin the embedding space.

Given a topic entity, we extract its 2-hop sub-graph (i.e., h = 2) to collect candidate answers,which is sufficient for WebQuestions. At trainingtime, the memory size is limited to Nmax = 96candidate answers (for the sake of efficiency). Ifthere are more potential candidates, we do randomsampling as mentioned earlier. We initialize wordembeddings with pre-trained GloVe vectors (Pen-nington et al., 2014) with word embedding sizedv = 300. The relation embedding size dp, entitytype embedding size dt and hidden size d are setas 128, 16 and 128, respectively. The dropout rateson the word embedding layer, question encoderside and the answer encoder side are 0.3, 0.3 and0.2, respectively. The batch size is set as 32, andanswer module threshold θ = 0.7. As for the topicentity prediction, we use the same hyperparame-ters. For each question, there are 15 candidatesafter negative sampling in the training time. Whenencoding a question, we use a CNN with filter sizes2 and 3. A linear projection is applied to mergefeatures extracted with different filters. When en-coding a candidate aspect, we use a CNN with filtersize 3. Linear activation and max-pooling are usedtogether with CNNs. In the training process, we

use the Adam optimizer (Kingma and Ba, 2014)to train the model. The initial learning rate is setas 0.001 which is reduced by a factor of 10 if noimprovement is observed on the validation set in 3consecutive epochs. The training procedure stopsif no improvement is observed on the validation setin 10 consecutive epochs. The hyper-parametersare tuned on the development set.

4.3 Performance comparison

As shown in Table 1, our method can achieve an F1score of 0.557 when the gold topic entity is known,which gives an upper bound of our model perfor-mance. When the gold topic entity is unknown,we report the results using: 1) the Freebase SearchAPI, which achieves a recall@1 score of 0.857 onthe test set for topic entity linking, and 2) the topicentity predictor, which achieves a recall@1 scoreof 0.898 for entity retrieval.

As for the performance of BAMnet on WebQues-tions, it achieves an F1 score of 0.518 using thetopic entity predictor, which is significantly bet-ter than the F1 score of 0.497 using the FreebaseSearch API. We can observe that BAMnet signif-icantly outperforms previous state-of-the-art IR-based methods, which conclusively demonstratesthe effectiveness of modeling bidirectional interac-tions between questions and the KB.

It is important to note that unlike the state-of-the-art SP-based methods, BAMnet relies on noexternal resources and very few hand-crafted fea-tures, but still remains competitive with those ap-proaches. Based on careful hand-drafted rules,some SP-based methods (Bao et al., 2016; Yihet al., 2015) can better model questions with con-straints and aggregations. For example, (Yih et al.,2015) applies many manually designed rules andfeatures to improve performance on questions withconstraints and aggregations, and (Bao et al., 2016)directly models temporal (e.g., “after 2000”), ordi-nal (e.g., “first”) and aggregation constraints (e.g.,“how many”) by adding detected constraint nodesto query graphs. In contrast, our method is end-to-end, with very few hand-crafted rules.

Additionally, (Yavuz et al., 2016; Bao et al.,2016) train their models on external Q&A datasetsto get extra supervision. For a fairer comparison,we only show their results without training on ex-ternal Q&A datasets. Similarly, for hyhrid systems(Feng et al., 2016; Xu et al., 2016), we only reportresults without using Wikipedia free text. It is in-

2920

Methods (ref) Macro F1

SP-based

(Berant et al., 2013) 0.357(Yao and Van Durme, 2014) 0.443

(Wang et al., 2014) 0.453(Bast and Haussmann, 2015) 0.494

(Berant and Liang, 2015) 0.497(Yih et al., 2015) 0.525

(Reddy et al., 2016) 0.503(Yavuz et al., 2016) 0.516(Bao et al., 2016) 0.524(Feng et al., 2016) 0.471

(Reddy et al., 2017) 0.495(Abujabal et al., 2017) 0.510

(Hu et al., 2018) 0.496IR-based

(Bordes et al., 2014a) 0.392(Yang et al., 2014) 0.413(Dong et al., 2015) 0.408

(Bordes et al., 2015) 0.422(Xu et al., 2016) 0.471(Hao et al., 2017) 0.429

Our Method: BAMnet

w/ gold topic entity 0.557w/ Freebase Search API 0.497w/ topic entity predictor 0.518

Table 1: Results on the WebQuestions test set. Bold:best in-category performance.

teresting to note that both (Yih et al., 2015) and(Bao et al., 2016) also use the ClueWeb dataset forlearning more accurate semantics. The F1 scoreof (Yih et al., 2015) drops from 0.525 to 0.509 ifClueWeb information is removed. To summarize,BAMnet achieves state-of-the-art performance of0.518 without recourse to any external resourcesand relies only on very few hand-crafted features.If we assume gold-topic entities are given thenBAMnet achieves an F1 of 0.557.

4.4 Ablation study

We now discuss the performance impact of the dif-ferent modules and strategies in BAMnet. Notethat gold topic entity is assumed to be known whenwe do this ablation study, because the error intro-duced by topic entity prediction might reduce thereal performance impact of a module or strategy.As shown in Table 2, significant performance dropswere observed after turning off some key attention

Methods Macro F1

all 0.557w/o two-layered bidirectional attn 0.534w/o kb-aware attn (+self-attn) 0.544w/o importance module 0.540w/o enhancing module 0.550w/o generalization module 0.542w/o joint type matching 0.545w/o topic entity delexicalization 0.529w/o constraint delexicalization 0.554

Table 2: Ablation results on the WebQuestions test set.Gold topic entity is assumed to be known.

modules, which confirms that the real power ofour method comes from the idea of hierarchicaltwo-way attention. As we can see, when turningoff the two-layered bidirectional attention network,the model performance drops from 0.557 to 0.534.Among all submodules in the attention network,the importance module is the most significant sincethe F1 score drops to 0.540 without it, therebyconfirming the effectiveness of modeling the query-to-KB attention flow. On the flip side, the impor-tance of modeling the KB-to-query attention flow isconfirmed by the fact that replacing the KB-awareattention module with self-attention significantlydegrades the performance. Besides, the secondaryattention layer, the enhancing module, also con-tributes to the overall model performance. Finally,we find that the topic entity delexicalization strat-egy has a big influence on the model performancewhile the constraint delexicalization strategy onlymarginally boosts the performance.

Figure 4: Attention heatmap generated by the reason-ing module. Best viewed in color.

4.5 Interpretability analysisHere, we show that our method does capturethe mutual interactions between question words

2921

KBAspects Questions BAMnet w/o BiAttn. BAMnet Gold Answers

AnswerType

What degrees did Obamaget in college?

Harvard Law School,Columbia University,Occidental College

Bachelor of Arts,Juris Doctor,

Political Science

Juris Doctor,Bachelor of Arts

What music period didBeethoven live in?

Austrian Empire,Germany, Bonn

Classical music,Opera

Opera,Classical music

AnswerPath

Where did Queenslandget its name from? Australia Queen Victoria Queen Victoria

Where does Delawareriver start? Delaware Bay

West BranchDelaware River,Mount Jefferson

West BranchDelaware River,Mount Jefferson

AnswerContext

What are the majorcities in Ukraine?

Kiev, Olyka,...

Vynohradiv, HusiatynKiev Kiev

Who is running for vice presidentwith Barack Obama 2012? David Petraeus Joe Biden Joe Biden

Table 3: Predicted answers of BAMnet w/ and w/o bidirectional attention on the WebQuestions test set.

and KB aspects, by visualizing the attention ma-trix AQM produced by the reasoning module.Fig. 4 shows the attention heatmap generated fora test question “who did location surrender to in

number ” (where “location” and “ number ”are entity types which replace the topic entity men-tion “France” and the constraint entity mention“ww2”, respectively in the original question). Aswe can see, the attention network successfully de-tects the interactions between “who” and answertype, “surrender to” and answer path, and focusesmore on those words when encoding the question.

To further examine the importance of the two-way flow of interactions, in Table 3, we show thepredicted answers of BAMnet with and withoutthe two-layered bidirectional attention network onsamples questions from the WebQuestions test set.We divide the questions into three categories basedon which kind of KB aspect is the most crucialfor answering them. As we can see, compared tothe simplified version which is not equipped withbidirectional attention, our model is more capableof answering all the three types of questions.

4.6 Error analysis

To better examine the limitations of our approach,we randomly sampled 100 questions on which ourmethod performed poorly (i.e., with per-questionF1 score less than 0.6), and categorized the errors.We found that around 33% of errors are due to labelissues of gold answers and are not real mistakes.This includes incomplete and erroneous labels, andalso alternative correct answers. Constraints areanother source of errors (11%), with temporal con-straints accounting for most. Some questions have

implicit temporal (e.g., tense) constraints whichour method does not model. A third source of er-ror is what we term type errors (13%), for whichour method generates more answers than neededbecause of poorly utilizing answer type informa-tion. Lexical gap is another source of errors (5%).Finally, other sources of errors (38%) include topicentity prediction error, question ambiguity, incom-plete answers and other miscellaneous errors.

5 Conclusions and future work

We introduced a novel and effective bidirectional at-tentive memory network for the purpose of KBQA.To our best knowledge, we are the first to model themutual interactions between questions and a KB,which allows us to distill the information that is themost relevant to answering the questions on bothsides of the question and KB. Experimental resultsshow that our method significantly outperformsprevious IR-based methods while remaining com-petitive with hand-crafted SP-based methods. Bothablation study and interpretability analysis verifythe effectiveness of the idea of modeling mutualinteractions. In addition, our error analysis showsthat our method actually performs better than whatthe evaluation metrics indicate.

In the future, we would like to explore effec-tive ways of modeling more complex types of con-straints (e.g., ordinal, comparison and aggregation).

Acknowledgments

This work is supported by IBM Research AIthrough the IBM AI Horizons Network. We thankthe anonymous reviewers for their constructive sug-gestions.

2922

ReferencesAbdalghani Abujabal, Mohamed Yahya, Mirek Riede-

wald, and Gerhard Weikum. 2017. Automated tem-plate generation for question answering over knowl-edge graphs. In WWW, pages 1191–1200.

Soren Auer, Christian Bizer, Georgi Kobilarov, JensLehmann, Richard Cyganiak, and Zachary Ives.2007. Dbpedia: A nucleus for a web of open data.In The semantic web, pages 722–735. Springer.

Junwei Bao, Nan Duan, Zhao Yan, Ming Zhou, andTiejun Zhao. 2016. Constraint-based question an-swering with knowledge graph. In COLING, pages2503–2514.

Hannah Bast and Elmar Haussmann. 2015. More ac-curate question answering on freebase. In CIKM,pages 1431–1440. ACM.

Petr Baudis and Jan Pichl. 2016. Dataset factoid we-bquestions. https://github.com/brmson/dataset-factoid-webquestions.

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on freebase fromquestion-answer pairs. In EMNLP, pages 1533–1544.

Jonathan Berant and Percy Liang. 2015. Imitationlearning of agenda-based semantic parsers. TACL,3:545–558.

Antoine Bordes, Sumit Chopra, and Jason Weston.2014a. Question answering with subgraph embed-dings. In EMNLP, pages 615–620.

Antoine Bordes, Nicolas Usunier, Sumit Chopra, andJason Weston. 2015. Large-scale simple questionanswering with memory networks. arXiv preprintarXiv:1506.02075.

Antoine Bordes, Jason Weston, and Nicolas Usunier.2014b. Open question answering with weakly super-vised embedding models. In ECML-PKDD, pages165–180. Springer.

Qingqing Cai and Alexander Yates. 2013. Large-scalesemantic parsing via schema matching and lexiconextension. In ACL, pages 423–433.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder–decoderfor statistical machine translation. In EMNLP, pages1724–1734.

Rajarshi Das, Manzil Zaheer, Siva Reddy, andAndrew McCallum. 2017. Question answer-ing on knowledge bases and text using universalschema and memory networks. arXiv preprintarXiv:1704.08384.

Li Dong, Furu Wei, Ming Zhou, and Ke Xu.2015. Question answering over freebase with multi-column convolutional neural networks. In ACL-IJCNLP, volume 1, pages 260–269.

Yansong Feng, Songfang Huang, Dongyan Zhao, et al.2016. Hybrid question answering over knowledgebase and free text. In COLING, pages 2397–2407.

Google. 2018. Freebase data dumps. https://developers.google.com/freebase.

Yanchao Hao, Yuanzhe Zhang, Kang Liu, Shizhu He,Zhanyi Liu, Hua Wu, and Jun Zhao. 2017. An end-to-end model for question answering over knowl-edge base with cross-attention combining globalknowledge. In ACL, volume 1, pages 221–231.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In CVPR, pages 770–778.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Sen Hu, Lei Zou, Jeffrey Xu Yu, Haixun Wang, andDongyan Zhao. 2018. Answering natural languagequestions by subgraph matching over knowledgegraphs. TKDE, 30(5):824–837.

Sergey Ioffe and Christian Szegedy. 2015. Batch nor-malization: Accelerating deep network training byreducing internal covariate shift. arXiv preprintarXiv:1502.03167.

Sarthak Jain. 2016. Question answering over knowl-edge base using factual memory networks. InNAACL-HLT, pages 109–115.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Jayant Krishnamurthy and Tom M Mitchell. 2012.Weakly supervised training of semantic parsers. InEMNLP-CoNLL, pages 754–765.

Chen Liang, Jonathan Berant, Quoc Le, Kenneth D For-bus, and Ni Lao. 2016. Neural symbolic machines:Learning semantic parsers on freebase with weak su-pervision. arXiv preprint arXiv:1611.00020.

Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston.2016. Key-value memory networks for directly read-ing documents. arXiv preprint arXiv:1606.03126.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In EMNLP, pages 1532–1543.

Siva Reddy, Oscar Tackstrom, Michael Collins, TomKwiatkowski, Dipanjan Das, Mark Steedman, andMirella Lapata. 2016. Transforming dependencystructures to logical forms for semantic parsing.TACL, 4:127–140.

2923

Siva Reddy, Oscar Tackstrom, Slav Petrov, Mark Steed-man, and Mirella Lapata. 2017. Universal semanticparsing. In EMNLP, pages 89–101.

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. 2016. Bidirectional attentionflow for machine comprehension. arXiv preprintarXiv:1611.01603.

Christina Unger, Lorenz Buhmann, Jens Lehmann,Axel-Cyrille Ngonga Ngomo, Daniel Gerber, andPhilipp Cimiano. 2012. Template-based questionanswering over rdf data. In WWW, pages 639–648.ACM.

Shuohang Wang and Jing Jiang. 2016. Machine com-prehension using match-lstm and answer pointer.arXiv preprint arXiv:1608.07905.

Zhenghao Wang, Shengquan Yan, Huaming Wang, andXuedong Huang. 2014. An overview of microsoftdeep qa system on stanford webquestions bench-mark. Technical Report MSR-TR-2014-121, Mi-crosoft Research.

Jason Weston, Sumit Chopra, and Antoine Bor-des. 2014. Memory networks. arXiv preprintarXiv:1410.3916.

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Machine learning, 8(3-4):229–256.

Yuk Wah Wong and Raymond Mooney. 2007. Learn-ing synchronous grammars for semantic parsingwith lambda calculus. In ACL, pages 960–967.

Caiming Xiong, Victor Zhong, and Richard Socher.2016. Dynamic coattention networks for questionanswering. arXiv preprint arXiv:1611.01604.

Kun Xu, Siva Reddy, Yansong Feng, Songfang Huang,and Dongyan Zhao. 2016. Question answering onfreebase via relation extraction and textual evidence.arXiv preprint arXiv:1603.00957.

Kun Xu, Lingfei Wu, Zhiguo Wang, and VadimSheinin. 2018a. Graph2seq: Graph to se-quence learning with attention-based neural net-works. arXiv preprint arXiv:1804.00823.

Kun Xu, Lingfei Wu, Zhiguo Wang, Mo Yu, Li-wei Chen, and Vadim Sheinin. 2018b. Exploit-ing rich syntactic information for semantic pars-ing with graph-to-sequence model. arXiv preprintarXiv:1808.07624.

Min-Chul Yang, Nan Duan, Ming Zhou, and Hae-Chang Rim. 2014. Joint relational embeddings forknowledge-based question answering. In EMNLP,pages 645–650.

Yi Yang and Ming-Wei Chang. 2016. S-mart:Novel tree-based structured learning algorithms ap-plied to tweet entity linking. arXiv preprintarXiv:1609.08075.

Xuchen Yao and Benjamin Van Durme. 2014. Infor-mation extraction over structured data: Question an-swering with freebase. In ACL, volume 1, pages956–966.

Semih Yavuz, Izzeddin Gur, Yu Su, Mudhakar Srivatsa,and Xifeng Yan. 2016. Improving semantic parsingvia answer type inference. In EMNLP, pages 149–159.

Scott Wen-tau Yih, Ming-Wei Chang, Xiaodong He,and Jianfeng Gao. 2015. Semantic parsing viastaged query graph generation: Question answeringwith knowledge base. In ACL, pages 1321–1331.

bidirectional attentive memory networks for question ...bidirectional attentive memory networks for...

Documents