semantic parsing via staged query graph generation ... · semantic parsing via staged query graph...

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguisticsand the 7th International Joint Conference on Natural Language Processing, pages 1321–1331,

Beijing, China, July 26-31, 2015. c©2015 Association for Computational Linguistics

Semantic Parsing via Staged Query Graph Generation:Question Answering with Knowledge Base

Wen-tau Yih Ming-Wei Chang Xiaodong He Jianfeng GaoMicrosoft Research

Redmond, WA 98052, USA{scottyih,minchang,xiaohe,jfgao}@microsoft.com

Abstract

We propose a novel semantic parsingframework for question answering using aknowledge base. We define a query graphthat resembles subgraphs of the knowl-edge base and can be directly mapped toa logical form. Semantic parsing is re-duced to query graph generation, formu-lated as a staged search problem. Unliketraditional approaches, our method lever-ages the knowledge base in an early stageto prune the search space and thus simpli-fies the semantic matching problem. Byapplying an advanced entity linking sys-tem and a deep convolutional neural net-work model that matches questions andpredicate sequences, our system outper-forms previous methods substantially, andachieves an F1 measure of 52.5% on theWEBQUESTIONS dataset.

1 Introduction

Organizing the world’s facts and storing themin a structured database, large-scale knowledgebases (KB) like DBPedia (Auer et al., 2007) andFreebase (Bollacker et al., 2008) have becomeimportant resources for supporting open-domainquestion answering (QA). Most state-of-the-artapproaches to KB-QA are based on semantic pars-ing, where a question (utterance) is mapped to itsformal meaning representation (e.g., logical form)and then translated to a KB query. The answers tothe question can then be retrieved simply by exe-cuting the query. The semantic parse also providesa deeper understanding of the question, which canbe used to justify the answer to users, as well as toprovide easily interpretable information to devel-opers for error analysis.

However, most traditional approaches for se-mantic parsing are largely decoupled from the

knowledge base, and thus are faced with sev-eral challenges when adapted to applications likeQA. For instance, a generic meaning represen-tation may have the ontology matching problemwhen the logical form uses predicates that differfrom those defined in the KB (Kwiatkowski et al.,2013). Even when the representation languageis closely related to the knowledge base schema,finding the correct predicates from the large vo-cabulary in the KB to relations described in theutterance remains a difficult problem (Berant andLiang, 2014).

Inspired by (Yao and Van Durme, 2014; Bao etal., 2014), we propose a semantic parsing frame-work that leverages the knowledge base moretightly when forming the parse for an input ques-tion. We first define a query graph that can bestraightforwardly mapped to a logical form in λ-calculus and is semantically closely related to λ-DCS (Liang, 2013). Semantic parsing is then re-duced to query graph generation, formulated asa search problem with staged states and actions.Each state is a candidate parse in the query graphrepresentation and each action defines a way togrow the graph. The representation power of thesemantic parse is thus controlled by the set of le-gitimate actions applicable to each state. In partic-ular, we stage the actions into three main steps:locating the topic entity in the question, findingthe main relationship between the answer and thetopic entity, and expanding the query graph withadditional constraints that describe properties theanswer needs to have, or relationships between theanswer and other entities in the question.

One key advantage of this staged design isthat through grounding partially the utterance tosome entities and predicates in the KB, we makethe search far more efficient by focusing on thepromising areas in the space that most likely leadto the correct query graph, before the full parseis determined. For example, after linking “Fam-

1321

ily Guy” in the question “Who first voiced Megon Family Guy?” to FamilyGuy (the TV show)in the knowledge base, the procedure needs onlyto examine the predicates that can be applied toFamilyGuy instead of all the predicates in theKB. Resolving other entities also becomes easy,as given the context, it is clear that Meg refersto MegGriffin (the character in Family Guy).Our design divides this particular semantic pars-ing problem into several sub-problems, such as en-tity linking and relation matching. With this in-tegrated framework, best solutions to each sub-problem can be easily combined and help pro-duce the correct semantic parse. For instance,an advanced entity linking system that we em-ploy outputs candidate entities for each questionwith both high precision and recall. In addi-tion, by leveraging a recently developed semanticmatching framework based on convolutional net-works, we present better relation matching modelsusing continuous-space representations instead ofpure lexical matching. Our semantic parsing ap-proach improves the state-of-the-art result on theWEBQUESTIONS dataset (Berant et al., 2013) to52.5% in F1, a 7.2% absolute gain compared tothe best existing method.

The rest of this paper is structured as follows.Sec. 2 introduces the basic notion of the graphknowledge base and the design of our query graph.Sec. 3 presents our search-based approach for gen-erating the query graph. The experimental resultsare shown in Sec. 4, and the discussion of our ap-proach and the comparisons to related work are inSec. 5. Finally, Sec. 6 concludes the paper.

2 Background

In this work, we aim to learn a semantic parserthat maps a natural language question to a logi-cal form query q, which can be executed against aknowledge baseK to retrieve the answers. Our ap-proach takes a graphical view of bothK and q, andreduces semantic parsing to mapping questions toquery graphs. We describe the basic design below.

2.1 Knowledge base

The knowledge base K considered in this workis a collection of subject-predicate-object triples(e1, p, e2), where e1, e2 ∈ E are the entities (e.g.,FamilyGuy or MegGriffin) and p ∈ P is abinary predicate like character. A knowledgebase in this form is often called a knowledge graph

Family Guy cvt2

Meg Griffin

Lacey Chabert

1/31/1999

cvt1

from

12/26/1999

cvt3

series

Mila Kunis

Figure 1: Freebase subgraph of Family Guy

because of its straightforward graphical represen-tation – each entity is a node and two related en-tities are linked by a directed edge labeled by thepredicate, from the subject to the object entity.

To compare our approach to existing methods,we use Freebase, which is a large database withmore than 46 million topics and 2.6 billion facts.In Freebase’s design, there is a special entity cate-gory called compound value type (CVT), which isnot a real-world entity, but is used to collect mul-tiple fields of an event or a special relationship.

Fig. 1 shows a small subgraph of Freebase re-lated to the TV show Family Guy. Nodes are theentities, including some dates and special CVT en-tities1. A directed edge describes the relation be-tween two entities, labeled by the predicate.

2.2 Query graph

Given the knowledge graph, executing a logical-form query is equivalent to finding a subgraph thatcan be mapped to the query and then resolving thebinding of the variables. To capture this intuition,we describe a restricted subset of λ-calculus in agraph representation as our query graph.

Our query graph consists of four types of nodes:grounded entity (rounded rectangle), existentialvariable (circle), lambda variable (shaded circle),aggregation function (diamond). Grounded enti-ties are existing entities in the knowledge base K.Existential variables and lambda variables are un-

1In the rest of the paper, we use the term entity for bothreal-world and CVT entities, as well as properties like date orheight. The distinction is not essential to our approach.

1322

Family Guy cast

Meg Griffinargmin

xy

Figure 2: Query graph that represents the question“Who first voiced Meg on Family Guy?”

grounded entities. In particular, we would like toretrieve all the entities that can map to the lambdavariables in the end as the answers. Aggregationfunction is designed to operate on a specific entity,which typically captures some numerical proper-ties. Just like in the knowledge graph, relatednodes in the query graph are connected by directededges, labeled with predicates in K.

To demonstrate this design, Fig. 2 shows onepossible query graph for the question “Who firstvoiced Meg on Family Guy?” using Freebase.The two entities, MegGriffin and FamilyGuyare represented by two rounded rectangle nodes.The circle node y means that there should existan entity describing some casting relations likethe character, actor and the time she started therole2. The shaded circle node x is also calledthe answer node, and is used to map entities re-trieved by the query. The diamond node arg minconstrains that the answer needs to be the ear-liest actor for this role. Equivalently, the logi-cal form query in λ-calculus without the aggrega-tion function is: λx.∃y.cast(FamilyGuy, y) ∧actor(y, x) ∧ character(y,MegGriffin)Running this query graph against K as inFig. 1 will match both LaceyChabert andMilaKunis before applying the aggregationfunction, but only LaceyChabert is the correctanswer as she started this role earlier (by checkingthe from property of the grounded CVT node).

Our query graph design is inspired by (Reddyet al., 2014), but with some key differences. Thenodes and edges in our query graph closely re-semble the exact entities and predicates from theknowledge base. As a result, the graph canbe straightforwardly translated to a logical formquery that is directly executable. In contrast, thequery graph in (Reddy et al., 2014) is mappedfrom the CCG parse of the question, and needs fur-ther transformations before mapping to subgraphs

2y should be grounded to a CVT entity in this case.

f Se Sp Sc

Ae Ap Aa/Ac

Aa/Ac

Figure 3: The legitimate actions to grow a querygraph. See text for detail.

of the target knowledge base to retrieve answers.Semantically, our query graph is more related to

simple λ-DCS (Berant et al., 2013; Liang, 2013),which is a syntactic simplification of λ-calculuswhen applied to graph databases. A query graphcan be viewed as the tree-like graph pattern of alogical form in λ-DCS. For instance, the path fromthe answer node to an entity node can be describedusing a series of join operations in λ-DCS. Differ-ent paths of the tree graph are combined via theintersection operators.

3 Staged Query Graph Generation

We focus on generating query graphs with the fol-lowing properties. First, the tree graph consists ofone entity node as the root, referred as the topicentity. Second, there exists only one lambda vari-able x as the answer node, with a directed pathfrom the root to it, and has zero or more existentialvariables in-between. We call this path the coreinferential chain of the graph, as it describes themain relationship between the answer and topicentity. Variables can only occur in this chain, andthe chain only has variable nodes except the root.Finally, zero or more entity or aggregation nodescan be attached to each variable node, includingthe answer node. These branches are the addi-tional constraints that the answers need to satisfy.For example, in Fig. 2, FamilyGuy is the rootand FamilyGuy→ y → x is the core inferentialchain. The branch y → MegGriffin specifiesthe character and y → arg min constrains that theanswer needs to be the earliest actor for this role.

Given a question, we formalize the querygraph generation process as a search problem,with staged states and actions. Let S =⋃ {φ,Se,Sp,Sc} be the set of states, where eachstate could be an empty graph (φ), a single-node graph with the topic entity (Se), a core in-ferential chain (Sp), or a more complex querygraph with additional constraints (Sc). Let A =⋃ {Ae,Ap,Ac,Aa} be the set of actions. An ac-tion grows a given graph by adding some edges

1323

Family Guys1

Meg Griffins2

ϕ s0

Figure 4: Two possible topic entity linking actionsapplied to an empty graph, for question “Who firstvoiced [Meg] on [Family Guy]?”

and nodes. In particular, Ae picks an entity node;Ap determines the core inferential chain; Ac andAa add constraints and aggregation nodes, respec-tively. Given a state, the valid action set can be de-fined by the finite state diagram in Fig. 3. Noticethat the order of possible actions is chosen for theconvenience of implementation. In principle, wecould choose a different order, such as matchingthe core inferential chain first and then resolvingthe topic entity linking. However, since we willconsider multiple hypotheses during search, theorder of the staged actions can simply be viewedas a different way to prune the search space or tobias the exploration order.

We define the reward function on the state spaceusing a log-linear model. The reward basicallyestimates the likelihood that a query graph cor-rectly parses the question. Search is done usingthe best-first strategy with a priority queue, whichis formally defined in Appendix A. In the follow-ing subsections, we use a running example of find-ing the semantic parse of question qex = “Whofirst voiced Meg of Family Guy?” to describe thesequence of actions.

3.1 Linking Topic Entity

Starting from the initial state s0, the valid actionsare to create a single-node graph that correspondsto the topic entity found in the given question. Forinstance, possible topic entities in qex can either beFamilyGuy or MegGriffin, shown in Fig. 4.

We use an entity linking system that is designedfor short and noisy text (Yang and Chang, 2015).For each entity e in the knowledge base, the sys-tem first prepares a surface-form lexicon that listsall possible ways that e can be mentioned in text.This lexicon is created using various data sources,such as names and aliases of the entities, the an-chor text in Web documents and the Wikipedia re-direct table. Given a question, it considers all the

Family Guys1

Family Guy cast actor xys3

Family Guy writer start xys4

Family Guy genre xs5

Figure 5: Candidate core inferential chains startfrom the entity FamilyGuy.

consecutive word sequences that have occurred inthe lexicon as possible mentions, paired with theirpossible entities. Each pair is then scored by a sta-tistical model based on its frequency counts in thesurface-form lexicon. To tolerate potential mis-takes of the entity linking system, as well as ex-ploring more possible query graphs, up to 10 top-ranked entities are considered as the topic entity.The linking score will also be used as a feature forthe reward function.

3.2 Identifying Core Inferential Chain

Given a state s that corresponds to a single-nodegraph with the topic entity e, valid actions to ex-tend this graph is to identify the core inferentialchain; namely, the relationship between the topicentity and the answer. For example, Fig. 5 showsthree possible chains that expand the single-nodegraph in s1. Because the topic entity e is given,we only need to explore legitimate predicate se-quences that can start from e. Specifically, to re-strict the search space, we explore all paths oflength 2 when the middle existential variable canbe grounded to a CVT node and paths of length 1 ifnot. We also consider longer predicate sequencesif the combinations are observed in training data3.

Analogous to the entity linking problem, wherethe goal is to find the mapping of mentions to en-tities in K, identifying the core inferential chainis to map the natural utterance of the question tothe correct predicate sequence. For question “Whofirst voiced Meg on [Family Guy]?” we need tomeasure the likelihood that each of the sequencesin {cast-actor, writer-start, genre}correctly captures the relationship between FamilyGuy and Who. We reduce this problem to measur-ing semantic similarity using neural networks.

3Decomposing relations in the utterance can be done us-ing decoding methods (e.g., (Bao et al., 2014)). However,similar to ontology mismatch, the relation in text may nothave a corresponding single predicate, such as grandparentneeds to be mapped to parent-parent in Freebase.

1324

15K 15K 15K 15K 15K

1000 1000 1000

max max

...

...

... max

300

...

...

Word hashing layer: ft

Convolutional layer: ht

Max pooling layer: v

Semantic layer: y

<s> w1 w2 wT </s>Word sequence: xt

Word hashing matrix: Wf

Convolution matrix: Wc

Max pooling operation

Semantic projection matrix: Ws

... ...

300

Figure 6: The architecture of the convolutionalneural networks (CNN) used in this work. TheCNN model maps a variable-length word se-quence (e.g., a pattern or predicate sequence) to alow-dimensional vector in a latent semantic space.See text for the description of each layer.

3.2.1 Deep Convolutional Neural NetworksTo handle the huge variety of the semanticallyequivalent ways of stating the same question, aswell as the mismatch of the natural language ut-terances and predicates in the knowledge base, wepropose using Siamese neural networks (Brom-ley et al., 1993) for identifying the core inferen-tial chain. For instance, one of our constructionsmaps the question to a pattern by replacing the en-tity mention with a generic symbol <e> and thencompares it with a candidate chain, such as “whofirst voiced meg on<e>” vs. cast-actor. Themodel consists of two neural networks, one forthe pattern and the other for the inferential chain.Both are mapped to k-dimensional vectors as theoutput of the networks. Their semantic similar-ity is then computed using some distance func-tion, such as cosine. This continuous-space rep-resentation approach has been proposed recentlyfor semantic parsing and question answering (Bor-des et al., 2014a; Yih et al., 2014) and has shownbetter results compared to lexical matching ap-proaches (e.g., word-alignment models). In thiswork, we adapt a convolutional neural network(CNN) framework (Shen et al., 2014b; Shen et al.,2014a; Gao et al., 2014) to this matching problem.The network architecture is illustrated in Fig. 6.

The CNN model first applies a word hashingtechnique (Huang et al., 2013) that breaks a wordinto a vector of letter-trigrams (xt → ft in Fig. 6).For example, the bag of letter-trigrams of the word“who” are #-w-h, w-h-o, h-o-# after adding the

Family Guy cast actor xy

Family Guy cast actor xy

Meg Griffin

Family Guy xy

Meg Griffinargmin

s3

s6

s7

Figure 7: Extending an inferential chain with con-straints and aggregation functions.

word boundary symbol #. Then, it uses a convo-lutional layer to project the letter-trigram vectorsof words within a context window of 3 words toa local contextual feature vector (ft → ht), fol-lowed by a max pooling layer that extracts themost salient local features to form a fixed-lengthglobal feature vector (v). The global feature vectoris then fed to feed-forward neural network layersto output the final non-linear semantic features (y),as the vector representation of either the pattern orthe inferential chain.

Training the model needs positive pairs, such asa pattern like “who first voiced meg on <e>” andan inferential chain like cast-actor. Thesepairs can be extracted from the full semanticparses when provided in the training data. If thecorrect semantic parses are latent and only thepairs of questions and answers are available, suchas the case in the WEBQUESTIONS dataset, wecan still hypothesize possible inferential chains bytraversing the paths in the knowledge base thatconnect the topic entity and the answer. Sec. 4.1will illustrate this data generation process in detail.

Our model has two advantages over the embed-ding approach (Bordes et al., 2014a). First, theword hashing layer helps control the dimensional-ity of the input space and can easily scale to largevocabulary. The letter-trigrams also capture somesub-word semantics (e.g., words with minor ty-pos have almost identical letter-trigram vectors),which makes it especially suitable for questionsfrom real-world users, such as those issued to asearch engine. Second, it uses a deeper archi-tecture with convolution and max-pooling layers,which has more representation power.

1325

3.3 Augmenting Constraints & Aggregations

A graph with just the inferential chain forms thesimplest legitimate query graph and can be exe-cuted against the knowledge base K to retrievethe answers; namely, all the entities that x canbe grounded to. For instance, the graph in s3 inFig. 7 will retrieve all the actors who have been onFamilyGuy. Although this set of entities obvi-ously contains the correct answer to the question(assuming the topic entity FamilyGuy is correct),it also includes incorrect entities that do not sat-isfy additional constraints implicitly or explicitlymentioned in the question.

To further restrict the set of answer entities, thegraph with only the core inferential chain can beexpanded by two types of actions: Ac and Aa. Ac

is the set of possible ways to attach an entity to avariable node, where the edge denotes one of thevalid predicates that can link the variable to theentity. For instance, in Fig. 7, s6 is created byattaching MegGriffin to y with the predicatecharacter. This is equivalent to the last con-junctive term in the corresponding λ-expression:λx.∃y.cast(FamilyGuy, y) ∧ actor(y, x) ∧character(y,MegGriffin). Sometimes, theconstraints are described over the entire answerset through the aggregation function, such as theword “first” in our example question qex. This ishandled similarly by actions Aa, which attach anaggregation node on a variable node. For exam-ple, the arg min node of s7 in Fig. 7 chooses thegrounding with the smallest from attribute of y.

The full possible constraint set can be derivedby first issuing the core inferential chain as a queryto the knowledge base to find the bindings of vari-ables y’s and x, and then enumerating all neigh-boring nodes of these entities. This, however,often results in an unnecessarily large constraintpool. In this work, we employ simple rules to re-tain only the nodes that have some possibility to belegitimate constraints. For instance, a constraintnode can be an entity that also appears in the ques-tion (detected by our entity linking component), oran aggregation constraint can only be added if cer-tain keywords like “first” or “latest” occur in thequestion. The complete set of these rules can befound in Appendix B.

3.4 Learning the reward function

Given a state s, the reward function γ(s) basicallyjudges whether the query graph represented by s

is the correct semantic parse of the input ques-tion q. We use a log-linear model to learn the re-ward function. Below we describe the features andthe learning process.

3.4.1 FeaturesThe features we designed essentially match spe-cific portions of the graph to the question, and gen-erally correspond to the staged actions describedpreviously, including:

Topic Entity The score returned by the entitylinking system is directly used as a feature.

Core Inferential Chain We use similarityscores of different CNN models described inSec. 3.2.1 to measure the quality of the core infer-ential chain. PatChain compares the pattern (re-placing the topic entity with an entity symbol) andthe predicate sequence. QuesEP concatenates thecanonical name of the topic entity and the predi-cate sequence, and compares it with the question.This feature conceptually tries to verify the entitylinking suggestion. These two CNN models arelearned using pairs of the question and the infer-ential chain of the parse in the training data. Inaddition to the in-domain similarity features, wealso train a ClueWeb model using the Freebaseannotation of ClueWeb corpora (Gabrilovich et al.,2013). For two entities in a sentence that can belinked by one or two predicates, we pair the sen-tences and predicates to form a parallel corpus totrain the CNN model.

Constraints & Aggregations When a con-straint node is present in the graph, we use somesimple features to check whether there are wordsin the question that can be associated with the con-straint entity or property. Examples of such fea-tures include whether a mention in the questioncan be linked to this entity, and the percentage ofthe words in the name of the constraint entity ap-pear in the question. Similarly, we check the ex-istence of some keywords in a pre-compiled list,such as “first”, “current” or “latest” as features foraggregation nodes such as arg min. The completelist of these simple word matching features canalso be found in Appendix B.

Overall The number of the answer entities re-trieved when issuing the query to the knowledgebase and the number of nodes in the query graphare both included as features.

1326

Family Guy cast actor

Meg Griffinargmin

xy

q = Who first voiced Meg on Family Guy?

(1) EntityLinkingScore(FamilyGuy, Family Guy ) = 0.9(2) PatChain( who first voiced meg on <e> , cast-actor) = 0.7(3) QuesEP(q, family guy cast-actor ) = 0.6(4) ClueWeb( who first voiced meg on <e> , cast-actor) = 0.2(5) ConstraintEntityWord( Meg Griffin , q) = 0.5(6) ConstraintEntityInQ( Meg Griffin , q) = 1(7) AggregationKeyword(argmin, q) = 1(8) NumNodes(s) = 5(9) NumAns(s) = 1

s

Figure 8: Active features of a query graph s. (1)is the entity linking score of the topic entity. (2)-(4) are different model scores of the core chain.(5) indicates 50% of the words in “Meg Griffin”appear in the question q. (6) is 1 when the mention“Meg” in q is correctly linked to MegGriffinby the entity linking component. (8) is the numberof nodes in s. The knowledge base returns only 1entity when issuing this query, so (9) is 1.

To illustrate our feature design, Fig. 8 presentsthe active features of an example query graph.

3.4.2 Learning

In principle, once the features are extracted, themodel can be trained using any standard off-the-shelf learning algorithm. Instead of treating it as abinary classification problem, where only the cor-rect query graphs are labeled as positive, we viewit as a ranking problem. Suppose we have severalcandidate query graphs for each question4. Let ga

and gb be the query graphs described in states sa

and sb for the same question q, and the entity setsAa and Ab be those retrieved by executing ga andgb, respectively. Suppose that A is the labeled an-swers to q. We first compute the precision, recalland F1 score of Aa and Ab, compared with thegold answer setA. We then rank sa and sb by theirF1 scores5. The intuition behind is that even if aquery is not completely correct, it is still preferredthan some other totally incorrect queries. In thiswork, we use a one-layer neural network modelbased on lambda-rank (Burges, 2010) for trainingthe ranker.

4We will discuss how to create these candidate querygraphs from question/answer pairs in Sec. 4.1.

5We use F1 partially because it is the evaluation metricused in the experiments.

4 Experiments

We first introduce the dataset and evaluation met-ric, followed by the main experimental results andsome analysis.

4.1 Data & evaluation metricWe use the WEBQUESTIONS dataset (Berantet al., 2013), which consists of 5,810 ques-tion/answer pairs. These questions were collectedusing Google Suggest API and the answers wereobtained from Freebase with the help of AmazonMTurk. The questions are split into training andtesting sets, which contain 3,778 questions (65%)and 2,032 questions (35%), respectively. Thisdataset has several unique properties that make itappealing and was used in several recent paperson semantic parsing and question answering. Forinstance, although the questions are not directlysampled from search query logs, the selection pro-cess was still biased to commonly asked questionson a search engine. The distribution of this ques-tion set is thus closer to the “real” informationneed of search users than that of a small numberof human editors. The system performance is ba-sically measured by the ratio of questions that areanswered correctly. Because there can be morethan one answer to a question, precision, recalland F1 are computed based on the system outputfor each individual question. The average F1 scoreis reported as the main evaluation metric6.

Because this dataset contains only question andanswer pairs, we use essentially the same searchprocedure to simulate the semantic parses fortraining the CNN models and the overall rewardfunction. Candidate topic entities are first gener-ated using the same entity linking system for eachquestion in the training data. Paths on the Free-base knowledge graph that connect a candidateentity to at least one answer entity are identifiedas the core inferential chains7. If an inferential-chain query returns more entities than the correctanswers, we explore adding constraint and aggre-gation nodes, until the entities retrieved by thequery graph are identical to the labeled answers, orthe F1 score cannot be increased further. Negativeexamples are sampled from of the incorrect can-didate graphs generated during the search process.

6We used the official evaluation script from http://www-nlp.stanford.edu/software/sempre/.

7We restrict the path length to 2. In principle, parses ofshorter chains can be used to train the initial reward function,for exploring longer paths using the same search procedure.

1327

Method Prec. Rec. F1

(Berant et al., 2013) 48.0 41.3 35.7(Bordes et al., 2014b) - - 29.7

(Yao and Van Durme, 2014) - - 33.0(Berant and Liang, 2014) 40.5 46.6 39.9

(Bao et al., 2014) - - 37.5(Bordes et al., 2014a) - - 39.2

(Yang et al., 2014) - - 41.3(Wang et al., 2014) - - 45.3

Our approach – STAGG 52.8 60.7 52.5

Table 1: The results of our approach compared toexisting work. The numbers of other systems areeither from the original papers or derived from theevaluation script, when the output is available.

In the end, we produce 17,277 query graphs withnone-zero F1 scores from the training set questionsand about 1.7M completely incorrect ones.

For training the CNN models to identify thecore inferential chain (Sec. 3.2.1), we onlyuse 4,058 chain-only query graphs that achieveF1 = 0.5 to form the parallel question and pred-icate sequence pairs. The hyper-parameters inCNN, such as the learning rate and the numbersof hidden nodes at the convolutional and semanticlayers were chosen via cross-validation. We re-served 684 pairs of patterns and inference-chainsfrom the whole training examples as the held-outset, and the rest as the initial training set. Theoptimal hyper-parameters were determined by theperformance of models trained on the initial train-ing set when applied to the held-out data. Wethen fixed the hyper-parameters and retrained theCNN models using the whole training set. Theperformance of CNN is insensitive to the hyper-parameters as long as they are in a reasonablerange (e.g., 1000± 200 nodes in the convolutionallayer, 300 ± 100 nodes in the semantic layer, andlearning rate 0.05 ∼ 0.005) and the training pro-cess often converges after ∼ 800 epochs.

When training the reward function, we createdup to 4,000 examples for each question that con-tain all the positive query graphs and randomly se-lected negative examples. The model is trained asa ranker, where example query graphs are rankedby their F1 scores.

4.2 ResultsTab. 1 shows the results of our system, STAGG(Staged query graph generation), compared to ex-isting work8. As can be seen from the table, our

8We do not include results of (Reddy et al., 2014) becausethey used only a subset of 570 test questions, which are not

Method #Entities # Covered Ques. # Labeled Ent.Freebase API 19,485 3,734 (98.8%) 3,069 (81.2%)

Ours 9,147 3,770 (99.8%) 3,318 (87.8%)

Table 2: Statistics of entity linking results on train-ing set questions. Both methods cover roughly thesame number of questions, but Freebase API sug-gests twice the number of entities output by ourentity linking system and covers fewer topic enti-ties labeled in the original data.

system outperforms the previous state-of-the-artmethod by a large margin – 7.2% absolute gain.

Given the staged design of our approach, it isthus interesting to examine the contributions ofeach component. Because topic entity linking isthe very first stage, the quality of the entities foundin the questions, both in precision and recall, af-fects the final results significantly. To get someinsight about how our topic entity linking com-ponent performs, we also experimented with ap-plying Freebase Search API to suggest entities forpossible mentions in a question. As can be ob-served in Tab. 2, to cover most of the trainingquestions, we only need half of the number ofsuggestions when using our entity linking compo-nent, compared to Freebase API. Moreover, theyalso cover more entities that were selected as thetopic entities in the original dataset. Starting fromthose 9,147 entities output by our component, an-swers of 3,453 questions (91.4%) can be found intheir neighboring nodes. When replacing our en-tity linking component with the results from Free-base API, we also observed a significant perfor-mance degradation. The overall system perfor-mance drops from 52.5% to 48.4% in F1 (Prec =49.8%, Rec = 55.7%), which is 4.1 points lower.

Next we test the system performance when thequery graph has just the core inferential chain.Tab. 3 summarizes the results. When only thePatChain CNN model is used, the performanceis already very strong, outperforming all existingwork. Adding the other CNN models boosts theperformance further, reaching 51.8% and is onlyslightly lower than the full system performance.This may be due to two reasons. First, the ques-tions from search engine users are often short anda large portion of them simply ask about propertiesof an entity. Examining the query graphs gener-ated for training set questions, we found that 1,888

directly comparable to results from other work. On these 570questions, our system achieves 67.0% in F1.

1328

Method Prec. Rec. F1

PatChain 48.8 59.3 49.6+QuesEP 50.7 60.6 50.9+ClueWeb 51.3 62.6 51.8

Table 3: The system results when only theinferential-chain query graphs are generated. Westarted with the PatChain CNN model and thenadded QuesEP and ClueWeb sequentially. SeeSec. 3.4 for the description of these models.

(50.0%) can be answered exactly (i.e., F1 = 1) us-ing a chain-only query graph. Second, even if thecorrect parse requires more constraints, the lessconstrained graph still gets a partial score, as itsresults cover the correct answers.

4.3 Error Analysis

Although our approach substantially outperformsexisting methods, the room for improvementseems big. After all, the accuracy for the intendedapplication, question answering, is still low andonly slightly above 50%. We randomly sampled100 questions that our system did not generatethe completely correct query graphs, and catego-rized the errors. About one third of errors are infact due to label issues and are not real mistakes.This includes label error (2%), incomplete labels(17%, e.g., only one song is labeled as the an-swer to “What songs did Bob Dylan write?”) andacceptable answers (15%, e.g., “Time in China”vs. “UTC+8”). 8% of the errors are due to incor-rect entity linking; however, sometimes the men-tion is inherently ambiguous (e.g., AFL in “Whofounded the AFL?” could mean either “AmericanFootball League” or “American Federation of La-bor”). 35% of the errors are because of the incor-rect inferential chains; 23% are due to incorrect ormissing constraints.

5 Related Work and Discussion

Several semantic parsing methods use a domain-independent meaning representation derived fromthe combinatory categorial grammar (CCG) parses(e.g., (Cai and Yates, 2013; Kwiatkowski et al.,2013; Reddy et al., 2014)). In contrast, our querygraph design matches closely the graph knowl-edge base. Although not fully demonstrated inthis paper, the query graph can in fact be fairly ex-pressive. For instance, negations can be handledby adding tags to the constraint nodes indicatingthat certain conditions cannot be satisfied. Our

graph generation method is inspired by (Yao andVan Durme, 2014; Bao et al., 2014). Unlike tra-ditional semantic parsing approaches, it uses theknowledge base to help prune the search spacewhen forming the parse. Similar ideas have alsobeen explored in (Poon, 2013).

Empirically, our results suggest that it is cru-cial to identify the core inferential chain, whichmatches the relationship between the topic en-tity in the question and the answer. Our CNNmodels can be analogous to the embedding ap-proaches (Bordes et al., 2014a; Yang et al., 2014),but are more sophisticated. By allowing param-eter sharing among different question-pattern andKB predicate pairs, the matching score of a rareor even unseen pair in the training data can still bepredicted precisely. This is due to the fact that theprediction is based on the shared model parame-ters (i.e., projection matrices) that are estimatedusing all training pairs.

6 Conclusion

In this paper, we present a semantic parsing frame-work for question answering using a knowledgebase. We define a query graph as the meaning rep-resentation that can be directly mapped to a logicalform. Semantic parsing is reduced to query graphgeneration, formulated as a staged search prob-lem. With the help of an advanced entity linkingsystem and a deep convolutional neural networkmodel that matches questions and predicate se-quences, our system outperforms previous meth-ods substantially on the WEBQUESTIONS dataset.

In the future, we would like to extend our querygraph to represent more complicated questions,and explore more features and models for match-ing constraints and aggregation functions. Apply-ing other structured-output prediction methods tograph generation will also be investigated.

Acknowledgments

We thank the anonymous reviewers for theirthoughtful comments, Ming Zhou, Nan Duan andXuchen Yao for sharing their experience on thequestion answering problem studied in this work,and Chris Meek for his valuable suggestions. Weare also grateful to Siva Reddy and Jonathan Be-rant for providing us additional data.

Appendix

See supplementary notes.

1329

ReferencesSoren Auer, Christian Bizer, Georgi Kobilarov, Jens

Lehmann, Richard Cyganiak, and Zachary Ives.2007. DBpedia: A nucleus for a web of open data.In The semantic web, pages 722–735. Springer.

Junwei Bao, Nan Duan, Ming Zhou, and Tiejun Zhao.2014. Knowledge-based question answering as ma-chine translation. In Proceedings of the 52nd An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 967–976, Baltimore, Maryland, June. Association forComputational Linguistics.

Jonathan Berant and Percy Liang. 2014. Seman-tic parsing via paraphrasing. In Proceedings of the52nd Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1415–1425, Baltimore, Maryland, June. Associationfor Computational Linguistics.

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on Freebase fromquestion-answer pairs. In Proceedings of the 2013Conference on Empirical Methods in Natural Lan-guage Processing, pages 1533–1544, Seattle, Wash-ington, USA, October. Association for Computa-tional Linguistics.

Kurt Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: Acollaboratively created graph database for structur-ing human knowledge. In Proceedings of the 2008ACM SIGMOD International Conference on Man-agement of Data, SIGMOD ’08, pages 1247–1250,New York, NY, USA. ACM.

Antoine Bordes, Sumit Chopra, and Jason Weston.2014a. Question answering with subgraph embed-dings. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Process-ing (EMNLP), pages 615–620, Doha, Qatar, Octo-ber. Association for Computational Linguistics.

Antoine Bordes, Jason Weston, and Nicolas Usunier.2014b. Open question answering with weakly su-pervised embedding models. In Proceedings ofECML-PKDD.

Jane Bromley, James W. Bentz, Leon Bottou, Is-abelle Guyon, Yann LeCun, Cliff Moore, EduardSackinger, and Roopak Shah. 1993. Signature ver-ification using a “Siamese” time delay neural net-work. International Journal Pattern Recognitionand Artificial Intelligence, 7(4):669–688.

Christopher JC Burges. 2010. From RankNet toLambdaRank to LambdaMart: An overview. Learn-ing, 11:23–581.

Qingqing Cai and Alexander Yates. 2013. Large-scale semantic parsing via schema matching and lex-icon extension. In Proceedings of the 51st AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), pages 423–433,

Sofia, Bulgaria, August. Association for Computa-tional Linguistics.

Evgeniy Gabrilovich, Michael Ringgaard, and Amar-nag Subramanya. 2013. FACC1: Freebase annota-tion of ClueWeb corpora, version 1. Technical re-port, June.

Jianfeng Gao, Patrick Pantel, Michael Gamon, Xi-aodong He, Li Deng, and Yelong Shen. 2014. Mod-eling interestingness with deep neural networks. InProceedings of the 2013 Conference on EmpiricalMethods in Natural Language Processing.

Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng,Alex Acero, and Larry Heck. 2013. Learning deepstructured semantic models for Web search usingclickthrough data. In Proceedings of the 22nd ACMinternational conference on Conference on informa-tion & knowledge management, pages 2333–2338.ACM.

Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and LukeZettlemoyer. 2013. Scaling semantic parsers withon-the-fly ontology matching. In Proceedings ofthe 2013 Conference on Empirical Methods in Natu-ral Language Processing, pages 1545–1556, Seattle,Washington, USA, October. Association for Compu-tational Linguistics.

Percy Liang. 2013. Lambda dependency-based com-positional semantics. Technical report, arXiv.

Hoifung Poon. 2013. Grounded unsupervised seman-tic parsing. In Annual Meeting of the Association forComputational Linguistics (ACL), pages 933–943.

Siva Reddy, Mirella Lapata, and Mark Steedman.2014. Large-scale semantic parsing withoutquestion-answer pairs. Transactions of the Associ-ation for Computational Linguistics, 2:377–392.

Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng,and Gregoire Mesnil. 2014a. A latent semanticmodel with convolutional-pooling structure for in-formation retrieval. In Proceedings of the 23rd ACMInternational Conference on Conference on Infor-mation and Knowledge Management, pages 101–110. ACM.

Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng,and Gregoire Mesnil. 2014b. Learning semanticrepresentations using convolutional neural networksfor web search. In Proceedings of the CompanionPublication of the 23rd International Conference onWorld Wide Web Companion, pages 373–374.

Zhenghao Wang, Shengquan Yan, Huaming Wang, andXuedong Huang. 2014. An overview of MicrosoftDeep QA system on Stanford WebQuestions bench-mark. Technical Report MSR-TR-2014-121, Mi-crosoft, Sep.

Yi Yang and Ming-Wei Chang. 2015. S-MART: Noveltree-based structured learning algorithms applied totweet entity linking. In Annual Meeting of the Asso-ciation for Computational Linguistics (ACL).

1330

Min-Chul Yang, Nan Duan, Ming Zhou, and Hae-Chang Rim. 2014. Joint relational embeddings forknowledge-based question answering. In Proceed-ings of the 2014 Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages645–650, Doha, Qatar, October. Association forComputational Linguistics.

Xuchen Yao and Benjamin Van Durme. 2014. Infor-mation extraction over structured data: Question an-swering with Freebase. In Proceedings of the 52ndAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), pages956–966, Baltimore, Maryland, June. Associationfor Computational Linguistics.

Wen-tau Yih, Xiaodong He, and Christopher Meek.2014. Semantic parsing for single-relation ques-tion answering. In Proceedings of the 52nd AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers), pages 643–648,Baltimore, Maryland, June. Association for Compu-tational Linguistics.

1331

semantic parsing via staged query graph generation ... · semantic parsing via staged query graph...

Documents