learning-to-ask: knowledge acquisition via 20 questions · knowledge acquisition. moreover, the...

10
Learning-to-Ask: Knowledge Acquisition via 20 estions Yihong Chen Dep. of Electr. Engin. Tsinghua University, Beijing, China [email protected] Bei Chen Microsoft Research Beijing, China [email protected] Xuguang Duan Dep. of Electr. Engin. Tsinghua University, Beijing, China [email protected] Jian-Guang Lou Microsoft Research Beijing, China [email protected] Yue Wang Dep. of Electr. Engin. Tsinghua University, Beijing, China [email protected] Wenwu Zhu Dep. of Comp. Sci. & Technol. Tsinghua University, Beijing, China [email protected] Yong Cao Alibaba AI Labs Beijing, China [email protected] ABSTRACT Almost all the knowledge empowered applications rely upon accu- rate knowledge, which has to be either collected manually with high cost, or extracted automatically with unignorable errors. In this paper, we study 20 Questions, an online interactive game where each question-response pair corresponds to a fact of the target entity, to acquire highly accurate knowledge effectively with nearly zero labor cost. Knowledge acquisition via 20 Questions predominantly presents two challenges to the intelligent agent playing games with human players. The first one is to seek enough information and identify the target entity with as few questions as possible, while the second one is to leverage the remaining questioning oppor- tunities to acquire valuable knowledge effectively, both of which count on good questioning strategies. To address these challenges, we propose the Learning-to-Ask (LA) framework, within which the agent learns smart questioning strategies for information seeking and knowledge acquisition by means of deep reinforcement learn- ing and generalized matrix factorization respectively. In addition, a Bayesian approach to represent knowledge is adopted to ensure robustness to noisy user responses. Simulating experiments on real data show that LA is able to equip the agent with effective ques- tioning strategies, which result in high winning rates and rapid knowledge acquisition. Moreover, the questioning strategies for information seeking and knowledge acquisition boost the perfor- mance of each other, allowing the agent to start with a relatively small knowledge set and quickly improve its knowledge base in the absence of constant human supervision. This work was mainly done when the authors were at Microsoft Research. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. KDD ’18, August 19–23, 2018, London, United Kingdom © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-5552-0/18/08. . . $15.00 https://doi.org/10.1145/3219819.3220047 KEYWORDS Knowledge Acquisition; Information Seeking; 20 Questions; Rein- forcement Learning; Generalized Matrix Factorization ACM Reference Format: Yihong Chen, Bei Chen, Xuguang Duan, Jian-Guang Lou, Yue Wang, Wenwu Zhu, and Yong Cao. 2018. Learning-to-Ask: Knowledge Acquisition via 20 Questions. In KDD ’18: The 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, August 19–23, 2018, London, United Kingdom. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/ 3219819.3220047 1 INTRODUCTION There is no denying that knowledge plays an important role in many intelligent applications including, but not limited to, question answering systems, task-oriented bots and recommender systems. However, it remains a persistent challenge to build up accurate knowledge bases effectively, or rather, acquire knowledge at afford- able cost. Automatic knowledge extraction from existing resources like semi-structured or unstructured texts [18] usually suffers from low precisions in spite of high recalls. Manual knowledge acquisi- tion, while providing very accurate results, is notoriously costly. Early work explored crowdsourcing to alleviate the manual cost. To motivate user engagement, the knowledge acquisition process was transformed into enjoyable interactive games between the game agents and users, termed as games with a purpose (GWAP) [21]. Drawing inspirations from GWAP, we find that the spoken parlor game, 20 Questions 1 , is an excellent choice to be equipped with the purpose of knowledge acquisition, via which accurate knowledge can be acquired with nearly zero labor cost. In the online interactive 20 Questions, the game agent plays the role of the guesser and tries to figure out what is in the mind of the human player. Figure 1 gives an example episode. At the be- ginning, the player thinks of a target entity mentally. Then, the agent asks yes-no questions, collects responses from the player, and finally tries to guess the target entity based on the collected responses. The agent wins the episode if it hits the target entity. Otherwise, it fails. In essence, 20 Questions presents to the agent 1 https://en.wikipedia.org/wiki/Twenty_Questions arXiv:1806.08554v1 [cs.AI] 22 Jun 2018

Upload: others

Post on 27-Oct-2019

15 views

Category:

Documents


0 download

TRANSCRIPT

Learning-to-Ask: Knowledge Acquisition via 20QuestionsYihong Chen∗

Dep. of Electr. Engin.Tsinghua University, Beijing, China

[email protected]

Bei ChenMicrosoft Research

Beijing, [email protected]

Xuguang Duan∗Dep. of Electr. Engin.

Tsinghua University, Beijing, [email protected]

Jian-Guang LouMicrosoft Research

Beijing, [email protected]

Yue WangDep. of Electr. Engin.

Tsinghua University, Beijing, [email protected]

Wenwu ZhuDep. of Comp. Sci. & Technol.

Tsinghua University, Beijing, [email protected]

Yong Cao∗Alibaba AI LabsBeijing, China

[email protected]

ABSTRACTAlmost all the knowledge empowered applications rely upon accu-rate knowledge, which has to be either collectedmanually with highcost, or extracted automatically with unignorable errors. In thispaper, we study 20 Questions, an online interactive game where eachquestion-response pair corresponds to a fact of the target entity,to acquire highly accurate knowledge effectively with nearly zerolabor cost. Knowledge acquisition via 20 Questions predominantlypresents two challenges to the intelligent agent playing games withhuman players. The first one is to seek enough information andidentify the target entity with as few questions as possible, whilethe second one is to leverage the remaining questioning oppor-tunities to acquire valuable knowledge effectively, both of whichcount on good questioning strategies. To address these challenges,we propose the Learning-to-Ask (LA) framework, within which theagent learns smart questioning strategies for information seekingand knowledge acquisition by means of deep reinforcement learn-ing and generalized matrix factorization respectively. In addition,a Bayesian approach to represent knowledge is adopted to ensurerobustness to noisy user responses. Simulating experiments on realdata show that LA is able to equip the agent with effective ques-tioning strategies, which result in high winning rates and rapidknowledge acquisition. Moreover, the questioning strategies forinformation seeking and knowledge acquisition boost the perfor-mance of each other, allowing the agent to start with a relativelysmall knowledge set and quickly improve its knowledge base in theabsence of constant human supervision.

∗This work was mainly done when the authors were at Microsoft Research.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’18, August 19–23, 2018, London, United Kingdom© 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-5552-0/18/08. . . $15.00https://doi.org/10.1145/3219819.3220047

KEYWORDSKnowledge Acquisition; Information Seeking; 20 Questions; Rein-forcement Learning; Generalized Matrix FactorizationACM Reference Format:Yihong Chen, Bei Chen, Xuguang Duan, Jian-Guang Lou, YueWang,WenwuZhu, and Yong Cao. 2018. Learning-to-Ask: Knowledge Acquisition via 20Questions. In KDD ’18: The 24th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining, August 19–23, 2018, London, UnitedKingdom. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3219819.3220047

1 INTRODUCTIONThere is no denying that knowledge plays an important role inmany intelligent applications including, but not limited to, questionanswering systems, task-oriented bots and recommender systems.However, it remains a persistent challenge to build up accurateknowledge bases effectively, or rather, acquire knowledge at afford-able cost. Automatic knowledge extraction from existing resourceslike semi-structured or unstructured texts [18] usually suffers fromlow precisions in spite of high recalls. Manual knowledge acquisi-tion, while providing very accurate results, is notoriously costly.Early work explored crowdsourcing to alleviate the manual cost. Tomotivate user engagement, the knowledge acquisition process wastransformed into enjoyable interactive games between the gameagents and users, termed as games with a purpose (GWAP) [21].Drawing inspirations from GWAP, we find that the spoken parlorgame, 20 Questions1, is an excellent choice to be equipped with thepurpose of knowledge acquisition, via which accurate knowledgecan be acquired with nearly zero labor cost.

In the online interactive 20 Questions, the game agent plays therole of the guesser and tries to figure out what is in the mind ofthe human player. Figure 1 gives an example episode. At the be-ginning, the player thinks of a target entity mentally. Then, theagent asks yes-no questions, collects responses from the player,and finally tries to guess the target entity based on the collectedresponses. The agent wins the episode if it hits the target entity.Otherwise, it fails. In essence, 20 Questions presents to the agent1https://en.wikipedia.org/wiki/Twenty_Questions

arX

iv:1

806.

0855

4v1

[cs

.AI]

22

Jun

2018

KDD ’18, August 19–23, 2018, London, United Kingdom Y. Chen, B. Chen, X. Duan, J. Lou, Y. Wang, W. Zhu, and Y. Cao

Q1: Is the character a female?

Q2: Can the character play ping-pong?

Q20: Does the character dedicate to charity?

No

Unknown

Q19: Does the character work for Microsoft?

I guess the character is Bill Gates !

Yes

Yes

Mentally thinks of a target entity

Wants to hit the target entity

Agent Player

Amazing! You hit it!

Figure 1: An example episode of 20 Questions.

an information-seeking task [10, 24], which aims at interactivelyseeking information about the target entity. During a 20 Questionsepisode, each question-response pair is equivalent to a fact of the tar-get entity. For example, <"Does the character work for Microsoft?","Yes"> with target "Bill Gates" corresponds to the fact "Bill Gatesworks for Microsoft.", and <"Is the food spicy?", "No"> with target"Apple" corresponds to the fact "Apple is not spicy.". Hence, in ad-dition to seeking information using existing knowledge for gameplaying, the agent can actually collect new knowledge by askingquestions without existing answers on purpose. Since the playerdoes not know whether the current question is for informationseeking or for knowledge acquisition, he/she feels nothing differentabout the game while contributing his/her knowledge unknow-ingly. What’s more, the wide audience of 20 Questions makes itan excellent carrier to incorporate human contributions into theknowledge acquisition process. After a reasonable number of gamesessions, the agent’s knowledge becomes stable and accurate as theacquired answers are voted across many different human players.

However, integrating knowledge acquisition into the originalinformation-seeking 20 Questions poses two serious challenges forthe agent. (1) The agent needs efficient and robust informationseeking (IS) strategies to work with noisy user responses and hitthe target entity with as few questions as possible. As the ques-tioning opportunities are limited, the agent earns more knowledgeacquisition opportunities if it can identify the target entity withless questions. Hitting the target entity accurately is not only theprecondition for knowledge acquisition, since the facts have tobe linked to the correct entity, but also the key to attract humanplayers because nobody wants to play with a stupid agent. (2) To ac-quire valuable facts, the agent needs effective knowledge acquisition(KA) strategies which can identify important questions for a givenentity. Due to the great diversity in entities, most questions arenot tailored or even irrelevant to a given entity and they shouldn’tbe asked during the corresponding KA. For example, KA on "BillGates" should start with meaningful questions rather than "Did thecharacter play a power forward position?", which is more suitablefor KA on basketball players.

In this paper, to address the challenges, we study effective IS andKA strategies under a novel Learning-to-Ask (LA) framework. Our

goal is to learn both IS and KA strategies which boost each other,allowing the agent to start with a relatively small knowledge baseand quickly improves in the absence of constant human supervision.Our contributions are summarized as follows:

• We are the first to formally integrate IS strategies and KAstrategies into a unified framework, named Learning-to-Ask,which turns the vanilla 20 Questions game into an effectiveknowledge harvester with nearly zero labor cost.

• Novel methods combining reinforcement learning (RL) withBayesian treatment are proposed to learn IS strategies in LA,which allow the agent to overcome noisy responses and hitthe target entity with limited questioning opportunities.

• KA strategies learning in LA is achieved with a generalizedmatrix factorization (GMF) based method so that the agentcan infer the most important and relevant questions giventhe current target entity and knowledge base status.

• We conduct experiments against simulators built on realworld data. The results demonstrate the effectiveness of ourmethods with high-accuracy IS and rapid KA.

Related Work. Knowledge acquisition has been widely stud-ied. Early projects like Cyc [7] and OMCS [14] leveraged crowd-sourcing to build up the knowledge base. GWAP were proposedto make knowledge acquisition more attractive to users. Gameswere designed for a dual purpose of entertainment and completingknowledge acquisition tasks: ESP game [20] and Peekaboom [23]for labeling images; Verbosity [22], Common Consensus [9], and asimilar game built on a smartphone spoken dialogue system [12]for acquiring commonsense knowledge. As for 20 Questions, it isoriginally a spoken parlor game that has been extensively exploredin a variety of disciplines. 20 Questions was used as a researchtool in cognitive science and developmental psychology [1], alsopowered a commercial product2. Previous work [25] has used rein-forcement learning for 20 Questions, where the authors treated thegame as a task-oriented dialog system and applied hard-thresholdguessing with the assumption that the responses from players werecompletely correct except for language understanding errors. Incontrast, we explicitly consider the uncertainty in the players’ feed-back and equip the gamewith the purpose of knowledge acquisition.Although 20 Questions was proposed for commonsense knowledgeacquisition in [15], the work predominantly focused on designingan interface specially for collecting facts in OMCS, using statis-tical classification methods combined with fine-tuned heuristics.We instead study the problem of learning questioning strategiesfor both IS and KA, which is more meaningful in terms of pow-ering a competitive agent to attract the players. Moreover, as ourmethods do not rely on lots of handcrafted rules or specific assump-tions about the knowledge bases, they can be generalized to otherknowledge bases with little pain, generating questioning strategieswhich gradually improve as game sessions increase. To the best ofour knowledge, we are the first to incorporate RL and GMF into aunified framework to learn questioning strategies, which can beused by the agents for the dual purpose of playing entertaining 20Questions, i.e. a game of information-seeking nature, and acquiringnew knowledge from human players.

2http://www.20q.net

Learning-to-Ask: Knowledge Acquisition via 20Questions KDD ’18, August 19–23, 2018, London, United Kingdom

𝑞1

Entity Set

Question Set

𝑑11 𝑑1𝑁

𝑑21 𝑑2𝑁

𝑑𝑀1 𝑑𝑀𝑁

…………

……

……

……

……

Entity-Question Matrix

𝑞𝑁𝑒1𝑒2

𝑒𝑀

Bill Gates𝑒𝑚:𝑞𝑛: Does the character live in America?

𝑑𝑚𝑛:

Yes No Unknown

Figure 2: The knowledge base: entity-question matrix.

The rest of the paper is organized as follows. Section 2 introducesLA, the proposed unified framework for questioning strategies gen-eration. Multiple methods to implement LA are presented in Section3 and empirically evaluated in Section 4, followed by conclusionand discussion in Section 5.

2 LEARNING-TO-ASK FRAMEWORKIn this section, we present Learning-to-Ask (LA), a unified frame-work under which the agent can learn both information seekingand knowledge acquiring strategies in 20 Questions. To make itclear, we first introduce the knowledge base which backs up ouragent and then we present LA in detail.

As shown in Figure 2, the knowledge base is represented as aM × N entity-question matrix D, with each row and each columncorresponding to an entity in E = {e1, e2, ..., eM } and a questionin Q = {q1,q2, ...,qN } respectively. An entry dmn in D stores theagent’s knowledge about the pair (em ,qn ). Although it may betempting to regard ’Yes’ or ’No’ as knowledge for the pair (em ,qn ),the deterministic approach suffers from human players’ noisy re-sponses (They may make mistakes unintentionally or give ’Un-known’ responses). We thus resort to a Bayesian approach, mod-elling dmn as a random variable which follows a multinoulli distri-bution over a possible set of responses R = { yes, no, unknown }. Theexact parameters of dmn can be estimated directly from game logsor replaced by a uniform distribution if the entry is missing from thelogs. Notably, for an initial automatically constructed knowledgebase, dmn can also be roughly approximated even though thereis no game logs yet. Hence, our methods actually overcome thecold-start problem encountered in most knowledge base completionscenarios.

Backed by the knowledge base described above, the agent learnsto ask questions within the LA framework. Figure 3 summarizes theLA framework. At each step in an episode, the agent manages thegame state, which contains the history of questions and responsesover the past game steps besides the guess. Two modules, termedas IS Module and KA Module, are devised to generate the questionsfor the dual purposes of IS and KA. Both modules output scores forthe candidate questions, from which the agent chooses the mostappropriate one to ask the human player. The response given bythe human player is used to update the game state. At the endof each episode, the agent sends its guess to the human playerfor the final judgment. Through experiences of such episodes, theagent gradually learns the questioning strategies that identify the

Agent

Player

Information

Seeking

Knowledge

Acquisition

Game State ...

Yes

No

Unknown

Ask

Respond

Question scores

𝑞1

𝑞2

𝑞𝑁

Guess Entity

KB

Figure 3: Learning-to-Ask framework.

target entity effectively and improves its own knowledge base. Thedetailed functionalities of the modules are listed below.

IS Module. This module generates questioning strategies tocollect necessary information to hit the target entity at the finalguess. At each game step t , IS module detects the ambiguity in thecollected information (a1,x1,a2,x2, ...,at−1,xt−1), where a denotesa question and x denotes a response. And it scores the candidatequestions according to how much uncertainty they can resolveabout the target entity. The agent asks the next question based onthe scores. Once adequate information collected, the agent makes aguess д.

KA Module. This module generates questioning strategies toimprove the agent’s knowledge base i.e. acquire knowledge aboutthe missing entries in D. We assume that the ideal knowledgebase for the agent should be consistent with the one human playerspossesses. Hence, the most valuable missing entries that KAmoduleaims to acquire are those known to the human players but unknownto the agent. KA module scores the candidate question based onthe values of entries pertaining to the guess д. The agent asks thequestion according to the scores and accumulates the response intoits knowledge buffer.

Importantly, the questioning opportunities are limited. The moreopportunities allocated to KA module, the more knowledge theagent acquires. However, the IS module needs enough questioningopportunities as well. Otherwise, the agent may fail to collect ade-quate information and make a correct guess. Therefore, it is a bigchallenge to improve both modules in LA. Note that the whole LAframework is invisible to the human players. They simply regard itas an enjoyable game and contribute their knowledge unknowinglywhile the agent gradually improves itself.

3 METHODSSo far we have introduced LA, a unified framework under which theagent learns questioning strategies to interact with players effec-tively. Here we present our methods to implement LA. As describedabove, the key to implement LA is to instantiate the IS moduleand the KA module. We leverage deep reinforcement learning toinstantiate the IS module, while generalized matrix factorizationis incorporated into the KA module. We start by explaining thebasic methodology underlying our methods, then we describe theimplementation details.

KDD ’18, August 19–23, 2018, London, United Kingdom Y. Chen, B. Chen, X. Duan, J. Lou, Y. Wang, W. Zhu, and Y. Cao

3.1 Learning-to-Ask for Information SeekingIS module aims to collect necessary information to identify the tar-get entity while the agent interacts with the human player, whichis essentially a task of interactive strategy learning. In particular,no direct instructions are given to the agent and the consequenceof asking some question plays out only at the end of the episode.This setting of interactive strategy learning naturally lends itself toreinforcement learning (RL) [17]. One of the most popular appli-cation domains of reinforcement learning is task-oriented dialogsystems[2, 8]. Zhao et al.[25] proposed a variant of deep recur-rent Q-network (DRQN) for end-to-end dialog state tracking andmanagement, from which we drew inspiration to design our agent.Unlike approaches with subtle control rules and complicated heuris-tics, the RL-based methods allow the agent to learn questioningstrategies solely through experience. At the beginning, the agenthas no idea what effective questioning strategies should be. It grad-ually learns to ask "good" questions by trial and error. Effectivequestioning strategies are reinforced by the reward signal that indi-cates whether the agent wins the game, namely whether it collectsenough information to hit the target entity.

Formally, letT be the number of total steps in an episode (T = 20in 20 Questions), andT1 < T be the number of questioning opportu-nities allocated to the IS module. At step t , the agent observes thecurrent game state st , which ideally summarizes the game historyover the past steps, and takes an action at based on a policy π . Inour case, an action corresponds to a question in the question set Qand a policy corresponds to a questioning strategy that maps gamestates to questions. After asking a question, the agent receives aresponse xt from the human player and a reward signal rt reflectingits performance. The primary goal of the agent is to find an optimalpolicy that maximizes the expected cumulative discounted rewardover an episode:

E[Rt ] = E[ T1∑i=t

γ i−t ri

], (1)

where γ ∈ [0, 1] is a discount factor that determines the importanceof future rewards. This can be achieved by Q-learning, which learnsa state-action value function or Q-function Q(s,a) and constructsthe optimal policy by greedily selecting the action with the highestvalue in each state.

Given that the states in IS are complex and high-dimensional,we resort to deep neural networks to approximate the Q-function.This technique, termed as DQN , was introduced by Mnih et al.[11].DQN utilizes experience replay and target network to mimic thesupervise learning case and improve the stability. The Q-networkis updated by performing gradient descent on the loss

l(θ ) = E(st ,at ,st+1,rt )∼U [(rt+γ maxat+1

Q(st+1,at+1; θ )−Q(st ,at ;θ ))2],(2)

where U is the experience replay memory that stores past transi-tions, θ are the parameters of the behave Q-network and θ are theparameters of the target network. Note that θ are copied from θonly after every C updates.

The above DQN method built upon Markov Decision Process(MDP) assumes the states to be fully observable. Although it is

𝑠𝑡1

𝑠𝑡2

𝑠𝑡𝑁

. . .

𝑠𝑡

. . . . . .

Value Network

𝑄(𝑠𝑡, 𝑎𝑡 = 𝑞1)

𝑄(𝑠𝑡 , 𝑎𝑡 = 𝑞2)

𝑄(𝑠𝑡, 𝑎𝑡 = 𝑞𝑁)

Figure 4: Q-network architecture for LA-DQN.

possible to manually design such states for IS module by incorpo-rating the whole history, it may sacrifice the learning efficiencyof Q-network. To address this problem, we combine embeddingtechnique and deep recurrent Q-network (DRQN) [3] which uti-lizes recurrent neural networks to maintain belief states over thesequence of previous observations. Here we first present LA-DQN,a method to instantiate IS module using DQN, and explain its inca-pability in dealing with large question space and response space.Then we describe LA-DRQN, which leverages DRQN with questionand response embedded.

3.1.1 LA-DQN. Letht = (a1,x1,a2,x2, ...,at−1,xt−1) be the his-tory at step t , which provides all the information the agent needsto determine the next question. Owing to the required Markovproperty underlying DQN, a straightforward design for the state stis to encapsulate the history ht . Since there are 3 possible valuesfor responses in our setting (yes, no, unknown), the state st can berepresented as a 4N -dimensional vector:

st = [s1t , s2t , ..., sNt ]⊤, (3)

where snt is a 4-dimensional binary vector associated with the sta-tus of the question qn ∈ Q up to step t . The first dimension insnt indicates whether qn has been asked previously i.e whether itappears in the history of the current episode. If qn has been asked,the remaining 3 dimensions are one-hot representing the responsefrom player; otherwise, the remaining dimensions are all zeros.The state st is feed to the Q-network implemented as MultilayerPerceptrons (MLP) to estimate the state-action values for all ques-tions: [Q(st ,at = q1),Q(st ,at = q2), ...,Q(st ,at = qN )]. Figure 4shows the Q-network architecture for LA-DQN. Although thereare alternatives for the Q-network architecture, this one allows theagent to obtain the state-action values for all questions in just oneforward pass, which greatly reduces computation complexity andimproves the scalability of the whole LA framework.

3.1.2 LA-DRQN. In the LA-DQN method, states are designed tobe fully observable, containing all the information needed for thenext questioning decision. However, the state space is highly un-derutilized, and it expands fast as the valid questions and responsesincrease, which results in slow convergence of Q-Network learningand poor performance of target entity identification. Comparedto LA-DQN, LA-DRQN is more powerful in terms of generaliza-tion to large question and response spaces. As shown in Figure5, dense representations of questions and responses are learned

Learning-to-Ask: Knowledge Acquisition via 20Questions KDD ’18, August 19–23, 2018, London, United Kingdom

. . .

𝑊1(𝑎𝑡−1)

𝑠𝑡

. . .

Value Network

. . .

𝑄(𝑠𝑡, 𝑎𝑡 = 𝑞1)

𝑄(𝑠𝑡 , 𝑎𝑡 = 𝑞2)

𝑄(𝑠𝑡, 𝑎𝑡 = 𝑞𝑁)

𝑜𝑡 𝑠𝑡−1

. . .

𝑠𝑡+1

. . .

. . .

𝑄(𝑠𝑡+1, 𝑎𝑡+1 = 𝑞1)

𝑄(𝑠𝑡+1, 𝑎𝑡+1 = 𝑞2)

𝑄(𝑠𝑡+1, 𝑎𝑡+1 = 𝑞𝑁)

𝑜𝑡+1

. . .. . .

𝑊2(𝑥𝑡−1)

𝑊1(𝑎𝑡)

𝑊2(𝑥𝑡)

Figure 5: Q-network architecture for LA-DRQN.

in LA-DRQN. Similar as word embeddings, the question embed-dingW1 : q → RN1 and the response embeddingW2 : x → RN2

are parametrized functions mapping questions and responses toN1-dimensional real-value vectors and N2-dimensional real-valuevectors respectively. The observation at step t is the concatenatedembeddings of the question and response at the previous step:

ot = [W1(at−1),W2(xt−1)]⊤. (4)

To capture information over steps, a recurrent neural network (e.g.LSTM) is employed to summarize the past steps and generate astate representation for current step:

st = LSTM(st−1,ot ), (5)

where the state st ∈ RN3 . Similar as LA-DQN, the generated stateis used as the input to a network to estimate the state-action valuesfor all questions. It is important to realize that complex states bringobstacles to Q-network training while simple states may fail tocapture some valuable information. The state design in LA-DRQNcombines simple states of concatenated embeddings and recurrentneural networks to summarize the history, which is more appro-priate for the IS module. The embedding approach also enableseffective learning of model parameters and makes it possible toscale up.

3.1.3 Guessing via Naive Bayes. At step T1, the agent accom-plishes the task of IS module and makes a guess on the targetentity using the collected responses to the T1 questions. Let X =(x1,x2, ...,xT1 ) denote the response sequence and д denote theagent’s guess which can be any entity in E = {e1, e2, ..., eM }. Theposterior of the guess д is

P(д |X ) ∝ P0(д)P(X |д) ∝ P0(д) ×T1∏t=1

P(xt |д), (6)

assuming independence among the responses given a specific entity.An ideal choice for the prior P0(д) is the entity popularity distribu-tion. After computing the posterior probability for each em ∈ E,the agent chooses the one with the maximum posterior probabilityas the final guess. This Bayesian approach reduces LA’s sensibilityto the noisy responses given by the players and allows the agent tomake a more robust guess.

3.2 Learning-to-Ask for Knowledge AcquisitionLA framework can make 20 Questions an enjoyable game with thewell-designed IS module described above. Moreover, the incorpora-tion of a sophisticated KA module will make it matter more thanjust a game. Successful KA not only boosts the winning rates of thegame, but also improves the agent’s knowledge base. Intuitively, theagent achieves the best performance when its knowledge is consis-tent with the human players’ knowledge. However, this is usuallynot the case in practice, where the agent always has less knowledgedue to limited resources for knowledge base construction and thedynamic nature of knowledge. Hence, motivating human contri-butions plays an important part in KA. As described in Section2, the goal of KA module is to fill in the valuable missing entries.Specifically, given an entity em , the agent chooses questions from Qto collect knowledge about the entries {dmn }Nn=1. Instead of askingquestions blindly, we propose LA-GMF, a generalized matrix factor-ization (GMF) based method that takes into account the followingtwo aspects to make KA more efficient.

Uncertainty. Consider the knowledge base shown in Figure 2.To fill in an entry dmn means to estimate a multinoulli distributionover possible response values { yes, no, unknown } from collectedresponses. Let Nmn denote the number of collected responses fordmn . Small Nmn leads to high uncertainty of the estimateddmn . Forthis reason, the entries with higher uncertainty should be chosen,so that the agent can refine the related estimations faster. With thisin mind, we selectNc questions from Q to maintain an adaptive can-didate question set Qc , where a question qn is chosen proportionalto 1√

Nmn. Note that, for the missing entries, Nmn are initialized

as 3, the number of possible response values, since we give themuniform priors. In the cases where the agent receives too many"unknown" responses for an entry, the entry should be rejected tobe asked any further, because it is of little value and nobody bothersto know about it.

Value. Among the entries with high uncertainty, only some ofthem are worth acquiring. A valuable question is the one that the hu-man players are more likely to know the answer (They respondwithyes/no rather than unknown). Interestingly, given the incomplete D,predicting the missing entries is similar to the collaborative filteringin recommendation systems, which utilizes matrix factorization toinfer users’ preference based on their explicit or implicit adoptinghistory. Recent studies show that matrix factorization (MF) can beinterpreted as a special case of neural collaborative filtering [4].Thus, we employ generalized matrix factorization (GMF), whichextends MF under the neural collaborative framework, to scoreentries and rank the candidates in Qc . To be clear, let Y ∈ RM×N

be the indicator matrix constructed from D: if the entry dmn isknown (Nmn , 3), then ymn = 1; otherwise, ymn = 0. Y can befactorized as Y = U · V , where U is a M × K real-valued matrixand V is a K × N real-valued matrix. LetUm be them-th row of Udenoting the latent factor for entity em , and Vn be the n-th columnof V denoting the latent factor for question qn . GMF scores eachentry dmn and estimates the value as:

ymn = aout (h⊤(Um ⊙ Vn )), (7)

where ⊙ denotes the element-wise product of vectors, h denotesthe weights of the output layer, and aout denotes the non-linear

KDD ’18, August 19–23, 2018, London, United Kingdom Y. Chen, B. Chen, X. Duan, J. Lou, Y. Wang, W. Zhu, and Y. Cao

Algorithm 1 One cycle of LA-GMF for KA module.Input: knowledge base D, buffer size Nkconstruct Y from D; factorize Y = U ·V ; Count = 0while Count < Nk doem = д is the guess of the current episodeQh = {a1,a2, ...,aT1 } is the set of history asked questionsθm = ( 1√

Nm1, 1√

Nm2, ..., 1√

NmN) is the uncertainty parameter

for step = 1, 2, . . . ,T2 doQc = �for i = 1, 2, . . . ,Nc doqi ∼ Multinomial(θm ) and qi < Qc and qi < QhQc = Qc + qi

end forrank questions in Qc by scores using Equation (7)ask the top question q in Qc , collect the response in bufferQh = Qh + q; Count = Count + 1

end forend whileupdate D using all the knowledge in buffer

activation function. U and V are learned by minimizing the meansquared error (ymn − ymn )2.

The collected responses are accumulated into a buffer and pe-riodically updated to the knowledge base. An updating cycle ofLA-GMF is shown in Algorithm 1, which generally contains multi-ple episodes. The agent starts with the knowledge base D. Duringan episode, the agent makes a guess д after IS module accomplishesits task in T1 questions, and the remaining T2 = T −T1 questioningopportunities are left for KA module. LA-GMF first samples a setof candidate questions Qc and ranks the questions according toEquation 7. The question with the highest score is chosen to askthe human player. Once the buffer is full, a cycle finishes, and theknowledge baseD is updated. Ideally, the improved knowledge baseD converges to the knowledge base that the human players possessas the KA process goes on and on. Notably, if the guess providedby the IS module is wrong, the agent fails to acquire correct knowl-edge in the episode, which we omit in Algorithm 1. Therefore, awell-designed IS module is quite important to the KA module, andin return, the knowledge collected by the KA module increases theagent’s competitiveness in playing 20 Questions, making the gamemore attractive to the human players. Furthermore, in contrast tomost works adopting relation inference to complete knowledgebase, LA provides highly accurate knowledge base completion dueto its interactive nature.

3.3 ImplementationIn this subsection, we describe details about the implementationsof LA-DQN, LA-DRQN and LA-GMF. To avoid duplicated questionswhich reduce human players’ pleasure, we constrain the availablequestions to those which haven’t been asked over the past steps inthe current episode. Rewards r for LA-DQN and LA-DRQN are +1for winning, −1 for losing, and 0 for all nonterminal steps. We adoptϵ-greedy exploration strategy. Some modern reinforcement learn-ing techniques are used to improve the convergence of LA-DQN andLA-DRQN: Double Q-learning [19] alleviates the overestimation of

Table 1: Parameters for experiments.

Meth. Parameter Value

LA-D

QN&LA

-DRQ

N learning rate 2.5 × 10−4replay prioritization 0.5

initial explore 1final explore 0.1

final explore step 1 × 106discount factor (γ ) 0.99

behave network update frequency in episodes 1target network update frequency in episodes (C) 1 × 104

LA-DQN hidden layers’ sizes on PersonSub [2048, 1024]LA-DRQN network sizes (N1, N2, N3) on PersonSub [30, 2, 32]LA-DRQN network sizes (N1, N2, N3) on Person [62, 2, 64]

LA-G

MF learning rate 1 × 10−3

buffer size (Nk ) 9000number of negative samples 4dimension of latent factor (K ) 48

Q-value and prioritized experience replay [13] allows faster back-wards propagation of reward information by favoring transitionswith large gradients. Since all the possible rewards over an episodefall into [−1, 1], Tanh is chosen as the activation function for theQ-network output layer. Batch Normalization [5] and Dropout [16]techniques are used to accelerate training of Q-networks as well.As for LA-GMF, we use a Siдmoid output layer. Adam optimizer [6]is applied to train the networks.

4 EXPERIMENTSIn this section, we experimentally evaluate our methods with theaim of answering the following questions:Q.1 Equipped with the learned IS strategies in LA, is the agent

smart enough to hit the target entity?Q.2 Equipped with the learned KA strategies in LA, can the agent

acquire knowledge rapidly to improve its knowledge base?Q.3 Is the acquired knowledge helpful to the agent in terms of

increasing the efficiency in identifying the target entity?Q.4 Starting with no prior, can the logical connections between

questions learned by the agent over episodes?Q.5 Can the agent engage players in contributing their knowl-

edge unknowingly by asking appropriate questions?

4.1 Experimental SettingsWe evaluate our methods against the player simulators constructedfrom the real world data. Some important parameters are listed inTable 1.

Dataset.Wemade an experimental Chinese 20 Questions system,where the target entities are persons (celebrities and characters infictions), and collected the data for constructing the entity-questionmatrices. As described in Section 2, R = { yes, no, unknown } isthe response set. Every entry follows a multinoulli distributionwith corresponding collected response records. The entries withless than 30 records were filtered out, since they are statisticallyinsignificant. For all missing entries, we initialized them with 3records, one for each possible response, corresponding to a uni-form distribution over {yes, no, unknown}. After data processing,the whole dataset Person contains more than 10 thousands enti-ties with 1.8 thousands questions. Each entity has a popular score,

Learning-to-Ask: Knowledge Acquisition via 20Questions KDD ’18, August 19–23, 2018, London, United Kingdom

Figure 6: Long tail distribution on log-log axes of Person.

Table 2: Statistics of datasets.

Dataset # Entities # Questions Missing RatioPerson 10440 1819 0.9597

PersonSub 500 500 0.5836

indicating the frequency in which the entity appears in the gamelogs. The popular score follows a long tail distribution as shown inFigure 6. For simplicity, we constructed a subset PersonSub whilecomparing performances of different methods, which contains themost popular 500 entities and their most asked 500 questions. Bothdatasets summarized in Table 2 are in Chinese, where the missingratio is the proportion of missing entries. The datasets are highlysparse with diverse questions, which makes it more challengingto learn better questioning strategies. It is worth emphasizing thatthe final datasets include only the aggregated entries instead of thesession-level game logs, which means the agent has to figure outthe questioning strategies interactively from zero.

Simulator. In order to evaluate our methods, a faithful simu-lator is necessary. The simulating approach not only obviates theexpensive cost of online testing (It hurts players’ game experience.)but also allows us to compare the acquired knowledge base withthe ground-truth knowledge base which is not available while test-ing online. To simulate players’ behavior, we consider two aspects:target generation and response generation. At the beginning ofeach episode, the player simulator samples an entity as the targetaccording to the popular score distribution, which simulates howlikely the agent would encounter that entity in a real episode. Theresponse to a certain entity-question pair is obtained by samplingfrom the multinoulli distribution of the corresponding entry, whichaims to model the uncertainty in a real player’s feedback.

Metrics. We use two metrics, one for IS evaluation, and theother for KA evaluation. (1)Winning Rate. Tomeasure the agent’sability of hitting the target entity, we obtain the winning rate byaveraging over a sufficient number of testing episodes where theagent uses the learned questioning strategy to play 20 Questionsagainst the simulator. (2) Knowledge Base Size vs. Distance. Toevaluate the KA strategies, we consider not only the number ofknowledge base entries that have been filled, i.e. knowledge basesize, but also the effectiveness of the filled entries, given that fastknowledge base size increase may imply the agent wastes question-ing opportunities on irrelevant questions and "unknown" answers.Also, an effective KA strategy is supposed to result in a knowledgebase which reflects the knowledge across all the human players. In

other words, the distance between the collected knowledge baseand the human knowledge base should be small. Since each entry ofthe entity-question matrix is represented as a random variable witha multinoulli distribution in our methods, we define the asymmet-ric distance from knowledge base D1 to D2 as average exponentialKullback-Leibler divergence over all entries:

Distance(D1,D2) =

∑i, j exp

(KL(d1i j | |d

2i j )

)N

, (8)

where N is the number of total entries, D1 and D2 are the knowl-edge bases for the agent and the player simulator respectively. Forcomputation convenience, the missing entries are smoothed as uni-form distributions while calculating the distances. Clearly, a goodKA strategy is expected to achieve fast knowledge base distance de-crease while the knowledge base size increase slowly, which meansthe agent does not waste questioning opportunities on worthlessquestions.

Baselines. Since we study a novel problem in this paper, baselinemethods are not readily available. Hence, in order to demonstratethe effects of ourmethods, we constructed baselines for eachmoduleseparately by hand. Details about the baselines construction willbe described later.

4.2 Performance of IS Module (Q.1)In this subsection, we compare different methods to generate ques-tioning strategies for the IS module. Our experiments on PersonSubdemonstrate the effectiveness of the proposed RL-style methods,and further experiments on Person show the scalability of the pro-posed LA-DRQN.

Baselines. Methods based on entropy often play crucial rolesin querying knowledge bases i.e. performing information-seekingtasks. Therefore, it is not surprising to find that the most popularapproaches to build 20 Questions IS agent seem to be the fusionof entropy-based methods and heuristics (e.g. asking the questionwith maximum entropy) as well. Although supervised methods likedecision trees also work well as long as the game logs are available,thesemethods actually suffer from the cold-start problem,where theagent has nothing but a automatically constructed knowledge base.Given the fact that our primary goal is to complete a knowledgebase which may be automatically constructed with no initial gamelogs available, we choose entropy-based methods as our baselines.We design several entropy-based methods, and take the best one asour baseline, named as Entropy. Specially, we maintain a candidateentity set E ′ initialized by E. Every entity em in E ′ is equippedwith a tolerance value tm initialized by 0. Let Emn = E[dmn ] denotethe expectation by setting yes = 1, no = −1 and unknown = 0. Wedefine a variable En for question qn , which follows a distributionby considering all {Emn |em ∈ E ′}. At each step, with the idea ofmaximum entropy, we choose the question qc = argmaxqnh(En )and receive the response xc , where h(·) is the entropy function.Then for each em ∈ E ′, tm adds |Emc − xc |, which indicates thebias. When tm reaches a threshold (15 in our experiments), theentity em will be removed from E ′. Finally, the entity with smallesttolerance value in E ′ is chosen to be the guess. Besides, we alsodesign a linear Q-learning agent as a simple RL baseline, named asLA-LIN.

KDD ’18, August 19–23, 2018, London, United Kingdom Y. Chen, B. Chen, X. Duan, J. Lou, Y. Wang, W. Zhu, and Y. Cao

(a) (b) (c) (d)

Figure 7: (a) Convergence processes on PersonSub; (b) LA-DRQN performance with varied IS questioning opportunities onPersonSub; (c) LA-DRQN convergence process on Person; (d) LA-DRQN performance of IS with KA on PersonSub.

Table 3: Winning rate of the IS module. The reported resultsare obtained by averaging over 10000 testing episodes.

Dataset Entropy LA-LIN LA-DQN LA-DRQNPersonSub 0.5161 0.4020 0.7229 0.8535Person 0.2873 − − 0.6237

Winning Rates. Table 3 shows the results for T1 = 20. OnPersonSub, the simple RL agent LA-LIN performs worse than theEntropy agent, suggesting that the Entropy agent is a relativelycompetitive baseline. Both the LA-DQN and the LA-DRQN agentlearn better questioning strategies than the Entropy agent withhigher winning rates achieved while playing 20 Questions againstthe player simulator. However, the winning rate of the LA-DRQNagent is significantly higher than that of the LA-DQN agent.

Convergence Processes. As shown in Figure 7(a), with T1 =20, the LA-DRQN agent learns better questioning strategies fasterthan the LA-DQN agent due to the architecture of value network.The redundancy in states of the LA-DQN agent results in slowconvergence and it takes much longer to train the large valuenetwork. In contrast, the embeddings and recurrent units in the LA-DRQN agent significantly shrink the sizes of the value network viaextracting a dense representation of states from the complete askingand responding history. As for the LA-LIN, the model capacity istoo small to capture a good strategy although it converges fast.

Number of Questioning Opportunities. The number of ques-tioning opportunities has considerable influence on the perfor-mance of IS strategies. Hence, we take a further look at the per-formance of the LA-DRQN agent trained and tested under variednumber of questioning opportunities (T1). As shown in Figure 7(b),where each point corresponds to the performance of the LA-DRQNagent trained up to 2.5 × 107 steps, more questioning opportunitieslead to higher winning rates. Given this fact, it is important toallocate enough questioning opportunities for the IS module if theagent wants to succeed in seeking the target i.e. winning the game.We observe that 17 questioning opportunities are adequate for theLA-DRQN agent to perform well on PersonSub. Thus, the remaining3 questioning opportunities can be saved for the KA module as willbe discussed in detail later.

Scale up to a Larger Knowledge Base. Finally, we are inter-ested in the scalability of our methods considering that our primary

goal is to complete various knowledge bases. We perform experi-ments with T1 = 20 on Person, a larger knowledge base coveringalmost all data from the game system. It is worthwhile to mentionthat, 20 Questions becomes much more difficult for the agent dueto the large sizes and high missing ratio of the entity-question ma-trix. Since the LA-LIN and LA-DQN give poor results due to hugestate spaces (The number of the dimensions of the states is 40k onPerson.), and the computation complexity grows linearly with thenumber of questions, they are difficult to scale up to larger spacesof questions and responses. Hence, we only evaluate the Entropybaseline and the LA-DRQN method. The winning rates are shownin Table 3 while the convergence process of LA-DRQN is presentedin Figure 7(c). The LA-DRQN agent outperforms the Entropy agentsignificantly, which demonstrates its effectiveness in seeking thetarget and its ability to scale up to larger knowledge bases.

4.3 Performance of KA Module (Q.2)A promising IS module is the bedrock on which the KA moduleis based. Therefore, we choose the LA-DRQN, which outperformsall other methods, to instantialize the IS module while runningKA experiments. We build a simulation scenario where the playersown more knowledge than the agent and the agent tries to acquireknowledge via playing 20 Questions with the players. We ran exper-iments on PersonSub. Note that in our KA experiments, the agenthasT1 = 17 questioning opportunities for IS andT2 = 3 questioningopportunities for KA each episode, as mentioned in the study onthe number of questioning opportunities.

A Simulation Scenario for KA. To build a simulation scenariofor KA, we first randomly throw away 20% existing entries of theoriginal entity-question matrix and use the remaining dataset toconstruct the knowledge base for our agent. Throughout the wholeinteractive process, the knowledge base for the player simulatorremains unchanged while the agent’s knowledge base improvesgradually and converges to the one the player simulator possessestheoretically in the end.

Baseline. Since KA methods depend on the exact approach inwhich the knowledge is represented, it is difficult to find readilyavailable baselines. However, we believe that all KAmethods shouldtake into account the "uncertainty" and "value" as Algorithm 1 does,even though the "uncertainty" and "value" can be defined in differentways. Given this fact, reasonable baselines for KA should include

Learning-to-Ask: Knowledge Acquisition via 20Questions KDD ’18, August 19–23, 2018, London, United Kingdom

(a) (b)

Figure 8: Knowledge base (a) size vs. (b) distance of agent onPersonSub. The LA-GMF was trained with Nc = 32.

(a) (b)

Figure 9: Tradeoff "uncertainty" and "value" on PersonSub.

the uncertainty-only method and the value-only method. For thisreason, we simply modify Algorithm 1 to obtain the baselines.

Knowledge Base Size vs. Distance. Figure 8 shows the changeof knowledge base sizes and knowledge base distances. The value-only method overexploits the value information estimated from theknown entries, so the agent tends to ask more about the knownentries instead of the unknown entries, leading to the almost un-changed knowledge base size and knowledge base distance. Whilethe uncertainty-only method does collect new knowledge, it adoptsa relatively inefficient KA strategy, wasting a lot of questioning op-portunities on the nonsense knowledge (unknown by the players).That is why the knowledge base distance decreases quite slowlywhen the knowledge size soars up. Compared to the uncertainty-only method, our proposed LA-GMF enjoys a lower speed of knowl-edge base size increment and a higher speed of knowledge basedistance reduction, indicating the agent learns to use the question-ing opportunities in a more efficient way. To sum up, the resultsdemonstrate that our proposed LA-GMF learns better strategies toacquire valuable knowledge faster and the knowledge base quicklybecomes close to the ground-truth one (that of the player simulator).

Trade-off Uncertainty and Value. The LA-GMF method con-sider both the "uncertainty" and "value". Actually, the parameter Ncin Algorithm 1 determines the trade-off between the two aspects,with larger Nc putting more weight on "value" and smaller Ncputting more weight on "uncertainty". Results of experiments onPersonSubwith different Nc are shown in Figure 9. Each point in thefigure was obtained by choosing the latest update in the previous5.5 million game steps. In our experiments, Nc = 32 achieves themost promising knowledge base distance reduction with modestknowledge base size increment.

4.4 Further Analysis4.4.1 IS Boosted by the Acquired Knowledge (Q.3). With the right

guess in IS module, knowledge can be acquired via KA module. Inreturn, the valuable acquired knowledge boosts the winning ratesof the IS module. We take a close look at the effect of the acquiredknowledge on IS. As shown in Figure 7(d), our LA-DRQN agentthat updates the IS strategy with the acquired knowledge performsbetter than the one that updates the IS strategy without the acquiredknowledge, improving the winning rate faster i.e. learning betterIS strategies.

4.4.2 LearnedQuestion Representations (Q.4). To analyzewhetherour LA-DRQN agent can learn reasonable question representa-tions, we visualize the learned question embeddingsW1(qn ) (n =1, 2, ...,N ) on PersonSub dataset using t-SNE. We annotate severaltypical questions in Figure 10. It can be observed that if questionshave logical connections between each other, they may have simi-lar embeddings i.e close to each other in the embedding space. Forexample, the logical connected questions "Is the character a footballplayer?" and "Did the character wear the No.10 shirt?" are closeto each other in the learned embedding space. This suggests thatwe could obtain meaningful question embeddings starting fromrandom initialization after the agent plays a sufficient number ofepisodes of the interactive 20 Questions, although initializing thequestion embeddings with pre-trained word embeddings wouldbring faster convergence in practice.

4.4.3 Questioning Strategy for KA (Q.5). To intuitively analyzethe knowledge acquisition ability of our agent, we show severalcontrastive examples of KA questions generated by LA-GMF anduncertainty-only method in Table 4. As observed, the questionsgenerated by LA-GMF are more reasonable and relevant to thetarget entity, while the ones generated by uncertainty-only methodare incongruous and may reduce players’ gaming experience. More-over, the LA-GMF agent tends to ask valuable questions that playersknow the answers and acquire valid facts instead of getting "un-known" and wasting the opportunities. For example, with the targetentity Jimmy Kudo, the LA-GMF agent chooses to ask questionslike "Is the character brave?" to obtain the fact "Jimmy Kudo isbrave", while the uncertainty-only agent chooses some strangequestions like "Is the character a monster?" and wastes the valuablequestioning opportunities.

5 CONCLUSION AND FUTUREWORKIn this paper, we study knowledge acquisition via 20 Questions. Wepropose a novel framework, learning-to-ask (LA), within whichautomated agents learn to ask questions to solve an informationseeking task while acquiring new knowledge within limited ques-tioning opportunities. We also present well-designed methods toimplement the LA framework. Reinforcement learning based meth-ods, LA-DQN and LA-DRQN, are proposed for information seeking.Borrowing ideas from recommender systems, LA-GMF is proposedfor knowledge acquisition. Experiments on real data show that ouragent can play enjoyable 20 Questions with high winning rates,while acquiring knowledge rapidly for knowledge base completion.

KDD ’18, August 19–23, 2018, London, United Kingdom Y. Chen, B. Chen, X. Duan, J. Lou, Y. Wang, W. Zhu, and Y. Cao

Does the character major in music?

Does the character have a special voice?

Does the character have an English stage name?Does the character often use an English name?

Does the character belong to a group of 3 people?

Does the character have a good partner?

Did the character release an album in 2011?

Did the character debut in 1980s?

Is the character a football player?Is the character attractive with well-developed muscles?

Does the character often act in action movies?Did the character act in sitcoms?

Did the character cooperate with Chinese stars?Did the character wear the No.10 shirt?

Figure 10: Visualization of question embeddings with LA-DRQN on PersonSub.

Table 4: Contrasting examples of KA questions generated by LA-GMF and Uncertainty-Only.

Target Entity KA questions with LA-GMF KA questions with Uncertainty-Only

Jay Chou3Does the character mainly sing folk songs? No Did the character die unnaturally? Unknown

Is the character handsome? Yes Does the character like human beings? UnknownIs the character born in Japan? No Is the character from a novel? No

Jimmy Kudo4Is the character brave? Yes Does the character only act in movies (no TV shows)? Unknown

Is the character from Japanese animation? Yes Has the character ever been traitorous? UnknownWas the character dead? No Is the character a monster? Unknown

In fact, LA framework can be implemented in many other ap-proaches. Our work just marks a first step towards deep learn-ing and reinforcement learning based knowledge acquisition in agame-style Human-AI collaboration. We believe it is worthwhile toundertake in-depth studies on such smart knowledge acquisitionmechanisms in an era where various Human-AI collaborations arebecoming more and more common.

For future work, we are interested in integrating the IS andKA seamlessly using hierarchical reinforcement learning. And abandit-style KA module is worth studying since the KA essentiallytrades off the "uncertainty" and "value" similar as in the exploitation-exploration problem. Furthermore, modeling the complex depen-dencies between questions explicitly should be helpful to the IS.

ACKNOWLEDGMENTSThis work is partly supported by the National Natural ScienceFoundation of China (Grant No. 61673237 and No. U1611461). Thankreviewers for their constructive comments.

REFERENCES[1] Mary L Courage. 1989. Children’s inquiry strategies in referential communication

and in the game of twenty questions. Child Development 60, 4 (1989), 877–886.[2] Milica Gašić, Catherine Breslin, Matthew Henderson, Dongho Kim, Martin Szum-

mer, Blaise Thomson, Pirros Tsiakoulis, and Steve Young. 2013. On-line policyoptimisation of bayesian spoken dialogue systems via human interaction. InICASSP. 8367–8371.

[3] Matthew Hausknecht and Peter Stone. 2015. Deep recurrent Q-learning forpartially observable MDPs. In AAAI Fall Symposium Series.

[4] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-SengChua. 2017. Neural collaborative filtering. In WWW. 173–182.

[5] Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift. In ICML.

[6] Diederik P Kingma and Jimmy Ba. 2015. Adam: a method for stochastic optimiza-tion. In ICLR.

[7] Douglas B Lenat. 1995. CYC: A large-scale investment in knowledge infrastruc-ture. Commun. ACM 38, 11 (1995), 33–38.

3Jay Chou is a famous Chinese singer.(https://en.wikipedia.org/wiki/Jay_Chou)4JimmyKudo is a role in a Japanese animation.(https://en.wikipedia.org/wiki/Jimmy_Kudo)

[8] Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky.2016. Deep reinforcement learning for dialogue generation. arXiv:1606.01541(2016).

[9] Henry Lieberman, Dustin Smith, and Alea Teeters. 2007. Common Consensus:a web-based game for collecting commonsense goals. In ACM Workshop onCommon Sense for Intelligent Interfaces.

[10] Gary Marchionini. 1997. Information seeking in electronic environments. Cam-bridge university press.

[11] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness,Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, GeorgOstrovski, et al. 2015. Human-level control through deep reinforcement learning.Nature 518, 7540 (2015), 529–533.

[12] Naoki Otani, Daisuke Kawahara, Sadao Kurohashi, Nobuhiro Kaji, and ManabuSassano. 2016. Large-scale acquisition of commonsense knowledge via a quizgame on a dialogue system. In Open Knowledge Base and Question AnsweringWorkshop. 11–20.

[13] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. 2016. Prioritizedexperience replay. In ICLR.

[14] Push Singh, Thomas Lin, Erik T Mueller, Grace Lim, Travell Perkins, and Wan LiZhu. 2002. Open Mind Common Sense: Knowledge acquisition from the generalpublic. In OTM. 1223–1237.

[15] Robert Speer, Jayant Krishnamurthy, Catherine Havasi, Dustin Smith, HenryLieberman, and Kenneth Arnold. 2009. An interface for targeted collection ofcommon sense knowledge using a mixture model. In IUI. 137–146.

[16] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: a simple way to prevent neural networks fromoverfitting. JMLR 15, 1 (2014), 1929–1958.

[17] Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: an introduc-tion. Vol. 1. MIT press Cambridge.

[18] Niket Tandon, Gerard de Melo, Fabian Suchanek, and Gerhard Weikum. 2014.Webchild: Harvesting and organizing commonsense knowledge from the web. InWSDM. 523–532.

[19] Hado Van Hasselt, Arthur Guez, and David Silver. 2016. Deep reinforcementlearning with double Q-Learning.. In AAAI.

[20] Luis Von Ahn and Laura Dabbish. 2004. Labeling images with a computer game.In SIGCHI. 319–326.

[21] Luis Von Ahn and Laura Dabbish. 2008. Designing games with a purpose. Com-mun. ACM 51, 8 (2008), 58–67.

[22] Luis Von Ahn, Mihir Kedia, and Manuel Blum. 2006. Verbosity: a game forcollecting common-sense facts. In SIGCHI. 75–78.

[23] Luis Von Ahn, Ruoran Liu, and Manuel Blum. 2006. Peekaboom: a game forlocating objects in images. In SIGCHI. 55–64.

[24] Thomas D Wilson. 2000. Human information behavior. Informing science 3, 2(2000), 49–56.

[25] Tiancheng Zhao and Maxine Eskenazi. 2016. Towards end-to-end learning fordialog state tracking and management using deep reinforcement learning. InSIGdial.