what question answering can learn from trivia nerdsdataset (and its requisite leaderboard). the...

14
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7422–7435 July 5 - 10, 2020. c 2020 Association for Computational Linguistics 7422 What Question Answering can Learn from Trivia Nerds Jordan Boyd-Graber , iSchool, CS, UMIACS, LSC University of Maryland [email protected] Benjamin Börschinger Google Research Zürich {jbg, bboerschinger}@google.com Abstract Question answering (QA)is not just build- ing systems; this NLP subfield also creates and curates challenging question datasets that reveal the best systems. We argue that QA datasets—and QA leaderboards—closely resemble trivia tournaments: the questions agents—humans or machines—answer reveals a “winner”. However, the research commu- nity has ignored the lessons from decades of the trivia community creating vibrant, fair, and effective QA competitions. After detail- ing problems with existing QA datasets, we outline several lessons that transfer to QA re- search: removing ambiguity, identifying better QA agents, and adjudicating disputes. 1 Introduction This paper takes an unconventional analysis to an- swer “where we’ve been and where we’re going” in question answering (QA). Instead of approach- ing the question only as computer scientists, we apply the best practices of trivia tournaments to QA datasets. The QA community is obsessed with evalua- tion. Schools, companies, and newspapers hail new SOTAs and topping leaderboards, giving rise to troubling claims (Lipton and Steinhardt, 2019) that an “AI model tops humans” (Najberg, 2018) be- cause it ‘won’ some leaderboard, putting “millions of jobs at risk” (Cuthbertson, 2018). But what is a leaderboard? A leaderboard is a statistic about QA accuracy that induces a ranking over participants. Newsflash: this is the same as a trivia tourna- ment. The trivia community has been doing this for decades (Jennings, 2006); Section 2 details this overlap between the qualities of a first-class QA dataset (and its requisite leaderboard). The ex- perts running these tournaments are imperfect, but they’ve learned from their past mistakes (see Ap- pendix A for a brief historical perspective) and cre- ated a community that reliably identifies those best at question answering. Beyond the format of the competition, trivia norms ensure individual ques- tions are clear, unambiguous, and reward knowl- edge (Section 3). We are not saying that academic QA should sur- render to trivia questions or the community—far from it! The trivia community does not under- stand the real world information seeking needs of users or what questions challenge computers. How- ever, they have well-tested protocols to declare that someone is better at answering questions than an- other. This collection of tradecraft and principles can nonetheless help the QA community. Beyond these general concepts that QA can learn from, Section 4 reviews how the “gold standard” of trivia formats, Quizbowl can improve traditional QA. We then briefly discuss how research that uses fun, fair, and good trivia questions can benefit from the expertise, pedantry, and passion of the trivia community (Section 5). 2 Surprise, this is a Trivia Tournament! “My research isn’t a silly trivia tournament,” you say. That may be, but let us first tell you a little about what running a tournament is like, and per- haps you might see similarities. First, the questions. Either you write them your- self or you pay someone to write questions by a particular date (sometimes people on the Internet). Then, you advertise. You talk about your ques- tions: who is writing them, what subjects are cov- ered, and why people should try to answer them. Next, you have the tournament. You keep your questions secure until test time, collect answers from all participants, and declare a winner. After- ward, people use the questions to train for future tournaments. These have natural analogs to crowd sourcing

Upload: others

Post on 26-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What Question Answering can Learn from Trivia Nerdsdataset (and its requisite leaderboard). The ex-perts running these tournaments are imperfect, but they’ve learned from their past

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7422–7435July 5 - 10, 2020. c©2020 Association for Computational Linguistics

7422

What Question Answering can Learn from Trivia Nerds

Jordan Boyd-Graber , †

iSchool, CS, UMIACS, LSCUniversity of Maryland

[email protected]

Benjamin Börschinger†

† Google Research Zürich{jbg, bboerschinger}@google.com

Abstract

Question answering (QA)is not just build-ing systems; this NLP subfield also createsand curates challenging question datasets thatreveal the best systems. We argue thatQA datasets—and QA leaderboards—closelyresemble trivia tournaments: the questionsagents—humans or machines—answer revealsa “winner”. However, the research commu-nity has ignored the lessons from decadesof the trivia community creating vibrant, fair,and effective QA competitions. After detail-ing problems with existing QA datasets, weoutline several lessons that transfer to QA re-search: removing ambiguity, identifying betterQA agents, and adjudicating disputes.

1 Introduction

This paper takes an unconventional analysis to an-swer “where we’ve been and where we’re going”in question answering (QA). Instead of approach-ing the question only as computer scientists, weapply the best practices of trivia tournaments to QA

datasets.The QA community is obsessed with evalua-

tion. Schools, companies, and newspapers hailnew SOTAs and topping leaderboards, giving rise totroubling claims (Lipton and Steinhardt, 2019) thatan “AI model tops humans” (Najberg, 2018) be-cause it ‘won’ some leaderboard, putting “millionsof jobs at risk” (Cuthbertson, 2018). But what is aleaderboard? A leaderboard is a statistic about QA

accuracy that induces a ranking over participants.Newsflash: this is the same as a trivia tourna-

ment. The trivia community has been doing thisfor decades (Jennings, 2006); Section 2 details thisoverlap between the qualities of a first-class QA

dataset (and its requisite leaderboard). The ex-perts running these tournaments are imperfect, butthey’ve learned from their past mistakes (see Ap-pendix A for a brief historical perspective) and cre-

ated a community that reliably identifies those bestat question answering. Beyond the format of thecompetition, trivia norms ensure individual ques-tions are clear, unambiguous, and reward knowl-edge (Section 3).

We are not saying that academic QA should sur-render to trivia questions or the community—farfrom it! The trivia community does not under-stand the real world information seeking needs ofusers or what questions challenge computers. How-ever, they have well-tested protocols to declare thatsomeone is better at answering questions than an-other. This collection of tradecraft and principlescan nonetheless help the QA community.

Beyond these general concepts that QA can learnfrom, Section 4 reviews how the “gold standard”of trivia formats, Quizbowl can improve traditionalQA. We then briefly discuss how research that usesfun, fair, and good trivia questions can benefit fromthe expertise, pedantry, and passion of the triviacommunity (Section 5).

2 Surprise, this is a Trivia Tournament!

“My research isn’t a silly trivia tournament,” yousay. That may be, but let us first tell you a littleabout what running a tournament is like, and per-haps you might see similarities.

First, the questions. Either you write them your-self or you pay someone to write questions by aparticular date (sometimes people on the Internet).

Then, you advertise. You talk about your ques-tions: who is writing them, what subjects are cov-ered, and why people should try to answer them.

Next, you have the tournament. You keep yourquestions secure until test time, collect answersfrom all participants, and declare a winner. After-ward, people use the questions to train for futuretournaments.

These have natural analogs to crowd sourcing

Page 2: What Question Answering can Learn from Trivia Nerdsdataset (and its requisite leaderboard). The ex-perts running these tournaments are imperfect, but they’ve learned from their past

7423

questions, writing the paper, advertising, and run-ning a leaderboard. Trivia nerds cannot help youform hypotheses or write your paper, but they cantell you how to run a fun, well-calibrated, and dis-criminative tournament.

Such tournaments are designed to effectivelyfind a winner, which matches the scientific goal ofknowing which model best answers questions. Ourgoal is not to encourage the QA community toadopt the quirks and gimmicks of trivia games.Instead, it’s to encourage experiments and datasetsthat consistently and efficiently find the systemsthat best answer questions.

2.1 Are we having fun?Many authors use crowdworkers to establish hu-man accuracy (Rajpurkar et al., 2016; Choi et al.,2018). However, they are not the only humans whoshould answer a dataset’s questions. So should thedataset’s creators.

In the trivia world, this is called a play test: getin the shoes of someone answering the questions.If you find them boring, repetitive, or uninteresting,so will crowdworkers. If you can find shortcutsto answer questions (Rondeau and Hazen, 2018;Kaushik and Lipton, 2018), so will a computer.

Concretely, Weissenborn et al. (2017) catalogartifacts in SQuAD (Rajpurkar et al., 2018), themost popular QA leaderboard. If you see a list like“Along with Canada and the United Kingdom, whatcountry. . . ”, you can ignore the rest of the ques-tion and just type Ctrl+F (Yuan et al., 2019; Rus-sell, 2020) to find the third country—Australia inthis case—that appears with “Canada and the UK”.Other times, a SQuAD playtest would reveal frustrat-ing questions that are i) answerable given the infor-mation but not with a direct span,1 ii) answerableonly given facts beyond the given paragraph,2 iii)unintentionally embedded in a discourse, resultingin arbitrary correct answers,3 iv) or non-questions.

1A source paragraph says “In [Commonwealth coun-tries]. . . the term is generally restricted to. . . Private educationin North America covers the whole gamut. . . ”; thus, “Whatis the term private school restricted to in the US?” has theinformation needed but not as a span.

2A source paragraph says “Sculptors [in the collec-tion include] Nicholas Stone, Caius Gabriel Cibber, [...],Thomas Brock, Alfred Gilbert, [...] and Eric Gill”, i.e., a listof names; thus, the question “Which British sculptor whosework includes the Queen Victoria memorial in front of Buck-ingham Palace is included in the V&A collection?” should beunanswerable in SQuAD.

3A question “Who else did Luther use violent rhetorictowards?” has the gold answer “writings condemning theJews and in diatribes against Turks”.

SearchQA (Dunn et al., 2017), derived from Jeop-ardy!, asks “An article that he wrote about his river-boat days was eventually expanded into Life on theMississippi.” The apprentice and newspaper writerwho wrote the article is named Samuel LanghorneClemens; however, the reference answer is his laterpen name, Mark Twain. Most QA evaluation met-rics would count Samuel Clemens as incorrect. Ina real game of Jeopardy!, this would not be an issue(Section 3.1).

Of course, fun is relative, and any dataset isbound to contain errors. However, playtesting isan easy way to find systematic problems: unfair,unfun playtests make for ineffective leaderboards.Eating your own dog food can help diagnose arti-facts, scoring issues, or other shortcomings earlyin the process.

The deeper issues when creating a QA task are:i) have you designed a task that is internally consis-tent, ii) supported by a scoring metric that matchesyour goals, iii) using gold annotations that rewardthose who do the task well? Imagine someonewho loves answering the questions your task poses:would they have fun on your task? This is the foun-dation of Gamification (von Ahn, 2006), whichcan create quality data from users motivated byfun rather than pay. Even if you pay crowdworkers,unfun questions may undermine your dataset goals.

2.2 Am I measuring what I care about?

Answering questions requires multiple skills: iden-tifying answer mentions (Hermann et al., 2015),naming the answer (Yih et al., 2015), abstainingwhen necessary (Rajpurkar et al., 2018), and jus-tifying an answer (Thorne et al., 2018). In QA,the emphasis on SOTA and leaderboards has fo-cused attention on single automatically computablemetrics—systems tend to be compared by their‘SQuAD score’ or their ‘NQ score’, as if this wereall there is to say about their relative capabilities.Like QA leaderboards, trivia tournaments need todecide on a single winner, but they explicitly rec-ognize that there are more interesting comparisons.

A tournament may recognize differentbackground/resources—high school, small school,undergraduates (Hentzel, 2018). Similarly, morepractical leaderboards would reflect trainingtime or resource requirements (see Dodge et al.,2019) including ‘constrained’ or ‘unconstrained’training (Bojar et al., 2014). Tournaments alsogive specific awards (e.g., highest score without

Page 3: What Question Answering can Learn from Trivia Nerdsdataset (and its requisite leaderboard). The ex-perts running these tournaments are imperfect, but they’ve learned from their past

7424

incorrect answers). Again, there are obviousleaderboard analogs that would go beyond a singlenumber. In SQuAD 2.0 (Rajpurkar et al., 2018),abstaining contributes the same to the overall F1

as a fully correct answer, obscuring whether asystem is more precise or an effective abstainer.If the task recognizes both abilities as important,reporting a single score risks implicitly prioritizingone balance of the two.

2.3 Do my questions separate the best?Assume that you have picked a metric (or a setof metrics) that captures what you care about. Aleaderboard based on this metric can rack up cita-tions as people chase the top spot. But your leader-board is only useful if it is discriminative: the bestsystem reliably wins.

There are many ways questions might not be dis-criminative. If every system gets a question right(e.g., abstain on non-questions like “asdf” or cor-rectly answer “What is the capital of Poland?”), thedataset does not separate participants. Similarly, ifevery system flubs “what is the oldest north-facingkosher restaurant”, it is not discriminative. Sug-awara et al. (2018) call these questions “easy” and“hard”; we instead argue for a three-way distinction.

In between easy questions (system answers cor-rectly with probability 1.0) and hard (probabil-ity 0.0), questions with probabilities nearer to0.5 are more interesting. Taking a cue fromVygotsky’s proximal development theory of hu-man learning (Chaiklin, 2003), these discrimina-tive questions—rather than the easy or the hardones—should most improve QA systems. TheseGoldilocks4 questions (not random noise) decidewho tops the leaderboard. Unfortunately, exist-ing datasets have many easy questions. Sugawaraet al. (2020) find that ablations like shuffling wordorder (Feng et al., 2018), shuffling sentences, oronly offering the most similar sentence do not im-pair systems. Newer datasets such as DROP (Duaet al., 2019) and HellaSwag (Zellers et al., 2019)are harder for today’s systems; because Goldilocksis a moving target, we propose annual evaluationsin Section 5.

2.4 Why so few Goldilocks questions?This is a common problem in trivia tournaments,particularly pub quizzes (Diamond, 2009), where

4In a British folktale first recorded by Robert Southey, thecharacter Goldilocks finds three beds: one too hard, one nothard enough, and one “just right”.

AnnotationError Too EasyToo Hard Discriminative

Questions

Figure 1: Two datasets with 0.16 annotation error: thetop, however, better discriminates QA ability. In thegood dataset (top), most questions are challenging butnot impossible. In the bad dataset (bottom), there aremore trivial or impossible questions and annotation er-ror is concentrated on the challenging, discriminativequestions. Thus, a smaller fraction of questions decidewho sits atop the leaderboard, requiring a larger test set.

challenging questions can scare off patrons. Manyquiz masters prefer popularity with players andthus write easier questions.

Sometimes there are fewer Goldilocks questionsnot by choice, but by chance: a dataset becomesless discriminative through annotation error. Alldatasets have some annotation error; if this annota-tion error is concentrated on the Goldilocks ques-tions, the dataset will be less useful. As we writethis in 2020, humans and computers sometimesstruggle on the same questions.

Figure 1 shows two datasets of the same sizewith the same annotation error. However, they havedifferent difficulty distributions and correlation ofannotation error and difficulty. The dataset thathas more discriminative questions and consistentannotator error has fewer questions that do not dis-criminate the winner of the leaderboard. We callthis the effective dataset proportion ρ (higher isbetter). Figure 2 shows the test set size required toreliably discriminate systems for different ρ, basedon a simulation (Appendix B).

At this point, you may despair about how big adataset you need.5 The same terror besets triviatournament organizers. Instead of writing morequestions, they use pyramidality (Section 4) tomake every question count.

3 The Craft of Question Writing

Trivia enthusiasts agree that questions need to bewell written (despite other disagreements). Asking“good questions” requires sophisticated pragmatic

5Using a more sophisticated simulation approach, theTREC 2002 QA test set (Voorhees, 2003) could not discriminatesystems with less than a seven absolute score point difference.

Page 4: What Question Answering can Learn from Trivia Nerdsdataset (and its requisite leaderboard). The ex-perts running these tournaments are imperfect, but they’ve learned from their past

7425

12 5 10 15100

100025005000

10000Te

st N

eede

d = 0.25

Average Accuracy50.080.0

12 5 10 15

= 0.85

12 5 10 15 Accuracy

= 1.00

Figure 2: How much test data do you need to discriminate two systems with 95% confidence? This depends onboth the difference in accuracy between the systems (x axis) and the average accuracy of the systems (closer to50% is harder). Test set creators do not have much control over those. They do have control, however, over howmany questions are discriminative. If all questions are discriminative (right), you only need 2500 questions, but ifthree quarters of your questions are too easy, too hard, or have annotation errors (left), you’ll need 15000.

reasoning (Hawkins et al., 2015), and pedagogyexplicitly acknowledges the complexity of writ-ing effective questions for assessing student per-formance (Haladyna, 2004, focusing on multiplechoice questions).

QA datasets, however, are often collected fromthe wild or written by untrained crowdworkers.Crowdworkers lack experience in crafting ques-tions and may introduce idiosyncrasies that short-cut machine learning (Geva et al., 2019). Simi-larly, data collected from the wild such as Natu-ral Questions (Kwiatkowski et al., 2019) or Ama-zonQA (Gupta et al., 2019) by design have vastvariations in quality. In the previous section, wefocused on how datasets as a whole should be struc-tured. Now, we focus on how specific questionsshould be structured to make the dataset as valuableas possible.

3.1 Avoiding ambiguity and assumptions

Ambiguity in questions not only frustrates answer-ers who resolve the ambiguity ‘incorrectly’. Am-biguity also frustrates the goal of using questionsto assess knowledge. Thus, the US Department ofTransportation explicitly bans ambiguous questionsfrom exams for flight instructors (Flight StandardsService, 2008); and the trivia community has like-wise developed rules and norms that prevent ambi-guity. While this is true in many contexts, examplesare rife in format called Quizbowl (Boyd-Graberet al., 2012), whose very long questions6 show-case trivia writers’ tactics. For example, Quizbowlauthor Zhu Ying (writing for the 2005 PARFAIT

tournament) asks participants to identify a fictional

6Like Jeopardy!, they are not syntactically questions butstill are designed to elicit knowledge-based responses; forconsistency, we still call them questions.

character while warning against possible confusion[emphasis added]:

He’s not Sherlock Holmes, but his address is221B. He’s not the Janitor on Scrubs, but hisfather is played by R. Lee Ermy. [. . . ] For tenpoints, name this misanthropic, crippled, Vicodin-dependent central character of a FOX medicaldrama.ANSWER: Gregory House, MD

In contrast, QA datasets often contain ambigu-ous and under-specified questions. While thissometimes reflects real world complexities suchas actual under-specified or ill-formed searchqueries (Faruqui and Das, 2018; Kwiatkowskiet al., 2019), ignoring this ambiguity is prob-lematic. As a concrete example, Natural Ques-tions (Kwiatkowski et al., 2019) answers “whatyear did the us hockey team win the Olympics”with 1960 and 1980, ignoring the US women’steam, which won in 1998 and 2018, and furtherassuming the query is about ice rather than fieldhockey (also an Olympic event). Natural Questionsassociates a page about the United States men’snational ice hockey team, arbitrarily removing theambiguity post hoc. However, this does not resolvethe ambiguity, which persists in the original ques-tion: information retrieval arbitrarily provides oneof many interpretations. True to their name, Natu-ral Questions are often under-specified when usersask a question online.

The problem is neither that such questions existnor that machine reading QA considers questionsgiven an associated context. The problem is thattasks do not explicitly acknowledge the originalambiguity and gloss over the implicit assumptionsin the data. This introduces potential noise and bias(i.e., giving a bonus to systems that make the sameassumptions as the dataset) in leaderboard rankings.

Page 5: What Question Answering can Learn from Trivia Nerdsdataset (and its requisite leaderboard). The ex-perts running these tournaments are imperfect, but they’ve learned from their past

7426

At best, these will become part of the measurementerror of datasets (no dataset is perfect). At worst,they will recapitulate the biases that went into thecreation of the datasets. Then, the community willimplicitly equate the biases with correctness: youget high scores if you adopt this set of assump-tions. These enter into real-world systems, furtherperpetuating the bias. Playtesting can reveal theseissues (Section 2.1), as implicit assumptions canrob a player of correctly answered questions. Ifyou wanted to answer 2014 to “when did Michi-gan last win the championship”—when the Michi-gan State Spartans won the Women’s Cross Coun-try championship—and you cannot because youchose the wrong school, the wrong sport, and thewrong gender, you would complain as a player; re-searchers instead discover latent assumptions thatcreep into the data.7

It is worth emphasizing that this is not a purelyhypothetical problem. For example, Open DomainRetrieval Question Answering (Lee et al., 2019)deliberately avoids providing a reference contextfor the question in its framing but, in re-purposingdata such as Natural Questions, opaquely relies onit for the gold answers.

3.2 Avoiding superficial evaluations

A related issue is that, in the words of Voorhees andTice (2000), “there is no such thing as a questionwith an obvious answer”. As a consequence, triviaquestion authors delineate acceptable and unaccept-able answers.

For example, in writing for the trivia tournamentHarvard Fall XI, Robert Chu uses a mental modelof an answerer to explicitly delineate the range ofacceptable correct answers:

In Newtonian gravity, this quantity satisfies Pois-son’s equation. [. . . ] For a dipole, this quantity isgiven by negative the dipole moment dotted withthe electric field. [. . . ] For 10 points, name thisform of energy contrasted with kinetic.ANSWER: potential energy (prompt on energy;accept specific types like electrical potential en-ergy or gravitational potential energy; do notaccept or prompt on just “potential”)

Likewise, the style guides for writing questionsstipulate that you must give the answer type clearlyand early on. These mentions specify whether youwant a book, a collection, a movement, etc. It also

7Where to draw the line is a matter of judgment;computers—which lack common sense—might find questionsambiguous where humans would not.

signals the level of specificity requested. For ex-ample, a question about a date must state “day andmonth required” (September 11, “month and yearrequired” (April 1968), or “day, month, and year re-quired” (September 1, 1939). This is true for otheranswers as well: city and team, party and country,or more generally “two answers required”. Despitethese conventions, no pre-defined set of answersis perfect, and every worthwhile trivia competitionhas a process for adjudicating answers.

In high school and college national competitionsand game shows, if low-level staff cannot resolvethe issue by throwing out a single question or ac-cepting minor variations (America instead of USA),the low-level staff contacts the tournament direc-tor. The tournament director, who has a deeperknowledge of rules and questions, often decide theissue. If not, the protest goes through an adjudi-cation process designed to minimize bias:8 writethe summary of the dispute, get all parties to agreeto the summary, and then hand the decision offto mutually agreed experts from the tournament’sphone tree. The substance of the disagreement iscommunicated (without identities), and the expertsapply the rules and decide.

Consider what happened when a particularlyinept Jeopardy! contestant9 did not answerlaproscope to “Your surgeon could choose to take alook inside you with this type of fiber-optic instru-ment”. Since the van Doren scandal (Freedman,1997), every television trivia contestant has an ad-vocate assigned from an auditing company. In thiscase, the advocate initiated a process that went to apanel of judges who then ruled that endoscope (amore general term) was also correct.

The need for a similar process seems to havebeen well-recognized in the earliest days of QA

system bake-offs such as TREC-QA, and Voorhees(2008) notes that

[d]ifferent QA runs very seldom return exactly thesame [answer], and it is quite difficult to deter-mine automatically whether the difference [. . . ]is significant.

In stark contrast to this, QA datasets typically onlyprovide a single string or, if one is lucky, severalstrings. A correct answer means exactly matchingthese strings or at least having a high token overlapF1, and failure to agree with the pre-recorded ad-missible answers will put you at an uncontestabledisadvantage on the leaderboard (Section 2.2).

8https://www.naqt.com/rules/#protest

9http://www.j-archive.com/showgame.php?game_id=6112

Page 6: What Question Answering can Learn from Trivia Nerdsdataset (and its requisite leaderboard). The ex-perts running these tournaments are imperfect, but they’ve learned from their past

7427

To illustrate how current evaluations fall shortof meaningful discrimination, we qualitatively ana-lyze two near-SOTA systems on SQuAD V1.1: theoriginal XLNet (Yang et al., 2019) and a subsequentiteration called XLNet-123.10

Despite XLNet-123’s margin of almost four abso-lute F1 (94 vs 98) on development data, a manualinspection of a sample of 100 of XLNet-123’s winsindicate that around two-thirds are ‘spurious’: 56%are likely to be considered not only equally goodbut essentially identical; 7% are cases where theanswer set omits a correct alternative; and 5% ofcases are ‘bad’ questions.11

Our goal is not to dwell on the exact proportions,to minimize the achievements of these strong sys-tems, or to minimize the usefulness of quantitativeevaluations. We merely want to raise the limitationof blind automation for distinguishing between sys-tems on a leaderboard.

Taking our cue from the trivia community, wepresent an alternative for MRQA. Blind test setsare created for a specific time; all systems are sub-mitted simultaneously. Then, all questions andanswers are revealed. System authors can protestcorrectness rulings on questions, directly address-ing the issues above. After agreement is reached,quantitative metrics are computed for comparisonpurposes—despite their inherent limitations theyat least can be trusted. Adopting this for MRQA

would require creating a new, smaller test set ev-ery year. However, this would gradually refine theannotations and process.

This suggestion is not novel: Voorhees and Tice(2000) accept automatic evaluations “for experi-ments internal to an organization where the ben-efits of a reusable test collection are most signif-icant (and the limitations are likely to be under-stood)” (our emphasis) but that “satisfactory tech-niques for [automatically] evaluating new runs”have not been found yet. We are not aware ofany change on this front—if anything, we seem tohave become more insensitive as a community tojust how limited our current evaluations are.

3.3 Focus on the bubble

While every question should be perfect, time andresources are limited. Thus, authors and editorsof tournaments “focus on the bubble”, where the

10We could not find a paper describing XLNet-123, thesubmission is by http://tia.today.

11Examples in Appendix C.

“bubble” are the questions most likely to discrimi-nate between top teams at the tournament. Thesequestions are thoroughly playtested, vetted, andedited. Only after these questions have been per-fected will the other questions undergo the samelevel of polish.

For computers, the same logic applies. Authorsshould ensure that these discriminative questionsare correct, free of ambiguity, and unimpeachable.However, as far as we can tell, the authors of QA

datasets do not give any special attention to thesequestions.

Unlike a human trivia tournament, however—with finite patience of the participants—this doesnot mean that you should necessarily remove allof the easy or hard questions from your dataset.This could inadvertently lead to systems unableto answer simple questions like “who is buried inGrant’s tomb?” (Dwan, 2000, Chapter 7). Instead,focus more resources on the bubble.

4 Why Quizbowl is the Gold Standard

We now focus our thus far wide-ranging QA dis-cussion to a specific format: Quizbowl, which hasmany of the desirable properties outlined above.We have no delusion that mainstream QA will uni-versally adopt this format (indeed, a monoculturewould be bad). However, given the community’semphasis on fair evaluation, computer QA can bor-row aspects from the gold standard of human QA.

We have shown example of Quizbowl questions,but we have not explained how the format works;see Rodriguez et al. (2019) for more. You might bescared off by how long the questions are. However,in real Quizbowl trivia tournaments, they are notfinished because the questions are interruptible.

Interruptible A moderator reads a question.Once someone knows the answer, they use a signal-ing device to “buzz in”. If the player who buzzed isright, they get points. Otherwise, they lose pointsand the question continues for the other team.

Not all trivia games with buzzers have this prop-erty, however. For example, take Jeopardy!, thesubject of Watson’s tour de force (Ferrucci et al.,2010). While Jeopardy! also uses signaling de-vices, these only work once the question has beenread in its entirety; Ken Jennings, one of the topJeopardy! players (and also a Quizbowler) explainsit on a Planet Money interview (Malone, 2019):

Page 7: What Question Answering can Learn from Trivia Nerdsdataset (and its requisite leaderboard). The ex-perts running these tournaments are imperfect, but they’ve learned from their past

7428

Jennings: The buzzer is not live until Alexfinishes reading the question. And if you buzzin before your buzzer goes live, you actuallylock yourself out for a fraction of a second. Sothe big mistake on the show is people who areall adrenalized and are buzzing too quickly, tooeagerly.Malone: OK. To some degree, Jeopardy! is kindof a video game, and a crappy video game whereit’s, like, light goes on, press button—that’s it.Jennings: (Laughter) Yeah.

Jeopardy!’s buzzers are a gimmick to ensure goodtelevision; however, Quizbowl buzzers discrimi-nate knowledge (Section 2.3). Similarly, whileTriviaQA (Joshi et al., 2017) is written by knowl-edgeable writers, the questions are not pyramidal.

Pyramidal Recall that effective datasets dis-criminate the best from the rest—the higher theproportion of effective questions ρ, the better.Quizbowl’s ρ is nearly 1.0 because discriminationhappens within a question: after every word, an an-swerer must decide if they know enough to answer.Quizbowl questions are arranged so that questionsare maximally pyramidal: questions begin withhard clues—ones that require deep understanding—to more accessible clues that are well known.

Well-Edited Quizbowl questions are created inphases. First, the author selects the answer and as-sembles (pyramidal) clues. A subject editor then re-moves ambiguity, adjusts acceptable answers, andtweaks clues to optimize discrimination. Finally,a packetizer ensures the overall set is diverse, hasuniform difficulty, and is without repeats.

Unnatural Trivia questions are fake: the askeralready knows the answer. But they’re no morefake than a course’s final exam, which—likeleaderboards—are designed to test knowledge.

Experts know when questions are ambiguous(Section 3.1); while “what play has a characterwhose father is dead” could be Hamlet, Antigone,or Proof, a good writer’s knowledge avoids the am-biguity. When authors omit these cues, the questionis derided as a “hose” (Eltinge, 2013), which robsthe tournament of fun (Section 2.1).

One of the benefits of contrived formats is afocus on specific phenomena. Dua et al. (2019)exclude questions an existing MRQA system couldanswer to focus on challenging quantitative reason-ing. One of the trivia experts consulted in Wallaceet al. (2019) crafted a question that tripped up neu-ral QA by embedding the phrase “this author opens

Crime and Punishment” into a question; the topsystem confidently answers Fyodor Dostoyevski.However, that phrase was in a longer question “Thenarrator in Cogwheels by this author opens Crimeand Punishment to find it has become The BrothersKaramazov”. Again, this shows the inventivenessand linguistic dexterity of the trivia community.

A counterargument is that real-life questions—e.g., on Yahoo! Questions (Szpektor andDror, 2013), Quora (Iyer et al., 2017) or websearch (Kwiatkowski et al., 2019)—ignore the craftof question writing. Real humans react to unclearquestions with confusion or divergent answers, ex-plicitly answering with how they interpreted theoriginal question (“I assume you meant. . . ”).

Given real world applications will have to dealwith the inherent noise and ambiguity of unclearquestions, our systems must cope with it. How-ever, addressing the real world cannot happen byglossing over its complexity.

Complicated Quizbowl is more complex thanother datasets. Unlike other datasets where you justneed to decide what to answer, in Quizbowl youalso need to choose when to answer the question.12

While this improves the dataset’s discrimination, itcan hurt popularity because you cannot copy/pastecode from other QA tasks. The cumbersome pyra-midal structure complicates some questions (e.g.,what is log base four of sixty-four).

5 A Call to Action

You may disagree with the superiority of Quizbowlas a QA framework (de gustibus non est disputan-dum). In this final section, we hope to distill our ad-vice into a call to action regardless of your questionformat or source. Here are our recommendations ifyou want to have an effective leaderboard.

Talk to Trivia Nerds You should talk to trivianerds because they have useful information (notjust about the election of 1876). Trivia is not justthe accumulation of information but also connect-ing disparate facts (Jennings, 2006). These skillsare exactly those we want computers to develop.

Trivia nerds are writing questions anyway; we

12This complex methodology can be an advantage. Theunderlying mechanisms of systems that can play Quizbowl(e.g., reinforcement learning) share properties with other tasks,such as simultaneous translation (Grissom II et al., 2014; Maet al., 2019), human incremental processing (Levy et al., 2008;Levy, 2011), and opponent modeling (He et al., 2016).

Page 8: What Question Answering can Learn from Trivia Nerdsdataset (and its requisite leaderboard). The ex-perts running these tournaments are imperfect, but they’ve learned from their past

7429

can save money and time if we pool resources.13

Computer scientists benefit if the trivia communitywrites questions that aren’t trivial for computersto solve (e.g., avoiding quotes and named entities).The trivia community benefits from tools that maketheir job easier: show related questions, link toWikipedia, or predict where humans will answer.

Likewise, the broader public has unique knowl-edge and skills. In contrast to low-paid crowdwork-ers, public platforms for question answering andcitizen science (Bowser et al., 2013) are brimmingwith free expertise if you can engage the relevantcommunities. For example, the Quora query “Isthere a nuclear control room on nuclear aircraft car-riers?” is purportedly answered by someone whoworked in such a room (Humphries, 2017). Asmachine learning algorithms improve, the “goodenough” crowdsourcing that got us this far may notbe enough for continued progress.

Eat Your Own Dog Food As you develop newquestion answering tasks, you should feel comfort-able playing the task as a human. Importantly, thisis not just to replicate what crowdworkers are do-ing (also important) but to remove hidden assump-tions, institute fair metrics, and define the task well.For this to feel real, you will need to keep score;have all of your coauthors participate and comparescores.

Again, we emphasize that human and com-puter skills are not identical, but this is a ben-efit: humans’ natural aversion to unfairness willhelp you create a better task, while computers willblindly optimize an objective function (Bostrom,2003). As you go through the process of playing onyour question–answer dataset, you can see whereyou might have fallen short on the goals we outlinein Section 3.

Won’t Somebody Look at the Data? After QA

datasets are released, there should also be deeper,more frequent discussion of actual questions withinthe NLP community. Part of every post-mortem oftrivia tournaments is a detailed discussion of thequestions, where good questions are praised andbad questions are excoriated. This is not meantto shame the writers but rather to help build andreinforce cultural norms: questions should be well-

13Many question answering datasets benefit from the effortsof the trivia community. Ethically using the data, however,requires acknowledging their contributions and using theirinput to create datasets (Jo and Gebru, 2020, Consent andInclusivity criterion).

written, precise, and fulfill the creator’s goals. Justlike trivia tournaments, QA datasets resemble aproduct for sale. Creators want people to investtime and sometimes money (e.g., GPU hours) in us-ing their data and submitting to their leaderboards.It is “good business” to build a reputation for qual-ity questions and discussing individual questions.

Similarly, discussing and comparing the actualpredictions made by the competing systems shouldbe part of any competition culture—without it, it ishard to tell what a couple of points on some leader-board mean. To make this possible, we recommendthat leaderboards include an easy way for anyoneto download a system’s development predictionsfor qualitative analyses.

Make Questions Discriminative We argue thatquestions should be discriminative (Section 2.3),and while Quizbowl is one solution (Section 4), noteveryone is crazy enough to adopt this (beautiful)format. For more traditional QA tasks, you canmaximize the usefulness of your dataset by ensur-ing as many questions as possible are challenging(but not impossible) for today’s QA systems.

But you can use some Quizbowl intuitions toimprove discrimination. In visual QA, you can of-fer increasing resolutions of the image. For othersettings, create pyramidality by adding metadata:coreference, disambiguation, or alignment to aknowledge base. In short, consider multiple ver-sions/views of your data that progress from difficultto easy. This not only makes more of your datasetdiscriminative but also reveals what makes a ques-tion answerable.

Embrace Multiple Answers or Specify Speci-ficity As QA moves to more complicated for-mats and answer candidates, what constitutes acorrect answer becomes more complicated. Fullyautomatic evaluations are valuable for both train-ing and quick-turnaround evaluation. In the caseannotators disagree, the question should explic-itly state what level of specificity is required(e.g., September 1, 1939 vs. 1939 or Leninism vs.socialism). Or, if not all questions have a singleanswer, link answers to a knowledge base with mul-tiple surface forms or explicitly enumerate whichanswers are acceptable.

Appreciate Ambiguity If your intended QA ap-plication has to handle ambiguous questions, dojustice to the ambiguity by making it part of yourtask—for example, recognize the original ambigu-

Page 9: What Question Answering can Learn from Trivia Nerdsdataset (and its requisite leaderboard). The ex-perts running these tournaments are imperfect, but they’ve learned from their past

7430

ity and resolve it (“did you mean. . . ”) instead ofgiving credit for happening to ‘fit the data’.

To ensure that our datasets properly “isolate theproperty that motivated [the dataset] in the firstplace” (Zaenen, 2006), we need to explicitly appre-ciate the unavoidable ambiguity instead of silentlyglossing over it.14

This is already an active area of research, withconversational QA being a new setting actively ex-plored by several datasets (Reddy et al., 2018; Choiet al., 2018); and other work explicitly focusingon identifying useful clarification questions (Raoand Daumé III), thematically linked questions (El-gohary et al., 2018) or resolving ambiguities thatarise from coreference or pragmatic constraints byrewriting underspecified question strings (Elgoharyet al., 2019; Min et al., 2020).

Revel in Spectacle However, with more compli-cated systems and evaluations, a return to the yearlyevaluations of TRECQA may be the best option.This improves not only the quality of evaluation(we can have real-time human judging) but also letsthe test set reflect the build it/break it cycle (Ruefet al., 2016), as attempted by the 2019 iteration ofFEVER (Thorne et al., 2019). Moreover, anotherlesson the QA community could learn from triviagames is to turn it into a spectacle: exciting gameswith a telegenic host. This has a benefit to thepublic, who see how QA systems fail on difficultquestions and to QA researchers, who have a spoon-ful of fun sugar to inspect their systems’ output andtheir competitors’.

In between full automation and expensive hu-mans in the loop are automatic metrics that mimicthe flexibility of human raters, inspired by machinetranslation evaluations (Papineni et al., 2002; Spe-cia and Farzindar, 2010) or summarization (Lin,2004). However, we should not forget that thesemetrics were introduced as ‘understudies’—goodenough when quick evaluations are needed for sys-tem building but no substitute for a proper evalu-ation. In machine translation, Laubli et al. (2020)reveal that crowdworkers cannot spot the errorsthat neural MT systems make—fortunately, trivianerds are cheaper than professional translators.

Be Honest in Crowning QA ChampionsLeaderboards are a ranking over entrants based on

14Not surprisingly, ‘inherent’ ambiguity is not limited toQA; Pavlick and Kwiatkowski (2019) show natural languageinference has ‘inherent disagreements’ between humans andadvocate for recovering the full range of accepted inferences.

a ranking over numbers. This can be problematicfor several reasons. The first is that single numbershave some variance; it’s better to communicateestimates with error bars.

While—particularly for leaderboards—it istempting to turn everything into a single number,there are often different sub-tasks and systems whodeserve recognition. A simple model that requiresless training data or runs in under ten millisecondsmay be objectively more useful than a bloated, brit-tle monster of a system that has a slightly higherF1 (Dodge et al., 2019). While you may only rankby a single metric (this is what trivia tournamentsdo too), you may want to recognize the highest-scoring model that was built by undergrads, tookno more than one second per example, was trainedonly on Wikipedia, etc.

Finally, if you want to make human–computercomparisons, pick the right humans. Paraphrasinga participant of the 2019 MRQA workshop (Fischet al., 2019), a system better than the average hu-man at brain surgery does not imply superhumanperformance in brain surgery. Likewise, beating adistracted crowdworker on QA is not QA’s endgame.If your task is realistic, fun, and challenging, youwill find experts to play against your computer.Not only will this give you human baselines worthreporting—they can also tell you how to fix yourQA dataset. . . after all, they’ve been at it longer thanyou have.

Acknowledgements This work was supportedby Google’s Visiting Researcher program. Boyd-Graber is also supported by NSF Grant IIS-1822494.Any opinions, findings, conclusions, or recommen-dations expressed here are those of the authorsand do not necessarily reflect the view of the spon-sor. Thanks to Christian Buck for creating the NQ

playtesting environment that spurred the initial ideafor this paper. Thanks to Jon Clark and MichaelCollins for the exciting e-mail thread that forced theauthors to articulate their positions for the first time.Thanks to Kevin Kwok for permission to use Pro-tobowl screenshot and information. Hearty thanksto all those who read and provided feedback ondrafts: Matt Gardner, Roger Craig, MassimilianoCiaramita, Jon May, Zachary Lipton, and DivyanshKaushik. And finally, thanks to the trivia commu-nity for providing a convivial home for pedants andknow-it-alls; may more people listen to you.

Page 10: What Question Answering can Learn from Trivia Nerdsdataset (and its requisite leaderboard). The ex-perts running these tournaments are imperfect, but they’ve learned from their past

7431

ReferencesLuis von Ahn. 2006. Games with a purpose. Computer,

39:92–94.

David Baber. 2015. Television Game Show Hosts: Bi-ographies of 32 Stars. McFarland.

Ondrej Bojar, Christian Buck, Christian Federmann,Barry Haddow, Philipp Koehn, Johannes Leveling,Christof Monz, Pavel Pecina, Matt Post, HerveSaint-Amand, Radu Soricut, Lucia Specia, and AlešTamchyna. 2014. Findings of the 2014 workshop onstatistical machine translation. In Proceedings of theNinth Workshop on Statistical Machine Translation.

Nick Bostrom. 2003. Ethical issues in advanced arti-ficial intelligence. Institute of Advanced Studies inSystems Research and Cybernetics, 2:12–17.

Anne Bowser, Derek Hansen, Yurong He, CarolBoston, Matthew Reid, Logan Gunnell, and JenniferPreece. 2013. Using gamification to inspire new cit-izen science volunteers. In Proceedings of the FirstInternational Conference on Gameful Design, Re-search, and Applications, Gamification ’13, pages18–25, New York, NY, USA. ACM.

Jordan Boyd-Graber, Brianna Satinoff, He He, andHal Daume III. 2012. Besting the quiz master:Crowdsourcing incremental classification games. InProceedings of Empirical Methods in Natural Lan-guage Processing.

Seth Chaiklin. 2003. The Zone of Proximal Devel-opment in Vygotsky’s Analysis of Learning and In-struction, Learning in Doing: Social, Cognitiveand Computational Perspectives, page 39–64. Cam-bridge University Press.

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettle-moyer. 2018. QuAC: Question answering in con-text. In Proceedings of Empirical Methods in Nat-ural Language Processing.

Anthony Cuthbertson. 2018. Robots can now read bet-ter than humans, putting millions of jobs at risk.Newsweek.

Paul Diamond. 2009. How To Make 100 Pounds ANight (Or More) As A Pub Quizmaster. DP Quiz.

Jesse Dodge, Suchin Gururangan, Dallas Card, RoySchwartz, and Noah A. Smith. 2019. Show yourwork: Improved reporting of experimental results.In Proceedings of Empirical Methods in NaturalLanguage Processing. Association for Computa-tional Linguistics.

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, GabrielStanovsky, Sameer Singh, and Matt Gardner. 2019.DROP: A reading comprehension benchmark requir-ing discrete reasoning over paragraphs. In Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics.

Matthew Dunn, Levent Sagun, Mike Higgins,V. Ugur Güney, Volkan Cirik, and KyunghyunCho. 2017. SearchQA: A new Q&A dataset aug-mented with context from a search engine. CoRR,abs/1704.05179.

R. Dwan. 2000. As Long as They’re Laughing: Grou-cho Marx and You Bet Your Life. Midnight MarqueePress.

Ahmed Elgohary, Denis Peskov, and Jordan Boyd-Graber. 2019. Can you unpack that? learning torewrite questions-in-context. In Empirical Methodsin Natural Language Processing.

Ahmed Elgohary, Chen Zhao, and Jordan Boyd-Graber.2018. Dataset and baselines for sequential open-domain question answering. In Empirical Methodsin Natural Language Processing.

Stephen Eltinge. 2013. Quizbowl lexicon.

Manaal Faruqui and Dipanjan Das. 2018. Identifyingwell-formed natural language questions. In Proceed-ings of Empirical Methods in Natural Language Pro-cessing.

Shi Feng, Eric Wallace, Alvin Grissom II, PedroRodriguez, Mohit Iyyer, and Jordan Boyd-Graber.2018. Pathologies of neural models make interpreta-tion difficult. In Proceedings of Empirical Methodsin Natural Language Processing.

David Ferrucci, Eric Brown, Jennifer Chu-Carroll,James Fan, David Gondek, Aditya A. Kalyanpur,Adam Lally, J. William Murdock, Eric Nyberg,John Prager, Nico Schlaefer, and Chris Welty. 2010.Building Watson: An Overview of the DeepQAProject. AI Magazine, 31(3).

Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eu-nsol Choi, and Danqi Chen, editors. 2019. Proceed-ings of the 2nd Workshop on Machine Reading forQuestion Answering. Association for ComputationalLinguistics, Hong Kong, China.

Flight Standards Service. 2008. Aviation Instrutor’sHandbook, volume FAA-H-8083-9A. Federal Avia-tion Administration, Department of Transportation.

Morris Freedman. 1997. The fall of Charlie Van Doren.The Virginia Quarterly Review, 73(1):157–165.

Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019.Are we modeling the task or the annotator? an in-vestigation of annotator bias in natural language un-derstanding datasets. In Proceedings of EmpiricalMethods in Natural Language Processing.

Alvin Grissom II, He He, Jordan Boyd-Graber, andJohn Morgan. 2014. Don’t until the final verb wait:Reinforcement learning for simultaneous machinetranslation. In Proceedings of Empirical Methodsin Natural Language Processing.

Page 11: What Question Answering can Learn from Trivia Nerdsdataset (and its requisite leaderboard). The ex-perts running these tournaments are imperfect, but they’ve learned from their past

7432

Mansi Gupta, Nitish Kulkarni, Raghuveer Chanda,Anirudha Rayasam, and Zachary C. Lipton. 2019.AmazonQA: A review-based question answeringtask. In International Joint Conference on ArtificialIntelligence.

Thomas M. Haladyna. 2004. Developing and Validat-ing Multiple-choice Test Items. Lawrence ErlbaumAssociates.

Robert X. D. Hawkins, Andreas Stuhlmüller, Judith De-gen, and Noah D. Goodman. 2015. Why do youask? Good questions provoke informative answers.In Proceedings of the 37th Annual Meeting of theCognitive Science Society.

He He, Jordan Boyd-Graber, Kevin Kwok, and HalDaumé III. 2016. Opponent modeling in deep re-inforcement learning. In Proceedings of the Interna-tional Conference of Machine Learning.

R. Robert Hentzel. 2018. NAQT eligibility overview.

Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,and Phil Blunsom. 2015. Teaching machines to readand comprehend. In Proceedings of Advances inNeural Information Processing Systems.

Bryan Humphries. 2017. Is there a nuclear controlroom on nuclear aircraft carriers?

Shankar Iyer, Nikhil Dandekar, and Kornél Csernai.2017. Quora question pairs.

Ken Jennings. 2006. Brainiac: adventures in the cu-rious, competitive, compulsive world of trivia buffs.Villard.

Eun Seo Jo and Timnit Gebru. 2020. Lessons fromarchives: Strategies for collecting sociocultural datain machine learning. In Proceedings of the Confer-ence on Fairness, Accountability, and Transparency.

Mandar Joshi, Eunsol Choi, Daniel Weld, and LukeZettlemoyer. 2017. TriviaQA: A large scale dis-tantly supervised challenge dataset for reading com-prehension. In Proceedings of the Association forComputational Linguistics.

Divyansh Kaushik and Zachary C. Lipton. 2018. Howmuch reading does reading comprehension require?a critical investigation of popular benchmarks. InProceedings of Empirical Methods in Natural Lan-guage Processing.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-field, Michael Collins, Ankur Parikh, Chris Alberti,Danielle Epstein, Illia Polosukhin, Matthew Kelcey,Jacob Devlin, Kenton Lee, Kristina N. Toutanova,Llion Jones, Ming-Wei Chang, Andrew Dai, JakobUszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-ral questions: a benchmark for question answeringresearch. Transactions of the Association for Com-putational Linguistics.

Samuel Laubli, Sheila Castilho, Graham Neubig, RicoSennrich, Qinlan Shen, and Antonio Toral. 2020.A set of recommendations for assessing human-machine parity in language translation. Journal ofArtificial Intelligence Research, 67.

Kenton Lee, Ming-Wei Chang, and Kristina Toutanova.2019. Latent retrieval for weakly supervised opendomain question answering. In Proceedings of theAssociation for Computational Linguistics.

Roger Levy. 2011. Integrating surprisal and uncertain-input models in online sentence comprehension: For-mal techniques and empirical results. In Proceed-ings of the Association for Computational Linguis-tics.

Roger P. Levy, Florencia Reali, and Thomas L. Grif-fiths. 2008. Modeling the effects of memory on hu-man online sentence processing with particle filters.In Proceedings of Advances in Neural InformationProcessing Systems.

Chin-Yew Lin. 2004. Rouge: a package for automaticevaluation of summaries. In Proceedings of theWorkshop on Text Summarization Branches Out.

Zachary C. Lipton and Jacob Steinhardt. 2019. Trou-bling trends in machine learning scholarship. Queue,17(1).

Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng,Kaibo Liu, Baigong Zheng, Chuanqiang Zhang,Zhongjun He, Hairong Liu, Xing Li, Hua Wu, andHaifeng Wang. 2019. STACL: Simultaneous trans-lation with implicit anticipation and controllable la-tency using prefix-to-prefix framework. In Proceed-ings of the Association for Computational Linguis-tics.

Kenny Malone. 2019. How uncle Jamie broke Jeop-ardy. Planet Money, (912).

Sewon Min, Julian Michael, Hannaneh Hajishirzi, andLuke Zettlemoyer. 2020. AmbigQA: Answering am-biguous open-domain questions.

Adam Najberg. 2018. Alibaba AI model tops humansin reading comprehension.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic eval-uation of machine translation. In Proceedings of theAssociation for Computational Linguistics.

Ellie Pavlick and Tom Kwiatkowski. 2019. Inherentdisagreements in human textual inferences. Transac-tions of the Association for Computational Linguis-tics, 7:677–694.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for SQuAD. In Proceedings of the Associationfor Computational Linguistics.

Page 12: What Question Answering can Learn from Trivia Nerdsdataset (and its requisite leaderboard). The ex-perts running these tournaments are imperfect, but they’ve learned from their past

7433

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questionsfor machine comprehension of text. In Proceedingsof Empirical Methods in Natural Language Process-ing.

Sudha Rao and Hal Daumé III. Learning to askgood questions: Ranking clarification questions us-ing neural expected value of perfect information. InProceedings of the Association for ComputationalLinguistics.

Siva Reddy, Danqi Chen, and Christopher D. Manning.2018. CoQA: A conversational question answeringchallenge. Transactions of the Association for Com-putational Linguistics, 7:249–266.

Pedro Rodriguez, Shi Feng, Mohit Iyyer, He He,and Jordan Boyd-Graber. 2019. Quizbowl: Thecase for incremental question answering. CoRR,abs/1904.04792.

Marc-Antoine Rondeau and T. J. Hazen. 2018. Sys-tematic error analysis of the Stanford question an-swering dataset. In Proceedings of the Workshop onMachine Reading for Question Answering.

Andrew Ruef, Michael Hicks, James Parker, DaveLevin, Michelle L. Mazurek, and Piotr Mardziel.2016. Build it, break it, fix it: Contesting securedevelopment. In Proceedings of the ACM SIGSACConference on Computer and Communications Se-curity.

Dan Russell. 2020. Why control-F is the single mostimportant thing you can teach someone about search.SearchReSearch.

Lucia Specia and Atefeh Farzindar. 2010. A.: Esti-mating machine translation post-editing effort withHTER. In In: AMTA 2010 Workshop, Bringing MTto the User: MT Research and the Translation Indus-try. The 9th Conference of the Association for Ma-chine Translation in the Americas.

Saku Sugawara, Kentaro Inui, Satoshi Sekine, andAkiko Aizawa. 2018. What makes reading compre-hension questions easier? In Proceedings of Empiri-cal Methods in Natural Language Processing.

Saku Sugawara, Pontus Stenetorp, Kentaro Inui, andAkiko Aizawa. 2020. Assessing the benchmark-ing capacity of machine reading comprehensiondatasets. In Association for the Advancement of Ar-tificial Intelligence.

Idan Szpektor and Gideon Dror. 2013. From queryto question in one click: Suggesting synthetic ques-tions to searchers. In Proceedings of the World WideWeb Conference.

David Taylor, Colin McNulty, and Jo Meek. 2012.Your starter for ten: 50 years of University Chal-lenge.

James Thorne, Andreas Vlachos, Oana Cocarascu,Christos Christodoulopoulos, and Arpit Mittal, edi-tors. 2018. Proceedings of the First Workshop onFact Extraction and VERification (FEVER). Associ-ation for Computational Linguistics.

James Thorne, Andreas Vlachos, Oana Cocarascu,Christos Christodoulopoulos, and Arpit Mittal. 2019.The FEVER2.0 shared task. In Proceedings of theSecond Workshop on Fact Extraction and VERifica-tion (FEVER).

Ellen M. Voorhees. 2003. Evaluating the evaluation: Acase study using the TREC 2002 question answeringtrack. In Conference of the North American Chapterof the Association for Computational Linguistics.

Ellen M. Voorhees. 2008. Evaluating QuestionAnswering System Performance, pages 409–430.Springer Netherlands, Dordrecht.

Ellen M Voorhees and Dawn M Tice. 2000. Buildinga question answering test collection. In Proceedingsof the ACM SIGIR Conference on Research and De-velopment in Information Retrieval.

Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Ya-mada, and Jordan Boyd-Graber. 2019. Trick me ifyou can: Human-in-the-loop generation of adversar-ial question answering examples. Transactions ofthe Association of Computational Linguistics, 10.

Dirk Weissenborn, Georg Wiese, and Laura Seiffe.2017. Making neural QA as simple as possible butnot simpler. In Conference on Computational Natu-ral Language Learning.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Car-bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.XLNet: Generalized autoregressive pretraining forlanguage understanding.

Wen-tau Yih, Ming-Wei Chang, Xiaodong He, andJianfeng Gao. 2015. Semantic parsing via stagedquery graph generation: Question answering withknowledge base. In Proceedings of the Associationfor Computational Linguistics.

Xingdi Yuan, Jie Fu, Marc-Alexandre Cote, Yi Tay,Christopher Pal, and Adam Trischler. 2019. Interac-tive machine comprehension with information seek-ing agents.

Annie Zaenen. 2006. Mark-up barking up the wrongtree. Computational Linguistics, 32(4):577–580.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, AliFarhadi, and Yejin Choi. 2019. HellaSwag: Can amachine really finish your sentence? In Proceed-ings of the Association for Computational Linguis-tics.

Page 13: What Question Answering can Learn from Trivia Nerdsdataset (and its requisite leaderboard). The ex-perts running these tournaments are imperfect, but they’ve learned from their past

7434

Appendix

Footnote numbers continue from main article.

A An Abridged History of ModernTrivia

In the United States, modern trivia exploded im-mediately after World War II via countless gameshows including College Bowl (Baber, 2015), theprecursor to Quizbowl. The craze spread tothe United Kingdom in a bootlegged version ofQuizbowl called University Challenge (now li-censed by ITV) and pub quizzes (Taylor et al.,2012).

The initial explosion, however, was not withoutcontroversy. A string of cheating scandals, most no-tably the Van Doren (Freedman, 1997) scandal (thesubject of the film Quiz Show), and the 1977 entryof Quizbowl into intercollegiate competition forcedtrivia to “grow up”. Professional organizations andmore “grownup” game shows like Jeopardy! (the“all responses in the form of a question” gimmickgrew out of how some game shows gave contestantsthe answers) helped created formalized structuresfor trivia.

As the generation that grew up with formalizedtrivia reached adulthood, they sought to make theoutcomes of trivia competitions more rigorous, es-chewing the randomness that makes for good tele-vision. Organizations like National Academic QuizTournaments and the Academic Competition Fed-eration created routes for the best players to helpdirect how trivia competitions would be run. In2019, these organizations have institutionalized thebest practices of “good trivia” described here.

B Simulating the Test Set Needed

We simulate a head-to-head trivia competitionwhere System A and System B have an accuracy a(probability of getting a question right) separatedby some difference: aA − aB ≡ ∆. We then sim-ulate this on a test set of size N—scaled by theeffective dataset proportion ρ—via draws from twoBinomial distributions with success probabilitiesof aA and aB:

Ra ∼Binomial(ρN, aA

Rb ∼Binomial(ρN, aB) (1)

and see the minimum test set questions (using anexperiment size of 5000) needed to detect the bettersystem 95% of the time (i.e., the minimum N such

that Ra > Rb from Equation 1 in 0.95 of the exper-iments). Our emphasis, however is ρ: the smallerthe percentage of discriminative questions (eitherbecause of difficulty or because of annotation er-ror), the larger your test set must be.15

C Qualitative Analysis Examples

We provide some concrete examples for the classesinto which we classified the XLNet-123 wins overXLNet. We indicate gold answer spans (providedby the human annotators) by underlining (theremay be, the XLNet answer span by bold face, andthe XLNet-123 answer span by italics, combiningfor tokens shared between spans as is appropriate.

C.1 Insignificant and significant spandifferences

QUESTION: What type of vote must the Parlia-ment have to either block or suggest changes tothe Commission’s proposals?CONTEXT: The essence is there are three read-ings, starting with a Commission proposal, wherethe Parliament must vote by a majority of allMEPs (not just those present) to block or suggestchanges

a majority of all MEPs is as good an answer asmajority, yet its Exact Match score is 0. The prob-lem is not merely one of picking a soft metric; evenits Token-F1 score is merely 0.4, effectively penal-izing a system for giving a more complete answer.The limitations of Token-F1 become even clearerin light of the following significant span difference:

QUESTION: What measure of a computationalproblem broadly defines the inherent difficulty ofthe solution?CONTEXT: A problem is re-garded as inherently difficultif its solution requires significant resources,whatever the algorithm used.

We agree with the automatic evaluation thata system answering significant resources to thisquestion should not be given full (and possibly no)credit as it fails to mention relevant context. Nev-ertheless, the Token-F1 of this answer is 0.57, i.e.,larger than for the insignificant span difference justdiscussed.

C.2 Missing Gold AnswersWe also observed 7 (out of 100) cases of missinggold answers. As an example, consider

15Disclaimer: This should be only one of many consider-ations in deciding on the size of your test set. Other factorsmay include balancing for demographic properties, coveringlinguistic variation, or capturing task-specific phenomena.

Page 14: What Question Answering can Learn from Trivia Nerdsdataset (and its requisite leaderboard). The ex-perts running these tournaments are imperfect, but they’ve learned from their past

7435

QUESTION: What would someone who iscivilly disobedient do in court?CONTEXT: Steven Barkan writes that if defen-dants plead not guilty, “they must decide whethertheir primary goal will be to win an acquittal andavoid imprisonment or a fine, or to use the pro-ceedings as a forum to inform the jury and thepublic of the political circumstances surroundingthe case and their reasons for breaking the law viacivil disobedience.” [. . . ]In countries such as the United States whose lawsguarantee the right to a jury trial but do not excuselawbreaking for political purposes, some civil dis-obedients seek jury nullification.

While annotators did mark two distinct spansas gold answers, they ignored jury nullificationwhich is a fine answer to the question and should berewarded. Reasonable people can disagree whetherthis is a missing answer or if it is excluded by asubtlety in the question’s phrasing. This is pre-cisely the point—relying on a pre-collected answerstrings without a process for adjudicating disagree-ments in official comparisons does not do justiceto the complexity of question answering.

C.3 Bad QuestionsWe also observed 5 cases of genuinely bad ques-tions. Consider

QUESTION: What library contains the SelmurProductions catalogue?CONTEXT: Also part of the library is the afore-mentioned Selznick library, the Cinerama Produc-tions/Palomar theatrical library and the SelmurProductions catalog that the network acquiredsome years back

This is an annotation error—the correct answerto the question is not available from the paragraphand would have to be (the American BroadcastCompany’s) Programming Library. While we haveto live with annotation errors as part of reality, it isnot clear that we ought to accept them for officialevaluations—any human taking a closer look atthe paragraph, as part of an adjudication process,would concede that the question is problematic.

Other cases of ‘annotation’ error are more subtle,involving meaning-changing typos, for example:

QUESTION: Which French kind [sic] issued thisdeclaration?CONTEXT: They retained the religious provi-sions of the Edict of Nantes until the rule of LouisXIV, who progressively increased persecution ofthem until he issued the Edict of Fontainebleau(1685), which abolished all legal recognition ofProtestantism in France

While one could debate whether or not systemsought to be able to do ‘charitable’ reinterpretations

of the question text, this is part of the point—caseslike these warrant discussion and should not besilently glossed over when ‘computing the score’.