ordinal common-sense inference · a model trained to make ordinal predictions should say:...

18
Ordinal Common-sense Inference Sheng Zhang Johns Hopkins University [email protected] Rachel Rudinger Johns Hopkins University [email protected] Kevin Duh Johns Hopkins University [email protected] Benjamin Van Durme Johns Hopkins University [email protected] Abstract Humans have the capacity to draw common- sense inferences from natural language: vari- ous things that are likely but not certain to hold based on established discourse, and are rarely stated explicitly. We propose an evaluation of automated common-sense inference based on an extension of recognizing textual entail- ment: predicting ordinal human responses on the subjective likelihood of an inference hold- ing in a given context. We describe a frame- work for extracting common-sense knowledge from corpora, which is then used to construct a dataset for this ordinal entailment task. We train a neural sequence-to-sequence model on this dataset, which we use to score and gen- erate possible inferences. Further, we anno- tate subsets of previously established datasets via our ordinal annotation protocol in order to then analyze the distinctions between these and what we have constructed. 1 Introduction We use words to talk about the world. There- fore, to understand what words mean, we must have a prior explication of how we view the world. Hobbs (1987) Researchers in Artificial Intelligence and (Compu- tational) Linguistics have long-cited the require- ment of common-sense knowledge in language un- derstanding. 1 This knowledge is viewed as a key 1 Schank (1975): It has been apparent ... within ... natural language understanding ... that the eventual limit to our solu- tion ... would be our ability to characterize world knowledge. Sam bought a new clock The clock runs Dave found an axe in his garage A car is parked in the garage Tom was accidentally shot by his teammate in the army The teammate dies Two friends were in a heated game of checkers A person shoots the checkers My friends and I decided to go swimming in the ocean The ocean is carbonated Figure 1: Examples of common-sense inference ranging from very likely, likely, plausible, technically possible, to impossible. component in filling in the gaps between the tele- graphic style of natural language statements. We are able to convey considerable information in a rela- tively sparse channel, presumably owing to a par- tially shared model at the start of any discourse. 2 Common-sense inference – inferences based on common-sense knowledge – is possibilistic: things everyone more or less would expect to hold in a given context, but without the necessary strength of logical entailment. 3 Because natural language cor- pora exhibits human reporting bias (Gordon and Van Durme, 2013), systems that derive knowledge ex- clusively from such corpora may be more accurately considered models of language, rather than of the 2 McCarthy (1959): a program has common sense if it au- tomatically deduces for itself a sufficiently wide class of imme- diate consequences of anything it is told and what it already knows. 3 Many of the bridging inferences of Clark (1975) make use of common-sense knowledge, such as the following example of “Probable part”: I walked into the room. The windows looked out to the bay. To resolve the definite reference the windows, one needs to know that rooms have windows is probable. 379 Transactions of the Association for Computational Linguistics, vol. 5, pp. 379–395, 2017. Action Editor: Mark Steedman. Submission batch: 12/2016; Revision batch: 3/2017; Published 11/2017. c 2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

Upload: others

Post on 18-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ordinal Common-sense Inference · A model trained to make ordinal predictions should say: plausible, with probability 1.0 , whereas a model trained to make binary entailed/not-entailed

Ordinal Common-sense Inference

Sheng ZhangJohns Hopkins [email protected]

Rachel RudingerJohns Hopkins [email protected]

Kevin DuhJohns Hopkins [email protected]

Benjamin Van DurmeJohns Hopkins [email protected]

Abstract

Humans have the capacity to draw common-sense inferences from natural language: vari-ous things that are likely but not certain to holdbased on established discourse, and are rarelystated explicitly. We propose an evaluationof automated common-sense inference basedon an extension of recognizing textual entail-ment: predicting ordinal human responses onthe subjective likelihood of an inference hold-ing in a given context. We describe a frame-work for extracting common-sense knowledgefrom corpora, which is then used to constructa dataset for this ordinal entailment task. Wetrain a neural sequence-to-sequence model onthis dataset, which we use to score and gen-erate possible inferences. Further, we anno-tate subsets of previously established datasetsvia our ordinal annotation protocol in orderto then analyze the distinctions between theseand what we have constructed.

1 Introduction

We use words to talk about the world. There-fore, to understand what words mean, we musthave a prior explication of how we view theworld. – Hobbs (1987)

Researchers in Artificial Intelligence and (Compu-tational) Linguistics have long-cited the require-ment of common-sense knowledge in language un-derstanding.1 This knowledge is viewed as a key

1Schank (1975): It has been apparent ... within ... naturallanguage understanding ... that the eventual limit to our solu-tion ... would be our ability to characterize world knowledge.

Sam bought a new clock ; The clock runsDave found an axe in his garage ; A car is parkedin the garageTom was accidentally shot by his teammate in thearmy ; The teammate diesTwo friends were in a heated game of checkers ;A person shoots the checkersMy friends and I decided to go swimming in theocean ; The ocean is carbonated

Figure 1: Examples of common-sense inference rangingfrom very likely, likely, plausible, technically possible, toimpossible.

component in filling in the gaps between the tele-graphic style of natural language statements. We areable to convey considerable information in a rela-tively sparse channel, presumably owing to a par-tially shared model at the start of any discourse.2

Common-sense inference – inferences based oncommon-sense knowledge – is possibilistic: thingseveryone more or less would expect to hold in agiven context, but without the necessary strength oflogical entailment.3 Because natural language cor-pora exhibits human reporting bias (Gordon and VanDurme, 2013), systems that derive knowledge ex-clusively from such corpora may be more accuratelyconsidered models of language, rather than of the

2McCarthy (1959): a program has common sense if it au-tomatically deduces for itself a sufficiently wide class of imme-diate consequences of anything it is told and what it alreadyknows.

3Many of the bridging inferences of Clark (1975) make useof common-sense knowledge, such as the following example of“Probable part”: I walked into the room. The windows lookedout to the bay. To resolve the definite reference the windows,one needs to know that rooms have windows is probable.

379

Transactions of the Association for Computational Linguistics, vol. 5, pp. 379–395, 2017. Action Editor: Mark Steedman.Submission batch: 12/2016; Revision batch: 3/2017; Published 11/2017.

c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

Page 2: Ordinal Common-sense Inference · A model trained to make ordinal predictions should say: plausible, with probability 1.0 , whereas a model trained to make binary entailed/not-entailed

world (Rudinger et al., 2015). Facts such as “A per-son walking into a room is very likely to be blink-ing and breathing” are usually unstated in text, sotheir real-world likelihoods do not align to languagemodel probabilities.4 We would like to have systemscapable of reading a sentence that describes a real-world situation and inferring how likely other state-ments about that situation are to hold true in the realworld, e.g. This capability is subtly but cruciallydistinct from the ability to predict other sentencesreported in the same text, as a language model maybe trained to do.

We therefore propose a model of knowledge ac-quisition based on first deriving possibilistic state-ments from text. As the relative frequency of thesestatements suffers the mentioned reporting bias, wethen follow up with human annotation of derived ex-amples. Since we initially are uncertain about thereal-world likelihood of the derived common-senseknowledge holding in any particular context, we pairit with various grounded context and present to hu-mans for their own assessment. As these examplesvary in assessed plausibility, we propose the task ofordinal common-sense inference, which embraces awider set of natural conclusions arising from lan-guage comprehension (see Fig 1).

In what follows, we describe prior efforts incommon-sense and textual inference (§2). We thenstate our position on how ordinal common-sense in-ference should be defined (§3), and detail our ownframework for large-scale extraction and abstrac-tion, along with a crowdsourcing protocol for assess-ment (§4). This includes a novel neural model forforward generation of textual inference statements.Together these methods are applied to contexts de-rived from various prior textual inference resources,resulting in the JHU Ordinal Common-sense Infer-ence (JOCI) corpus, a large collection of diversecommon-sense inference examples, judged to holdwith varying levels of subjective likelihood (§5). Weprovide baseline results (§6) for prediction on theJOCI corpus.5

4For further background see discussions by Van Durme(2010), Gordon and Van Durme (2013), Rudinger et al. (2015)and Misra et al. (2016).

5The JOCI corpus is released freely at: http://decomp.net/.

2 Background

Mining Common Sense Building large collec-tions of common-sense knowledge can be donemanually via professionals (Hobbs and Navarretta,1993), but at considerable cost in terms of time andexpense (Miller, 1995; Lenat, 1995; Baker et al.,1998; Friedland et al., 2004). Efforts have pursuedvolunteers (Singh, 2002; Havasi et al., 2007) andgames with a purpose (Chklovski, 2003), but arestill left fully reliant on human labor. Many havepursued automating the process, such as in expand-ing lexical hierarchies (Hearst, 1992; Snow et al.,2006), constructing inference patterns (Lin and Pan-tel, 2001; Berant et al., 2011), reading referencematerials (Richardson et al., 1998; Suchanek et al.,2007), mining search engine query logs (Pasca andVan Durme, 2007), and most relevant here: abstract-ing from instance-level predications discovered indescriptive texts (Schubert, 2002; Liakata and Pul-man, 2002; Clark et al., 2003; Banko and Etzioni,2007). In this article we are concerned with knowl-edge mining for purposes of seeding a text genera-tion process (constructing common-sense inferenceexamples).

Common-sense Tasks Many textual inferencetasks have been designed to require some de-gree of common-sense knowledge, e.g., the Wino-grad Schema Challenge discussed by Levesque etal. (2011). The data for these tasks are eithersmaller, carefully constructed evaluation sets by pro-fessionals, following efforts like the FRACAS testsuite (Cooper et al., 1996), or they rely on crowd-sourced elicitation (Bowman et al., 2015). Crowd-sourcing is scalable, but elicitation protocols canlead to biased responses unlikely to contain a widerange of possible common-sense inferences. Hu-mans can generally agree on the plausibility of awide range of possible inference pairs, but they arenot likely to generate them from an initial prompt.6

The construction of SICK (Sentences InvolvingCompositional Knowledge) made use of existingparaphrastic sentence pairs (descriptions by differ-

6McRae et al. (2005): Features such as <is larger thana tulip> or <moves faster than an infant>, for example; al-though logically possible, do not occur in [human responses][...] Although people are capable of verifying that a <dog islarger than a pencil>.

380

Page 3: Ordinal Common-sense Inference · A model trained to make ordinal predictions should say: plausible, with probability 1.0 , whereas a model trained to make binary entailed/not-entailed

ent people of the same image), which were modi-fied through a series of rule-based transformationsthen judged by humans (Marelli et al., 2014). Aswith SICK, we rely on humans only for judging pro-vided examples, rather than elicitation of text. Un-like SICK, our generation is based on a process tar-geted specifically at common sense (see §4.1.1).

Plausibility Researchers in psycholinguisticshave explored a notion of plausibility in humansentence processing, where, for instance, argumentsto predicates are intuitively more or less “plausible”as fillers to different thematic roles, as reflected inhuman reading times. For example, McRae et al.(1998) looked at manipulations such as:

(a) The boss hired by the corporation was per-fect for the job.(b) The applicant hired by the corporation wasperfect for the job.

where the plausibility of a boss being the agent – ascompared to patient – of the predicate hired might bemeasured by looking at delays in reading time in thewords following the predicate. This measurement isthen contrasted with the timing observed in the samepositions in (b).7

Rather than measuring according to predictionssuch as human reading times, here we ask anno-tators explicitly to judge plausibility on a 5-pointordinal scale (See §3). Further, our effort mightbe described in this setting as conditional plausibil-ity,8 where plausibility judgments for a given sen-tence are expected to be dependent on precedingcontext. Further exploration of conditional plau-sibility is an interesting avenue of potential futurework, perhaps through the measurement of humanreading times when using prompts derived from ourordinal common-sense inference examples. Compu-tational modeling of (unconditional) semantic plau-sibility has been explored by those such as Pado etal. (2009), Erk et al. (2010) and Sayeed et al. (2015).

Textual Entailment A multi-year source of tex-tual inference examples were generated under theRecognizing Textual Entailment (RTE) Challenges,introduced by Dagan et al. (2006):

7This notion of thematic plausibility is then related tothe notion of verb-argument selectional preference (Zernik,1992; Resnik, 1993; Clark and Weir, 1999), and sortal(in)correctness (Thomason, 1972).

8Thanks to the anonymous reviewer for this connection.

We say that T entails H if, typically, a humanreading T would infer that H is most likelytrue. This somewhat informal definition isbased on (and assumes) common human un-derstanding of language as well as commonbackground knowledge.

This definition strayed from the more strict notionof entailment as used by linguistic semanticists, suchas those involved with FRACAS. While Giampic-colo et al. (2008) extended binary RTE with an “un-known” category, the entailment community has pri-marily focused on issues such as “paraphrase” and“monotonicity”. An example of this is the NaturalLogic implementation of MacCartney and Manning(2007).

Language understanding in context is not only un-derstanding the entailments of a sentence, but alsothe plausible inferences of the sentence, i.e. thenew posterior on the world after reading the sen-tence. A new sentence in a discourse is almost neverentailed by another sentence in the discourse, be-cause such a sentence would add no new informa-tion. In order to successfully process a discourse,there needs to be some understanding of what newinformation can be, possibly or plausibly, added tothe discourse. Collecting sentence pairs with ordi-nal entailment connections is potentially useful forimproving and testing these language understandingcapabilities that would be needed by algorithms forapplications like storytelling.

Garrette et al. (2011) and Beltagy et al. (2017)treated textual entailment as probabilistic logical in-ference in Markov Logic Networks (Richardson andDomingos, 2006). However, the notion of probabil-ity in their entailment task has a subtle distinctionfrom our problem of common-sense inference. Theprobability of being an entailment given by a proba-bilistic model trained for a binary classification (be-ing an entailment or not) is not necessarily the sameas the likelihood of an inference being true. For ex-ample:

T: A person flips a coin.H: That flip comes up heads.

No human reading T should infer that H is true.A model trained to make ordinal predictions shouldsay: “plausible, with probability 1.0”, whereas amodel trained to make binary entailed/not-entailed

381

Page 4: Ordinal Common-sense Inference · A model trained to make ordinal predictions should say: plausible, with probability 1.0 , whereas a model trained to make binary entailed/not-entailed

predictions should say: “not entailed, with probabil-ity 1.0”. The following example exhibits the sameproperty:

T: An animal eats food.H: A person eats food.

Again, with high confidence, H is plausible; and,with high confidence, it is also not entailed.

Non-entailing Inference Of the various non-“entailment” textual inference tasks, a few are mostsalient here. Agirre et al. (2012) piloted a TextualSimilarity evaluation which has been refined in sub-sequent years. Systems produce scalar values corre-sponding to predictions of how similar the meaningis between two provided sentences, e.g., the follow-ing pair from SICK was judged very similar (4.2 outof 5), while also being a contradiction: There is nobiker jumping in the air and A lone biker is jump-ing in the air. The ordinal approach we advocate forrelies on a graded notion, like textual similarity.

The Choice of Plausible Alternative (COPA)task (Roemmele et al., 2011) was a reaction to RTE,similarly motivated to probe a system’s ability to un-derstand inferences that are not strictly entailed. Asingle context was provided, with two alternative in-ferences, and a system had to judge which was moreplausible. The COPA dataset was manually elicited,and is not large; we discuss this data further in §5.

The Narrative Cloze task (Chambers and Juraf-sky, 2008) requires a system to score candidate in-ferences as to how likely they are to appear in adocument that also included the provided context.Many such inferences are then not strictly entailedby the context. Further, the Cloze task gives the ben-efit of being able to generate very large numbers ofexamples automatically by simply occluding partsof existing documents and asking a system to pre-dict what is missing. The LAMBADA dataset (Pa-perno et al., 2016) is akin to our strategy for auto-matic generation followed by human filtering, butfor Cloze examples. As our concern is with infer-ences that are often true but never stated in a doc-ument, this approach is not viable here. The ROC-Stories corpus (Mostafazadeh et al., 2016) eliciteda more “plausible” collection of documents in or-der to retain the narrative Cloze in the context ofcommon-sense inference. The ROCStories corpuscan be viewed as an extension of the idea behind

the COPA corpus, done at a larger scale with crowd-sourcing, and with multi-sentence contexts; we con-sider this dataset in §5.

Alongside the narrative Cloze, Pichotta andMooney (2016) made use of a 5-point Likert scale(very likely to very unlikely) as a secondary evalu-ation of various script induction techniques. Whilethey were concerned with measuring their ability togenerate very likely inferences, here we are inter-ested in generating a wide swath of inference candi-dates, including those that are impossible.

3 Ordinal Common-sense Inference

Our goal is a system that can perform speculative,common-sense inference as part of understandinglanguage. Based on the observed shortfalls of priorwork, we propose the notion of Ordinal Common-sense Inference (OCI). OCI embraces the notion ofDagan et al. (2006), in that we are concerned withhuman judgments of epistemic modality.9

As agreed by many linguists, modality in nat-ural language is a continuous category, butspeakers are able to map areas of this axis intodiscrete values (Lyons, 1977; Horn, 1989; deHaan, 1997) – Saurı and Pustejovsky (2009)

According to Horn (1989), there are two scalesof epistemic modality which differ in polarity (posi-tive vs. negative polarity): 〈certain, likely, possible〉and 〈impossible, unlikely, uncertain〉. The Squareof Opposition (SO) (Fig 2) illustrates the logical re-lations holding between values in the two scales.Based on their logical relations, we can make a setof exhaustive epistemic modals: 〈very likely, likely,possible, impossible〉, where 〈very likely, likely, pos-sible〉 lie on a single, positive Horn scale, and im-possible, a complementary concept from the cor-responding negative Horn scale, completes the set.In this paper, we further replace the value possibleby the more fine-grained values (technically possi-ble and plausible). This results in a 5-point scaleof likelihood: 〈very likely, likely, plausible, techni-cally possible, impossible〉. The OCI task definitiondirectly embraces subjective likelihood on such an

9Epistemic modality: the likelihood that (some aspect of) acertain state of affairs is/has been/will be true (or false) in thecontext of the possible world under consideration.

382

Page 5: Ordinal Common-sense Inference · A model trained to make ordinal predictions should say: plausible, with probability 1.0 , whereas a model trained to make binary entailed/not-entailed

ordinal scale. Humans are presented with a contextC and asked whether a provided hypothesisH is verylikely, likely, plausible, technically possible, or im-possible. Furthermore, an important part of this pro-cess is the generation of H by automatic methods,which seeks to avoid the elicitation bias of manyprior works.

A EContraries

I OSubcontraries

Contradictories

certain

likely

possible

impossible

unlikely

uncertain

Positive Negative

Figure 2: SO for epistemic modals (Saurı and Puste-jovsky, 2009).10

4 Framework for collecting OCI corpus

We now describe our framework for collecting ordi-nal common-sense inference examples. It is naturalto collect this data in two stages. In the first stage(§4.1), we automatically generate inference candi-dates given some context. We propose two broadapproaches using either general world knowledge orneural methods. In the second stage (§4.2), we an-notate these candidates with ordinal labels.

4.1 Generation of Common-sense InferenceCandidates

4.1.1 Generation based on World KnowledgeOur motivation for this approach was first intro-

duced by Schubert (2002):

There is a largely untapped source of generalknowledge in texts, lying at a level beneath theexplicit assertional content. This knowledgeconsists of relationships implied to be possi-ble in the world, or, under certain conditions,implied to be normal or commonplace in theworld.

Following Schubert (2002) and Van Durme andSchubert (2008), we define an approach for ab-stracting over explicit assertions derived from cor-pora, leading to a large-scale collection of generalpossibilistic statements. As shown in Fig 3, this

10“Contradictories”: exhaustive and mutually exclusive con-ditions. “Contraries”: non-exhaustive and mutually exclusive.“Subcontraries”: exhaustive and non-mutually exclusive.

approach generates common-sense inference can-didates in four steps: (a) extracting propositionswith predicate-argument structures from texts, (b)abstracting over propositions to generate templatesfor concepts, (c) deriving properties of concepts viadifferent strategies, and (d) generating possibilistichypotheses from contexts.

publication.n.01

person buy ____

collection.n.02

magazine.n.01

book.n.01

No

person subscribe to ____

Yes

person borrow ____ from library

Yes No

Yes

(c) Property derivation using the decision tree

feature

feature

feature

No

[person] borrow [book] from [library]

person.n.01book.n.01library.n.01

____ borrow book from libraryperson borrow ____ from libraryperson borrow book from ____

propositional templates

abstracted proposition

[John] borrowed [the books] from [the library]pred-arg structured proposition

John borrowed the books from the library .plain text

(a) Extraction

(b) Abstraction

The professor recommended [books] for this course. context

(d) Inference generation

A person borrows the books from a library.inference

approximation

template generation

extraction

propertyderivation

verbalization

hypothesis

Hypothesis generation

Figure 3: Generating common-sense inferencesbased on general world knowledge.

(a) Extracting propositions: First we extract alarge set of propositions with predicate-argumentstructures from noun phrases and clauses, underwhich general world presumptions often lie. Toachieve this goal, we use PredPatt11 (White et al.,2016; Zhang et al., 2017), which defines a frame-

11https://github.com/hltcoe/PredPatt

383

Page 6: Ordinal Common-sense Inference · A model trained to make ordinal predictions should say: plausible, with probability 1.0 , whereas a model trained to make binary entailed/not-entailed

work of interpretable, language-neutral predicate-argument extraction patterns from Universal Depen-dencies (de Marneffe et al., 2014). Fig 3 (a) showsan example extraction.

We use the Gigaword corpus (Parker et al., 2011)for extracting propositions as it is a comprehensivetext archive. There exists a version containing au-tomatically generated syntactic annotation (Ferraroet al., 2014), which bootstraps large-scale knowl-edge extraction. We use PyStanfordDependencies12

to convert constituency parses to depedency parses,from which we extract structured propositions.(b) Abstracting propositions: In this step, we ab-stract the propositions into a more general form.This involves lemmatization, stripping inessentialmodifiers and conjuncts, and replacing specific ar-guments with generic types.13 This method of ab-straction often yields general presumptions aboutthe world. To reduce noise from predicate-argumentextraction, we only keep 1-place and 2-place predi-cates after abstraction.

We further generalize individual arguments toconcepts by attaching semantic-class labels to them.Here we choose WordNet (Miller, 1995) nounsynsets14 as the semantic-class set. When select-ing the correct sense for an argument, we adopt afast and relatively accurate method: always takingthe first sense which is usually the most commonlyused sense (Suchanek et al., 2007; Pasca, 2008). Bydoing so, we attach 84 million abstracted proposi-tions with senses, covering 43.7% (35,811/81,861)of WordNet noun senses.

Each of these WordNet senses, then, is associ-ated with a set of abstracted propositions. The ab-stracted propositions are turned into templates by re-placing the sense’s corresponding argument with aplaceholder, similar to Van Durme et al. (2009) (seeFig 3 (b)). We remove any template associated witha sense if it occurs less than two times for that sense,

12https://pypi.python.org/pypi/PyStanfordDependencies

13Using English glosses of the logical representations, ab-straction of “a long, dark corridor” would yield “corridor” forexample; “a small office at the end of a long dark corridor”would yield “office”; and “Mrs. MacReady” would yield “per-son”. See Schubert (2002) for detail.

14In order to avoid too general senses, we set cut points atthe depth of 4 (Pantel et al., 2007) to truncate the hierarchy andconsider all 81,861 senses below these points.

leaving 38 million unique templates.(c) Deriving properties via WordNet: At this step,we want to associate with each WordNet sense a setof possible properties. We employ three strategies.

The first strategy is to use a decision tree topick out highly discriminative properties for eachWordNet sense. Specifically, for each set of co-hyponyms,15 we train a decision tree using the as-sociated templates as features. For example, inFig 3 (c), we train a decision tree over the co-hyponyms of publication.n.01. Then the template“person subscribe to ” would be selected as aproperty of magazine.n.01, and the template “personborrow from library” for book.n.01. The secondstrategy selects the most frequent templates associ-ated with each sense as properties of that sense. Thethird strategy uses WordNet ISA relations to derivenew properties of senses. For the sense book.n.01and its hypernym publication.n.01, we generate aproperty “ be publication”.(d) Generating hypotheses: As shown in Fig 3 (d),given a discourse context (Tanenhaus and Seiden-berg, 1980), we first extract an argument of the con-text, then select the derived properties for the argu-ment. Since we don’t assume any specific sense forthe argument, these properties could come from anyof its candidate senses. We generate hypotheses byreplacing the placeholder in the selected propertieswith the argument, and verbalizing the properties.16

4.1.2 Generation via Neural MethodsIn addition to the knowledge-based methods de-

scribed above, we also adapt a neural sequence-to-sequence model (Vinyals et al., 2015; Bahdanauet al., 2014) to generate inference candidates givencontexts. The model is trained on sentence pairs la-beled “entailment” from the SNLI corpus (Bowmanet al., 2015) (train). Here, the SNLI “premise” is theinput (context C), and the SNLI “hypothesis” is theoutput (hypothesisH).

We employ two different strategies for forwardgeneration of inference candidates given any con-

15Senses sharing a hypernym with each other are called co-hyponyms (e.g., book.n.01, magazine.n.01 and collections.n.02are co-hyponyms of publication.n.01).

16 We use the pattern.en module (http://www.clips.ua.ac.be/pages/pattern-en) for verbalization, whichincludes determining plurality of the argument, adding properarticles, and conjugating verbs.

384

Page 7: Ordinal Common-sense Inference · A model trained to make ordinal predictions should say: plausible, with probability 1.0 , whereas a model trained to make binary entailed/not-entailed

text. The sentence-prompt strategy uses the entiresentence in the context as an input, and generatesoutput using greedy decoding. The word-promptstrategy differs by using only a single word from thecontext as input. This word is chosen in the samefashion as the step (d) in the generation based onworld knowledge, i.e. an argument of the context.The second approach is motivated by our hypothesisthat providing only a single word context will forcethe model to generate a hypothesis that generalizesover the many contexts in which that word was seen,resulting in more common-sense-like hypotheses, asin Fig 4. We later present the full context and de-coded hypotheses to crowdsourced annotation.

dustpan ; a person is cleaning.a boy in blue and white shorts is sweeping with abroom and dustpan. ; a young man is holding abroom.

Figure 4: Examples of sequence-to-sequence hypothesisgeneration from single-word and full-sentence inputs.

Neural Sequence-to-Sequence ModelNeural sequence-to-sequence models learn to mapvariable-length input sequences to variable-lengthoutput sequences, as a conditional probability ofoutput given input. For our purposes, we wantto learn the conditional probability of an hypothe-sis sentence, H, given a context sentence, C, i.e.,P (H|C).

The sequence-to-sequence architecture consistsof two components: an encoder and a decoder. Theencoder is a recurrent neural network (RNN) iter-ating over input tokens (i.e., words in C), and thedecoder is another RNN iterating over output tokens(words in H). The final state of the encoder, hC , ispassed to the decoder as its initial state. We use athree-layer stacked LSTM (state size 512) for boththe encoder and decoder RNN cells, with indepen-dent parameters for each. We use the LSTM for-mulation of Hochreiter and Schmidhuber (1997) assummarized in Vinyals et al. (2015).

The network computes P (H|C):

P (H|C) =len(H)∏

t=1

p(wt|w<t, C) (1)

where wt are the words in H. At each time step, t,

the successive conditional probability is computedfrom the LSTM’s current hidden state:

p(wt|w<t, C) ∝ exp(vwt · ht) (2)

where vwt is the embedding of word wt from its cor-responding row in the output vocabulary matrix, V(a learnable parameter of the network), and ht is thehidden state of the decoder RNN at time t. In our im-plementation, we set the vocabulary to be all wordsthat appear in the training data at least twice, result-ing in a vocabulary of size 24,322.

This model also makes use of an attention mech-anism.17 An attention vector, attnt, is concatenatedwith the LSTM hidden state at time t to form thehidden state, ht, from which output probabilitiesare computed (Eqn. 2). This attention vector is aweighted average of the hidden states of the encoder,h1≤i≤len(C):

uti = vT tanh(W1hi +W2ht)

ati = softmax(uti)

attnt =

len(C)∑

i=1

atihi

(3)

where vector v and matricesW1,W2 are parameters.The network is trained via backpropagation on

the cross-entropy loss of the observed sequences intraining. A sampled softmax is used to compute theloss during training, while a full softmax is used af-ter training to score unseen (C,H) pairs, or to gen-erate an H given a C. Generation is performed viabeam search with a beam size of 1; the highest prob-ability word is decoded at each time step and fed asinput to the decoder at the next time step until anend-of-sequence token is decoded.

4.2 Ordinal Label AnnotationIn this stage, we turn to human efforts to annotatecommon-sense inference candidates with ordinal la-bels. The annotator is given a context, and then isasked to assess the likelihood of the hypotheses be-ing true. These context-hypothesis pairs are anno-tated with one of the five labels: very likely, likely,plausible, technically possible, and impossible, cor-responding to the ordinal values of {5,4,3,2,1} re-spectively.

17See Vinyals et al. (2015) for full details.

385

Page 8: Ordinal Common-sense Inference · A model trained to make ordinal predictions should say: plausible, with probability 1.0 , whereas a model trained to make binary entailed/not-entailed

In the case that the hypotheses in the inferencecandidates do not make sense, or have grammat-ical errors, judges can provide an additional la-bel, NA, so that we can filter these candidates inpost-processing. The combination of generation ofcommon-sense inference candidates with human fil-tering seeks to avoid the problem of elicitation bias.

5 JOCI Corpus

We now describe indepth how we created the JHUOrdinal Common-sense Inference (JOCI) corpus.The main part of the corpus consists of contextschosen from SNLI (Bowman et al., 2015) andROCStories (Mostafazadeh et al., 2016), pairedwith hypotheses generated via methods describedin §4.1. These pairs are then annotated with ordi-nal labels using crowdsourcing (§4.2). We also in-clude context-hypothesis pairs directly taken fromSNLI and other corpora (e.g., as premise-hypothesispairs), and re-annotate them with ordinal labels.

5.1 Data sources for Context-Hypothesis Pairs

In order to compare with existing inference cor-pora, we choose contexts from two resources: (1)the first sentence in the sentence pairs of the SNLIcorpus which are captions from the Flickr30k cor-pus (Young et al., 2014), and (2) the first sentence inthe stories of the ROCStories corpus.

We then collect candidates of automatically gen-erated common-sense inferences (AGCI) againstthese contexts. Specifically, in the SNLI trainset, there are over 150K different first sentences,involving 7,414 different arguments according topredicate-argument extraction. We randomly choose4,600 arguments. For each argument, we sampleone first sentence that has the argument and col-lect candidates of AGCI against this as context. Wealso do the same generation for the SNLI develop-ment set and test set. We also collect candidates ofAGCI against randomly sampled first sentences inthe ROCStories corpus. Collectively, these pairs andtheir ordinal labels (to be described in § 5.2) make upthe main part of the JOCI corpus. The statistics ofthis subset are shown in Table 1 (first five rows).

For comprehensiveness, we also produced ordi-nal labels on (C,H) pairs directly drawn from ex-isting corpora. For SNLI, we randomly select 1000

contexts (premises) from the SNLI train set. Then,the corresponding hypothesis is one of the entail-ment, neutral, or contradiction hypotheses takenfrom SNLI. For ROCStories, we defined C as thefirst sentence of the story, and H as the second orthird sentence. For COPA, (C,H) corresponds topremise-effect. The statistics are shown in the bot-tom rows of Table 1.

Subset Name # pairs Context Source Hypothesis Source

AGCIagainst

SNLI/ROCStories

22,086 SNLI-train AGCI-WK

2,456 SNLI-dev AGCI-WK

2,362 SNLI-test AGCI-WK

5,002 ROCStories AGCI-WK

1,211 SNLI-train AGCI-NN

SNLI993 SNLI-train SNLI-entailment988 SNLI-train SNLI-neutral995 SNLI-train SNLI-contradiction

ROCStories1,000 ROCStories-1st ROCStories-2nd1,000 ROCStories-1st ROCStories-3rd

COPA 1,000 COPA-premise COPA-effectTotal 39,093 - -

Table 1: JOCI corpus statistics, where each subsetconsists of different sources for context-and-hypothesispairs, each annotated with common-sense ordinal la-bels. AGCI-WK represents candidates generated based onworld knowledge. AGCI-NN represents candidates gen-erated via neural methods.

5.2 Crowdsourced Ordinal Label Annotation

We use Amazon Mechanical Turk to annotate thehypotheses with ordinal labels. In each HIT (Hu-man Intelligence Task), a worker is presented withone context and one or two hypotheses, as shownin Fig 5. First, the annotator sees an “Initial Sen-tence” (context), e.g. “John’s goal was to learn howto draw well.”, and is then asked about the plausibil-ity of the hypothesis, e.g. “A person accomplishesthe goal”. In particular, we ask the annotator howplausible the hypothesis is true during or shortly af-ter, because without this constraint, most sentencesare technically plausible in some imaginary world.

If the hypothesis does not make sense18, the work-ers can check the box under the question and skip theordinal annotation. In the annotation, about 25% ofhypotheses are marked as not making sense, and areremoved from our data.

With the sampled contexts and the auto-generated

18“Not making sense” means that inferences that are incom-plete sentences or grammatically wrong.

386

Page 9: Ordinal Common-sense Inference · A model trained to make ordinal predictions should say: plausible, with probability 1.0 , whereas a model trained to make binary entailed/not-entailed

M. Labels Context Hypothesis5 [5,5,5] John was excited to go to the fair. The fair opens.4 [4,4,3] Today my water heater broke. A person looks for a heater.3 [3,3,4] John ’s goal was to learn how to draw well. A person accomplishes the goal.2 [2,2,2] Kelly was playing a soccer match for her University. The University is dismantled.

1 [1,1,1]A brown-haired lady dressed all in blue denim sitsin a group of pigeons. People are made of the denim.

5 [5,5,4] Two females are playing rugby on a field, one witha blue uniform and one with a white uniform.

Two females are play sports outside.

4 [4,4,3] A group of people have an outside cookout. People are having conversations.3 [3,3,3] Two dogs fighting, one is black, the other beige. The dogs are playing.

2 [2,2,3]A bare headed man wearing a dark blue cassock,sandals, and dark blue socks mounts the stone stepsleading into a weathered old building.

A man is in the middle of home building.

1 [1,1,1]A skydiver hangs from the undercarriage of an air-plane or some sort of air gliding device. A camera is using an object.

Table 2: Examples of context-and-hypothesis pairs with ordinal judgments and Median value.(The upper 5 rows are samples from AGCI-WK. The lower 5 rows are samples from AGCI-NN.)

Initial Sentence: John ’s goal was to learn how to draw well

1. The following statements is to be true during or shortly after the context of the initial sentence.

A person accomplishes the goal .This statement does not make sense.

1ex. The following statements is to be true during or shortly after the context of the initial sentence.

The goal is a content .This statement does not make sense.

Figure 5: The annotation interface, with a drop-down listprovides ordinal labels to select.

hypotheses, we prepare 50K common-sense infer-ence examples for crowdsourced annotation in bulk.In order to guarantee the quality of annotation, wehave each example annotated by three workers. Wetake the median of the three as the gold label. Table 3shows the statistics of the crowdsourced efforts.

# examples 50,832# participated workers 150average cost per example 1.99¢average work time per example 20.71s

Table 3: Statistics of the crowdsourced efforts.

To make sure non-expert workers have a correctunderstanding of our task, before launching the latertasks in bulk, we run two pilots to create a pool ofqualified workers. In the first pilot, we publish 100examples. Each example is annotated by five work-

ers. From this pilot, we collect a set of “good” ex-amples which have a 100% annotation agreementamong workers. The ordinal labels chosen by theworkers are regarded as the gold labels. In the sec-ond pilot, we randomly select two “good” (high-agreement) examples for each ordinal label and pub-lish an HIT with these examples. To measure theworkers’ agreement, we calculate the average ofquadratic weighted Cohen’s κ scores between theworkers’ annotation. By setting a threshold of theaverage of κ scores to 0.7, we are able to create apool that has over 150 qualified workers.

5.3 Corpus CharacteristicsWe want a corpus with a reliable inter-annotatoragreement. Additionally, in order to evaluate or traina common-sense inference system, we ideally needa corpus that provides as many inference examplesas possible for every ordinal likelihood value. Inthis section, we investigate the characteristics of theJOCI corpus. We also compare JOCI with relatedresources under our annotation protocol.Quality: We measure the quality of each pair bycalculating Cohen’s κ of workers’ annotations. Theaverage κ of the JOCI corpus is 0.54. Fig 7 showsthe growth of the size of JOCI as we decrease thethreshold of the averaged κ to filter pairs. Even if weplace a relatively strict threshold (>0.6), we still geta large subset of JOCI with over 20K pairs. Table 2contains pairs randomly sampled from this subset,

387

Page 10: Ordinal Common-sense Inference · A model trained to make ordinal predictions should say: plausible, with probability 1.0 , whereas a model trained to make binary entailed/not-entailed

impossible

tech-possibleplausible likely

very-likely0%

10%

20%

30%

40%

50%

60%

70%

80%

90%JOCI

SNLI-entailment

SNLI-neutral

SNLI-contradiction

(a) JOCI vs. SNLI

impossible

tech-possibleplausible likely

very-likely0%

10%

20%

30%

40%

50%

60%JOCI

ROCStories-2nd

ROCStories-3rd

(b) JOCI vs. ROCStories

impossible

tech-possibleplausible likely

very-likely0%

10%

20%

30%

40%

50%

60%JOCI

COPA-0

COPA-1

(c) JOCI vs. COPAFigure 6: Comparison of normalized distributions between JOCI and other corpora.

qualitatively confirming we can generate and collectannotations of pairs at each ordinal category.

0.00.20.40.60.81.00

5000

10000

15000

20000

25000

30000

35000

40000

very likely

likely

plausible

tech possible

impossible

Figure 7: Data growth along averaged κ scores.

Label Distribution: We believe datasets with widesupport of label distribution are important in trainingand evaluating systems to recognize ordinal scale in-ferences. Fig 6a shows the normalized label distri-bution of JOCI vs. SNLI. As desired, JOCI coversa wide range of ordinal likelihoods, with many sam-ples in each ordinal scale. Note also how traditionalRTE labels are related to ordinal labels, althoughmany inferences in SNLI require no common-senseknowledge (e.g. paraphrases). As expected, entail-ments are mostly considered very likely; neutral in-ferences mostly plausible; and contradictions likelyto be either impossible or technically possible.

Fig 6b shows the normalized distributions of JOCIand ROCStories. Compared with ROCStories, JOCIstill covers a wider range of ordinal likelihood. InROCStories we observe that, while 2nd sentencesare in general more likely to be true than 3rd, alarge proportion of both 2nd and 3rd sentences areplausible, as compared to likely or very likely. Thismatches intuition: pragmatics dictates that subse-quent sentences in a standard narrative carry new in-

formation.19 That our protocol picks this up is anencouraging sign for our ordinal protocol, as well assuggestive that the makeup of the elicited ROCSto-ries collection is indeed “story like.”

For the COPA dataset, we only make use of thepairs in which the alternatives are plausible effects(rather than causes) of the premise, as our proto-col more easily accommodates these pairs.20 An-notating this section of COPA with ordinal labelsprovides an enlightening and validating view of thedataset. Fig 6c shows the normalized distribution ofCOPA next to that of JOCI (COPA-1 alternatives aremarked as most plausible; COPA-0 are not.), True toits name, the majority of COPA alternatives are la-beled as either plausible or likely; almost none areimpossible. This is consistent with the idea that theCOPA task is to determine which of two possibleoptions is the more plausible. Fig 8 shows the jointdistribution of ordinal labels on (COPA-0,COPA-1)pairs. As expected, the densest areas of the heatmaplie above the diagonal, indicating that in almost ev-ery pair, COPA-1 received a higher likelihood judge-ment than COPA-0.Automatic Generation Comparisons: We com-pare the label distributions of different methods forautomatic generation of common-sense inference(AGCI) in Fig 9. Among ACGI-WK (generationbased on world knowledge) methods, the ISA strat-egy yields a bimodal distribution, with the major-ity of inferences labeled impossible or very likely.

19If subsequent sentences in a story were always very likely,then those would be boring tales; the reader could infer the con-clusion based on the introduction. While at the same time ifmost subsequent sentences were only technically possible, thereader would give up in confusion.

20Specifically, we treat premises as contexts and effect alter-natives as possible hypotheses.

388

Page 11: Ordinal Common-sense Inference · A model trained to make ordinal predictions should say: plausible, with probability 1.0 , whereas a model trained to make binary entailed/not-entailed

impossibletech-possible

plausible likelyvery-likely

COPA-0

impossible

tech-possible

plausible

likely

very-likely

CO

PA

-1

0 0 0 0 0

0 1 2 0 0

0 28 44 5 1

4 102 124 23 1

8 56 79 18 4

0

25

50

75

100

Figure 8: COPA heatmap.

This is likely because most copular statements gen-erated with the ISA strategy will either be categori-cally true or false. In contrast, the decision tree andfrequency based strategies generate many more hy-potheses with intermediate ordinal labels. This sug-gests the propositional templates (learned from text)capture many “possibilistic” hypotheses, which isour aim.

The two AGCI-NN (generation via neural meth-ods) strategies show interesting differences in labeldistribution as well. Sequence-to-sequence decod-ings with full-sentence prompts lead to more verylikely labels than single-word prompts. The reasonmay be that the model behaves more similarly toSNLI entailments when it has access to all the in-formation in the context. When combined, the fiveAGCI strategies (three AGCI-WK and two AGCI-NN) provide reasonable coverage over all five cat-egories, as can be seen in Fig 6.

6 Predicting Ordinal Judgments

We want to be able to predict ordinal judgments ofthe kind presented in this corpus. Our goal in thissection is to establish baseline results and explorewhat kinds of features are useful for predicting ordi-nal common-sense inference. To do so, we train andtest a logistic ordinal regression model gθ(φ(C,H)),which outputs ordinal labels using features φ definedon context-inference pairs. Here, gθ(·) is a regres-sion model with θ as trained parameters; we trainusing the margin-based method of (Rennie and Sre-bro, 2005), implemented in (Pedregosa-Izquierdo,2015),21 with the following features:

21LogisticSE: http://github.com/fabianp/mord

impossible

tech-possibleplausible likely

very-likely0

1000

2000

3000

4000

5000

6000

decision tree

frequency based

ISA based

(a) Distribution of AGCI-WK

impossible

tech-possibleplausible likely

very-likely0

50

100

150

200

250

300

350word prompts

sentence prompts

(b) Distribution of AGCI-NN

Figure 9: Label distributions of AGCI.

Bag of words features (BOW): We compute (1)“BOW overlap” (size of word overlap in C and H),and (2) BOW overlap divided by the length ofH.Similarity features (SIM): Using Google’sword2vec vectors trained on 100 billion tokens ofGoogleNews,22 we (1) sum the vectors in both thecontext and hypothesis and compute the cosine-similarity of the resulting two vectors (“similarity ofaverage”), and (2) compute the cosine-similarity ofall word pairs across the context and inference, thenaverage those similarities (“average of similarity”).Seq2seq score features (S2S): We compute thelog probability logP (H|C) under the sequence-to-sequence model described in § 4.1.2. There are fivevariants: (1) Seq2seq trained on SNLI “entailment”pairs only, (2) “neutral” pairs only, (3) “contradic-tion” pairs only, (4) “neutral” and “contradiction”pairs, and (5) SNLI pairs (any label) with the con-text (premise) replaced by an empty string.Seq2seq binary features (S2S-BIN): Binary indica-tor features for each of the five seq2seq model vari-ants, indicating that model achieved the lowest scoreon the context-hypothesis pair.

22The GoogleNews embeddings are available at: https://code.google.com/archive/p/word2vec/

389

Page 12: Ordinal Common-sense Inference · A model trained to make ordinal predictions should say: plausible, with probability 1.0 , whereas a model trained to make binary entailed/not-entailed

Length features (LEN): This set comprises threefeatures: the length of the context (in tokens), thedifference in length between the context and hypoth-esis, and a binary feature indicating if the hypothesisis longer than the context.

6.1 AnalysisWe train and test our regression model on two sub-sets of the JOCI corpus, which, for brevity, we call“A” and “B.” “A” consists of 2,976 sentence pairs(i.e., context-hypothesis pairs) from SNLI-train an-notated with ordinal labels. This corresponds tothe three rows labeled SNLI in Table 1 (993 +988 + 995 = 2, 976 pairs), and can be viewed asa textual entailment dataset re-labeled with ordinaljudgments. “B” consists of 6,375 context-inferencepairs, in which the contexts are the same 2,976SNLI-train premises as “A”, and the hypotheses aregenerated based on world knowledge (§4.1.1); thesepairs are also annotated with ordinal labels. Thiscorresponds to a subset of the row labeled AGCI inTable 1. A key difference between “A” and “B” isthat the hypotheses in “A” are human-elicited, whilethose in “B” are auto-generated; we are interested inseeing whether this affects the task’s difficulty.23

Model A-train A-test B-train B-testRegression: gθ(·) 2.05 1.96 2.48 2.74Most Frequent 5.70 5.56 6.55 7.00Freq. Sampling 4.62 4.29 5.61 5.54Rounded Average 2.46 2.39 2.79 2.89One-vs-All 3.74 3.80 5.14 5.71

Table 4: Mean squared error.

Model A-train A-test B-train B-testRegression: gθ(·) .39* .40* .32* .27*Most Frequent .00* .00* .00* .00*Freq. Sampling .03 .10 .01 .01Rounded Average .00* .00* .00* .00*One-vs-All .31* .30* .28* .24*

Table 5: Spearman’s ρ. (*p-value<.01)

Tables 4 and 5 show each model’s performance(mean squared error and Spearman’s ρ, respectively)in predicting ordinal labels.24 We compare our ordi-nal regression model gθ(·) with these baselines:

23Details of the data split is reported in the dataset release.24MSE and Spearman’s ρ are both commonly used evalua-

Most Frequent: Select the ordinal class appear-ing most often in train.

Frequency Sampling: Select an ordinal label ac-cording to their distribution in train.

Rounded Average: Average over all labels fromtrain rounded to nearest ordinal.

One-vs-All: Train one SVM classifier per ordinalclass and select the class label with the largest corre-sponding margin. We train this model with the sameset of features as the ordinal regression model.

Overall, the regression model achieves the low-est MSE and highest ρ, implying that this datasetis learnable and tractable. Naturally, we would de-sire a model that achieves MSE under 1.0, and wehope that the release of our dataset will encouragemore concerted effort in this common-sense infer-ence task. Importantly, note that performance onA-test is better than on B-test. We believe “B” isa more challenging dataset because auto-generationof hypothesis leads to wider variety than elicitation.

MSE Spear. ρFeature Set A B A BALL 1.96 2.74 .40* .27*ALL – {SIM} 2.10 2.75 .34* .25*ALL – {BOW} 2.02 2.77 .37* .25*ALL – {SIM,BOW} 2.31 2.79 .16* .20*ALL – {S2S} 2.00 2.85 .38* .22*ALL – {S2S-BIN} 1.97 2.76 .40* .26*ALL – {S2S,S2S-BIN} 2.06 2.87 .35* .21*ALL – {LEN} 2.01 2.77 .39* .25*∅ + {SIM} 2.06 3.04 .35* .10∅ + {BOW} 2.10 2.89 .34* .12*∅ + {S2S} 2.33 2.80 .14 .20*∅ + {S2S-BIN} 2.39 2.89 .00 .00∅ + {LEN} 2.39 2.89 .00 .05

Table 6: Ablation results for ordinal regression model onA-test and B-test. (*p-value<.01 for ρ)

We also run a feature ablation test. Table 6shows that the most useful features differ for A-test and B-test. On A-test, where the inferencesare elicited from humans, removal of similarity- andbow-based features together results in the largestperformance drop. On B-test, by contrast, remov-ing similarity and bow features results in a com-

tions in ordinal prediction tasks (Baccianella et al., 2009; Ben-nett and Lanning, 2007; Gaudette and Japkowicz, 2009; Agrestiand Kateri, 2011; Popescu and Dinu, 2009; Liu et al., 2015;Gella et al., 2013).

390

Page 13: Ordinal Common-sense Inference · A model trained to make ordinal predictions should say: plausible, with probability 1.0 , whereas a model trained to make binary entailed/not-entailed

parable performance drop to removing seq2seq fea-tures. These observations point to statistical differ-ences between human-elicited and auto-generatedhypotheses, a motivating point of the JOCI corpus.

7 Conclusions and Future Work

In motivating the need for automatically buildingcollections of common-sense knowledge, Clark etal. (2003) wrote:

“China launched a meteorological satelliteinto orbit Wednesday.” suggests to a humanreader that (among other things) there was arocket launch; China probably owns the satel-lite; the satellite is for monitoring weather;the orbit is around Earth; etc

The use of “etc” summarizes an infinite numberof other statements that a human reader would findto be very likely, likely, technically plausible, or im-possible, given the provided context.

Preferably we could build systems that would au-tomatically learn common-sense exclusively fromavailable corpora; extracting not just statementsabout what is possible, but also the associated prob-abilities of how likely certain things are to obtain inany given context. We are unaware of existing workthat has demonstrated this to be feasible.

We have thus described a multi-stage approachto common-sense textual inference: we first extractlarge numbers of possible statements from a corpus,and use those statements to generate contextuallygrounded context-hypothesis pairs. These are pre-sented to humans for direct assessment of subjec-tive likelihood, rather than relying on corpus dataalone. As the data is automatically generated, weseek to bypass issues in human elicitation bias. Fur-ther, since subjective likelihood judgments are notdifficult for humans, our crowdsourcing techniqueis both inexpensive and scalable.

Future work will extend our techniques for for-ward inference generation, further scale up the anno-tation of additional examples, and explore the use oflarger, more complex contexts. The resulting JOCIcorpus will be used to improve algorithms for natu-ral language inference tasks such as storytelling andstory understanding.

Acknowledgments

Thank you to action editor Mark Steedman and the anonymous reviewers for their feedback, as well as colleagues including Lenhart Schubert, Kyle Rawl-ins, Aaron White, and Keisuke Sakaguchi. This work was supported in part by DARPA LORELEI, the National Science Foundation Graduate Research Fellowship and the JHU Human Language Tech-nology Center of Excellence (HLTCOE).

ReferencesEneko Agirre, Mona Diab, Daniel Cer, and Aitor

Gonzalez-Agirre. 2012. Semeval-2012 task 6: A piloton semantic textual similarity. In Proceedings of theFirst Joint Conference on Lexical and ComputationalSemantics-Volume 1: Proceedings of the Main Confer-ence and the Shared Task, and Volume 2: Proceedingsof the Sixth International Workshop on Semantic Eval-uation, pages 385–393. Association for ComputationalLinguistics.

Alan Agresti and Maria Kateri. 2011. Categorical dataanalysis. In International encyclopedia of statisticalscience, pages 206–208. Springer.

Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas-tiani. 2009. Evaluation measures for ordinal regres-sion. In 2009 Ninth International Conference on In-telligent Systems Design and Applications, pages 283–287. Institute of Electrical and Electronics Engineers.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473v7.

Collin F. Baker, Charles J. Fillmore, and John B. Lowe.1998. The Berkeley FrameNet Project. In Proceed-ings of the 36th Annual Meeting of the Associationfor Computational Linguistics and 17th InternationalConference on Computational Linguistics, Volume 1,pages 86–90. Association for Computational Linguis-tics.

Michele Banko and Oren Etzioni. 2007. Strategiesfor lifelong knowledge extraction from the web. InProceedings of the 4th International Conference onKnowledge Capture, pages 95–102. Association forComputing Machinery.

Islam Beltagy, Stephen Roller, Pengxiang Cheng, KatrinErk, and Raymond J. Mooney. 2017. Representingmeaning with a combination of logical and distribu-tional models. Computational Linguistics.

James Bennett and Stan Lanning. 2007. The Netflixprize. In Proceedings of KDD Cup and Workshop,page 35.

391

Page 14: Ordinal Common-sense Inference · A model trained to make ordinal predictions should say: plausible, with probability 1.0 , whereas a model trained to make binary entailed/not-entailed

Jonathan Berant, Ido Dagan, and Jacob Goldberger.2011. Global learning of typed entailment rules. InProceedings of the 49th Annual Meeting of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, pages 610–619. Association forComputational Linguistics.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing, pages 632–642. Association for Computational Linguistics.

Nathanael Chambers and Dan Jurafsky. 2008. Unsuper-vised learning of narrative event chains. In Proceed-ings of ACL-08: HLT, pages 789–797. Association forComputational Linguistics.

Timothy Chklovski. 2003. Learner: A System for Ac-quiring Commonsense Knowledge by Analogy. InProceedings of Second International Conference onKnowledge Capture (K-CAP 2003).

Stephen Clark and David Weir. 1999. An iterative ap-proach to estimating frequencies over a semantic hier-archy. In Proceedings of the Joint SIGDAT Conferenceon Empirical Methods in Natural Language Process-ing and Very Large Corpora. Citeseer.

Peter Clark, Phil Harrison, and John Thompson. 2003. Aknowledge-driven approach to text meaning process-ing. In Proceedings of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies Workshop on TextMeaning-Volume 9, pages 1–6. Association for Com-putational Linguistics.

Herbert H. Clark. 1975. Bridging. In R. C. Schank andB. L. Nash-Webber, editors, Theoretical issues in nat-ural language processing. Association for ComputingMachinery, New York.

Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox,Johan Van Genabith, Jan Jaspars, Hans Kamp, DavidMilward, Manfred Pinkal, Massimo Poesio, and StevePulman. 1996. Using the framework. Technical re-port, Technical Report LRE 62-051 D-16, The FraCaSConsortium.

Ido Dagan, Oren Glickman, and Bernardo Magnini.2006. The Pascal recognising textual entailment chal-lenge. In Machine learning challenges: evaluatingpredictive uncertainty, visual object classification, andrecognising textual entailment.

Ferdinand de Haan. 1997. The Interaction of Modalityand Negation: A Typological Study. A Garland Series.Garland Pub.

Marie-Catherine de Marneffe, Timothy Dozat, NataliaSilveira, Katri Haverinen, Filip Ginter, Joakim Nivre,and Christopher D. Manning. 2014. Universal stan-ford dependencies: A cross-linguistic typology. In

Proceedings of the Ninth International Conferenceon Language Resources and Evaluation (LREC’14),pages 4585–4592. European Language Resources As-sociation (ELRA).

Katrin Erk, Sebastian Pado, and Ulrike Pado. 2010. Aflexible, corpus-driven model of regular and inverseselectional preferences. Computational Linguistics,36(4):723–763.

Francis Ferraro, Max Thomas, Matthew R. Gormley,Travis Wolfe, Craig Harman, and Benjamin VanDurme. 2014. Concretely Annotated Corpora. In 4thWorkshop on Automated Knowledge Base Construc-tion (AKBC).

Noah S. Friedland, Paul G. Allen, Gavin Matthews,Michael Witbrock, David Baxter, Jon Curtis, BlakeShepard, Pierluigi Miraglia, Jurgen Angele, SteffenStaab, et al. 2004. Project Halo: Towards a digitalaristotle. AI magazine, 25(4):29.

Dan Garrette, Katrin Erk, and Raymond Mooney. 2011.Integrating logical representations with probabilisticinformation using Markov logic. In Proceedings ofthe Ninth International Conference on ComputationalSemantics, pages 105–114. Association for Computa-tional Linguistics.

Lisa Gaudette and Nathalie Japkowicz. 2009. Evalua-tion methods for ordinal classification. In Proceedingsof the 22nd Canadian Conference on Artificial Intel-ligence: Advances in Artificial Intelligence, CanadianAI ’09, pages 207–210. Springer-Verlag.

Spandana Gella, Paul Cook, and Bo Han. 2013. Unsu-pervised word usage similarity in social media texts.In Second Joint Conference on Lexical and Computa-tional Semantics (*SEM), Volume 1: Proceedings ofthe Main Conference and the Shared Task: Seman-tic Textual Similarity, pages 248–253. Association forComputational Linguistics.

Danilo Giampiccolo, Hoa Trang Dang, BernardoMagnini, Ido Dagan, Elena Cabrio, and Bill Dolan.2008. The fourth Pascal recognizing textual entail-ment challenge. In Proceedings of the Text AnalysisConference (TAC) 2008.

Jonathan Gordon and Benjamin Van Durme. 2013. Re-porting bias and knowledge extraction. In AutomatedKnowledge Base Construction (AKBC): The 3rd Work-shop on Knowledge Extraction at the ACM Conferenceon Information and Knowledge Management.

Catherine Havasi, Robert Speer, and Jason Alonso. 2007.ConceptNet 3: a Flexible, Multilingual Semantic Net-work for Common Sense Knowledge. In Proceedingsof Recent Advances in Natural Language Processing.

Marti A. Hearst. 1992. Automatic acquisition of hy-ponyms from large text corpora. In Proceedings of

392

Page 15: Ordinal Common-sense Inference · A model trained to make ordinal predictions should say: plausible, with probability 1.0 , whereas a model trained to make binary entailed/not-entailed

the 14th Conference on Computational Linguistics-Volume 2, pages 539–545. Association for Computa-tional Linguistics.

Jerry R. Hobbs and Costanza Navarretta. 1993.Methodology for knowledge acquisition (unpublishedmanuscript). http://www.isi.edu/˜hobbs/damage.text.

Jerry R. Hobbs. 1987. World knowledge and word mean-ing. In Proceedings of the 1987 Workshop on Theoret-ical Issues in Natural Language Processing, TINLAP’87, pages 20–27. Association for Computational Lin-guistics.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural computation, 9(8):1735–1780.

Laurence R. Horn. 1989. A Natural History of Negation.David Hume series. CSLI Publications.

Douglas B. Lenat. 1995. CYC: A large-scale investmentin knowledge infrastructure. Communications of theACM, 38(11):33–38.

Hector J. Levesque, Ernest Davis, and Leora Morgen-stern. 2011. The Winograd schema challenge. InAAAI Spring Symposium: Logical Formalizations ofCommonsense Reasoning.

Maria Liakata and Stephen Pulman. 2002. Fromtrees to predicate-argument structures. In Proceed-ings of the 19th International Conference on Compu-tational Linguistics-Volume 1, pages 1–7. Associationfor Computational Linguistics.

Dekang Lin and Patrick Pantel. 2001. DIRT - Discoveryof Inference Rules from Text. In Proceedings of theseventh ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, pages 323–328. Association for Computing Machinery.

Quan Liu, Hui Jiang, Si Wei, Zhen-Hua Ling, and Yu Hu.2015. Learning semantic word embeddings based onordinal knowledge constraints. In Proceedings of the53rd Annual Meeting of the Association for Compu-tational Linguistics and the 7th International JointConference on Natural Language Processing (ACL-IJCNLP), pages 1501–1511.

John Lyons. 1977. Semantics. Cambridge UniversityPress.

Bill MacCartney and Christopher D. Manning. 2007.Natural logic for textual inference. In Proceedingsof the ACL-PASCAL Workshop on Textual Entailmentand Paraphrasing, pages 193–200. Association forComputational Linguistics.

Marco Marelli, Stefano Menini, Marco Baroni, LuisaBentivogli, Raffaella Bernardi, and Roberto Zampar-elli. 2014. A SICK cure for the evaluation of composi-tional distributional semantic models. In Proceedingsof the Ninth International Conference on LanguageResources and Evaluation, pages 216–223.

John McCarthy. 1959. Programs with common sense.In Proceedings of the Teddington Conference on theMechanization of Thought Processes, London: HerMajesty’s Stationery Office.

Ken McRae, Michael J. Spivey-Knowlton, andMichael K. Tanenhaus. 1998. Modeling the in-fluence of thematic fit (and other constraints) inon-line sentence comprehension. Journal Of Memoryand Language, 38:283–312.

Ken McRae, George S. Cree, Mark S. Seidenberg, andChris McNorgan. 2005. Semantic feature productionnorms for a large set of living and nonliving things.Behavior Research Methods, Instruments, & Comput-ers, 37(4):547–559.

George A. Miller.1995. WordNet: a lexical database forEnglish. Communications of the ACM, 38(11):39–41.

Ishan Misra, C. Lawrence Zitnick, Margaret Mitchell,and Ross Girshick. 2016. Seeing through the humanreporting bias: Visual classifiers from noisy human-centric labels. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages2930–2939.

Nasrin Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi Parikh, Dhruv Batra, Lucy Vanderwende,Pushmeet Kohli, and James Allen. 2016. A corpusand Cloze evaluation for deeper understanding of com-monsense stories. In Proceedings of the 2016 Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics: Human LanguageTechnologies, pages 839–849. Association for Compu-tational Linguistics.

Marius Pasca and Benjamin Van Durme. 2007. Whatyou seek is what you get: Extraction of class attributesfrom query logs. In Proceedings of the 20th Interna-tional Joint Conference on Artifical Intelligence.

Ulrike Pado, Matthew W. Crocker, Frank Keller, andMatthew W. Crocker. 2009. A probabilistic modelof semantic plausibility in sentence processing. Cog-nitive Science, 33(5):795–838.

Patrick Pantel, Rahul Bhagat, Bonaventura Coppola,Timothy Chklovski, and Eduard H. Hovy. 2007. ISP:Learning inferential selectional preferences. In Pro-ceedings of Human Language Technologies: The An-nual Conference of the North American Chapter ofthe Association for Computational Linguistics, pages564–571.

Denis Paperno, German Kruszewski, Angeliki Lazari-dou, Ngoc Quan Pham, Raffaella Bernardi, SandroPezzelle, Marco Baroni, Gemma Boleda, and RaquelFernandez. 2016. The LAMBADA dataset: Word pre-diction requiring a broad discourse context. In Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: Long

393

Page 16: Ordinal Common-sense Inference · A model trained to make ordinal predictions should say: plausible, with probability 1.0 , whereas a model trained to make binary entailed/not-entailed

Papers), pages 1525–1534. Association for Computa-tional Linguistics.

Robert Parker, David Graff, Junbo Kong, Ke Chen, andKazuaki Maeda. 2011. English Gigaword Fifth Edi-tion. Linguistic Data Consortium.

Marius Pasca. 2008. Turning web text and search queriesinto factual knowledge: Hierarchical class attribute ex-traction. In Proceedings of the 23rd National Confer-ence on Artificial Intelligence.

Fabian Pedregosa-Izquierdo. 2015. Feature extractionand supervised learning on fMRI: from practice to the-ory. Ph.D. thesis, Universite Pierre et Marie Curie-Paris VI.

Karl Pichotta and Raymond J. Mooney. 2016. Learn-ing statistical scripts with LSTM recurrent neural net-works. In Proceedings of the 30th AAAI Conferenceon Artificial Intelligence (AAAI), pages 2800–2806.

Marius Popescu and Liviu P. Dinu. 2009. Comparingstatistical similarity measures for stylistic multivariateanalysis. In Proceedings of Recent Advances in Natu-ral Language Processing, pages 349–354.

Jason D.M. Rennie and Nathan Srebro. 2005. Loss func-tions for preference levels: Regression with discreteordered labels. In Proceedings of the IJCAI Multidis-ciplinary Workshop on Advances in Preference Han-dling, pages 180–186.

Philip Resnik. 1993. Semantic classes and syntactic am-biguity. In Proceedings of ARPA Workshop on HumanLanguage Technology.

Matthew Richardson and Pedro Domingos. 2006.Markov logic networks. Machine learning, 62(1-2):107–136.

Stephen D. Richardson, William B. Dolan, and Lucy Van-derwende. 1998. MindNet: Acquiring and structuringsemantic information from text. In Proceedings of the36th Annual Meeting of the Association for Computa-tional Linguistics and 17th International Conferenceon Computational Linguistics, Volume 2, pages 1098–1102. Association for Computational Linguistics.

Melissa Roemmele, Cosmin Adrian Bejan, and An-drew S. Gordon. 2011. Choice of plausible alterna-tives: An evaluation of commonsense causal reason-ing. In AAAI Spring Symposium: Logical Formaliza-tions of Commonsense Reasoning, pages 90–95.

Rachel Rudinger, Pushpendre Rastogi, Francis Ferraro,and Benjamin Van Durme. 2015. Script induction aslanguage modeling. In Proceedings of the 2015 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1681–1686. Association for Com-putational Linguistics.

Roser Saurı and James Pustejovsky. 2009. FactBank:A corpus annotated with event factuality. Languageresources and evaluation, 43(3):227–268.

Asad Sayeed, Vera Demberg, and Pavel Shkadzko. 2015.An exploration of semantic features in an unsupervisedthematic fit evaluation framework. Italian Journal ofComputational Linguistics, 1(1).

Roger C. Schank. 1975. Using knowledge to understand.In TINLAP ’75: Proceedings of the 1975 Workshopon Theoretical Issues in Natural Language Processing,pages 117–121.

Lenhart Schubert. 2002. Can we derive general worldknowledge from texts? In Proceedings of the SecondInternational Conference on Human Language Tech-nology Research, pages 94–97. Morgan KaufmannPublishers Inc.

Push Singh. 2002. The public acquisition of common-sense knowledge. In Proceedings of AAAI SpringSymposium: Acquiring (and Using) Linguistic (andWorld) Knowledge for Information Access. AAAI.

Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2006.Semantic taxonomy induction from heterogenous evi-dence. In Proceedings of the 21st International Con-ference on Computational Linguistics and 44th AnnualMeeting of the Association for Computational Linguis-tics, pages 801–808. Association for ComputationalLinguistics.

Fabian M. Suchanek, Gjergji Kasneci, and GerhardWeikum. 2007. YAGO: A Core of Semantic Knowl-edge Unifying WordNet and Wikipedia. In Proceed-ings of the 16th International Conference on WorldWide Web, page 697.

Michael K. Tanenhaus and Mark S. Seidenberg. 1980.Discourse context and sentence perception. TechnicalReport 176, Center for the Study of Reading, IllinoisUniversity, Urbana.

Richmond H. Thomason. 1972. A Semantic Theory ofSortal Incorrectness. Journal of Philosophical Logic,1(2):209–258, May.

Benjamin Van Durme and Lenhart Schubert. 2008. Openknowledge extraction through compositional languageprocessing. In Proceedings of the 2008 Conference onSemantics in Text Processing, pages 239–254. Associ-ation for Computational Linguistics.

Benjamin Van Durme, Phillip Michalak, and LenhartSchubert. 2009. Deriving generalized knowledgefrom corpora using WordNet abstraction. In Proceed-ings of the 12th Conference of the European Chapterof the ACL (EACL 2009), pages 808–816. Associationfor Computational Linguistics.

Benjamin Van Durme. 2010. Extracting implicit knowl-edge from text. Ph.D. thesis, University of Rochester,Department of Computer Science, Rochester, NY14627-0226.

Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov,Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar

394

Page 17: Ordinal Common-sense Inference · A model trained to make ordinal predictions should say: plausible, with probability 1.0 , whereas a model trained to make binary entailed/not-entailed

as a foreign language. In C. Cortes, N. D. Lawrence,D. D. Lee, M. Sugiyama, and R. Garnett, editors, Ad-vances in Neural Information Processing Systems 28,pages 2773–2781. Curran Associates, Inc.

Aaron Steven White, Drew Reisinger, Keisuke Sak-aguchi, Tim Vieira, Sheng Zhang, Rachel Rudinger,Kyle Rawlins, and Benjamin Van Durme. 2016. Uni-versal decompositional semantics on universal depen-dencies. In Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Processing,pages 1713–1723. Association for Computational Lin-guistics.

Peter Young, Alice Lai, Micah Hodosh, and Julia Hock-enmaier. 2014. From image descriptions to visualdenotations: New similarity metrics for semantic in-ference over event descriptions. Transactions of theAssociation for Computational Linguistics, 2:67–78.

Uri Zernik. 1992. Closed yesterday and closed minds:Asking the right questions of the corpus to distinguishthematic from sentential relations. In Proceedingsof the 14th Conference on Computational Linguistics-Volume 4, pages 1305–1311. Association for Compu-tational Linguistics.

Sheng Zhang, Kevin Duh, and Benjamin Van Durme.2017. MT/IE: Cross-lingual open information ex-traction with neural sequence-to-sequence models. InProceedings of the 15th Conference of the EuropeanChapter of the Association for Computational Linguis-tics: Volume 2, Short Papers, pages 64–70. Associa-tion for Computational Linguistics.

395

Page 18: Ordinal Common-sense Inference · A model trained to make ordinal predictions should say: plausible, with probability 1.0 , whereas a model trained to make binary entailed/not-entailed

396