how the availability of training material affects performance in...

7
How the Availability of Training Material Affects Performance in the Winograd Schema Challenge Nicos Isaak * and Loizos Michael Abstract The Winograd Schema Challenge — the task of re- solving pronouns in sentences where shallow pars- ing techniques are not directly applicable — has been proposed as a conceptually and practically ap- pealing alternative to the Turing Test. Among a number of attempts to tackle this challenge, one recent study has demonstrated the plausibility of using commonsense knowledge acquired from raw text in Wikipedia. Here, we present the results of a large-scale experiment that shows how the per- formance of this particular approach varies with the availability of training material. We undertake quantitative and qualitative analysis at corpus and sentence level, to examine the effects that the avail- ability of training material has on the performance. 1 Introduction One of the most important challenges in computer science is the understanding of how systems that acquire and manipu- late commonsense knowledge can be created [Valiant, 2006]. With the creation of systems that are based on machine learn- ing, we aim for systems that will replace or substitute ba- sic human abilities, so that we can relate and interact with them. We believe that logical inferences are necessary in or- der to build natural language representations, as well to rea- son about information encoded in representations. Through the acquisition of knowledge and the extraction of general inference rules we can handle different problems, like the Winograd Schema Challenge (WSC) [Levesque, 2011]; it contains groups of nearly identical sentences with clear but very different meanings, and the task is to resolve a definite pronoun to one of its two co-referents. Here, we present the results of a large experiment, which yields the acquisition of large amounts of training material, to check how it affects the performance in the WSC. The aim is two fold. Firstly, to investigate how the availability of train- ing material affects commonsense systems; to examine if we can enhance them with valuable knowledge to resolve pro- nouns that were unable to handle previously. Secondly, to * Corresponding Author investigate the effect of training material on the WSC sen- tence level, depending on different qualitative properties. To date, no study has looked specifically at how the amount of training material might benefit the pronoun resolution in the WSC, and any evidence for this has been mainly anecdotal. To the best of our knowledge, no study has been focused on detecting any key variables affecting the pronoun resolution in the WSC sentences, either. The sections below explain each of these tasks along with the techniques that we developed. After the Introduction, we present the WSC with some highlights on the previous work, while the follow-on section analyzes the system that we are going to use, in our testings. The fourth section outlines the methodology of our Corpus Level Analysis, and Section five shows our Sentence Level Analysis. Finally, we discuss the implications of our results in section six, along with potential directions for future research. 2 The Winograd Schema Challenge (WSC) The WSC can be seen as a new type of Turing Test [Levesque, 2011]. It consists of sentence pairs (twin sentences) that have small differences, and the objective is to resolve a definite pronoun to one of its two co-referents, in each sentence. The co-referents belong to the same gender, and both are either singular or plural. Additionally, the sentence contains a spe- cial word, which when replaced by another word, the answer also changes. The following WSC sentence pair (catch example) illus- trates how difficult the problem can be. 1.) The cat caught the mouse because it was clever. Question: Who is clever? An- swers:cat, mouse 2.) The cat caught the mouse because it was careless. Question: Who is careless? Answers: cat, mouse. The motive behind the WSC is to simulate human-like rea- soning in machines in order to test the machines ability to answer commonsense questions regarding sentence compre- hension. Below, we present the related work with the currently ex- isting tools and techniques that attempt to solve the WSC, focusing on how they acquire knowledge: (1.) Rahman and Ng’s system [Rahman and Ng, 2012] tries to find the most probable pronoun candidate, through a number of lexical- ized statistical techniques. Through a Ranking based ap- proach (SVM) it combines the features derived from different

Upload: dangquynh

Post on 13-Apr-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: How the Availability of Training Material Affects Performance in …cognitum.ws/wp-content/uploads/2017/07/IsaakMichael2017.pdf · How the Availability of Training Material Affects

How the Availability of Training Material Affects Performance in the WinogradSchema Challenge

Nicos Isaak∗ and Loizos Michael

Abstract

The Winograd Schema Challenge — the task of re-solving pronouns in sentences where shallow pars-ing techniques are not directly applicable — hasbeen proposed as a conceptually and practically ap-pealing alternative to the Turing Test. Among anumber of attempts to tackle this challenge, onerecent study has demonstrated the plausibility ofusing commonsense knowledge acquired from rawtext in Wikipedia. Here, we present the results ofa large-scale experiment that shows how the per-formance of this particular approach varies withthe availability of training material. We undertakequantitative and qualitative analysis at corpus andsentence level, to examine the effects that the avail-ability of training material has on the performance.

1 IntroductionOne of the most important challenges in computer science isthe understanding of how systems that acquire and manipu-late commonsense knowledge can be created [Valiant, 2006].With the creation of systems that are based on machine learn-ing, we aim for systems that will replace or substitute ba-sic human abilities, so that we can relate and interact withthem. We believe that logical inferences are necessary in or-der to build natural language representations, as well to rea-son about information encoded in representations.

Through the acquisition of knowledge and the extraction ofgeneral inference rules we can handle different problems, likethe Winograd Schema Challenge (WSC) [Levesque, 2011]; itcontains groups of nearly identical sentences with clear butvery different meanings, and the task is to resolve a definitepronoun to one of its two co-referents.

Here, we present the results of a large experiment, whichyields the acquisition of large amounts of training material, tocheck how it affects the performance in the WSC. The aim istwo fold. Firstly, to investigate how the availability of train-ing material affects commonsense systems; to examine if wecan enhance them with valuable knowledge to resolve pro-nouns that were unable to handle previously. Secondly, to

∗Corresponding Author

investigate the effect of training material on the WSC sen-tence level, depending on different qualitative properties. Todate, no study has looked specifically at how the amount oftraining material might benefit the pronoun resolution in theWSC, and any evidence for this has been mainly anecdotal.To the best of our knowledge, no study has been focused ondetecting any key variables affecting the pronoun resolutionin the WSC sentences, either.

The sections below explain each of these tasks along withthe techniques that we developed. After the Introduction, wepresent the WSC with some highlights on the previous work,while the follow-on section analyzes the system that we aregoing to use, in our testings. The fourth section outlines themethodology of our Corpus Level Analysis, and Section fiveshows our Sentence Level Analysis. Finally, we discuss theimplications of our results in section six, along with potentialdirections for future research.

2 The Winograd Schema Challenge (WSC)The WSC can be seen as a new type of Turing Test [Levesque,2011]. It consists of sentence pairs (twin sentences) that havesmall differences, and the objective is to resolve a definitepronoun to one of its two co-referents, in each sentence. Theco-referents belong to the same gender, and both are eithersingular or plural. Additionally, the sentence contains a spe-cial word, which when replaced by another word, the answeralso changes.

The following WSC sentence pair (catch example) illus-trates how difficult the problem can be. 1.) The cat caught themouse because it was clever. Question: Who is clever? An-swers:cat, mouse 2.) The cat caught the mouse because it wascareless. Question: Who is careless? Answers: cat, mouse.The motive behind the WSC is to simulate human-like rea-soning in machines in order to test the machines ability toanswer commonsense questions regarding sentence compre-hension.

Below, we present the related work with the currently ex-isting tools and techniques that attempt to solve the WSC,focusing on how they acquire knowledge: (1.) Rahman andNg’s system [Rahman and Ng, 2012] tries to find the mostprobable pronoun candidate, through a number of lexical-ized statistical techniques. Through a Ranking based ap-proach (SVM) it combines the features derived from different

Page 2: How the Availability of Training Material Affects Performance in …cognitum.ws/wp-content/uploads/2017/07/IsaakMichael2017.pdf · How the Availability of Training Material Affects

knowledge resources like Web Queries, Framenet, Opinion-Finder, English Giga world, BLLIP and Reuters. (2.) TheBudukh system [Budukh, 2013] consists of four answeringmodules that use world knowledge with an aggregation mech-anism from ConceptNet, Web Queries, Narrative chains andsentiment analysis. (3.) Another work [Peng et al., 2015]that uses an Integer Linear Programming approach, acquiresstatistics in an unsupervised way from multiple knowledgeresources, like Gigaword corpus, Wikipedia Wikifier, WebQueries and polarity information. (4.) Sharma’s techniqueis based on Answer Set Programming. It tries to retrieve thebackground knowledge directly from Google search enginethrough fixed queries [Sharma, 2014; Sharma et al., 2015].(5.) The Wikisense [Isaak and Michael, 2016] is a common-sense system that tries to solve the WSC problem throughlogical inferences rules. To acquire commonsense knowledgeit uses only one knowledge source, the English Wikipedia.

This challenge is of great significance and that becameparticularly apparent by the various competitions that havebeen announced, like the Nuance Communications competi-tion that took place in 2016 (IJCAI), where one of the bestapproaches was from the Wikisense [Ackerman, 2016].

3 The Wikisense ApproachWikisense uses a very interesting approach to address theWSC, via the extraction of logical inferences rules from com-monsense knowledge, and we would like to use this systemin our testings. We believe that the availability of trainingmaterial as a source of knowledge, as used in the Wikisenseapproach, can benefit the WSC. Below, we briefly discuss themain elements of the Wikisense approach, by presenting howthe engine works and how it acquires knowledge.

The Engine is based on the Websense engine [Michael,2013], which is able to output logical inferences, so that wecan relate and interact with it. Wikisense accepts any WSCsentence, with the question and the two possible pronoun tar-gets, and can respond with the correct pronoun target, whichis implied by the question. It acquires knowledge from theEnglish Wikipedia, via a supervised learning approach, calledauto-didactic [Michael, 2010].

Algorithm 1 The Wikisense’s procedure for each WSC sen-tence1: function RESOLVEPRONOUN (sent, negF, revF, question, answers)2: conf=30%, pairs=[(Vx, Vy), (Sx, Sy), (Ax, Ay), (Nx, Ny), (NBx, NBy), (VBx, VBy)]3: for pair in pairs do4: correctIndex=CALCVALUES (pair, negF, revF, conf, sent, question)5: if correctIndex!=-1 then return answers [correctIndex]6: end for7: return -18: end function9: function CALCVALUES (pair, negF, revF, conf, sent, question)10: x, y= RUNANDESTIMATE (pair, sent, question)11: if negF==True then x,y=y, x12: if revF==True then x,y=y, x13: if x ¿ y and (x-y)/x ¿= conf then14: return 015: else if y ¿ x and (y-x)/y ¿= conf then16: return 117: else18: return -119: end if20: end function

Wikisense creates multiple search keywords, through theWSC sentence and the question. For instance, for thecatch (1) sentence, it creates the keywords catch/clever,cat*mouse/catch, cat/clever, mouse/clever. For every key-word, it requires one thousand sentences, and if the specifiednumber cannot be retrieved, it continues with the currently re-trieved number (> 0). Initially, the engine runs the first key-word, which connects the WSC sentence with the question(e.g., catch/clever) and stores two values (Vx-Vy). Vx isthe confidence that the answer to the question is the first can-didate noun (e.g., cat); e.g., it shows the subject of the verbcatch is also clever. Vy is the confidence that the answer to thequestion is the second candidate noun (mouse); e.g., it showsthe object of the verb catch is also clever. If Vx is greater thanVy, it returns the first answer as correct, but if the contraryexists it returns as correct the second answer. Otherwise, itproceeds with the other keywords that acquire the Wikipediasentences differently (e.g., synonyms, antonyms etc.).

After each keyword, the Wikisense can create the neces-sary commonsense knowledge, to output inferences rules, toresolve the pronoun. For the knowledge building it uses twodependency parsers (Spacy, Stanford Parser). Through dif-ferent semantic relations, like the subject and the object ofthe verb, the parsers can give the necessary tools for the en-gine to create a knowledge file. The knowledge file can be apart of three different stages. Firstly, it might contain usefulrules that can be used from Wikisense to resolve the pronoun.Secondly, it might contain un-useful rules that cannot be used(for the specific keyword). Thirdly, it might be an empty file,without any rules. Also, a training sentence can strengthen(increase) the weight of a rule, or not. At the end, all im-portant rules (weight > 1.0) can be checked for the pronounresolution.

Through the next simplified example, we will explain theknowledge file building procedure: For instance, for thecatch (1) sentence, the Wikisense system might acquire thefollowing training sentences: 1). The clever man caught theball 2). The cat caught the mouse. Through the parsers andthe semantic relations (e.g., subjects and direct objects) theknowledge file will initially contain the following: caught(clever, ball), caught (cat, mouse). Afterwards, and if it ac-quire more examples like the previous, Wikisense will gen-eralize the rules and modify the knowledge file; e.g., e.g.,knowledge File=[if catch(x, y) then x==cat or x==cleverand y==mouse]. Through this knowledge file, one can out-put logical inferences rules, depending on the input (see nextcode example).

Simplified Example of Wikisense Knowledge Checkingdef RunAndEstimate (....):Answers==Cat, Mouse, Wiki_keyword=[catch, clever]knowledge_File=[if catch(x, y) then x==cat,clever and y==mouse]for rule in len(knowledge_file):if wiki_keyword[0] in rule:for i in answers:if (line(x)==Answer[0] OR line(y)==Answer[1]):VxCounter+=1elif (line(x)==Answer[1] OR line(y)==Answer[0]):VyCounter+=1return VxCounter,VYCounter

(Wikisense Simplified Example)For further information about the Engine, we direct the

reader to the Wikisense paper [Isaak and Michael, 2016].

Page 3: How the Availability of Training Material Affects Performance in …cognitum.ws/wp-content/uploads/2017/07/IsaakMichael2017.pdf · How the Availability of Training Material Affects

Figure 1: A snapshot of 1 · 101 Set Results

4 Corpus-Level AnalysisBelow, we describe the knowledge enchantments that we per-formed, to discover how training Material affects the Perfor-mance in the WSC, and discuss certain choices made.

4.1 Empirical MethodologyFor testing purposes, we selected the first 100 WSC sen-tences, from the WSC Library 1, and used the Wikisense sys-tem with 12 training set sizes (1 · 101, 2 · 101, 5 · 101, 1 · 102,2 · 102, 5 · 102, 1 · 103, 2 · 103, 5 · 103, 1 · 104, 2 · 104, 5 · 104).For each set size, we ran Wikisense 100 times (rounds), forstatistically significant results. Each set determines the train-ing sentences number, and at each time we use randomly se-lected sentences; one training sentence can be of any lengthand can also be used multiple times. For every round, in everyset, we store if that specific round successfully, or incorrectlyresolved each WSC sentence, or if it left it unanswered (un-resolved). (see figure 1).

4.2 MaterialsFor the experiments, we have used a variety of hardwareand software, and the whole running procedure took sev-eral months. Among other softwares we have used theWikisense system, a Python library for plots (matplotlib),Stanford parser, Spacy parser, and a spreadsheet software todesign and administer the experiments. We have ran our ex-periments on 5 different systems: 1). Apple MacBook Prowith Intel Core i7 2,4 GHz, 16 GB 1333 MHz DDR3, SSD2). Apple Mac Pro 3.46GHz 12 Core Xeon Processor 5.1,32GB RAM DDR3, SSD, HDD 3). Apple iMac with Intel i52.5 GHz, 20 GB DDR3, SSD 4). Asus Lamborghini with In-tel Core i7 2.20 GHz, 16 GB 1333 MHz DDR3, SSD, HDD5). Lenovo Thinkstation with 2 quadro Xeon Processors 2.27GHz, 20 GB 1333 MHz DDR3, SSD, HDD.

4.3 Results and AnalysisHere, we present the results that were obtained for the unan-swered, the correct and the wrong pronoun resolution, in ourtraining sets.

Figure 2 compares the correct, wrong and unanswered pro-noun resolution; the horizontal axis shows the training setsizes, while the vertical axis depicts the uanswered, the cor-rect and the wrong resolution means. The correct resolu-tion is depicted with green color, while the wrong resolutionwith red color; the blue color shows the unresolved resolu-tion which is scaled up to 50% (for better visualization). A

1http://www.cs.nyu.edu/faculty/davise/papers/OldSchemas.xml

0

10

20

30

40

50

60

1x10^1 2x10^1 5x10^1 1x10^2 2x10^2 5x10^2 1x10^3 2x10^3 5x10^3 1x10^4 2x10^4 5x10^4

Average%

TrainingSet

Correct,Wrong&UnansweredResolutioninallSets

Corrects Wrongs Unanswered(Scaledby50%) Corrects+50%ofUnanswered

Figure 2: Performance Evaluation on the Entire Corpusacross different Set Sizes (with SE).

0

5

10

15

20

25

30

35

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96

Average%

Round

Correct&WrongResolutionfor1x10^1&5x10^4byRound

corrects_1x10^1 wrongs_1x10^1 corrects_5x10^4 wrongs_5x10^4

Figure 3: Performance Evaluation on the Entire Corpus, forthe Smallest and the Largest Set Size.

cursory glance at figure 2 reveals that the unanswered resolu-tion reduces, in each bigger set, while the correct increases.Initially, as the set grows, the wrong pronoun resolution in-creases, until the 2 · 103 set. Then, it reduces until the 1 · 104set, and increases in the last two sets.

Figure 3 compares the correct and wrong pronoun resolu-tion between the smallest (1 · 101) and the largest (5 · 104)training set size, by each round. The horizontal axis showsthe 100 rounds, while the vertical axis depicts the correct andwrong resolution means, in each set; each training set is de-picted with different density color values. There is a biggergap between the correct and the wrong line on the biggest setthan on the smallest set; it shows that with the largest set sizethe correct pronoun resolution is better. Also, the biggest setsize lines are depicted on higher values (on axis (y)), showinga smaller unanswered resolution.

The results provide convincing evidence of a link betweenthe amount of training set size and the Wikisense enrichment,regarding the unanswered WSC sentences and the correctpronoun resolution. With smaller set sizes Wikisense answersless WSC sentences than with bigger sets. A positive corre-lation, was obtained between the larger sets and the correctpronoun resolution and a negative one with the unansweredresolution. We rejected the null hypothesis, using an ANOVAanalysis. This is the case, 20.860 > 3.2849, showing that themeans of the three populations (correct, wrong, unanswered)of figure 2 are not equal; the means are significantly differentfrom each other.

If we observe the figure 2, we can see that for every biggerset the unanswered WSC number reduces, on benefit of thecorrect pronoun resolution. The only exception is in the 1·104

Page 4: How the Availability of Training Material Affects Performance in …cognitum.ws/wp-content/uploads/2017/07/IsaakMichael2017.pdf · How the Availability of Training Material Affects

set, where the unanswered WSC sentences number increasesup to 0.36 %, on benefit of the correct pronoun resolution;here we have the biggest difference between the correct andthe wrong pronoun resolution (up to 8%). Also, there is asignificant difference of 22% between the largest set and thesmallest set (see figure 3, on benefit of the correct pronounresolution; we can clearly see the correct pronoun resolutionline which is depicted on higher values, in the largest set. Fur-thermore, if we compare the default Wikisense training set(1 · 103) with the biggest set (5 · 104), we can see a positivedifference of 5% for the correct resolution in the biggest set,which is very important in the WSC (see figure 2) [Isaak andMichael, 2016].

As the unresolved number of the WSC sentences was re-duced (from smaller to bigger sets), Wikisense wrongfully re-solved more WSC sentences; it was continued until the 2 ·103set (see figure 2). Then, in the sets 5 ·103 and 1 ·104 it startedthe downfall until the sets of 2 · 104 and 5 · 104, where itincreased. In the former group of sets we have the biggestdifference between the correct and the wrong pronoun res-olution. On the other hand, the latter sets of 2 · 104 and5 · 104 prompt an interesting question. Why the wrong pro-noun resolution number increases? A possible reason for thisdiscrepancy might be the price that we are paying while theunresolved WSC sentence number reduces (see figures 2,3).As shown in figure 3, the lines of the smallest set are mixed,contrary to the lines of the largest set, showing a bigger gapbetween the two lines in the largest set; with higher values forthe correct pronoun resolution. A the end, we prefer a systemable to correctly resolve as many WSC sentences as possible.

As shown in figures 2, 3, the results are better with largertraining sets; but which one is the best? We might expectthe 1 · 104 set to be one of the possible optimum sets forWikisense. In this set, we have the biggest difference be-tween all sets for the correct and the wrong pronoun reso-lution, which is 7, 67% (26, 29% − 18, 62%). In the largestset, the difference between the correct and the wrong pronounresolution is 6, 25% (28, 69%−22, 44%). If we observed fig-ure’s 3 black line, which shows the correct pronoun resolu-tion plus 50% of the unanswered resolution, we will see thatthe difference is up to 0, 71%, in favor of the 1 · 104 set; herewe are giving the ability to Wikisense to randomly answer allthe unanswered WSC sentences; we have 50% chance of acorrect/wrong pronoun resolution. The 1 · 104 set might beconsidered as an optimum training set size. Also, it is clearenough that, we have the option to run Wikisense, less pro-cessing time with the 1 · 104 set, which is also vital for theupcoming challenges.

The general picture emerging from the analysis regardingthe amount of training data is the following: With largerset sizes, it seems to acquire richer knowledge that benefitsour system on the co-reference resolution, regardless of theprocessing time, which increases. Richer knowledge seemsto restrict sentence ambiguity, which is ubiquitous in natu-ral language; richer knowledge might help the system to re-solve pronouns in sentences that have negligible differencesbetween them, like the twin WSC sentences.

5 Sentence-Level AnalysisTo better understand how the availability of training ma-terial affects the performance, and why larger training setsizes offers better pronoun resolution, we proceed to a WSCsentence-level —quantitative and qualitative— analysis.

5.1 Quantitative AnalysisGenerally, our approach works well with respect to the WSC,regardless of the unanswered rate of the WSC sentences.Wikisense failed to answer 92 WSC sentences with the small-est Wikipedia set (WSC sentences that remained unanswered> 50%), and 46 with the biggest. It might seem counterin-tuitive that the unresolved number is almost 50%. However,with the biggest set we are able to resolve 14 more sentences,than with the default one (1 · 103), which is a huge step re-garding the challenge difficulties [Ackerman, 2016].

Means of each WSC sentence, by each set, are presentedin figure 4a, which shows the correct, wrong and unansweredpronoun resolution. The horizontal axis shows the trainingset sizes, while the vertical axis highlights each tested WSCSentence. Each correct WSC sentence resolution is depictedwith green color, for each set, while each sentence wrong res-olution is depicted with red color; the blue color shows theunanswered resolution. Our Sentence level analysis is consis-tent with previous results showing that, the unanswered color(blue) gradually reduces, in each bigger set, while the correctcolor (green) increases. Also, this is illustrated in figure 4b,which shows the positive correlation that was obtained be-tween the larger set sizes and the correct pronoun resolution;e.g., the green color of 5 · 104 set has the greatest values onthe horizontal axis.

Figures 4c, 4d, and 4e show the unanswered, the cor-rect and the wrong pronoun resolution through RGB colors;the WSC sentences are reordered by the performance, in thelargest training set. For instance, a cursory glance at figure 4cshows that the bigger training sets are the one with the darkestcolor, meaning that the unanswered WSC sentence numberreduces, as the set size grows. Also, there is a significant dif-ference on the green color density across the horizontal axisof figure 4d; it shows that the correct pronoun resolution forbigger sets is better than in the smaller sets. Also, we can seehow the wrong pronoun resolution changes, on WSC sentencelevel, on figure 4e.

5.2 Qualitative AnalysisUnanswered Sentences. Twenty-seven WSC sentences re-mained unanswered, in all rounds of all training sets; see fig-ure’s 4a blue color above sentence s040, on the vertical axis.Wikisense was not able to create a keyword in four of the27 sentences. In some cases, it was not able to find enoughtraining sentences, to resolve pronouns; e.g, some keywordsreturned less than ten sentences, in all sets (e.g., lie cautious).Furthermore, we have observed a possible keyword-mentionbased problem, which cannot lead to the correct pronountarget. For instance, in the sentence The cat was lying bythe mouse hole waiting for the mouse, but it was too im-patient. Question: What was too impatient?, the keywordlie impatient cannot easily lead to the correct pronoun target;

Page 5: How the Availability of Training Material Affects Performance in …cognitum.ws/wp-content/uploads/2017/07/IsaakMichael2017.pdf · How the Availability of Training Material Affects

s001

s010

s020

s030

s040

s050

s060

s070

s080

s090

s100

WSCsentence

A.)Correct,Wrong&UnansweredResolutionforWSCSentencesinallSets

1x10^1 2x10^1 5x10^1 1x10^2 2x10^2 5x10^2 1x10^3 2x10^3 5 x10^31x10^42x10^45x10^4

TrainingSet

0 10 20 30 40 50 60 70 80 90 100s001

s010

s020

s030

s040

s050

s060

s070

s080

s090

s100

CorrectbyTrainingSet

WSCSentences

B.)CorrectResolutionforWSCSentencesinallSets

1x10^1 2x10^1 5x10^1 1x10^2 2x10^2 5x10^2 1x10^3 2x10^3 5x10^3 1x10^4 2x10^4 5x10^4

Figure 4: Performance Evaluation on Individual Sentences across different Training Set-Sizes

Page 6: How the Availability of Training Material Affects Performance in …cognitum.ws/wp-content/uploads/2017/07/IsaakMichael2017.pdf · How the Availability of Training Material Affects

Wikisense needs to provide more information for the exam-ined sentence structure. Furthermore, there were sentencesthat remained unanswered because of the wrong knowledge,that was returned from the training sentences. For instance,for a keyword like punish bully, we want training sentencesthat will help us to figure out the relation between bullingand punishment; that we probably have to punish people whobully others. In contrast, Wikisense returned sentences, thatdid not help on the pronoun resolution; e.g., training sen-tences for ”Wooly Bully” song that did not have any relationwith punishment and bullying. It appears that we have to dealwith the training sentences meaning before using them on thepronoun resolution.

Unanswered Sentences in the beginning. Initially, withsmaller training sets a lot of WSC sentences remained Unan-swered, but correctly resolved with larger training sets (seefigure 4c). Sentences like the Sam’s drawing was hung justabove Tina’s and it did look much better with another one be-low it. Question: Which looked better? were answered withbigger than with smaller sets; Wikisense found more usefultraining sentences in larger sets for the keyword hang good,that helped on the pronoun resolution. Larger training setsseems to have higher possibilities to return richer knowledgethan smaller sets.

Correct and Wrong Resolving in all Sets. If the key-word is too powerful —keywords that can help us to resolvepronouns in a WSC sentence, where the sentence structuredoes not change anything— then the pronoun will be cor-rectly resolved. WSC sentences that are connected with thequestion via powerful keywords, can be easily resolved, evenwith small training set sizes. For example the WSC sentenceThe city councilmen refused the demonstrators a permit be-cause they feared violence. Question Who feared violence?can be easily resolved through the keywords refuse fear; thesubject of the verb refuse is the one who fears that somethingis going to happen; the keyword directly leads to the correctpronoun target without misleadings. On the other hand, thereare keywords that can lead to wrong conclusions. For exam-ple, the WSC sentence Anne gave birth to a daughter lastmonth. She is a very charming woman. Question: Who isa charming woman was wrongfully resolved with all train-ing set sizes. The question word charming, from the keywordgive charming led the engine to return the daughter as theanswer to the question; even humans might match the wordcharming with children than with adults.

Confusing WSC Sentences. There are WSC sentences,where there is no evident relationship between the training setsize and the pronoun resolution. For instance, for the WSCsentences Frank felt vindicated when his longtime rival Billrevealed that he was the winner of the competition. Question:Who was the winner of the competition? the pronoun hasbeen resolved as: UN, UN, WR, CR, CR, WR, WR, CR, WR,WR, WR, WR. It is obvious that we are talking about WSCsentences that might be very confusing even for humans. Themajority of the WSC sentences are Hard WSC sentences andthis is not an easy task [Bender, 2015].

Twin Sentence Issue. The results yielded some interestingfindings that are based on the keyword generator. We do noteven have a pair of sentences (twins) which was correctly re-

solved. This can be seen in figure 4a where the green color isnot depicted on continues sentences ids. The majority of theother pair was incorrectly resolved with larger sets. This hap-pens because we are talking about twin sentences with negli-gible differences, forcing to the creation of the same keyword,for both WSC sentences; it indicates the need of a better key-word generator, which will produce different keywords forthe twin sentences.

WSC Sentences with Negation & Sentence Length Wedid not find any differences for WSC sentences with nega-tion, or for WSC sentence that have different length sizes,regarding the correct, wrong or the unanswered pronoun res-olution. Furthermore, we did not find any issues, related withthe pronoun resolution, that are based on the training sen-tences length.

Keyword POS Analysis Here, we analyze the relation-ship between the keyword parts, based on the words partof speech, in two different cases; in WSC sentences that re-mained unanswered, and in sentences that were correctly re-solved, in all training sets (> 50%). The results yielded someinteresting findings. The sentences, that were correctly re-solved, had in the left part of the keyword a verb word in95% of the cases, The right part of the keyword was a noun,or an adjective in 66% of the cases and 34% a verb. On theother hand, the unanswered sentences had a verb in the leftpart of the keyword in 67% of the times and 33% a preposi-tion. The right part of the keyword was in 63% a noun or anadjective, and in 37% a verb. Another interesting side find-ing was that, if we eliminate the unanswered sentences in thesentences that remained unanswered by 100%, in all sets, theright part of the keyword is in 41%, a verb. These findingswould suggest that, if the left keyword part — which indi-rectly connects the two possible pronoun targets — is a verb,and the right keyword part, — which connects the questionwith the sentence— is an adjective or a noun, then we havebetter possibilities to correctly resolve the pronoun target; thisis an evidence that suggests further improvements that wouldhelp the pronoun resolution. The results indicate that com-monsense conclusion systems —like Wikisense — might bebenefit from subject verb action relations; e.g., if they use akeyword searching-like generator, they could use it to pro-duce keywords that are based on verb - noun/adjective rela-tions.

Keyword Set Analysis. Keyword set analysis resultsyielded significant correlation between the correct resolutionand the training set size. Seventy-eight percent of the WSCsentence were correctly resolved with bigger sets than thesixth (5 · 102), where an equivalent number of sentences, ineach set, was found. However, for the unanswered WSC sen-tences the percentage was limited to 64%; for the 24 WSCsentences that, remained by 100% unanswered, the percent-age was 60%. Our findings are consistent with previous re-sults showing that with bigger training sets we are able toresolve more WSC sentences.

6 Conclusion and Future WorkGiven the system that we used as exemplar, that was one ofthe best approaches used in the first WSC, we have improved

Page 7: How the Availability of Training Material Affects Performance in …cognitum.ws/wp-content/uploads/2017/07/IsaakMichael2017.pdf · How the Availability of Training Material Affects

the performance in the WSC; both at corpus and sentencelevel analysis. One could interpret the results in this paperas demonstrating that the appropriate choice of the size ofthe training corpus could yield significant performance im-provements in the WSC, even when applied to state-of-the-art systems. Initial results show further improvement that canenhance commonsense systems, but several findings warrantfurther discussion.

Future research will have to examine the effects of trainingmaterial on different WSC systems, to investigate to what ex-tent the relationship between the pronoun resolution and theavailability of training data is positive for non commonsenseconclusion systems. Also, future studies will have to evaluatethe use of algorithms that use fewer data, to determine wheredo we need to focus in order to achieve better co-referenceresolution. Finally one could use our qualitative analysis tocreate higher quality set of training data, for the upcomingWinograd Schema Challenges.

References[Ackerman, 2016] Evan Ackerman. Winograd Schema

Challenge Results: AI Common Sense Still a Problem, forNow. Spectrum, 2016.

[Bender, 2015] David Bender. Establishing a human base-line for the winograd schema challenge. In MAICS, pages39–45, 2015.

[Budukh, 2013] Tejas Ulhas Budukh. An intelligent co-reference resolver for winograd schema sentences contain-ing resolved semantic entities, 2013.

[Isaak and Michael, 2016] Nicos Isaak and Loizos Michael.Tackling the winograd schema challenge through machinelogical inferences. In David Pearce and Helena SofiaPinto, editors, STAIRS, volume 284 of Frontiers in Ar-tificial Intelligence and Applications, pages 75–86. IOSPress, 2016.

[Levesque, 2011] Hector J. Levesque. The WinogradSchema Challenge. In AAAI Spring Symposium: LogicalFormalizations of Commonsense Reasoning, number SS-11-06. American Association for Artificial Intelligence,2011.

[Michael, 2010] Loizos Michael. Partial observability andlearnability. Artif. Intell., 174(11):639–669, 2010.

[Michael, 2013] Loizos Michael. Machines with Websense.In Proc. of 11th International Symposium on Logical For-malizations of Commonsense Reasoning (Commonsense13), 2013.

[Peng et al., 2015] Haoruo Peng, Daniel Khashabi, and DanRoth. Solving hard coreference problems. Urbana,51:61801, 2015.

[Rahman and Ng, 2012] Altaf Rahman and Vincent Ng. Re-solving Complex Cases of Definite Pronouns: The Wino-grad Schema Challenge. In Proceedings of the 2012 JointConference on Empirical Methods in Natural LanguageProcessing and Computational Natural Language Learn-ing, EMNLP-CoNLL ’12, pages 777–789, Stroudsburg,

PA, USA, 2012. Association for Computational Linguis-tics.

[Sharma et al., 2015] Arpit Sharma, Nguyen H Vo, SomakAditya, and Chitta Baral. Towards addressing the wino-grad schema challenge-building and using a semanticparser and a knowledge hunting module. In Proceedings ofthe Twenty-Fourth International Joint Conference on Arti-ficial Intelligence, IJCAI, pages 25–31, 2015.

[Sharma, 2014] Arpit Sharma. Solving Winograd SchemaChallenge: Using Semantic Parsing, Automatic Knowl-edge Acquisition and Logical Reasoning. Master’s thesis,Arizona State University, 2014.

[Valiant, 2006] Leslie G. Valiant. Knowledge Infusion. InProceedings of the 21st National Conference on Artifi-cial Intelligence - Volume 2, AAAI’06, pages 1546–1551.AAAI Press, 2006.