formal arguments, preferences, and natural language interfaces to humans: an empirical evaluation

Formal Arguments, Preferences,and Natural Language Interfaces

to Humans: an EmpiricalEvaluation

Federico Cerutti Nava Tintarev Nir Oren

ECAI 2014 — Friday 22nd August, 2014

Motivation

– Distributed autonomous systems increasingly used– Reasoning can be formalized as argumentation– However, if we need to explain this to people the information

presentation needs to be more natural– Can we create a bridge between natural language and formal

argumentation?– What kind of factors need to be considered

- Preferences between arguments?- Domain specific knowledge?

2 of 31

Background

The ExperimentMethodology

ResultsConclusions

3 of 31

Background on P&S

Rule-based argumentation frameworkAllows to express arguments in favour of preferences among rulesIncludes negation as failure an strong negationAlthough it is pre-Dung1995, it is easy to draw a correspondence withan abstract argumentation frameworks (there are some points wherewe should be cautious, but it is not the case of this work)

4 of 31

Crash course on P&S

Each rule as a set of antecedents and a consequentStrict (they cannot contain negation as failure atoms) and defeasiblerulesArguments as sequence (instead of recursive structure like in ASPIC)of rulesThe conclusions of an argument is the set containing each consequentof each rule of the argumentAttacks:

on some antecedent of some ruleon some conclusion

Skeptical semantics: groundedCredulous semantics: stable

5 of 31

ExampleS Ds1 : ⇒ sAAAs2 : ⇒ sBBBs3 : ⇒ sdoc

r1 : sAAA ∧ ∼ exAAA⇒ poorerr2 : sBBB ∧ sdoc ∧ ∼ exBBB ∧ ∼ exdoc⇒¬ poorerr3 : ∼ exexpert⇒ r1 ≺ r2

A politician and an economist discuss the potential financial outcome of theindependence of a region X. The politician puts forward an argument in favour ofthe conclusion “If Region X becomes independent, X’s citizens will be poorerthan they are now”. Another argument holding a contradicting conclusion (i.e.Region X will not be poorer) is advanced by the economist. The economist’sopinion is likely to be preferred to that of the politician, and is supported by ascientific document.

A rgs = {a1 = ⟨s1, r1⟩,a2 = ⟨s2, s3, r2⟩,a3 = ⟨r3⟩}; a2A rgs-defeats a1a2 justified

6 of 31

Background

The Experiment

MethodologyResults

Conclusions

7 of 31

The Experiment

Presenting each participant with a text, written in natural language,followed by a questionnaireBetween subjects design across eight texts: each participant is shown asingle (randomly selected) textFour domains:

1 weather forecast2 political debate3 used car sale4 romantic relationship

Two KBs: base case, and extended caseThe base case always consider two arguments a1 and a2 with twocontradicting conclusions; and a preference in favour of a2

8 of 31

The Extended Case for the Example

More recent research disputes the claim of the economist

S Ds1 : ⇒ sAAAs2 : ⇒ sBBBs3 : ⇒ sdocs4 : ⇒ sresearchs5 : sresearch⇒¬sdoc

r1 : sAAA ∧ ∼ exAAA⇒ poorerr2 : sBBB ∧ sdoc ∧ ∼ exBBB ∧ ∼ exdoc⇒¬ poorerr3 : ∼ exexpert⇒ r1 ≺ r2

A rgs = {a1 = ⟨s1, r1⟩,a2 = ⟨s2, s3, r2⟩,a3 = ⟨r3⟩,a4 = ⟨s4, s5⟩}a2A rgs-defeats a1,a2A rgs-defeats a4,a4A rgs-defeats a2,

Two stable extensions:{a1,a3,a4} and {a2,a3}

9 of 31

Domain 1: weather forecast

The weather forecasting service of the broadcasting company AAA saysthat it will rain tomorrow (a1).Meanwhile, the forecast service of the broadcasting company BBB says thatit will be cloudy tomorrow but that it will not rain (a2).It is also well known that the forecasting service of BBB is more accuratethan the one of AAA (a3).However, yesterday the trustworthy newspaper CCC published an articlewhich said that BBB has cut the resources for its weather forecastingservice in the past months, thus making it less reliable than in the past (a4).

10 of 31

Domain 2: political debate

In a TV debate, the politician AAA argues that if Region X becomesindependent then X’s citizens will be poorer than now (a1).Subsequently, financial expert (a3) Dr. BBB presents a document; whichscientifically shows that Region X will not be worse off financially if itbecomes independent (a2).After that, the moderator of the debate reminds BBB of more recentresearch by several important economists that disputes the claims in thatdocument (a4).

11 of 31

Domain 3: buying a car

You are planning to buy a second-hand car, and you go to a dealership withBBB, a mechanic whom has been recommended you by a friend (a3).The salesperson AAA shows you a car and says that it needs very littlework done to it (a1).BBB says it will require quite a lot of work, because in the past he had tofix several issues in a car of the same model (a2).While you are at the dealership, your friend calls you to tell you that heknows (beyond a shadow of a doubt) that BBB made unnecessary repairsto his car last month (a4).

12 of 31

Domain 4: romance

After several dates, you would like to start a serious relationship with J.but you turn to ask two friends of yours, AAA and BBB, for advice. Youhave known BBB for longer than you have known AAA (a3).AAA tells you that J is lovely and you should go ahead (a1),while BBB suggests that you should be very cautious because J might havea hidden agenda (a2).After some weeks, CCC, who is also a close friend of BBB, tells you thatBBB has been into you for years; BBB is too shy to tell you about theirfeelings about you, but are still possessive of you (a4).

13 of 31

Formalisation summary

Domain Base Case ExtendedCase

Type of reinstatement

1, weather 1.B 1.E preference attack

2, politics 2.B 2.E a2 rebuttal

3, buying car 3.B 3.E preference attack

4, romance 4.B 4.E preference rebuttal

14 of 31

BackgroundThe Experiment

Methodology

ResultsConclusions

15 of 31

Methodology

Participants are asked to determine which of the following positionsthey think is accurate:

PA: I think that AAA’s position is correct (e.g. “X’s citizens will bepoorer than now”)PB: I think that BBB’s position is correct (e.g. “X’s citizens will not beworse off financially”)PU: I cannot determine if either AAA’s or BBB’s position is correct(e.g. “I cannot conclude anything about Region X’s finances”)

Rate a statements in terms of relevance (for the conclusion) andagreement on a 7 points scale from Disagree to Agree for eachstatement

16 of 31

Hypotheses

H1: In the base cases (Scenarios 1.B, 2.B, 3.B and 4.B), the majority ofparticipants will agree with BBB’s statement (positionPB)H2: In the extended cases (Scenarios 1.E, 2.E, 3.E and 4.E), themajority of participants will agree that they cannot concludeanything from the text (positionPU).H3: The majority of participants who view a base case scenario willagree with the preference argument, and find it relevant

17 of 31


Methodology

ResultsConclusions

18 of 31

Hypotheses H1 and H2

0

15

30

45

60

PA PB PU

%

Distribution of acceptability of actors’ positions

Base cases Extended cases

Distribution of the final conclusionPA/PB/PUBase cases, χ 2 analysis (2, N=77)=37.74, p< 0.001;

extended cases χ 2 (2, N=84)=8.0, p< 0.0219 of 31

Hypothesis H3

Participants rate how much (on a scale of 1 to 7) they agree with thefollowing statement (agreement), and whether it is relevant in drawingtheir conclusion (relevance): “BBB is more trustworthy than AAA.”

Significant difference between the base and the extended cases foragreement (Mann-Whitney U(1778), Z=−5.0, p< 0.001) and relevance(Mann-Whitney U(1852), Z=−4.7, p< 0.001).

In addition, the median values both for agreement and relevance aregreater for the base cases than for the extended cases

20 of 31

Post Hoc: Motivations

Base Cases Extended Cases

PA PB PU PA PB PU

1, weather 5.0 50.0 45.0 15.8 21.1 63.2

2, politics 5.3 63.2 31.6 21.1 10.5 68.4

3, buying car 0.0 68.2 31.8 23.8 23.8 52.4

4, romance 12.5 68.8 18.8 48.0 36.0 16.0

Distribution of the final conclusionPA/PB/PUFisher (N = 161) = 48.756, p< 0.001, 10000 sampled tables, Monte Carlo

approach with 99% confidence interval (MC99)

21 of 31

Post Hoc: Distributions of Base Cases

0

15

30

45

60

U1 U2 U3

%

Distributions of motivations forPU (scenarios 1.B and 3.B)

1.B 3.B

Agreement with thePU position in scenarios 1.B and 3.B:U1: lack of information, U2: domain specific reasons; U3: other

22 of 31

Post Hoc: Distributions between Base/ExtendedCases

Base Cases Extended CasesPA PB PU PA PB PU

1, weather 5.0 50.0 45.0 15.8 21.1 63.22, politics 5.3 63.2 31.6 21.1 10.5 68.43, buying car 0.0 68.2 31.8 23.8 23.8 52.44, romance 12.5 68.8 18.8 48.0 36.0 16.0

Are the distributions of choices (amongPA,PB, andPU) in the base caseis significantly different from the distribution of choices in thecorresponding extended case?

YES for the third domain (3.B and 3.E, buying a car) — Fisher(N = 43) = 10.693, p< 0.001, 10000 sampled tables, MC99.NO for the first domain (1.B and 1.E, weather forecasts) — Fisher(N = 39) = 3.832, p= 0.187, 10000 sampled tables, MC99.

23 of 31

Post Hoc: Distributions Extended Cases

Base Cases Extended CasesPA PB PU PA PB PU

1, weather 5.0 50.0 45.0 15.8 21.1 63.22, politics 5.3 63.2 31.6 21.1 10.5 68.43, buying car 0.0 68.2 31.8 23.8 23.8 52.44, romance 12.5 68.8 18.8 48.0 36.0 16.0

Domain has a significant effect on the distribution of positions — Fisher(N = 84) = 16.308, p< 0.05, 10000 sampled tables, MC99.

24 of 31

Post Hoc: Relevance and Agreement

Base cases Extended cases

RB†

Md∗B RE†

Md∗E C.D.‡

Rel

evan

ce 1, weather 110.38 6.00 82.92 4.00 46.602, politics 107.45 6.00 69.45 4.00 47.19

3, buying car 118.05 6.50 67.45 4.00 44.384, romance 48.34 2.00 44.40 2.00 46.57

Agr

eem

ent 1, weather 116.38 6.00 87.18 4.00 46.60

2, politics 103.34 6.00 65.05 4.00 47.193, buying car 121.93 6.50 64.33 4.00 44.384, romance 44.94 2.00 44.20 2.00 46.57

Statistically significant cases when |Rx−Ry|> C.D.† Mean rank as computed with the Kruskal-Wallis test‡ Critical Difference, as computed in [Siegel and Castellan Jr., 1988] citedby [Field, 2009] with α= 0.05.

25 of 31

Post Hoc: Relevance and Agreement

Scenario 3.B Scenario 4.B

R3.B†

Md∗3.B R4.B†

Md∗4.B C.D.‡

Relevance 118.05 6.50 48.34 2.00 47.79Agreement 121.93 6.50 44.94 2.00 47.79

Statistically significant cases when |Rx−Ry|> C.D.† Mean rank as computed with the Kruskal-Wallis test‡ Critical Difference, as computed in [Siegel and Castellan Jr., 1988] citedby [Field, 2009] with α= 0.05.

26 of 31


MethodologyResults

Conclusions

27 of 31

Conclusions

Investigation into the relationship between formal systems ofdefeasible argumentation and arguments in natural languageResults suggest a correspondence between the formal theory and itsrepresentation in natural languagePreference generally applied “following” Prakken and Sartor:importance of being able to represent them

Humans evaluate preference depending on the contextCollateral knowledgeReverse of preference

28 of 31

Acknowledgement

Research was sponsored by US Army Research laboratory and the UK Ministryof Defence and was accomplished under Agreement Number W911NF-06-3-0001.The views and conclusions contained in this document are those of the authorsand should not be interpreted as representing the official policies, either expressedor implied, of the US Army Research Laboratory, the U.S. Government, the UKMinistry of Defense, or the UK Government. The US and UK Governments areauthorized to reproduce and distribute reprints for Government purposesnotwithstanding any copyright notation hereon.

This research has been carried out within the project “Scrutable AutonomousSystems” (SAsSY), funded by the Engineering and Physical Sciences ResearchCouncil (EPSRC, UK), grant ref. EP/J012084/1.

29 of 31

Advert

30 of 31

References I

[Field, 2009] Field, A. (2009).Discovering Statistics Using SPSS (Introducing Statistical Methods series).SAGE Publications Ltd.

[Siegel and Castellan Jr., 1988] Siegel, S. and Castellan Jr., N. J. (1988).Nonparametric Statistics for The Behavioral Sciences.McGraw-Hill Humanities/Social Sciences/Languages.

31 of 31

formal arguments, preferences, and natural language interfaces to humans: an empirical evaluation

Education

a2 rgsdefeats a1 a2

favour of a2

independent a2

a2 rgsdefeats a4

a4 rgsdefeats a2

arguments a1

model a2

aaa a3