empirical methods for evaluating dialog systems · empirical methods for evaluating dialog systems...

9

Click here to load reader

Upload: vuongxuyen

Post on 04-Sep-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Empirical Methods for Evaluating Dialog Systems · Empirical Methods for Evaluating Dialog Systems ... dialog evaluation, dialog metric, descriptive statistics, gold standard, wizard-of-oz

Empirical Methods for Evaluating Dialog Systems

Tim Paek

Microsoft ResearchOne Microsoft Way

Redmond, WA [email protected]

Paper ID: SIGDIAL_TP

Keywords: dialog evaluation, dialog metric, descriptive statistics, gold standard, wizard-of-oz

Contact Author: Tim Paek

Abstract

We examine what purpose a dialog metric serves and then propose empirical methodsfor evaluating systems that meet that purpose. The methods include a protocol forconducting a wizard-of-oz experiment and a basic set of descriptive statistics forsubstantiating performance claims using the data collected from the experiment as anideal benchmark or “gold standard” for comparative judgments. The methods alsoprovide a practical means of optimizing the system through component analysis andcost valuation.

Page 2: Empirical Methods for Evaluating Dialog Systems · Empirical Methods for Evaluating Dialog Systems ... dialog evaluation, dialog metric, descriptive statistics, gold standard, wizard-of-oz

Empirical Methods for Evaluating Dialog Systems

Abstract

We examine what purpose adialog metric serves and thenpropose empirical methods forevaluating systems that meet thatpurpose. The methods include aprotocol for conducting a wizard-of-oz experiment and a basic setof descriptive statistics forsubstantiating performance claimsusing the data collected from theexperiment as an ideal benchmarkor “gold standard” forcomparative judgments. Themethods also provide a practicalmeans of optimizing the systemthrough component analysis andcost valuation.

1 Introduction

In evaluating the performance of dialog systems,designers face a number of complicated issues.On the one hand, dialog systems are ultimatelycreated for the user, so usability factors such assatisfaction or likelihood of future use should bethe final criteria. On the other hand, becauseusability factors are subjective, they can beerratic and highly dependent on features of theuser interface (Kamm et al., 1999). So, designershave turned to “objective” metrics such asdialog success rate or completion time.Unfortunately, due to the interactive nature ofdialog, these metrics do not always correspondto the most effective user experience (Lamel etal., 2000). Furthermore, several different metricsmay contradict one another (Kamm et al., 1999),leaving designers with the tricky task ofuntangling the interactions or correlationsbetween metrics.

Instead of focusing on developing a newmetric that circumvents the problems above, wemaintain that designers need to make better useof the ones that already exist. Toward that end,we first examine what purpose a dialog metricserves and then propose empirical methods forevaluating systems that meet that purpose. Themethods include a protocol for conducting a

wizard-of-oz experiment and a basic set ofdescriptive statistics for substantiatingperformance claims using the data collectedfrom the experiment as an ideal benchmark or“gold standard” for comparative judgments. Themethods also provide a practical means ofoptimizing the system through componentanalysis and cost valuation.

2 Purpose

Performance can be measured in myriad ways.Indeed, for evaluating dialog systems, the oneproblem designers do not encounter is lack ofchoice. Dialog metrics come in a diverseassortment of styles. They can be subjective orobjective, deriving from questionnaires or logfiles. They can vary in scale, from the utterancelevel to the overall dialog (Glass et al., 2000).They can treat the system as a “black box,”describing only its external behavior (Eckert etal., 1998), or as a “glass box,” detailing itsinternal processing. If one metric fails to suffice,dialog metrics can be combined. For example,the PARADISE framework allows designers topredict user satisfaction from a linearcombination of objective metrics such as meanrecognition score and task completion (Kamm etal., 1999; Litman & Pan, 1999; Walker et al.,1997).

Why so many metrics? The answer has to dowith more than just the absence of agreed uponstandards in the research community,notwithstanding significant efforts in thatdirection (Gibbon et al., 1997). Part of thereason deals with the purpose a dialog metricserves. Designers want a dialog metric toaddress the multiple, sometimes inconsistentneeds. Here are four typical needs:

(1) Provide an accurate estimation of how well asystem meets the goals of the domain task.(2) Allow for comparative judgments of onesystem against another, and if possible, acrossdifferent domain tasks.(3) Identify factors or components in the systemthat can be improved.(4) Discover tradeoffs or correlations betweenfactors.

Page 3: Empirical Methods for Evaluating Dialog Systems · Empirical Methods for Evaluating Dialog Systems ... dialog evaluation, dialog metric, descriptive statistics, gold standard, wizard-of-oz

While the above list is not intended to beexhaustive, it is instructive. Creating such a listcan help designers to anticipate the kinds ofobstacles they are likely to face in trying tosatisfy all of the needs. Consider the first needon the list.

Providing an accurate estimation of how wella system meets the goals of the domain taskdepends on how well the designers havedelineated all the possible goals of interaction.Unfortunately, users often have finer goals thanthose anticipated by designers, even for domaintasks that seem well defined, such as airlineticket reservation. For example, a user may beleisurely hunting for a vacation and not careabout destination or time of travel, or the usermay be frantically looking for an emergencyticket and not care about price. The“appropriate” dialog metric should reflect eventhese kinds of goals. While “time to completion”is more appropriate for the emergency ticket,“concept efficiency rate” is more appropriate forthe savvy vacationer. As psychologists havelong recognized, when people engage inconversation, they make sure that they mutuallyunderstand the goals, roles, and behaviors thatcan be expected (Clark, 1996; Clark & Brennan,1991; Clark & Schaefer, 1987, 1989). Theyevaluate the “performance” of the dialog basedon their mutual understanding and expectations.

Not only do different users have differentgoals, they sometimes have multiple goals, ormore often, their goals change dynamically inresponse to system behavior such ascommunication failures (Danieli & Gerbino,1995; Paek & Horvitz, 1999). Because goalsengender expectations that then influenceevaluation at different points of time, usabilityratings are notoriously hard to interpret,especially if the system is not equipped to inferand keep track of user goals (Horvitz & Paek,1999; Paek & Horvitz, 2000).

The second typical need for a dialog metric –allowing for comparative judgments, introducesfurther obstacles. In addition to unanticipated,dynamically changing user goals, differentsystems employ different dialog strategiesoperating under different architecturalconstraints, making the search for a dialogmetric that generalizes across systems nearlyimpossible. While the PARADISE frameworkfacilitates some comparison of dialog systems indifferent domain tasks, generalization is limited

because different components can render factorsirrelevant in the statistical model (Kamm et al.,1997). For example, a common measure of taskcompletion would be possible if every systemrepresented the domain task as an Attribute-Value Matrix (AVM). Unfortunately, thatrequirement excludes systems that use Bayesiannetworks or other non-symbolic representations.This has prompted some researchers to arguethat a “common inventory of concepts” isnecessary to have standard metrics forevaluation across systems and domain tasks(Kamm et al., 1997; Glass et al., 2000). As wediscuss in the next section, the argument isactually backwards; we can use the metrics wealready have to define a common inventory ofconcepts. Furthermore, with the proper set ofdescriptive statistics, we can exploit thesemetrics to address the third and fourth typicalneeds of designers, that of identifyingcontributing factors, along with their tradeoffs,and optimizing them.

This is not to say that comparative judgmentsare impossible; rather, it takes some amount ofcareful work to make them meaningful. Whenresearch papers describe evaluation studies ofthe performance of dialog systems, it isimperative that they provide a baselinecomparison from which to benchmark theirsystems. Even when readers understand thescale of the metrics being reported, without abaseline, the numbers convey very little aboutthe quality of experience users of the system canexpect. For example, suppose a paper reportsthat a dialog system received an averageusability score of 9.5/10, a high conceptefficiency rate of 90%, and a low word error rateof 5%. These numbers sound terrific, but theycould have resulted from low user expectationsand a simplistic or highly constrained interface.Practically speaking, readers must eitherexperience interacting with the systemthemselves, or have a baseline comparison forthe domain task from which to make sense ofthe numbers. This is true even if the paperreports a statistical model for predicting one ormore of the metrics from the others, which mayreveal tradeoffs but not how well the systemperforms relative to the baseline.

To sum up, in considering the purpose adialog metric serves, we examined four typicalneeds and discussed the kinds of obstaclesdesigners are likely to face in finding a dialog

Page 4: Empirical Methods for Evaluating Dialog Systems · Empirical Methods for Evaluating Dialog Systems ... dialog evaluation, dialog metric, descriptive statistics, gold standard, wizard-of-oz

metric that satisfies those needs. The obstaclesthemselves present distinct challenges: first,keeping track of user goals and expectations forperformance based on the goals, and second,establishing a baseline from which to benchmarksystems and make comparative judgments.Assuming that designers equip their system tohandle the first challenge, we now proposeempirical methods that allow them to handle thesecond, while at the same time providing apractical means of optimizing the system. Thesemethods do not require new metrics, but insteadtake advantage of existing ones throughexperimental design and a basic set ofdescriptive statistics.

3 Empirical methods

Before designers can make comparativejudgments about the performance of a dialogsystem relative to another system, so that readersunacquainted with either system can understandthe reported metrics, they need a baseline.Fortunately, in evaluating dialog betweenhumans and computers, the “gold standard” isoftentimes known; namely, human conversation.The most intuitive and effective way tosubstantiate performance claims is to compare adialog system on a particular domain task withhow human beings perform on the same task.Because human performance constitutes an idealbenchmark, readers can make sense of thereported metrics by assessing how close thesystem approaches the gold standard.Furthermore, with a benchmark, designers canoptimize their system through componentanalysis and cost valuation.

In this section, we outline an experimentalprotocol for obtaining human performance datathat can serve as a gold standard. We thenhighlight a basic set of descriptive statistics forsubstantiating performance claims, as well as foroptimization.

3.1 Experimental protocol

Collecting human performance data forestablishing a gold standard requires conductinga carefully controlled wizard-of-oz (WOZ)study. The general idea is that userscommunicate with a human “wizard” under theillusion that they are interacting with acomputational system. For spoken dialog

Figure 1. Wizard-of-Oz study for the purpose ofestablishing a baseline comparison.

systems, maintaining the illusion usuallyinvolves utilizing a synthetic voice to outputwizard responses, often through voice distortionor a text-to-speech (TTS) generator.

The typical use of a WOZ study is to recordand analyze user input and wizard output. Thisallows designers to know what to expect andwhat they should try to support. User input isespecially critical for speech recognitionsystems that rely on the collected data foracoustic training and language modeling. Initerative WOZ studies, previously collected datais used to adjust the system so that as theperformance of the system improves, the studiesemploy less of the wizard and more of thesystem (Glass et al., 2000). In the process,design constraints in the interface may berevealed, in which case, further studies areconducted until acceptable tradeoffs are found(Bernsen et al., 1998).

In contrast to the typical use, a WOZ studyfor establishing a gold standard prohibitsmodifications to the interface or experimental“curtain.” As shown in Figure 1, all input andoutput through the interface must be carefullycontrolled. If designers want to use previouslycollected performance data as a gold standard,they need to verify that all input and output haveremained constant. The protocol for establishinga gold standard is straightforward:

(1) Select a dialog metric to serve as anobjective function for evaluation.(2) Vary the component or feature that bestmatches the desired performance claim for thedialog metric.

Users

Wizard

Dialog System

or

Experimentally Controlled Curtain

Controlled Input

Controlled Output

Controlled Input

Controlled Output

Page 5: Empirical Methods for Evaluating Dialog Systems · Empirical Methods for Evaluating Dialog Systems ... dialog evaluation, dialog metric, descriptive statistics, gold standard, wizard-of-oz

(3) Hold all other input and output through theinterface constant so that the only unknownvariable is who does the internal processing.(4) Repeat using different wizards.

To motivate the above protocol, considerhow a WOZ study might be used to evaluatespoken dialog systems. The Achilles’ heel ofspoken interaction is the fragility of the speechrecognizer. System performance depends highlyon the quality of the recognition. Suppose adesigner is interested in bolstering therobustness of a dialog system by exploitingvarious repair strategies. Using task completionrate as an objective function, the designer variesthe repair strategies utilized by the system. Tomake claims about the robustness of these repairstrategies, the designer must keep all other inputand output constant. In particular, the wizard inthe experiment must receive utterances throughthe same speech recognizer as the dialog system.The performance of the wizard on the samequality of input as the dialog system constitutesthe gold standard. The designer may also wish tokeep the set of repair strategies constant whilevarying the use or disuse of the speechrecognizer to estimate how much the recognizerdegrades task completion.

A deep intuition underlies the experimentalcontrol of the speech recognizer. As researchershave observed, people with impaired hearing ornon-native language skills still manage tocommunicate effectively despite noisy oruncertain input. Unfortunately, the same cannotbe said of computers with analogousdeficiencies. People overcome their deficienciesby collaboratively working out the mutual beliefthat their utterances have been understoodsufficiently for current purposes – a processreferred to as “grounding” (Clark, 1996). Repairstrategies based on grounding indeed showpromise for improving the robustness of spokendialog systems (Paek & Horvitz, 1999; Paek &Horvitz, 2000).

3.1.1 Precautions

A few precautions are in order. First, WOZstudies for establishing a gold standard workbest with dialog systems that are highlymodular. Modularity makes it possible to testcomponents by replacing a module with thewizard. Without modularity, it is harder to usebecause the boundaries between components are

the performance of the wizard as a gold standard

Figure 2. Comparison of two dialog systemswith respect to the gold standard.

blurred. Second, what allows the performance ofthe wizard to be used as a gold standard is notthe wizard, but rather the fact that theperformance constitutes an upper bound. Forexample, the upper bound may be betterestablished by graphical user interfaces (GUI) ortouch-tone systems, in which case, thosesystems should be the gold standard.

3.2 Descriptive statistics

After designers collect data from the WOZstudy, they can make comparative judgmentsabout the performance of their system relative toother systems using a basic set of descriptivestatistics. The descriptive statistics rest on firstmodel fitting the data for both the wizard andthe dialog system. Plotting the fitted curves onthe same graph sheds light on how best tosubstantiate any performance claims. In fact, weadvocate that designers present this “benchmarkgraph” to assist readers in interpreting dialogmetrics.

Using spoken dialog again as an example,suppose a designer is evaluating the robustnessof two dialog systems utilizing two differentrepair strategies. The designer varies the repairstrategies, while holding constant the use of thespeech recognizer. As speech recognition errorsincrease, numerous researchers have shown thattask completion rate, or dialog success rate, notsurprisingly decreases. Plotting task completionrate as a function of word error rate discloses anapproximately linear relationship (Lamel et al.,2000; Rudnicky, 2000).

Figure 2 displays a benchmark graph for two

Benchmark Graph

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

Word Error Rate (%)

Tas

kC

om

ple

tio

nR

ate

(%)

System A System B Gold Standard

Page 6: Empirical Methods for Evaluating Dialog Systems · Empirical Methods for Evaluating Dialog Systems ... dialog evaluation, dialog metric, descriptive statistics, gold standard, wizard-of-oz

Figure 3. Distance in performance of the twosystems from the gold standard.

dialog systems A and B, utilizing different repairstrategies. Suppose that the fitted curve forSystem A is characteristically linear, while thecurve for System B is polynomial. Becausewizards are presumably more capable ofrecovering from recognition errors, theirperformance data make up the gold standard.Figure 2 shows a fitted curve for the goldstandard staying close to the upper right handcorner of the graph in a monotonicallydecreasing fashion; that is, task completion rateremains relatively high as word error rateincreases and then gracefully degrades beforethe error rate reaches 100%.

Looking at the benchmark graph, readersimmediately get a sense of how to substantiateperformance claims about robustness. Forexample, by noticing that task completion ratefor the gold standard rapidly drops from around65% at the 80% mark to about 15% by 100%,readers know that at 80% word error rate, evenwizards, with human level intelligence, cannotrecover from failures with better than 65% taskcompletion rate. In other words, the task isdifficult. So, even if System A and B report lowtask completion rates after the 80% word errorrate, they may be performing relatively wellcompared to the gold standard.

In making comparative judgments, it helps toplot the absolute difference in performance fromthe gold standard as a function of the sameindependent variable as the benchmark graph.Figure 3 displays such a “gold impurity graph”for Systems A and B as a function of word errorrate. The closer a system is to the gold standard,the smaller the “mass” of the gold impurity on

the graph. Anomalies are easier to see, as theytypically show up as bumps or peaks. Theadvantage of the graph is that if a dialog systemreports terrible numbers on various performancemetrics but displays a splendidly small goldimpurity, the reader can be assured that thesystem is as good as it can possibly be.

Looking at the gold impurity graph forSystems A and B, without having experiencedeither of the two systems, readers can makecomparative judgments. For example, althoughB performs worse at lower word error rates thanA, after about the 35% mark, B stays closer tothe gold standard. With such crosses inperformance, designers cannot categoricallyprefer one system to the other. In fact, assumingthat the only difference between A and B is thechoice of repair strategies, designers shouldprefer A to B if the average word error rate forthe speech recognizer is below 35%, and B to A,if the average error rate is about 40%.

With a gold standard, readers are even able tosubstantiate performance claims about differentdialog systems across domain tasks. They needonly to look at how close each system is to theirrespective gold standard in a benchmark graph,and how much mass each system shows in agold impurity graph.

3.2.1 Complexity

One reason why comparative judgments,without a gold standard, are so hard to makeacross different domain tasks is task complexity.For example, tutoring physics is generally morecomplex than retrieving email. Another reason isdialog complexity. A physics tutoring systemwill be less complex if the system forces users tofollow a predefined script. An email system thatengages in “mixed initiative” will always bemore complex because the user can take morepossible actions at any point in time.

The way to express complexity in abenchmark graph is to measure the distance ofthe gold standard to the absolute upper bound ofperformance. If wizards with human levelintelligence cannot perform close to the absoluteupper bound, then the task is complex, or thedialog interface is too restrictive for wizard, orboth. Because complexity is measured only inconnection with the gold standard ceterisparibus, “intellectual complexity” can bedefined as:

Gold Impurity Graph

0

5

10

15

20

25

30

35

40

45

50

0 10 20 30 40 50 60 70 80 90 100

Word Error Rate (%)

Tas

kC

om

ple

tio

nR

ate

Dif

fere

nce

(%)

|A - G| |B - G|System B Mass

System A Mass

Gold Impurity Graph

0

5

10

15

20

25

30

35

40

45

50

0 10 20 30 40 50 60 70 80 90 100

Word Error Rate (%)

Tas

kC

om

ple

tio

nR

ate

Dif

fere

nce

(%)

|A - G| |B - G|System B Mass

System A Mass

Page 7: Empirical Methods for Evaluating Dialog Systems · Empirical Methods for Evaluating Dialog Systems ... dialog evaluation, dialog metric, descriptive statistics, gold standard, wizard-of-oz

∑=

−⋅=n

x

xgUnIC0

)(

where U is the upper bound value of aperformance metric, n is the upper bound valuefor an independent variable x, and g(x) is thegold standard along that variable.

Designers can use intellectual complexity tocompare systems across different domain tasksif they are not too concerned aboutdiscriminating task complexity from dialogcomplexity. Otherwise, they can use intellectualcomplexity an objective function and vary thecomplexity of the dialog interface to scrutinizehow much task complexity affects wizardperformance.

3.2.2 Precautions

Before substantiating performance claims with abenchmark graph, designers should exercise afew precautionary measures. First, in modelfitting a gold standard or the performance of adialog system, beware of insufficient data.Without sufficient data, differences from thegold standard may be due to variance in themodels. To guarantee that designers havecollected enough data, we recommend that theygo through an iterative process. First, runsubjects, collect data, and fit a model. Then plot

the least squares distance, or ∑ −i

ii xfy 2))(( ,

where f(x) is the fitted model, against theiteration. Keep running more subjects until theplot seems to approach convergence. To informreaders of the reliability of the fitted models, wesuggest that designers either show theconvergence plot or report their R2s for theircurves (which relate how much of the variancecan be accounted for by the fitted models).Second, to guarantee the reliability of the goldstandard, use different wizards. Theexperimental protocol listed this as the last pointbecause it is important to know whether aconsistent gold standard is even possible withthe given interface. Difference between wizardsmay reveal serious design flaws. Furthermore,just as adding more subjects improves the fit ofthe dialog performance models, the law of largenumbers applies equally to the gold standard.

Finally, designers may encounter problemswith residual errors in model fitting that aretypically well covered in most statistics

Figure 4. Dollar amount designer is willing topay for improvements to task completion rate.

textbooks. For example, because theperformance metric shown in Figure 2 and 3,task completion rate, has an upper bound of100%, it is unlikely that residual errors will beequally spread out at all word error rates.Another common problem is the non-normalityof the residual errors, which violates the modelassumption.

3.2.3 Component analysis

Designers can identify which components arecontributing the most to a performance metricby examining the gold impurity graph of thesystem with and without the component,rendering this kind of test similar to a “lesion”experiment. Carrying out stepwise comparisonsof the components, designers can check fortradeoffs, and even use all or part of the massunder the curve as an optimization metric. Forexample, a designer may wish to improve adialog system from its current average taskcompletion rate of 70% to 80%. Suppose thatSystem B in Figure 2 incorporates a particularcomponent that System A does not. Looking atthe corresponding word error rates in the goldimpurity graph for both systems, the mass underthe curve for B is slightly greater than that for A.The designer can optimize the performance ofthe system by selecting components thatminimize that mass, in which case, thecomponent in System B should be excluded.Because components may interact with eachother, designers may want to carry out a multi-dimensional component analysis foroptimization.

Cost for Improvement

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

Task Completion Rate (%)

Do

llar

sin

Th

ou

san

ds

($)

Page 8: Empirical Methods for Evaluating Dialog Systems · Empirical Methods for Evaluating Dialog Systems ... dialog evaluation, dialog metric, descriptive statistics, gold standard, wizard-of-oz

3.2.4 Cost valuation

Suppose the main concern of the designer is tooptimize the monetary cost of the dialog system.The designer can determine how muchimproving the system is worth by calculating theaverage marginal cost. To do this, a costfunction must be elicited that conveys what thedesigner is willing to pay to achieve variouslevels of performance. This is actually veryeasy. Figure 4 displays what dollar amount adesigner might be willing to pay for variousrates of task completion. The average marginalcost can be computed by using the cost functionas a weighting factor for the mass under the goldimpurity graph for the system. So, following theprevious example, if the designer wishes toimprove the system that is currently operating atan average task completion rate of 70% to 80%,then the average marginal cost for that gain issimply:

∑=

−⋅=80

70

)()()(t

tgtftcAMC

where f(t) is the task completion rate of thesystem, g(t) is the task completion rate of thegold standard, and c(t) is the cost function.

Average marginal cost is useful forminimizing expenditure. For example, if thegoal is to improve task completion rate from70% to 80%, and the designer must choosebetween two systems, one with a particularcomponent and one without, the designer shouldcalculate the average marginal cost of bothsystems as stated in the above equation andselect the cheaper system.

4 Discussion

Instead of focusing on developing new dialogmetrics that allow for comparative judgmentsacross different systems and domain tasks, weproposed empirical methods that accomplish thesame purpose while taking advantage of dialogmetrics that already exist. In particular, weoutlined a protocol for conducting a WOZexperiment to collect human performance datathat can be used as a gold standard. We thendescribed how to substantiate performanceclaims using both a benchmark graph and a goldimpurity graph. Finally, we explained how to

optimize a dialog system using componentanalysis and value optimization.

Without a doubt, the greatest drawback to theempirical methods we propose is the tremendouscost of running WOZ studies, both in terms oftime and money. In special cases, such as theDARPA Communicator Project whereparticipants work within the same domain task, afunding agency may wish to conduct the WOZstudies on behalf of the participants. To defraythe cost of running the studies, the agency maywish to determine its own cost function withrespect to a given performance metric and utilizeaverage marginal cost to decide which dialogsystems to continue sponsoring.

Because the focus of this paper has been onhow to apply the empirical methods,hypothetical examples were considered. Work iscurrently underway to collect data for evaluatingimplemented dialog systems. We maintain thatwithout these empirical methods, readers ofreported dialog metrics cannot really make senseof the numbers.

References

Bersen, N. O., Dybkjaer, H. & Dybkjaer, L.(1998). Designing interactive speech systems: Fromfirst ideas to user testing. Springer-Verlag.

Clark, H.H. (1996). Using language. CambridgeUniversity Press.

Clark, H.H. & Brennan, S.A. (1991). Groundingin communication. In Perspectives on SociallyShared Cognition, APA Books, pp.127-149.

Clark, H.H. & Schaefer, E.F. (1987).Collaborating on contributions to conversations.Language and Cognitive Processes, 2/1, pp.19-41.

Clark, H.H. & Schaefer, E.F. (1989).Contributing to discourse. Cognitive Science, 13,pp.259-294.

Danieli, M. & Gerbino, E. (1995). Metrics forevaluating dialogue strategies in a spoken languagesystem. In Proc. of AAAI Spring Symposium onEmpirical Methods in Discourse Interpretation andGeneration, pp. 34-39.

Eckert, W., Levin, E. & Pieraccini, R. (1998).Automatic evaluation of spoken dialogue systems. InTWLT13: Formal semantics and pragmatics ofdialogue, pp. 99-110.

Page 9: Empirical Methods for Evaluating Dialog Systems · Empirical Methods for Evaluating Dialog Systems ... dialog evaluation, dialog metric, descriptive statistics, gold standard, wizard-of-oz

Gibbon, D., Moore, R. & Winski, R. (Eds.)(1998). Handbook of standards and resources forspoken language systems. Spoken Language SystemAssessment, 3, Walter de Gruyter, Berlin.

Glass, J., Polifroni, J., Seneff, S. & Zue, V.(2000). Data collection and performance evaluationof spoken dialogue systems: The MIT experience. InProc. of ICSLP.

Horvitz, E. & Paek, T. (1999). A computationalarchitecture for conversation. In Proc. of 7thInternational Conference on User Modeling, SpringerWien, pp. 201-210.

Kamm, C., Walker, M.A. & Litman, D. (1999).Evaluating spoken language systems. In Proc. ofAVIOS.

Lamel, L., Rosset S. & Gauvain, J.L. (2000).Considerations in the design and evaluation ofspoken language dialog systems. In Proc. of ICSLP.

Litman, D. & Pan, S. (1999). Empiricallyevaluating an adaptable spoken dialogue system. InProc. of 7th International Conference on UserModeling, Springer Wien, pp. 55-64.

Paek, T. & Horvitz, E. (2000). Conversation asaction under uncertainty. In Proc. of 16th UAI,Morgan Kaufmann, pp. 455-464.

Paek, T. & Horvitz, E. (1999). Uncertainty,utility, and misunderstanding. In Proc. of AAAI FallSymposium on Psychological Models ofCommunication, pp. 85-92.

Rudnicky, A. (2000). Understanding systemperformance in dialog systems. MSR Invited Talk.

Walker, M.A., Litman, D., Kamm, C. & Abella,A. (1997). PARADISE: A framework for evaluatingspoken dialogue agents. In Proc. of 35th ACL.