predicting causes of reformulation in intelligent assistants

Proceedings of the SIGDIAL 2017 Conference, pages 299–309,Saarbrucken, Germany, 15-17 August 2017. c©2017 Association for Computational Linguistics

Predicting Causes of Reformulation in Intelligent Assistants

Shumpei Sano, Nobuhiro Kaji, and Manabu SassanoYahoo Japan Corporation

1-3 Kioicho, Chiyoda-ku, Tokyo 102-8282, Japan{shsano, nkaji, msassano}@yahoo-corp.jp

Abstract

Intelligent assistants (IAs) such as Siriand Cortana conversationally interact withusers and execute a wide range of actions(e.g., searching the Web, setting alarms,and chatting). IAs can support these ac-tions through the combination of vari-ous components such as automatic speechrecognition, natural language understand-ing, and language generation. However,the complexity of these components hin-ders developers from determining whichcomponent causes an error. To removethis hindrance, we focus on reformulation,which is a useful signal of user dissat-isfaction, and propose a method to pre-dict the reformulation causes. We eval-uate the method using the user logs of acommercial IA. The experimental resultshave demonstrated that features designedto detect the error of a specific componentimprove the performance of reformulationcause detection.

1 Introduction

Intelligent assistants (IAs) such as Apple’s Siri andCortana have gained considerable attention as mo-bile devices have become prevalent in our dailylives. They are hybrids of search and dialogue sys-tems that conversationally interact with users andexecute a wide range of actions (e.g., searchingthe Web, setting alarms, making phone calls andchatting). IAs can support these actions throughthe combination of various components such asautomatic speech recognition (ASR), natural lan-guage understanding (NLU), and language gen-eration (LG). One major concern in the develop-ment of commercial IAs is how to speed up the cy-cle of the system performance enhancement. The

enhancement process is often performed by man-ually investigating user logs, finding erroneousdata, detecting the component responsible for thaterror, and updating the component. As IAs arecomposed of various components, the error causedetection becomes an obstacle in the manual pro-cess. In this paper, we attempt to automate theerror cause detection to overcome this obstacle.

One approach to do this is to utilize user feed-back. In this work, we focus on reformulation,i.e., when a user modifies the previous input. Inweb search and dialogue systems, reformulationis known as an implicit feedback signal that theuser could not receive a desired response to theprevious input due to one or more system compo-nents failing. In IAs, ASR error is a major causeof reformulation and has been extensively studied(Hassan et al., 2015; Schmitt and Ultes, 2015).

Besides correcting ASR errors, users of IAs re-formulate their previous utterances when they en-counter NLU errors, LG errors, and so on. For ex-ample, when a user utters “alarm”, the NLU com-ponent may mistakenly conclude that s/he wants toperform a Web search, and consequently the sys-tem shows the search results for “alarm.” Sarikaya(2017) reported that only 12% of the errors in anIA system are related to ASR components, whichis the smallest percentage across six components.They also reported that the NLU component isthe biggest source of errors (24%). Therefore,the errors related to the other components shouldnot be ignored to improve system performance.However, previous work mainly focused on refor-mulation caused by ASR error, and reformulationcaused by the other components has received littleattention.

In this work, we propose a method to predictreformulation causes, i.e., detect the componentresponsible for causing the reformulation in IAs.Features are divided into several categories mainly

299

on the basis of their relations with the componentsin an IA. The experiments demonstrate that thesefeatures can improve the error detection perfor-mances of corresponding components. The pro-posed method which combines all features sets,outperforms the baseline, which uses component-independent features such as session informationand reformulation related information.

Our work makes the following contributions.First, we investigate the reformulation causesamong the components in IAs from real data ofa commercial IA. Second, we create dataset of hu-man annotated data obtained from a commercialIA. Finally, we develop the method to predict re-formulation causes in IAs.

2 Related Work

Three research areas are related to our work, andthe most closely related is reformulation (alsocalled correction). As reformulation is frequentlycaused by system errors, the second related area iserror analysis and error detection in search or dia-logue systems. The third area is system evaluationin search or dialogue systems. Reformulation is auseful feature for system evaluation.

2.1 Reformulation and Correction

Users of search or dialogue systems often refor-mulate their previous inputs when trying to obtainbetter results (Hassan et al., 2013) or correct er-rors (e.g., ASR errors) (Jiang et al., 2013; Hassanet al., 2015). Our research focuses on the lattercategory of reformulation, which relates to correc-tion. Studies on correction have mainly focused onautomatic detection of correction (Levow, 1998;Hirschberg et al., 2001; Litman et al., 2006). Somestudies have also tried to improve system per-formance beyond correction detection. Shokouhiet al. (2016) constructed a large-scale dataset ofcorrection pairs from search logs and showed thatthe database enables ASR performance to be im-proved.

The research most related to ours is that of Has-san et al. (2015). In addition to reformulation de-tection, they proposed a method to detect whethera reformulation is caused by ASR error or not. Weextend their study to determine which component(e.g., ASR, NLU, and LG) is responsible for theerror.

2.2 Error Analysis and Error Detection

Besides reformulation, researchers have studiedsystem errors in search or dialogue systems. Forexample, Meena et al. (2015) and Hirst et al.(1994) focused on miscommunication in spokendialogue systems and Feild et al. (2010) focusedon user frustrations in search systems. In this pa-per, we focus on predicting the cause of errorsamong the different components in IAs. In spo-ken dialogue systems, Georgiladakis et al. (2016)reported that ASR error is the most frequent causeof errors among seven components (65.9% in Let’sGo datasets (Raux et al., 2005). On the otherhand, Sarikaya (2017) reported that ASR error isthe least frequent cause of errors among six com-ponents (12% in an IA). These results indicatethat errors in IA have various causes and that thecauses other than ASR error also should not be ig-nored. In this paper, we focus on reformulationand propose a method to automatically detect thereformulation causes in IAs.

2.3 System Evaluation

Users of the search or conversational systems of-ten reformulate their inputs when they are dissat-isfied with the system responses to their previousinputs. Therefore, reformulation is a useful signalof user dissatisfaction and information related toreformulation has been widely used as a feature toautomatically evaluate system performance in websearch (Hassan et al., 2013), dialogue (Schmitt andUltes, 2015), and IA systems (Jiang et al., 2015;Kiseleva et al., 2016; Sano et al., 2016). How-ever, these studies paid little attention to detectingcauses of user dissatisfaction. Information of thesecauses is beneficial to the developers for both im-proving system design and engineering feature ofautomatic evaluation methods. Thus, we proposea method to automatically detect the reformulationcauses in IAs.

3 Reformulation Cause Prediction

In this section, we describe the task of reformula-tion cause prediction in IAs.

3.1 Definition

First, we define notations used in this paper.

U1 : The user utterance

R : The corresponding system response to U1

300

Label U1 R U2No error What’s the weather? It will be sunny today. What’s the weather tomorrow?ASR error What’s the. Sorry? What’s the weather?NLU error Alarm. Here are the search results for “Alarm”. Open alarm.LG error What’s your name? I’m twenty years old. Tell me your name.Unsupported action Play videos of cats. Sorry, I can’t support that action. Search for videos of cats.Endpoint error Search Obama’s age. No results found in for “Obama’s age”. Search Obama.Uninterpretable input Aaaa. Sorry? Aaa.

Table 1: List of annotation labels. U1, R, and U2 are example conversations.

Label Rate Error RateNo error 38.7% N.A.ASR error 31.7% 57.2%NLU error 17.3% 31.2%LG error 5.1% 9.2%Unsupported action 0.8% 1.4%Endpoint error 0.5% 0.9%Uninterpretable input 5.9% N.A.

Table 2: Percentage of annotation labels in ourdataset. Error rate is calculated using labels re-lated to reformulation causes.

U2 : The next utterance to U1

Reformulation : A pair of (U1, U2) is a reformu-lation if U2 is uttered to modify U1 in orderto satisfy the same intent as in U1

To define the reformulation, we referred to the def-inition in the work of Hassan et al. (2015) that isused for voice search systems.

3.2 CorpusWe constructed a dataset of user logs of a com-mercial IA1 for analyzing and predicting reformu-lation causes. We randomly sampled 1,000 utter-ance pairs of (U1, U2) and corresponding informa-tion with the following conditions.

• U1 and U2 are text or voice inputs

• Interval time between U1 and U2 is equal toor less than 30 minutes (the same as in pre-vious research (Jiang et al., 2015; Sano et al.,2016).)

• Samples where normalized Levenshtein edit-distance (Li and Liu, 2007) between U1 andU2 is equal to 0 (i.e., U1 and U2 are identicalutterances) or more than 0.5 are excluded

1Because the IA supports only Japanese, all utterancesare made in Japanese. In this paper, we present Englishtranslations rather than the original Japanese to facilitate non-Japanese readers ’understanding.

• U1 and U2 were both uttered in June, 2016

With the first three conditions, we can ex-clude utterances that are not reformulations andcan focus on reformulation. We calculatedcharacter-based, rather than word-based, edit-distance (white spaces are ignored), because wefound word-based edit-distances sometimes fail toidentify reformulation pairs such as ”what’s up”and ”whatsapp.”

All samples in the dataset are manually anno-tated with the label of the reformulation causes be-tween different components in IAs. Table 1 liststhe annotation labels. Here, we explain them. Inthis paper, we assume an IA system that has thecomponents shown in Figure 1. ASR error meansthat the ASR component misrecognizes U1. NLUerror means that ASR is correct but the NLU com-ponent misunderstands the intent of U1 or fails tofill one or more slots correctly. LG error meansthat ASR and NLU are correct but the LG compo-nent fails to generate appropriate response. Thiserror mainly caused by response generation failurein chat intent. Note that the NLU component onlydetermines whether or not an utterance is chat in-tent and the LG component generates appropriateresponse of the utterance in our system. Therefore,we consider that the case shown in Table 1 is LGerror rather than NLU error. Endpoint error meansthat endpoint API (application program interface)fails to respond with correct information. Unsup-ported action means that the system cannot sup-port the action that the user expects and so cannotgenerate a correct response. No error means that asample contains no error. Submitting similar utter-ances to obtain better results in search intents (e.g.,U1 is “Search for image of strawberry wallpaper”and U2 is “Search for image of strawberry”) andusing similar functions (e.g., U1 is “Turn on Wi-Fi” and U2 is “Turn off Wi-Fi”) are typical utter-ances in this label. Finally, Uninterpretable inputmeans that U1 is uninterpretable.

As expert knowledge about the components of

301

Voice Input

Text Input

Natural Language Understanding

Language Generation

Automatic Speech Recognition

Intelligent Assistant

Response

Context Information(e.g., previous input, current location)

access backend DB(e.g., knowledge base DB)

Endpoint APIs

access endpoint API(e.g., weather information API)

Figure 1: Components and processes of typical IA system.

IAs is required for the annotation, annotation isperformed by an expert developer of the commer-cial IA. Figure 2 shows the annotation flowchart.First, we listen to the voice of U1 and read thetexts of U1, R, and U2. Text information is usedto support guessing the user intent of U1. Unin-terpretable input such as one-word utterances andmisrecognized input of background noises is dis-tinguished in this phase. Next, ASR error samplesare distinguished using transcription of U1. Af-terwards, No error samples are distinguished if Rof the samples correctly satisfies the intent of U1.Finally, one of the other four error causes is an-notated to the remaining samples. The annotationresults are shown in Table 2. We ignored the caseswhere the latter components could recover fromthe errors generated by the former components be-cause these cases were rarely observed. For exam-ple, a percentage that the NLU component couldrecover from ASR errors is 1.2% in our dataset.

3.3 Discussion on Annotated Labels

Here, we discuss the annotation results and whichlabels should be included or excluded in reformu-lation cause prediction. To come to the point, wedo not use Uninterpretable input, Unsupported ac-tion, or Endpoint error for reformulation causeprediction. We will explain why these labels areexcluded.

As shown in Table 2, the most frequent causeis ASR error. This result differs from that of

Sarikaya (2017) in which ASR error is the leastfrequent cause. The reason for the difference isthat Sarikaya (2017) used whole samples, whereaswe use only reformulation samples. These resultsindicate that the reformulation tendency when auser encounters an error differs depending on thecause of errors. For example, the percentage of un-supported actions is 1.4%, which is much smallerthan that reported by Sarikaya (2017), 14%. Thisfinding indicate that when users encounter a re-sponse that notifies them that the action they ex-pected was unsupported, they would rather giveup than reformulate their previous utterances. Thesame is true for endpoint error.

Next, we discuss which labels should be in-cluded or excluded in the task. First, Uninter-pretable input should be excluded because theseutterances have no appropriate responses and be-come noise for the task. We also exclude Unsup-ported action and Endpoint error. An IA do notrequire user feedback for these errors because nocomponents in the IA is responsible for the errorsand the IA is aware the error causes in these la-bels. In addition, the benefits of detecting theseerrors are limited because these errors are rarelyobserved in reformulation as shown in Table 2. Inconclusion, we use No error, ASR error, NLU er-ror, and LG error for the task.

302

Listen to U1 voice andread U1, R, and U2 text

Uninterpretableinput

Is U1interpretable?

Yes

Are U1transcription and ASR text same?

ASR error

Is R appropriate response of U1?

No error

LG errorNLU error

Endpoint errorUnsupported action

Transcript U1 voice

Annotate one of the other cause labels

No

Yes

Yes

No

No

Figure 2: Annotation flowchart.

3.4 Task Definition

Here, we describe the task of reformulation causeprediction. Our goal is to predict the reformulationcause of U1 between different components usingthe information from U1 and U2. Specifically, wepredict one of the four labels in Table 1, No er-ror, ASR error, NLU error, and LG error. For ex-ample, we want to predict as NLU error from theconversation logs of (U1:“Alarm.”, R:“Here arethe search results for Alarm.”, U2:“Open alarm.”).These labels are useful for IA developers to im-prove system performance.

4 Analysis

We analyze the statistical differences between re-formulation causes prior to the experiments andexploit the findings for engineering features.

4.1 Analysis of Correction Types

Here, we analyze the correction types when anIA user is trying to correct a previous utterance.We simplify the definition of (Swerts et al., 2000)and define four correction types: ADD, OMIT ,PAR, and OTHER. Their definitions are as fol-lows.

ADD A sequence of words is added to U1.

Type ExampleADD What’s the weather in Nara today?OMIT What’s the weather?PAR How about the weather today?OTHER How about the humidity today?

Table 3: Example corrections of “What’s theweather today?” in each correction type.

ADD OMIT PAR OTHERNo error 27.1 7.2 58.1 7.4ASR error 7.5 6.0 74.2 12.3NLU error 27.9 19.8 41.9 10.5LG error 23.5 19.6 47.1 9.8

Table 4: Distribution of correction types (%).

OMIT A sequence of words in U1 is omitted inU2.

PAR A sequence of words in U1 changes into an-other sequence of words in U2.

OTHER Match none of preceding correctiontypes.

Note that U1 and U2 have at least one word in com-mon in ADD, OMIT , and PAR. Table 3 showsexamples of each correction type.

Table 4 presents the distribution of correctiontypes. As shown in Table 4, the percentage ofPAR in ASR error is higher than those of othercorrection types. In No error, the percentages ofADD and PAR are large. We can also see thatNLU error and LG error have similar distributionsand that OMIT is more frequent than the othercorrection types. This is because when users en-counter NLU error or LG error, they use differ-ent types of corrections depending on the situa-tion: adding words to clarify their intent, omittingwords that seem to be the noise for the system, andso on. We expect correction types to be useful fordetecting reformulation causes as the distributionof correction types differs for different error types.

4.2 Analysis of Input Types

Here, we analyze the correlation between errorcauses and input types. Input types sometimes dif-fer between U1 and U2. For example, Jiang et al.(2013) have reported that users switch from voiceinputs to text inputs when they encounter ASR er-rors. Table 5 presents the distribution of input

303

V2V T2T V2T T2VNo error 75.2 23.5 1.3 0.0ASR error 94.6 0.0 5.4 0.0NLU error 70.5 23.8 1.7 4.6LG error 76.5 23.5 0.0 0.0

Table 5: Distribution of input type switches (%).For example, V2T means input type of U1 is voiceinput and that of U2 is text input.

type switches. As shown in Table 5, distributionof input type switches differs slightly among theerror causes. As Jiang et al. (2013) have shown,voice-to-text switches are more frequent for ASRerror than in other causes. We can also see thatboth text-to-voice and voice-to-text switches areobserved for NLU error. Compared with thesecauses, input type switches are rarely observed forNo error. Interestingly, text-to-voice switches arenot observed for No error either. On the otherhand, small number of voice-to-text switches areobserved for No error. We guess that some usersswitch input from voice to text when they sub-mit similar query by copy and paste the part ofASR result of their previous utterances. Thesefindings suggest that users tend to keep using thesame input type while their intents are correctlyrecognized. Unlike the other causes, no input typeswitches are observed for LG error. We expectthis is due to insufficient data.

5 Features

We divide the features into five categories by theirfunctions. Session features and reformulation fea-tures, which are useful for reformulation predic-tion (Hassan et al., 2015) and system evaluation(Jiang et al., 2015), are designed to distinguish be-tween errors or non-errors. Though these featuresare also useful for reformulation cause detection,these are not specialized for detecting reformula-tion causes. ASR, NLU, and LG features are de-signed to detect reformulation causes related totheir corresponding components from causes re-lated to the other components. Table 6 lists thefeatures.

5.1 Session Features

Features related to session information belong tothis category. In InputType, 1 if an utterance is textinput and 0 if an utterance is voice input (Hassanet al., 2015). In a web search system, the interval

time between inputs is a useful indicator of searchsuccess (Huang and Efthimiadis, 2009). There-fore, Interval is useful for distinguishing No errorfrom the other labels. If CharLen or WordLen ofthe utterance is long or short, the utterance possi-bly contains noise information or lacks informa-tion for the system.

5.2 Reformulation Features

Features related to reformulation belong to thiscategory. Features in this category are widelyused in previous methods such as query reformu-lation detection (Hassan et al., 2015), error de-tection (Litman et al., 2006), and system perfor-mance evaluation (Jiang et al., 2015). The correc-tion type t of Correction(t) is one of ADD, OMIT,PAR, or OTHER described in Section 4.1. Asshown in Section 4.1, the distribution of correctiontypes differs among annotation labels. Therefore,Correction(t) is useful information for predictingreformulation causes. Voice2Text and Text2Voiceare designed to distinguish ASR error and NLUerror from other errors on the basis of analysis inSection 4.2.

5.3 ASR Features

Features related to the ASR component belongto this category. Low ASRConf indicates speechrecognition errors (Hassan et al., 2015). The prob-ability of misrecognition increases as the recog-nized voice length increases (Hassan et al., 2015).Therefore, long VoiceLen may be one signal re-lated to ASR error. Note that these features arecalculated only when the input type of the utter-ance is voice input.

5.4 NLU Features

Features related to the NLU component belong tothis category. When a user’s intent in U1 is mis-understood and that in U2 is correctly understood,recognized intents or filled slots between the ut-terances are different. On the other hand, whena user’s intent in U1 is correctly understood bythe system, the user sometimes uses similar func-tions subsequently (e.g., requesting weather infor-mation for Osaka after requesting it for Tokyo.).Therefore, DifferentIntent and DifferentSlot areuseful to distinguish NLU error from the other er-rors, and SameIntent is useful to distinguish Noerror from the other errors. Note that the intentsused in these features are the intents recognized by

304

Category Name DefinitionSession CharLen* Number of characters in utterance.

WordLen* Number of words in utterance.InputType* 1 if utterance is text input else 0Interval Time between U1 and U2

Reformulation EditDistance Normalized Levenshtein edit distance between U1 and U2

Correction(t) 1 if U2 is correction type t of U1

CommonWords Number of words appearing in both U1 and U2

Voice2Text 1 if U1 is voice input and U2 is text inputText2Voice 1 if U1 is text input and U2 is voice input

ASR ASRConf* Speech recognition confidenceVoiceLen* Speech recognition time

NLU SameIntent 1 if recognized intents between U1 and U2 are sameDifferentIntent 1 if recognized intents between U1 and U2 are differentDifferentSlot 1 if some slots in U2 are different from those in U1

IntentType(t)* 1 if recognized intent of utterance is t

LG DialogAct(t)* 1 if utterance contains phrases in dialogue act t

Table 6: List of features. Features marked with “*” were computed for both U1 and U2.

Type ExamplesPraise Wow!; Great.Thanking Thanks.; Thank you.Backchannel I see.; Yeah.Accept Yes.; Exactly.Abuse Shit.; Shut up.Reject No.; Not like that.IDU What do you mean?

Table 7: List of dialog acts.

the IA system such as weather information, websearch, application launch, and chat.

5.5 LG Features

Features related to the LG component belong tothis category. If users expect the system to chat,their utterances may contain phrases commonlyused in chatting. DialogAct(t) is designed to de-tect these phrases in the utterance. User utterancesof chat intent in IAs have unique characterictics(e.g., some users curse at the intelligent assistants(Akasaki and Kaji, 2017)). We defined seven typesof dialogue acts that are common in chats betweenusers and IAs, as listed in Table 7. Frequently oc-curring phrases in the user log of the commercialIA are used for the phrases of each dialogue act.

6 Experiments

6.1 Experimental Settings

The experimental settings of reformulation causeprediction are as follows.

• The dataset described in Section 3.2 is usedfor evaluation. It contains 928 samples.

• We evaluate the performance of the modelwith 10-fold cross validation.

• We train the model using a linear SVM clas-sifier with the features described in section 5.

• We optimize hyper parameters of the classi-fier with an additional 5-fold cross validationusing only training sets (9-folds used for atraining set is combined and split into 5 newfolds in each validation).

• The baseline model is trained in the sameconditions except that the model uses onlySession and Reformulation features. Com-parison between the proposed and the base-line methods enables us to evaluate the effectof the features related to the components ofIA.

We choose linear SVM because it has scalabilityand has outperformed RBF-kernel SVM.

305

Precision Recall F1

Baseline (B.) 0.51 0.53 0.51Proposed 0.67 0.66 0.67+

B. + ASR 0.56 0.59 0.57+

B. + NLU 0.63 0.60 0.61⋆

B. + LG 0.50 0.50 0.49

Table 8: The results of the reformulation causeprediction.

Gold \ Predict No ASR NLU LGNo error 216 132 29 10ASR error 67 219 21 10NLU error 61 54 52 6LG error 19 17 14 1

(a) BaselineGold \ Predict No ASR NLU LGNo error 284 55 27 21ASR error 38 230 37 12NLU error 44 29 81 19LG error 8 12 11 20

(b) Proposed

Table 9: Confusion matrix of reformulation causeprediction of (a) baseline and (b) proposed meth-ods.

6.2 Results

Table 8 presents the results for the reformulationcause prediction. The first row compares the pro-posed method with the baseline. The proposedmethod obtains a 0.67-point F1-measure and out-performs the baseline. This result shows the ef-fectiveness of the features related to the compo-nents of IA. The second row illustrates the per-formance when one feature set is added to thebaseline. We can see that ASR and NLU fea-tures improve the performance of the baseline. InF1-measure, statistical significant differences fromthe baseline detected by the paired t-test are de-noted by + (p < 0.01) and ⋆ (p < 0.05).

Table 9 presents the confusion matrix of the pro-posed and baseline methods. The results goingdiagonally show agreement between the gold la-bels and the predicted labels. As shown in Table9, the proposed method outperforms the baselineregardless of the gold labels. Again, these resultsindicate the effectiveness of the features related tothe components of IA.

Table 10 presents F1-measures of each gold la-bel. Again, the proposed method outperforms the

No ASR NLU LGBaseline (B.) 0.58 0.59 0.36 0.03Proposed 0.75+ 0.72+ 0.49⋆ 0.33+

B. + ASR 0.66+ 0.67+ 0.35 0.16B. + NLU 0.71+ 0.65 0.43 0.25⋆

B. + LG 0.55 0.57 0.32 0.08

Table 10: F1-measure in reformulation cause pre-diction for each label.

other methods. Focusing on individual labels, F1-measures of the proposed method are better forNo error and ASR error than for NLU error andLG error. In other words, F1-measures for NLUerror and LG error are not high. As the perfor-mances of individual labels are in the order of thenumber of samples belonging to the label, we ex-pect that these low performances are mainly due toinsufficient data and will improve given sufficientdata. Focusing on individual feature sets, ASR andNLU features are useful for distinguishing No er-ror from other labels. Note that statistical signifi-cant differences from the baseline detected by thepaired t-test are denoted by + (p < 0.01) and ⋆(p < 0.05).

6.3 Results by Input Types

Here we analyze the result of Table 8 by inputtypes. Table 11 presents the results of reformu-lation cause prediction by input types. As shownin Table 11, Both the F1-measures of the proposedmethod in voice inputs and text inputs are 0.66.These results suggest that the proposed method isrobust for both input types. On the other hand,the F1-measure of the baseline method in voiceinputs is lower than that in text inputs. Particu-larly, the F1 measure of No error in voice inputs islower than that in text inputs. Table 12 presents thedistribution of predicted labels with the followingtwo conditions. First, U1 and U2 are voice inputs.Second, gold labels of all samples are No error.As shown in Table 12, misclassification rate of theproposed method in No error as ASR error is lessthan that of the baseline method. In other words,the proposed method distinguishes between No er-ror and ASR error more accurately compared tothe baseline method. These results suggest thatASR features contribute to the performance im-provement of the proposed method.

306

No ASR NLU LG totalBaseline 0.49 0.58 0.33 0.03 0.47Proposed 0.73 0.71 0.49 0.30 0.66# samples 291 300 122 39 752

(a) U1 and U2 are voice inputs.No ASR NLU LG total

Baseline 0.80 N.A. 0.32 0.00 0.60Proposed 0.81 N.A. 0.42 0.39 0.66# samples 91 N.A. 40 12 143

(b) U1 and U2 are text inputs.

Table 11: F1-measure in reformulation cause pre-diction of each label between input types.

Predicted label No ASR NLU LGBaseline 129 129 25 8Proposed 205 53 16 17

Table 12: The number of predicted samples inreformulation cause prediction in following twoconditions. First, U1 and U2 are voice inputs. Sec-ond, gold labels of all samples are No error.

6.4 Investigation of Feature Weights

We investigate weights of the features learned bythe linear-kernel SVM to clarify what featurescontribute to the reformulation cause prediction.

Table 13 presents top and bottom featureweights in each label. The median value of the 10models which are obtained with cross validationare used for weights in Table 13. Features calcu-lated from U1 appear in Table 13 but that calcu-lated from U2 do not appear. This result is not sur-prising because information related to U1 has morerelationship to reformulation causes compared tothat related to U2. Next, we focus on the featuresin individual labels. Features related to their cor-responding components appear in Table 13 suchas ASRConf in ASR error, SameIntent in NLU er-ror, and DialogAct in LG error. These results indi-cate that features designed to detect reformulationcauses related to their corresponding componentswork as designed. Finally, we focus on the indi-vidual features. We observe that features of inputtype switches are useful for predicting reformula-tion causes. In particular, Voice2Text is useful fordetecting ASR error and Text2Voice is useful fordetecting NLU error. These results are consistentwith findings in section 4.2.

Label Feature WeightNo error ASRConf* 1.07

IntentType(SingSong)* 1.05Voice2Text -1.01IntentType(DeviceControl)* -1.10

ASR error Voice2Text 1.27IntentType(Search)* 0.90ASRConf* -1.51InputType* -1.77

NLU error Text2Voice 1.43InputType* 0.94SameIntent -0.61IntentType(Dictionary)* -0.77

LG error IntentType(DeviceControl)* 1.08DialogAct(IDU)* 0.81DialogAct(Praise)* -0.95Voice2Text -1.04

Table 13: Top and Bottom two feature weights ofthe proposed method. Features marked with “*”were computed for U1.

7 Future Work

While the proposed method has outperformed thebaseline, there is room for improvement on theperformance. As mentioned in section 6.2, theperformances of individual labels are in the or-der of the number of samples belonging to the la-bels. Therefore, we expect that the performanceimproves as the dataset size increases. The perfor-mance will also improve if some features used inprevious studies are added to our features. For ex-ample, linguistic features such as word n-gram andlanguage model score are used in previous studies(Hassan et al., 2015; Meena et al., 2015) but notused in ours.

8 Conclusion

This paper attempted to predict reformulationcauses in intelligent assistants (IAs). Prior to theprediction, we first analyzed the cause of reformu-lation in IAs using user logs obtained from a com-mercial IA. Based on the analysis, we defined re-formulation cause prediction as four-class classifi-cation problem of classifying user utterances intoNo error, ASR error, NLU error, or LG error. Fea-tures are divided into five categories mainly on thebasis of the relations with the components in theIA. The experiments demonstrated that the pro-posed method, which combines all feature sets,outperforms the baseline which uses component-independent features such as session informationand reformulation related information.

307

ReferencesSatoshi Akasaki and Nobuhiro Kaji. 2017. Chat de-

tection in an intelligent assistant: Combining task-oriented and non-task-oriented spoken dialogue sys-tems. In Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers). Association for Computa-tional Linguistics (to appear).

Henry A. Feild, James Allan, and Rosie Jones.2010. Predicting searcher frustration. In Proceed-ings of the 33rd International ACM SIGIR Con-ference on Research and Development in Infor-mation Retrieval. ACM, SIGIR ’10, pages 34–41.https://doi.org/10.1145/1835449.1835458.

Spiros Georgiladakis, Georgia Athanasopoulou,Raveesh Meena, Jos Lopes, Arodami Chori-anopoulou, Elisavet Palogiannidi, Elias Iosif,Gabriel Skantze, and Alexandros Potamianos.2016. Root cause analysis of miscommunicationhotspots in spoken dialogue systems. In Proceed-ings of Interspeech 2016. ISCA, pages 1156–1160.https://doi.org/10.21437/Interspeech.2016-1273.

Ahmed Hassan, Ranjitha Gurunath Kulkarni, UmutOzertem, and Rosie Jones. 2015. Charac-terizing and predicting voice query reformula-tion. In Proceedings of the 24th ACM In-ternational on Conference on Information andKnowledge Management. ACM, pages 543–552.https://doi.org/10.1145/2806416.2806491.

Ahmed Hassan, Xiaolin Shi, Nick Craswell, andBill Ramsey. 2013. Beyond clicks: Queryreformulation as a predictor of search satis-faction. In Proceedings of the 22nd ACMInternational Conference on Information andKnowledge Management. ACM, pages 2019–2028.https://doi.org/10.1145/2505515.2505682.

Julia Hirschberg, Diane Litman, and Marc Swerts.2001. Identifying user corrections automaticallyin spoken dialogue systems. In Proceedings ofthe Second Meeting of the North American Chap-ter of the Association for Computational Linguisticson Language Technologies. Association for Com-putational Linguistics, NAACL ’01, pages 1–8.https://doi.org/10.3115/1073336.1073363.

Graeme Hirst, Susan McRoy, Peter Heeman,Philip Edmonds, and Diane Horton. 1994. Re-pairing conversational misunderstandings andnon-understandings. Speech Communication15(3-4):213–229. https://doi.org/10.1016/0167-6393(94)90073-6.

Jeff Huang and Efthimis N. Efthimiadis. 2009. An-alyzing and evaluating query reformulation strate-gies in web search logs. In Proceedings ofthe 18th ACM Conference on Information andKnowledge Management. ACM, pages 77–86.https://doi.org/10.1145/1645953.1645966.

Jiepu Jiang, Ahmed Hassan Awadallah, Rosie Jones,Umut Ozertem, Imed Zitouni, Ranjitha Gu-runath Kulkarni, and Omar Zia Khan. 2015. Auto-matic online evaluation of intelligent assistants. InProceedings of the 24th International Conferenceon World Wide Web. International World Wide WebConferences Steering Committee, pages 506–516.https://doi.org/10.1145/2736277.2741669.

Jiepu Jiang, Wei Jeng, and Daqing He. 2013. Howdo users respond to voice input errors?: Lexicaland phonetic query reformulation in voice search.In Proceedings of the 36th International ACM SI-GIR Conference on Research and Developmentin Information Retrieval. ACM, pages 143–152.https://doi.org/10.1145/2484028.2484092.

Julia Kiseleva, Kyle Williams, Jiepu Jiang, AhmedHassan Awadallah, Aidan C. Crook, Imed Zi-touni, and Tasos Anastasakos. 2016. Un-derstanding user satisfaction with intelligent as-sistants. In Proceedings of the 2016 ACMon Conference on Human Information Inter-action and Retrieval. ACM, pages 121–130.https://doi.org/10.1145/2854946.2854961.

Gina-Anne Levow. 1998. Characterizing and recog-nizing spoken corrections in human-computer dia-logue. In Proceedings of the 36th Annual Meet-ing of the Association for Computational Linguis-tics and 17th International Conference on Computa-tional Linguistics - Volume 1. Association for Com-putational Linguistics, ACL ’98, pages 736–742.https://doi.org/10.3115/980845.980969.

Yujian Li and Bo Liu. 2007. A normalized Levenshteindistance metric. Pattern Analysis and Machine In-telligence, IEEE Transactions on 29(6):1091–1095.https://doi.org/10.1109/TPAMI.2007.1078.

Diane Litman, Julia Hirschberg, and MarcSwerts. 2006. Characterizing and predict-ing corrections in spoken dialogue systems.Computational Linguistics 32(3):417–438.https://doi.org/10.1162/coli.2006.32.3.417.

Raveesh Meena, Jose Lopes, Gabriel Skantze, andJoakim Gustafson. 2015. Automatic detection ofmiscommunication in spoken dialogue systems. InProceedings of the 16th Annual Meeting of the Spe-cial Interest Group on Discourse and Dialogue. As-sociation for Computational Linguistics, pages 354–363. https://doi.org/10.18653/v1/W15-4647.

Antoine Raux, Brian Langner, Dan Bohus, Alan WBlack, and Maxine Eskenazi. 2005. Let’s go public!taking a spoken dialog system to the real world. InProceedings of Interspeech 2005. ISCA, pages 885–888.

Shumpei Sano, Nobuhiro Kaji, and Manabu Sassano.2016. Prediction of prospective user engagementwith intelligent assistants. In Proceedings of the

308

54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers). Asso-ciation for Computational Linguistics, pages 1203–1212. https://doi.org/10.18653/v1/P16-1114.

Ruhi Sarikaya. 2017. The technology behindpersonal digital assistants: An overview ofthe system architecture and key components.IEEE Signal Processing Magazine 34(1):67–81.https://doi.org/10.1109/MSP.2016.2617341.

Alexander Schmitt and Stefan Ultes. 2015. Interac-tion quality: Assessing the quality of ongoing spo-ken dialog interaction by experts―and how it relatesto user satisfaction. Speech Commun. 74(C):12–36.https://doi.org/10.1016/j.specom.2015.06.003.

Milad Shokouhi, Umut Ozertem, and Nick Craswell.2016. Did you say u2 or youtube?: Inferring im-plicit transcripts from voice search logs. In Proceed-ings of the 25th International Conference on WorldWide Web. International World Wide Web Confer-ences Steering Committee, WWW ’16, pages 1215–1224. https://doi.org/10.1145/2872427.2882994.

Marc Swerts, Diane J Litman, and Julia Hirschberg.2000. Corrections in spoken dialogue systems. InProceedings of ICSLP-2000. ISCA, pages 615–618.

309

predicting causes of reformulation in intelligent assistants

Documents