spoken dialogue systems - kit of dialogs session: one complete interaction utterance: one...

Spoken Dialogue Systems

Felix Putze ([email protected])partially based on slides from Matthias Denecke

Structure of this lecture

Introduction

Discourse Theory

Dialog Strategies

More: Speech Recognition, Evaluation, ...

What is dialog?❙ Verbal Interaction of two or more agents❙ over more than one utterance❙ to establish a joint goal (really?)

What is a spoken dialog system?❙ Autonomous, artificial system❙ Engages in dialog with a human user❙ Employs speech recognition and speech synthesis❙ Interactive user interface to backend services

(databases, shops, ...)

Definition of Dialog

Examples of Spoken Dialog Systems

Automated Telephone ServicesMobile Devices, InCar DevicesHumanoid Robots (Armar)Toys, EntertainmentDomestic Appliances

Why is dialog interesting for researchers?

It's more than just speech recognition!

✂Deep understanding necessary✂Interesting phenomena (repair, selfrepair, reference)✂Indirect communication✂Boundary between semantics/pragmatics✂Nonverbal communication (gestures, emotions, ...)✂Allows learning of unknown words

Challenges in Dialog Systems

Structure of Dialogs

Session: one complete interactionUtterance: one person's speech between two pausesTurn: all utterances between two utterances of the

partner. Most systems assume that participants take alternating, nonoverlapping turns

Bargein: Interrupting a turn Initiative:

❙ System initiative: „Please tell my your departure airport” (less ASR errors, rigid style, form filling)

❙ Mixed initiative: “Which flight are you interested in?” (more freedom, give more information at once, harder to recognize and understand)

Components of a Spoken Dialog System

Automatic Speech

Recognition (ASR)

Natural Language

Understanding (NLU)

ContextInterpretation

DiscourseFusion

Dialog StateDialog Strategy

Selects Action(Move)

Texttospeech(TTS)

Discourse*: Entirety of information and structure of the ongoing dialog up to a certain point

Each new utterance must be interpreted in the light of the current discourse

Each new utterance must be included in the discourse

* in the context of this course

Discourse Theories

Discourse Representation Theory (DRT)

Emerging in the 80’sSo far: using Montague grammars (converting text to

first order logic statements), but:

Problem : “A man walks in the park. He whistles.” (∃x: man(x) & walksinpark(x)) (∃x:whistles(x))does not give correct truth conditionsRequired: Anaphora resolution

Solution: First step: construct intermediate representation

introduce twodimensional boxes as scopeintroduce discourse referents introduce accessibility restrictions

xman(x)walksinpark(x)whistles(x)

Discourse referent

Second step: construct formula: (∃x:max(x),walks(x),whistles(x))


DRT provides construction algorithm based on syntax treeFred owns a Porsche. It is red. *Fred does not own a Porsche. It is red.

x

fred(x)red(z)z = ???

x y z

fred(x)porsche(y)owns(x,y)red(z)y = z

y

porsche(y)owns(x,y)

¬


Typed Feature Structures (TFS)

A TFS is a set of named attributes (“features”), which can be of atomic type (int, string, ...) or another TFS

Possible types are organized in an ontology Two compatible TFS can be unified, resulting in a more

specific TFS containing information from both original TFSUse a set of TFS as discourse representation, use

unification for integrating new information

Typed Feature Structures: Example

speech_act

act_inform act_cmd

act_inform_name

object

graspable_objperson

Ontology of a robotic domain(ontology = formal representation ofentity classes with their relations)

Typed Feature Structures: Example

act_inform_nameobj_person : person

first_name : stringlast_name : string

act_inform_nameconfirmed : boolean

act_inform_nameobj_person : person

first_name : stringlast_name : string

confirmed : boolean

Unification

Semantic Grammars

Convenient way for rulebased natural language understanding

based on a contextfree grammar (often also used as speech recognition language model)

mark certain terminal or nonterminal nodes with semantic tags

Instead of representing a syntactic structure (verbs, nouns), concentrate on semantic structure

a tag indicates which element of a TFS ontology is created and filled with the tagged value

TFS & Semantic Grammars: Example

class obj_person inherits object

string : FIRST_NAME;

string : FAMILY_NAME;

class inform_name inherits act_inform

obj_person : PERSON;

boolean : CONFIRM_NAME;

boolean : CONFIRM_FAMILY_NAME;

public <informName,VP> =<inform_name,V> <obj_person,NP> {PERSON obj_person}

<inform_name,V> =my [first] name is | i am | you can call me

<obj_person,NP> =<dbimport_first_name> {FIRST_NAME NODE_VALUE}

TFS

Grammar

Interpretation in Context

ASROutput: „yes“ NLU→ Interpretation: confirmation_yesWhat does that mean? Depends on context!

Context (last system action) Interpretation

act_ask_confirm_name confirmation_nameact_ask_to_proceed cooperative_useract_inform_victory indicate_happinessother backchannel

Grosz, Sidner 1986: Seminal work tripartite organization of discourse structure Integrating

❙ Focus of attention, or attentional state❙ Intention of the participants❙ Structure of sequences of utterances

In each dialog segment, a “focus space” describes the relevant subset of all information available

Intentions are structured in a hierarchy (=> subgoals)each dialog segment is linked to one intention

Attention, Intention, Discourse Structure

Simplistic analogy:attentional state: local variablesintention: processing instructionsdiscourse structure: stack

Grosz and Sidner show in their paper that their theory allows for correct treatment of interruptions


System: Welcome to the automatic restaurant service.User: Tell me a nice french restaurant, please.System: Where are you looking for a restaurant?User: I am now at the market place.System: In which city?User: In Karlsruhe.System: I recommend the restaurant “Chez Pierre”.

It is 200 meters from your current location.

S1 U1S2 U2 S3 U3

S4


System: Welcome to the automatic restaurant service.User: Tell me a nice french restaurant, please.System: Where are you looking for a restaurant?User: I am now at the market place.System: In which city?User: In Karlsruhe.System: I recommend the restaurant “Chez Pierre”.

It is 200 meters from your current location.

Attention restaurant type & name location, city

Intention find a suited restaurant determine location


Dialog Strategies

PlanBased SystemsGrammarBased SystemsFiniteState SystemsRational AgencyFrameBased SystemsStatistical Systems

Decisions to make

Which item(s) should be brought up next?How much information is required and appropriate?Use system initiative or mixed initiative?When to ask for confirmation of understood information?Repair strategies (reprompt, rephrase, abort, ignore, ...)Does the user need help? (list possible commands,

forward to operator)Style of prompts (length, tone, ...)

John L. Austin (~1960): Words do not only describe acts, they are acts (“I bet that I am faster than you”)

An utterance can be interpreted on different levels: locutionary act = the form of the utterancepropositionary act = the meaning of the utterance illocutionary act = the speech act of the utteranceperlocutionary act = intended effect of the utterance

Speech act types: inform, request, promise, confirm,…Many taxonomies available

Plan-Based Dialog Systems


Example: “I don't need this book anymore”

locutionary act: “I” = subject, “this book” = object, ...propositionary act: The speaker has book which he does

not need at this point. illocutionary act: offering the book to the addresseeperlocutionary act: make the addressee happyoverall goal: become a friend of the addressee

Observation:• humans don’t utter communicative actions randomly• actions are planned to achieve various goals• dialog acts are part of a plan• listener’s job to uncover the plan and react accordingly

Example:

customer to butcher: “Where are the steaks you advertised?”

bad reaction (to question): “Back in the fridge.”good reaction (to plan): “How many do you want?”


Key Ideautterances: observable actions used to achieve goalsSystem needs to uncover goal

Generalizationdialogue is special case of other rational

(noncommunicative) behavior → we can apply standard AI techniques for rational behavior


planning:❙ STRIPS❙ ADL❙ ...

planrecognition inference rules action definitions models of the mental states of the participants expectations of likely goals


Plan-Based Dialog Systems: Problems

redundancy: recognize illocutionary act for each single utterance, which can influence the interpretation of earlier utterances

complexity of inference: process of plan recognition and planning is combinatorially intractable

discourse vs. domain plans: speech acts can be taskrelated or used to control the dialog (“meta communication)” => need to use multilevel plan structures

Speech act (or dialog act) theory and practice are nevertheless important for other approaches

Observation:

there are adjacency pairs, e.g. questionanswer, proposalacceptance, statement(n)acknowledgment, ...

users expect them more in humanmachine dialogs thanin humanhuman dialogs

Propose:

• phasestructure grammar• statemachine

define acceptable dialog, justlike syntax grammar definesacceptable sentence

Grammar Based Systems

extend phrase structure above sentence level allows exploitation of task oriented knowledge simple grounding

Which size would you like?

NP

Question

VP

A large

Det

Answer

N’

one


PhaseStructure Grammar (e.g. CFG):❙ Terminals = question, request, reply, offer, answer

proposition, acceptance, rejection, ...❙ Nonterminals = initiative, reaction, evaluation, ...

Dialog Grammar is used to “parse” the dialog structure, and

“predict” the possible set of next dialog acts by finding valid continuations


Advantageseasy to designeasy to factor out common dialogue features

Drawbackstoo structured grammars difficult to obtain automaticallytradeoff between robustness and rigidity


Idea: Simplify grammar (Chomsky 2 => Chomsky 3)

Finite State Machines:

state1state3

state2speech

act 1

act 2

state4

state5

state6

response1expecteddialogacts

Finite State Machines

System designer has control over dialogue flowno specific linguistic knowledge necessary tools can be visually appealingeasy to “read”easy to repair

=> Attractive for companies and endusers

Finite State Machines: Advantages

States in a DialogTransitions

Trigger ActionsPalette of

Possible ActionsDrag and Drop

Programming

Finite State Machines: Dialog Editor

Dialog acts must be identified (source of errors)

Only one state results from a transition (complex state meanings)

Utterances can be multifunctional: one utterance can contain several dialog acts

Multiple utterances could be needed to establish a dialog act

Low degree of abstraction (implement confirmation strategy for every piece of information seperately)

Problems with Finite State Machines

Inflexible designgets complex with complex dialoguelimited reusability

Intertwines dialogue level knowledgedomain specific knowledgelanguage specific knowledge=> maintenance nightmare

Problems with Finite State Machines

Theories developed in the 80‘s and 90’s byPerrault, Cohen & Levesque, Sadek

Formally describe rational behavior

Use formal theorem proving to deduce the next rational action of the system

Rational Agency

Example:I know you know xI don’t tell you x I know you don’t know xI know you want goal gI know x is relevant for you to achieve gI tell you x

Rational Agency: Examples

Apply formal theorem proving to deduce the next actionProlog, Lisp

RA assumes that collaborators desire to achieve a common goal

more general than dialogue processing:can also be applied to joint problem solving

Rational Agency

Inference rules do not depend on the domainWorking System:ARTIMIS (France Telecom)

implements rational agency in an agentbased architecture

exceptional system:workingtheoretical foundation

Rational Agency: Observations

Frames introduced by M. Minsky for knowledge representation (1970’s)

Represent incomplete knowledgeA frame consists a name, attribute slots and a list of

possible values for each slotDialog as slot filling: compare to filling a form on a web

site:

Frame-Based Systems

Simplest approach:one frame with multiple slotsgoal: fill all slots

Example: “I want to fly from Pittsburgh to Boston”misrecognition of destination causes skipping during

parsing, resulting in the following frame:

[BookFlight][Dst] Boston[Dep] <empty>

Frame-Based Systems

Frame-Based Systems: Strategy

move ask_destination:conditions: [Dst] = <empty>

[Dep] != <empty>bindings: ask_for_destination

Define a set of movesEach move has a set of conditions (concerning the

configuration of the slots) and bindings which can be executed (i.e. to produce speech output)

Find all moves with matching conditions and execute one or more of them

Example:

Frame-Based Systems: Advantages

Implicit dialog state (= configuration of all slots)Allows mixedinitiative dialogsAny slot can be filled at any point of the dialogPartially understood information can be used Information state independent of other discourse

elements Information state independent of strategy (in contrast to

finite state machines)

Multiple frames which can be active in parallel or sequentiallyExample: [BookFlight], [QueryFlightInfo]

Define goals: Selection of frames and slots which need to be filled to satisfy a certain user request or to trigger a backend application

Extend abstract dialog state:intention: which goal is selected? Is choise ambigous?quality: overall and for single slotsnumber of queries for a certain slot

Frame-Based Systems: Extensions

Frames are actually a predecessor of Typed Feature Structures

Instead of atomic content, allow slots to contain typed feature structures

Introduces concepts from object orientation: inheritance, aggregation, ... which allows for more general goals and move conditions

Frame-Based Systems: Feature Structures

VoiceXML

What (X)HTML is for displaying websites, VoiceXML is for speech interfaces

standardized by W3C: http://www.w3.org/TR/voicexml20goal: portable applications, streamlined voice portalscontains many predefined “form elements” and builtin

grammarscoupled with standards for grammar definition, textto

speech enhancementJoin the Praktikum next semester for some handson

experience!

VoiceXML: Example

<?xml version="1.0"?><vxml version="2.0"><menu> <prompt> Say one of: <enumerate/> </prompt> <choice next="http://www.sports.com/"> Sports </choice> <choice next="http://www.weather.com"> Weather </choice> <choice next="http://www.news.com"> News </choice> <noinput>Please say one of <enumerate/></noinput></menu></vxml>

Learn optimal next system action given the dialog history from a corpus of existing dialogs

optimal = maximizes some payoff function (reward) over the whole dialog

Employ Reinforcement LearningExample Applications:

Marilyn Walker: NJFUNMichael Kearns (AT&T): How may I help you?Diane Litman: Steve Young: ......

Statistical Systems

Markov Decision Process (MDP)

Agent

Action

Environment

1. observes2. s

elects3. re

wards

4. modifies

Markov Decision Process (MDP)

Example for dialog management:

Environment: State of all slots (empty / filled / confirmed), dialog parameters (duration, number of confirmations, ...)

Actions: Dialog moves of the systemReward: At end of session: positive, if task successful;

negative elsesmall negative reward for each other action (to produce

short dialogs)

What is optimal?

Optimize overall reward no greedy action selection→ Immediate rewards are often more worth than equal

rewards at later time bias for earlier gains→

∑t=1

k

r t

limh∞

1h∑t=0

h

r t

∑t=1

∞

t r t

Finite horizon model (useful, when durationof session is known

Average reward model

Discounted reward model

Optimal Strategy by Value Iteration

Arbitrarily initialize value(s) for all s ∊ States While strategy not optimal:

For s ∊ States:For a ∊ Actions:

Q(s,a) = reward(s,a) + γ * ∑s'trans(s,a,s') * value(s')

value(s) = maxa Q(s,a)

Resulting strategy: in every state s, pick action which leadsto successor s' with highest value(s')

Bad: slow convergenceWorse: requires detailed knowledge of state transitions

(not available for large systems)

Probablity of transitionfrom s to s' via action a

Reinforcement Learning (Q-Learning)

State s state s'

via action a

state s'' state s''' ...maximumQscore

immediatereward byenvironment

Formal description of Q-Learning

Q st , at ⇐Q st ,at [r t1 maxa Q st1 , a−Q st , at ]

learning rate (can change over time)

Arbitrarily initialize Q(s,a) for all s ∊ States, a ∊ Actions While strategy not optimal:

For s ∊ States:Pick action a

Do update on Q(s,a)

Offpolicy, i.e. no knowledge on state transitions required!

Exploration vs. Exploitation

How is “action a“ selected in QLearning?Algorithm should elaborate the most promising areas of the

stateaction spaceFor guaranteed optimality, we need to explore less

promising statesEspecially the case when learning online, i.e. during a

dialog session with a real user!εgreedy selection: usually select action with highest Q

value for current state. With a probability of ε, select a random action instead.

User Simulation to Generate Training Data

Reinforcement Learning requires many training episodes, much more than available for common systems

Use recorded dialog sessions to train a user model and use it to generate an infinite number of training episodes

Simple user model: Bigrams, i.e.: P(Actionuser | Actionsystem)

Include error model, e.g.: P(observed Actionuser | Actionuser)

test performance with real users or at least on a different user simulation!

Real Dialog Sessions

DialogSystemPrototype

User Model

ErrorModel

Simulated DialogSessions

Dialog Agent

1. record2. estimate

3. generate

4. train

Real Dialog Sessions

5. evaluate

User Simulation

Partially Oberservable MPD (POMDP)

We make mistakes when we evaluate the user's response and his intention

Consequence: State is ambiguous!Solution: Extend MDP by probability distribution over all

statesAdvantages: Implicitly maintains multiple hypotheses with

confidence scoresNew problem: State space explodes (even becomes

continuous)

Task Success: number of filled slots, word error rate, ...Efficiency: number of turnsUser satisfaction (sometimes not correlated with task

success or efficiency)Naturalness (how?)PARADISE framework (Walker, Litman et. al. 1997):

calculates a single performance value based on task success and dialog costs. Factors are weighed by their ability to predict user satisfaction.

Evaluation

Wizard of Oz Experiments

Errors in a running system are expensive to correctErrors in a written specification are hard to findSolution: Design a prototype!Replace dialog strategy by a human operator (wizard)Wizard has the same perceptional and operational

capabilities as the final dialog systemCan decide on his own or based on a predefined scriptUser does not know of the wizard

Wizard of Oz Experiments

Speech Act: Greeting (0.91)Face ID: felix (0.89)=> act_greet(“felix”)

Speech Recognition in Dialog Systems

Traditionally in ASR, we search W with maximalP(W|Audio). But in dialog systems, we are interested in semantic or dialog acts!

occurrence of semantic concepts is dependent of discourse and dialog state

Consequence: Returning only best textual hypothesis is mathematically incorrect!

ASR and dialog system must communicatePossible Solution: Weigh grammar rules according to the

dialog act expectations (Fügen, Holzapfel 2004)See some mathematics explained in (Young 2002)

Speech Recognition in Dialog Systems

Large dialog systems often deal with large vocabularies, which induce a high word error rate, especially for proper names (person names, locations, ...)

Often, not the whole vocabulary is equally likely in each situation

Example: A navigation device in Karlsruhe will hear the destination “Durlach” more often than the destination “Paris” (even if the latter has a higher apriori probability)

Consequence: Limit the vocabulary for each turn seperately

Multimodality in Dialog Systems

Use more than just speech informationNatural way of interactionExamples: Gestures, facial or vocal identity, emotionConvert each modality to a TFS representationUnify feature structures for different modalitiesDialog strategies are modality independent!

Natural Interaction?

Speech = natural way of communicationSpoken Dialog Systems = natural interaction???

Adaptive Dialog Systems

Users of dialog systems vary in many dimensions: age, experience, personality, emotional state, ...

This has a huge impact on performance and user satisfaction, dialog designers should not ignore it!

Simplified example: use explicit confirmations for unexperienced users and implicit or no confirmations for experienced users

Extend dialog state by control variables that describe the user state

Currently unsolved: We are working on it

Our Approach for Adaptive Dialog Systems

Detect the user state by using multimodal fusion of biosignals (voice, video, EEG, EMG, ...)

Decide for adaptation of system behavior:• Voice properties (speed, pitch, ...)• Language style (empathic vs. formal)• Helping behavior• Error recovery

Problems:• Errors in state recognition• Frequent changes recognized as inconsistent→

Dialog modelling is more than speech recognitionRequires knowledge from speech processing, linguistics,

AI, social science, ...Discourse representationDialog strategies: plan based, finite state machines,

statistical approaches, ...

Summary

Dialog research at the Cognitive Systems Lab on adaptive dialog systems: Looking for Hiwis, Studienarbeiter and Diplomarbeiter

VoiceXMLSeminar/Praktikum next semesterPraktikum on user studies for dialog systems

Try it yourself

spoken dialogue systems - kit of dialogs session: one complete interaction utterance: one...

Documents