spoken dialogue systems - kit of dialogs session: one complete interaction utterance: one...
TRANSCRIPT
Spoken Dialogue Systems
Felix Putze ([email protected])partially based on slides from Matthias Denecke
Structure of this lecture
Introduction
Discourse Theory
Dialog Strategies
More: Speech Recognition, Evaluation, ...
What is dialog?❙ Verbal Interaction of two or more agents❙ over more than one utterance❙ to establish a joint goal (really?)
What is a spoken dialog system?❙ Autonomous, artificial system❙ Engages in dialog with a human user❙ Employs speech recognition and speech synthesis❙ Interactive user interface to backend services
(databases, shops, ...)
Definition of Dialog
Examples of Spoken Dialog Systems
Automated Telephone ServicesMobile Devices, InCar DevicesHumanoid Robots (Armar)Toys, EntertainmentDomestic Appliances
Why is dialog interesting for researchers?
It's more than just speech recognition!
✂Deep understanding necessary✂Interesting phenomena (repair, selfrepair, reference)✂Indirect communication✂Boundary between semantics/pragmatics✂Nonverbal communication (gestures, emotions, ...)✂Allows learning of unknown words
Challenges in Dialog Systems
Structure of Dialogs
Session: one complete interactionUtterance: one person's speech between two pausesTurn: all utterances between two utterances of the
partner. Most systems assume that participants take alternating, nonoverlapping turns
Bargein: Interrupting a turn Initiative:
❙ System initiative: „Please tell my your departure airport” (less ASR errors, rigid style, form filling)
❙ Mixed initiative: “Which flight are you interested in?” (more freedom, give more information at once, harder to recognize and understand)
Components of a Spoken Dialog System
Automatic Speech
Recognition (ASR)
Natural Language
Understanding (NLU)
ContextInterpretation
DiscourseFusion
Dialog StateDialog Strategy
Selects Action(Move)
Texttospeech(TTS)
Discourse*: Entirety of information and structure of the ongoing dialog up to a certain point
Each new utterance must be interpreted in the light of the current discourse
Each new utterance must be included in the discourse
* in the context of this course
Discourse Theories
Discourse Representation Theory (DRT)
Emerging in the 80’sSo far: using Montague grammars (converting text to
first order logic statements), but:
Problem : “A man walks in the park. He whistles.” (∃x: man(x) & walksinpark(x)) (∃x:whistles(x))does not give correct truth conditionsRequired: Anaphora resolution
Solution: First step: construct intermediate representation
introduce twodimensional boxes as scopeintroduce discourse referents introduce accessibility restrictions
xman(x)walksinpark(x)whistles(x)
Discourse referent
Second step: construct formula: (∃x:max(x),walks(x),whistles(x))
Discourse Representation Theory (DRT)
DRT provides construction algorithm based on syntax treeFred owns a Porsche. It is red. *Fred does not own a Porsche. It is red.
x
fred(x)red(z)z = ???
x y z
fred(x)porsche(y)owns(x,y)red(z)y = z
y
porsche(y)owns(x,y)
¬
Discourse Representation Theory (DRT)
Typed Feature Structures (TFS)
A TFS is a set of named attributes (“features”), which can be of atomic type (int, string, ...) or another TFS
Possible types are organized in an ontology Two compatible TFS can be unified, resulting in a more
specific TFS containing information from both original TFSUse a set of TFS as discourse representation, use
unification for integrating new information
Typed Feature Structures: Example
speech_act
act_inform act_cmd
act_inform_name
object
graspable_objperson
Ontology of a robotic domain(ontology = formal representation ofentity classes with their relations)
Typed Feature Structures: Example
act_inform_nameobj_person : person
first_name : stringlast_name : string
act_inform_nameconfirmed : boolean
act_inform_nameobj_person : person
first_name : stringlast_name : string
confirmed : boolean
Unification
Semantic Grammars
Convenient way for rulebased natural language understanding
based on a contextfree grammar (often also used as speech recognition language model)
mark certain terminal or nonterminal nodes with semantic tags
Instead of representing a syntactic structure (verbs, nouns), concentrate on semantic structure
a tag indicates which element of a TFS ontology is created and filled with the tagged value
TFS & Semantic Grammars: Example
class obj_person inherits object
string : FIRST_NAME;
string : FAMILY_NAME;
class inform_name inherits act_inform
obj_person : PERSON;
boolean : CONFIRM_NAME;
boolean : CONFIRM_FAMILY_NAME;
public <informName,VP> =<inform_name,V> <obj_person,NP> {PERSON obj_person}
<inform_name,V> =my [first] name is | i am | you can call me
<obj_person,NP> =<dbimport_first_name> {FIRST_NAME NODE_VALUE}
TFS
Grammar
Interpretation in Context
ASROutput: „yes“ NLU→ Interpretation: confirmation_yesWhat does that mean? Depends on context!
Context (last system action) Interpretation
act_ask_confirm_name confirmation_nameact_ask_to_proceed cooperative_useract_inform_victory indicate_happinessother backchannel
Grosz, Sidner 1986: Seminal work tripartite organization of discourse structure Integrating
❙ Focus of attention, or attentional state❙ Intention of the participants❙ Structure of sequences of utterances
In each dialog segment, a “focus space” describes the relevant subset of all information available
Intentions are structured in a hierarchy (=> subgoals)each dialog segment is linked to one intention
Attention, Intention, Discourse Structure
Simplistic analogy:attentional state: local variablesintention: processing instructionsdiscourse structure: stack
Grosz and Sidner show in their paper that their theory allows for correct treatment of interruptions
Attention, Intention, Discourse Structure
System: Welcome to the automatic restaurant service.User: Tell me a nice french restaurant, please.System: Where are you looking for a restaurant?User: I am now at the market place.System: In which city?User: In Karlsruhe.System: I recommend the restaurant “Chez Pierre”.
It is 200 meters from your current location.
S1 U1S2 U2 S3 U3
S4
Attention, Intention, Discourse Structure
System: Welcome to the automatic restaurant service.User: Tell me a nice french restaurant, please.System: Where are you looking for a restaurant?User: I am now at the market place.System: In which city?User: In Karlsruhe.System: I recommend the restaurant “Chez Pierre”.
It is 200 meters from your current location.
Attention restaurant type & name location, city
Intention find a suited restaurant determine location
Attention, Intention, Discourse Structure
Dialog Strategies
PlanBased SystemsGrammarBased SystemsFiniteState SystemsRational AgencyFrameBased SystemsStatistical Systems
Decisions to make
Which item(s) should be brought up next?How much information is required and appropriate?Use system initiative or mixed initiative?When to ask for confirmation of understood information?Repair strategies (reprompt, rephrase, abort, ignore, ...)Does the user need help? (list possible commands,
forward to operator)Style of prompts (length, tone, ...)
John L. Austin (~1960): Words do not only describe acts, they are acts (“I bet that I am faster than you”)
An utterance can be interpreted on different levels: locutionary act = the form of the utterancepropositionary act = the meaning of the utterance illocutionary act = the speech act of the utteranceperlocutionary act = intended effect of the utterance
Speech act types: inform, request, promise, confirm,…Many taxonomies available
Plan-Based Dialog Systems
Plan-Based Dialog Systems
Example: “I don't need this book anymore”
locutionary act: “I” = subject, “this book” = object, ...propositionary act: The speaker has book which he does
not need at this point. illocutionary act: offering the book to the addresseeperlocutionary act: make the addressee happyoverall goal: become a friend of the addressee
Observation:• humans don’t utter communicative actions randomly• actions are planned to achieve various goals• dialog acts are part of a plan• listener’s job to uncover the plan and react accordingly
Example:
customer to butcher: “Where are the steaks you advertised?”
bad reaction (to question): “Back in the fridge.”good reaction (to plan): “How many do you want?”
Plan-Based Dialog Systems
Key Ideautterances: observable actions used to achieve goalsSystem needs to uncover goal
Generalizationdialogue is special case of other rational
(noncommunicative) behavior → we can apply standard AI techniques for rational behavior
Plan-Based Dialog Systems
planning:❙ STRIPS❙ ADL❙ ...
planrecognition inference rules action definitions models of the mental states of the participants expectations of likely goals
Plan-Based Dialog Systems
Plan-Based Dialog Systems: Problems
redundancy: recognize illocutionary act for each single utterance, which can influence the interpretation of earlier utterances
complexity of inference: process of plan recognition and planning is combinatorially intractable
discourse vs. domain plans: speech acts can be taskrelated or used to control the dialog (“meta communication)” => need to use multilevel plan structures
Speech act (or dialog act) theory and practice are nevertheless important for other approaches
Observation:
there are adjacency pairs, e.g. questionanswer, proposalacceptance, statement(n)acknowledgment, ...
users expect them more in humanmachine dialogs thanin humanhuman dialogs
Propose:
• phasestructure grammar• statemachine
define acceptable dialog, justlike syntax grammar definesacceptable sentence
Grammar Based Systems
extend phrase structure above sentence level allows exploitation of task oriented knowledge simple grounding
Which size would you like?
NP
Question
VP
A large
Det
Answer
N’
one
Grammar Based Systems
PhaseStructure Grammar (e.g. CFG):❙ Terminals = question, request, reply, offer, answer
proposition, acceptance, rejection, ...❙ Nonterminals = initiative, reaction, evaluation, ...
Dialog Grammar is used to “parse” the dialog structure, and
“predict” the possible set of next dialog acts by finding valid continuations
Grammar Based Systems
Advantageseasy to designeasy to factor out common dialogue features
Drawbackstoo structured grammars difficult to obtain automaticallytradeoff between robustness and rigidity
Grammar Based Systems
Idea: Simplify grammar (Chomsky 2 => Chomsky 3)
Finite State Machines:
state1state3
state2speech
act 1
act 2
state4
state5
state6
response1expecteddialogacts
Finite State Machines
System designer has control over dialogue flowno specific linguistic knowledge necessary tools can be visually appealingeasy to “read”easy to repair
=> Attractive for companies and endusers
Finite State Machines: Advantages
States in a DialogTransitions
Trigger ActionsPalette of
Possible ActionsDrag and Drop
Programming
Finite State Machines: Dialog Editor
Dialog acts must be identified (source of errors)
Only one state results from a transition (complex state meanings)
Utterances can be multifunctional: one utterance can contain several dialog acts
Multiple utterances could be needed to establish a dialog act
Low degree of abstraction (implement confirmation strategy for every piece of information seperately)
Problems with Finite State Machines
Inflexible designgets complex with complex dialoguelimited reusability
Intertwines dialogue level knowledgedomain specific knowledgelanguage specific knowledge=> maintenance nightmare
Problems with Finite State Machines
Theories developed in the 80‘s and 90’s byPerrault, Cohen & Levesque, Sadek
Formally describe rational behavior
Use formal theorem proving to deduce the next rational action of the system
Rational Agency
Example:I know you know xI don’t tell you x I know you don’t know xI know you want goal gI know x is relevant for you to achieve gI tell you x
Rational Agency: Examples
Apply formal theorem proving to deduce the next actionProlog, Lisp
RA assumes that collaborators desire to achieve a common goal
more general than dialogue processing:can also be applied to joint problem solving
Rational Agency
Inference rules do not depend on the domainWorking System:ARTIMIS (France Telecom)
implements rational agency in an agentbased architecture
exceptional system:workingtheoretical foundation
Rational Agency: Observations
Frames introduced by M. Minsky for knowledge representation (1970’s)
Represent incomplete knowledgeA frame consists a name, attribute slots and a list of
possible values for each slotDialog as slot filling: compare to filling a form on a web
site:
Frame-Based Systems
Simplest approach:one frame with multiple slotsgoal: fill all slots
Example: “I want to fly from Pittsburgh to Boston”misrecognition of destination causes skipping during
parsing, resulting in the following frame:
[BookFlight][Dst] Boston[Dep] <empty>
Frame-Based Systems
Frame-Based Systems: Strategy
move ask_destination:conditions: [Dst] = <empty>
[Dep] != <empty>bindings: ask_for_destination
Define a set of movesEach move has a set of conditions (concerning the
configuration of the slots) and bindings which can be executed (i.e. to produce speech output)
Find all moves with matching conditions and execute one or more of them
Example:
Frame-Based Systems: Advantages
Implicit dialog state (= configuration of all slots)Allows mixedinitiative dialogsAny slot can be filled at any point of the dialogPartially understood information can be used Information state independent of other discourse
elements Information state independent of strategy (in contrast to
finite state machines)
Multiple frames which can be active in parallel or sequentiallyExample: [BookFlight], [QueryFlightInfo]
Define goals: Selection of frames and slots which need to be filled to satisfy a certain user request or to trigger a backend application
Extend abstract dialog state:intention: which goal is selected? Is choise ambigous?quality: overall and for single slotsnumber of queries for a certain slot
Frame-Based Systems: Extensions
Frames are actually a predecessor of Typed Feature Structures
Instead of atomic content, allow slots to contain typed feature structures
Introduces concepts from object orientation: inheritance, aggregation, ... which allows for more general goals and move conditions
Frame-Based Systems: Feature Structures
VoiceXML
What (X)HTML is for displaying websites, VoiceXML is for speech interfaces
standardized by W3C: http://www.w3.org/TR/voicexml20goal: portable applications, streamlined voice portalscontains many predefined “form elements” and builtin
grammarscoupled with standards for grammar definition, textto
speech enhancementJoin the Praktikum next semester for some handson
experience!
VoiceXML: Example
<?xml version="1.0"?><vxml version="2.0"><menu> <prompt> Say one of: <enumerate/> </prompt> <choice next="http://www.sports.com/"> Sports </choice> <choice next="http://www.weather.com"> Weather </choice> <choice next="http://www.news.com"> News </choice> <noinput>Please say one of <enumerate/></noinput></menu></vxml>
Learn optimal next system action given the dialog history from a corpus of existing dialogs
optimal = maximizes some payoff function (reward) over the whole dialog
Employ Reinforcement LearningExample Applications:
Marilyn Walker: NJFUNMichael Kearns (AT&T): How may I help you?Diane Litman: Steve Young: ......
Statistical Systems
Markov Decision Process (MDP)
Agent
Action
Environment
1. observes2. s
elects3. re
wards
4. modifies
Markov Decision Process (MDP)
Example for dialog management:
Environment: State of all slots (empty / filled / confirmed), dialog parameters (duration, number of confirmations, ...)
Actions: Dialog moves of the systemReward: At end of session: positive, if task successful;
negative elsesmall negative reward for each other action (to produce
short dialogs)
What is optimal?
Optimize overall reward no greedy action selection→ Immediate rewards are often more worth than equal
rewards at later time bias for earlier gains→
∑t=1
k
r t
limh∞
1h∑t=0
h
r t
∑t=1
∞
t r t
Finite horizon model (useful, when durationof session is known
Average reward model
Discounted reward model
Optimal Strategy by Value Iteration
Arbitrarily initialize value(s) for all s ∊ States While strategy not optimal:
For s ∊ States:For a ∊ Actions:
Q(s,a) = reward(s,a) + γ * ∑s'trans(s,a,s') * value(s')
value(s) = maxa Q(s,a)
Resulting strategy: in every state s, pick action which leadsto successor s' with highest value(s')
Bad: slow convergenceWorse: requires detailed knowledge of state transitions
(not available for large systems)
Probablity of transitionfrom s to s' via action a
Reinforcement Learning (Q-Learning)
State s state s'
via action a
state s'' state s''' ...maximumQscore
immediatereward byenvironment
Formal description of Q-Learning
Q st , at ⇐Q st ,at [r t1 maxa Q st1 , a−Q st , at ]
learning rate (can change over time)
Arbitrarily initialize Q(s,a) for all s ∊ States, a ∊ Actions While strategy not optimal:
For s ∊ States:Pick action a
Do update on Q(s,a)
Offpolicy, i.e. no knowledge on state transitions required!
Exploration vs. Exploitation
How is “action a“ selected in QLearning?Algorithm should elaborate the most promising areas of the
stateaction spaceFor guaranteed optimality, we need to explore less
promising statesEspecially the case when learning online, i.e. during a
dialog session with a real user!εgreedy selection: usually select action with highest Q
value for current state. With a probability of ε, select a random action instead.
User Simulation to Generate Training Data
Reinforcement Learning requires many training episodes, much more than available for common systems
Use recorded dialog sessions to train a user model and use it to generate an infinite number of training episodes
Simple user model: Bigrams, i.e.: P(Actionuser | Actionsystem)
Include error model, e.g.: P(observed Actionuser | Actionuser)
test performance with real users or at least on a different user simulation!
Real Dialog Sessions
DialogSystemPrototype
User Model
ErrorModel
Simulated DialogSessions
Dialog Agent
1. record2. estimate
3. generate
4. train
Real Dialog Sessions
5. evaluate
User Simulation
Partially Oberservable MPD (POMDP)
We make mistakes when we evaluate the user's response and his intention
Consequence: State is ambiguous!Solution: Extend MDP by probability distribution over all
statesAdvantages: Implicitly maintains multiple hypotheses with
confidence scoresNew problem: State space explodes (even becomes
continuous)
Task Success: number of filled slots, word error rate, ...Efficiency: number of turnsUser satisfaction (sometimes not correlated with task
success or efficiency)Naturalness (how?)PARADISE framework (Walker, Litman et. al. 1997):
calculates a single performance value based on task success and dialog costs. Factors are weighed by their ability to predict user satisfaction.
Evaluation
Wizard of Oz Experiments
Errors in a running system are expensive to correctErrors in a written specification are hard to findSolution: Design a prototype!Replace dialog strategy by a human operator (wizard)Wizard has the same perceptional and operational
capabilities as the final dialog systemCan decide on his own or based on a predefined scriptUser does not know of the wizard
Speech Recognition in Dialog Systems
Traditionally in ASR, we search W with maximalP(W|Audio). But in dialog systems, we are interested in semantic or dialog acts!
occurrence of semantic concepts is dependent of discourse and dialog state
Consequence: Returning only best textual hypothesis is mathematically incorrect!
ASR and dialog system must communicatePossible Solution: Weigh grammar rules according to the
dialog act expectations (Fügen, Holzapfel 2004)See some mathematics explained in (Young 2002)
Speech Recognition in Dialog Systems
Large dialog systems often deal with large vocabularies, which induce a high word error rate, especially for proper names (person names, locations, ...)
Often, not the whole vocabulary is equally likely in each situation
Example: A navigation device in Karlsruhe will hear the destination “Durlach” more often than the destination “Paris” (even if the latter has a higher apriori probability)
Consequence: Limit the vocabulary for each turn seperately
Multimodality in Dialog Systems
Use more than just speech informationNatural way of interactionExamples: Gestures, facial or vocal identity, emotionConvert each modality to a TFS representationUnify feature structures for different modalitiesDialog strategies are modality independent!
Natural Interaction?
Speech = natural way of communicationSpoken Dialog Systems = natural interaction???
Adaptive Dialog Systems
Users of dialog systems vary in many dimensions: age, experience, personality, emotional state, ...
This has a huge impact on performance and user satisfaction, dialog designers should not ignore it!
Simplified example: use explicit confirmations for unexperienced users and implicit or no confirmations for experienced users
Extend dialog state by control variables that describe the user state
Currently unsolved: We are working on it
Our Approach for Adaptive Dialog Systems
Detect the user state by using multimodal fusion of biosignals (voice, video, EEG, EMG, ...)
Decide for adaptation of system behavior:• Voice properties (speed, pitch, ...)• Language style (empathic vs. formal)• Helping behavior• Error recovery
Problems:• Errors in state recognition• Frequent changes recognized as inconsistent→
Dialog modelling is more than speech recognitionRequires knowledge from speech processing, linguistics,
AI, social science, ...Discourse representationDialog strategies: plan based, finite state machines,
statistical approaches, ...
Summary