on-line dialogue policy optimisation milica gašić dialogue systems group

On-line dialogue policy optimisation Milica Gai Dialogue Systems Group Slide 2 Spoken Dialogue System Optimisation Problem: What is the optimal behaviour Solution: Find it automatically through interaction Slide 3 Reinforcement learning Slide 4 Training in interaction with humans Problem 1: Optimisation requires too many dialogues Problem 2: Training makes random moves Problem 3: Humans give inconsistent ratings Slide 5 Outline Background Dialogue model Dialogue optimisation Sample-efficient optimisation Models for learning Robust reward function Human experiments Conclusion Slide 6 Model: Partially Observable Markov Decision Process atat stst s t+1 rtrt otot o t+1 State is Markov -- depends on the previous state and action: P(s t+1 |s t, a t ) the transition probability State is unobservable and generates a noisy observation P(o t |s t ) -- the observation probability In every state action is taken and a reward is obtained Dialogue is a sequence of states Action selection (policy) is based on the distribution over all states at every time step t belief state b(s t ) Slide 7 Dialogue state factorisation Decompose the state into conditionally independent elements: user goal user action stst gtgt utut dtdt dialogue history atat rtrt otot o t+1 g t+1 u t+1 d t+1 Slide 8 Further dialogue state factorisation gtgt utut dtdt atat rtrt otot o t+1 g t+1 u t+1 d t+1 g t food d t food u t food g t area d t area u t area g t+1 food d t+1 food u t+1 food g t+1 area d t+1 area u t+1 food Slide 9 Policy optimisation in summary space Compress the belief state into a summary space 1 J. Williams and S. Young (2005). "Scaling up POMDPs for Dialogue Management: The Summary POMDP Method." Original Belief Space Actions Policy Summary Space Summary Actions Summary Function Master Function Summary Policy Slide 10 Q-function Q-function measures the expected discounted reward that can be obtained at a summary point when an action is taken Takes into account the reward of the future actions Optimising the Q-function is equivalent to optimising the policy Discount factor in (0,1] Reward Starting summary point Starting action Expectation with respect to policy Slide 11 Online learning Reinforcement learning in direct interaction with the environment Actions are taken e-greedily Exploitation: choose action according to the best estimate of Q function Exploration: choose action randomly (with probability e) In practice 10,000s of dialogues are needed! Slide 12 Problem 1: Standard models require too many dialogues Slide 13 Solution: Take into account similarities between different belief states Essential ingredients Gaussian process Kernel function Outcome Sample-efficient policy optimisation Slide 14 Gaussian Process Policy Optimisation The Q-function is the expected long-term reward It can be modelled as a Gaussian process Prior: Posterior, given visited summary states, actions and obtained rewards: Slide 15 Voice mail example Voice mail example: The user asks the system to save or delete the message. The user input is corrupted with noise, so the true dialogue state is unknown. belief state b(s) Slide 16 The role of kernel function in a Gaussian Process The kernel function models correlation between different Q-function values Confirm Q-function value Action Belief state Confirm Slide 17 Problem 2: Standard models make random moves Exploitation? Exploration? Slide 18 Solution: Define a stochastic policy Gaussian process defines Gaussian distributions for each action Sample from these distributions Automatically deal with exploration/exploitation Outcome: Less unexpected behaviour Slide 19 Results during testing (with simulated user) Slide 20 Results during training (with simulated user) Slide 21 Problem 3: Humans give inconsistent ratings Reward is a measure of how good the dialogue is Slide 22 On-line learning from user rating Slide 23 User rating inconsistency Random policyOnline learned policy Simulator trained policy User rating (%)36.376.985.7 Objective score (%) 17.753.863.7 P(user rating=1|objective score=1) 0.800.94 P(user rating=1| objective score=0) 0.260.570.68 Slide 24 Solution: Incorporate both objective and subjective evaluation Slide 25 Evaluation results Simulator trainedOn-line trained Evaluation dialogues 400410 Reward11.6 +/- 0.413.4 +/- 0.3 Success (%)93.5 +/- 1.296.8 +/- 0.9 Slide 26 Conclusions GP in policy optimisation Automate dialogue manager optimisation Enable sample efficient optimisation Outperforms simulator trained policies

on-line dialogue policy optimisation milica gašić dialogue systems group

Documents