task refinement for autonomous robots using complementary corrective human feedback

12
Abstract A robot can perform a given task through a policy that maps its sensed state to appropriate actions. We assume that a hand-coded controller can achieve such a mapping only for the basic cases of the task. Refining the controller becomes harder and gets more tedious and error prone as the complexity of the task increases. In this paper, we present a new learning from demonstration approach to improve the robot’s performance through the use of corrective human feedback as a complement to an existing hand-coded algorithm. The human teacher observes the robot as it performs the task using the hand-coded algorithm and takes over the control to correct the behavior when the robot selects a wrong action to be executed. Corrections are captured as new state-action pairs and the default controller output is replaced by the demonstrated corrections during autonomous execution when the current state of the robot is decided to be similar to a previously corrected state in the correction database. The proposed approach is applied to a complex ball dribbling task performed against stationary defender robots in a robot soccer scenario, where physical Aldebaran Nao humanoid robots are used. The results of our experiments show an improvement in the robot’s performance when the default hand-coded controller is augmented with corrective human demonstration. 1. Introduction Transferring the knowledge of how to perform a certain task to a complex robotic platform remains a challenging problem in robotics research with an increasing importance as robots start emerging from research laboratories into everyday life and interacting with ordinary people who are not robotics experts. A widely adopted method for transferring task knowledge to a robot is to develop a controller using a model for performing the task or skill, if such a model is available. Although it is usually relatively easier to develop a controller that can handle trivial cases, handling more complex situations often requires substantial modifications on the controller. Due to interference among the newly added cases and the existing ones, it becomes a tedious and time consuming process to ameliorate the controller and the underlying model as the number of such complex cases increases. That brings out the need for a new approach to robot programming. Learning from Demonstration (LfD) paradigm is one such approach that utilizes supervised learning for transferring task or skill knowledge to an autonomous robot without explicitly programming it. Instead of hand-coding a controller for performing a task or skill, LfD methods make use of a teacher who demonstrates the robot how to perform the task or skill while the robot observes the demonstrations and synchronously records * The first author is supported by the Turkish State Planning Organization (DPT) under the TAM Project, number 2007K120610 www.intechweb.org www.intechopen.com International Journal of Advanced Robotic Systems, Vol. 8, No. 2. (2011), ISSN 1729-8806, pp 68-79 Task Refinement for Autonomous Robots using Complementary Corrective Human Feedback Çetin Meriçli 1 , Manuela Veloso 2 and H. Levent Akın 3 1 Department of Computer Engineering Bogaziçi University Bebek, Turkey, and the Computer Science Department, Carnegie Mellon University, USA 2 Computer Science Department, Carnegie Mellon University, USA 3 Department of Computer Engineering Bogaziçi University Bebek, Turkey

Upload: cmu

Post on 03-Mar-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Task Refinement for AutonomousRobots using ComplementaryCorrective Human Feedback

Çetin Meriçli1, Manuela Veloso2, and H. Levent Akın3 *

1 Department of Computer Engineering Bogaziçi University Bebek, Turkey,and the Computer Science Department, Carnegie Mellon University, USA2 Computer Science Department, Carnegie Mellon University, USA3 Department of Computer Engineering Bogaziçi University Bebek, Turkey

Abstract

A robot can perform a given task through a policythat maps its sensed state to appropriate actions. Weassume that a hand-coded controller can achieve such amapping only for the basic cases of the task. Refiningthe controller becomes harder and gets more tedious anderror prone as the complexity of the task increases. Inthis paper, we present a new learning from demonstrationapproach to improve the robot’s performance through theuse of corrective human feedback as a complement toan existing hand-coded algorithm. The human teacherobserves the robot as it performs the task using thehand-coded algorithm and takes over the control to correctthe behavior when the robot selects a wrong action to beexecuted. Corrections are captured as new state-actionpairs and the default controller output is replaced by thedemonstrated corrections during autonomous executionwhen the current state of the robot is decided to besimilar to a previously corrected state in the correctiondatabase. The proposed approach is applied to acomplex ball dribbling task performed against stationarydefender robots in a robot soccer scenario, where physicalAldebaran Nao humanoid robots are used. The resultsof our experiments show an improvement in the robot’sperformance when the default hand-coded controller isaugmented with corrective human demonstration.

1. Introduction

Transferring the knowledge of how to perform acertain task to a complex robotic platform remainsa challenging problem in robotics research with anincreasing importance as robots start emerging fromresearch laboratories into everyday life and interactingwith ordinary people who are not robotics experts. Awidely adopted method for transferring task knowledgeto a robot is to develop a controller using a model forperforming the task or skill, if such a model is available.Although it is usually relatively easier to develop acontroller that can handle trivial cases, handling morecomplex situations often requires substantial modificationson the controller. Due to interference among the newlyadded cases and the existing ones, it becomes a tedious andtime consuming process to ameliorate the controller andthe underlying model as the number of such complex casesincreases. That brings out the need for a new approach torobot programming.

Learning from Demonstration (LfD) paradigm is onesuch approach that utilizes supervised learning fortransferring task or skill knowledge to an autonomousrobot without explicitly programming it. Instead ofhand-coding a controller for performing a task or skill,LfD methods make use of a teacher who demonstratesthe robot how to perform the task or skill while the robotobserves the demonstrations and synchronously records

*The first author is supported by the Turkish State Planning Organization (DPT) under the TAM Project, number 2007K120610

International Journal of Advanced Robotic Systems, Vol. 8, No. 2. (2011)68 www.intechweb.orgwww.intechopen.com

International Journal of Advanced Robotic Systems,Vol. 8, No. 2. (2011), ISSN 1729-8806, pp 68-79

Task Refinement for AutonomousRobots using ComplementaryCorrective Human Feedback

Çetin Meriçli1, Manuela Veloso2 and H. Levent Akın3

1 Department of Computer Engineering Bogaziçi University Bebek, Turkey,and the Computer Science Department, Carnegie Mellon University, USA2 Computer Science Department, Carnegie Mellon University, USA3 Department of Computer Engineering Bogaziçi University Bebek, Turkey

the demonstrated actions along with the perceived stateof the system. The robot then uses the stored state-actionpairs to derive an execution policy for reproducing thedemonstrated task or skill. Compared to more traditionalexploration based methods, LfD approaches aim to reducethe learning time and eliminate the necessity for defininga proper reward function, which is considered to be adifficult problem (7). Moreover, since the LfD approachesdo not require the robot to be programmed explicitly, theyare very suitable for cases where the task knowledge isavailable through a user who is an expert in the taskdomain but not in robotics.

Providing a way for humans to transfer task and skillknowledge to robots via natural interactions, the LfDapproaches are also suitable for problems, where an overallanalytical model for the task or skill is not availablebut a human teacher can tell which action to take ina particular situation. However, providing sufficientnumber of examples is a very time consuming processwhen working with robots with highly complex bodyconfigurations; such as humanoids, and for sophisticatedtasks with very high dimensional state and action spaces.

LfD based methods have been applied to many learningscenarios involving high level task and low level skilllearning on different robotic platforms varying fromwheeled and legged robots to autonomous helicopters.Here we present a few representative studies and stronglyencourage the reader to resort to (7) for a comprehensivesurvey on LfD.

While learning to perform high level tasks, it is a commonpractice to assume that the low level skills required toperform the task are available to the robot. Task learningfrom demonstration have been studied in many differentcontexts; such as, (from reinforcement learning pointof view) learning how to bake a cake (40), (from anactive learning point of view) learning concepts (14), and(from the sliding autonomy point of view) learning ofgeneral behavior policies from demonstration for a singlerobot (19; 21; 22; 24) and multi-robot systems (20; 23) viathe “Confidence Based Autonomy (CBA)” approach.

Several approaches to low level skill learning in theliterature utilize LfD methods with different foci. Tactileinteraction has been utilized for skill acquisition throughkinesthetic teaching (31) and skill refinement throughtactile correction (4–6; 8; 15). Motion primitives havebeen used for learning biped walking from humandemonstrated joint trajectories (36) and learning to playair hockey (11).

Several regression based approaches have been proposedfor learning quadruped walking on a Sony AIBO robotand learning low level skills for playing soccer (28; 29),learning several skills with different characteristics (cyclic,skills with multiple constraints, etc.) using a probabilisticapproach that utilizes Hidden Markov Models (HMM)along with regression (16), and learning non-linearmultivariate motion dynamics (27).

Interacting with the learner using high level abstractmethods has been introduced in forms of naturallanguage (12; 39) and advice operators as functionaltransformations for low level robot motion, demonstratedon a Segway RMP robot (2; 3). Reinforcement learningmethods have been investigated in conjunction with theLfD paradigm for teaching a flying robot how to performa complex skill (1), learning to swing up a pole and keep itbalanced (9; 10), learning constrained reaching tasks (30),and hierarchical learning of quadrupedal locomotion onrough terrain (33).

In our previous work on complementary skill refinement,we used real-time corrective human demonstration toimprove the biped walk stability of a Nao humanoidrobot (17; 18). An existing walk algorithm was used tocapture a complete walk cycle, and the captured walkcycle was played back to obtain a computationally cheapopen-loop walking behavior. A human demonstratormonitored the robot as it walked using the open-loopcontroller and modified the joint commands in real-timevia a wireless game controller to keep the robot stable.The recorded demonstration values together with thecorresponding sensor readings were used to derive apolicy for computing proper joint command correctionvalues for a given sensory reading to recover the balanceof the robot.

In this paper, we present a corrective demonstrationapproach for task execution refinement where ahand-coded algorithm for performing the task exists butis inadequate in handling complex cases. The humandemonstrator observes the robot carry out the taskby executing the hand-coded algorithm and providescorrective feedback when the hand-coded controllercomputes a wrong action. The received demonstrationactions are stored along with the state of the robot atthe time of correction as complements (or “patches”) tothe base hand-coded algorithm. During autonomousexecution, the robot substitutes the action computedby the hand-coded algorithm with the demonstratedaction if the corrective demonstration history databasecontains a demonstration provided in a similar state.The key idea is to keep the base controller algorithmas the primary source of the action policy, and usethe demonstration data as exceptions only whenneeded instead of deriving the entire policy out of thedemonstrations and the output of the controller algorithm.We applied this approach to a complex ball dribbling taskin humanoid robot soccer domain. Experiment resultsshow considerable performance improvement when thehand-coded algorithm is complemented by correctivehuman feedback. Since the human teacher providescorrection only when the robot performs an erroneousaction, the number of corrective demonstration examplesneeded to improve the task performance is smallercompared to other LfD approaches in the literature.

The rest of the paper is organized as follows: In Section 2,we describe the robot soccer domain and the hardwareplatform used in this study, followed by a brief overviewof the software infrastructure used in our humanoid

2International Journal of Advanced Robotic Systems, Vol. 8, No. 2. (2011)69 www.intechweb.orgwww.intechopen.com

robot soccer system. In Section 3, we first give theproblem definition for the ball dribbling task, which isour application domain, and then we present a thoroughexplanation of the special image processing systemfor free space detection, the ball dribbling behaviordeveloped using the available low level skills and thesoftware infrastructure described in Section 2, and ahand-coded algorithm for action selection parts of the balldribbling behavior. Section 4 contains the explanationof the corrective demonstration setup for deliveringthe demonstration to the robot and a domain-specificcorrection reuse system used for deciding when to apply acorrection based on the similarity of the current state of thesystem to the states in the correction database. We presentour experimental study in Section 5 with results showinga considerable improvement in the task completion timeusing corrective demonstration as a complement to theoriginal hand-coded algorithm over using the originalhand-coded algorithm alone. Pointing out some futuredirections to be further explored, we conclude the paper inSection 6.

2. Background

2.1 Robot Soccer Domain

RoboCup is an international research initiative thataims to foster research in the fields of artificialintelligence and robotics by providing standard problemsto be tackled from different points of view; suchas, software development, hardware design, andsystems integration (37). Soccer was selected by theRoboCup Federation as the primary standard problemdue to its inherently complex and dynamic nature,allowing scientists to conduct research on many differentsub-problems ranging from multi-robot task allocationto image processing, and from biped walking toself-localization. With its various categories focusingon different challenges in the soccer domain; such as,playing soccer in simulated environments (the 2D and3D Simulation Leagues) and physical environments usingwheeled platforms (the Small Size League and the MiddleSize League), humanoid robots of different sizes andcapabilities (the Humanoid League), and a standardhardware platform (the Standard Platform League), theultimate goal of RoboCup is to develop, by 2050, a teamof 11 fully autonomous humanoid robots that can beat thehuman world champion soccer team in a game that will beplayed on a regular soccer field complying with the officialFIFA rules.

In the Standard Platform League (SPL) of RoboCup (38),teams of 3 autonomous humanoid robots play soccer on a6 meters by 4 meters green carpeted field (Figure 1(a)).The league started in 1998 as an embodied softwarecompetition with a common and standard hardwareplatform, hence the name. Sony AIBO robot dogs hadbeen used as the standard robot platform of the leagueuntil 2008, and the Aldebaran Nao humanoid robot wasdecided to be the new standard platform thereafter. A

snapshot showing the Nao robots playing soccer is givenin Figure 1(b).

(a)

(b)

Figure 1. a) The field setup for the RoboCup Standard PlatformLeague (SPL), and b) a snapshot from an SPL game showing theNao robots playing soccer.

2.2 Hardware Platform

The Aldebaran Nao robot (Figure 2) which is the standardhardware platform for the RoboCup SPL competitions, isa 4.5 kg, 58 cm tall humanoid robot with 21 degrees offreedom1. The Nao has an on-board 500 MHz processor,to be shared between the low level control systemand the autonomous perception, cognition, and motionalgorithms. It is equipped with a variety of sensorsincluding two color cameras, two ultrasound distancesensors, a 3-axisaccelerometer, a 2-axis gyroscope (X-Y),an inertial measurement unit for computing the absoluteorientation of the torso, 4 pressure sensors on the sole ofeach foot, and a bump sensor at the tiptoe of each foot.

The Nao runs a Linux-based operating system and hasa software framework named NaoQi, which allows usersto develop their own controller software and access thesensors and actuators of the robot. The internal controllersoftware of the robot runs at 100Hz, making it is possibleto read new sensor values and send actuator commandsevery 10ms.

1 More information can be found on http://www.aldebaran-robotics.com.

3Çetin Meriçli, Manuela Veloso and H. Levent Akın: Task Refinement for Autonomous Robots using Complementary Corrective Human Feedback

70www.intechweb.orgwww.intechopen.com

Figure 2. The Aldebaran Nao humanoid robot.

2.3 Software Overview

Being able to play soccer requires several complex softwaremodules (i.e., image processing, self localization, motiongeneration, planning, communication, etc.) to be designed,implemented, and seamlessly integrated with each other.In this section of the paper, we present a brief overviewof the software infrastructure developed for the RoboCupSPL competitions and also used in this study.

2.3.1 Image Processing

The Nao humanoid robots perceive their environmentvia their sensors, namely the two color cameras, theultrasound distance sensors, the gyroscope, and theaccelerometer. All the important objects in the gameenvironment (i.e., the field, the goals, the ball, and therobots) are color coded to facilitate object recognition.However, perception of the environment remains themost challenging problem primarily due to the extremelylimited on-board processing power that prevents the useof intensive and sophisticated computer vision algorithms.The very narrow fields of view (FoV) of the robot’s cameras(≈ 58o diagonal) and their sensitivity to changes in lightcharacteristics like the temperature and luminance levelsare among the other contributing factors to the perceptionproblem.

The job of the image processing module is to extract therelative distances and bearings of the objects detected inthe camera image. In addition to the position information,the image processing module also reports confidencescores indicating the likelihood of those objects beingactually present in the camera image.

2.3.2 Self Localization and World Modeling

These modules are responsible for determining thelocation of the robot as well as the locations of the otherimportant objects (e.g. the ball) on the field. Oursystem uses a variation of Monte Carlo Localization (MCL)called Sensor Resetting Localization (34) for estimatingthe position of the robot on the field. For calculatingand tracking the global positions of the other objects, we

employ a modeling approach which treats objects basedon their individual motion models defined in terms of theirdynamics (25; 26).

2.3.3 Planning and Behavior Control

Our planning and behavior generation module is builtusing a hierarchical Finite State Machine (FSM) basedmulti-robot control formalism called Skills, Tactics, andPlays (STP) (13). Plays are multi-robot formations whereeach robot is executing a tactic consisting of several skills.Skills can be stand-alone or formed via a hierarchicalcombination of other skills.

2.3.4 Motion Generation

The motion generation module is responsible for all typesof movement on the field including biped walking, ballmanipulation (e.g., kicking), and some other motions suchas getting back upright after a fall. For the biped walking,we use the omni-directional walk algorithm provided byAldebaran. For kicking the ball and the other motions,we use predefined actions in the form of sequences ofkeyframes, each of which define a vector of joint anglesand a duration value for the interpolation between theprevious pose and the current one. Two variations (strongand weak) of three types of kick (side kick to the left, sidekick to the right, and forward kick) are implemented to beused in the games.

3. Proposed Approach

3.1 Problem Definition

Technical challenges are held as a complementary part ofthe RoboCup SPL competitions with the aim of creatinga research incentive on complex soccer playing skills thatwill help leverage the quality of the games and enable theleague to gradually approach the level of real soccer gamesboth in terms of the field setup and the game rules. Eachyear, the technical challenges are determined accordinglyby the Technical Committee of RoboCup SPL.

Figure 3. An example scenario for the dribbling challenge.

Our application and evaluation domain, the “DribblingChallenge”, was one of the three technical challenges ofthe 2010 SPL competitions. In that challenge, an attackerrobot is expected to score a goal in three minutes withouthaving itself or the ball touching any of the three stationarydefender robots that are placed on the field in such a wayto block the direct shot paths. The positions of the obstacle

4International Journal of Advanced Robotic Systems, Vol. 8, No. 2. (2011)71 www.intechweb.org

www.intechopen.com

robots are not known beforehand; therefore, the robot hasto detect the opponent robots, model the free space onthe field, and plan its actions accordingly. An examplescenario is illustrated in Figure 3.

3.2 Free Space Detection using Vision

Instead of trying to detect the defender robots and avoidthem, our attacker robot detects the free space in frontof it and builds a free space model of its surroundingsto decide which direction is the best to dribble the balltowards. The soccer field is a green carpet with white fieldlines on it. The robots are also white and gray, and theywear pink or blue waist bands as uniforms (Figure 1(b),Figure 3). Therefore, anything that is non-green and lyingon the field can be considered as an obstacle, except forthe detected field lines. We utilize a simplified version ofthe Visual Sonar algorithm by Lenser and Veloso (35) andthe algorithm by Hoffmann et al. (32). We scan the pixelson the image along evenly spaced vertical lines calledscanlines, starting from the bottom end and continue untilwe see a certain number of non-green pixels. Although theexact distance function is determined by the position of thecamera, in general the projected relative distance of a pixelincreases as we ascend from the bottom of the image tothe top, assuming all the pixels lie on the ground plane.If we do not encounter any green pixels along a scanline,we consider that scanline as fully occupied. Otherwise, thepoint where the non-green block starts is marked as theend of the free space towards that direction. To furthersave some computation time, we do not process everyvertical line on the image. Instead, we process the linesalong every fifth pixel and every other pixel along thoselines. As a result, we effectively process only 1/10th of theimage(Figure 4(b)). The pixels denoting the end of the freespace are then projected onto the ground to have a roughestimate of the distance of the corresponding obstacle inthe direction of the scanned line. In order to cover theentire 180o space in front of it, the robot pans its head fromside to side. As the head moves, the computed free spaceend points are combined and divided into 15 slots, eachcovering 12o in front of the robot. In the mean time, eachfree space slot is tagged with a flag indicating whether thatslot points towards the opponent goal or not based on thelocation of the opponent goal in the world model, or theestimated location and orientation of the robot on the field(Figure 4(c)).

3.3 Ball Dribbling Behavior

We used the Finite State Machine (FSM) based behaviorsystem explained in Section 2.3.3 for developing the balldribbling behavior. The FSM structure of the ball dribblingbehavior is depicted in Figure 5. The robot starts with“searching for the ball” by panning its head from side toside several times using both cameras. If it cannot findthe ball at the end of this initial scan, it starts turning inplace while tilting its head up and down, and this cyclecontinues until the ball is detected. Once the ball is locatedon the field, “approach the ball” behavior gets activatedand the robot starts walking towards the ball. Utilizing theomni-directional walk, it is guaranteed that the robot faces

(a)

(b)

(c)

Figure 4. The environment as perceived by the robot: a) the colorsegmented image showing the potential obstacle points markedas red dots, b) the computed perceived free space segmentsillustrated with the cyan lines overlaid on the original image,and c) the free space model built out of the perceived free spacesegments extracted from a set of images taken while the robotpans its head. Here, the dark triangles indicate the free space slotspointing towards the opponent goal.

the ball when the “approach the ball” behavior is executedand completed successfully. After reaching the ball, therobot pans its head one more time to gather informationabout the free space around it, calculates its current state,selects an action that matches its state, and finally kicksthe ball towards a target point computed according to theselected action. If the robot loses the ball at any instant ofthis process, it goes back to the “search for the ball” state.Except for the lightly colored select action and select dribbledirection states shown on the state diagram, each state inthe given FSM corresponds to a low level skill. We usethe existing low level skills in our robot soccer system

5Çetin Meriçli, Manuela Veloso and H. Levent Akın: Task Refinement for Autonomous Robots using Complementary Corrective Human Feedback

72www.intechweb.orgwww.intechopen.com

Figure 5. The state diagram of the base system.

without any modifications; namely, looking for the ball,approaching the ball, lining up for a kick, and kicking theball to a specified point relative to the robot by selecting anappropriate kick from the portfolio of available kicks.

3.4 Action and Dribble Direction Selection

The select action and the select dribble direction statesconstitute the main decision points of the system we aim toimprove using corrective demonstrations. The hand-codedalgorithms for both the action and the dribble directionselection parts utilize the free space model in front of therobot. After lining up with the ball properly, the robotselects one of the following two actions:

• Shoot

• Dribble

The shoot action corresponds to kicking the ball directlytowards the opponent goal using a powerful and longrange kick. The dribble action, on the other hand,corresponds to dribbling the ball towards a moreconvenient location on the field using a weaker and shorterrange kick. When the robot reaches the decision point;that is, after it aligns itself with the ball and scans theenvironment for free space modeling, the action selectionalgorithm checks if any of the free space slots pointingtowards the opponent goal has a distance less than acertain fraction of the distance to the goal. If so, the path tothe opponent goal is considered “occupied” and the dribbleaction is selected in that situation. Otherwise, the path isconsidered “clear” and the shoot action targeting the centerof the opponent goal is selected. The pseudo-code of theaction selection algorithm is given in Algorithm 1.

If the action selection algorithm deduces that the path tothe opponent goal is blocked and subsequently selects thedribbling action, a second algorithm steps in to determinethe best way to dribble the ball. All slots in the free spacemodel are examined and assigned a score computed asthe weighted sum of the distance values of the slot itselfand its left and right neighbors. The free space slot withthe maximum score is selected as the dribble direction.The algorithm for dribble direction selection is given inAlgorithm 2.

Algorithm 1 Action selection algorithm. Γ ∈ [0, 1] isa coefficient for specifying the maximum distance to beconsidered as free space in terms of the distance of the goal.In our implementation, we use Γ = 0.5.

goalDist ← getGoalDist()goalAngle ← getGoalAngle()if goalAngle < −π

2 or goalAngle > π2 then

return dribbleelse

for all i ∈ getGoalSlots() dodistDi f f ← |goalDist − disti|if distDi f f > ΓgoalDist then

return dribbleend if

end forend ifreturn shoot

Algorithm 2 Dribble direction selection algorithm. N is thenumber of free space slots.

goalAngle ← getGoalAngle()if goalAngle < −π

2 or goalAngle > π2 then

if |angle0 − goalAngle| < |angleN−1 − goalAngle|then

dribbleAngle ← angle0else

dribbleAngle ← angleN−1end if

elsemaxDist ← 0for i ← 1; i < N − 1; i ← i + 1 do

distance ← 0.25disti−1 + 0.5disti + 0.25disti+1if distance > maxDist then

maxDist ← distancemaxSlot ← i

end ifend fordribbleAngle ← anglemaxSlot

end ifreturn dribbleAngle

Using the two algorithms explained above for the twoaction selection states in the behavior FSM, the robotis able to perform the ball dribbling task and score agoal with limited success. We define the success metric

6International Journal of Advanced Robotic Systems, Vol. 8, No. 2. (2011)73 www.intechweb.org

www.intechopen.com

for this task to be the time it takes for the robot toscore a goal. The performance evaluation results forthe hand-coded action selection algorithms are providedin Section 5. In the following section, we presentthe corrective demonstration system developed as acomplement to the hand-coded action selection algorithmsfor refining the task performance.

4. Corrective Demonstration

Argall et al. (7) formally define the learning fromdemonstration problem as follows. The world consistsof states S, and A is the set of actions the robot can take.Transitions between states are defined with a probabilistictransition function T(s′ |s, a) : S × A × S → [0, 1]. The stateis not fully observable; instead, the robot has access to anobserved state Z with the mapping M : S → Z. A policyπ : Z → A is employed for selecting the next action basedon the current observed state.

Corrective demonstration is a form of teacherdemonstration focusing on correcting an action selectedby the robot to be performed by proposing

• an alternative action to be executed in that state, or

• a modification to the selected action.

The usual form of employing corrective demonstrationis either through adding the corrective demonstrationexample to the demonstration dataset or replacing anexample in the dataset with the corrective example,and re-deriving the action policy using the updateddemonstration dataset.

However, re-deriving the execution policy each time acorrection is received can be cumbersome if the totalnumber of state-action pairs in the demonstration databaseis large. On the other hand, accumulating a number ofcorrective demonstration points and then re-deriving theexecution policy may be misleading or inefficient sincethe demonstrator will not be able to see the effect of theprovided corrective feedback immediately.

In our approach, we store the collected correctivedemonstration points separate from the hand-codedcontroller, and utilize a reuse algorithm to decide whento use correction. In the following subsections, we firstdescribe how the corrective demonstration is delivered tothe robot, and then we explain how the stored correctionsare used during autonomous execution.

4.1 Delivering Corrective Demonstration

During the demonstration session, the robot is set to asemi-autonomous mode in which it executes the dribblingbehavior with the hand-coded action selection algorithmsand performs the dribbling task. Each time an actionselection state is reached, the robot computes an action,announces the selected action, and asks the teacher forthe feedback. The teacher can give one of the followingcommands as the feedback:

1. Perform the selected action

2. Revise the action

If (1) is selected, the robot continues with the actioncomputed by the hand-coded action selection system. If(2) is selected, the robot provides the available actionsand asks the teacher to select one. If the robot is inthe dribble direction selection state, it first selects a freespace slot according to the hand-coded dribble directionselection algorithm. If the teacher wants to change theselected dribble slot, the robot provides the directions thatcan be chosen and waits for the teacher to select one. Inboth revision cases, the action provided by the teacher ispaired with the current state of the system and stored in adatabase. No records are stored in the database when theteacher approves the robot’s selection as the next action tobe executed.

4.2 Reusing the Corrections

By the end of the demonstration session, the robot has builta demonstration database of state-action pairs denotingwhat action is provided by the teacher as a replacementof the action computed by the hand-coded algorithms andwhat was the robot’s state when that correction is received.During autonomous execution, the decision of when toexecute the action selected by the hand-coded algorithmsand when to use corrective demonstration samples is madeby a correction reuse system based on the similarity ofthe current state of the robot to the states in which thedemonstration samples were collected.

We define the observed state of the robot as

Z =< slotDist0, ..., slotDistN−1, goal0, ..., goalN−1 >

where slotDisti is the distance to the nearest obstacle insideslot i, and goali ∈ {true, f alse} is a Boolean flag which isset to true if the slot i intersects with the goal, and set tof alse otherwise.

Since the robot is expected to kick/dribble the ball into theopponent goal, rather than the mere position of the roboton the field , the distribution of the free space with respectto the direction towards the goal needs to be taken intoaccount. Therefore, we calculate the sum of the absolutedifferences of the free space slots using the slot pointingtowards the center of the goal as the origin if the goal is insight. If the goal is not somewhere within the 180o in frontof the robot, we calculate the sum of absolute differencesof the free space slots using the rightmost slot as the origin.The similarity value in the range [0, 1] is then calculated as

similarity = e−Kdiff 2

where K is a coefficient for shaping the similarity function,anddiff is the calculated sum of absolute differences ofthe slot distances. In our implementation, we selectedK = 5. The algorithm for similarity calculation is given inAlgorithm 3.

During execution, when the robot reaches the actionselection or dribble direction selection states, it first checksits demonstration database and fetches the demonstrationsample with the highest similarity to the current state. Ifthe similarity value is higher than a threshold value τ,the robot executes the demonstrated action instead of the

7Çetin Meriçli, Manuela Veloso and H. Levent Akın: Task Refinement for Autonomous Robots using Complementary Corrective Human Feedback

74www.intechweb.orgwww.intechopen.com

action computed by the hand-coded algorithm. In ourimplementation, we use τ = 0.9.

Algorithm 3 The algorithm for computing the similarity oftwo given states.

distcurr ← getSlotDist(Zcurr)distdemo ← getSlotDist(Zdemo)diff ← 0if goalAngle < −π

2 or goalAngle > π2 then

for i ← 0; i < N; i ← i + 1 dodiff ← diff + |distcurr(i)− distdemo(i)|

end fordiff ← diff /N

elsegoalSlotcurr ← getGoalSlot(Zcurr)goalSlotdemo ← getGoalSlot(Zdemo)num ← 0s1 ← goalSlotcurr, s2 ← goalSlotdemowhile s1 < N and s2 < N do

diff ← diff + |distcurr(s1)− distdemo(s2)|num ← num + 1, s1 ← s1 + 1, s2 ← s2 + 1

end whiles1 ← goalSlotcurr, s2 ← goalSlotdemowhile s1 >= 0 and s2 >= 0 do

diff ← diff + |distcurr(s1)− distdemo(s2)|num ← num + 1, s1 ← s1 − 1, s2 ← s2 − 1

end whilediff ← diff /num

end ifsimilarity ← e−Kdiff 2

return similarity

5. Results

We evaluated the efficiency of the complementarycorrective demonstration using three instances of the balldribbling task with different opponent robot placementsin each of them. The test cases were designed insuch a way that the robot using the hand-coded actionselection algorithm would be able to complete the task,but not through following an optimal sequence of actions(Figure 6). The following criteria were kept in mind whiledesigning the test cases:

• Case 1: In this scenario, we place two robots onthe periphery of the centercircle, leaving a narrow,but passable corridor. The third robot is placedon the virtual intersection of the opponent penaltymark and the left corner of the opponent penaltybox. The hand-coded behavior computes the corridorbetween the two center robots to be too narrow topass. Therefore, the robot tries to avoid the two robotsat the center and mostly chooses a right dribblingdirection to avoid the third robot as well. During thedemonstration, we advised the robot to take a directshot between the two robots at the center (Figure 7(b)).This scenario was a good showcase for illustratinghow to refine the otherwise imprecise output of avery simple algorithm; no additional complexity wereintroduced to the algorithm and a limited number ofdemonstrations were provided only when the robottried dribbling the ball whereas it could take a directshot.

• Case 2: In this case, a direct shot is not possiblefrom the initial position, and the robots are placedasymmetrically on the field in such a way that dribblingthe ball towards the robot placed further away isadvantageous. During the demonstration, the givenadvice was to first dribble the ball to the left, and thentake a direct shot towards the goal (Figure 7(e)). Thehand-coded algorithms tend to choose the right actionby dribbling the ball to the left, but then the robotdecides to advance the ball through a series of dribblesbefore kicking it into the goal instead of taking a directshot.

• Case 3: This case was also designed to emphasizethe ability of the proposed algorithm to reshape thebehavioral response in addition to correcting mistakes.Similar to Case 2, a direct shot is not possiblefrom the initial position, and the robots are placedsymmetrically so no clear advantage of choosing oneinitial dribbling direction over another exists. Duringthe demonstration, we gave a very similar advice to theone we gave in Case 2 to investigate whether we cancreate a bias towards a specific action in certain cases(Figure 7(h)).

(a)

(b)

(c)

Figure 6. Three different configurations used in the experiments.a) Case 1, b) Case 2, and c) Case 3.

8International Journal of Advanced Robotic Systems, Vol. 8, No. 2. (2011)75 www.intechweb.org

www.intechopen.com

(a) Case 1 HA (b) Case 1 CD (c) Case 1 HA+CD

(d) Case 2 HA (e) Case 2 CD (f) Case 2 HA+CD

(g) Case 3 HA (h) Case 3 CD (i) Case 3 HA+CD

Figure 7. The illustrations of the performance evaluation runs. Different colors denote different runs. For each run, a dashed linerepresents a dribble and a solid line represents a kick. HA stands for hand-coded algorithm and CD stands for corrective demonstration.HA+CD shows the cases where the robot is in autonomous mode using both hand-coded algorithm and the corrective demonstrationdatabase.

We gathered corrective demonstration data from all threecases and formed a common database. A total of 42 actionselection and 21 dribble direction selection demonstrationpoints were collected in a roughly 30 minutes longdemonstration session. Time required to score a goal beingthe success measurement metric, we then evaluated theperformance of the system with and without the use of the

corrective demonstration database.

We ran 10 trials for each case, 5 with the hand-codedaction and dribble direction selection algorithms (HA),and another 5 trials with the corrective demonstrationdata (CD) in addition to the HA (HA+CD). The sequenceof actions taken by the robot at each trial are depicted

9Çetin Meriçli, Manuela Veloso and H. Levent Akın: Task Refinement for Autonomous Robots using Complementary Corrective Human Feedback

76www.intechweb.orgwww.intechopen.com

Case 1 Case 2 Case 3Trial HA HA+CD HA HA+CD HA HA+CD

1 2:38 1:35 6:34 - out 1:42 2:10 1:57 - out2 2:27 1:49 3:31 1:47 2:31 1:573 1:48 1:32 2:02 2:08 2:24 1:034 2:36 1:27 3:52 3:47 - out 3:45 - own goal 1:535 3:57 1:31 2:56 1:29 - missed 1:54 2:52

mean 2:41 1:34 3:05 1:52 2:14 1:56std 0.0326 0.006 0.033 0:009 0.011 0.031

Table 1. Elapsed times during trials.

in Figure 7, and the timing information is presented inTable 1. In the figures, a dashed line indicates dribbleaction, a solid line indicates a shoot action, and a thinline indicates the replacement of the ball to the initialposition after committing a foul. In the table, “out”means that the robot kicked the ball out of bounds fromthe sides, “missed” means that the robot chose the rightactions but the ball did not roll into the goal due toimperfect actuation, and “own goal” means that the robotaccidentally kicked the ball into its own goal. The failedattempts are excluded from the given mean and standarddeviation values. The failures were mostly due to theimperfection of the lower level skills like aligning with theball, and the high variance in both the kick distance andthe kick direction.

The decrease in the timings in all three test cases whenusing (HA+CD) compared to the system using (HA)alone shows an improvement in the overall performancesince according to the problem definition, the shortercompletion times are considered more successful. InCase 1, where the average completion time is reducedby around one minute, the improvement in the taskperformance was mostly due to the bias created by thecorrective demonstration which favors taking direct shotsas opposed to the dribbling action computed by thehand-coded algorithm as given in Figure 7(c). In Case2, the complementary corrective demonstration was ableto correct the wrong decision made by the hand-codedalgorithm on taking a second dribble action instead of adirect shot after dribbling the ball to the left. As a result,the average task completion time was reduced almost tothe half of the time it took on average when only (HA)was used. The effectiveness of corrective demonstrationin Case 2 is presented in Figure 7(f). In Case 3, thecorrective demonstration was again proven to be effectivein creating a bias in situations where it is not analyticallypossible to prefer an action over another. Presentinga preference for dribbling to the left (Figure 7(h)), thecorrective demonstration was able to change the initialresponse of the hand-coded algorithm from dribbling theball to the right (Figure 7(g)) to dribbling the ball to the left(Figure 7(i)).

6. Conclusions and future work

In this paper, we contributed a task and skill improvementmethod which utilizes corrective human demonstrationas a complement to an existing hand-coded algorithmfor performing the task. We applied the method on

one of the technical challenge tasks of the RoboCup2010 Standard Platform League competitions called the“Dribbling Challenge”. Corrective demonstrations weresupplied at two levels: action selection and dribbledirection selection. The hand-coded action selectionalgorithms for kick type and dribble direction selectionscan handle most of the basic cases; however, they areunable to perform well on all the defined test cases, asthe test cases are designed in such a way to measuredifferent aspects of the developed ball dribbling behavior.A human teacher monitors the execution of the task bythe robot and intervenes as needed to correct the outputof the hand-coded algorithm. During the advice session,the robot saves the received demonstration examples in adatabase, and during autonomous execution, it fetches thedemonstration example received for the most similar statein the database using a domain specific similarity measure.If the similarity of those two states is above a certainthreshold, the robot executes the fetched demonstrationaction instead of the output of its own algorithm. Wepresented empirical results in which the proposed methodis evaluated in three different field setups. The resultsshow that it is possible to improve the task performanceby applying “patches” to the hand-coded algorithmthrough only a small number of demonstrations as wellas “shaping” a hand-coded behavior by complementing itwith corrective demonstrations instead of modifying theunderlying algorithm itself.

Investigating the possibility of developing a domain-freestate similarity measure, applying the correctivedemonstration to the other sub-spaces of the task space,considering other demonstration retrieval mechanismsallowing better generalization over the state-actionlandscape, and applying the proposed method to moresophisticated tasks are among the future work that we aimto address.

7. Acknowledgments

The authors would like to thank the other members of theCMurfs RoboCup SPL team for their invaluable help, andto Tekin Meriçli for proofreading the manuscript.

8. References

[1] Pieter Abbeel and Andrew Y. Ng. Apprenticeshiplearning via inverse reinforcement learning. In InProceedings of the Twenty-first International Conferenceon Machine Learning, 2004.

10International Journal of Advanced Robotic Systems, Vol. 8, No. 2. (2011)77 www.intechweb.org

www.intechopen.com

[2] B. Argall, B. Browning, and M. Veloso. Learningfrom demonstration with the critique of a humanteacher. In Second Annual Conference on Human-RobotInteractions (HRI’07), 2007.

[3] B. Argall, B. Browning, and M. Veloso. Learningrobot motion control with demonstration andadvice-operators. In Proceedings of the IEEE/RSJInternational Conference on Intelligent Robots andSystems (IROS’08), 2008.

[4] B. Argall, E. Sauser, and A. Billard. Policy adaptationthrough tactile correction. In In Proceedings of the36th Annual Convention of the Society for the Study ofArtificial Intelligence and Simulation of Behaviour (AISB’10), Leicester, UK, March 2010, 2010.

[5] B. Argall, E. Sauser, and A. Billard. Tactile correctionand multiple training data sources for robot motioncontrol. In In NIPS 2009 Workshop on Learning fromMultiple Sources with Application to Robotics, Whistler,British Columbia, December 2009, 2010.

[6] B. Argall, E. Sauser, and A. Billard. Tactile feedbackfor policy refinement and reuse. In In Proceedings of the9th IEEE International Conference on Development andLearning (ICDL ’10), Ann Arbor, Michigan, August 2010,2010.

[7] Brenna D. Argall, Sonia Chernova, Manuela Veloso,and Brett Browning. A survey of robot learningfrom demonstration. Robotics and Automation Systems,57(5):469–483, 2009.

[8] Brenna Dee Argall, Eric Sauser, and Aude Billard.Tactile Guidance for Policy Adaptation. Undersubmission, 2010.

[9] C. G. Atkeson and S. Schaal. Learning tasks from asingle demonstration. In IEEE International Conferenceon Robotics and Automation (icra97), pages 1706–1712.Piscataway, NJ: IEEE, 1997.

[10] C. G. Atkeson and S. Schaal. Robot learningfrom demonstration. In Proceedings of the FourteenthInternational Conference on Machine Learning: (icml ’97),pages 12–20. Morgan Kaufmann, 1997.

[11] Darrin C. Bentivegna, Christopher G. Atkeson,and Gordon Cheng. Learning similar tasks fromobservation and practice. In in Proceedings of the 2006IEEE/RSJ International Conference on Intelligent Robotsand Systems, pages 2677–2683, 2006.

[12] Cynthia Breazeal, Guy Hoffman, and AndreaLockerd. Teaching and working with robots as acollaboration. In AAMAS ’04: Proceedings of the ThirdInternational Joint Conference on Autonomous Agentsand Multiagent Systems, pages 1030–1037, Washington,DC, USA, 2004. IEEE Computer Society.

[13] Brett Browning, James Bruce, Michael Bowling, andManuela Veloso. Stp: Skills, tactics and plays formulti-robot control in adversarial environments. IEEEJournal of Control and Systems Engineering, 219:33–52,2005.

[14] M. Cakmak, C. Chao, and A.L. Thomaz. Designinginteractions for robot active learners. AutonomousMental Development, IEEE Transactions on, 2(2):108–118, jun. 2010.

[15] S. Calinon and A. Billard. What is the Teacher’s Rolein Robot Programming by Demonstration? - towardBenchmarks for Improved Learning. Interaction

Studies. Special Issue on Psychological Benchmarks inHuman-Robot Interaction, 8(3), 2007.

[16] Sylvain Calinon, Florent D’halluin, Eric Sauser,Darwin Caldwell, and Aude Billard. Learning andreproduction of gestures by imitation: An approachbased on Hidden Markov Model and GaussianMixture Regression. IEEE Robotics and AutomationMagazine, 17(2):44–54, 2010.

[17] Çetin Meriçli and Manuela Veloso. Bipedwalk learning through playback and correctivedemonstration. In AAAI 2010: Twenty-FourthConference on Artificial Intelligence, 2010.

[18] Çetin Meriçli and Manuela Veloso. Improvingbiped walk stability using real-time corrective humanfeedback. In RoboCup 2010: Robot Soccer World CupXIV, 2010.

[19] Sonia Chernova and Manuela Veloso.Confidence-based policy learning fromdemonstration using gaussian mixture models.In In Proceedings of AAMAS’07, the Sixth InternationalJoint Conference on Autonomous Agents and Multi-AgentSystems, Honolulu, Hawaii, May 2007, 2007.

[20] Sonia Chernova and Manuela Veloso. Multiagentcollaborative task learning through imitation. InIn Proceedings of the 4th International Symposium onImitation in Animals and Artifacts, AIBS’07, Artificialand Ambient Intelligence, Newcastle, UK, April 2007,2007.

[21] Sonia Chernova and Manuela Veloso.Confidence-based demonstration selection forinteractive robot learning. In In Proceedings of the 3rdACM/IEEE International Conference on Human-RobotInteraction (HRI’08), Amsterdam, The Netherlands,March 2008., 2008.

[22] Sonia Chernova and Manuela Veloso. Learningequivalent action choices from demonstration. In InProceedings of the IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS’08), Nice, France,September 2008, 2008.

[23] Sonia Chernova and Manuela Veloso. Teachingcollaborative multirobot tasks throughdemonstration. In In Proceedings of AAMAS’08, theSeventh International Joint Conference on AutonomousAgents and Multi-Agent Systems, Estoril, Portugal, May2008, 2008.

[24] Sonia Chernova and Manuela Veloso. Interactivepolicy learning through confidence-based autonomy.Journal of Artificial Intelligence Research, 34, 2009.

[25] Brian Coltin, Somchaya Liemhetcharat, Çetin Meriçli,Junyun Tay, and Manuela Veloso. Multi-humanoidworld modeling in standard platform robot soccer. InProceedings of 2010 IEEE-RAS International Conferenceon Humanoid Robots, December 6-8, 2010, Nashville, TN,USA, 2010.

[26] Brian Coltin, Somchaya Liemhetcharat, etinMerili, and Manuela Veloso. Challenges ofmulti-robot world modelling in dynamic andadversarial domains. In Workshop on PracticalCognitive Agents and Robots, 9th InternationalConference on Autonomous Agents and MultiagentSystems (AAMAS 2010), 2010.

11Çetin Meriçli, Manuela Veloso and H. Levent Akın: Task Refinement for Autonomous Robots using Complementary Corrective Human Feedback

78www.intechweb.orgwww.intechopen.com

[27] Elena Gribovskaya, Khansari Zadeh, SeyedMohammad, and Aude Billard. Learning NonlinearMultivariate Dynamics of Motion in RoboticManipulators [accepted]. International Journal ofRobotics Research, 2010.

[28] D.H. Grollman and O.C. Jenkins. Dogged learningfor robots. In International Conference on Roboticsand Automation (ICRA 2007), pages 2483–2488, Rome,Italy, Apr 2007.

[29] D.H. Grollman and O.C. Jenkins. Learning elementsof robot soccer from demonstration. In InternationalConference on Development and Learning (ICDL 2007),pages 276–281, London, England, Jul 2007.

[30] F. Guenter, M. Hersch, S. Calinon, and A. Billard.Reinforcement Learning for Imitating ConstrainedReaching Movements. RSJ Advanced Robotics, SpecialIssue on Imitative Robots, 21(13):1521–1544, 2007.

[31] M. Hersch, F. Guenter, S. Calinon, and A. Billard.Dynamical System Modulation for Robot Learningvia Kinesthetic Demonstrations. IEEE Transactions onRobotics, 24(6):1463–1467, 2008.

[32] Jan Hoffmann, Matthias Jüngel, and Martin Lötzsch.A vision based system for goal-directed obstacleavoidance used in the rc’03 obstacle avoidancechallenge. In In 8th International Workshop on RoboCup2004 (Robot World Cup Soccer Games and Conferences),Lecture Notes in Artificial Intelligence, pages 418–425.Springer, 2004.

[33] J. Zico Kolter, Pieter Abbeel, and Andrew Y. Ng.Hierarchical apprenticeship learning with applicationto quadruped locomotion. In John C. Platt, DaphneKoller, Yoram Singer, and Sam T. Roweis, editors,Advances in Neural Information Processing Systems

20, Proceedings of the Twenty-First Annual Conferenceon Neural Information Processing Systems, Vancouver,British Columbia, Canada, December 3-6, 2007. MITPress, 2007.

[34] Scott Lenser and Manuela Veloso. Sensor resettinglocalization for poorly modelled mobile robots. InProceedings of ICRA-2000, San Francisco, April 2000.

[35] Scott Lenser and Manuela Veloso. Visual sonar:Fast obstacle avoidance using monocular vision. InIEEE/RSJ International Conference on Intelligent Robotsand Systems (IROS) 2003, pages 886–891, 2003.

[36] Jun Nakanishi, Jun Morimoto, Gen Endo, GordonCheng, Stefan Schaal, and Mitsuo Kawato. Learningfrom demonstration and adaptation of bipedlocomotion. Robotics and Autonomous Systems,47(2-3):79 – 91, 2004. Robot Learning fromDemonstration.

[37] RoboCup. RoboCup International Robot SoccerCompetition, 2009. http://www.robocup.org.

[38] RoboCup SPL. The RoboCup Standard PlatformLeague, 2009. http://www.tzi.de/spl.

[39] Paul E. Rybski, Kevin Yoon, Jeremy Stolarz, andManuela M. Veloso. Interactive robot task trainingthrough dialog and demonstration. In In Proceedingsof the 2007 ACM/IEEE International Conference onHuman-Robot Interaction, Washington D.C, pages255–262, 2007.

[40] Andrea L. Thomaz and Cynthia Breazeal.Reinforcement learning with human teachers:evidence of feedback and guidance with implicationsfor learning performance. In AAAI’06: Proceedingsof the 21st national conference on Artificial intelligence,pages 1000–1005. AAAI Press, 2006.

12International Journal of Advanced Robotic Systems, Vol. 8, No. 2. (2011)79 www.intechweb.org

www.intechopen.com