machine learning techniques and applications reinforcement...

51
MACHINE LEARNING – 2012 1 MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement Learning for Continuous State and Action Spaces Gradient Methods

Upload: others

Post on 20-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

1

MACHINE LEARNING TECHNIQUES

AND APPLICATIONS

Reinforcement Learning

for Continuous State and Action Spaces

Gradient Methods

Page 2: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

2

Reinforcement Learning (RL)

Supervised Learning

Unsupervised Learning

Reinforcement learning is in-between

Page 3: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

3

Drawbacks of classical RL

Curse of dimensionality:

Computational costs increase dramatically with number of states.

Markov World:

Cannot handle continuous action and state space

Model-based vs model free:

Need a model of the world (can be estimated through exploration)

Gradient methods to handle continuous state and action spaces

Page 4: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

4

RL in continuous state and action spaces

1....States and actions , , are continuous:

,

One can no longer swipe through all states and actions

to determine the optimal policy.

Instead, one can either

t t

N P

t t

t Ts a

s a

:

1) use function approximation to estimate the value function V(s)

2) use function approximation to estimate the state-action

value function Q(s, a)

3) optimize a parameterized policy | (policy seara s ch)

Page 5: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

5

Policy Gradients

Parametrize the value function:

;V s Open parameters

Assume an initial estimate ; .

Run an episode using a greedy policy | ;

Compute the error on the estimate: = ;

Find the optimal parameters through gradient descent

on the stat

k

t

t

V s

a s

e V s E r

e value function:

; ~

V se

Page 6: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

6

TD learning for continuous state action space

Doya, NIPS 1996

1

Parametrize the value function such that:

;

is a set of basis function (e.g. RBF functions).

These are set by the user and also called the features.

are the weights associated

KT

j j

j

j

V s s s

s K

to each feature. These are

the unknown parameters.

Page 7: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

7

TD learning for continuous state action space

Doya, NIPS 1996

1

Pick a set of parameters and compute:

;

Do some roll-outs of fixed episode length and gather , 1...

Estimate new ; through TD learning

Gradient descent on the squared TD err

KT

j j

j

t

V s s s

r s t T

V s

1 1...

or gives:

;ˆ; ; ,

Other technique for estimation of non-linear regression functions have

been proposed elsewhere to get a better estimate

t

j t t t t j t

j

t

j K

r s

V sr s V s V s r s s

of the parameters than simple

gradient descent.

Page 8: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

8

Robotics Applications of continuous RL

Teaching a two joints, three link robot leg to stand up

Robot configuration. θ0: pitch angle, θ1: hip joint angle, θ2: knee joint angle, θm: the angle of the

line from the center of mass to the center of the foot.

Morimoto and Doya, Robotics and Autonomous Systems , 2001

Page 9: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

9

Robotics Applications of continuous RL

Morimoto and Doya, Robotics and Autonomous Systems , 2001

• The final goal is the upright stand-up posture.

• Reaching the final goal is a necessary but not sufficient condition of successful

stand-up because the robot may fall down after passing through the final goal

need to define subgoals

• The stand-up task is accomplished when the robot stands up and stays upright

for more than 2(T + 1) seconds.

Page 10: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

10

Robotics Applications of continuous RL

Morimoto and Doya, Robotics and Autonomous Systems , 2001

Hierarchical reinforcement learning:

Upper layer: discretize state-action space

and discover which subgoal should be

reached using Q-learning.

State: pitch and joint angles

Actions: joint displacement (not the torque).

Lower layer: continuous state-action space

The robot learns to apply appropriate torque

to achieve each subgoal.

State: pitch and joint angles, and their

velocities

Actions: torque at the two joints

Page 11: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

11

Reward:

Decompose the task into

an upper-level and lower-level

sets of goals.

Y: height of the head

of the robot at a sub-goal posture

and L is total length of the robot.

Robotics Applications of continuous RL

Morimoto and Doya, Robotics and Autonomous Systems , 2001

When the robot achieves a sub-goal, the learner gets a reward < 0.5.

Full reward is obtained when all subgoals and main goal are both achieved.

Page 12: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

12

Robotics Applications of continuous RL

Morimoto and Doya, Robotics and Autonomous Systems , 2001

State and action are continuous in time: ,

Model value function: ;

Model policy: | ;

,

: sigmoid function, : noise

Learn through gradient descent on

squared TD error.

Backp

T

T

s t u t

V s s

u s

u s f s

f

1...

ropagate TD error to estimate :

, j j j K

e t

t e t

Page 13: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

13

Robotics Applications of continuous RL

• 750 trials in simulation + 170 on real robot

• Goal: to stand up

Morimoto and Doya, Robotics and Autonomous Systems , 2001

Page 14: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

14

Separate the problem in two parts:

a) Learning to swing the pole up

b) Learning to balance the pole upright

Part a is learned through TD-learning

Part b is learned using optimal control

Additionally estimate a model of the

dynamics of the inverted pendulum

through locally weighted regression based

(estimate parameters of a known model)

Learning how to swing up a pendulum

Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997

Robotics Applications of continuous RL

Page 15: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

15

Learning how to swing up a pendulum

Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997

* *

1Collect a sequence , from

a single human demonstration

State of the system = , , ,

(human hand trajectory + velocity

angular trajectory and velocity of

the pendulum)

Actions: = (c

T

t tt

t t t t t

t t

s u

s x x

u x

an be converted into torque

command to the robot)

Reward: minimizes angular and hand displacement

as well as velocities

Robotics Applications of continuous RL

Page 16: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

16

1

Learn first a model of the task through RL using

continuous TD learning to estimate ; :

Take a model ; =

Parameters are estimated incrementally through non-linear

function approximation (

K

j

i

V s

V s s

locally weighted regression)

* *

Learn then the reward:

, ,

and are estimated so as to minimize

discrepancies between human and robot trajectories.

Use the reward to generate optimal trajectories from

optim

TT

t t t t t t t tr x u t x x Q x x u Ru

Q R

al control.

Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997

Robotics Applications of continuous RL

Page 17: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

17

Learning how to swing up a pendulum

Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997

Robotics Applications of continuous RL

Page 18: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

18

RL in continuous state and action spaces

1....States and actions , , are continuous:

,

One can no longer swipe through all states and actions

to determine the optimal policy.

Instead, one can either

t t

N P

t t

t Ts a

s a

:

1) use function approximation to estimate the value function V(s)

2) use function approximation to estimate the state-action

value function Q(s, a)

3) optimize a parameterized policy | (policy seara s ch)

Page 19: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

19

Least Square Policy Iteration

Page 20: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

20

Least Square Policy Iteration

1

Approximate the state-value function:

, ; , ,

, is a set of basis function , : .

These are set by the user and also called the features.

are the

KT

j j

j

N P

j

j

Q s a s a s a

s a K s a

weights associated to each feature. These are

the unknown parameters.

Page 21: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

21

Least Square Policy Iteration

1: 1 1: 1: 2: 1

Perform a roll-out for time steps using policy | ;

(greedy search on arg max , ; using current estimate of , ; ).

Collect samples , , .

Determine how good an initial choice

a

T T T T

T a s

Q s a Q s a

s a r r s

1

11,.. 1,..

of parameter is by comparing

the predicted reward and the actual reward .

' ,

, , ' , , , 0,1 discount factor

T

t t

T

t t t t tt T t T

R r

J R

s a s s a

Bellman Residual Error

See Lagoudakis & Parr, Journal of Machine Learning Research, 2003 for various numerical solutions to

determine the optimal alpha parameter.

Page 22: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

22

Goal: learn to balance and ride a bicycle to a target

position located 1 km away from the starting location

Continuous state: six-dimensional real-valued vector

- angle of the handlebar,

- vertical angle of the bicycle

- angle of the bicycle to the goal

5 Discrete actions: torque applied to the handlebar (discretized to {−2, 0,+2}) and

displacement of the rider (discretized to {−0.02, 0,+0.02}).

Model of the world: uniform noise on each action. Model of the bike’s dynamics.

The state-action value function Q(s, a) for a fixed action a is approximated by a

linear combination of 20 basis functions (linear and quadratic combination of

state and its 1st and 2nd order derivatives)

LSPI: Example learning to ride a bike

Lagoudakis & Parr, Journal of Machine Learning Research, 2003

S=

Page 23: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

23

Collect training samples using random policy. Each episode lasts 20 steps.

Learned policy evaluated 100 times to estimate the probability of success (reaching the

goal)

Shaping reward was 1% of the change (in meters) in the distance to the goal and change

in the square of the vertical angle.

Lagoudakis & Parr, Journal of Machine Learning Research, 2003

LSPI: Example learning to ride a bike

The policy after the first iteration balances

the bicycle, but fails to ride to the goal.

The policy after the second iteration is

heading towards the goal, but fails to

balance.

All policies thereafter balance and ride the

bicycle to the goal.

Crash still happens because of noise in the

model.

Page 24: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

24

Successful policies are found after a few thousand training episodes.

With 5000 training episodes (60000 samples) the probability of success is about

95% and the expected number of balancing steps is about 70000 steps.

Lagoudakis & Parr, Journal of Machine Learning Research, 2003

LSPI: Example learning to ride a bike

Page 25: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

25

RL in continuous state and action spaces

1....States and actions , , are continuous:

,

One can no longer swipe through all states and actions

to determine the optimal policy.

Instead, one can either

t t

N P

t t

t Ts a

s a

:

1) use function approximation to estimate the value function V(s)

2) use function approximation to estimate the state-action

value function Q(s, a)

3) optimize a parameterized policy | (policy seara s ch)

Page 26: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

26

Policy Gradients

An alternative is to parametrize the stochastic policy:

| is approximated by drawing from the distribution

| ;

a s

p a s

Parameters of the policy

Actor in state , choose an action following a stochastic

policy: | drawing from the distribution | ;

t t

t t t

s a

a s p a s

Page 27: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

27

1

Starting from a deterministic estimate of the policy:

| ; = | |

where | are again a set of known basis functions

and the weights for each basis function.

Adding some noise and

KT

j j

j

j

j

a s a s a s

a s K

making the policy explicitly time-

dependent | , ; = | , , ~ 0,

leads to stochastic estimate:

| , ; ~ | , , | , | ,

T

t t t t t

T T

t t t t t

a s t a s t N

a s t N a s t a s t a s t

Kobler and Peters, Machine Learning, 2011

Policy learning by Weighted Exploration with the Returns (PoWER)

Structured state-

dependent exploration

Page 28: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

28

Policy learning by Weighted Exploration with the Returns (PoWER)

Finite Horizon T; episodic restarts

1: 1: 1 1:

1

At each roll-out (episode), the agent gathers tupple of

rewards and associated states and actions: = , , .

Each roll-out is generated by using a policy with parameter :

~

with

T T T

t

a s r

p

p p s p

1

1

| , | , ;T

t t t t ts s a a s t

1

1

1

The cumulative reward for roll-out is:

, , ,T

t t t

t

R T r s s a t

Kobler and Peters, Machine Learning, 2011

Page 29: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

31

At each new trial, parameters are computed as the weighted

average of all previous trials plus some Gaussian noise

1~ ,

1,...number if trials

i i

i i

i

N w

w

i

Policy learning by Weighted Exploration with the Returns (PoWER)

Kobler and Peters, Machine Learning, 2011

Page 30: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

32

Policy learning by Weighted Exploration with the Returns (PoWER)

Teaching a robot to the ball-in-a-cup task

State: joint angle + velocities of robot + cartesian coordinates of the ball

Action: joint space accelerations

| , , ~ , , , with 31 basis functions per DOFTx x

Kobler and Peters, Machine Learning, 2011

Page 31: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

33

Policy learning by Weighted Exploration with the Returns (PoWER)

Kobler and Peters, Machine Learning, 2011

Page 32: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

34

Policy learning by Weighted Exploration with the Returns (PoWER)

Page 33: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

35

Inverse Reinforcement Learning

A major difficulty in RL is to determine the reward.

The reward encapsulates key information about the task.

A poor choice may lead to suboptimal solutions.

IRL was proposed as framework to estimate what the reward could be.

The reward is such that it best explain the observed behavior of an

“expert agent”. The expert is an agent that knows how to perform

the task (in an optimal manner).

Estimation of the policy and the reward is then done simultaneously.

Page 34: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

36

Inverse Reinforcement Learning

1

Consider first a discrete state - action space.

Assume a parametrization of the reward function:

where : 0,1 are bounded basis functions.

0,1 , so that the reward is also bound

KT

j j

j

j

j

r s s s

s S

ed 0,1r s

Abbel & Ng, Int. Conf. on Machine Learning, 2004

Page 35: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

37

Inverse Reinforcement Learning

1

1

1

Given:

The value of a given policy over one time steps roll-out is given by:

;

T

Tt

t

t

Tt T

t

TT t

t

r s s

T

E V E r s

E s

E s

Expected features given policy

A particular policy is evaluated as a linear combination of features.

Page 36: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

39

Inverse Reinforcement Learning

:1: 1 1

*

Step 1:

i) Record a set of demonstrated paths (a series of roll-outs) yielding a set of

1 state transitions .

ii) Build an estimate of the optimal policy , of the demonstrator by

comp

t T

Mi

i

M

T s

s a

* *

*

*

1 1

uting expected features for policy ,

1:

M Tt i

t

i t

s a

sM

Page 37: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

40

Inverse Reinforcement Learning

0

Step 2:

Initialize the agent's policy , , e.g. picking random policy.

and then do for i=0...n iterations:

i) Approximate (e.g. via monte-carlo) expected feature for the policy .

ii) Find the opt

i i

s a

*

0,... 1: 1

1

imal weight solution of max min .

iii) Set the current estimate of the reward to be

and use RL to find the optimal associated policy .

Terminate if is inferio

j

i i T

j i

Ti i

i

i

c

r s s

c

r to a threshold.

Page 38: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

41

Inverse Reinforcement Learning

*

The algorithm is trying to find a reward function

such that

i.e. a reward on which the expert does better, by a margin of c,

than any of the -1 policies found previously.

i

Ti ir s s V V c

i

,

*1... 1,

The optimization problem is equivalent to

max

with constraints ,

and 1.

Akin to maximum margin optimization in SVM.

j

c

T Tj i

c

c

Page 39: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

42

Learning optimal policy to drive a car, or learn driving styles (Nice, Nasty, Right lane

nice, right lane nasty, middle lane) Abbel & Ng, Int. Conf. on Machine Learning, 2004

IRL Example: Driving a car

Agent car is faster than any

other car and needs to learn

when and how to pass other

cars.

5 actions (steer onto any of

the 3 lanes or outside the

road)

5 features indicate where the

car is

10 features for discretized

distances to closest car

Page 40: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

43

30 iteration steps for learning each policy through IRL: (from top to bottom, left to

right: nice, nasty, right lane nice, right lane nasty, middle lane)

Learning based on 2 minutes expert demonstration of each driving style.

IRL Example: Driving a car

Page 41: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

44

30 iteration steps for learning each policy through IRL: (from top to bottom, left to

right: nice, nasty, right lane nice, right lane nasty, middle lane)

Learning based on 2 minutes expert demonstration of each driving style.

IRL Example: Driving a car

Page 42: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

Learning without reward: Donut

Flipping a block to stand on its end Launching a ball into a basket

The robot is provided solely with failed examples.

It has no information about the task – no reward, no indication of what

was incorrect.

Grollman and Billard, ICRA 2012, Best Paper Award Cognitive Robotics

Page 43: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

47

Build a distribution (DONUT) that moves away from the bad demonstrations

Explore around the demonstrations and use the covariance to guide the

exploration.

Move away from things that have been visited a lot during unsuccessful

demonstrations but remain within the vicinity of the demonstrations.

2D DONUT distribution; high likelihood of

drawing datapoints in the vicinity but away from

the demonstrated datapoints.

Original distribution of points

from failed demonstrations

Learning without reward: Donut

Page 44: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

48

Exploration parameter that determines how far

away the peaks are from the base distribution.

Base distribution learned from training

example (failed attempts) using GMM

Donut distribution

Exploratory policy

Learning without reward: Donut

Continuous state: joint angle ; Continuous action: joint velocity

, draw from DONUT distribution ~ | | ; , 1 | ; , /

s a

s a D a s N N f

Page 45: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

High coherence across

trials high confidence

Little coherence across trials

low confidence

Angular Position (radian) of robot’s wrist

Vel

oci

ty

• Search around the demonstrations

• Reproduce only parts where all demonstrators agreed

• Avoid regions with high uncertainty

Learning without reward: Donut

Page 46: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

High coherence across

trials high confidence

Little coherence across trials

low confidence

Angular Position (radian) of robot’s wrist

Vel

oci

ty

• Search around the demonstrations

• Reproduce only parts where all demonstrators agreed

• Avoid regions with high uncertainty

Learning without reward: Donut

Page 47: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

Find a solution in a few trials

Is comparable in efficiency to classical reinforcement learning approaches

But does not need a reward!

Learning without reward: Donut

Page 48: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

Learning without reward: Donut

Multidimensional DONUT applied to learn how to play Golf

Region in parameter space that

lead to successful hit

State: position of ball with respect to the hole

Action: speed and orientation of end-effector

when hitting the ball

Page 49: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

Learning without reward: Donut

Multidimensional DONUT; Create holes in the distribution located

on each unsuccessful attempt.

Lightness corresponds to the likelihood of generating arbitrary

2D parameters. Red dots are human attempts, blue are system-

generated trials.

Page 50: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

Learning without reward: Donut

Multidimensional DONUT applied to learn how to play Golf

Grollman and Billard, IJSR, 2012

Page 51: MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement ...lasa.epfl.ch/teaching/lectures/ML_Phd/Slides/RL-GradientMethods.pdf · MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement

MACHINE LEARNING – 2012

RL is a good alternative to supervised learning when you do

not know what the optimal solution would be.

Drawbacks of classical discrete state-action RL can be

overcome by considering continuous value functions.

Reward function can be estimated through IRL

Learning without rewards and from purely unsuccessful

examples is yet a little explored problem.

Summary