em-based reinforcement learning - tu darmstadt · references em-based reinforcement learning...

References

EM-based Reinforcement Learning

Gerhard Neumann1

1TU Darmstadt, Intelligent Autonomous Systems

December 21, 2011

EM-based Reinforcement Learning Robot Learning, WS 2011

References

Outline

Expectation Maximization (EM)-based Reinforcement Learning

• Recap : Modelling data with Maximum Likelihood

• Expectation Maximization

• EM for RL

• Applications


References

Why should we use probabilities for RL?

Reinforcement Learning in Continuous State and Action Spaces isa hard problem

• Value-functions are hard to estimate in continuous spaces

• Many RL methods rely on discretizations of the state space,action space or both


References

Why should we use probabilities for RL?

However : Many probablistic inference algorithms can be used incontinuous spaces

• Gaussians, Mixture of Gaussians, Linear Gaussian Models,Gaussian Processes

• We know how to estimate these distributions from data

• Can we use probabilistic inference for infering a policy?


References

Quick Recap : Fun from high school...

Definitions :

• Marginal distribution : P (X) =∑

Y P (X,Y )

• Conditional distribution : P (X|Y ) = P (X,Y )P (Y )


References

Quick Recap

Gaussian Distribution:

P (x|θ) = N (x|µ,Σ) =1

(2π)(k/2)|Σ|1/2exp(−1

2(x−µ)TΣ−1(x−µ))

Parameters θ :

• µ . . . mean

• Σ . . . covariance matrix


References

Recap : Modelling our data

We are given a set of data points yi

• ... and we want to estimate a generative model P (yi;θ) forthese data points


References


Maximum Likelihood Solutions

• We want to find the parameters θ maximizes the likelihoodP (Y ;θ) of the data yi

argmaxθ

P (y1:N ;θ) =∏

i=1...N

P (yi;θ)

• This is often easier in log-space

argmaxθ

logP (y1:N ;θ) =∑

i=1...N

logP (yi;θ)

• A piece of cake for all distributions from the exponentialfamily (e.g Gaussian)


References


E.g. Gaussian Distribution• Given : Set of data-points {xi}i=1...N

• Estimate Parameters

µ =

∑xiN

,Σ =

∑(xi − µ)(xi − µ)T

N


References

Recap : Modelling our data with hidden variables

Often we are not given all information ...

• E.g. missing data

• Mixture Modelling / Clustering : Which mixture componentcreated the data?

• Reinforcement Learning : Which trajectories create highreward?


References


Maximum Likelihood Solutions with hidden variables z

• Given a model P (y, z;θ) which maximizes the likelihood ofthe data yi

argmaxθ

L(θ) = logP (y1:N ;θ) =∑i

logP (yi;θ)

=∑i

log∑z

P (yi, z;θ)

• Since the data for the hidden variables z is missing, we needto marginalize it out !


References


Maximum Likelihood Solutions with hidden variables z

argmaxθ

L(θ) = logP (y1:N ;θ) =∑i

logP (yi;θ)

=∑i

log∑z

P (yi, z;θ)

• oOhh... the log of a sum... are we doomed?!

• At least no closed form solution exists any more...


References

Outline

EM-based Reinforcement Learning


• Expectation Maximization (EM)

• EM for RL

• Applications


References

Iterative Solution : Expectation-Maximization

Expectation-Maximization based Algorithms:

• (E)xpectation Step

• (M)aximization Step


References

Expectation Step:

• Use a proposal distribution Pi(z) over the hidden variables

• What is my belief over the hidden variables given the currentmodel θ(t−1) and the observation yi?

• Calculate Pi(z) = P (z|yi;θ(t−1))


References

Maximization Step:

• Weight the log-likelihood of the joint by the proposaldistribution

Q(θ) = argmaxθ

∑i

∑z

Pi(z) logP (yi, z;θ)

• Set θ(t) to argmaxθQ(θ)


References

Iterative Solution : EM

Comparison : Standard ML Solution :

L(θ) = argmaxθ

∑i

log∑z

P (yi, z;θ)

M-Step :

Q(θ) =∑i

∑z

Pi(z) logP (yi, z;θ)

’Magic’ of EM : Transformed log of sum into sum of log

• The E and the M-step can be solved in closed form !

• Both steps are proved to increase the log-likelihood L(θ) orleave it unchanged

• Thus the algorithm always converges to a (local) maxima


References

Example : Gaussian Mixture Models

The distribution is composed of K Gaussians components

P (y) =∑

k=1...K

P (k)P (y|k) =∑

k=1...K

ckN (y|µk,Σk)

• θ : ck . . . Mixture coefficients, µk . . . mean, Σk . . .Covariance


References

Hidden variable k

• We do not know which component k created our data

• Joint Distribution : P (y, k) = ckN (y|µk,Σk)

• If we would know k the task would be easy...


References

EM for Gaussian Mixture Models

Expectation Step :

• Calculate probability that component k created data point yj

Pi(k) = P (k|yi) =P (yi, k;θ)∑k P (yi, k;θ)

• Called responsibilities...

Maximization Step :

argmax{c1:K ,µ1:K ,Σ1:K}

∑i

∑k

Pi(k) logP (y, k)

• Each mixture component can be optimized independently!


References


Expectation Step :

• Calculate probability that component k created data point yj

Pi(k) = P (k|yi) =P (yi, k;θ)∑k P (yi, k;θ)

• Called responsibilities...

Maximization Step :


∑i

∑k

Pi(k) logP (y, k)

• Each mixture component can be optimized independently!


∑k

∑i

Pi(k) logP (y, k)


References


Each mixture component can be optimized independently!

argmax{ck,µk,Σk}

∑i

Pi(k)(logN (yi|µk,Σk) + log ck)

• Comparison : Maximum-Likelihood (ML) Problem of a singleGaussian

argmax{µ,Σ}

∑i

logN (yi|µ,Σ)

• Weighted ML-Solution : Pi(k) defines a weighting of eachdata-point


References


Comparison : ML-Solution for single Gaussian

µ =

∑j yj

NΣ =

∑j(yj−µk)(yj−µk)

T

N

M-Step : Weighted ML-Solution

µk =

∑j Pj(k)yj∑j Pj(k)

Σk =∑

j Pj(k)(yj−µk)(yj−µk)T∑

j Pj(k)


References


Example: From Bishop book


References

EM in a nutshell

• EM can be used whenever we need to deal withhidden/unobserved variables

• Iteratively apply E- and M-step• Both are applicable in closed formulate• No learning rates or whatsoever are needed!

• Uses proposal distribution over hidden variables• Belief over hidden variables using the current model...• Used as Weighting in the joint log-likelihood


References

Outline

Expectation Maximization (EM)-based Reinforcement Learning


• Expectation Maximization

• EM for RL

• Applications


References

EM for Reinforcement Learning

Ok, nice, but how can I use that for robotic learning?

• Model RL as Maximum Likelihood Problem!

• ’Observed’ variable : We want to observe a reward event

P (R = 1|τ) ∝ exp(βRτ )

• Binary event of observing a reward, β ’temperature’ ofdistribution

• τ . . . trajectory, Rτ reward of the trajectory• Common approach to transform a reward into a distribution


References


Example for reward distribution : Matlab...


References


• ’Observed’ variable : Reward Event

P (R = 1|τ) ∝ exp(βRτ )

• ’Hidden’ Variable : We do not know which trajectoriesgenerated the reward event

• Model for trajectories : p(τ ;θ)• Contains our policy :

p(τ ;θ) = P (s0)∏t

P (st|at−1, st−1)π(at−1|st−1;θ)

• We want to find a θ which gives high reward!


References


We want to find a θ which gives high reward!

• Joint Distribution : p(R, τ ;θ) = p(R|τ)p(τ ; )

• We want to maximize the log-likelihood of our ’observation’(getting a reward)

argmaxθ

log p(R = 1;θ) = log

∫τP (R = 1|τ)P (τ ;θ)dτ


References


We want to maximize the log-likelihood of our ’observation’(getting a reward)

log p(R) = log

∫τP (R|τ)P (τ ;θ)dτ

• High dimensional trajectory space : The sum over alltrajectories is intractable

• Are we doomed again?


References


We want to maximize the log-likelihood of our ’observation’(getting a reward)

log p(R) = log

∫τP (R|τ)P (τ ;θ)dτ

• High dimensional trajectory space : The sum over alltrajectories is intractable

• Are we doomed again? No... EM can help us out


References


EM can help us out

• Use proposal distibution P (τ) over trajectories

E-step :

• Estimate the probability that trajectory τ has created thereward event.

P (τ) = P (τ |R;θt−1) =P (R|τ)P (τ ;θt−1)

P (R;θt−1)

∝ P (R|τ)P (τ ;θt−1)

• P (τ) is also called the reward-weighted model distribution.


References


M-step :

θt = argmaxθ

Q(θ) =

∫τP (τ) logP (R|τ)P (τ ;θ)dτ

=

∫τP (R|τ)P (τ ;θt−1) logP (τ ;θ)dτ + const

If we we use samples from τj ∼ P (τ ;θt−1) this integral can beefficiently approximated!

L(θ) ≈∑τj

P (R|τj) logP (τj ;θ)

• This is again just the weighted maximum likelihood solution

• Each trajectory is weighted by its reward probability exp(βRτ )


References

Summary : EM for Reinforcement Learning

• Start with initial distribution P (τ ;θ0)

• For t = 1 . . . L• Sample N trajectories from P (τ ;θt−1)• Weight each trajectory by its probability wi ∝ exp(βRτ ) that

it created the reward event• Estimate new model parameters θt by weighted maximum

likilihood estimate


References

Illustration : 1-step RL Problem

• 2-dimensional action space, no states

• Reward Function : r(a) = −(a− a∗)TD(a− a∗)

• Show matlab demo...


References

Problems : 1-step RL Problem

• 2-dimensional action space, multi-modal solution space

• Reward Function :

r(a) = max(−(a−a∗,1)TD(a−a∗,1),−(a−a∗,2)TD(a−a∗,2))

• Show matlab demo...

• Current master thesis of Chris


References

Using Linear Features...

2 different models have been used

• Reward-Weighted Regression (RWR) : a = θTφ(s) + ε• Add noise to the action vector...

• Policy-learning by Weighting Exploration with Returns(PoWER) : a = (θT + ε)φ(s)

• Add noise to the parameter vector...

with ε ∼ N (0, σ2I)


References

Linear Feature Representations

2 different models have been used

• Reward-Weighted Regression (RWR) :

a = θTφ(s) + ε

• Add noise to the action vector...

• Policy-learning by Weighting Exploration with Returns(PoWER) :

a = (θT + ε)φ(s)

• Add noise to the parameter vector...

• Will both be covererd in more detail by Jan...

with ε ∼ N (0, σ2I)


References

Reward Weighted Regression

a = θTφ(s) + ε : Model for the Policy

π(a|s;θ) = N (a|θTφ(s), σ2I)

• In the M-step we have to maximize

argmaxθ

∑j

exp(βRj)(aj − θTφ(sj))2

• Looks familiar...?


References

Reward Weighted Regression

• In the M-step we have to maximize

argmaxθ

∑j

exp(βRj)(aj − θTφ(sj))2

• This is just a weighted linear regression problem!

θ = (ΦTRΦ)−1ΦTRA

• with...• Φ = [φ(s1),φ(s2), . . . ,φ(sN )]T

• R = diag([Rj ])• A = [a1,a2, . . . ,aN ]


References

Things you can do...

Ball in the Cup


References


Dart : Playing around the clock


References


Robot Balancing for different forces...


References

Extensions / Not covered...

• Similar EM-based approach to estimate the V-function(Neumann & Peters, 2009)

• Variational inference approach which has better properties incase of a multi-modal solution-space (Neumann, 2011)

• How to choose β?

• Similar, but better :• Relative Entropy Policy Search (REPS) (Peters et al., 2010)• Bound the ’distance’ between two subsequent policies


References

Possible Projects / Bachelor Thesis...

Lets play table tennis...!

• Final Setup : 2 robots playing against each other...

• We will also get the real robots...


References

Lets play table tennis...!

Use EM-based algorithms for ...

• Learning when to intercept the ball

• Learning to smash

• Learning to stop the ball

• Learning to play the ball with spin


References

The end

Thanks for your attention!


References

Bibliography I

Neumann, G., & Peters, J. 2009.Fitted Q-Iteration by Advantage Weighted Regression.In: Advances in Neural Information Processing Systems 22(NIPS 2008).MA: MIT Press.

Neumann, Gerhard. 2011.Variational Inference for Policy Search in Changing Situations.Pages 817–824 of: Getoor, Lise, & Scheffer, Tobias (eds),Proceedings of the 28th International Conference on MachineLearning (ICML-11).ICML ’11.New York, NY, USA: ACM.


References

Bibliography II

Peters, Jan, Mulling, Katharina, & Altun, Yasemin. 2010.Relative Entropy Policy Search.In: AAAI.


em-based reinforcement learning - tu darmstadt · references em-based reinforcement learning...

Documents