em-based reinforcement learning - tu darmstadt · references em-based reinforcement learning...
TRANSCRIPT
References
EM-based Reinforcement Learning
Gerhard Neumann1
1TU Darmstadt, Intelligent Autonomous Systems
December 21, 2011
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Outline
Expectation Maximization (EM)-based Reinforcement Learning
• Recap : Modelling data with Maximum Likelihood
• Expectation Maximization
• EM for RL
• Applications
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Why should we use probabilities for RL?
Reinforcement Learning in Continuous State and Action Spaces isa hard problem
• Value-functions are hard to estimate in continuous spaces
• Many RL methods rely on discretizations of the state space,action space or both
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Why should we use probabilities for RL?
However : Many probablistic inference algorithms can be used incontinuous spaces
• Gaussians, Mixture of Gaussians, Linear Gaussian Models,Gaussian Processes
• We know how to estimate these distributions from data
• Can we use probabilistic inference for infering a policy?
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Quick Recap : Fun from high school...
Definitions :
• Marginal distribution : P (X) =∑
Y P (X,Y )
• Conditional distribution : P (X|Y ) = P (X,Y )P (Y )
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Quick Recap : Fun from high school...
Implications :
• Product rule : P (X,Y ) = P (X|Y )P (Y ) = P (Y |X)P (X)
• Chain rule : P (X1, . . . , Xn) =∏i P (Xi|X1, . . . , Xi−1)
• Bayes rule : P (Y |X) = P (X|Y )P (Y )P (X)
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Quick Recap
Gaussian Distribution:
P (x|θ) = N (x|µ,Σ) =1
(2π)(k/2)|Σ|1/2exp(−1
2(x−µ)TΣ−1(x−µ))
Parameters θ :
• µ . . . mean
• Σ . . . covariance matrix
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Recap : Modelling our data
We are given a set of data points yi
• ... and we want to estimate a generative model P (yi;θ) forthese data points
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Recap : Modelling our data
Maximum Likelihood Solutions
• We want to find the parameters θ maximizes the likelihoodP (Y ;θ) of the data yi
argmaxθ
P (y1:N ;θ) =∏
i=1...N
P (yi;θ)
• This is often easier in log-space
argmaxθ
logP (y1:N ;θ) =∑
i=1...N
logP (yi;θ)
• A piece of cake for all distributions from the exponentialfamily (e.g Gaussian)
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Recap : Modelling our data
E.g. Gaussian Distribution• Given : Set of data-points {xi}i=1...N
• Estimate Parameters
µ =
∑xiN
,Σ =
∑(xi − µ)(xi − µ)T
N
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Recap : Modelling our data with hidden variables
Often we are not given all information ...
• E.g. missing data
• Mixture Modelling / Clustering : Which mixture componentcreated the data?
• Reinforcement Learning : Which trajectories create highreward?
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Recap : Modelling our data with hidden variables
Maximum Likelihood Solutions with hidden variables z
• Given a model P (y, z;θ) which maximizes the likelihood ofthe data yi
argmaxθ
L(θ) = logP (y1:N ;θ) =∑i
logP (yi;θ)
=∑i
log∑z
P (yi, z;θ)
• Since the data for the hidden variables z is missing, we needto marginalize it out !
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Recap : Modelling our data with hidden variables
Maximum Likelihood Solutions with hidden variables z
argmaxθ
L(θ) = logP (y1:N ;θ) =∑i
logP (yi;θ)
=∑i
log∑z
P (yi, z;θ)
• oOhh... the log of a sum... are we doomed?!
• At least no closed form solution exists any more...
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Outline
EM-based Reinforcement Learning
• Recap : Modelling data with Maximum Likelihood
• Expectation Maximization (EM)
• EM for RL
• Applications
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Iterative Solution : Expectation-Maximization
Expectation-Maximization based Algorithms:
• (E)xpectation Step
• (M)aximization Step
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Expectation Step:
• Use a proposal distribution Pi(z) over the hidden variables
• What is my belief over the hidden variables given the currentmodel θ(t−1) and the observation yi?
• Calculate Pi(z) = P (z|yi;θ(t−1))
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Maximization Step:
• Weight the log-likelihood of the joint by the proposaldistribution
Q(θ) = argmaxθ
∑i
∑z
Pi(z) logP (yi, z;θ)
• Set θ(t) to argmaxθQ(θ)
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Iterative Solution : EM
Comparison : Standard ML Solution :
L(θ) = argmaxθ
∑i
log∑z
P (yi, z;θ)
M-Step :
Q(θ) =∑i
∑z
Pi(z) logP (yi, z;θ)
’Magic’ of EM : Transformed log of sum into sum of log
• The E and the M-step can be solved in closed form !
• Both steps are proved to increase the log-likelihood L(θ) orleave it unchanged
• Thus the algorithm always converges to a (local) maxima
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Example : Gaussian Mixture Models
The distribution is composed of K Gaussians components
P (y) =∑
k=1...K
P (k)P (y|k) =∑
k=1...K
ckN (y|µk,Σk)
• θ : ck . . . Mixture coefficients, µk . . . mean, Σk . . .Covariance
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Hidden variable k
• We do not know which component k created our data
• Joint Distribution : P (y, k) = ckN (y|µk,Σk)
• If we would know k the task would be easy...
EM-based Reinforcement Learning Robot Learning, WS 2011
References
EM for Gaussian Mixture Models
Expectation Step :
• Calculate probability that component k created data point yj
Pi(k) = P (k|yi) =P (yi, k;θ)∑k P (yi, k;θ)
• Called responsibilities...
Maximization Step :
argmax{c1:K ,µ1:K ,Σ1:K}
∑i
∑k
Pi(k) logP (y, k)
• Each mixture component can be optimized independently!
EM-based Reinforcement Learning Robot Learning, WS 2011
References
EM for Gaussian Mixture Models
Expectation Step :
• Calculate probability that component k created data point yj
Pi(k) = P (k|yi) =P (yi, k;θ)∑k P (yi, k;θ)
• Called responsibilities...
Maximization Step :
argmax{c1:K ,µ1:K ,Σ1:K}
∑i
∑k
Pi(k) logP (y, k)
• Each mixture component can be optimized independently!
argmax{c1:K ,µ1:K ,Σ1:K}
∑k
∑i
Pi(k) logP (y, k)
EM-based Reinforcement Learning Robot Learning, WS 2011
References
EM for Gaussian Mixture Models
Each mixture component can be optimized independently!
argmax{ck,µk,Σk}
∑i
Pi(k)(logN (yi|µk,Σk) + log ck)
• Comparison : Maximum-Likelihood (ML) Problem of a singleGaussian
argmax{µ,Σ}
∑i
logN (yi|µ,Σ)
• Weighted ML-Solution : Pi(k) defines a weighting of eachdata-point
EM-based Reinforcement Learning Robot Learning, WS 2011
References
EM for Gaussian Mixture Models
Comparison : ML-Solution for single Gaussian
µ =
∑j yj
NΣ =
∑j(yj−µk)(yj−µk)
T
N
M-Step : Weighted ML-Solution
µk =
∑j Pj(k)yj∑j Pj(k)
Σk =∑
j Pj(k)(yj−µk)(yj−µk)T∑
j Pj(k)
EM-based Reinforcement Learning Robot Learning, WS 2011
References
EM for Gaussian Mixture Models
Example: From Bishop book
EM-based Reinforcement Learning Robot Learning, WS 2011
References
EM in a nutshell
• EM can be used whenever we need to deal withhidden/unobserved variables
• Iteratively apply E- and M-step• Both are applicable in closed formulate• No learning rates or whatsoever are needed!
• Uses proposal distribution over hidden variables• Belief over hidden variables using the current model...• Used as Weighting in the joint log-likelihood
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Outline
Expectation Maximization (EM)-based Reinforcement Learning
• Recap : Modelling data with Maximum Likelihood
• Expectation Maximization
• EM for RL
• Applications
EM-based Reinforcement Learning Robot Learning, WS 2011
References
EM for Reinforcement Learning
Ok, nice, but how can I use that for robotic learning?
• Model RL as Maximum Likelihood Problem!
• ’Observed’ variable : We want to observe a reward event
P (R = 1|τ) ∝ exp(βRτ )
• Binary event of observing a reward, β ’temperature’ ofdistribution
• τ . . . trajectory, Rτ reward of the trajectory• Common approach to transform a reward into a distribution
EM-based Reinforcement Learning Robot Learning, WS 2011
References
EM for Reinforcement Learning
Example for reward distribution : Matlab...
EM-based Reinforcement Learning Robot Learning, WS 2011
References
EM for Reinforcement Learning
• ’Observed’ variable : Reward Event
P (R = 1|τ) ∝ exp(βRτ )
• ’Hidden’ Variable : We do not know which trajectoriesgenerated the reward event
• Model for trajectories : p(τ ;θ)• Contains our policy :
p(τ ;θ) = P (s0)∏t
P (st|at−1, st−1)π(at−1|st−1;θ)
• We want to find a θ which gives high reward!
EM-based Reinforcement Learning Robot Learning, WS 2011
References
EM for Reinforcement Learning
We want to find a θ which gives high reward!
• Joint Distribution : p(R, τ ;θ) = p(R|τ)p(τ ; )
• We want to maximize the log-likelihood of our ’observation’(getting a reward)
argmaxθ
log p(R = 1;θ) = log
∫τP (R = 1|τ)P (τ ;θ)dτ
EM-based Reinforcement Learning Robot Learning, WS 2011
References
EM for Reinforcement Learning
We want to maximize the log-likelihood of our ’observation’(getting a reward)
log p(R) = log
∫τP (R|τ)P (τ ;θ)dτ
• High dimensional trajectory space : The sum over alltrajectories is intractable
• Are we doomed again?
EM-based Reinforcement Learning Robot Learning, WS 2011
References
EM for Reinforcement Learning
We want to maximize the log-likelihood of our ’observation’(getting a reward)
log p(R) = log
∫τP (R|τ)P (τ ;θ)dτ
• High dimensional trajectory space : The sum over alltrajectories is intractable
• Are we doomed again? No... EM can help us out
EM-based Reinforcement Learning Robot Learning, WS 2011
References
EM for Reinforcement Learning
EM can help us out
• Use proposal distibution P (τ) over trajectories
E-step :
• Estimate the probability that trajectory τ has created thereward event.
P (τ) = P (τ |R;θt−1) =P (R|τ)P (τ ;θt−1)
P (R;θt−1)
∝ P (R|τ)P (τ ;θt−1)
• P (τ) is also called the reward-weighted model distribution.
EM-based Reinforcement Learning Robot Learning, WS 2011
References
EM for Reinforcement Learning
M-step :
θt = argmaxθ
Q(θ) =
∫τP (τ) logP (R|τ)P (τ ;θ)dτ
=
∫τP (R|τ)P (τ ;θt−1) logP (τ ;θ)dτ + const
If we we use samples from τj ∼ P (τ ;θt−1) this integral can beefficiently approximated!
L(θ) ≈∑τj
P (R|τj) logP (τj ;θ)
• This is again just the weighted maximum likelihood solution
• Each trajectory is weighted by its reward probability exp(βRτ )
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Summary : EM for Reinforcement Learning
• Start with initial distribution P (τ ;θ0)
• For t = 1 . . . L• Sample N trajectories from P (τ ;θt−1)• Weight each trajectory by its probability wi ∝ exp(βRτ ) that
it created the reward event• Estimate new model parameters θt by weighted maximum
likilihood estimate
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Illustration : 1-step RL Problem
• 2-dimensional action space, no states
• Reward Function : r(a) = −(a− a∗)TD(a− a∗)
• Show matlab demo...
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Problems : 1-step RL Problem
• 2-dimensional action space, multi-modal solution space
• Reward Function :
r(a) = max(−(a−a∗,1)TD(a−a∗,1),−(a−a∗,2)TD(a−a∗,2))
• Show matlab demo...
• Current master thesis of Chris
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Using Linear Features...
2 different models have been used
• Reward-Weighted Regression (RWR) : a = θTφ(s) + ε• Add noise to the action vector...
• Policy-learning by Weighting Exploration with Returns(PoWER) : a = (θT + ε)φ(s)
• Add noise to the parameter vector...
with ε ∼ N (0, σ2I)
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Linear Feature Representations
2 different models have been used
• Reward-Weighted Regression (RWR) :
a = θTφ(s) + ε
• Add noise to the action vector...
• Policy-learning by Weighting Exploration with Returns(PoWER) :
a = (θT + ε)φ(s)
• Add noise to the parameter vector...
• Will both be covererd in more detail by Jan...
with ε ∼ N (0, σ2I)
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Reward Weighted Regression
a = θTφ(s) + ε : Model for the Policy
π(a|s;θ) = N (a|θTφ(s), σ2I)
• In the M-step we have to maximize
argmaxθ
∑j
exp(βRj)(aj − θTφ(sj))2
• Looks familiar...?
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Reward Weighted Regression
• In the M-step we have to maximize
argmaxθ
∑j
exp(βRj)(aj − θTφ(sj))2
• This is just a weighted linear regression problem!
θ = (ΦTRΦ)−1ΦTRA
• with...• Φ = [φ(s1),φ(s2), . . . ,φ(sN )]T
• R = diag([Rj ])• A = [a1,a2, . . . ,aN ]
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Things you can do...
Ball in the Cup
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Things you can do...
Dart : Playing around the clock
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Things you can do...
Robot Balancing for different forces...
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Extensions / Not covered...
• Similar EM-based approach to estimate the V-function(Neumann & Peters, 2009)
• Variational inference approach which has better properties incase of a multi-modal solution-space (Neumann, 2011)
• How to choose β?
• Similar, but better :• Relative Entropy Policy Search (REPS) (Peters et al., 2010)• Bound the ’distance’ between two subsequent policies
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Possible Projects / Bachelor Thesis...
Lets play table tennis...!
• Final Setup : 2 robots playing against each other...
• We will also get the real robots...
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Lets play table tennis...!
Use EM-based algorithms for ...
• Learning when to intercept the ball
• Learning to smash
• Learning to stop the ball
• Learning to play the ball with spin
EM-based Reinforcement Learning Robot Learning, WS 2011
References
The end
Thanks for your attention!
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Bibliography I
Neumann, G., & Peters, J. 2009.Fitted Q-Iteration by Advantage Weighted Regression.In: Advances in Neural Information Processing Systems 22(NIPS 2008).MA: MIT Press.
Neumann, Gerhard. 2011.Variational Inference for Policy Search in Changing Situations.Pages 817–824 of: Getoor, Lise, & Scheffer, Tobias (eds),Proceedings of the 28th International Conference on MachineLearning (ICML-11).ICML ’11.New York, NY, USA: ACM.
EM-based Reinforcement Learning Robot Learning, WS 2011
References
Bibliography II
Peters, Jan, Mulling, Katharina, & Altun, Yasemin. 2010.Relative Entropy Policy Search.In: AAAI.
EM-based Reinforcement Learning Robot Learning, WS 2011