shipra agrawal ieor 8100 (reinforcement learning)Β Β· [a. and goyal colt 2012, aistats 2013, jacm...
TRANSCRIPT
Shipra Agrawal
IEOR 8100 (Reinforcement Learning)
Computational complexity
the amount of per-time-step computation the algorithm uses during learning;
Space complexity
the amount of memory used by the algorithm;
Learning complexity
a measure of how much experience the algorithm needs to learn in a given task.
2
Using a single thread of experience
No resets or generative model
Vs.
Generative models (simulator)
http://hunch.net/~jl/projects/RL/EE.html
3
Optimal learning complexity?
βoptimally exploreβ meaning to obtain maximum expected discounted rewardE[ π‘=1
β πΎπ‘β1ππ‘] over a known prior over the unknown MDPs
Tractable in special cases (πΎ < 1,ππ΄π΅, Gittins, 1989).
4
Bound number of steps on which suboptimal policy is played
near-optimally on all but a polynomial number of steps (Kakade, 2003; Strehl and Littman, 2008b, Strehl et al 2006, 2009)
Example: [Strehl et al 2006] (may been improved in recent work)
5
Some examples:
An O(SA) analysis of Q-learning.
Michael Kearns and Satinder Singh, Finite-Sample Convergence Rates for Q-learning and Indirect Algorithms NIPS 1999.
An analysis of TD-lambda.
Michael Kearns and Satinder Singh, Bias-Variance Error Bounds for Temporal Difference Updates, COLT 2000.
6
Bound difference in total reward of algorithm compared to a benchmark policy
π ππ π, π = π πβ(π 0) β πΈ[ π‘=1π ππ‘]
πβ s0 is infinite horizon average reward achieved by the best stationary policy πβ
Episodic/single thread
Episodic: starting state is reset every H steps
Single thread: assume Communicating with diameter D
Worst-case regret bounds; example
Theorem [Agrawal, Jia, 2017]: For any communicating MDP M with (unknown) diameter
π·, and for T β₯ π5π΄, our algorithm achieves: Regret π, π β€ π(π· ππ΄π)
Bayesian regret bounds: (expectation over known prior on MDP M); example
Theorem [Osband, Russo, Van Roy, 2013]: For any known prior distribution π on MDP M,
our algorithm achieves in episodic setting: EMβΌπ[Regret π, π ] β€ π(π»π π΄π)
7
Exploration-exploitation and regret minimization for multi-armed bandit
UCB
Thompson Sampling
Regret minimization for RL (tabular)
UCRL
Posterior sampling based algorithms
8
Two armed bandit
9
S
1.0, r(s,2)
1.0, r(s,1)
π π , π, π , π(π , π)
Online decisions
At every time step π‘ = 1,β¦ , π, pull one arm out of π arms
Stochastic feedback
For each arm π, reward is generated i.i.d. from a fixed but unknown distribution ππ
support [0,1], mean ππ
Bandit feedback
Only the reward of the pulled arm can be observed
Multiple rigged slot machines in a casino.
Which one to put money on?
β’ Try each one out
Arm == actions
WHEN TO STOP TRYING (EXPLORATION) AND START PLAYING (EXPLOITATION)?
Minimize expected regret in time π π πππππ‘ π = ππβ β πΈ[ π‘=1
π ππ‘] where πβ = maxπ
ππ
Equivalent formulation Expected regret for playing arm It = π at time π‘: Ξπ = πβ β ππ
ππ,π be the number times arm π is played till time π
π πππππ‘ π = π‘(πβ β ππΌπ‘) = π:ππβ πβ Ξπ πΈ[ππ,π]
If we can bound ππ,π by πΆ log π
Ξπ2 , then regret bound: π:ππβ πβ
πΆ log π
Ξπ Problem-dependent bound, close to lower bound [Lai and Robbins 1985]
For problem independent bound: C πππππ(π)
Separately bound total regret for playing an arm with Ξπ β€πlog(T)
πand Ξπ >
πlog(T)
π
Lower bound Ξ© ππ ,
Two arms black and red
Random rewards with unknown mean ππ = π. π, ππ = π
Optimal expected reward in π time steps is 1.1 Γ π
Exploit only strategy: use the current best estimate (MLE/empirical mean) of unknown mean to pick arms
Initial few trials can mislead into playing red action forever
1.1, 1, 0.2,
1, 1, 1, 1, 1, 1, 1, β¦β¦
Expected regret in π steps is close to 0.1 Γ π
4/20/2019 13
Exploitation: play the empirical mean reward maximizer
Exploration: play less explored actions to ensure empirical estimates converge
14
Empirical mean at time t for arm π
Upper confidence bound
15
Using Azuma-Hoeffding (since πΈ ππ‘ πΌπ‘ = π = ππ), with probability 1 β2
π‘2,
Two implications: for every arm π
ππΆπ΅π,π‘ > ππ with probability 1 β2
π‘2 Recall:
If ππ,π‘ >16ln π
Ξπ2 , ππ,π‘ < ππ +
Ξπ
2
Each suboptimal arm is played at most
16ln π
Ξπ2 times, since after that many plays:
16
Natural and Efficient heuristic
Maintain belief about parameters (e.g., mean reward) of each arm
Observe feedback, update belief of pulled arm i in Bayesian manner
Pull arm with posterior probability of being best arm
NOT same as choosing the arm that is most likely to be best
Maintain Bayesian posteriors for unknown parameters
With more trials posteriors concentrate on the true parameters Mode captures MLE: enables exploitation
Less trials means more uncertainty in estimates Spread/variance captures uncertainty: enables
exploration
A sample from the posterior is used as an estimate for unknown parameters to make decisions
18
Uniform distribution π΅ππ‘π(1,1)
π΅ππ‘π πΌ, π½ prior β Posterior
π΅ππ‘π πΌ + 1, π½ if you observe 1
π΅ππ‘π πΌ, π½ + 1 if you observe 0
Start with π΅ππ‘π(πΌ0, π½0) prior belief for every arm
In round t,
β’ For every arm π, sample ππ,π‘ independently from posterior Bππ‘π πΌ0 + ππ,π‘ , π½0 + πΉπ,π‘
β’ Play arm ππ‘ = maxπ
ππ,π‘
β’ Observe reward and update the Beta posterior for arm ππ‘
Theorem [Russo and Van Roy 2014]: Given any prior π over MAB instance M described by reward distributions with 0,1 bounded support (e.g., prior distribution over π1, π2, β¦ , ππ in case of Bernoulli rewards)
BayesianRegret π = πΈπβΌπ[Regret π,π ] β€ π( ππln π )
Note;
Regret π,π notation is used to explicitly indicate dependence on regret on the MAB instance M.
In comparison, worst case regret,
Regret π = πππ₯π[Regret π,π ]
20
[A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013]
Instance-dependent bounds for {0,1} rewards
Regret π β€ ππ(π») 1 + π πΞπ
πΎπΏ(πβ||ππ)+ π(
π
π2)
Matches asymptotic instance wise lower bound [Lai Robbins 1985]
UCB algorithm achieves this only after careful tuning [Kaufmann et al. 2012]
Arbitrary bounded [0,1] rewards (using Beta and Gaussian priors)
Regret π β€ π(ππ(π») π1
Ξπ)
Matches the best available for UCB for general reward distributions
Instance-independent bounds (Beta and Gaussian priors)
Regret π β€ π( ππln π )
Prior and likelihood mismatch allowed!
Two arms, π1 β₯ π2, Ξ = π1 β π2
Every time arm 2 is pulled, Ξ regret
Bound the number of pulls of arm 2 by log T
Ξ2 to get log T
Ξregret bound
How many pulls of arm 2 are actually needed?
After n=O(log T
Ξ2 ) pulls of arm 2 and arm 1
Empirical means are well separated
Error |ππ β ππ| β€log(π)
πβ€
Ξ
4whp
(Using Azuma Hoeffding inequality)
Beta Posteriors are well separated
Mean = πΌπ
πΌπ+π½π= ππ
standard deviation β1
πΌ+π½=
1
nβ€
Ξ
4
The two arms can be distinguished!
No more arm 2 pulls.
π2 π1
π2 π1
Ξ
If arm 2 is pulled less than n=O(log T
Ξ2 ) times?
Regret is at most nΞ =log T
Ξ
At least log T
Ξ2 pulls of arm 2, but few pulls of arm 1
π2 π1π2 π1
Ξ
π2 π1π2 π1
Ξ
Arm 1 will be played roughly every constant number of steps in this situation
It will take at most constant Γlog π
Ξ2 steps (extra pulls of arm 2) to get out of
this situation
Total number of pulls of arm 2 is at most O(log π
Ξ2 )
Summary: variance of posterior enables exploration
Optimal bounds (and for multiple arms) require more careful use of posterior structure
Using UCB and TS ideas for exploration-exploitation in RL
27
Single state MDP Solution concept: optimal action
Multi-armed bandit problem
Uncertainty in rewards Random rewards with unknown mean π1, π2
Exploit only: use the current best estimate (MLE/empirical mean) of unknown mean to pick arms
Initial few trials can mislead into playing red action forever
1.1, 1, 0.2,
1, 1, 1, 1, 1, 1, 1, β¦β¦
S
1.0, +1
1.0, +1.1
π π , π, π , π(π , π)
28
Uncertainty in rewards, state transitions
Unknown reward distribution, transition probabilities
Exploration-exploitation:
Explore actions/states/policies, learn reward distributions and transition model
Exploit the (seemingly) best policy
29
S1 S2
1.0, $1
0.6, $0
0.9, $100
0.4, $0
0.1, $0
1.0, $1
Upper confidence bound based algorithms [Jaksch, Ortner, Auer, 2010] [Bartlett, Tewari, 2012]
Worst-case regret bound π(π·π π΄π) for communicating MDP
Lower bound Ξ© π·ππ΄π
Optimistic Posterior Sampling [A. Jia 2017]
Worst-case regret bound π(π· ππ΄π) for communicating MDP of diameter π·
Improvement by a factor of π
Optimistic Value iteration [Azar, Osband, Munos, 2017]
Worst-case reget bound π( π»ππ΄π) in episodic setting
Posterior sampling known prior setting [Osband, Russo, and Van Roy 2013, Osbandand Van Roy, 2016, 2017]b
Bayesian regret bound of π(π» ππ΄π) in episodic setting, length π» episodes
30
UCRL: Upper confidence bound based algorithm for RL
Posterior sampling based algorithm for RL
Main result
Proof techniques
31
Similar principles as UCB
This is a Model-based approach
Maintain an estimate of model π, π
Occassionally solve the MDP S, A, π, π , π 1 to find a policy
Run this policy for some time to get samples, and update model estimate
Compare to ``model-freeβ approach or direct learning approach like Q-learning
Directly update Q-values or value function or policy using samples.
32
Proceed in epochs, an epoch ends when the number of visits of some state-action pair doubles.
In the beginning of every epoch k
Use samples to compute an optimistic MDP (S, A, π , π, π 1)
MDP with value greater than true MDP (Upper bound!!)
Solve the optimistic MDP to find optimal policy π
Execute π in epoch k
observe samples π π‘ , ππ‘, π π‘+1
Go to next epoch If visits of some state-action pair doubles
If ππ π , π β₯ 2 ππβ1(π , π) for some π , π
33
In the beginning of every epoch k
For every π , π, compute empirical model estimate let nk(π , π) be the number of times π , π was visited before this epoch,
let nk(π , π, π β²) be the number of transition to π β²
Set π (π , π) as average reward over these nk(π , π) steps
Set π(π , π, π β²) as nk π ,π,π β²
nk(π ,π)
Compute optimistic model estimate Use Chernoff bounds to define confidence region around π , π
| π π , π, π β² β π π , π, π β² | β€log π‘
ππ(π ,π)with probability 1 β
1
π‘2
True π , π lies in this region
Find the best combination , π , π in this region MDP (S, A, π , π, π 1) with maximum value
Will have value more than the true MDP
34
Recall regret:
Regret π, π = π πβ β
π‘=1
π
π(π π‘ , ππ‘)
Theorem: For any communicating MDP M with (unknown) diameter π·, with high probability:
Regret π, π β€ π(π·π π΄π)
π(β ) notation hides logarithmic factors in S,A,T beyond constants.
35
Our setting, regret definition
Posterior sampling algorithm for MDPs
Main result
Proof techniques
36
Maintain Bayesian posteriors for unknown parameters
With more trials posteriors concentrate on the true parameters Mode captures MLE: enables exploitation
Less trials means more uncertainty in estimates Spread/variance captures uncertainty: enables
exploration
A sample from the posterior is used as an estimate for unknown parameters to make decisions
37
Assume for simplicity: Known reward distribution
Needs to learn the unknown transition probability vector ππ ,π = (ππ ,π 1 ,β¦ , ππ ,π(π)) for all π , π
In any state π π‘ = π , ππ‘ = π, observes new state π π‘+1
outcome of a Multivariate Bernoulli trial with probability vector ππ ,π
38
Given prior Dirichlet πΌ1, πΌ2, β¦ , πΌπ on ππ ,π
After a Multinoulli trial with outcome (new state) π, Bayesian posterior on ππ ,π
Dirichlet πΌ1, πΌ2, β¦ , πΌπ + 1,β¦ , πΌπ
After ππ ,π = πΌ1 + β―+ πΌπ observations for a state-action pair π , π
Posterior mean vector is empirical mean
ππ ,π(π) =πΌπ
πΌ1 + β―+ πΌπ=
πΌπ
ππ ,π
variance bounded by 1
ππ ,π
With more trials of π , π, the posterior mean concentrates around true mean
39
Learning
Maintain a Dirichlet posterior for ππ ,π for every π , π After round π‘, on observing outcome π π‘+1, update for state π π‘ and action ππ‘
To decide action
Sample a ππ ,π for every π , π
Compute the optimal policy π for sample MDP (π, π΄, π, π, π 0)
Choose ππ‘ = π π π‘
Exploration-exploitation
Exploitation: With more observations Dirichlet posterior concentrates, ππ ,πapproaches empirical mean ππ ,π
Exploration: Anti-concentration of Dirichlet ensures exploration for states/actions/policies less explored
40
Proceed in epochs, an epoch ends when the number of visits ππ ,π of any state-action pair doubles.
In every epoch
For every π , π, generate multiple π = π π independent samples from a Dirichletposterior for Pπ ,π
Form extended sample MDP (π, ππ΄, π, π, π 0)
Find optimal policy π and use through the epoch
Further, initial exploration:
For π , π with very small ππ ,π <ππ
π΄, use a simple optimistic sampling, that provides
extra exploration
41
An algorithm based on posterior sampling with high probability near-optimal worst-case regret upper bound
Recall regret:
Regret π, π = π πβ β
π‘=1
π
π(π π‘ , ππ‘)
Theorem: For any communicating MDP M with (unknown) diameter π·, and for T β₯ π5π΄, with high probability:
Regret π, π β€ π(π· ππ΄π)
Improvement of π factor above UCB based algorithm
42
43