shipra agrawal ieor 8100 (reinforcement learning)Β Β· [a. and goyal colt 2012, aistats 2013, jacm...
TRANSCRIPT
![Page 1: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/1.jpg)
Shipra Agrawal
IEOR 8100 (Reinforcement Learning)
![Page 2: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/2.jpg)
Computational complexity
the amount of per-time-step computation the algorithm uses during learning;
Space complexity
the amount of memory used by the algorithm;
Learning complexity
a measure of how much experience the algorithm needs to learn in a given task.
2
![Page 3: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/3.jpg)
Using a single thread of experience
No resets or generative model
Vs.
Generative models (simulator)
http://hunch.net/~jl/projects/RL/EE.html
3
![Page 4: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/4.jpg)
Optimal learning complexity?
βoptimally exploreβ meaning to obtain maximum expected discounted rewardE[ π‘=1
β πΎπ‘β1ππ‘] over a known prior over the unknown MDPs
Tractable in special cases (πΎ < 1,ππ΄π΅, Gittins, 1989).
4
![Page 5: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/5.jpg)
Bound number of steps on which suboptimal policy is played
near-optimally on all but a polynomial number of steps (Kakade, 2003; Strehl and Littman, 2008b, Strehl et al 2006, 2009)
Example: [Strehl et al 2006] (may been improved in recent work)
5
![Page 6: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/6.jpg)
Some examples:
An O(SA) analysis of Q-learning.
Michael Kearns and Satinder Singh, Finite-Sample Convergence Rates for Q-learning and Indirect Algorithms NIPS 1999.
An analysis of TD-lambda.
Michael Kearns and Satinder Singh, Bias-Variance Error Bounds for Temporal Difference Updates, COLT 2000.
6
![Page 7: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/7.jpg)
Bound difference in total reward of algorithm compared to a benchmark policy
π ππ π, π = π πβ(π 0) β πΈ[ π‘=1π ππ‘]
πβ s0 is infinite horizon average reward achieved by the best stationary policy πβ
Episodic/single thread
Episodic: starting state is reset every H steps
Single thread: assume Communicating with diameter D
Worst-case regret bounds; example
Theorem [Agrawal, Jia, 2017]: For any communicating MDP M with (unknown) diameter
π·, and for T β₯ π5π΄, our algorithm achieves: Regret π, π β€ π(π· ππ΄π)
Bayesian regret bounds: (expectation over known prior on MDP M); example
Theorem [Osband, Russo, Van Roy, 2013]: For any known prior distribution π on MDP M,
our algorithm achieves in episodic setting: EMβΌπ[Regret π, π ] β€ π(π»π π΄π)
7
![Page 8: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/8.jpg)
Exploration-exploitation and regret minimization for multi-armed bandit
UCB
Thompson Sampling
Regret minimization for RL (tabular)
UCRL
Posterior sampling based algorithms
8
![Page 9: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/9.jpg)
Two armed bandit
9
S
1.0, r(s,2)
1.0, r(s,1)
π π , π, π , π(π , π)
![Page 10: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/10.jpg)
Online decisions
At every time step π‘ = 1,β¦ , π, pull one arm out of π arms
Stochastic feedback
For each arm π, reward is generated i.i.d. from a fixed but unknown distribution ππ
support [0,1], mean ππ
Bandit feedback
Only the reward of the pulled arm can be observed
![Page 11: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/11.jpg)
Multiple rigged slot machines in a casino.
Which one to put money on?
β’ Try each one out
Arm == actions
WHEN TO STOP TRYING (EXPLORATION) AND START PLAYING (EXPLOITATION)?
![Page 12: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/12.jpg)
Minimize expected regret in time π π πππππ‘ π = ππβ β πΈ[ π‘=1
π ππ‘] where πβ = maxπ
ππ
Equivalent formulation Expected regret for playing arm It = π at time π‘: Ξπ = πβ β ππ
ππ,π be the number times arm π is played till time π
π πππππ‘ π = π‘(πβ β ππΌπ‘) = π:ππβ πβ Ξπ πΈ[ππ,π]
If we can bound ππ,π by πΆ log π
Ξπ2 , then regret bound: π:ππβ πβ
πΆ log π
Ξπ Problem-dependent bound, close to lower bound [Lai and Robbins 1985]
For problem independent bound: C πππππ(π)
Separately bound total regret for playing an arm with Ξπ β€πlog(T)
πand Ξπ >
πlog(T)
π
Lower bound Ξ© ππ ,
![Page 13: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/13.jpg)
Two arms black and red
Random rewards with unknown mean ππ = π. π, ππ = π
Optimal expected reward in π time steps is 1.1 Γ π
Exploit only strategy: use the current best estimate (MLE/empirical mean) of unknown mean to pick arms
Initial few trials can mislead into playing red action forever
1.1, 1, 0.2,
1, 1, 1, 1, 1, 1, 1, β¦β¦
Expected regret in π steps is close to 0.1 Γ π
4/20/2019 13
![Page 14: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/14.jpg)
Exploitation: play the empirical mean reward maximizer
Exploration: play less explored actions to ensure empirical estimates converge
14
![Page 15: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/15.jpg)
Empirical mean at time t for arm π
Upper confidence bound
15
![Page 16: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/16.jpg)
Using Azuma-Hoeffding (since πΈ ππ‘ πΌπ‘ = π = ππ), with probability 1 β2
π‘2,
Two implications: for every arm π
ππΆπ΅π,π‘ > ππ with probability 1 β2
π‘2 Recall:
If ππ,π‘ >16ln π
Ξπ2 , ππ,π‘ < ππ +
Ξπ
2
Each suboptimal arm is played at most
16ln π
Ξπ2 times, since after that many plays:
16
![Page 17: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/17.jpg)
Natural and Efficient heuristic
Maintain belief about parameters (e.g., mean reward) of each arm
Observe feedback, update belief of pulled arm i in Bayesian manner
Pull arm with posterior probability of being best arm
NOT same as choosing the arm that is most likely to be best
![Page 18: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/18.jpg)
Maintain Bayesian posteriors for unknown parameters
With more trials posteriors concentrate on the true parameters Mode captures MLE: enables exploitation
Less trials means more uncertainty in estimates Spread/variance captures uncertainty: enables
exploration
A sample from the posterior is used as an estimate for unknown parameters to make decisions
18
![Page 19: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/19.jpg)
Uniform distribution π΅ππ‘π(1,1)
π΅ππ‘π πΌ, π½ prior β Posterior
π΅ππ‘π πΌ + 1, π½ if you observe 1
π΅ππ‘π πΌ, π½ + 1 if you observe 0
Start with π΅ππ‘π(πΌ0, π½0) prior belief for every arm
In round t,
β’ For every arm π, sample ππ,π‘ independently from posterior Bππ‘π πΌ0 + ππ,π‘ , π½0 + πΉπ,π‘
β’ Play arm ππ‘ = maxπ
ππ,π‘
β’ Observe reward and update the Beta posterior for arm ππ‘
![Page 20: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/20.jpg)
Theorem [Russo and Van Roy 2014]: Given any prior π over MAB instance M described by reward distributions with 0,1 bounded support (e.g., prior distribution over π1, π2, β¦ , ππ in case of Bernoulli rewards)
BayesianRegret π = πΈπβΌπ[Regret π,π ] β€ π( ππln π )
Note;
Regret π,π notation is used to explicitly indicate dependence on regret on the MAB instance M.
In comparison, worst case regret,
Regret π = πππ₯π[Regret π,π ]
20
![Page 21: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/21.jpg)
[A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013]
Instance-dependent bounds for {0,1} rewards
Regret π β€ ππ(π») 1 + π πΞπ
πΎπΏ(πβ||ππ)+ π(
π
π2)
Matches asymptotic instance wise lower bound [Lai Robbins 1985]
UCB algorithm achieves this only after careful tuning [Kaufmann et al. 2012]
Arbitrary bounded [0,1] rewards (using Beta and Gaussian priors)
Regret π β€ π(ππ(π») π1
Ξπ)
Matches the best available for UCB for general reward distributions
Instance-independent bounds (Beta and Gaussian priors)
Regret π β€ π( ππln π )
Prior and likelihood mismatch allowed!
![Page 22: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/22.jpg)
Two arms, π1 β₯ π2, Ξ = π1 β π2
Every time arm 2 is pulled, Ξ regret
Bound the number of pulls of arm 2 by log T
Ξ2 to get log T
Ξregret bound
How many pulls of arm 2 are actually needed?
![Page 23: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/23.jpg)
After n=O(log T
Ξ2 ) pulls of arm 2 and arm 1
Empirical means are well separated
Error |ππ β ππ| β€log(π)
πβ€
Ξ
4whp
(Using Azuma Hoeffding inequality)
Beta Posteriors are well separated
Mean = πΌπ
πΌπ+π½π= ππ
standard deviation β1
πΌ+π½=
1
nβ€
Ξ
4
The two arms can be distinguished!
No more arm 2 pulls.
π2 π1
π2 π1
Ξ
![Page 24: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/24.jpg)
If arm 2 is pulled less than n=O(log T
Ξ2 ) times?
Regret is at most nΞ =log T
Ξ
![Page 25: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/25.jpg)
At least log T
Ξ2 pulls of arm 2, but few pulls of arm 1
π2 π1π2 π1
Ξ
π2 π1π2 π1
Ξ
![Page 26: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/26.jpg)
Arm 1 will be played roughly every constant number of steps in this situation
It will take at most constant Γlog π
Ξ2 steps (extra pulls of arm 2) to get out of
this situation
Total number of pulls of arm 2 is at most O(log π
Ξ2 )
Summary: variance of posterior enables exploration
Optimal bounds (and for multiple arms) require more careful use of posterior structure
![Page 27: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/27.jpg)
Using UCB and TS ideas for exploration-exploitation in RL
27
![Page 28: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/28.jpg)
Single state MDP Solution concept: optimal action
Multi-armed bandit problem
Uncertainty in rewards Random rewards with unknown mean π1, π2
Exploit only: use the current best estimate (MLE/empirical mean) of unknown mean to pick arms
Initial few trials can mislead into playing red action forever
1.1, 1, 0.2,
1, 1, 1, 1, 1, 1, 1, β¦β¦
S
1.0, +1
1.0, +1.1
π π , π, π , π(π , π)
28
![Page 29: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/29.jpg)
Uncertainty in rewards, state transitions
Unknown reward distribution, transition probabilities
Exploration-exploitation:
Explore actions/states/policies, learn reward distributions and transition model
Exploit the (seemingly) best policy
29
S1 S2
1.0, $1
0.6, $0
0.9, $100
0.4, $0
0.1, $0
1.0, $1
![Page 30: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/30.jpg)
Upper confidence bound based algorithms [Jaksch, Ortner, Auer, 2010] [Bartlett, Tewari, 2012]
Worst-case regret bound π(π·π π΄π) for communicating MDP
Lower bound Ξ© π·ππ΄π
Optimistic Posterior Sampling [A. Jia 2017]
Worst-case regret bound π(π· ππ΄π) for communicating MDP of diameter π·
Improvement by a factor of π
Optimistic Value iteration [Azar, Osband, Munos, 2017]
Worst-case reget bound π( π»ππ΄π) in episodic setting
Posterior sampling known prior setting [Osband, Russo, and Van Roy 2013, Osbandand Van Roy, 2016, 2017]b
Bayesian regret bound of π(π» ππ΄π) in episodic setting, length π» episodes
30
![Page 31: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/31.jpg)
UCRL: Upper confidence bound based algorithm for RL
Posterior sampling based algorithm for RL
Main result
Proof techniques
31
![Page 32: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/32.jpg)
Similar principles as UCB
This is a Model-based approach
Maintain an estimate of model π, π
Occassionally solve the MDP S, A, π, π , π 1 to find a policy
Run this policy for some time to get samples, and update model estimate
Compare to ``model-freeβ approach or direct learning approach like Q-learning
Directly update Q-values or value function or policy using samples.
32
![Page 33: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/33.jpg)
Proceed in epochs, an epoch ends when the number of visits of some state-action pair doubles.
In the beginning of every epoch k
Use samples to compute an optimistic MDP (S, A, π , π, π 1)
MDP with value greater than true MDP (Upper bound!!)
Solve the optimistic MDP to find optimal policy π
Execute π in epoch k
observe samples π π‘ , ππ‘, π π‘+1
Go to next epoch If visits of some state-action pair doubles
If ππ π , π β₯ 2 ππβ1(π , π) for some π , π
33
![Page 34: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/34.jpg)
In the beginning of every epoch k
For every π , π, compute empirical model estimate let nk(π , π) be the number of times π , π was visited before this epoch,
let nk(π , π, π β²) be the number of transition to π β²
Set π (π , π) as average reward over these nk(π , π) steps
Set π(π , π, π β²) as nk π ,π,π β²
nk(π ,π)
Compute optimistic model estimate Use Chernoff bounds to define confidence region around π , π
| π π , π, π β² β π π , π, π β² | β€log π‘
ππ(π ,π)with probability 1 β
1
π‘2
True π , π lies in this region
Find the best combination , π , π in this region MDP (S, A, π , π, π 1) with maximum value
Will have value more than the true MDP
34
![Page 35: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/35.jpg)
Recall regret:
Regret π, π = π πβ β
π‘=1
π
π(π π‘ , ππ‘)
Theorem: For any communicating MDP M with (unknown) diameter π·, with high probability:
Regret π, π β€ π(π·π π΄π)
π(β ) notation hides logarithmic factors in S,A,T beyond constants.
35
![Page 36: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/36.jpg)
Our setting, regret definition
Posterior sampling algorithm for MDPs
Main result
Proof techniques
36
![Page 37: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/37.jpg)
Maintain Bayesian posteriors for unknown parameters
With more trials posteriors concentrate on the true parameters Mode captures MLE: enables exploitation
Less trials means more uncertainty in estimates Spread/variance captures uncertainty: enables
exploration
A sample from the posterior is used as an estimate for unknown parameters to make decisions
37
![Page 38: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/38.jpg)
Assume for simplicity: Known reward distribution
Needs to learn the unknown transition probability vector ππ ,π = (ππ ,π 1 ,β¦ , ππ ,π(π)) for all π , π
In any state π π‘ = π , ππ‘ = π, observes new state π π‘+1
outcome of a Multivariate Bernoulli trial with probability vector ππ ,π
38
![Page 39: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/39.jpg)
Given prior Dirichlet πΌ1, πΌ2, β¦ , πΌπ on ππ ,π
After a Multinoulli trial with outcome (new state) π, Bayesian posterior on ππ ,π
Dirichlet πΌ1, πΌ2, β¦ , πΌπ + 1,β¦ , πΌπ
After ππ ,π = πΌ1 + β―+ πΌπ observations for a state-action pair π , π
Posterior mean vector is empirical mean
ππ ,π(π) =πΌπ
πΌ1 + β―+ πΌπ=
πΌπ
ππ ,π
variance bounded by 1
ππ ,π
With more trials of π , π, the posterior mean concentrates around true mean
39
![Page 40: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/40.jpg)
Learning
Maintain a Dirichlet posterior for ππ ,π for every π , π After round π‘, on observing outcome π π‘+1, update for state π π‘ and action ππ‘
To decide action
Sample a ππ ,π for every π , π
Compute the optimal policy π for sample MDP (π, π΄, π, π, π 0)
Choose ππ‘ = π π π‘
Exploration-exploitation
Exploitation: With more observations Dirichlet posterior concentrates, ππ ,πapproaches empirical mean ππ ,π
Exploration: Anti-concentration of Dirichlet ensures exploration for states/actions/policies less explored
40
![Page 41: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/41.jpg)
Proceed in epochs, an epoch ends when the number of visits ππ ,π of any state-action pair doubles.
In every epoch
For every π , π, generate multiple π = π π independent samples from a Dirichletposterior for Pπ ,π
Form extended sample MDP (π, ππ΄, π, π, π 0)
Find optimal policy π and use through the epoch
Further, initial exploration:
For π , π with very small ππ ,π <ππ
π΄, use a simple optimistic sampling, that provides
extra exploration
41
![Page 42: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/42.jpg)
An algorithm based on posterior sampling with high probability near-optimal worst-case regret upper bound
Recall regret:
Regret π, π = π πβ β
π‘=1
π
π(π π‘ , ππ‘)
Theorem: For any communicating MDP M with (unknown) diameter π·, and for T β₯ π5π΄, with high probability:
Regret π, π β€ π(π· ππ΄π)
Improvement of π factor above UCB based algorithm
42
![Page 43: Shipra Agrawal IEOR 8100 (Reinforcement Learning)Β Β· [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Qππ(π»)1+π](https://reader033.vdocuments.mx/reader033/viewer/2022050411/5f883067c4d1b122020ac279/html5/thumbnails/43.jpg)
43