online learning for stochastic shortest path model via

21
Online Learning for Stochastic Shortest Path Model via Posterior Sampling Mehdi Jafarnia-Jahromi University of Southern California [email protected] Liyu Chen University of Southern California [email protected] Rahul Jain University of Southern California [email protected] Haipeng Luo University of Southern California [email protected] Abstract We consider the problem of online reinforcement learning for the Stochastic Short- est Path (SSP) problem modeled as an unknown MDP with an absorbing state. We propose PSRL-SSP, a simple posterior sampling-based reinforcement learning algorithm for the SSP problem. The algorithm operates in epochs. At the beginning of each epoch, a sample is drawn from the posterior distribution on the unknown model dynamics, and the optimal policy with respect to the drawn sample is fol- lowed during that epoch. An epoch completes if either the number of visits to the goal state in the current epoch exceeds that of the previous epoch, or the number of visits to any of the state-action pairs is doubled. We establish a Bayesian regret bound of ˜ O(B ? S AK), where B ? is an upper bound on the expected cost of the optimal policy, S is the size of the state space, A is the size of the action space, and K is the number of episodes. The algorithm only requires the knowledge of the prior distribution, and has no hyper-parameters to tune. It is the first such posterior sampling algorithm and outperforms numerically previously proposed optimism-based algorithms. 1 Introduction Stochastic Shortest Path (SSP) model considers the problem of an agent interacting with an environ- ment to reach a predefined goal state while minimizing the cumulative expected cost. Unlike the finite-horizon and discounted Markov Decision Processes (MDPs) settings, in the SSP model, the horizon of interaction between the agent and the environment depends on the agent’s actions, and can possibly be unbounded (if the goal is not reached). A wide variety of goal-oriented control and reinforcement learning (RL) problems such as navigation, game playing, etc. can be formulated as SSP problems. In the RL setting, where the SSP model is unknown, the agent interacts with the environment in K episodes. Each episode begins at a predefined initial state and ends when the agent reaches the goal (note that it might never reach the goal). We consider the setting where the state and action spaces are finite, the cost function is known, but the transition kernel is unknown. The performance of the agent is measured through the notion of regret, i.e., the difference between the cumulative cost of the learning algorithm and that of the optimal policy during the K episodes. The agent has to balance the well-known trade-off between exploration and exploitation: should the agent explore the environment to gain information for future decisions, or should it exploit the current information to minimize the cost? A general way to balance the exploration-exploitation trade-off is to use the Optimism in the Face of Uncertainty (OFU) principle [Lai and Robbins, 1985]. The idea is to construct a set of plausible models based on the available information, select the model Preprint. Under review. arXiv:2106.05335v1 [cs.LG] 9 Jun 2021

Upload: others

Post on 31-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Online Learning for Stochastic Shortest Path Modelvia Posterior Sampling

Mehdi Jafarnia-JahromiUniversity of Southern California

[email protected]

Liyu ChenUniversity of Southern California

[email protected]

Rahul JainUniversity of Southern California

[email protected]

Haipeng LuoUniversity of Southern California

[email protected]

Abstract

We consider the problem of online reinforcement learning for the Stochastic Short-est Path (SSP) problem modeled as an unknown MDP with an absorbing state.We propose PSRL-SSP, a simple posterior sampling-based reinforcement learningalgorithm for the SSP problem. The algorithm operates in epochs. At the beginningof each epoch, a sample is drawn from the posterior distribution on the unknownmodel dynamics, and the optimal policy with respect to the drawn sample is fol-lowed during that epoch. An epoch completes if either the number of visits to thegoal state in the current epoch exceeds that of the previous epoch, or the numberof visits to any of the state-action pairs is doubled. We establish a Bayesian regretbound of O(B?S

√AK), where B? is an upper bound on the expected cost of the

optimal policy, S is the size of the state space, A is the size of the action space,and K is the number of episodes. The algorithm only requires the knowledgeof the prior distribution, and has no hyper-parameters to tune. It is the first suchposterior sampling algorithm and outperforms numerically previously proposedoptimism-based algorithms.

1 Introduction

Stochastic Shortest Path (SSP) model considers the problem of an agent interacting with an environ-ment to reach a predefined goal state while minimizing the cumulative expected cost. Unlike thefinite-horizon and discounted Markov Decision Processes (MDPs) settings, in the SSP model, thehorizon of interaction between the agent and the environment depends on the agent’s actions, andcan possibly be unbounded (if the goal is not reached). A wide variety of goal-oriented control andreinforcement learning (RL) problems such as navigation, game playing, etc. can be formulated asSSP problems. In the RL setting, where the SSP model is unknown, the agent interacts with theenvironment in K episodes. Each episode begins at a predefined initial state and ends when the agentreaches the goal (note that it might never reach the goal). We consider the setting where the stateand action spaces are finite, the cost function is known, but the transition kernel is unknown. Theperformance of the agent is measured through the notion of regret, i.e., the difference between thecumulative cost of the learning algorithm and that of the optimal policy during the K episodes.

The agent has to balance the well-known trade-off between exploration and exploitation: should theagent explore the environment to gain information for future decisions, or should it exploit the currentinformation to minimize the cost? A general way to balance the exploration-exploitation trade-offis to use the Optimism in the Face of Uncertainty (OFU) principle [Lai and Robbins, 1985]. Theidea is to construct a set of plausible models based on the available information, select the model

Preprint. Under review.

arX

iv:2

106.

0533

5v1

[cs

.LG

] 9

Jun

202

1

associated with the minimum cost, and follow the optimal policy with respect to the selected model.This idea is widely used in the RL literature for MDPs (e.g., [Jaksch et al., 2010, Azar et al., 2017,Fruit et al., 2018, Jin et al., 2018, Wei et al., 2020, 2021]) and also for SSP models [Tarbouriech et al.,2020, Rosenberg et al., 2020, Rosenberg and Mansour, 2020, Chen and Luo, 2021, Tarbouriech et al.,2021b].

An alternative fundamental idea to encourage exploration is to use Posterior Sampling (PS) (alsoknown as Thompson Sampling) [Thompson, 1933]. The idea is to maintain the posterior distributionon the unknown model parameters based on the available information and the prior distribution. PSalgorithms usually proceed in epochs. In the beginning of an epoch, a model is sampled from theposterior. The actions during the epoch are then selected according to the optimal policy associatedwith the sampled model. PS algorithms have two main advantages over OFU-type algorithms. First,the prior knowledge of the environment can be incorporated through the prior distribution. Second,PS algorithms have shown superior numerical performance on multi-armed bandit problems [Scott,2010, Chapelle and Li, 2011], and MDPs [Osband et al., 2013, Osband and Van Roy, 2017, Ouyanget al., 2017b].

The main difficulty in designing PS algorithms is the design of the epochs. In the basic settingof bandit problems, one can simply sample at every time step [Chapelle and Li, 2011]. In finite-horizon MDPs, where the length of an episode is predetermined and fixed, the epochs and episodescoincide, i.e., the agent can sample from the posterior distribution at the beginning of each episode[Osband et al., 2013]. However, in the general SSP model, where the length of each episode is notpredetermined and can possibly be unbounded, these natural choices for the epoch do not work.Indeed, the agent needs to switch policies during an episode if the current policy cannot reach thegoal.

In this paper, we propose PSRL-SSP, the first PS-based RL algorithm for the SSP model.PSRL-SSP starts a new epoch based on two criteria. According to the first criterion, a new epochstarts if the number of episodes within the current epoch exceeds that of the previous epoch. Thesecond criterion is triggered when the number of visits to any state-action pair is doubled during anepoch, similar to the one used by Bartlett and Tewari [2009], Jaksch et al. [2010], Filippi et al. [2010],Dann and Brunskill [2015], Ouyang et al. [2017b], Rosenberg et al. [2020]. Intuitively speaking, inthe early stages of the interaction between the agent and the environment, the second criterion triggersmore often. This criterion is responsible for switching policies during an episode if the current policycannot reach the goal. In the later stages of the interaction, the first criterion triggers more often andencourages exploration. We prove a Bayesian regret bound of O(B?S

√AK), where S is the number

of states, A is the number of actions, K is the number of episodes, and B? is an upper bound on theexpected cost of the optimal policy. This is similar to the regret bound of Rosenberg et al. [2020] andhas a gap of

√S with the minimax lower bound. We note that concurrent works of Tarbouriech et al.

[2021b] and Cohen et al. [2021] have closed the gap via OFU algorithms and blackbox reduction tothe finite-horizon, respectively. However, the goal of this paper is not to match the minimax regretbound, but rather to introduce the first PS algorithm that has near-optimal regret bound with superiornumerical performance than OFU algorithms. This is verified with the experiments in Section 5. The√S gap with the lower bound exists for the PS algorithms in the finite-horizon Osband et al. [2013]

and the infinite-horizon average-cost MDPs [Ouyang et al., 2017b] as well. Thus, it remains an openquestion whether it is possible to achieve the lower bound via PS algorithms in these settings.

Related Work. Posterior Sampling. The idea of PS algorithms dates back to the pioneering workof Thompson [1933]. The algorithm was ignored for several decades until recently. In the past twodecades, PS algorithms have successfully been developed for various settings including multi-armedbandits (e.g., Scott [2010], Chapelle and Li [2011], Kaufmann et al. [2012], Agrawal and Goyal[2012, 2013]), MDPs (e.g., [Strens, 2000, Osband et al., 2013, Fonteneau et al., 2013, Gopalan andMannor, 2015, Osband and Van Roy, 2017, Kim, 2017, Ouyang et al., 2017b, Banjevic and Kim,2019]), Partially Observable MDPs [Jafarnia-Jahromi et al., 2021], and Linear Quadratic Control(e.g., [Abeille and Lazaric, 2017, Ouyang et al., 2017a]). The interested reader is referred to Russoet al. [2017] and references therein for a more comprehensive literature review.

Online Learning in SSP. Another related line of work is online learning in the SSP model whichwas introduced by Tarbouriech et al. [2020]. They proposed an algorithm with O(K2/3) regretbound. Subsequent work of Rosenberg et al. [2020] improved the regret bound to O(B?S

√AK).

2

The concurrent works of Cohen et al. [2021], Tarbouriech et al. [2021b] proved a minimax regretbound of O(B?

√SAK). However, none of these works propose a PS-type algorithm. We refer the

interested reader to Rosenberg and Mansour [2020], Chen et al. [2020], Chen and Luo [2021] for theSSP model with adversarial costs and Tarbouriech et al. [2021a] for sample complexity of the SSPmodel with a generative model.

Comparison with Ouyang et al. [2017b]. Our work is related to Ouyang et al. [2017b] whichproposes TSDE, a PS algorithm for infinite-horizon average-cost MDPs. However, clear distinctionsexist both in the algorithm and analysis. From the algorithmic perspective, our first criterion indetermining the epoch length is different from TSDE. Note that using the same epochs as TSDEleads to a sub-optimal regret bound of O(K2/3) in the SSP model setting. Moreover, followingHoeffding-type concentration as in TSDE, yields a regret bound of O(K2/3) in the SSP model setting.Instead, we propose a different analysis using Bernstein-type concentration inspired by the work ofRosenberg et al. [2020] to achieve the O(

√K) regret bound (see Lemma 5).

2 Preliminaries

A Stochastic Shortest Path (SSP) model is denoted byM = (S,A, c, θ, sinit, g) where S is the statespace, A is the action space, c : S × A → [0, 1] is the cost function, sinit ∈ S is the initial state,g /∈ S is the goal state, and θ : S+ × S × A → [0, 1] represents the transition kernel such thatθ(s′|s, a) = P(s′t = s′|st = s, at = a) where S+ = S ∪ g includes the goal state as well. Herest ∈ S and at ∈ A are the state and action at time t = 1, 2, 3, · · · and s′t ∈ S+ is the subsequent state.We assume that the initial state sinit is a fixed and known state and S and A are finite sets with size Sand A, respectively. A stationary policy is a deterministic map π : S → A that maps a state to anaction. The value function (also called the cost-to-go function) associated with policy π is a functionV π(·; θ) : S+ → [0,∞] given by V π(g; θ) = 0 and V π(s; θ) := E[

∑τπ(s)t=1 c(st, π(st))|s1 = s] for

s ∈ S, where τπ(s) is the number of steps before reaching the goal state (a random variable) if theinitial state is s and policy π is followed throughout the episode. Here, we use the notation V π(·; θ)to explicitly show the dependence of the value function on θ. Furthermore, the optimal value functioncan be defined as V (s; θ) = minπ V

π(s; θ). Policy π is called proper if the goal state is reached withprobability 1, starting from any initial state and following π (i.e., maxs τπ(s) <∞ almost surely),otherwise it is called improper.

We consider the reinforcement learning problem of an agent interacting with an SSP modelM = (S,A, c, θ∗, sinit, g) whose transition kernel θ∗ is randomly generated according to the priordistribution µ1 at the beginning and is then fixed. We will focus on SSP models with transitionkernels in the set ΘB? with the following standard properties:

Assumption 1. For all θ ∈ ΘB? , the following holds: (1) there exists a proper policy, (2) for allimproper policies πθ, there exists a state s ∈ S , such that V πθ (s; θ) =∞, and (3) the optimal valuefunction V (·; θ) satisfies maxs V (s; θ) ≤ B?.

Bertsekas and Tsitsiklis [1991] prove that the first two conditions in Assumption 1 imply that foreach θ ∈ ΘB? , the optimal policy is stationary, deterministic, proper, and can be obtained by theminimizer of the Bellman optimality equations given by

V (s; θ) = mina

c(s, a) +

∑s′∈S+

θ(s′|s, a)V (s′; θ), ∀s ∈ S. (1)

Standard techniques such as Value Iteration and Policy Iteration can be used to compute the optimalpolicy if the SSP model is known [Bertsekas, 2017]. Here, we assume that S ,A, and the cost functionc are known to the agent, however, the transition kernel θ∗ is unknown. Moreover, we assume thatthe support of the prior distribution µ1 is a subset of ΘB? .

The agent interacts with the environment in K episodes. Each episode starts from the initial state sinitand ends at the goal state g (note that the agent may never reach the goal). At each time t, the agentobserves state st and takes action at. The environment then yields the next state s′t ∼ θ∗(·|st, at).If the goal is reached (i.e., s′t = g), then the current episode completes, a new episode starts, andst+1 = sinit. If the goal is not reached (i.e., s′t 6= g), then st+1 = s′t. The goal of the agent is tominimize the expected cumulative cost after K episodes, or equivalently, minimize the Bayesian

3

regret defined as

RK := E

[TK∑t=1

c(st, at)−KV (sinit; θ∗)

],

where TK is the total number of time steps before reaching the goal state for the Kth time, andV (sinit; θ∗) is the optimal value function from (1). Here, expectation is with respect to the priordistribution µ1 for θ∗, the horizon TK , the randomness in the state transitions, and the randomness ofthe algorithm. If the agent does not reach the goal state at any of the episodes (i.e., TK =∞), wedefine RK =∞.

3 A Posterior Sampling RL Algorithm for SSP Models

In this section, we propose the Posterior Sampling Reinforcement Learning (PSRL-SSP) algorithm(Algorithm 1) for the SSP model. The input of the algorithm is the prior distribution µ1. Attime t, the agent maintains the posterior distribution µt on the unknown parameter θ∗ given byµt(Θ) = P(θ∗ ∈ Θ|Ft) for any set Θ ⊆ ΘB? . Here Ft is the information available at time t (i.e., thesigma algebra generated by s1, a1, · · · , st−1, at−1, st). Upon observing state s′t by taking action atat state st, the posterior can be updated according to

µt+1(dθ) =θ(s′t|st, at)µt(dθ)∫θ′(s′t|st, at)µt(dθ′)

. (2)

The PSRL-SSP algorithm proceeds in epochs ` = 1, 2, 3, · · · . Let t` denote the start time of epoch`. In the beginning of epoch `, parameter θ` is sampled from the posterior distribution µt` and theactions within that epoch are chosen according to the optimal policy with respect to θ`. Each epochends if either of the two stopping criteria are satisfied. The first criterion is triggered if the number ofvisits to the goal state during the current epoch (denoted by K`) exceeds that of the previous epoch.This ensures that K` ≤ K`−1 + 1 for all `. The second criterion is triggered if the number of visitsto any of the state-action pairs is doubled compared to the beginning of the epoch. This guaranteesthat nt(s, a) ≤ 2nt`(s, a) for all (s, a) where nt(s, a) =

∑t−1τ=1 1sτ=s,aτ=a denotes the number

of visits to state-action pair (s, a) before time t.

The second stopping criterion is similar to that used by Jaksch et al. [2010], Rosenberg et al. [2020],and is one of the two stopping criteria used in the posterior sampling algorithm (TSDE) for theinfinite-horizon average-cost MDPs [Ouyang et al., 2017b]. This stopping criterion is crucial sinceit allows the algorithm to switch policies if the generated policy is improper and cannot reach thegoal. We note that updating the policy only at the beginning of an episode (as done in the posteriorsampling for finite-horizon MDPs [Osband et al., 2013]) does not work for SSP models, because ifthe generated policy in the beginning of the episode is improper, the goal is never reached and theregret is infinity.

The first stopping criterion is novel. A similar stopping criterion used in the posterior sampling forinfinite-horizon MDPs [Ouyang et al., 2017b] is based on the length of the epochs, i.e., a new epochstarts if the length of the current epoch exceeds the length of the previous epoch. This leads to abound of O(

√TK) on the number of epochs which translates to a final regret bound of O(K2/3)

in SSP models. However, our first stopping criterion allows us to bound the number of epochs byO(√K) rather than O(

√TK) (see Lemma 2). This is one of the key steps in avoiding dependency

on c−1min (i.e., a lower bound on the cost function) in the main term of the regret and achieve a final

regret bound of O(√K).

Remark 1. The PSRL-SSP algorithm only requires the knowledge of the prior distribution µ1. Itdoes not require the knowledge of B? and T? (an upper bound on the expected time the optimal policytakes to reach the goal) as in Cohen et al. [2021].

Main Results. We now provide our main results for the PSRL-SSP algorithm for unknown SSPmodels. Our first result considers the case where the cost function is strictly positive for all state-actionpairs. Subsequently, we extend the result to the general case by adding a small positive perturbationto the cost function and running the algorithm with the perturbed costs. We first assume that

4

Algorithm 1: PSRL-SSPInput: µ1

Initialization: t← 1, `← 0,K−1 ← 0, t0 ← 0, kt0 ← 0for episodes k = 1, 2, · · · ,K do

st ← sinitwhile st 6= g do

if k − kt` > K`−1 or nt(s, a) > 2nt`(s, a) for some (s, a) ∈ S ×A thenK` ← k − kt``← `+ 1t` ← tkt` ← kGenerate θ` ∼ µt`(·) and compute π`(·) = π∗(·; θ`) according to (1)

endChoose action at = π`(st) and observe s′t ∼ θ∗(·|st, at)Update µt+1 according to (2)st+1 ← s′tt← t+ 1

endend

Assumption 2. There exists cmin > 0, such that c(s, a) ≥ cmin for all state-action pairs (s, a).

This assumption allows us to bound the total time spent in K episodes with the total cost, i.e.,cminTK ≤ CK , where CK :=

∑TKt=1 c(st, at) is the total cost during the K episodes. To facilitate

the presentation of the results, we assume that S ≥ 2, A ≥ 2, and K ≥ S2A. The first main result isas follows.Theorem 1. Suppose Assumptions 1 and 2 hold. Then, the regret of the PSRL-SSP algorithm isupper bounded as

RK = O

B?S√KAL2 + S2A

√B?

3

cminL2

,

where L = log(B?SAKc−1min).

Note that when K B?S2Ac−1

min, the regret bound scales as O(B?S√KA). A crucial point about

the above result is that the dependency on c−1min is only in the lower order term. This allows us

to extend the O(√K) bound to the general case where Assumption 2 does not hold by using the

perturbation technique of Rosenberg et al. [2020] (see Theorem 2). Avoiding dependency on c−1min

in the main term is achieved by using a Bernstein-type confidence set in the analysis inspired byRosenberg et al. [2020]. We note that using a Hoeffding-type confidence set in the analysis as inOuyang et al. [2017b] gives a regret bound of O(

√K/cmin) which results in O(K2/3) regret bound

if Assumption 2 is violated.Theorem 2. Suppose Assumption 1 holds. Running the PSRL-SSP algorithm with costs cε(s, a) :=maxc(s, a), ε for ε = (S2A/K)2/3 yields

RK = O(B?S√KAL2 + (S2A)

23K

13 (B

32? L

2 + T?) + S2AT32? L

2),

where L := log(KB?T?SA).

Note that when K S2A(B3? + T?(T?/B?)

6), the regret bound scales as O(B?S√KA). These

results have similar regret bounds as the Bernstein-SSP algorithm [Rosenberg et al., 2020], andhave a gap of

√S with the lower bound of Ω(B?

√SAK).

4 Theoretical Analysis

In this section, we prove Theorem 1. Proof of Theorem 2 can be found in the Appendix.

5

A key property of posterior sampling is that conditioned on the information at time t, θ∗ and θt havethe same distribution if θt is sampled from the posterior distribution at time t [Osband et al., 2013,Russo and Van Roy, 2014]. Since the PSRL-SSP algorithm samples θ` at the stopping time t`, we usethe stopping time version of the posterior sampling property stated as follows.

Lemma 1 (Adapted from Lemma 2 of Ouyang et al. [2017b]). Let t` be a stopping time with respectto the filtration (Ft)∞t=1, and θ` be the sample drawn from the posterior distribution at time t`. Then,for any measurable function f and any Ft` -measurable random variable X , we have

E[f(θ`, X)|Ft` ] = E[f(θ∗, X)|Ft` ].

We now sketch the proof of Theorem 1. Let 0 < δ < 1 be a parameter to be chosen later. Wedistinguish between known and unknown state-action pairs. A state-action pair (s, a) is known ifthe number of visits to (s, a) is at least α · B?Scmin

log B?SAδcmin

for some large enough constant α (to bedetermined in Lemma A.6), and unknown otherwise. We divide each epoch into intervals. Thefirst interval starts at time t = 1. Each interval ends if any of the following conditions hold: (i) thetotal cost during the interval is at least B?; (ii) an unknown state-action pair is met; (iii) the goalstate is reached; or (iv) the current epoch completes. The idea of introducing intervals is that afterall state-action pairs are known, the cost accumulated during an interval is at least B? (ignoringconditions (iii) and (iv)), which allows us to bound the number of intervals with the total cost dividedby B?. Note that introducing intervals and distinguishing between known and unknown state-actionpairs is only in the analysis and thus knowledge of B? is not required.

Instead of bounding RK , we bound RM defined as

RM := E

[TM∑t=1

c(st, at)−KV (sinit; θ∗)

],

for any number of intervals M as long as K episodes are not completed. Here, TM is the total timeof the first M intervals. Let CM denote the total cost of the algorithm after M intervals and defineLM as the number of epochs in the first M intervals. Observe that the number of times conditions (i),(ii), (iii), and (iv) trigger to start a new interval are bounded by CM/B?, O(B?S

2Acmin

log B?SAδcmin

), K,and LM , respectively. Therefore, number of intervals can be bounded as

M ≤ CMB?

+K + LM +O(B?S

2A

cminlog

B?SA

δcmin). (3)

Moreover, since the cost function is lower bounded by cmin, we have cminTM ≤ CM . Our argumentproceeds as follows.1 We bound RM . B?S

√MA which implies E[CM ] . KE[V (sinit; θ∗)] +

B?S√MA. From the definition of intervals and once all the state-action pairs are known, the cost

accumulated within each interval is at least B? (ignoring intervals that end when the epoch or episodeends). This allows us to bound the number of intervals M with CM/B? (or E[CM ]/B?). Solving forE[CM ] in the quadratic inequality E[CM ] . KE[V (sinit; θ∗)] + B?S

√MA . KE[V (sinit; θ∗)] +

S√E[CM ]B?A implies that E[CM ] . KE[V (sinit; θ∗)]+B?S

√AK. Since this bound holds for any

number of M intervals as long as K episodes are not passed, it holds for E[CK ] as well. Moreover,since cmin > 0, this implies that the K episodes eventually terminate and proves the final regretbound.

Bounding the Number of Epochs. Before proceeding with bounding RM , we first prove thatthe number of epochs is bounded as O(

√KSA log TM ). Recall that the length of the epochs is

determined by two stopping criteria. If we ignore the second criterion for a moment, the first stoppingcriterion ensures that the number of episodes within each epoch grows at a linear rate which impliesthat the number of epochs is bounded by O(

√K). If we ignore the first stopping criterion for a

moment, the second stopping criterion triggers at most O(SA log TM ) times. The following lemmashows that the number of epochs remains of the same order even if these two criteria are consideredsimultaneously.

Lemma 2. The number of epochs is bounded as LM ≤√

2SAK log TM + SA log TM .

1Lower order terms are neglected.

6

We now provide the proof sketch for bounding RM . With abuse of notation define tLM+1 := TM + 1.We can write

RM := E

[TM∑t=1

c(st, at)−KV (sinit; θ∗)

]= E

[LM∑`=1

t`+1−1∑t=t`

c(st, at)

]−KE [V (sinit; θ∗)] . (4)

Note that within epoch `, action at is taken according to the optimal policy with respect to θ`. Thus,with the Bellman equation we can write

c(st, at) = V (st; θ`)−∑s′

θ`(s′|st, at)V (s′; θ`).

Substituting this and adding and subtracting V (st+1; θ`) and V (s′t; θ`), decomposes RM as

RM = R1M +R2

M +R3M ,

where

R1M := E

[LM∑`=1

t`+1−1∑t=t`

[V (st; θ`)− V (st+1; θ`)]

]

R2M := E

[LM∑`=1

t`+1−1∑t=t`

[V (st+1; θ`)− V (s′t; θ`)]

]−KE [V (sinit; θ∗)]

R3M := E

[LM∑`=1

t`+1−1∑t=t`

[V (s′t; θ`)−

∑s′

θ`(s′|st, at)V (s′; θ`)

]].

We proceed by bounding these terms separately. Proof of these lemmas can be found in the supple-mentary material. R1

M is a telescopic sum and can be bounded by the following lemma.Lemma 3. The first term R1

M is bounded as R1M ≤ B?E[LM ].

To bound R2M , recall that s′t ∈ S+ is the next state of the environment after applying action at at

state st, and that s′t = st+1 for all time steps except the last time step of an episode (right beforereaching the goal). In the last time step of an episode, s′t = g while st+1 = sinit. This proves that theinner sum of R2

M can be written as V (sinit; θ`)K`, where K` is the number of visits to the goal stateduring epoch `. Using K` ≤ K`−1 + 1 and the property of posterior sampling completes the proof.This is formally stated in the following lemma.Lemma 4. The second term R2

M is bounded as R2M ≤ B?E[LM ].

The rest of the proof proceeds to bound the third term R3M which contributes to the dominant term of

the final regret bound. The detailed proof can be found in Lemma 5. Here we provide the proof sketch.R3M captures the difference between V (·; θ`) at the next state s′t ∼ θ∗(·|st, at) and its expectation

with respect to the sampled θ`. Applying the Hoeffding-type concentration bounds [Weissman et al.,2003], as used by Ouyang et al. [2017b] yields a regret bound of O(K2/3) which is sub-optimal.To achieve the optimal dependency on K, we use a technique based on the Bernstein concentrationbound inspired by the work of Rosenberg et al. [2020]. This requires a more careful analysis. Letnt`(s, a, s

′) be the number of visits to state-action pair (s, a) followed by state s′ before time t`.For a fixed state-action pair (s, a), define the Bernstein confidence set using the empirical transition

probability θ`(s′|s, a) :=nt` (s,a,s

′)

nt` (s,a) as

B`(s, a) :=

θ(·|s, a) : |θ(s′|s, a)− θ`(s′|s, a)| ≤ 4

√θ`(s′|s, a)A`(s, a) + 28A`(s, a),∀s′ ∈ S+

.

(5)

HereA`(s, a) :=log(SAn+

` (s,a)/δ)

n+` (s,a)

and n+` (s, a) := maxnt`(s, a), 1. This confidence set is similar

to the one used by Rosenberg et al. [2020] and contains the true transition probability θ∗(·|s, a) withhigh probability (see Lemma A.2). Note that B`(s, a) is Ft`-measurable which allows us to use theproperty of posterior sampling (Lemma 1) to conclude that B`(s, a) contains the sampled transition

7

probability θ`(·|s, a) as well with high probability. With some algebraic manipulation, R3M can be

written as (with abuse of notation ` := `(t) is the epoch at time t)

R3M = E

[TM∑t=1

∑s′∈S+

[θ∗(s′|st, at)− θ`(s′|st, at)]

(V (s′; θ`)−

∑s′′∈S+

θ∗(s′′|st, at)V (s′′; θ`)

)].

Under the event that both θ∗(·|st, at) and θ`(·|st, at) belong to the confidence set B`(st, at), Bern-stein bound can be applied to obtain

R3M ≈ O

(E

[TM∑t=1

√SA`(st, at)V`(st, at)

])= O

(M∑m=1

E

[tm+1−1∑t=tm

√SA`(st, at)V`(st, at)

]),

where tm denotes the start time of interval m and V` is the empirical variance defined as

V`(st, at) :=∑s′∈S+

θ∗(s′|st, at)

(V (s′; θ`)−

∑s′′∈S+

θ∗(s′′|st, at)V (s′′; θ`)

)2

. (6)

Applying Cauchy Schwarz on the inner sum twice implies that

R3M ≈ O

M∑m=1

√√√√SE

[tm+1−1∑t=tm

A`(st, at)

√√√√E

[tm+1−1∑t=tm

V`(st, at)

]Using the fact that all the state-action pairs (st, at) within an interval except possibly the firstone are known, and that the cumulative cost within an interval is at most 2B?, one can boundE[∑tm+1−1

t=tmV`(st, at)

]= O(B2

?) (see Lemma A.5 for details). Applying Cauchy Schwarz againimplies

R3M ≈ O

B?√√√√MSE

[TM∑t=1

A`(st, at)

] ≈ O (B?S√MA).

This argument is formally presented in the following lemma.Lemma 5. The third term R3

M can be bounded as

R3M ≤ 288B?S

√MA log2 SAE[TM ]

δ+ 1632B?S

2A log2 SAE[TM ]

δ+ 4SB?δE[LM ].

Detailed proofs of all lemmas and the theorem can be found in the appendix in the supplementarymaterial.

5 Experiments

In this section, the performance of our PSRL-SSP algorithm is compared with existing OFU-typealgorithms in the literature. Two environments are considered: RandomMDP and GridWorld.RandomMDP [Ouyang et al., 2017b, Wei et al., 2020] is an SSP with 8 states and 2 actions whosetransition kernel and cost function are generated uniformly at random. GridWorld [Tarbouriech et al.,2020] is a 3× 4 grid (total of 12 states including the goal state) and 4 actions (LEFT, RIGHT, UP,DOWN) with c(s, a) = 1 for any state-action pair (s, a) ∈ S ×A. The agent starts from the initialstate located at the top left corner of the grid, and ends in the goal state at the bottom right corner.At each time step, the agent attempts to move in one of the four directions. However, the attempt issuccessful only with probability 0.85. With probability 0.15, the agent takes any of the undesireddirections uniformly at random. If the agent tries to move out of the boundary, the attempt will not besuccessful and it remains in the same position.

In the experiments, we evaluate the frequentist regret of PSRL-SSP for a fixed environment (i.e., theenvironment is not sampled from a prior distribution). A Dirichlet prior with parameters [0.1, · · · , 0.1]is considered for the transition kernel. Dirichlet is a common prior in Bayesian statistics since it is aconjugate prior for categorical and multinomial distributions.

8

0 2000 4000 6000 8000 10000Episode

0

2000

4000

6000

8000

10000

12000

14000

16000

Regr

et

RandomMDPULCVIBernstein-SSPUC-SSPEB-SSPPSRL-SSP

0 2000 4000 6000 8000 10000Episode

0

50000

100000

150000

200000

250000

300000

350000

Regr

et

GridWorldULCVIBernstein-SSPUC-SSPEB-SSPPSRL-SSP

Figure 1: Cumulative regret of existing SSP algorithms on RandomMDP (left) and GridWorld (right)for 10, 000 episodes. The results are averaged over 10 runs and 95% confidence interval is shownwith the shaded area. Our proposed PSRL-SSP algorithm outperforms all the existing algorithmsconsiderably. The performance gap is even more significant in the more challenging GridWorldenvironment (right).

We compare the performance of our proposed PSRL-SSP against existing online learning algorithmsfor the SSP problem (UC-SSP [Tarbouriech et al., 2020], Bernstein-SSP [Rosenberg et al., 2020],ULCVI [Cohen et al., 2021], and EB-SSP [Tarbouriech et al., 2021b]). The algorithms are evaluatedat K = 10, 000 episodes and the results are averaged over 10 runs. 95% confidence interval isconsidered to compare the performance of the algorithms. All the experiments are performed on a2015 Macbook Pro with 2.7 GHz Dual-Core Intel Core i5 processor and 16GB RAM.

Figure 1 shows that PSRL-SSP outperforms all the previously proposed algorithms for the SSPproblem, significantly. In particular, it outperforms the recently proposed ULCVI [Cohen et al., 2021]and EB-SSP [Tarbouriech et al., 2021b] which match the theoretical lower bound. Our numericalevaluation reveals that the ULCVI algorithm does not show any evidence of learning even after 80,000episodes (not shown here). The poor performance of these algorithms ensures the necessity toconsider PS algorithms in practice.

The gap between the performance of PSRL-SSP and OFU algorithms is even more apparent inthe GridWorld environment which is more challenging compared to RandomMDP. Note that inRandomMDP, it is possible to go to the goal state from any state with only one step. This is sincethe transition kernel is generated uniformly at random. However, in the GridWorld environment, theagent has to take a sequence of actions to the right and down to reach the goal at the bottom rightcorner. Figure 1(right) verifies that PSRL-SSP is able to learn this pattern significantly faster thanOFU algorithms.

Since these plots are generated for a fixed environment (not generated from a prior), we conjecturethat PSRL-SSP enjoyed the same regret bound under the non-Bayesian setting.

Conclusions

In this paper, we have proposed the first posterior sampling-based reinforcement learning algorithm forthe SSP models with unknown transition probabilities. The algorithm is very simple as compared tothe optimism-based algorithm proposed for SSP models recently [Tarbouriech et al., 2020, Rosenberget al., 2020, Cohen et al., 2021, Tarbouriech et al., 2021b]. It achieves a Bayesian regret bound ofO(B?S

√AK), where B? is an upper bound on the expected cost of the optimal policy, S is the size

of the state space, A is the size of the action space, and K is the number of episodes. This has a√S gap from the best known bound for an optimism-based algorithm but numerical experiments

suggest a better performance in practice. A next step would be to extend the algorithm to continuousstate and action spaces, and to propose model-free algorithms for such settings. Designing posteriorsampling-based model-free algorithms for even average MDPs remains an open problem.

9

ReferencesMarc Abeille and Alessandro Lazaric. Thompson sampling for linear-quadratic control problems. In

Artificial Intelligence and Statistics, pages 1246–1254. PMLR, 2017.Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem.

In Conference on learning theory, pages 39–1. JMLR Workshop and Conference Proceedings,2012.

Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. InInternational Conference on Machine Learning, pages 127–135. PMLR, 2013.

Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforce-ment learning. In Proceedings of the 34th International Conference on Machine Learning-Volume70, pages 263–272. JMLR. org, 2017.

Dragan Banjevic and Michael Jong Kim. Thompson sampling for stochastic control: The continuousparameter case. IEEE Transactions on Automatic Control, 64(10):4137–4152, 2019.

Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcementlearning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference onUncertainty in Artificial Intelligence, pages 35–42. AUAI Press, 2009.

Dimitri P Bertsekas. Dynamic programming and optimal control, vol i and ii, 4th edition. Belmont,MA: Athena Scientific, 2017.

Dimitri P Bertsekas and John N Tsitsiklis. An analysis of stochastic shortest path problems. Mathe-matics of Operations Research, 16(3):580–595, 1991.

Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. Advances in neuralinformation processing systems, 24:2249–2257, 2011.

Liyu Chen and Haipeng Luo. Finding the stochastic shortest path with low regret: The adversarialcost and unknown transition case. arXiv preprint arXiv:2102.05284, 2021.

Liyu Chen, Haipeng Luo, and Chen-Yu Wei. Minimax regret for stochastic shortest path withadversarial costs and known transition. arXiv preprint arXiv:2012.04053, 2020.

Alon Cohen, Yonathan Efroni, Yishay Mansour, and Aviv Rosenberg. Minimax regret for stochasticshortest path. arXiv preprint arXiv:2103.13056, 2021.

Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcementlearning. In Advances in Neural Information Processing Systems, pages 2818–2826, 2015.

Sarah Filippi, Olivier Cappé, and Aurélien Garivier. Optimism in reinforcement learning and kullback-leibler divergence. In 2010 48th Annual Allerton Conference on Communication, Control, andComputing (Allerton), pages 115–122. IEEE, 2010.

Raphaël Fonteneau, Nathan Korda, and Rémi Munos. An optimistic posterior sampling strat-egy for bayesian reinforcement learning. In NIPS 2013 Workshop on Bayesian Optimization(BayesOpt2013), 2013.

Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrainedexploration-exploitation in reinforcement learning. In International Conference on MachineLearning, pages 1573–1581, 2018.

Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized markov decisionprocesses. In Conference on Learning Theory, pages 861–898. PMLR, 2015.

Mehdi Jafarnia-Jahromi, Rahul Jain, and Ashutosh Nayyar. Online learning for unknown partiallyobservable mdps. arXiv preprint arXiv:2102.12661, 2021.

Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcementlearning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.

Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient?In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.

Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptoticallyoptimal finite-time analysis. In International conference on algorithmic learning theory, pages199–213. Springer, 2012.

Michael Jong Kim. Thompson sampling for stochastic control: The finite parameter case. IEEETransactions on Automatic Control, 62(12):6415–6422, 2017.

10

Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances inapplied mathematics, 6(1):4–22, 1985.

Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcementlearning? In International Conference on Machine Learning, pages 2701–2710. PMLR, 2017.

Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning viaposterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011,2013.

Yi Ouyang, Mukul Gagrani, and Rahul Jain. Learning-based control of unknown linear systems withthompson sampling. arXiv preprint arXiv:1709.04047, 2017a.

Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown markov decisionprocesses: A thompson sampling approach. In Advances in Neural Information Processing Systems,pages 1333–1342, 2017b.

Aviv Rosenberg and Yishay Mansour. Stochastic shortest path with adversarially changing costs.arXiv preprint arXiv:2006.11561, 2020.

Aviv Rosenberg, Alon Cohen, Yishay Mansour, and Haim Kaplan. Near-optimal regret bounds forstochastic shortest path. In International Conference on Machine Learning, pages 8210–8219.PMLR, 2020.

Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics ofOperations Research, 39(4):1221–1243, 2014.

Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial onthompson sampling. arXiv preprint arXiv:1707.02038, 2017.

Steven L Scott. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models inBusiness and Industry, 26(6):639–658, 2010.

Malcolm Strens. A bayesian framework for reinforcement learning. In ICML, volume 2000, pages943–950, 2000.

Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, and Alessandro Lazaric. No-regretexploration in goal-oriented reinforcement learning. In International Conference on MachineLearning, pages 9428–9437. PMLR, 2020.

Jean Tarbouriech, Matteo Pirotta, Michal Valko, and Alessandro Lazaric. Sample complexity boundsfor stochastic shortest path with a generative model. In Algorithmic Learning Theory, pages1157–1178. PMLR, 2021a.

Jean Tarbouriech, Runlong Zhou, Simon S Du, Matteo Pirotta, Michal Valko, and Alessandro Lazaric.Stochastic shortest path: Minimax, parameter-free and towards horizon-free regret. arXiv preprintarXiv:2104.11186, 2021b.

William R Thompson. On the likelihood that one unknown probability exceeds another in view ofthe evidence of two samples. Biometrika, 25(3/4):285–294, 1933.

Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model-free reinforcement learning in infinite-horizon average-reward markov decision processes. InInternational Conference on Machine Learning, pages 10170–10180. PMLR, 2020.

Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, and Rahul Jain. Learning infinite-horizonaverage-reward mdps with linear function approximation. International Conference on ArtificialIntelligence and Statistics, 2021.

Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger.Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep,2003.

11

A Proofs

A.1 Proof of Lemma 2

Lemma (restatement of Lemma 2). The number of epochs is bounded as LM ≤√

2SAK log TM +SA log TM .

Proof. Define macro epoch i with start time tui given by tu1= t1, and

tui+1 = min t` > tui : nt`(s, a) > 2nt`−1(s, a) for some (s, a) , i = 2, 3, · · · .

A macro epoch starts when the second criterion of determining epoch length triggers. Let NM bea random variable denoting the total number of macro epochs by the end of interval M and defineuNM+1 := LM + 1.

Recall that K` is the number of visits to the goal state in epoch `. Let Ki :=∑ui+1−1`=ui

K` be thenumber of visits to the goal state in macro epoch i. By definition of macro epochs, all the epochswithin a macro epoch except the last one are triggered by the first criterion, i.e., K` = K`−1 + 1 for` = ui, · · · , ui+1 − 2. Thus,

Ki =

ui+1−1∑`=ui

K` = Kui+1−1 +

ui+1−ui−1∑j=1

(Kui−1 + j) ≥ui+1−ui−1∑

j=1

j =(ui+1 − ui − 1)(ui+1 − ui)

2.

Solving for ui+1 − ui implies that ui+1 − ui ≤ 1 +√

2Ki. We can write

LM = uNM+1 − 1 =

NM∑i=1

(ui+1 − ui) ≤NM∑i=1

(1 +

√2Ki

)= NM +

NM∑i=1

√2Ki

≤ NM +

√√√√2NM

NM∑i=1

Ki = NM +√

2NMK,

where the second inequality follows from Cauchy-Schwarz. It suffices to show that the number ofmacro epochs is bounded as NM ≤ 1 + SA log TM . Let Ts,a be the set of all time steps at which thesecond criterion is triggered for state-action pair (s, a), i.e.,

Ts,a :=t` ≤ TM : nt`(s, a) > 2nt`−1

(s, a).

We claim that |Ts,a| ≤ log nTM+1(s, a). To see this, assume by contradiction that |Ts,a| ≥ 1 +log nTM+1(s, a), then

ntLM (s, a) =∏

t`≤TM ,nt`−1(s,a)≥1

nt`(s, a)

nt`−1(s, a)

≥∏

t`∈Ts,a,nt`−1(s,a)≥1

nt`(s, a)

nt`−1(s, a)

> 2|Ts,a|−1 ≥ nTM+1(s, a),

which is a contradiction. Thus, |Ts,a| ≤ log nTM+1(s, a) for all (s, a). In the above argument, thefirst inequality is by the fact that nt(s, a) is non-decreasing in t, and the second inequality is by thedefinition of Ts,a. Now, we can write

NM = 1 +∑s,a

|Ts,a| ≤ 1 +∑s,a

log nTM+1(s, a)

≤ 1 + SA log

∑s,a nTM+1(s, a)

SA= 1 + SA log

TMSA≤ SA log TM ,

where the second inequality follows from Jensen’s inequality.

A.2 Proof of Lemma 3

Lemma (restatement of Lemma 3). The first term R1M is bounded as R1

M ≤ B?E[LM ].

12

Proof. Recall

R1M = E

[LM∑`=1

t`+1−1∑t=t`

[V (st; θ`)− V (st+1; θ`)]

]Observe that the inner sum is a telescopic sum, thus

R1M = E

[LM∑`=1

[V (st` ; θ`)− V (st`+1

; θ`)]]≤ B?E[LM ],

where the inequality is by Assumption 1.

A.3 Proof of Lemma 4

Lemma (restatement of Lemma 4). The second term R2M is bounded as R2

M ≤ B?E[LM ].

Proof. Recall that K` is the number of times the goal state is reached during epoch `. By definition,the only time steps that s′t 6= st+1 is right before reaching the goal. Thus, with V (g; θ`) = 0, we canwrite

R2M = E

[LM∑`=1

t`+1−1∑t=t`

[V (st+1; θ`)− V (s′t; θ`)]

]−KE [V (sinit; θ∗)]

= E

[LM∑`=1

V (sinit; θ`)K`

]−KE [V (sinit; θ∗)]

=

∞∑`=1

E[1m(t`)≤MV (sinit; θ`)K`

]−KE [V (sinit; θ∗)] ,

where the last step is by Monotone Convergence Theorem. Here m(t`) is the interval at time t`. Notethat from the first stopping criterion of the algorithm we have K` ≤ K`−1 + 1 for all `. Thus, eachterm in the summation can be bounded as

E[1m(t`)≤MV (sinit; θ`)K`

]≤ E

[1m(t`)≤MV (sinit; θ`)(K`−1 + 1)

].

1m(t`)≤M(K`−1 + 1) is Ft` measurable. Therefore, applying the property of posterior sampling(Lemma 1) implies

E[1m(t`)≤MV (sinit; θ`)(K`−1 + 1)

]= E

[1m(t`)≤MV (sinit; θ∗)(K`−1 + 1)

]Substituting this into R2

M , we obtain

R2M ≤

∞∑`=1

E[1m(t`)≤MV (sinit; θ∗)(K`−1 + 1)

]−KE [V (sinit; θ∗)]

= E

[LM∑`=1

V (sinit; θ∗)(K`−1 + 1)

]−KE [V (sinit; θ∗)]

= E

[V (sinit; θ∗)

(LM∑`=1

K`−1 −K

)]+ E [V (sinit; θ∗)LM ] ≤ B?E[LM ].

In the last inequality we have used the fact that 0 ≤ V (sinit; θ∗) ≤ B? and∑LM`=1K`−1 ≤ K.

A.4 Proof of Lemma 5

Lemma (restatement of Lemma 5). The third term R3M can be bounded as

R3M ≤ 288B?S

√MA log2 SAE[TM ]

δ+ 1632B?S

2A log2 SAE[TM ]

δ+ 4SB?δE[LM ].

13

Proof. With abuse of notation let ` := `(t) denote the epoch at time t and m(t) be the interval attime t. We can write

R3M = E

[TM∑t=1

[V (s′t; θ`)−

∑s′

θ`(s′|st, at)V (s′; θ`)

]]

= E

[ ∞∑t=1

1m(t)≤M

[V (s′t; θ`)−

∑s′

θ`(s′|st, at)V (s′; θ`)

]]

=

∞∑t=1

E

[1m(t)≤ME

[V (s′t; θ`)−

∑s′

θ`(s′|st, at)V (s′; θ`)

∣∣∣Ft, θ∗, θ`]] .The last equality follows from Dominated Convergence Theorem, tower property of conditionalexpectation, and that 1m(t)≤M is measurable with respect to Ft. Note that conditioned on Ft,θ∗ and θ`, the only random variable in the inner expectation is s′t. Thus, E[V (s′t; θ`)|Ft, θ∗, θ`] =∑s′ θ∗(s

′|st, at)V (s′; θ`). Using Dominated Convergence Theorem again implies that

R3M = E

[TM∑t=1

∑s′∈S+

[θ∗(s′|st, at)− θ`(s′|st, at)]V (s′; θ`)

]

= E

[TM∑t=1

∑s′∈S+

[θ∗(s′|st, at)− θ`(s′|st, at)]

(V (s′; θ`)−

∑s′′∈S+

θ∗(s′′|st, at)V (s′′; θ`)

)], (7)

where the last equality is due to the fact that θ∗(·|st, at) and θ`(·|st, at) are probability distributionsand that

∑s′′∈S+ θ∗(s

′′|st, at)V (s′′; θ`) is independent of s′.

Recall the Bernstein confidence setB`(s, a) defined in (5) and let Ω`s,a be the event that both θ∗(·|s, a)

and θ`(·|s, a) are in B`(s, a). If Ω`s,a holds, then the difference between θ∗(·|s, a) and θ`(·|s, a) canbe bounded by the following lemma.

Lemma A.1. Denote A`(s, a) =log(SAn+

` (s,a)/δ)

n+` (s,a)

. If Ω`s,a holds, then

|θ∗(s′|s, a)− θ`(s′|s, a)| ≤ 8√θ∗(s′|s, a)A`(s, a) + 136A`(s, a).

Proof. Since Ω`s,a holds, by (5) we have that

θ`(s′|s, a)− θ∗(s′|s, a) ≤ 4

√θ`(s′|s, a)A`(s, a) + 28A`(s, a).

Using the primary inequality that x2 ≤ ax + b implies x ≤ a +√b with x =

√θ`(s′|s, a),

a = 4√A`(s, a), and b = θ∗(s

′|s, a) + 28A`(s, a), we obtain√θ`(s′|s, a) ≤ 4

√A`(s, a) +

√θ∗(s′|s, a) + 28A`(s, a) ≤

√θ∗(s′|s, a) + 10

√A`(s, a),

where the last inequality is by sub-linearity of the square root. Substituting this bound into (5) yields

|θ∗(s′|s, a)− θ`(s′|s, a)| ≤ 4√θ∗(s′|s, a)A`(s, a) + 68A`(s, a).

Similarly,

|θ`(s′|s, a)− θ`(s′|s, a)| ≤ 4√θ∗(s′|s, a)A`(s, a) + 68A`(s, a).

Using the triangle inequality completes the proof.

Note that if either of θ∗(·|st, at) or θ`(·|st, at) is not in B`(st, at), then the inner term of (7) can bebounded by 2SB? (note that |S+| ≤ 2S and V (·; θ`) ≤ B?). Thus, applying Lemma A.1 implies

14

that ∑s′∈S+

[θ∗(s′|st, at)− θ`(s′|st, at)]

(V (s′; θ`)−

∑s′′∈S+

θ∗(s′′|st, at)V (s′′; θ`)

)

≤ 8∑s′∈S+

√√√√A`(st, at)θ∗(s′|st, at)

(V (s′; θ`)−

∑s′′∈S+

θ∗(s′′|st, at)V (s′′; θ`)

)2

1Ω`st,at

+ 136∑s′∈S+

A`(st, at)

∣∣∣∣∣V (s′; θ`)−∑

s′′∈S+

θ∗(s′′|st, at)V (s′′; θ`)

∣∣∣∣∣1Ω`st,at

+ 2SB?(1θ∗(·|st,at)/∈B`(st,at) + 1θ`(·|st,at)/∈B`(st,at)

)≤ 16

√SA`(st, at)V`(st, at)1Ω`st,at

+ 272SB?A`(st, at)1Ω`st,at

+ 2SB?(1θ∗(·|st,at)/∈B`(st,at) + 1θ`(·|st,at)/∈B`(st,at)

).

where A`(s, a) =log(SAn+

` (s,a)/δ)

n+` (s,a)

and V`(s, a) is defined in (6). Here the last inequality follows

from Cauchy-Schwarz, |S+| ≤ 2S, V (·; θ`) ≤ B? and the definition of V`. Substituting this into (7)yields

R3M ≤ 16

√SE

[TM∑t=1

√A`(st, at)V`(st, at)1Ω`st,at

](8)

+ 272SB?E

[TM∑t=1

A`(st, at)1Ω`st,at

](9)

+ 2SB?E

[TM∑t=1

(1θ∗(·|st,at)/∈B`(st,at) + 1θ`(·|st,at)/∈B`(st,at)

)]. (10)

The inner sum in (9) is bounded by 6SA log2(SATM/δ) (see Lemma A.4). To bound (10), we firstshow that B`(s, a) contains the true transition probability θ∗(·|s, a) with high probability:

Lemma A.2. For any epoch ` and any state-action pair (s, a) ∈ S ×A, θ∗(·|s, a) ∈ B`(s, a) withprobability at least 1− δ

2SAn+` (s,a)

.

Proof. Fix (s, a, s′) ∈ S ×A× S+ and 0 < δ′ < 1 (to be chosen later). Let (Zi)∞i=1 be a sequence

of random variables drawn from the probability distribution θ∗(·|s, a). Apply Lemma A.3 belowwith Xi = 1Zi=s′ and δt = δ′

4St2 to a prefix of length t of the sequence (Xi)∞i=1, and apply union

bound over all t and s′ to obtain

∣∣∣θ`(s′|s, a)− θ∗(s′|s, a)∣∣∣ ≤ 2

√√√√ θ`(s′|s, a) log8Sn+

`

2(s,a)

δ′

n+` (s, a)

+ 7 log8Sn+

`

2(s, a)

δ′

with probability at least 1 − δ′/2 for all s′ ∈ S+ and ` ≥ 1, simultaneously. Choose δ′ =δ/SAn+

` (s, a) and use S ≥ 2, A ≥ 2 to complete the proof.

Lemma A.3 (Theorem D.3 (Anytime Bernstein) of Rosenberg et al. [2020]). Let (Xn)∞n=1 be asequence of independent and identically distributed random variables with expectation µ. Supposethat 0 ≤ Xn ≤ B almost surely. Then with probability at least 1 − δ, the following holds for alln ≥ 1 simultaneously: ∣∣∣∣∣

n∑i=1

(Xi − µ)

∣∣∣∣∣ ≤ 2

√√√√B

n∑i=1

Xi log2n

δ+ 7B log

2n

δ.

15

Now, by rewriting the sum in (10) over epochs, we have

E

[TM∑t=1

(1θ∗(·|st,at)/∈B`(st,at) + 1θ`(·|st,at)/∈B`(st,at)

)]

= E

[LM∑`=1

t`+1−1∑t=t`

(1θ∗(·|st,at)/∈B`(st,at) + 1θ`(·|st,at)/∈B`(st,at)

)]

=∑s,a

E

[LM∑`=1

t`+1−1∑t=t`

1st=s,at=a(1θ∗(·|s,a)/∈B`(s,a) + 1θ`(·|s,a)/∈B`(s,a)

)]

=∑s,a

E

[LM∑`=1

(nt`+1

(s, a)− nt`(s, a)) (

1θ∗(·|s,a)/∈B`(s,a) + 1θ`(·|s,a)/∈B`(s,a))].

Note that nt`+1(s, a)−nt`(s, a) ≤ nt`(s, a)+1 by the second stopping criterion. Moreover, observe

that B`(s, a) is Ft` measurable. Thus, it follows from the property of posterior sampling (Lemma 1)that E[1θ`(·|s,a)/∈B`(s,a)|Ft` ] = E[1θ∗(·|s,a)/∈B`(s,a)|Ft` ] = P(θ∗(·|s, a) /∈ B`(s, a)|Ft`) ≤δ/(2SAn+

` (s, a)), where the inequality is by Lemma A.2. Using Monotone Convergence Theoremand that 1m(t`)≤M is Ft` measurable, we can write

∑s,a

E

[LM∑`=1

(nt`+1

(s, a)− nt`(s, a)) (

1θ∗(·|s,a)/∈B`(s,a) + 1θ`(·|s,a)/∈B`(s,a))]

≤∑s,a

∞∑`=1

E[1m(t`)≤M (nt`(s, a) + 1)E

[1θ∗(·|s,a)/∈B`(s,a) + 1θ`(·|s,a)/∈B`(s,a)|Ft`

]]≤∑s,a

∞∑`=1

E[1m(t`)≤M (nt`(s, a) + 1)

δ

SAn+` (s, a)

]≤ 2δE[LM ],

where the last inequality is by nt`(s, a) + 1 ≤ 2n+` (s, a) and Monotone Convergence Theorem.

We proceed by bounding (8). Denote by tm the start time of interval m, define tM+1 := TM + 1,and rewrite the sum in (8) over intervals to get

E

[TM∑t=1

√A`(st, at)V`(st, at)1Ω`st,at

]=

M∑m=1

E

[tm+1−1∑t=tm

√A`(st, at)V`(st, at)1Ω`st,at

]

Applying Cauchy-Schwarz twice on the inner expectation implies

E

[tm+1−1∑t=tm

√A`(st, at)V`(st, at)1Ω`st,at

]

≤ E

√√√√tm+1−1∑

t=tm

A`(st, at) ·

√√√√tm+1−1∑t=tm

V`(st, at)1Ω`st,at

√√√√E

[tm+1−1∑t=tm

A`(st, at)

√√√√E

[tm+1−1∑t=tm

V`(st, at)1Ω`st,at

]

≤ 7B?

√√√√E

[tm+1−1∑t=tm

A`(st, at)

],

16

where the last inequality is by Lemma A.5. Summing overM intervals and applying Cauchy-Schwarz,we get

M∑m=1

E

[tm+1−1∑t=tm

√A`(st, at)V`(st, at)1Ω`st,at

]≤ 7B?

M∑m=1

√√√√E

[tm+1−1∑t=tm

A`(st, at)

]

≤ 7B?

√√√√M

M∑m=1

E

[tm+1−1∑t=tm

A`(st, at)

]

= 7B?

√√√√ME

[TM∑t=1

A`(st, at)

]

≤ 18B?

√MSAE

[log2 SATM

δ

],

where the last inequality follows from Lemma A.4. Substituting these bounds in (8), (9), (10),concavity of log2 x for x ≥ 3, and applying Jensen’s inequality completes the proof.

Lemma A.4.∑TMt=1A`(st, at) ≤ 6SA log2(SATM/δ).

Proof. Recall A`(s, a) =log(SAn+

` (s,a)/δ)

n+` (s,a)

. Denote by L := log(SATM/δ), an upper bound on the

numerator of A`(st, at). we have

TM∑t=1

A`(st, at) ≤TM∑t=1

L

n+` (st, at)

= L∑s,a

TM∑t=1

1st=s,at=a

n+` (s, a)

≤ 2L∑s,a

TM∑t=1

1st=s,at=a

n+t (s, a)

= 2L∑s,a

1nTM+1(s,a)>0 + 2L∑s,a

nTM+1(s,a)−1∑j=1

1

j

≤ 2LSA+ 2L∑s,a

(1 + log nTM+1(s, a))

≤ 4LSA+ 2LSA log TM ≤ 6LSA log TM .

Here the second inequality is by n+` (s, a) ≥ 0.5n+

t (s, a) (the second criterion in determining theepoch length), the third inequality is by

∑nx=1 1/x ≤ 1 + log n, and the fourth inequality is by

nTM+1(s, a) ≤ TM . The proof is complete by noting that log TM ≤ L.

Lemma A.5. For any interval m, E[∑tm+1−1t=tm

V`(st, at)1Ω` ] ≤ 44B2? .

Proof. To proceed with the proof, we need the following two technical lemmas.

Lemma A.6. Let (s, a) be a known state-action pair and m be an interval. If Ω`s,a holds, then forany state s′ ∈ S+,

|θ∗(s′|s, a)− θ`(s′|s, a)| ≤ 1

8

√cminθ∗(s′|s, a)

SB?+

cmin

4SB?.

Proof. From Lemma A.1, we know that if Ω`s,a holds, then

|θ∗(s′|s, a)− θ`(s′|s, a)| ≤ 8√θ∗(s′|s, a)A`(s, a) + 136A`(s, a),

with A`(s, a) =log(SAn+

` (s,a)/δ)

n+` (s,a)

. The proof is complete by noting that log(x)/x is decreasing, and

that n+` (s, a) ≥ α · B?Scmin

log B?SAδcmin

for some large enough constant α since (s, a) is known.

17

Lemma A.7 (Lemma B.15. of Rosenberg et al. [2020]). Let (Xt)∞t=1 be a martingale difference

sequence adapted to the filtration (Ft)∞t=0. Let Yn = (∑nt=1Xt)

2 −∑nt=1 E[X2

t |Ft−1]. Then(Yn)∞n=0 is a martingale, and in particular if τ is a stopping time such that τ ≤ c almost surely, thenE[Yτ ] = 0.

By the definition of the intervals, all the state-action pairs within an interval except possibly the firstone are known. Therefore, we bound

E

[tm+1−1∑t=tm

V`(st, at)1Ω`st,at

∣∣∣Ftm]

= E[V`(stm , atm)1Ω`st,at

|Ftm]

+ E

[tm+1−1∑t=tm+1

V`(st, at)1Ω`st,at

∣∣∣Ftm].

The first summand is upper bounded by B2? . To bound the second term, define Zt` := [V (s′t; θ`)−∑

s′∈S θ∗(s′|st, at)V (s′; θ`)]1Ω`st,at

. Conditioned on Ftm , θ∗ and θ`, (Zt`)t≥tm constitutes a martin-gale difference sequence with respect to the filtration (Fmt+1)t≥tm , where Fmt is the sigma algebragenerated by (stm , atm), · · · , (st, at). Moreover, tm+1 − 1 is a stopping time with respect to(Fmt+1)t≥tm and is bounded by tm + 2B?/cmin. Therefore, Lemma A.7 implies that

E

[tm+1−1∑t=tm+1

V`(st, at)1Ω`st,at

∣∣∣Ftm , θ∗, θ`]

= E

(tm+1−1∑t=tm+1

Zt`1Ω`st,at

)2 ∣∣∣Ftm , θ∗, θ` . (11)

We proceed by bounding |∑tm+1−1t=tm+1 Z

t`1Ω`st,at

| in terms of∑tm+1−1t=tm+1 V`(st, at)1Ω`st,at

and combinewith the left hand side to complete the proof. We have∣∣∣∣∣tm+1−1∑t=tm+1

Zt`1Ω`st,at

∣∣∣∣∣ =

∣∣∣∣∣tm+1−1∑t=tm+1

[V (s′t; θ`)−

∑s′∈S

θ∗(s′|st, at)V (s′; θ`)

]1Ω`st,at

∣∣∣∣∣≤

∣∣∣∣∣tm+1−1∑t=tm+1

[V (s′t; θ`)− V (st; θ`)]

∣∣∣∣∣ (12)

+

∣∣∣∣∣tm+1−1∑t=tm+1

[V (st; θ`)−

∑s′∈S

θ`(s′|st, at)V (s′; θ`)

]∣∣∣∣∣ (13)

+

∣∣∣∣∣tm+1−1∑t=tm+1

∑s′∈S+

[θ`(s′|st, at)− θ∗(s′|st, at)]

(V (s′; θ`)−

∑s′′∈S+

θ∗(s′′|st, at)V (s′′; θ`)

)1Ω`st,at

∣∣∣∣∣ .(14)

where (14) is by the fact that θ`(·|st, at), θ∗(·|st, at) are probability distributions and∑s′′∈S+ θ∗(s

′′|st, at)V (s′′; θ`) is independent of s′ and V (g; θ`) = 0. (12) is a telescopic sum(recall that st+1 = s′t if s′t 6= g) and is bounded by B?. It follows from the Bellman equation that(13) is equal to

∑tm+1−1t=tm+1 c(st, at). By definition, the interval ends as soon as the cost accumulates to

B? during the interval. Moreover, since V (·; θ`) ≤ B?, the algorithm does not choose an action withinstantaneous cost more than B?. This implies that

∑tm+1−1t=tm+1 c(st, at) ≤ 2B?. To bound (14) we use

the Bernstein confidence set, but taking into account that all the state-action pairs in the summationare known, we can use Lemma A.6 to obtain∑

s′∈S+

(θ`(s′|st, at)− θ∗(s′|st, at))

(V (s′; θ`)−

∑s′′∈S+

θ∗(s′′|st, at)V (s′′; θ`)

)1Ω`st,at

≤∑s′∈S+

1

8

√√√√cminθ∗(s′|st, at)(V (s′; θ`)−

∑s′′∈S+ θ∗(s′′|st, at)V (s′′; θ`)

)21Ω`st,at

SB?

+∑s′∈S+

cmin

4SB?

∣∣∣∣∣V (s′; θ`)−∑

s′′∈S+

θ∗(s′′|st, at)V (s′′; θ`)

∣∣∣∣∣≤ 1

4

√cminV`(st, at)1Ω`st,at

B?+c(st, at)

2.

18

The last inequality follows from Cauchy-Schwarz inequality, |S+| ≤ 2S, |V (·; θ`)| ≤ B?, andcmin ≤ c(st, at). Summing over the time steps in interval m and applying Cauchy-Schwarz, we get

tm+1−1∑t=tm+1

1

4

√cminV`(st, at)1Ω`st,at

B?+c(st, at)

2

≤ 1

4

√(tm+1 − tm)

cmin

∑tm+1−1t=tm+1 V`(st, at)1Ω`st,at

B?

+

∑tm+1−1t=tm+1 c(st, at)

2

≤ 1

4

√√√√2

tm+1−1∑t=tm+1

V`(st, at)1Ω`st,at+B?.

The last inequality follows from the fact that duration of interval m is at most 2B?/cmin and itscumulative cost is at most 2B?. Substituting these bounds into (11) implies that

E

[tm+1−1∑t=tm+1

V`(st, at)1Ω`st,at

∣∣∣Ftm , θ∗, θ`]≤ E

4B? +

1

4

√√√√2

tm+1−1∑t=tm+1

V`(st, at)1Ω`st,at

2 ∣∣∣Ftm , θ∗, θ`

≤ 32B2? +

1

4E

[tm+1−1∑t=tm+1

V`(st, at)1Ω`st,at

∣∣∣Ftm , θ∗, θ`],

where the last inequality is by (a + b)2 ≤ 2(a2 + b2) with b = 14

√2∑tm+1−1t=tm+1 V`(st, at)1Ω`st,at

and a = 4B?. Rearranging implies that E[∑tm+1−1

t=tm+1 V`(st, at)1Ω`st,at|Ftm , θ∗, θ`

]≤ 43B2

? andthe proof is complete.

A.5 Proof of Theorem 1

Theorem (restatement of Theorem 1). Suppose Assumptions 1 and 2 hold. Then, the regret boundof the PSRL-SSP algorithm is bounded as

RK = O

B?S√KAL2 + S2A

√B?

3

cminL2

,

where L = log(B?SAKc−1min).

Proof. Denote by CM the total cost after M intervals. Recall that

E[CM ] = KE[V (sinit; θ∗)] +RM = KE[V (sinit; θ∗)] +R1M +R2

M +R3M

Using Lemmas 3, 4, and 5 with δ = 1/K obtains

E[CM ] ≤ KE[V (sinit; θ∗)]

+O(B?E[LM ] +B?S

√MA log2(SAKE[TM ]) +B?S

2A log2(SAKE[TM ])

). (15)

Recall that LM ≤√

2SAK log TM + SA log TM . Taking expectation from both sides and usingJensen’s inequality gets us E[LM ] ≤

√2SAK logE[TM ] + SA logE[TM ]. Moreover, taking

expectation from both sides of (3), plugging in the bound on E[LM ], and concavity of log(x) implies

M ≤ E[CM ]

B?+K +

√2SAK logE[TM ] + SA logE[TM ] +O

(B?S

2A

cminlog

B?KSA

cmin

).

19

Substituting this bound in (15), using subadditivity of the square root, and simplifying yields

E[CM ] ≤ KE[V (sinit; θ∗)] +O

(B?S

√KA log2(SAKE[TM ]) + S

√B?E[CM ]A log2(SAKE[TM ])

+B?S54A

34K

14 log

54 (SAKE[TM ]) + S2A

√B3?

cminlog3 B?SAKE[TM ]

cmin

).

Solving for E[CM ] (by using the primary inequality that x ≤ a√x+ b implies x ≤ (a+

√b)2 for

a, b > 0), using K ≥ S2A, V (sinit; θ∗) ≤ B?, and simplifying the result gives

E[CM ] ≤

(O(S

√B?A log2(SAKE[TM ])

)

+

√√√√√KE[V (sinit; θ∗)] +O

B?S√KA log2.5(SAKE[TM ]) + S2A

√B3?

cminlog3 B?SAKE[TM ]

cmin

)2

≤ O(B?S

2A log2 SAE[TM ]

δ

)

+KE[V (sinit; θ∗)] +O

(B?S

√KA log2.5(SAKE[TM ]) + S2A

√B3?

cminlog3 B?SAKE[TM ]

cmin

+B?S

√KA log4(SAKE[TM ]) + S2A

(B?

5

cminlog7 B?SAKE[TM ]

cmin

) 14

)

≤ KE[V (sinit; θ∗)] +O

B?S√KA log4 SAKE[TM ]) + S2A

√B3?

cminlog4 B?SAKE[TM ]

cmin

.

(16)

Note that by simplifying this bound, we can write E[CM ] ≤ O(√

B?3S4A2K2E[TM ]/cmin

). On

the other hand, we have that cminTM ≤ CM which implies E[TM ] ≤ E[CM ]/cmin. Isolating E[TM ]implies E[TM ] ≤ O

(B?

3S4A2K2/c3min

). Substituting this bound into (16) yields

E[CM ] ≤ KE[V (sinit; θ∗)] +O

B?S√KA log4 B?SAK

cmin+ S2A

√B3?

cminlog4 B?SAK

cmin

.

We note that this bound holds for any number of M intervals as long as the K episodes have notelapsed. Since, cmin > 0, this implies that the K episodes eventually terminate and the claimedbound of the theorem for RK holds.

A.6 Proof of Theorem 2

Theorem (restatement of Theorem 2). Suppose Assumption 1 holds. Running the PSRL-SSP algo-rithm with costs cε(s, a) := maxc(s, a), ε for ε = (S2A/K)2/3 yields

RK = O(B?S√KAL2 + (S2A)

23K

13 (B

32? L

2 + T?) + S2AT32? L

2),

where L := log(KB?T?SA) and T? is an upper bound on the expected time the optimal policy takesto reach the goal from any initial state.

Proof. Denote by T εK the time to complete K episodes if the algorithm runs with the perturbed costscε(s, a) and let Vε(sinit; θ∗), V πε (sinit; θ∗) be the optimal value function and the value function for

20

policy π in the SSP with cost function cε(s, a) and transition kernel θ∗. We can write

RK = E

T εK∑t=1

c(st, at)−KV (sinit; θ∗)

≤ E

T εK∑t=1

cε(st, at)−KV (sinit; θ∗)

= E

T εK∑t=1

cε(st, at)−KVε(sinit; θ∗)

+KE [Vε(sinit; θ∗)− V (sinit; θ∗)] . (17)

Theorem 1 implies that the first term is bounded by

E

T εK∑t=1

cε(st, at)−KVε(sinit; θ∗)

= O

Bε?S√KAL2ε + S2A

√Bε?

3

εL2ε

,

with Lε = log(Bε?SAK/ε) and Bε? ≤ B? + εT? (to see this note that Vε(s; θ∗) ≤ V π∗

ε (s; θ∗) ≤B? + εT?). To bound the second term of (17), we have

Vε(sinit; θ∗) ≤ V π∗

ε (sinit; θ∗) ≤ V (sinit; θ∗) + εT?.

Combining these bounds, we can write

RK = O

(B?S√KAL2

ε + εT?S√KAL2

ε + S2A

√(B? + εT?)3

εL2ε +KT?ε

).

Substituting ε = (S2A/K)2/3, and simplifying the result with K ≥ S2A and B? ≤ T? (sincec(s, a) ≤ 1) implies

RK = O(B?S√KAL2 + (S2A)

23K

13 (B

32? L

2 + T?) + S2AT32? L

2),

where L = log(KB?T?SA). This completes the proof.

21