regret bounds for the adaptive control of linear quadratic...

Regret Bounds for the Adaptive Control of LinearQuadratic Systems

COLT 2011

Yasin Abbasi-Yadkori Csaba Szepesvari

University of AlbertaE-mail: [email protected]

Budapest, July 10, 2011

COLT 2011 (Budapest) Partial monitoring July 10, 2011 1 / 27

Contributions! !"# !"$ %&'()*+

%, -.*/0/)1)(2

34+5)(2

67+'()*+5

80.94(

:*1)'2;<)(=;

.4'*9+)>4.

?4=0@)*.;:*1)'2

80.94(;

:*1)'2;<A*

;.4'*9+)>4.

Off-policy learning with options and recognizersDoina Precup, Richard S. Sutton, Cosmin Paduraru, Anna J. Koop, Satinder Singh

McGill University, University of Alberta, University of Michigan

Options

Distinguished

region

Ideas and Motivation Background Recognizers Off-policy algorithm for options Learning w/o the Behavior Policy

Options• A way of behaving for a period of time

Models of options• A predictive model of the outcome of following the option• What state will you be in?• Will you still control the ball?• What will be the value of some feature?• Will your teammate receive the pass?• What will be the expected total reward along the way?• How long can you keep control of the ball?

Dribble Keepaway Pass

Options for soccer players could be

Options in a 2D world

The red and blue options

are mostly executed.

Surely we should be able

to learn about them from

this experience!

Experienced

trajectory

Off-policy learning• Learning about one policy while behaving according to another• Needed for RL w/exploration (as in Q-learning)• Needed for learning abstract models of dynamical systems

(representing world knowledge)• Enables efficient exploration• Enables learning about many ways of behaving at the same time

(learning models of options)

! a policy! a stopping condition

Non-sequential example

Problem formulation w/o recognizers

Problem formulation with recognizers

• One state

• Continuous action a ! [0, 1]

• Outcomes zi = ai

• Given samples from policy b : [0, 1] " #+

• Would like to estimate the mean outcome for a sub-region of the

action space, here a ! [0.7, 0.9]

Target policy ! : [0, 1] ! "+ is uniform within the region of interest

(see dashed line in figure below). The estimator is:

m! =1n

b(ai)zi.

Theorem 1 Let A = {a1, . . . ak} ! A be a subset of all the

possible actions. Consider a fixed behavior policy b and let !A be

the class of policies that only choose actions from A, i.e., if!(a) > 0 then a " A. Then the policy induced by b and the binaryrecognizer cA is the policy with minimum-variance one-step

importance sampling corrections, among those in !A:

! as given by (1) = arg minp!!A

Proof: Using Lagrange multipliers

Theorem 2 Consider two binary recognizers c1 and c2, such that

µ1 > µ2. Then the importance sampling corrections for c1 have

lower variance than the importance sampling corrections for c2.

Off-policy learning

Let the importance sampling ratio at time step t be:

!t ="(st, at)

b(st, at)

The truncated n-step return, R(n)t , satisfies:

R(n)t = !t[rt+1 + (1 ! #t+1)R

(n!1)t+1 ].

The update to the parameter vector is proportional to:

!$t =h

R!t ! yt

""yt!0(1 ! #1) · · · !t!1(1 ! #t).

Theorem 3 For every time step t ! 0 and any initial state s,

Eb[!!t|s] = E![!!t|s].

Proof: By induction on n we show that

Eb{R(n)t |s} = E!{R

(n)t |s}

which implies that Eb{R"t |s} = E!(R"

t |s}. The rest of the proof isalgebraic manipulations (see paper).

Implementation of off-policy learning for options

In order to avoid!! ! 0, we use a restart function g : S ! [0, 1](like in the PSD algorithm). The forward algorithm becomes:

!!t = (R!t " yt)#"yt

gi"i..."t!1(1 " #i+1)...(1 " #t),

where gt is the extent of restarting in state st.

The incremental learning algorithm is the following:

• Initialize !0 = g0, e0 = !0!!y0

• At every time step t:

"t = #t (rt+1 + (1 " $t+1)yt+1) " yt

%t+1 = %t + &"tet

!t+1 = #t!t(1 " $t+1) + gt+1

et+1 = '#t(1 " $t+1)et + !t+1!!yt+1

References

Off-policy learning is tricky

• The Bermuda triangle

! Temporal-difference learning! Function approximation (e.g., linear)! Off-policy

• Leads to divergence of iterative algorithms! Q-learning diverges with linear FA! Dynamic programming diverges with linear FA

Baird's Counterexample

Vk(s) =

!(7)+2!(1)

terminal

state99%

Vk(s) =

!(7)+2!(2)

Vk(s) =

!(7)+2!(3)

Vk(s) =

!(7)+2!(4)

Vk(s) =

!(7)+2!(5)

Vk(s) =

2!(7)+!(6)

0 1000 2000 3000 4000 5000

Iterations (k)

Parametervalues, !k(i)

(log scale,

broken at !1)

!k(1) – !k(5)

Precup, Sutton & Dasgupta (PSD) algorithm

• Uses importance sampling to convert off-policy case to on-policy case• Convergence assured by theorem of Tsitsiklis & Van Roy (1997)• Survives the Bermuda triangle!

• Variance can be high, even infinite (slow learning)• Difficult to use with continuous or large action spaces• Requires explicit representation of behavior policy (probability distribution)

Option formalism

An option is defined as a triple o = !I,!, ""

• I # S is the set of states in which the option can be initiated

• ! is the internal policy of the option

• " : S $ [0, 1] is a stochastic termination condition

We want to compute the reward model of option o:

Eo{R(s)} = E{r1 + r2 + . . . + rT |s0 = s, !, "}

We assume that linear function approximation is used to represent

the model:

Eo{R(s)} % #T $s = y

Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function

approximation. In Proceedings of ICML.

Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference

learning with function approximation. In Proceedings of ICML.

Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A

framework for temporal abstraction in reinforcement learning. Artificial

Intelligence, vol . 112, pp. 181–211.

Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings

of NIPS-17.

Sutton R.S., Rafols E. and Koop, A. (2006). Temporal abstraction in

temporal-difference networks”. In Proceedings of NIPS-18.

Tadic, V. (2001). On the convergence of temporal-difference learning with linear

function approximation. In Machine learning vol. 42.

Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning

with function approximation. IEEE Transactions on Automatic Control 42.

Acknowledgements

Theorem 4 If the following assumptions hold:

• The function approximator used to represent the model is a

state aggregrator

• The recognizer behaves consistently with the function

approximator, i.e., c(s, a) = c(p, a), !s " p

• The recognition probability for each partition, µ(p) is estimatedusing maximum likelihood:

µ(p) =N(p, c = 1)

Then there exists a policy ! such that the off-policy learning

algorithm converges to the same model as the on-policy algorithm

using !.

Proof: In the limit, w.p.1, µ converges toP

s db(s|p)P

a c(p, a)b(s, a) where db(s|p) is the probability ofvisiting state s from partition p under the stationary distribution of b.

Let ! be defined to be the same for all states in a partition p:

!(p, a) = "(p, a)X

db(s|p)b(s, a)

! is well-defined, in the sense thatP

a !(s, a) = 1. Using Theorem3, off-policy updates using importance sampling corrections " willhave the same expected value as on-policy updates using !.

The authors gratefully acknowledge the ideas and encouragement

they have received in this work from Eddie Rafols, Mark Ring,

Lihong Li and other members of the rlai.net group. We thank Csaba

Szepesvari and the reviewers of the paper for constructive

comments. This research was supported in part by iCore, NSERC,

Alberta Ingenuity, and CFI.

The target policy ! is induced by a recognizer function

c : [0, 1] !" #+:

!(a) =c(a)b(a)

x c(x)b(x)=

c(a)b(a)µ

(see blue line below). The estimator is:

m! =1n

zi!(ai)b(ai)

zic(ai)b(ai)

1b(ai)

zic(ai)

!" !"" #"" $"" %"" &"""

!'& ()*+,+-./01.,+.2-345.13,.630780#""04.)*/301.,+.2-349

:+;<7=;0,3-762+>3,

:+;<0,3-762+>3,

?=)@3,07804.)*/30.-;+724

McGill

The importance sampling corrections are:

!(s, a) ="(s, a)b(s, a)

=c(s, a)µ(s)

where µ(s) depends on the behavior policy b. If b is unknown,instead of µ we will use a maximum likelihood estimate

µ : S ! [0, 1], and importance sampling corrections will be definedas:

!(s, a) =c(s, a)

On-policy learning

If ! is used to generate behavior, then the reward model of anoption can be learned using TD-learning.

The n-step truncated return is:

R(n)t = rt+1 + (1 ! "t+1)R

(n!1)t+1 .

The #-return is defined as usual:

R!t = (1 ! #)

#n!1R(n)t .

The parameters of the function approximator are updated on every

step proportionally to:

!$t =h

R!t ! yt

""yt(1 ! "1) · · · (1 ! "t).

• Recognizers reduce variance

• First off-policy learning algorithm for option models

• Off-policy learning without knowledge of the behavior

distribution

• Observations

– Options are a natural way to reduce the variance of

importance sampling algorithms (because of the termination

condition)

– Recognizers are a natural way to define options, especially

for large or continuous action spaces.

Contributions! !"# !"$ %&'()*+

%, -.*/0/)1)(2

34+5)(2

67+'()*+5

80.94(

:*1)'2;<)(=;

.4'*9+)>4.

?4=0@)*.;:*1)'2

80.94(;

:*1)'2;<A*

;.4'*9+)>4.

Off-policy learning with options and recognizersDoina Precup, Richard S. Sutton, Cosmin Paduraru, Anna J. Koop, Satinder Singh

McGill University, University of Alberta, University of Michigan

Options

Distinguished

region

Ideas and Motivation Background Recognizers Off-policy algorithm for options Learning w/o the Behavior Policy

Options• A way of behaving for a period of time

Models of options• A predictive model of the outcome of following the option• What state will you be in?• Will you still control the ball?• What will be the value of some feature?• Will your teammate receive the pass?• What will be the expected total reward along the way?• How long can you keep control of the ball?

Dribble Keepaway Pass

Options for soccer players could be

Options in a 2D world

The red and blue options

are mostly executed.

Surely we should be able

to learn about them from

this experience!

Experienced

trajectory

Off-policy learning• Learning about one policy while behaving according to another• Needed for RL w/exploration (as in Q-learning)• Needed for learning abstract models of dynamical systems

(representing world knowledge)• Enables efficient exploration• Enables learning about many ways of behaving at the same time

(learning models of options)

! a policy! a stopping condition

Non-sequential example

Problem formulation w/o recognizers

Problem formulation with recognizers

• One state

• Continuous action a ! [0, 1]

• Outcomes zi = ai

• Given samples from policy b : [0, 1] " #+

• Would like to estimate the mean outcome for a sub-region of the

action space, here a ! [0.7, 0.9]

Target policy ! : [0, 1] ! "+ is uniform within the region of interest

(see dashed line in figure below). The estimator is:

m! =1n

b(ai)zi.

Theorem 1 Let A = {a1, . . . ak} ! A be a subset of all the

possible actions. Consider a fixed behavior policy b and let !A be

the class of policies that only choose actions from A, i.e., if!(a) > 0 then a " A. Then the policy induced by b and the binaryrecognizer cA is the policy with minimum-variance one-step

importance sampling corrections, among those in !A:

! as given by (1) = arg minp!!A

Proof: Using Lagrange multipliers

Theorem 2 Consider two binary recognizers c1 and c2, such that

µ1 > µ2. Then the importance sampling corrections for c1 have

lower variance than the importance sampling corrections for c2.

Off-policy learning

Let the importance sampling ratio at time step t be:

!t ="(st, at)

b(st, at)

The truncated n-step return, R(n)t , satisfies:

R(n)t = !t[rt+1 + (1 ! #t+1)R

(n!1)t+1 ].

The update to the parameter vector is proportional to:

!$t =h

R!t ! yt

""yt!0(1 ! #1) · · · !t!1(1 ! #t).

Theorem 3 For every time step t ! 0 and any initial state s,

Eb[!!t|s] = E![!!t|s].

Proof: By induction on n we show that

Eb{R(n)t |s} = E!{R

(n)t |s}

which implies that Eb{R"t |s} = E!(R"

t |s}. The rest of the proof isalgebraic manipulations (see paper).

Implementation of off-policy learning for options

In order to avoid!! ! 0, we use a restart function g : S ! [0, 1](like in the PSD algorithm). The forward algorithm becomes:

!!t = (R!t " yt)#"yt

gi"i..."t!1(1 " #i+1)...(1 " #t),

where gt is the extent of restarting in state st.

The incremental learning algorithm is the following:

• Initialize !0 = g0, e0 = !0!!y0

• At every time step t:

"t = #t (rt+1 + (1 " $t+1)yt+1) " yt

%t+1 = %t + &"tet

!t+1 = #t!t(1 " $t+1) + gt+1

et+1 = '#t(1 " $t+1)et + !t+1!!yt+1

References

Off-policy learning is tricky

• The Bermuda triangle

! Temporal-difference learning! Function approximation (e.g., linear)! Off-policy

• Leads to divergence of iterative algorithms! Q-learning diverges with linear FA! Dynamic programming diverges with linear FA

Baird's Counterexample

Vk(s) =

!(7)+2!(1)

terminal

state99%

Vk(s) =

!(7)+2!(2)

Vk(s) =

!(7)+2!(3)

Vk(s) =

!(7)+2!(4)

Vk(s) =

!(7)+2!(5)

Vk(s) =

2!(7)+!(6)

0 1000 2000 3000 4000 5000

Iterations (k)

Parametervalues, !k(i)

(log scale,

broken at !1)

!k(1) – !k(5)

Precup, Sutton & Dasgupta (PSD) algorithm

• Uses importance sampling to convert off-policy case to on-policy case• Convergence assured by theorem of Tsitsiklis & Van Roy (1997)• Survives the Bermuda triangle!

• Variance can be high, even infinite (slow learning)• Difficult to use with continuous or large action spaces• Requires explicit representation of behavior policy (probability distribution)

Option formalism

An option is defined as a triple o = !I,!, ""

• I # S is the set of states in which the option can be initiated

• ! is the internal policy of the option

• " : S $ [0, 1] is a stochastic termination condition

We want to compute the reward model of option o:

Eo{R(s)} = E{r1 + r2 + . . . + rT |s0 = s, !, "}

We assume that linear function approximation is used to represent

the model:

Eo{R(s)} % #T $s = y

Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function

approximation. In Proceedings of ICML.

Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference

learning with function approximation. In Proceedings of ICML.

Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A

framework for temporal abstraction in reinforcement learning. Artificial

Intelligence, vol . 112, pp. 181–211.

Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings

of NIPS-17.

Sutton R.S., Rafols E. and Koop, A. (2006). Temporal abstraction in

temporal-difference networks”. In Proceedings of NIPS-18.

Tadic, V. (2001). On the convergence of temporal-difference learning with linear

function approximation. In Machine learning vol. 42.

Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning

with function approximation. IEEE Transactions on Automatic Control 42.

Acknowledgements

Theorem 4 If the following assumptions hold:

• The function approximator used to represent the model is a

state aggregrator

• The recognizer behaves consistently with the function

approximator, i.e., c(s, a) = c(p, a), !s " p

• The recognition probability for each partition, µ(p) is estimatedusing maximum likelihood:

µ(p) =N(p, c = 1)

Then there exists a policy ! such that the off-policy learning

algorithm converges to the same model as the on-policy algorithm

using !.

Proof: In the limit, w.p.1, µ converges toP

s db(s|p)P

a c(p, a)b(s, a) where db(s|p) is the probability ofvisiting state s from partition p under the stationary distribution of b.

Let ! be defined to be the same for all states in a partition p:

!(p, a) = "(p, a)X

db(s|p)b(s, a)

! is well-defined, in the sense thatP

a !(s, a) = 1. Using Theorem3, off-policy updates using importance sampling corrections " willhave the same expected value as on-policy updates using !.

The authors gratefully acknowledge the ideas and encouragement

they have received in this work from Eddie Rafols, Mark Ring,

Lihong Li and other members of the rlai.net group. We thank Csaba

Szepesvari and the reviewers of the paper for constructive

comments. This research was supported in part by iCore, NSERC,

Alberta Ingenuity, and CFI.

The target policy ! is induced by a recognizer function

c : [0, 1] !" #+:

!(a) =c(a)b(a)

x c(x)b(x)=

c(a)b(a)µ

(see blue line below). The estimator is:

m! =1n

zi!(ai)b(ai)

zic(ai)b(ai)

1b(ai)

zic(ai)

!" !"" #"" $"" %"" &"""

!'& ()*+,+-./01.,+.2-345.13,.630780#""04.)*/301.,+.2-349

:+;<7=;0,3-762+>3,

:+;<0,3-762+>3,

?=)@3,07804.)*/30.-;+724

McGill

The importance sampling corrections are:

!(s, a) ="(s, a)b(s, a)

=c(s, a)µ(s)

where µ(s) depends on the behavior policy b. If b is unknown,instead of µ we will use a maximum likelihood estimate

µ : S ! [0, 1], and importance sampling corrections will be definedas:

!(s, a) =c(s, a)

On-policy learning

If ! is used to generate behavior, then the reward model of anoption can be learned using TD-learning.

The n-step truncated return is:

R(n)t = rt+1 + (1 ! "t+1)R

(n!1)t+1 .

The #-return is defined as usual:

R!t = (1 ! #)

#n!1R(n)t .

The parameters of the function approximator are updated on every

step proportionally to:

!$t =h

R!t ! yt

""yt(1 ! "1) · · · (1 ! "t).

• Recognizers reduce variance

• First off-policy learning algorithm for option models

• Off-policy learning without knowledge of the behavior

distribution

• Observations

– Options are a natural way to reduce the variance of

importance sampling algorithms (because of the termination

condition)

– Recognizers are a natural way to define options, especially

for large or continuous action spaces.

Outline

1 Problem formulation

2 LQRMain ideasSome details (and the algorithm)

3 Proof sketch (and the result)

4 Conclusions and Open Problems

5 Bibliography

Control problems

Agent Environment

xt+1 = f (xt, ut, wt)

∑t c(xt, ut) → min

Control problems

Agent Environment

Control problems

Agent Environment

Control problems

Agent Environment

Control problems

Agent Environment

Learning to control

Agent Environment

f is unknown – yet the goal is to control the environment almost aswell as if it was known

Measure of performance of the learner

Does the average cost converge to the optimal average cost?

c(xt, ut)→ J∗ ?

How fast is the convergence?Compare the total losses⇒ Regret:

c(xt, ut) − TJ∗ .

Hannan consistency:RT

T→ 0 as T →∞

Typical result: For some γ ∈ (0, 1),

RT = O(Tγ) .COLT 2011 (Budapest) Partial monitoring July 10, 2011 5 / 27

c(xt, ut)→ J∗ ?

T→ 0 as T →∞

c(xt, ut)→ J∗ ?

T→ 0 as T →∞

c(xt, ut)→ J∗ ?

T→ 0 as T →∞

c(xt, ut)→ J∗ ?

T→ 0 as T →∞

This talk: Linear Quadratic Regulation

Linear dynamics: xt ∈ Rn, ut ∈ Rd.

f (xt, ut,wt+1) = A∗xt + B∗ut .

Quadratic cost: Q,R � 0

c(xt, ut) = x>t Qxt + u>t Rut .

Noise (wt)t: Subgaussian martingale noise, E[wt+1w>t+1 | Ft

]= In.

LQR problem: given A∗,B∗,Q,R, find an optimal controllerLQR learning problem: given Q,R. not knowing A∗,B∗, learn tocontrol the system

]= In.

The goal and why should we care?

Goal: Design a controller which achieves low regret for areasonably large class of LQR problems.

Simple ≡ beautiful, nice structures!Continuous states and controls!LQR control is actually useful! (even when no learning is involved)Unsolved problem!?

Previous works

Bartlett and Tewari (2009); Auer et al. (2010) – regret analysis offinite MDPsFiechter (1997) – discounted LQ, “PAC-learning”Control people!

I Lai and Wei (1982b, 1987); Chen and Guo (1987); Chen andZhang (1990); Lai and Ying (2006) – consistency, forcedexploration (like ε-greedy)

I Campi and Kumar (1998); Bittanti and Campi (2006) – consistency,basis of the present work

Lai and Robbins (1985) – principle in the face of uncertainty forbanditsLai and Wei (1982a); Dani et al. (2008); Rusmevichientong andTsitsiklis (2010) – linear estimation, tail inequalities

Previous works

Outline

5 Bibliography

The main ideas of the algorithm

Estimate the system dynamicsBe optimistic in selecting the controlsAvoid frequent changes to the policy

Estimation

xt+1 = A∗xt + B∗ut + wt+1

= Θ∗

)+ wt+1

= Θ∗zt + wt+1

Data: (z0, x1), (z2, x2), . . . , (zt−1, xt)

xi+1 = Θ∗zi + wi+1

Linear regression with correlated covariates, martingale noise⇒ Use ridge-regression (least-squares, with `2-penalties)

Estimation

= Θ∗

)+ wt+1

= Θ∗zt + wt+1

Data: (z0, x1), (z2, x2), . . . , (zt−1, xt)

xi+1 = Θ∗zi + wi+1

Estimation

= Θ∗

)+ wt+1

= Θ∗zt + wt+1

Data: (z0, x1), (z2, x2), . . . , (zt−1, xt)

xi+1 = Θ∗zi + wi+1

Estimation

= Θ∗

)+ wt+1

= Θ∗zt + wt+1

Data: (z0, x1), (z2, x2), . . . , (zt−1, xt)

xi+1 = Θ∗zi + wi+1

Estimation

= Θ∗

)+ wt+1

= Θ∗zt + wt+1

Data: (z0, x1), (z2, x2), . . . , (zt−1, xt)

xi+1 = Θ∗zi + wi+1

Estimation

= Θ∗

)+ wt+1

= Θ∗zt + wt+1

Data: (z0, x1), (z2, x2), . . . , (zt−1, xt)

xi+1 = Θ∗zi + wi+1

Optimism principle

Optimism PrincipleLet Ct(δ) be a confidence set for the unknown parameters. Choose thecontrol which gives rise to the best performance.

For given Θ, for the linear system with parameter Θ, let J(Θ) be theoptimal average cost and πΘ be the corresponding optimal policy.Choose

Θt = arg minθ∈Ct(δ)

J(θ) and ut = πΘt(xt) .

CaveatsJ(Θ), πΘ can be ill-definedNeed restriction on allowed set of parametersFinding Θt is a potentially difficult optimization problem

Optimism principle

Avoiding frequent changes

Frequent changes are unnecessarySaving computation⇒ going green!?Frequent changes might be a problem (avoiding frequent changeshelps with the proof)

Outline

5 Bibliography

How to choose the confidence set?

Adventures in self-normalized processes, “method of mixtures”⇒ dela Pena et al. (2009)

Theorem

Let z>t = (x>t , u>t ) ∈ Rn+d. Let Θt be the ridge-regression parameter

estimate with regularization coefficient λ > 0. Let Vt = λI +∑t−1

i=0 ziz>ibe the covariance matrix. Then, for any 0 < δ < 1, with probability atleast 1− δ,

trace((Θt −Θ∗)>Vt(Θt −Θ∗))

√2 log

(det(Vt)

1/2 det(λI)−1/2

)+ λ1/2S

Construction of confidence sets

An ellipsoid centred at Θt:

trace{

(Θ− Θt)>Vt(Θ− Θt)

}≤ βt.

The algorithm

Inputs: T, S > 0, δ > 0,Q,L.Set V0 = I and Θ0 = 0, (A0, B0) = Θ0 = argminΘ∈C0(δ) J∗(Θ).for t := 0, 1, 2, . . . do

Calculate Θt.Θt = argminΘ∈Ct(δ) J∗(Θ).Calculate ut based on the current parameters, ut = K(Θt)xt.Execute control, observe new state xt+1.Vt+1 := Vt + ztz>t , where z>t = (x>t , u

end for

Proof sketch

Fix T > 0.With high probability, the state stays O(log T). ⇐ most of the workis here..Decompose the regretAnalyze each term

Proof sketch

Regret decomposition

Dynamic programming equations,E [wt+1|Ft] = 0, Algebra ..

{x>t P(Θt)xt − E

[x>t+1P(Θt+1)xt+1

∣∣∣Ft

E[x>t+1

{P(Θt+1)− P(Θt)

∣∣∣Ft

Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt

(x>t Qxt + u>t Rut) =

J(Θt) + R1 + R2 + R3

≤ T J(Θ∗) + R1 + R2 + R3 .

{x>t P(Θt)xt − E

[x>t+1P(Θt+1)xt+1

∣∣∣Ft

E[x>t+1

{P(Θt+1)− P(Θt)

∣∣∣Ft

J(Θt) + R1 + R2 + R3

≤ T J(Θ∗) + R1 + R2 + R3 .

{x>t P(Θt)xt − E

[x>t+1P(Θt+1)xt+1

∣∣∣Ft

E[x>t+1

{P(Θt+1)− P(Θt)

∣∣∣Ft

J(Θt) + R1 + R2 + R3

≤ T J(Θ∗) + R1 + R2 + R3 .

{x>t P(Θt)xt − E

[x>t+1P(Θt+1)xt+1

∣∣∣Ft

E[x>t+1

{P(Θt+1)− P(Θt)

∣∣∣Ft

J(Θt) + R1 + R2 + R3

≤ T J(Θ∗) + R1 + R2 + R3 .

{x>t P(Θt)xt − E

[x>t+1P(Θt+1)xt+1

∣∣∣Ft

E[x>t+1

{P(Θt+1)− P(Θt)

∣∣∣Ft

J(Θt) + R1 + R2 + R3

≤ T J(Θ∗) + R1 + R2 + R3 .

Term R1

{x>t P(Θt)xt − E

[x>t+1P(Θt+1)xt+1

∣∣∣Ft

RegroupingMartingale difference sequenceState does not explode

Term R1

{x>t P(Θt)xt − E

[x>t+1P(Θt+1)xt+1

∣∣∣Ft

Term R1

{x>t P(Θt)xt − E

[x>t+1P(Θt+1)xt+1

∣∣∣Ft

Term R3

Algebra.. Reduce to

‖P(Θt)(Θt −Θ∗)>zt‖2

More algebra..Choice of confidence setState does not explode

Term R3

Algebra.. Reduce to

Term R3

Algebra.. Reduce to

Term R3

Algebra.. Reduce to

Term R3

Algebra.. Reduce to

Term R2

E[x>t+1

{P(Θt+1)− P(Θt)

∣∣∣Ft

Cannot analyze this algorithm!What if we change the policies rarely?

Term R2

E[x>t+1

{P(Θt+1)− P(Θt)

∣∣∣Ft

Cannot analyze this algorithm!What if we change the policies rarely?

Change the policy only when the determinant of confidence ellipsoiddoubles.

τs: time of sth policy change.

O(log T) policy changes up to time T.∑Tt=0 E

[x>t+1(P(Θt+1)− P(Θt))xt+1|Ft

]≤ O(log T).

The algorithm

Inputs: T, S > 0, δ > 0,Q,L.Set V0 = I and Θ0 = 0, (A0, B0) = Θ0 = argminΘ∈C0(δ) J∗(Θ).for t := 0, 1, 2, . . . do

if det(Vt) > 2 det(V0) thenCalculate Θt.Θt = argminΘ∈Ct(δ) J∗(Θ).Let V0 = Vt.

elseΘt = Θt−1.

end ifCalculate ut based on the current parameters, ut = K(Θt)xt.Execute control, observe new state xt+1.Vt+1 := Vt + ztz>t , where z>t = (x>t , u

end for

TheoremWith probability at least 1− δ, the regret of the algorithm is bounded asfollows:

R(T) = O(√

T log(1/δ)).

Conclusions

First regret result for the problem of linear optimal controlAlgorithm is too expensive!Does there exist a cheaper alternative with similar guarantees?Relaxing the martingale noise assumption? (kth order Markovnoise? ARMA..)Extension to linearly parameterized systems?xt+1 = θ>ϕ(xt, ut) + wt+1

Planning? Learning?Unrealizable case?

Conclusions

References

Auer, P., Jaksch, T., and Ortner, R. (2010). Near-optimal regretbounds for reinforcement learning. Journal of MachineLearning Research, 11:1563—1600.

Bartlett, P. L. and Tewari, A. (2009). REGAL: A regularizationbased algorithm for reinforcement learning in weaklycommunicating MDPs. In UAI 2009.

Bittanti, S. and Campi, M. C. (2006). Adaptive control of lineartime invariant systems: the “bet on the best” principle.Communications in Information and Systems,6(4):299–320.

Campi, M. and Kumar, P. (1998). Adaptive linear quadraticgaussian control: the cost-biased approach revisited.SIAM J. Control and Optim., 36(6):1890–1907.

Chen, H.-F. and Guo, L. (1987). Optimal adaptive control andconsistent parameter estimates for armax model withquadratic cost. SIAM Journal on Control and Optimization,25(4):845–867.

Chen, H.-F. and Zhang, J.-F. (1990). Identification andadaptive control for systems with unknown orders, delay,and coefficients. Automatic Control, IEEE Transactions on,35(8):866 –877.

Dani, V., Hayes, T., and Kakade, S. (2008). Stochastic linearoptimization under bandit feedback. COLT-2008, pages355–366.

de la Pena, V., Lai, T., and Shao, Q.-M. (2009).Self-normalized processes: Limit theory and StatisticalApplications. Springer.

Fiechter, C.-N. (1997). Pac adaptive control of linear systems.In in Proceedings of the 10th Annual Conference onComputational Learning Theory, ACM, pages 72–80.Press.

Lai, T. and Wei, C. (1982a). Least squares estimates instochastic regression models with applications toidentification and control of dynamic systems. The Annalsof Statistics, 10(1):154–166.

Lai, T. L. and Robbins, H. (1985). Asymptotically efficientadaptive allocation rules. Advances in AppliedMathematics, 6:4–22.

Lai, T. L. and Wei, C. Z. (1982b). Least squares estimates instochastic regression models with applications toidentification and control of dynamic systems. The Annalsof Statistics, 10(1):pp. 154–166.

Lai, T. L. and Wei, C. Z. (1987). Asymptotically efficientself-tuning regulators. SIAM J. Control Optim.,25:466–481.

Lai, T. L. and Ying, Z. (2006). Efficient recursive estimationand adaptive control in stochastic regression and armaxmodels. Statistica Sinica, 16:741–772.

Rusmevichientong, P. and Tsitsiklis, J. (2010). Linearlyparameterized bandits. Mathematics of OperationsResearch, 35(2):395–411.

regret bounds for the adaptive control of linear quadratic...

Documents