regret bounds for the adaptive control of linear quadratic...

87
Regret Bounds for the Adaptive Control of Linear Quadratic Systems COLT 2011 Yasin Abbasi-Yadkori Csaba Szepesv ´ ari University of Alberta E-mail: [email protected] Budapest, July 10, 2011 COLT 2011 (Budapest) Partial monitoring July 10, 2011 1 / 27

Upload: lemien

Post on 06-Feb-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Regret Bounds for the Adaptive Control of LinearQuadratic Systems

COLT 2011

Yasin Abbasi-Yadkori Csaba Szepesvari

University of AlbertaE-mail: [email protected]

Budapest, July 10, 2011

COLT 2011 (Budapest) Partial monitoring July 10, 2011 1 / 27

Contributions! !"# !"$ %&'()*+

!

%, -.*/0/)1)(2

34+5)(2

67+'()*+5

80.94(

:*1)'2;<)(=;

.4'*9+)>4.

?4=0@)*.;:*1)'2

80.94(;

:*1)'2;<A*

;.4'*9+)>4.

Off-policy learning with options and recognizersDoina Precup, Richard S. Sutton, Cosmin Paduraru, Anna J. Koop, Satinder Singh

McGill University, University of Alberta, University of Michigan

Options

Distinguished

region

Ideas and Motivation Background Recognizers Off-policy algorithm for options Learning w/o the Behavior Policy

Wall

Options• A way of behaving for a period of time

Models of options• A predictive model of the outcome of following the option• What state will you be in?• Will you still control the ball?• What will be the value of some feature?• Will your teammate receive the pass?• What will be the expected total reward along the way?• How long can you keep control of the ball?

Dribble Keepaway Pass

Options for soccer players could be

Options in a 2D world

The red and blue options

are mostly executed.

Surely we should be able

to learn about them from

this experience!

Experienced

trajectory

Off-policy learning• Learning about one policy while behaving according to another• Needed for RL w/exploration (as in Q-learning)• Needed for learning abstract models of dynamical systems

(representing world knowledge)• Enables efficient exploration• Enables learning about many ways of behaving at the same time

(learning models of options)

! a policy! a stopping condition

Non-sequential example

Problem formulation w/o recognizers

Problem formulation with recognizers

• One state

• Continuous action a ! [0, 1]

• Outcomes zi = ai

• Given samples from policy b : [0, 1] " #+

• Would like to estimate the mean outcome for a sub-region of the

action space, here a ! [0.7, 0.9]

Target policy ! : [0, 1] ! "+ is uniform within the region of interest

(see dashed line in figure below). The estimator is:

m! =1n

nX

i=1

!(ai)

b(ai)zi.

Theorem 1 Let A = {a1, . . . ak} ! A be a subset of all the

possible actions. Consider a fixed behavior policy b and let !A be

the class of policies that only choose actions from A, i.e., if!(a) > 0 then a " A. Then the policy induced by b and the binaryrecognizer cA is the policy with minimum-variance one-step

importance sampling corrections, among those in !A:

! as given by (1) = arg minp!!A

Eb

"

!(ai)

b(ai)

«2#

(2)

Proof: Using Lagrange multipliers

Theorem 2 Consider two binary recognizers c1 and c2, such that

µ1 > µ2. Then the importance sampling corrections for c1 have

lower variance than the importance sampling corrections for c2.

Off-policy learning

Let the importance sampling ratio at time step t be:

!t ="(st, at)

b(st, at)

The truncated n-step return, R(n)t , satisfies:

R(n)t = !t[rt+1 + (1 ! #t+1)R

(n!1)t+1 ].

The update to the parameter vector is proportional to:

!$t =h

R!t ! yt

i

""yt!0(1 ! #1) · · · !t!1(1 ! #t).

Theorem 3 For every time step t ! 0 and any initial state s,

Eb[!!t|s] = E![!!t|s].

Proof: By induction on n we show that

Eb{R(n)t |s} = E!{R

(n)t |s}

which implies that Eb{R"t |s} = E!(R"

t |s}. The rest of the proof isalgebraic manipulations (see paper).

Implementation of off-policy learning for options

In order to avoid!! ! 0, we use a restart function g : S ! [0, 1](like in the PSD algorithm). The forward algorithm becomes:

!!t = (R!t " yt)#"yt

tX

i=0

gi"i..."t!1(1 " #i+1)...(1 " #t),

where gt is the extent of restarting in state st.

The incremental learning algorithm is the following:

• Initialize !0 = g0, e0 = !0!!y0

• At every time step t:

"t = #t (rt+1 + (1 " $t+1)yt+1) " yt

%t+1 = %t + &"tet

!t+1 = #t!t(1 " $t+1) + gt+1

et+1 = '#t(1 " $t+1)et + !t+1!!yt+1

References

Off-policy learning is tricky

• The Bermuda triangle

! Temporal-difference learning! Function approximation (e.g., linear)! Off-policy

• Leads to divergence of iterative algorithms! Q-learning diverges with linear FA! Dynamic programming diverges with linear FA

Baird's Counterexample

Vk(s) =

!(7)+2!(1)

terminal

state99%

1%

100%

Vk(s) =

!(7)+2!(2)

Vk(s) =

!(7)+2!(3)

Vk(s) =

!(7)+2!(4)

Vk(s) =

!(7)+2!(5)

Vk(s) =

2!(7)+!(6)

0

5

10

0 1000 2000 3000 4000 5000

10

10

/ -10

Iterations (k)

510

1010

010

-

-

Parametervalues, !k(i)

(log scale,

broken at !1)

!k(7)

!k(1) – !k(5)

!k(6)

Precup, Sutton & Dasgupta (PSD) algorithm

• Uses importance sampling to convert off-policy case to on-policy case• Convergence assured by theorem of Tsitsiklis & Van Roy (1997)• Survives the Bermuda triangle!

BUT!

• Variance can be high, even infinite (slow learning)• Difficult to use with continuous or large action spaces• Requires explicit representation of behavior policy (probability distribution)

Option formalism

An option is defined as a triple o = !I,!, ""

• I # S is the set of states in which the option can be initiated

• ! is the internal policy of the option

• " : S $ [0, 1] is a stochastic termination condition

We want to compute the reward model of option o:

Eo{R(s)} = E{r1 + r2 + . . . + rT |s0 = s, !, "}

We assume that linear function approximation is used to represent

the model:

Eo{R(s)} % #T $s = y

Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function

approximation. In Proceedings of ICML.

Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference

learning with function approximation. In Proceedings of ICML.

Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A

framework for temporal abstraction in reinforcement learning. Artificial

Intelligence, vol . 112, pp. 181–211.

Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings

of NIPS-17.

Sutton R.S., Rafols E. and Koop, A. (2006). Temporal abstraction in

temporal-difference networks”. In Proceedings of NIPS-18.

Tadic, V. (2001). On the convergence of temporal-difference learning with linear

function approximation. In Machine learning vol. 42.

Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning

with function approximation. IEEE Transactions on Automatic Control 42.

Acknowledgements

Theorem 4 If the following assumptions hold:

• The function approximator used to represent the model is a

state aggregrator

• The recognizer behaves consistently with the function

approximator, i.e., c(s, a) = c(p, a), !s " p

• The recognition probability for each partition, µ(p) is estimatedusing maximum likelihood:

µ(p) =N(p, c = 1)

N(p)

Then there exists a policy ! such that the off-policy learning

algorithm converges to the same model as the on-policy algorithm

using !.

Proof: In the limit, w.p.1, µ converges toP

s db(s|p)P

a c(p, a)b(s, a) where db(s|p) is the probability ofvisiting state s from partition p under the stationary distribution of b.

Let ! be defined to be the same for all states in a partition p:

!(p, a) = "(p, a)X

s

db(s|p)b(s, a)

! is well-defined, in the sense thatP

a !(s, a) = 1. Using Theorem3, off-policy updates using importance sampling corrections " willhave the same expected value as on-policy updates using !.

The authors gratefully acknowledge the ideas and encouragement

they have received in this work from Eddie Rafols, Mark Ring,

Lihong Li and other members of the rlai.net group. We thank Csaba

Szepesvari and the reviewers of the paper for constructive

comments. This research was supported in part by iCore, NSERC,

Alberta Ingenuity, and CFI.

The target policy ! is induced by a recognizer function

c : [0, 1] !" #+:

!(a) =c(a)b(a)

P

x c(x)b(x)=

c(a)b(a)µ

(1)

(see blue line below). The estimator is:

m! =1n

nX

i=1

zi!(ai)b(ai)

=1n

nX

i=1

zic(ai)b(ai)

µ

1b(ai)

=1n

nX

i=1

zic(ai)

µ

!" !"" #"" $"" %"" &"""

'&

!

!'& ()*+,+-./01.,+.2-345.13,.630780#""04.)*/301.,+.2-349

:+;<7=;0,3-762+>3,

:+;<0,3-762+>3,

?=)@3,07804.)*/30.-;+724

McGill

The importance sampling corrections are:

!(s, a) ="(s, a)b(s, a)

=c(s, a)µ(s)

where µ(s) depends on the behavior policy b. If b is unknown,instead of µ we will use a maximum likelihood estimate

µ : S ! [0, 1], and importance sampling corrections will be definedas:

!(s, a) =c(s, a)

µ(s)

On-policy learning

If ! is used to generate behavior, then the reward model of anoption can be learned using TD-learning.

The n-step truncated return is:

R(n)t = rt+1 + (1 ! "t+1)R

(n!1)t+1 .

The #-return is defined as usual:

R!t = (1 ! #)

"X

n=1

#n!1R(n)t .

The parameters of the function approximator are updated on every

step proportionally to:

!$t =h

R!t ! yt

i

""yt(1 ! "1) · · · (1 ! "t).

• Recognizers reduce variance

• First off-policy learning algorithm for option models

• Off-policy learning without knowledge of the behavior

distribution

• Observations

– Options are a natural way to reduce the variance of

importance sampling algorithms (because of the termination

condition)

– Recognizers are a natural way to define options, especially

for large or continuous action spaces.

Contributions! !"# !"$ %&'()*+

!

%, -.*/0/)1)(2

34+5)(2

67+'()*+5

80.94(

:*1)'2;<)(=;

.4'*9+)>4.

?4=0@)*.;:*1)'2

80.94(;

:*1)'2;<A*

;.4'*9+)>4.

Off-policy learning with options and recognizersDoina Precup, Richard S. Sutton, Cosmin Paduraru, Anna J. Koop, Satinder Singh

McGill University, University of Alberta, University of Michigan

Options

Distinguished

region

Ideas and Motivation Background Recognizers Off-policy algorithm for options Learning w/o the Behavior Policy

Wall

Options• A way of behaving for a period of time

Models of options• A predictive model of the outcome of following the option• What state will you be in?• Will you still control the ball?• What will be the value of some feature?• Will your teammate receive the pass?• What will be the expected total reward along the way?• How long can you keep control of the ball?

Dribble Keepaway Pass

Options for soccer players could be

Options in a 2D world

The red and blue options

are mostly executed.

Surely we should be able

to learn about them from

this experience!

Experienced

trajectory

Off-policy learning• Learning about one policy while behaving according to another• Needed for RL w/exploration (as in Q-learning)• Needed for learning abstract models of dynamical systems

(representing world knowledge)• Enables efficient exploration• Enables learning about many ways of behaving at the same time

(learning models of options)

! a policy! a stopping condition

Non-sequential example

Problem formulation w/o recognizers

Problem formulation with recognizers

• One state

• Continuous action a ! [0, 1]

• Outcomes zi = ai

• Given samples from policy b : [0, 1] " #+

• Would like to estimate the mean outcome for a sub-region of the

action space, here a ! [0.7, 0.9]

Target policy ! : [0, 1] ! "+ is uniform within the region of interest

(see dashed line in figure below). The estimator is:

m! =1n

nX

i=1

!(ai)

b(ai)zi.

Theorem 1 Let A = {a1, . . . ak} ! A be a subset of all the

possible actions. Consider a fixed behavior policy b and let !A be

the class of policies that only choose actions from A, i.e., if!(a) > 0 then a " A. Then the policy induced by b and the binaryrecognizer cA is the policy with minimum-variance one-step

importance sampling corrections, among those in !A:

! as given by (1) = arg minp!!A

Eb

"

!(ai)

b(ai)

«2#

(2)

Proof: Using Lagrange multipliers

Theorem 2 Consider two binary recognizers c1 and c2, such that

µ1 > µ2. Then the importance sampling corrections for c1 have

lower variance than the importance sampling corrections for c2.

Off-policy learning

Let the importance sampling ratio at time step t be:

!t ="(st, at)

b(st, at)

The truncated n-step return, R(n)t , satisfies:

R(n)t = !t[rt+1 + (1 ! #t+1)R

(n!1)t+1 ].

The update to the parameter vector is proportional to:

!$t =h

R!t ! yt

i

""yt!0(1 ! #1) · · · !t!1(1 ! #t).

Theorem 3 For every time step t ! 0 and any initial state s,

Eb[!!t|s] = E![!!t|s].

Proof: By induction on n we show that

Eb{R(n)t |s} = E!{R

(n)t |s}

which implies that Eb{R"t |s} = E!(R"

t |s}. The rest of the proof isalgebraic manipulations (see paper).

Implementation of off-policy learning for options

In order to avoid!! ! 0, we use a restart function g : S ! [0, 1](like in the PSD algorithm). The forward algorithm becomes:

!!t = (R!t " yt)#"yt

tX

i=0

gi"i..."t!1(1 " #i+1)...(1 " #t),

where gt is the extent of restarting in state st.

The incremental learning algorithm is the following:

• Initialize !0 = g0, e0 = !0!!y0

• At every time step t:

"t = #t (rt+1 + (1 " $t+1)yt+1) " yt

%t+1 = %t + &"tet

!t+1 = #t!t(1 " $t+1) + gt+1

et+1 = '#t(1 " $t+1)et + !t+1!!yt+1

References

Off-policy learning is tricky

• The Bermuda triangle

! Temporal-difference learning! Function approximation (e.g., linear)! Off-policy

• Leads to divergence of iterative algorithms! Q-learning diverges with linear FA! Dynamic programming diverges with linear FA

Baird's Counterexample

Vk(s) =

!(7)+2!(1)

terminal

state99%

1%

100%

Vk(s) =

!(7)+2!(2)

Vk(s) =

!(7)+2!(3)

Vk(s) =

!(7)+2!(4)

Vk(s) =

!(7)+2!(5)

Vk(s) =

2!(7)+!(6)

0

5

10

0 1000 2000 3000 4000 5000

10

10

/ -10

Iterations (k)

510

1010

010

-

-

Parametervalues, !k(i)

(log scale,

broken at !1)

!k(7)

!k(1) – !k(5)

!k(6)

Precup, Sutton & Dasgupta (PSD) algorithm

• Uses importance sampling to convert off-policy case to on-policy case• Convergence assured by theorem of Tsitsiklis & Van Roy (1997)• Survives the Bermuda triangle!

BUT!

• Variance can be high, even infinite (slow learning)• Difficult to use with continuous or large action spaces• Requires explicit representation of behavior policy (probability distribution)

Option formalism

An option is defined as a triple o = !I,!, ""

• I # S is the set of states in which the option can be initiated

• ! is the internal policy of the option

• " : S $ [0, 1] is a stochastic termination condition

We want to compute the reward model of option o:

Eo{R(s)} = E{r1 + r2 + . . . + rT |s0 = s, !, "}

We assume that linear function approximation is used to represent

the model:

Eo{R(s)} % #T $s = y

Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function

approximation. In Proceedings of ICML.

Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference

learning with function approximation. In Proceedings of ICML.

Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A

framework for temporal abstraction in reinforcement learning. Artificial

Intelligence, vol . 112, pp. 181–211.

Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings

of NIPS-17.

Sutton R.S., Rafols E. and Koop, A. (2006). Temporal abstraction in

temporal-difference networks”. In Proceedings of NIPS-18.

Tadic, V. (2001). On the convergence of temporal-difference learning with linear

function approximation. In Machine learning vol. 42.

Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning

with function approximation. IEEE Transactions on Automatic Control 42.

Acknowledgements

Theorem 4 If the following assumptions hold:

• The function approximator used to represent the model is a

state aggregrator

• The recognizer behaves consistently with the function

approximator, i.e., c(s, a) = c(p, a), !s " p

• The recognition probability for each partition, µ(p) is estimatedusing maximum likelihood:

µ(p) =N(p, c = 1)

N(p)

Then there exists a policy ! such that the off-policy learning

algorithm converges to the same model as the on-policy algorithm

using !.

Proof: In the limit, w.p.1, µ converges toP

s db(s|p)P

a c(p, a)b(s, a) where db(s|p) is the probability ofvisiting state s from partition p under the stationary distribution of b.

Let ! be defined to be the same for all states in a partition p:

!(p, a) = "(p, a)X

s

db(s|p)b(s, a)

! is well-defined, in the sense thatP

a !(s, a) = 1. Using Theorem3, off-policy updates using importance sampling corrections " willhave the same expected value as on-policy updates using !.

The authors gratefully acknowledge the ideas and encouragement

they have received in this work from Eddie Rafols, Mark Ring,

Lihong Li and other members of the rlai.net group. We thank Csaba

Szepesvari and the reviewers of the paper for constructive

comments. This research was supported in part by iCore, NSERC,

Alberta Ingenuity, and CFI.

The target policy ! is induced by a recognizer function

c : [0, 1] !" #+:

!(a) =c(a)b(a)

P

x c(x)b(x)=

c(a)b(a)µ

(1)

(see blue line below). The estimator is:

m! =1n

nX

i=1

zi!(ai)b(ai)

=1n

nX

i=1

zic(ai)b(ai)

µ

1b(ai)

=1n

nX

i=1

zic(ai)

µ

!" !"" #"" $"" %"" &"""

'&

!

!'& ()*+,+-./01.,+.2-345.13,.630780#""04.)*/301.,+.2-349

:+;<7=;0,3-762+>3,

:+;<0,3-762+>3,

?=)@3,07804.)*/30.-;+724

McGill

The importance sampling corrections are:

!(s, a) ="(s, a)b(s, a)

=c(s, a)µ(s)

where µ(s) depends on the behavior policy b. If b is unknown,instead of µ we will use a maximum likelihood estimate

µ : S ! [0, 1], and importance sampling corrections will be definedas:

!(s, a) =c(s, a)

µ(s)

On-policy learning

If ! is used to generate behavior, then the reward model of anoption can be learned using TD-learning.

The n-step truncated return is:

R(n)t = rt+1 + (1 ! "t+1)R

(n!1)t+1 .

The #-return is defined as usual:

R!t = (1 ! #)

"X

n=1

#n!1R(n)t .

The parameters of the function approximator are updated on every

step proportionally to:

!$t =h

R!t ! yt

i

""yt(1 ! "1) · · · (1 ! "t).

• Recognizers reduce variance

• First off-policy learning algorithm for option models

• Off-policy learning without knowledge of the behavior

distribution

• Observations

– Options are a natural way to reduce the variance of

importance sampling algorithms (because of the termination

condition)

– Recognizers are a natural way to define options, especially

for large or continuous action spaces.

Page 2: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Outline

1 Problem formulation

2 LQRMain ideasSome details (and the algorithm)

3 Proof sketch (and the result)

4 Conclusions and Open Problems

5 Bibliography

COLT 2011 (Budapest) Partial monitoring July 10, 2011 2 / 27

Page 3: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Control problems

Agent Environment

xt

ut

xt+1 = f (xt, ut, wt)

∑t c(xt, ut) → min

COLT 2011 (Budapest) Partial monitoring July 10, 2011 3 / 27

Page 4: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Control problems

Agent Environment

xt

ut

xt+1 = f (xt, ut, wt)

∑t c(xt, ut) → min

COLT 2011 (Budapest) Partial monitoring July 10, 2011 3 / 27

Page 5: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Control problems

Agent Environment

xt

ut

xt+1 = f (xt, ut, wt)

∑t c(xt, ut) → min

COLT 2011 (Budapest) Partial monitoring July 10, 2011 3 / 27

Page 6: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Control problems

Agent Environment

xt

ut

xt+1 = f (xt, ut, wt)

∑t c(xt, ut) → min

COLT 2011 (Budapest) Partial monitoring July 10, 2011 3 / 27

Page 7: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Control problems

Agent Environment

xt

ut

xt+1 = f (xt, ut, wt)

∑t c(xt, ut) → min

COLT 2011 (Budapest) Partial monitoring July 10, 2011 3 / 27

Page 8: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Learning to control

Agent Environment

xt

ut

xt+1 = f (xt, ut, wt)

∑t c(xt, ut) → min

f is unknown – yet the goal is to control the environment almost aswell as if it was known

COLT 2011 (Budapest) Partial monitoring July 10, 2011 4 / 27

Page 9: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Measure of performance of the learner

Does the average cost converge to the optimal average cost?

1T

T∑

t=1

c(xt, ut)→ J∗ ?

How fast is the convergence?Compare the total losses⇒ Regret:

RT =

T∑

t=1

c(xt, ut) − TJ∗ .

Hannan consistency:RT

T→ 0 as T →∞

Typical result: For some γ ∈ (0, 1),

RT = O(Tγ) .COLT 2011 (Budapest) Partial monitoring July 10, 2011 5 / 27

Page 10: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Measure of performance of the learner

Does the average cost converge to the optimal average cost?

1T

T∑

t=1

c(xt, ut)→ J∗ ?

How fast is the convergence?Compare the total losses⇒ Regret:

RT =

T∑

t=1

c(xt, ut) − TJ∗ .

Hannan consistency:RT

T→ 0 as T →∞

Typical result: For some γ ∈ (0, 1),

RT = O(Tγ) .COLT 2011 (Budapest) Partial monitoring July 10, 2011 5 / 27

Page 11: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Measure of performance of the learner

Does the average cost converge to the optimal average cost?

1T

T∑

t=1

c(xt, ut)→ J∗ ?

How fast is the convergence?Compare the total losses⇒ Regret:

RT =

T∑

t=1

c(xt, ut) − TJ∗ .

Hannan consistency:RT

T→ 0 as T →∞

Typical result: For some γ ∈ (0, 1),

RT = O(Tγ) .COLT 2011 (Budapest) Partial monitoring July 10, 2011 5 / 27

Page 12: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Measure of performance of the learner

Does the average cost converge to the optimal average cost?

1T

T∑

t=1

c(xt, ut)→ J∗ ?

How fast is the convergence?Compare the total losses⇒ Regret:

RT =

T∑

t=1

c(xt, ut) − TJ∗ .

Hannan consistency:RT

T→ 0 as T →∞

Typical result: For some γ ∈ (0, 1),

RT = O(Tγ) .COLT 2011 (Budapest) Partial monitoring July 10, 2011 5 / 27

Page 13: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Measure of performance of the learner

Does the average cost converge to the optimal average cost?

1T

T∑

t=1

c(xt, ut)→ J∗ ?

How fast is the convergence?Compare the total losses⇒ Regret:

RT =

T∑

t=1

c(xt, ut) − TJ∗ .

Hannan consistency:RT

T→ 0 as T →∞

Typical result: For some γ ∈ (0, 1),

RT = O(Tγ) .COLT 2011 (Budapest) Partial monitoring July 10, 2011 5 / 27

Page 14: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

This talk: Linear Quadratic Regulation

Linear dynamics: xt ∈ Rn, ut ∈ Rd.

f (xt, ut,wt+1) = A∗xt + B∗ut .

Quadratic cost: Q,R � 0

c(xt, ut) = x>t Qxt + u>t Rut .

Noise (wt)t: Subgaussian martingale noise, E[wt+1w>t+1 | Ft

]= In.

LQR problem: given A∗,B∗,Q,R, find an optimal controllerLQR learning problem: given Q,R. not knowing A∗,B∗, learn tocontrol the system

COLT 2011 (Budapest) Partial monitoring July 10, 2011 6 / 27

Page 15: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

This talk: Linear Quadratic Regulation

Linear dynamics: xt ∈ Rn, ut ∈ Rd.

f (xt, ut,wt+1) = A∗xt + B∗ut .

Quadratic cost: Q,R � 0

c(xt, ut) = x>t Qxt + u>t Rut .

Noise (wt)t: Subgaussian martingale noise, E[wt+1w>t+1 | Ft

]= In.

LQR problem: given A∗,B∗,Q,R, find an optimal controllerLQR learning problem: given Q,R. not knowing A∗,B∗, learn tocontrol the system

COLT 2011 (Budapest) Partial monitoring July 10, 2011 6 / 27

Page 16: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

This talk: Linear Quadratic Regulation

Linear dynamics: xt ∈ Rn, ut ∈ Rd.

f (xt, ut,wt+1) = A∗xt + B∗ut .

Quadratic cost: Q,R � 0

c(xt, ut) = x>t Qxt + u>t Rut .

Noise (wt)t: Subgaussian martingale noise, E[wt+1w>t+1 | Ft

]= In.

LQR problem: given A∗,B∗,Q,R, find an optimal controllerLQR learning problem: given Q,R. not knowing A∗,B∗, learn tocontrol the system

COLT 2011 (Budapest) Partial monitoring July 10, 2011 6 / 27

Page 17: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

This talk: Linear Quadratic Regulation

Linear dynamics: xt ∈ Rn, ut ∈ Rd.

f (xt, ut,wt+1) = A∗xt + B∗ut .

Quadratic cost: Q,R � 0

c(xt, ut) = x>t Qxt + u>t Rut .

Noise (wt)t: Subgaussian martingale noise, E[wt+1w>t+1 | Ft

]= In.

LQR problem: given A∗,B∗,Q,R, find an optimal controllerLQR learning problem: given Q,R. not knowing A∗,B∗, learn tocontrol the system

COLT 2011 (Budapest) Partial monitoring July 10, 2011 6 / 27

Page 18: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

This talk: Linear Quadratic Regulation

Linear dynamics: xt ∈ Rn, ut ∈ Rd.

f (xt, ut,wt+1) = A∗xt + B∗ut .

Quadratic cost: Q,R � 0

c(xt, ut) = x>t Qxt + u>t Rut .

Noise (wt)t: Subgaussian martingale noise, E[wt+1w>t+1 | Ft

]= In.

LQR problem: given A∗,B∗,Q,R, find an optimal controllerLQR learning problem: given Q,R. not knowing A∗,B∗, learn tocontrol the system

COLT 2011 (Budapest) Partial monitoring July 10, 2011 6 / 27

Page 19: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

The goal and why should we care?

Goal: Design a controller which achieves low regret for areasonably large class of LQR problems.

Simple ≡ beautiful, nice structures!Continuous states and controls!LQR control is actually useful! (even when no learning is involved)Unsolved problem!?

COLT 2011 (Budapest) Partial monitoring July 10, 2011 7 / 27

Page 20: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

The goal and why should we care?

Goal: Design a controller which achieves low regret for areasonably large class of LQR problems.

Simple ≡ beautiful, nice structures!Continuous states and controls!LQR control is actually useful! (even when no learning is involved)Unsolved problem!?

COLT 2011 (Budapest) Partial monitoring July 10, 2011 7 / 27

Page 21: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

The goal and why should we care?

Goal: Design a controller which achieves low regret for areasonably large class of LQR problems.

Simple ≡ beautiful, nice structures!Continuous states and controls!LQR control is actually useful! (even when no learning is involved)Unsolved problem!?

COLT 2011 (Budapest) Partial monitoring July 10, 2011 7 / 27

Page 22: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

The goal and why should we care?

Goal: Design a controller which achieves low regret for areasonably large class of LQR problems.

Simple ≡ beautiful, nice structures!Continuous states and controls!LQR control is actually useful! (even when no learning is involved)Unsolved problem!?

COLT 2011 (Budapest) Partial monitoring July 10, 2011 7 / 27

Page 23: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

The goal and why should we care?

Goal: Design a controller which achieves low regret for areasonably large class of LQR problems.

Simple ≡ beautiful, nice structures!Continuous states and controls!LQR control is actually useful! (even when no learning is involved)Unsolved problem!?

COLT 2011 (Budapest) Partial monitoring July 10, 2011 7 / 27

Page 24: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Previous works

Bartlett and Tewari (2009); Auer et al. (2010) – regret analysis offinite MDPsFiechter (1997) – discounted LQ, “PAC-learning”Control people!

I Lai and Wei (1982b, 1987); Chen and Guo (1987); Chen andZhang (1990); Lai and Ying (2006) – consistency, forcedexploration (like ε-greedy)

I Campi and Kumar (1998); Bittanti and Campi (2006) – consistency,basis of the present work

Lai and Robbins (1985) – principle in the face of uncertainty forbanditsLai and Wei (1982a); Dani et al. (2008); Rusmevichientong andTsitsiklis (2010) – linear estimation, tail inequalities

COLT 2011 (Budapest) Partial monitoring July 10, 2011 8 / 27

Page 25: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Previous works

Bartlett and Tewari (2009); Auer et al. (2010) – regret analysis offinite MDPsFiechter (1997) – discounted LQ, “PAC-learning”Control people!

I Lai and Wei (1982b, 1987); Chen and Guo (1987); Chen andZhang (1990); Lai and Ying (2006) – consistency, forcedexploration (like ε-greedy)

I Campi and Kumar (1998); Bittanti and Campi (2006) – consistency,basis of the present work

Lai and Robbins (1985) – principle in the face of uncertainty forbanditsLai and Wei (1982a); Dani et al. (2008); Rusmevichientong andTsitsiklis (2010) – linear estimation, tail inequalities

COLT 2011 (Budapest) Partial monitoring July 10, 2011 8 / 27

Page 26: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Previous works

Bartlett and Tewari (2009); Auer et al. (2010) – regret analysis offinite MDPsFiechter (1997) – discounted LQ, “PAC-learning”Control people!

I Lai and Wei (1982b, 1987); Chen and Guo (1987); Chen andZhang (1990); Lai and Ying (2006) – consistency, forcedexploration (like ε-greedy)

I Campi and Kumar (1998); Bittanti and Campi (2006) – consistency,basis of the present work

Lai and Robbins (1985) – principle in the face of uncertainty forbanditsLai and Wei (1982a); Dani et al. (2008); Rusmevichientong andTsitsiklis (2010) – linear estimation, tail inequalities

COLT 2011 (Budapest) Partial monitoring July 10, 2011 8 / 27

Page 27: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Previous works

Bartlett and Tewari (2009); Auer et al. (2010) – regret analysis offinite MDPsFiechter (1997) – discounted LQ, “PAC-learning”Control people!

I Lai and Wei (1982b, 1987); Chen and Guo (1987); Chen andZhang (1990); Lai and Ying (2006) – consistency, forcedexploration (like ε-greedy)

I Campi and Kumar (1998); Bittanti and Campi (2006) – consistency,basis of the present work

Lai and Robbins (1985) – principle in the face of uncertainty forbanditsLai and Wei (1982a); Dani et al. (2008); Rusmevichientong andTsitsiklis (2010) – linear estimation, tail inequalities

COLT 2011 (Budapest) Partial monitoring July 10, 2011 8 / 27

Page 28: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Previous works

Bartlett and Tewari (2009); Auer et al. (2010) – regret analysis offinite MDPsFiechter (1997) – discounted LQ, “PAC-learning”Control people!

I Lai and Wei (1982b, 1987); Chen and Guo (1987); Chen andZhang (1990); Lai and Ying (2006) – consistency, forcedexploration (like ε-greedy)

I Campi and Kumar (1998); Bittanti and Campi (2006) – consistency,basis of the present work

Lai and Robbins (1985) – principle in the face of uncertainty forbanditsLai and Wei (1982a); Dani et al. (2008); Rusmevichientong andTsitsiklis (2010) – linear estimation, tail inequalities

COLT 2011 (Budapest) Partial monitoring July 10, 2011 8 / 27

Page 29: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Previous works

Bartlett and Tewari (2009); Auer et al. (2010) – regret analysis offinite MDPsFiechter (1997) – discounted LQ, “PAC-learning”Control people!

I Lai and Wei (1982b, 1987); Chen and Guo (1987); Chen andZhang (1990); Lai and Ying (2006) – consistency, forcedexploration (like ε-greedy)

I Campi and Kumar (1998); Bittanti and Campi (2006) – consistency,basis of the present work

Lai and Robbins (1985) – principle in the face of uncertainty forbanditsLai and Wei (1982a); Dani et al. (2008); Rusmevichientong andTsitsiklis (2010) – linear estimation, tail inequalities

COLT 2011 (Budapest) Partial monitoring July 10, 2011 8 / 27

Page 30: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Previous works

Bartlett and Tewari (2009); Auer et al. (2010) – regret analysis offinite MDPsFiechter (1997) – discounted LQ, “PAC-learning”Control people!

I Lai and Wei (1982b, 1987); Chen and Guo (1987); Chen andZhang (1990); Lai and Ying (2006) – consistency, forcedexploration (like ε-greedy)

I Campi and Kumar (1998); Bittanti and Campi (2006) – consistency,basis of the present work

Lai and Robbins (1985) – principle in the face of uncertainty forbanditsLai and Wei (1982a); Dani et al. (2008); Rusmevichientong andTsitsiklis (2010) – linear estimation, tail inequalities

COLT 2011 (Budapest) Partial monitoring July 10, 2011 8 / 27

Page 31: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Previous works

Bartlett and Tewari (2009); Auer et al. (2010) – regret analysis offinite MDPsFiechter (1997) – discounted LQ, “PAC-learning”Control people!

I Lai and Wei (1982b, 1987); Chen and Guo (1987); Chen andZhang (1990); Lai and Ying (2006) – consistency, forcedexploration (like ε-greedy)

I Campi and Kumar (1998); Bittanti and Campi (2006) – consistency,basis of the present work

Lai and Robbins (1985) – principle in the face of uncertainty forbanditsLai and Wei (1982a); Dani et al. (2008); Rusmevichientong andTsitsiklis (2010) – linear estimation, tail inequalities

COLT 2011 (Budapest) Partial monitoring July 10, 2011 8 / 27

Page 32: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Outline

1 Problem formulation

2 LQRMain ideasSome details (and the algorithm)

3 Proof sketch (and the result)

4 Conclusions and Open Problems

5 Bibliography

COLT 2011 (Budapest) Partial monitoring July 10, 2011 9 / 27

Page 33: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

The main ideas of the algorithm

Estimate the system dynamicsBe optimistic in selecting the controlsAvoid frequent changes to the policy

COLT 2011 (Budapest) Partial monitoring July 10, 2011 10 / 27

Page 34: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

The main ideas of the algorithm

Estimate the system dynamicsBe optimistic in selecting the controlsAvoid frequent changes to the policy

COLT 2011 (Budapest) Partial monitoring July 10, 2011 10 / 27

Page 35: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

The main ideas of the algorithm

Estimate the system dynamicsBe optimistic in selecting the controlsAvoid frequent changes to the policy

COLT 2011 (Budapest) Partial monitoring July 10, 2011 10 / 27

Page 36: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Estimation

xt+1 = A∗xt + B∗ut + wt+1

= Θ∗

(xt

ut

)+ wt+1

= Θ∗zt + wt+1

Data: (z0, x1), (z2, x2), . . . , (zt−1, xt)

xi+1 = Θ∗zi + wi+1

Linear regression with correlated covariates, martingale noise⇒ Use ridge-regression (least-squares, with `2-penalties)

COLT 2011 (Budapest) Partial monitoring July 10, 2011 11 / 27

Page 37: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Estimation

xt+1 = A∗xt + B∗ut + wt+1

= Θ∗

(xt

ut

)+ wt+1

= Θ∗zt + wt+1

Data: (z0, x1), (z2, x2), . . . , (zt−1, xt)

xi+1 = Θ∗zi + wi+1

Linear regression with correlated covariates, martingale noise⇒ Use ridge-regression (least-squares, with `2-penalties)

COLT 2011 (Budapest) Partial monitoring July 10, 2011 11 / 27

Page 38: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Estimation

xt+1 = A∗xt + B∗ut + wt+1

= Θ∗

(xt

ut

)+ wt+1

= Θ∗zt + wt+1

Data: (z0, x1), (z2, x2), . . . , (zt−1, xt)

xi+1 = Θ∗zi + wi+1

Linear regression with correlated covariates, martingale noise⇒ Use ridge-regression (least-squares, with `2-penalties)

COLT 2011 (Budapest) Partial monitoring July 10, 2011 11 / 27

Page 39: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Estimation

xt+1 = A∗xt + B∗ut + wt+1

= Θ∗

(xt

ut

)+ wt+1

= Θ∗zt + wt+1

Data: (z0, x1), (z2, x2), . . . , (zt−1, xt)

xi+1 = Θ∗zi + wi+1

Linear regression with correlated covariates, martingale noise⇒ Use ridge-regression (least-squares, with `2-penalties)

COLT 2011 (Budapest) Partial monitoring July 10, 2011 11 / 27

Page 40: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Estimation

xt+1 = A∗xt + B∗ut + wt+1

= Θ∗

(xt

ut

)+ wt+1

= Θ∗zt + wt+1

Data: (z0, x1), (z2, x2), . . . , (zt−1, xt)

xi+1 = Θ∗zi + wi+1

Linear regression with correlated covariates, martingale noise⇒ Use ridge-regression (least-squares, with `2-penalties)

COLT 2011 (Budapest) Partial monitoring July 10, 2011 11 / 27

Page 41: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Estimation

xt+1 = A∗xt + B∗ut + wt+1

= Θ∗

(xt

ut

)+ wt+1

= Θ∗zt + wt+1

Data: (z0, x1), (z2, x2), . . . , (zt−1, xt)

xi+1 = Θ∗zi + wi+1

Linear regression with correlated covariates, martingale noise⇒ Use ridge-regression (least-squares, with `2-penalties)

COLT 2011 (Budapest) Partial monitoring July 10, 2011 11 / 27

Page 42: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Optimism principle

Optimism PrincipleLet Ct(δ) be a confidence set for the unknown parameters. Choose thecontrol which gives rise to the best performance.

For given Θ, for the linear system with parameter Θ, let J(Θ) be theoptimal average cost and πΘ be the corresponding optimal policy.Choose

Θt = arg minθ∈Ct(δ)

J(θ) and ut = πΘt(xt) .

CaveatsJ(Θ), πΘ can be ill-definedNeed restriction on allowed set of parametersFinding Θt is a potentially difficult optimization problem

COLT 2011 (Budapest) Partial monitoring July 10, 2011 12 / 27

Page 43: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Optimism principle

Optimism PrincipleLet Ct(δ) be a confidence set for the unknown parameters. Choose thecontrol which gives rise to the best performance.

For given Θ, for the linear system with parameter Θ, let J(Θ) be theoptimal average cost and πΘ be the corresponding optimal policy.Choose

Θt = arg minθ∈Ct(δ)

J(θ) and ut = πΘt(xt) .

CaveatsJ(Θ), πΘ can be ill-definedNeed restriction on allowed set of parametersFinding Θt is a potentially difficult optimization problem

COLT 2011 (Budapest) Partial monitoring July 10, 2011 12 / 27

Page 44: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Optimism principle

Optimism PrincipleLet Ct(δ) be a confidence set for the unknown parameters. Choose thecontrol which gives rise to the best performance.

For given Θ, for the linear system with parameter Θ, let J(Θ) be theoptimal average cost and πΘ be the corresponding optimal policy.Choose

Θt = arg minθ∈Ct(δ)

J(θ) and ut = πΘt(xt) .

CaveatsJ(Θ), πΘ can be ill-definedNeed restriction on allowed set of parametersFinding Θt is a potentially difficult optimization problem

COLT 2011 (Budapest) Partial monitoring July 10, 2011 12 / 27

Page 45: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Optimism principle

Optimism PrincipleLet Ct(δ) be a confidence set for the unknown parameters. Choose thecontrol which gives rise to the best performance.

For given Θ, for the linear system with parameter Θ, let J(Θ) be theoptimal average cost and πΘ be the corresponding optimal policy.Choose

Θt = arg minθ∈Ct(δ)

J(θ) and ut = πΘt(xt) .

CaveatsJ(Θ), πΘ can be ill-definedNeed restriction on allowed set of parametersFinding Θt is a potentially difficult optimization problem

COLT 2011 (Budapest) Partial monitoring July 10, 2011 12 / 27

Page 46: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Optimism principle

Optimism PrincipleLet Ct(δ) be a confidence set for the unknown parameters. Choose thecontrol which gives rise to the best performance.

For given Θ, for the linear system with parameter Θ, let J(Θ) be theoptimal average cost and πΘ be the corresponding optimal policy.Choose

Θt = arg minθ∈Ct(δ)

J(θ) and ut = πΘt(xt) .

CaveatsJ(Θ), πΘ can be ill-definedNeed restriction on allowed set of parametersFinding Θt is a potentially difficult optimization problem

COLT 2011 (Budapest) Partial monitoring July 10, 2011 12 / 27

Page 47: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Optimism principle

Optimism PrincipleLet Ct(δ) be a confidence set for the unknown parameters. Choose thecontrol which gives rise to the best performance.

For given Θ, for the linear system with parameter Θ, let J(Θ) be theoptimal average cost and πΘ be the corresponding optimal policy.Choose

Θt = arg minθ∈Ct(δ)

J(θ) and ut = πΘt(xt) .

CaveatsJ(Θ), πΘ can be ill-definedNeed restriction on allowed set of parametersFinding Θt is a potentially difficult optimization problem

COLT 2011 (Budapest) Partial monitoring July 10, 2011 12 / 27

Page 48: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Optimism principle

Optimism PrincipleLet Ct(δ) be a confidence set for the unknown parameters. Choose thecontrol which gives rise to the best performance.

For given Θ, for the linear system with parameter Θ, let J(Θ) be theoptimal average cost and πΘ be the corresponding optimal policy.Choose

Θt = arg minθ∈Ct(δ)

J(θ) and ut = πΘt(xt) .

CaveatsJ(Θ), πΘ can be ill-definedNeed restriction on allowed set of parametersFinding Θt is a potentially difficult optimization problem

COLT 2011 (Budapest) Partial monitoring July 10, 2011 12 / 27

Page 49: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Avoiding frequent changes

Frequent changes are unnecessarySaving computation⇒ going green!?Frequent changes might be a problem (avoiding frequent changeshelps with the proof)

COLT 2011 (Budapest) Partial monitoring July 10, 2011 13 / 27

Page 50: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Avoiding frequent changes

Frequent changes are unnecessarySaving computation⇒ going green!?Frequent changes might be a problem (avoiding frequent changeshelps with the proof)

COLT 2011 (Budapest) Partial monitoring July 10, 2011 13 / 27

Page 51: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Avoiding frequent changes

Frequent changes are unnecessarySaving computation⇒ going green!?Frequent changes might be a problem (avoiding frequent changeshelps with the proof)

COLT 2011 (Budapest) Partial monitoring July 10, 2011 13 / 27

Page 52: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Outline

1 Problem formulation

2 LQRMain ideasSome details (and the algorithm)

3 Proof sketch (and the result)

4 Conclusions and Open Problems

5 Bibliography

COLT 2011 (Budapest) Partial monitoring July 10, 2011 14 / 27

Page 53: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

How to choose the confidence set?

Adventures in self-normalized processes, “method of mixtures”⇒ dela Pena et al. (2009)

Theorem

Let z>t = (x>t , u>t ) ∈ Rn+d. Let Θt be the ridge-regression parameter

estimate with regularization coefficient λ > 0. Let Vt = λI +∑t−1

i=0 ziz>ibe the covariance matrix. Then, for any 0 < δ < 1, with probability atleast 1− δ,

trace((Θt −Θ∗)>Vt(Θt −Θ∗))

d

√2 log

(det(Vt)

1/2 det(λI)−1/2

δ

)+ λ1/2S

2

.

COLT 2011 (Budapest) Partial monitoring July 10, 2011 15 / 27

Page 54: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Construction of confidence sets

An ellipsoid centred at Θt:

trace{

(Θ− Θt)>Vt(Θ− Θt)

}≤ βt.

 

  

                                                                                                                 

COLT 2011 (Budapest) Partial monitoring July 10, 2011 16 / 27

Page 55: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

The algorithm

Inputs: T, S > 0, δ > 0,Q,L.Set V0 = I and Θ0 = 0, (A0, B0) = Θ0 = argminΘ∈C0(δ) J∗(Θ).for t := 0, 1, 2, . . . do

Calculate Θt.Θt = argminΘ∈Ct(δ) J∗(Θ).Calculate ut based on the current parameters, ut = K(Θt)xt.Execute control, observe new state xt+1.Vt+1 := Vt + ztz>t , where z>t = (x>t , u

>t ).

end for

COLT 2011 (Budapest) Partial monitoring July 10, 2011 17 / 27

Page 56: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Proof sketch

Fix T > 0.With high probability, the state stays O(log T). ⇐ most of the workis here..Decompose the regretAnalyze each term

COLT 2011 (Budapest) Partial monitoring July 10, 2011 18 / 27

Page 57: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Proof sketch

Fix T > 0.With high probability, the state stays O(log T). ⇐ most of the workis here..Decompose the regretAnalyze each term

COLT 2011 (Budapest) Partial monitoring July 10, 2011 18 / 27

Page 58: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Proof sketch

Fix T > 0.With high probability, the state stays O(log T). ⇐ most of the workis here..Decompose the regretAnalyze each term

COLT 2011 (Budapest) Partial monitoring July 10, 2011 18 / 27

Page 59: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Proof sketch

Fix T > 0.With high probability, the state stays O(log T). ⇐ most of the workis here..Decompose the regretAnalyze each term

COLT 2011 (Budapest) Partial monitoring July 10, 2011 18 / 27

Page 60: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Proof sketch

Fix T > 0.With high probability, the state stays O(log T). ⇐ most of the workis here..Decompose the regretAnalyze each term

COLT 2011 (Budapest) Partial monitoring July 10, 2011 18 / 27

Page 61: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Regret decomposition

Dynamic programming equations,E [wt+1|Ft] = 0, Algebra ..

R1 =

T∑

t=0

{x>t P(Θt)xt − E

[x>t+1P(Θt+1)xt+1

∣∣∣Ft

]}

R2 =

T∑

t=0

E[x>t+1

{P(Θt+1)− P(Θt)

}xt+1

∣∣∣Ft

]

R3 =

T∑

t=0

z>t(

Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt

)zt.

T∑

t=0

(x>t Qxt + u>t Rut) =

T∑

t=0

J(Θt) + R1 + R2 + R3

≤ T J(Θ∗) + R1 + R2 + R3 .

COLT 2011 (Budapest) Partial monitoring July 10, 2011 19 / 27

Page 62: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Regret decomposition

Dynamic programming equations,E [wt+1|Ft] = 0, Algebra ..

R1 =

T∑

t=0

{x>t P(Θt)xt − E

[x>t+1P(Θt+1)xt+1

∣∣∣Ft

]}

R2 =

T∑

t=0

E[x>t+1

{P(Θt+1)− P(Θt)

}xt+1

∣∣∣Ft

]

R3 =

T∑

t=0

z>t(

Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt

)zt.

T∑

t=0

(x>t Qxt + u>t Rut) =

T∑

t=0

J(Θt) + R1 + R2 + R3

≤ T J(Θ∗) + R1 + R2 + R3 .

COLT 2011 (Budapest) Partial monitoring July 10, 2011 19 / 27

Page 63: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Regret decomposition

Dynamic programming equations,E [wt+1|Ft] = 0, Algebra ..

R1 =

T∑

t=0

{x>t P(Θt)xt − E

[x>t+1P(Θt+1)xt+1

∣∣∣Ft

]}

R2 =

T∑

t=0

E[x>t+1

{P(Θt+1)− P(Θt)

}xt+1

∣∣∣Ft

]

R3 =

T∑

t=0

z>t(

Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt

)zt.

T∑

t=0

(x>t Qxt + u>t Rut) =

T∑

t=0

J(Θt) + R1 + R2 + R3

≤ T J(Θ∗) + R1 + R2 + R3 .

COLT 2011 (Budapest) Partial monitoring July 10, 2011 19 / 27

Page 64: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Regret decomposition

Dynamic programming equations,E [wt+1|Ft] = 0, Algebra ..

R1 =

T∑

t=0

{x>t P(Θt)xt − E

[x>t+1P(Θt+1)xt+1

∣∣∣Ft

]}

R2 =

T∑

t=0

E[x>t+1

{P(Θt+1)− P(Θt)

}xt+1

∣∣∣Ft

]

R3 =

T∑

t=0

z>t(

Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt

)zt.

T∑

t=0

(x>t Qxt + u>t Rut) =

T∑

t=0

J(Θt) + R1 + R2 + R3

≤ T J(Θ∗) + R1 + R2 + R3 .

COLT 2011 (Budapest) Partial monitoring July 10, 2011 19 / 27

Page 65: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Regret decomposition

Dynamic programming equations,E [wt+1|Ft] = 0, Algebra ..

R1 =

T∑

t=0

{x>t P(Θt)xt − E

[x>t+1P(Θt+1)xt+1

∣∣∣Ft

]}

R2 =

T∑

t=0

E[x>t+1

{P(Θt+1)− P(Θt)

}xt+1

∣∣∣Ft

]

R3 =

T∑

t=0

z>t(

Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt

)zt.

T∑

t=0

(x>t Qxt + u>t Rut) =

T∑

t=0

J(Θt) + R1 + R2 + R3

≤ T J(Θ∗) + R1 + R2 + R3 .

COLT 2011 (Budapest) Partial monitoring July 10, 2011 19 / 27

Page 66: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Term R1

R1 =

T∑

t=0

{x>t P(Θt)xt − E

[x>t+1P(Θt+1)xt+1

∣∣∣Ft

]}

RegroupingMartingale difference sequenceState does not explode

COLT 2011 (Budapest) Partial monitoring July 10, 2011 20 / 27

Page 67: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Term R1

R1 =

T∑

t=0

{x>t P(Θt)xt − E

[x>t+1P(Θt+1)xt+1

∣∣∣Ft

]}

RegroupingMartingale difference sequenceState does not explode

COLT 2011 (Budapest) Partial monitoring July 10, 2011 20 / 27

Page 68: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Term R1

R1 =

T∑

t=0

{x>t P(Θt)xt − E

[x>t+1P(Θt+1)xt+1

∣∣∣Ft

]}

RegroupingMartingale difference sequenceState does not explode

COLT 2011 (Budapest) Partial monitoring July 10, 2011 20 / 27

Page 69: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Term R3

R3 =

T∑

t=0

z>t(

Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt

)zt.

Algebra.. Reduce to

O(√

T) +

(∑

t

‖P(Θt)(Θt −Θ∗)>zt‖2

)1/2

More algebra..Choice of confidence setState does not explode

COLT 2011 (Budapest) Partial monitoring July 10, 2011 21 / 27

Page 70: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Term R3

R3 =

T∑

t=0

z>t(

Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt

)zt.

Algebra.. Reduce to

O(√

T) +

(∑

t

‖P(Θt)(Θt −Θ∗)>zt‖2

)1/2

More algebra..Choice of confidence setState does not explode

COLT 2011 (Budapest) Partial monitoring July 10, 2011 21 / 27

Page 71: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Term R3

R3 =

T∑

t=0

z>t(

Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt

)zt.

Algebra.. Reduce to

O(√

T) +

(∑

t

‖P(Θt)(Θt −Θ∗)>zt‖2

)1/2

More algebra..Choice of confidence setState does not explode

COLT 2011 (Budapest) Partial monitoring July 10, 2011 21 / 27

Page 72: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Term R3

R3 =

T∑

t=0

z>t(

Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt

)zt.

Algebra.. Reduce to

O(√

T) +

(∑

t

‖P(Θt)(Θt −Θ∗)>zt‖2

)1/2

More algebra..Choice of confidence setState does not explode

COLT 2011 (Budapest) Partial monitoring July 10, 2011 21 / 27

Page 73: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Term R3

R3 =

T∑

t=0

z>t(

Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt

)zt.

Algebra.. Reduce to

O(√

T) +

(∑

t

‖P(Θt)(Θt −Θ∗)>zt‖2

)1/2

More algebra..Choice of confidence setState does not explode

COLT 2011 (Budapest) Partial monitoring July 10, 2011 21 / 27

Page 74: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Term R2

R2 =

T∑

t=0

E[x>t+1

{P(Θt+1)− P(Θt)

}xt+1

∣∣∣Ft

]

Cannot analyze this algorithm!What if we change the policies rarely?

COLT 2011 (Budapest) Partial monitoring July 10, 2011 22 / 27

Page 75: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Term R2

R2 =

T∑

t=0

E[x>t+1

{P(Θt+1)− P(Θt)

}xt+1

∣∣∣Ft

]

Cannot analyze this algorithm!What if we change the policies rarely?

COLT 2011 (Budapest) Partial monitoring July 10, 2011 22 / 27

Page 76: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Change the policy only when the determinant of confidence ellipsoiddoubles.

 

  

                                                                                                          

τs: time of sth policy change.

O(log T) policy changes up to time T.∑Tt=0 E

[x>t+1(P(Θt+1)− P(Θt))xt+1|Ft

]≤ O(log T).

COLT 2011 (Budapest) Partial monitoring July 10, 2011 23 / 27

Page 77: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

The algorithm

Inputs: T, S > 0, δ > 0,Q,L.Set V0 = I and Θ0 = 0, (A0, B0) = Θ0 = argminΘ∈C0(δ) J∗(Θ).for t := 0, 1, 2, . . . do

if det(Vt) > 2 det(V0) thenCalculate Θt.Θt = argminΘ∈Ct(δ) J∗(Θ).Let V0 = Vt.

elseΘt = Θt−1.

end ifCalculate ut based on the current parameters, ut = K(Θt)xt.Execute control, observe new state xt+1.Vt+1 := Vt + ztz>t , where z>t = (x>t , u

>t ).

end for

COLT 2011 (Budapest) Partial monitoring July 10, 2011 24 / 27

Page 78: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

TheoremWith probability at least 1− δ, the regret of the algorithm is bounded asfollows:

R(T) = O(√

T log(1/δ)).

COLT 2011 (Budapest) Partial monitoring July 10, 2011 25 / 27

Page 79: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Conclusions

First regret result for the problem of linear optimal controlAlgorithm is too expensive!Does there exist a cheaper alternative with similar guarantees?Relaxing the martingale noise assumption? (kth order Markovnoise? ARMA..)Extension to linearly parameterized systems?xt+1 = θ>ϕ(xt, ut) + wt+1

Planning? Learning?Unrealizable case?

COLT 2011 (Budapest) Partial monitoring July 10, 2011 26 / 27

Page 80: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Conclusions

First regret result for the problem of linear optimal controlAlgorithm is too expensive!Does there exist a cheaper alternative with similar guarantees?Relaxing the martingale noise assumption? (kth order Markovnoise? ARMA..)Extension to linearly parameterized systems?xt+1 = θ>ϕ(xt, ut) + wt+1

Planning? Learning?Unrealizable case?

COLT 2011 (Budapest) Partial monitoring July 10, 2011 26 / 27

Page 81: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Conclusions

First regret result for the problem of linear optimal controlAlgorithm is too expensive!Does there exist a cheaper alternative with similar guarantees?Relaxing the martingale noise assumption? (kth order Markovnoise? ARMA..)Extension to linearly parameterized systems?xt+1 = θ>ϕ(xt, ut) + wt+1

Planning? Learning?Unrealizable case?

COLT 2011 (Budapest) Partial monitoring July 10, 2011 26 / 27

Page 82: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Conclusions

First regret result for the problem of linear optimal controlAlgorithm is too expensive!Does there exist a cheaper alternative with similar guarantees?Relaxing the martingale noise assumption? (kth order Markovnoise? ARMA..)Extension to linearly parameterized systems?xt+1 = θ>ϕ(xt, ut) + wt+1

Planning? Learning?Unrealizable case?

COLT 2011 (Budapest) Partial monitoring July 10, 2011 26 / 27

Page 83: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Conclusions

First regret result for the problem of linear optimal controlAlgorithm is too expensive!Does there exist a cheaper alternative with similar guarantees?Relaxing the martingale noise assumption? (kth order Markovnoise? ARMA..)Extension to linearly parameterized systems?xt+1 = θ>ϕ(xt, ut) + wt+1

Planning? Learning?Unrealizable case?

COLT 2011 (Budapest) Partial monitoring July 10, 2011 26 / 27

Page 84: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Conclusions

First regret result for the problem of linear optimal controlAlgorithm is too expensive!Does there exist a cheaper alternative with similar guarantees?Relaxing the martingale noise assumption? (kth order Markovnoise? ARMA..)Extension to linearly parameterized systems?xt+1 = θ>ϕ(xt, ut) + wt+1

Planning? Learning?Unrealizable case?

COLT 2011 (Budapest) Partial monitoring July 10, 2011 26 / 27

Page 85: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Conclusions

First regret result for the problem of linear optimal controlAlgorithm is too expensive!Does there exist a cheaper alternative with similar guarantees?Relaxing the martingale noise assumption? (kth order Markovnoise? ARMA..)Extension to linearly parameterized systems?xt+1 = θ>ϕ(xt, ut) + wt+1

Planning? Learning?Unrealizable case?

COLT 2011 (Budapest) Partial monitoring July 10, 2011 26 / 27

Page 86: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

Conclusions

First regret result for the problem of linear optimal controlAlgorithm is too expensive!Does there exist a cheaper alternative with similar guarantees?Relaxing the martingale noise assumption? (kth order Markovnoise? ARMA..)Extension to linearly parameterized systems?xt+1 = θ>ϕ(xt, ut) + wt+1

Planning? Learning?Unrealizable case?

COLT 2011 (Budapest) Partial monitoring July 10, 2011 26 / 27

Page 87: Regret Bounds for the Adaptive Control of Linear Quadratic ...translectures.videolectures.net/site/normal_dl/tag=596536/colt2011... · Regret Bounds for the Adaptive Control of Linear

References

Auer, P., Jaksch, T., and Ortner, R. (2010). Near-optimal regretbounds for reinforcement learning. Journal of MachineLearning Research, 11:1563—1600.

Bartlett, P. L. and Tewari, A. (2009). REGAL: A regularizationbased algorithm for reinforcement learning in weaklycommunicating MDPs. In UAI 2009.

Bittanti, S. and Campi, M. C. (2006). Adaptive control of lineartime invariant systems: the “bet on the best” principle.Communications in Information and Systems,6(4):299–320.

Campi, M. and Kumar, P. (1998). Adaptive linear quadraticgaussian control: the cost-biased approach revisited.SIAM J. Control and Optim., 36(6):1890–1907.

Chen, H.-F. and Guo, L. (1987). Optimal adaptive control andconsistent parameter estimates for armax model withquadratic cost. SIAM Journal on Control and Optimization,25(4):845–867.

Chen, H.-F. and Zhang, J.-F. (1990). Identification andadaptive control for systems with unknown orders, delay,and coefficients. Automatic Control, IEEE Transactions on,35(8):866 –877.

Dani, V., Hayes, T., and Kakade, S. (2008). Stochastic linearoptimization under bandit feedback. COLT-2008, pages355–366.

de la Pena, V., Lai, T., and Shao, Q.-M. (2009).Self-normalized processes: Limit theory and StatisticalApplications. Springer.

Fiechter, C.-N. (1997). Pac adaptive control of linear systems.In in Proceedings of the 10th Annual Conference onComputational Learning Theory, ACM, pages 72–80.Press.

Lai, T. and Wei, C. (1982a). Least squares estimates instochastic regression models with applications toidentification and control of dynamic systems. The Annalsof Statistics, 10(1):154–166.

Lai, T. L. and Robbins, H. (1985). Asymptotically efficientadaptive allocation rules. Advances in AppliedMathematics, 6:4–22.

Lai, T. L. and Wei, C. Z. (1982b). Least squares estimates instochastic regression models with applications toidentification and control of dynamic systems. The Annalsof Statistics, 10(1):pp. 154–166.

Lai, T. L. and Wei, C. Z. (1987). Asymptotically efficientself-tuning regulators. SIAM J. Control Optim.,25:466–481.

Lai, T. L. and Ying, Z. (2006). Efficient recursive estimationand adaptive control in stochastic regression and armaxmodels. Statistica Sinica, 16:741–772.

Rusmevichientong, P. and Tsitsiklis, J. (2010). Linearlyparameterized bandits. Mathematics of OperationsResearch, 35(2):395–411.

COLT 2011 (Budapest) Partial monitoring July 10, 2011 27 / 27