optimal control and dynamic programming · optimal control and dynamic programming duarte antunes....

4SC000 Q2 2017-2018

Optimal Control and Dynamic Programming

Duarte Antunes

Part IIStage decision problems

Recap

Optimal control formulation• Dynamic model & cost function (transition diagram for discrete optimization problems).

• Computing an optimal policy vs computing an optimal path.

1

Dynamic progamming algorithm• Allows to compute policies (to deal with uncertainty).

• Equivalent way to write it: DP equation.

• Stochastic dynamic programming: computes a policy that minimizes an expected cost.

Alternative algorithms• To compute optimal paths, alternative algorithms (e.g. Dijkstra's) may be more efficient.

Partial information• When there is only partial information about the state rely on the Bayes filter.

2

Introduce optimal control concepts for stage decision problems

Goals of part II

Discrete optimization problems Stage decision problems

Formulation Transition diagramDynamic system &

additive cost function

DP algorithm & Stochastic DP

Graphical DP algorithm & DP equation

DP equation

Alternative algorithms Dijkstra's algorithm Static optimization

Partial information Bayes filter Kalman filter

Applicationfocus

Operational research & Computer science

Digital control

Outline

• Dynamic programming for stage decision problems

• Linear quadratic regulator

3

Stage decision problems

h�1X

k=0

gk(xk, uk) + gh(xh)

xk+1 = fk(xk, uk)

Dynamic model

Cost function

k 2 {0, . . . , h� 1}

• State and input live in arbitrary spaces.

• If these spaces are discrete this is a discrete optimization problem.

• Typically and for every .

• Goals: find an optimal path and find an optimal policy.

xk 2 Xk uk 2 Uk(xk)

xk 2 Rn uk 2 Rm k 2 {0, . . . , h� 1}

4

Optimal path

• Given an initial condition , a path is a set of decisions such that,

• A path is said to be optimal if there does not exist another path with a strictly smaller cost.

x0 {u0, u1, . . . , uh�1}{(x0, u0), . . . , (xh�1, uh�1)}

X0 X1

x0

x1

g0(x0, u0)g1(x1, u1)

gh�1(xh�1, uh�1)

Stage 0 Stage 1 Stage h� 1 Stage h

Xh�1 Xh

xh

xh�1 gh(xh)

x1 = f0(x0, u0) x2 = f1(x1, u1) xh=fh�1(xh�1,uh�1)

and satisfy the equations of the dynamic model.uk 2 Uk(xk)

5

Optimal policy

Policy A policy is a set of functions ⇡ = {µ0, . . . , µh�1}, µk : Xk ! Uk.

x`

h

Optimal policyA policy is said to be optimal if for every state at every stage , is the first action of an optimal path for the tail subproblem which considers only stages with initial condition and cost

` 2 {0, . . . , h� 1} µ`(x`)

x`{`, `+ 1, . . . , h}

h�1X

k=`

gk(xk, uk) + gh(xh)

Dynamic programming algorithm

6

Start with for every and for each decision stage, starting from the last and moving backwards, , compute and k 2 {h� 1, h� 2, . . . , 0} Jkµk

Then is an optimal policy.{µ0, . . . , µh�1}

(DP equation)

from

Theorem The policy obtained with the DP algorithm is an optimal policy. (proof in the appendix).

Jh(xh) = gh(xh) xh 2 Xh

and

where is the minimizer in the DP equation, i.e.,uk

J

k

(xk

) = minuk2Uk(xk)

g

k

(xk

, u

k

) + J

k+1(fk(xk

, u

k

))

µk(xk) = uk

Jk(xk) = gk(xk, µk(xk)) + Jk+1(fk(xk, µk(xk))

7

Dynamic

model

k 2 {0, 1}

Terminal cost

Cost

Quadratic

Non-quadratic

Simple integrator example

Consider the following simple example of a stage-decision problem

xk+1 = xk + uk

g2(x2) = x

22

1X

k=0

x

2k + u

2k + g2(x2)

g2(x2) = e

x2

8

Quadratic terminal costStep I

x2 = x1 + u1

= minu1

x

21 + u

21 + x

22J1(x1) = min

u1

g1(x1, u1) + g2(x2) = minu1

2(x21 + u

21 + x1u1)

Qx1(u1)

|{z}Quadratic function of u1

-3 -2 -1 0 1 21

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

Qx1(u1)

u1

How to compute the minimum?

Differentiate and equate to zero to find minimizer!

d

du1Q

x1(u1) = 0 , 2(2u1 + x1) = 0 , u1 = �1

2x1

Replacing in we obtain the cost-to-goQx1(u1) J1(x1) =

3

2x

21

9

Quadratic terminal costStep 2

J0(x0) = minu0

g0(x0, u0) + J1(x1)

x1 = x0 + u0

= minu0

x

20 + u

20 +

3

2x

21 = min

u0

5

2x

20 + 3u0x0 +

5

2u

20

Differentiating and equating to zero we obtain a function belonging to the optimal policy

u0 = �3

5x0

which leads to the cost-to-go

J0(x0) =8

5x

20

10

Optimal policy and optimal path

Optimal policy

Optimal path for

Computed by using:

Optimal cost

u0 = �3

5x0 u1 = �1

2x1

x0 = 1

k 2 {0, 1}xk+1 = xk + uk

u0 = �3

5u1 = �1

5

x0 = 1x1 =

2

5x2 =

1

5

J0(1) =8

5

11

Non-quadratic terminal costLet us try to apply the dynamic programming algorithm considering a non-quadratic terminal cost

We get stuck

• this equation implicitly determines from but there is not an explicit form.

• This implies that it is not easy to determine and move to step 2. u1 = µ1(x1)

u1 x1

g2(x2) = e

x2

Step 1J1(x1) = min

u1

g1(x1, u1) + g2(x2) = minu1

x

21 + u

21 + e

x2

x2 = x1 + u1

Differentiating and equating to zero, we obtain

2u1 + ex1+u1 = 0

12

Discussion

Linear dynamic models, quadratic cost• For these problems, we can explicitly obtain the optimal policy as shown next.

Non-linear dynamic models and/or non-quadratic cost• Very hard to apply DP and hence obtain optimal policies.

• This leads to approximation techniques such as discretization.

• Another class of approximation techniques will addressed in the next lectures.

Outline

• Dynamic programming for stage decision problems

• Linear quadratic regulator

13

Linear quadratic regulator

GivenDynamic model

Cost function

Find

•This is the finite-horizon linear quadratic optimal control problem in discrete-time.

•The solution when approaches infinity and the matrices in the dynamic model and cost function are time-invariant is the linear quadratic regulator.

xk+1 = Akxk +Bkuk k 2 {0, . . . , h� 1}

Optimal policyuk = µk(xk) k 2 {0, . . . , h� 1}

h�1X

k=0

⇥x

|k u

|k

⇤ Qk Sk

S

|k Rk

� xk

uk

�+ x

|hQhxh

h

14

Remarks• The linear quadratic regulator is one of the celebrated results in control

theory and one of the main achievements of optimal control.

Qh � 0Qk Sk

S|k Rk

�� 0, Rk > 0• Assumptions: are symmetric,

xk+1 = Axk +Buk

h�1X

k=0

⇥x

|k u

|k

⇤ Q S

S

|R

� xk

uk

�+ x

|hQhxh

• Model and cost are often time-invariant, i.e., and

• Cost function can result from a continuous-time problem.

• However, in general the cost is specified in discrete-time and used as tuning knob to obtain desired specifications (e.g. overshoot, etc).

• We focus on the stabilization problem, i.e., driving the state to zero.

Qk, Rk

15

Dynamic programming algorithm

xh = Ah�1xh�1 +Bh�1uh�1Step I

J

h�1(xh�1) = minuh�1

⇥x

|h�1 u

|h�1

⇤ Q

h�1 S

h�1

S

|h�1 R

h�1

� x

h�1

u

h�1

�+ J

h

(xh

)| {z }x

|hQhxh

Jh(Ah�1xh�1+Bh�1uh�1) = (Ah�1xh�1+Bh�1uh�1)|Qh(Ah�1xh�1+Bh�1uh�1)

terminal cost

Quadratic function of uh�1

Then

Jh�1(xh�1) = minuh�1

x

|h�1

�A

|h�1QhAh�1 +Qh�1

�xh�1

+ 2u|h�1

�S

|h�1 +B

|h�1QhAh�1

�xh�1 + u

|h�1

�B

|h�1QhBh�1 +Rh�1

�uh�1

16

Minimizing a quadratic function in Rn

J(�X�1y) = y|X�1XX�1y � 2y|X�1y + z

= z � y|X�1y

Unique minimizer

Minimum

rJ(u) = 0 , 2Xu+ 2y = 0 , u = �X�1y

minu2Rn

J(u) J(u) = u|Xu+ 2u|y + z

105

0-5

-10-10-5

05

-20

0

20

40

60

80

100

120

140

10

X > 0

17

Dynamic Programming

Step I

|{z}X

|{z}y

z|{z}

Jh�1(xh�1) = minuh�1

x

|h�1

�A

|h�1QhAh�1 +Qh�1

�xh�1

+ 2u|h�1

�S

|h�1 +B

|h�1QhAh�1

�xh�1 + u

|h�1

�B


�uh�1

Policy

uh�1 = �X

�1y = �

�B


��1�S

|h�1 +B

|h�1QhAh�1

�xh�1

Cost-to-go

Jh�1(xh�1)=z � y

|X

�1y

=x

|h�1

�A

|h�1QhAh�1+Qh�1

�xh�1

� x

|h�1(Sh�1+A

|h�1QhBh�1

��B


��1�S

|h�1 +B

|h�1QhAh�1)xh�1

18

Dynamic Programming

Step 2

Jh�2(xh�2) = x

|h�2Ph�2xh�2

xh�1 = Ah�2xh�2 +Bh�2uh�2

J

h�2(xh�2) = minuh�2

⇥x

|h�2 u

|h�2

⇤ Q

h�2 S

h�2

S

|h�2 R

h�2

� x

h�2

u

h�2

�+ J

h�1(xh�1)| {z }x

|h�1Ph�1xh�1

uh�2 = Kh�2xh�2

Since the cost-to.go is quadratic (as the terminal cost) we can apply the same reasoning and obtain

Ph�2 =A|h�2Ph�1Ah�2 +Qh�2

��Sh�2 +A|

h�2Ph�1Bh�2

��B|

h�2Ph�1Bh�2 +Rh�2

��1�S|h�2 +B|

h�2Ph�1Ah�2

�

Kh�2 = ��B|

h�2Ph�1Bh�2 +Rh�2

��1�S|h�2 +B|

h�2Ph�1Ah�2

�

19

Dynamic Programming

Step

J

k

(xk

) = minuk

⇥x

|k

u

|k

⇤ Q

k

S

k

S

|k

R

k

� x

k

u

k

�+ J

k+1(xk+1)| {z }x

|k+1Pk+1xk+1

xk+1 = Akxk +Bkukh� k

Jk(xk) = x

|kPkxk

uk = Kkxk

Kk = ��B|

kPk+1Bk +Rk

��1�S|k +B|

kPk+1Ak

�

Thus, simply iterate these equations for starting with to obtain the optimal policy

uk = Kkxk

The optimal cost for a given initial condition is J0(x0) = x

|0P0x0

Ph = Qh

Riccati equation

k 2 {h� 1, . . . , 1, 0}

Pk = A|kPk+1Ak+Qk�(Sk+A|

kPk+1Bk)�B|

kPk+1Bk+Rk

��1(S|

k +B|kPk+1Ak)

20

Example: double integrator

Consider a double integrator

Discretization

Continuous-time model

F

y(t)v(t)

�=

0 10 0

� y(t)v(t)

�+

01

�u(t)

y(t) =F (t)

m| {z }u(t)

x(t) = [y(t) v(t)]|

y = 0 y

xk+1 = e

2

40 10 0

3

5⌧

xk +

Z ⌧

0e

2

40 10 0

3

5r

dr

01

�uk

=

1 ⌧

0 1

�xk +

⌧2

2⌧

�uk

21

Example: double integrator

Qualitative goal: drive the mass to position zero in a fast way but with reasonable actuation values.

Dynamic model

xk+1 =

1 ⌧

0 1

�xk +

⌧2

2⌧

�uk

To achieve this goal let us start with this cost function

and then tune these parameters to improve the results.

Q =

1 00 1

�R = 1 h = 5

⌧ = 0.2

Ph�1k=0(x

|kQxk + u

|kRuk) + x

|hQhxh

Qh =

10 00 10

�

22

Dynamic programming

Iterate the following equations to obtain the optimal policy

k 2 {4, 3, 2, 1, 0}

Kk = ��B|

kPk+1Bk +Rk

��1�S|k +B|

kPk+1Ak

�

Pk = A|kPk+1Ak+Qk�(Sk+A|

kPk+1Bk)�B|

kPk+1Bk+Rk

��1(S|

k +B|kPk+1Ak)

P5 = Q5 =

10 00 10

�

=

10.9715 1.70941.7094 8.4359

�

First iteration

=⇥�0.1425 �1.4530

⇤K4 = �(

⇥0.02 0.2

⇤10 00 10

�0.020.2

�+1)�1

⇥0.02 0.2

⇤10 00 10

�1 0.20 1

�

P4 =

1 0.20 1

�| 10 00 10

� 1 0.20 1

�+

1 00 1

��1 0.20 1

�| 10 00 10

� 0.020.2

�(�K4)

23

Dynamic programming

P3 =

11.739 3.1443.144 8.0782

�P2 =

12.188 4.3114.311 8.2725

�P1 =

12.295 5.1655.165 8.675

�P0 =

12.121 5.7025.702 9.085

�

K1 = [�0.807 �1.432]K0 = [�0.918 �1.503] K2 = [�0.638 �1.368] K3 = [�0.414 �1.353]

Next iterations

uk = Kkxk k 2 {0, 1, . . . , 4}Optimal policy

Optimal path for initial condition x0 =⇥1 0

⇤| (iterate )uk = Kkxk

(x0, u0) = ( [1 0]|,�0.918)

(x1, u1) = ([0.982 � 0.184]|,�0.529) (x4, u4) = ([0.8082 � 0.313]|, 0.339)

x5 = [0.7525 � 0.2448]|(x2, u2) = ([0.934 � 0.289]|,�0.200)

(x3, u3) = ([0.8724 � 0.330]|, 0.085)

xk+1 = Axk +Buk

24

Plots and tuning

t0 0.5 1

y(t)

-0.20

0.20.40.60.81

t0 0.5 1

v(t)

-0.4

-0.3

-0.2

-0.1

0

t0 0.5 1

u(t)

-1

-0.5

0

0.5

Transitory responses are still far from qualitative specifications

Guidelines to tune the cost• By increasing the terminal cost one expects that the response gets closer to the desired

final position.• Same is expected by penalizing more the position error relatively to the velocity error.• Decreasing the penalty on the control action will allow more control authority to reach

the origin.

25

Increasing terminal cost

t0 0.5 1

y(t)

-0.20

0.20.40.60.81

t0 0.5 1v(t)

-1

-0.8

-0.6

-0.4

-0.2

0

t0 0.5 1

u(t)

-3-2-10123

t0 0.5 1

y(t)

-0.20

0.20.40.60.81

t0 0.5 1

v(t)

-1.5

-1

-0.5

0

t0 0.5 1

u(t)

-5

0

5

Qh = 1000I

Qh = 100I

Final position error improved by increasing terminal cost

26

Changing state cost

t0 0.5 1

y(t)

-0.20

0.20.40.60.81

t0 0.5 1v(t)

-1

-0.8

-0.6

-0.4

-0.2

0

t0 0.5 1

u(t)

-4

-2

0

2

4Qh = 100I

Q =

10 00 1

�

t0 0.5 1

y(t)

-0.20

0.20.40.60.81

t0 0.5 1

v(t)

-2

-1.5

-1

-0.5

0

t0 0.5 1

u(t)

-8-6-4-2024

Q =

100 00 1

�

Increasing position cost leads to smaller position error and larger velocity

27

Changing control cost

Qh = 100I

t0 0.5 1

y(t)

-0.20

0.20.40.60.81

t0 0.5 1

v(t)

-3

-2

-1

0

1

t0 0.5 1

u(t)

-15

-10

-5

0

5

10

t0 0.5 1

y(t)

-0.20

0.20.40.60.81

t0 0.5 1

v(t)

-5-4-3-2-101

t0 0.5 1

u(t)

-30

-20

-10

0

10

20

Decreasing control penalty leads to fast responses, but large actuation!R = 0.1

R = 0.01 Q =

10 00 1

�

28

Cheap control

As we obtain deadbeat control: zero state is achieved in 2 stepsR ! 0

t0 0.5 1

y(t)

-0.20

0.20.40.60.81

t0 0.5 1

v(t)-5-4-3-2-101

t0 0.5 1

u(t)

-30-20-100102030

Can we then always drive a mass to zero in two sampling periods?

• No, because this typically requires very large unfeasible actuations. • Actuators have limitations which were not incorporated in our linear model.• In this LQR framework the solution is to increasingly penalize the control input

until actuation constraints are met. More on this point later.

29

Increasing the horizon

t0 1 2

y(t)

-0.20

0.20.40.60.81

t0 1 2

v(t)

-0.4

-0.3

-0.2

-0.1

0

t0 1 2

u(t)

-0.8

-0.6

-0.4

-0.2

0

0.2

t0 2 4 6

y(t)

-0.20

0.20.40.60.81

t0 2 4 6

v(t)

-0.5

-0.4

-0.3

-0.2

-0.1

0

t0 2 4 6

u(t)

-1-0.8-0.6-0.4-0.20

0.2

Let us increase the horizon considering Q = Qh = I, R = 1 x0 =⇥1 0

⇤|

J0(x0) = x

|0Px0 = 8.347costh = 10

h = 30 J0(x0) = x

|0Px0 = 9.1881cost

0

0

30

Increasing the horizon

t0 5 10

y(t)

-0.20

0.20.40.60.81

t0 5 10

v(t)

-0.5-0.4-0.3-0.2-0.10

0.1

t0 5 10

u(t)

-1-0.8-0.6-0.4-0.20

0.2

t0 10 20

y(t)

-0.20

0.20.40.60.81

t0 10 20

v(t)

-0.5-0.4-0.3-0.2-0.10

0.1

t0 10 20

u(t)

-1-0.8-0.6-0.4-0.20

0.2

h = 50

h = 100

J0(x0) = x

|0Px0 = 9.1890

J0(x0) = x

|0Px0 = 9.1890

The cost converges to a constant as the time horizon increases

0

0

31

Discussion• Since the cost is positive-definite, if the horizon is large the optimal

input should drive the state to zero to stop paying cost.• This explains why the cost converges as the horizon increases.

• This reasoning is valid for every initial condition. Thus if converges as then converges, where results from the recursion

• Note that we are now considering time-invariant

h0X

k=0

gk(xk,Kkuk) +h�1X

k=h0+1

gk(xk,Kkuk) + gT (xT , uT )

|{z}⇡ 0 since xk ⇡ 0 and uk = Kkxk

x

|0P0x0

P0h ! 1

k 2 {h� 1, h� 2, . . . , 0}

P0

Ak, Bk, Qk, Rk, Sk

Pk=A|Pk+1A+Q�(S+A|Pk+1B)�B|Pk+1B+R

��1(S|+B|Pk+1A)

33

Infinite horizon LQR

P =A|PA+Q�(S+A|PB)�B|PB+R

��1(S|+B|PA)

1X

k=0

x

|kQxk + 2x|

kSuk + u

|kRuk

xk+1 = Axk +Buk

The optimal policy for the stage decision problem with infinite number of stages with dynamic model

and cost function

is given by uk = Kxk

where is the unique positive definite solution to the algebraicP

(A,B) controllable

Q SS| R

�> 0

K = ��B|PB+R

��1�S| +B|PA

�

Furthermore the closed-loop is exponentially stable.xk+1 = (A+BK)xk

Riccati equation

Proposition (special case of [Bertsekas, Sec. 4., Proposition 4.4.1])

34

Discussion

• As mentioned in [Bertsekas, Sec. 4.1], we can relax the assumptions is controllable and is positive definite.

(A,B)Q

• For simplicity, throughout the discussion, we assume . S = 0

• In fact, if is controllable, is positive definite, for a full rank and (not necessarily positive definite if ) and is observable, then the previous theorem still holds.

(A,B) R Q = NN|

N 2 Rn⇥r r n r < n(A,N)

R•Moreover, if we further relax the assumptions to is positive definite, is stabilizable, , full rank and is detectable then the theorem still holds except that is not necessarily positive definite.

(A,B)Q = NN|

P(A,N)N

• Actually according to ‘Linear optimal control’, B. O. Andersson, J. B. Moore, Sec 14.1, we just need to assure that is positive definite and is observable to guarantee stability of the closed-loop.

B|QB +R

• Therefore, we can for instance pick and positive definite and the closed-loop is stable (this will not the case for continuous-time optimal control problems).

R = 0 Q

(A,N)

35

Inverted pendulumInverted pendulum

Linearized model (see [1, p. 32])✓

[1] Feedback control of dynamic systems, Franklin, Powell, Emani-Naeini

d

dt

2

664

x

x

✓

✓

3

775 =

2

6664

0 1 0 0

0 � (I+m`2)bq

m2g`2

q 00 0 0 1

0 �m`bq

mg`(M+m)q 0

3

7775

2

664

x

x

✓

✓

3

775+

2

664

0I+ml2

q

0mlq

3

775u(t)

q = (I +m`2)(M +m)�m2`2

State space

m, I

M x

x = 0

`

u

(I +m`

2)✓ �mg`✓ = m`x

(M +m)x+ bx�m`✓ = u

36

Matlab implementation

clear all, close all, clc % definition of the continuous-time modelm = 0.2;M = 1;b = 0.05;I = 0.01;g = 9.8;l = 0.5;p = (I+m*l^2)*(M+m)-m^2*l^2;Ac = [0 1 0 0; 0 -(I+m*l^2)*b/p (m^2*g*l^2)/p 0; 0 0 0 1; 0 -(m*l*b)/p m*g*l*(M+m)/p 0];Bc = [ 0; (I+m*l^2)/p; 0; m*l/p];

% discretization n = 4;tau = 0.1;sysd = c2d(ss(Ac,Bc,zeros(1,n),0),tau);A = sysd.a; B = sysd.b; % LQR controlQ = diag([1 1 1 1]);S = zeros(4,1);R = 1;K = dlqr(A,B,Q,R,S); K = -K; % simulationkend = 10/tau;x0 = [1 0 0 0]';x(:,1) = x0;for k=1:kend u(:,k) = K*x(:,k); x(:,k+1) = A*x(:,k)+B*u(:,k);end plot((1:kend)*tau,u), figure, plot((1:kend)*tau,x(3,1:end-1)),figure, plot((1:kend)*tau,x(1,1:end-1)),

Model definition Controller synthesis

37

Time responses

Q = I, S = 0, R = 1, ⌧ = 0.1

t0 2 4 6 8 10

u

-3

-2

-1

0

1

2

3

t0 2 4 6 8 10

3

-0.02

-0.01

0

0.01

0.02

0.03

0.04

t0 2 4 6 8 10

x

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

38

Tuning the parametersWant faster convergence ? Reduce penalty on control input to increase control authority

Want to reduce the angle amplitude? Increase penalty on the angle state

R = 0.01

t0 2 4 6 8 10

x

0

0.2

0.4

0.6

0.8

1

1.2

t0 2 4 6 8 10

3

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

t0 2 4 6 8 10

u

-3

-2

-1

0

1

2

3

4

t0 2 4 6 8 10

x

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

t0 2 4 6 8 10

3

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

t0 2 4 6 8 10

u

-3

-2

-1

0

1

2

3

Q = diag([1 1 100 1])

Concluding remarks

To summarise:• Stage decision problems are extensions of discrete optimization problem for which state and input spaces can be arbitrary.

• In practice may be hard to obtain expressions for the costs-to-go

• When the cost is quadratic and the system is linear we obtain a framework for state feedback control design for any linear plant.

39

After this lecture, you should be able to:• Apply DP to stage-decision problems.

• Solve finite-horizon optimal control problems in discrete-time with a quadratic cost and a linear model by iteratively solving Riccati equations.

• Obtain the linear quadratic regulator using the algebraic Riccati equations for infinite horizon problems.

Appendix A Proof of optimality of dynamic programming

Proof of optimality

A1

Theorem

The policy obtained with the DP algorithm is an optimal policy.

Proof

We shall prove using induction that obtained by the DP algorithm is an optimal policy for the subproblem from stage to stage and that is the cost of the optimal path starting at .

• Step I: Prove this for .

• Step II: Assume that the induction hypothesis holds for a given and prove it for .

⇡k := {µk, . . . , µh�1}k h� 1

Jk(xk) xk

k = h� 1

k � 1k

Step I

A2

• It is also clear that

is the optimal cost for the subproblem with initial condition at stage .

• By construction is an optimal policy as it is the first decision of the optimal path from stage to stage since

min

uh�12Uh�1(xh�1)g

h�1(xh�1, uh�1) + J

h

(fh�1(xh�1, uh�1))

= g

h�1(xh�1, µh�1(xh�1)) + J

h

(fh�1(xh�1, µh�1(xh�1))).

⇡h�1 = {µh�1}h� 1 h

J

h�1(xh�1) = minuh�12Uh�1(xh�1) gh�1(xh�1, uh�1) + J

h

(fh�1(xh�1, uh�1))

xh�1 h� 1

Step II

A3

• Assume now that is an optimal policy and is the cost of the optimal path which starts at initial state . We shall prove using contradiction that is an optimal policy and is the cost of an optimal path which starts at initial state .

• Argument using contradiction: if is not optimal then there must exist a state such that is not the first action of the optimal path from stage to stage denoted by

• Since we are assuming that is an optimal policy we must have

⇡k+1 := {µk+1, . . . , µh�1} Jk+1(xk+1)xk+1

Jk(xk)xk

µk(xk) uk( 6= µk(xk)) k

� = {(xk, uk), (xk+1, uk+1), . . . , xh)}

⇡k+1 := {µk+1, . . . , µh�1}

u`+1 = µ`+1(x`+1) for every ` 2 {k + 1, h� 2}

⇡k := {µk, . . . , µh�1}

⇡k

h

Step II

A4

• The cost of such path is

• However, the cost of the path which has as the first decision is less or equal (contradiction)

J� =h�1X

`=k

g`(x`, u`) + gh(xh),

= gk(xk, uk) +h�1X

`=k+1

g`(x`, µ`(xk)) + gh(xh),

= gk(xk, uk) + Jk+1(f(xk, uk)).

µk(xk)

Jk(xk) = minuk

gk(xk, uk) + Jk+1(f(xk, uk))

= gk(xk, µk(xk)) + Jk+1(f(xk, µk(xk))

gk(xk, uk) + Jk+1(f(xk, uk)) = J�

optimal control and dynamic programming · optimal control and dynamic programming duarte antunes....

Documents