approximation dynamic programming presented by yu-shun, wang

50
Approximation Dynamic Programming Presented by Yu-Shun, Wang

Upload: barry-douglas

Post on 02-Jan-2016

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Approximation Dynamic Programming

Presented by Yu-Shun, Wang

Page 2: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 2

Agenda

IntroductionBackground

Policy Evaluation Algorithms

General Issues of Cost ApproximationApproximate Policy Iteration

Direct and Indirect Approximation

The Role of Monte Carlo Simulation

Direct Policy Evaluation - Gradient MethodsBatch Gradient Methods for Policy Evaluation

Incremental Gradient Methods for Policy Evaluation

Comparison with our approach

Page 3: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 3

Agenda

IntroductionBackground

Policy Evaluation Algorithms

General Issues of Cost ApproximationApproximate Policy Iteration

Direct and Indirect Approximation

The Role of Monte Carlo Simulation

Direct Policy Evaluation - Gradient MethodsBatch Gradient Methods for Policy Evaluation

Incremental Gradient Methods for Policy Evaluation

Comparison with our approach

Page 4: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 4

Background

A principal aim of the methods is to address problems with very large number of states n.

Another aim of the methods of this chapter is to address model-free situations.

i.e., problems where a mathematical model is unavailable or hard to construct.

The system and cost structure may be simulated.

for example, a queuing network with complicated but well-defined service disciplines.

Page 5: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 5

Background

The assumption here is that:

There is a computer program that simulates, for a given control u, the probabilistic transitions from any given state i to a successor state j according to the transition probabilities pij(u).

It also generates a corresponding transition cost g(i, u, j).

It may be possible to use repeated simulation to calculate (at least approximately) the transition probabilities of the system and the expected stage costs by averaging.

Page 6: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 6

Background

We will aim to approximate the cost function of a given policy or even the optimal cost-to-go function by generating one or more simulated system trajectories and associated costs.

In another type of method, which we will discuss only briefly, we use a gradient method and simulation data to approximate directly an optimal policy.

Page 7: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 7

Agenda

IntroductionBackground

Policy Evaluation Algorithms

General Issues of Cost ApproximationApproximate Policy Iteration

Direct and Indirect Approximation

The Role of Monte Carlo Simulation

Direct Policy Evaluation - Gradient MethodsBatch Gradient Methods for Policy Evaluation

Incremental Gradient Methods for Policy Evaluation

Comparison with our approach

Page 8: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 8

Policy Evaluation Algorithms

With this class of methods, we aim to approximate the cost function Jμ(i) of a policy μ with a parametric architecture of the form , where r is a parameter vector.

Alternatively, it may be used to construct an approximate cost-to-go function of a single suboptimal/heuristic policy with one-step or multistep look ahead.

( , )J i r

Page 9: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 9

Policy Evaluation Algorithms

We focus primarily on two types of methods, the first class called direct, we use simulation to collect samples of costs for various initial states, and fit the architecture to the samples.

The second and currently more popular class of methods is called indirect . Here, we obtain r by solving an approximate version of Bellman’s equation. We obtain the parameter vector r by solving the equation:

J

Page 10: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 10

Agenda

IntroductionBackground

Policy Evaluation Algorithms

General Issues of Cost ApproximationApproximate Policy Iteration

Direct and Indirect Approximation

The Role of Monte Carlo Simulation

Direct Policy Evaluation - Gradient MethodsBatch Gradient Methods for Policy Evaluation

Incremental Gradient Methods for Policy Evaluation

Comparison with our approach

Page 11: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 14

Approximate Policy Iteration

Suppose that the current policy is μ, and for a given r, is an approximation of Jμ(i). We generate an “improved” policy using the formula:

When the sequence of policies obtained actually converges to some , then it can be proved that μ is optimal to within:

( , )J i r

2

1

Page 12: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 15

Approximate Policy Iteration

Block diagram of approximate policy iteration

Page 13: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 16

Approximate Policy Iteration

A simulation-based implementation of the algorithm is illustrated in the following figure. It consists of four modules:

The simulator, which given a state-control pair (i, u), generates the next state j according to the system’s transition probabilities.

The decision generator, which generates the control of the improved policy at the current state i for use in the simulator.

The cost-to-go approximator, which is the function that is used by the decision generator.

The cost approximation algorithm, which accepts as input the output of the simulator and obtains the approximation of the cost of .

( )i

( , )J j r

( , )J r

Page 14: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 17

Approximate Policy Iteration

Page 15: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 18

Approximate Policy Iteration

There are two policies μ and , and parameter vectors r and , which are involved in this algorithm.

In particular, r corresponds to the current policy μ, and the approximation is used in the policy improvement to generate the new policy .

At the same time, drives the simulation that generates samples that determines the parameter ,which will be used in the next policy iteration.

r

( , )J r

r

Page 16: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 19

Approximate Policy Iteration

System simulator Cost-to-Go Approximator

DecisionGenerator

Cost Approximation Algorithm

Page 17: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 20

Agenda

IntroductionBackground

Policy Evaluation Algorithms

General Issues of Cost ApproximationApproximate Policy Iteration

Direct and Indirect Approximation

The Role of Monte Carlo Simulation

Direct Policy Evaluation - Gradient MethodsBatch Gradient Methods for Policy Evaluation

Incremental Gradient Methods for Policy Evaluation

Comparison with our approach

Page 18: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 21

Direct and Indirect Approximation

An important generic difficulty with simulation-based policy iteration is the cost samples may biases the simulation by underrepresented states.

As a result, the cost-to-go estimates of these states may be highly inaccurate, causing serious errors in the calculation of the improved control policy.

The difficulty is known as inadequate exploration of the system’s dynamics because of the use of a fixed policy.

Page 19: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 22

Direct and Indirect Approximation

One possibility for adequate exploration is to frequently restart the simulation and to ensure that the initial states employed form a rich and representative subset.

A related approach, called iterative resampling, is to derive an initial cost evaluation of μ, simulate the next policy obtained on the basis of this initial evaluation to obtain a set of representative states S

And repeat the evaluation of μ using additional trajectories initiated from S.

Page 20: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 23

Direct and Indirect Approximation

The most straightforward algorithmic approaches for approximating the cost function is direct.

It is used to find an approximation S that matches ∈best Jμ in some normed error sense, i.e.,

Where

Here, || . || is usually some (weighted) Euclidean norm.

J

Page 21: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 24

Direct and Indirect Approximation

If the matrix φ has linearly independent columns, the solution is unique and can also be represented as:

where Π denotes projection with respect to || . || on the subspace S.

A major difficulty is that specific cost function values Jμ(i) can only be estimated through their simulation-generated cost samples.

Page 22: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 25

Direct and Indirect Approximation

Page 23: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 26

Direct and Indirect Approximation

An alternative approach, referred to as indirect , is to approximate the solution of Bellman’s equation J = TμJ on the subspace S.

We can view this equation as a projected form of Bellman’s equation.

Page 24: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 27

Direct and Indirect Approximation

Solving projected equations as approximations to more complex/higher-dimensional equations has a long history in scientific computation in the context of Galerkin methods.

The use of the Monte Carlo simulation ideas that are central in approximate DP is an important characteristic that differentiates the methods from the Galerkin methods.

Page 25: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 28

Direct and Indirect Approximation

Page 26: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 29

Agenda

IntroductionBackground

Policy Evaluation Algorithms

General Issues of Cost ApproximationApproximate Policy Iteration

Direct and Indirect Approximation

The Role of Monte Carlo Simulation

Direct Policy Evaluation - Gradient MethodsBatch Gradient Methods for Policy Evaluation

Incremental Gradient Methods for Policy Evaluation

Comparison with our approach

Page 27: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 30

The Role of Monte Carlo Simulation

The methods of this chapter rely to a large extent on simulation in conjunction with cost function approximation in order to deal with large state spaces.

The advantage that simulation holds in this regard can be traced to its ability to compute (approximately) sums with a very large number terms.

Page 28: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 31

The Role of Monte Carlo Simulation

Example: Approximate Policy Evaluation

Consider the approximate solution of the Bellman equation that corresponds to a given policy of an n-state discounted problem:

where P is the transition probability matrix and α is the discount factor.

Let us adopt a hard aggregation approach whereby we divide the n states in two disjoint subsets I1 and I2 with I1∪I2 = {1, . . . , n}, and we use the piecewise constant approximation:

Page 29: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 32

The Role of Monte Carlo Simulation

Example: Approximate Policy Evaluation (cont.)

This corresponds to the linear feature-based architecture J ≈ φr, where φ is the n × 2 matrix with column components equal to 1 or 0, depending on whether the component corresponds to I1 or I2.

We obtain the approximate equations:

Page 30: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 33

The Role of Monte Carlo Simulation

Example: Approximate Policy Evaluation (cont.)

we can reduce to just two equations by forming two weighted sums (with equal weights) of the equations corresponding to the states in I1 and I2, respectively:

where n1 and n2 are numbers of states in I1 and I2. We thus obtain the aggregate system of the following two equations in r1 and r2:

Page 31: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 34

The Role of Monte Carlo Simulation

Example: Approximate Policy Evaluation (cont.)

Here the challenge, when the number of states n is very large, is the calculation of the large sums in the right-hand side, which can be of order O(n2).

Simulation allows the approximate calculation of these sums with complexity that is independent of n.

This is similar to the advantage that Monte-Carlo integration holds over numerical integration.

Page 32: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 35

IntroductionBackground

Policy Evaluation Algorithms

General Issues of Cost ApproximationApproximate Policy Iteration

Direct and Indirect Approximation

The Role of Monte Carlo Simulation

Direct Policy Evaluation - Gradient MethodsBatch Gradient Methods for Policy Evaluation

Incremental Gradient Methods for Policy Evaluation

Comparison with our approach

Agenda

Page 33: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 36

Batch Gradient Methods for Policy Evaluation

Suppose that the current policy is μ, and for a given r, is an approximation of Jμ(i). We generate an “improved” policy μ using the formula:

To evaluate approximately , we select a subset of “representative” states S (obtained by simulation), and for each i ∈ S, we obtain M(i) samples of the cost .

The mth such sample is denoted by c(i,m), and it can be viewed as plus some simulation error.

( , )J i r

J

( )J i

( )J i

Page 34: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 37

Batch Gradient Methods for Policy Evaluation

We obtain the corresponding parameter vector r by solving the following least squares problem:

The above problem can be solved if a linear approximation architecture is used.

However, when a nonlinear architecture is used, we may use gradient-like methods for solving the least squares problem.

Page 35: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 38

Batch Gradient Methods for Policy Evaluation

Let us focus on an N-transition portion (i0, . . . , iN) of a simulated trajectory, also called a batch. We view the numbers

as cost samples, one per initial state i0, . . . , iN−1, which can be used for least squares approximation of the parametric architecture ( , )J i r

Page 36: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 39

Batch Gradient Methods for Policy Evaluation

One way to solve this least squares problem is to use a gradient method, whereby the parameter associated with is updated at time N by

Here, denotes gradient with respect to and γ is a positive stepsize, which is usually diminishing over time (we leave its precise choice open for the moment).

r

J r

Page 37: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 40

Batch Gradient Methods for Policy Evaluation

The update of is done after processing the entire batch, and that the gradients are evaluated at the preexisting value of , i.e., the one before the update.

In a traditional gradient method, the gradient iteration is repeated, until convergence to the solution.

( , )kJ i rr

r

Page 38: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 41

Batch Gradient Methods for Policy Evaluation

However, there is an important tradeoff relating to the size N of the batch:

In order to reduce simulation error and generate multiple cost samples for a representatively large subset of states, it is necessary to use a large N,

Yet to keep the work per gradient iteration small, it is necessary to use a small N.

Page 39: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 42

Batch Gradient Methods for Policy Evaluation

To address the issue of size of N, batches may be changed after one or more iterations.

Thus, the N-transition batch comes from a potentially longer simulated trajectory, or from one of many simulated trajectories.

We leave the method for generating simulated trajectories and forming batches open for the moment.

But we note that it influences strongly the result of the corresponding least squares optimization.

Page 40: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 43

Agenda

IntroductionBackground

Policy Evaluation Algorithms

General Issues of Cost ApproximationApproximate Policy Iteration

Direct and Indirect Approximation

The Role of Monte Carlo Simulation

Direct Policy Evaluation - Gradient MethodsBatch Gradient Methods for Policy Evaluation

Incremental Gradient Methods for Policy Evaluation

Comparison with our approach

Page 41: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 44

Incremental Gradient Methods for Policy Evaluation

We now consider a variant of the gradient method called incremental . This method can also be described through the use of N-transition batches.

But we will see that the method is suitable for use with a single very long simulated trajectory, viewed as a single batch.

Page 42: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 45

Incremental Gradient Methods for Policy Evaluation

For a given N-transition batch (i0, . . . , iN), the batch gradient method processes the N transitions all at once, and updates .

The incremental method updates a total of N times, once after each transition.

Each time it adds to the corresponding portion of the gradient that can be calculated using the newly available simulation data.

r

r

r

Page 43: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 46

Incremental Gradient Methods for Policy Evaluation

Thus, after each transition (ik, ik+1):

We evaluate the gradient at the current value of .

We sum all the terms that involve the transition (ik, ik+1), and we update by making a correction along their sum:

r( , )kJ i r

r

Page 44: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 47

Incremental Gradient Methods for Policy Evaluation

By adding the “incremental” correction in the above iteration, we see that after N transitions, all the terms of the batch iteration will have been accumulated.

But there is a difference:

In the incremental version, is changed during the processing of the batch, and the gradient is evaluated at the most recent value of [after the transition (it, it+1)].

r( , )tJ i r

r

Page 45: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 48

Incremental Gradient Methods for Policy Evaluation

By contrast, in the batch version these gradients are evaluated at the value of prevailing at the end of the batch.

Note that the gradient sum can be conveniently updated following each transition, thereby resulting in an efficient implementation.

r

Page 46: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 49

Incremental Gradient Methods for Policy Evaluation

It can be seen that because is updated at intermediate transitions within a batch (rather than at the end of the batch), the location of the end of the batch becomes less relevant.

In this case, for each state i, we will have one cost sample for every time when state i is encountered in the simulation.

Accordingly state i will be weighted in the least squares optimization in proportion to the frequency of its occurrence within the simulated trajectory.

r

Page 47: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 50

Incremental Gradient Methods for Policy Evaluation

Generally, the incremental versions of the gradient methods can be implemented more flexibly and tend to converge faster than their batch counterparts.

However, the rate of convergence can be very slow.

Page 48: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 51

Agenda

IntroductionBackground

Policy Evaluation Algorithms

General Issues of Cost ApproximationApproximate Policy Iteration

Direct and Indirect Approximation

The Role of Monte Carlo Simulation

Direct Policy Evaluation - Gradient MethodsBatch Gradient Methods for Policy Evaluation

Incremental Gradient Methods for Policy Evaluation

Comparison with our approach

Page 49: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 52

Comparison with our approach

Approximation DP Our Approach

The method of gradient and stepsize determination

Open problems Collect meaningful and easy to take information during simulation

The extent of the use of simulation

Calculation only Coupled with whole solution approach

Problem structure Single Double

Considered target Estimate cost from n states

Minimize aggregate effect of n categories attackers

Progress Architecture only Architecture and implementation

Page 50: Approximation Dynamic Programming Presented by Yu-Shun, Wang

Page 53

The End