approximation dynamic programming presented by yu-shun, wang

Approximation Dynamic Programming

Presented by Yu-Shun, Wang

Agenda

IntroductionBackground

Policy Evaluation Algorithms

General Issues of Cost ApproximationApproximate Policy Iteration

Direct and Indirect Approximation

The Role of Monte Carlo Simulation

Direct Policy Evaluation - Gradient MethodsBatch Gradient Methods for Policy Evaluation

Incremental Gradient Methods for Policy Evaluation

Comparison with our approach

Background

A principal aim of the methods is to address problems with very large number of states n.

Another aim of the methods of this chapter is to address model-free situations.

i.e., problems where a mathematical model is unavailable or hard to construct.

The system and cost structure may be simulated.

for example, a queuing network with complicated but well-defined service disciplines.

Background

The assumption here is that:

There is a computer program that simulates, for a given control u, the probabilistic transitions from any given state i to a successor state j according to the transition probabilities pij(u).

It also generates a corresponding transition cost g(i, u, j).

It may be possible to use repeated simulation to calculate (at least approximately) the transition probabilities of the system and the expected stage costs by averaging.

Background

We will aim to approximate the cost function of a given policy or even the optimal cost-to-go function by generating one or more simulated system trajectories and associated costs.

In another type of method, which we will discuss only briefly, we use a gradient method and simulation data to approximate directly an optimal policy.

Agenda










With this class of methods, we aim to approximate the cost function Jμ(i) of a policy μ with a parametric architecture of the form , where r is a parameter vector.

Alternatively, it may be used to construct an approximate cost-to-go function of a single suboptimal/heuristic policy with one-step or multistep look ahead.

( , )J i r


We focus primarily on two types of methods, the first class called direct, we use simulation to collect samples of costs for various initial states, and fit the architecture to the samples.

The second and currently more popular class of methods is called indirect . Here, we obtain r by solving an approximate version of Bellman’s equation. We obtain the parameter vector r by solving the equation:

J

Agenda









Approximate Policy Iteration

Suppose that the current policy is μ, and for a given r, is an approximation of Jμ(i). We generate an “improved” policy using the formula:

When the sequence of policies obtained actually converges to some , then it can be proved that μ is optimal to within:

( , )J i r

2

1


Block diagram of approximate policy iteration


A simulation-based implementation of the algorithm is illustrated in the following figure. It consists of four modules:

The simulator, which given a state-control pair (i, u), generates the next state j according to the system’s transition probabilities.

The decision generator, which generates the control of the improved policy at the current state i for use in the simulator.

The cost-to-go approximator, which is the function that is used by the decision generator.

The cost approximation algorithm, which accepts as input the output of the simulator and obtains the approximation of the cost of .

( )i

( , )J j r

( , )J r


There are two policies μ and , and parameter vectors r and , which are involved in this algorithm.

In particular, r corresponds to the current policy μ, and the approximation is used in the policy improvement to generate the new policy .

At the same time, drives the simulation that generates samples that determines the parameter ,which will be used in the next policy iteration.

r

( , )J r

r


System simulator Cost-to-Go Approximator

DecisionGenerator

Cost Approximation Algorithm

Agenda










An important generic difficulty with simulation-based policy iteration is the cost samples may biases the simulation by underrepresented states.

As a result, the cost-to-go estimates of these states may be highly inaccurate, causing serious errors in the calculation of the improved control policy.

The difficulty is known as inadequate exploration of the system’s dynamics because of the use of a fixed policy.


One possibility for adequate exploration is to frequently restart the simulation and to ensure that the initial states employed form a rich and representative subset.

A related approach, called iterative resampling, is to derive an initial cost evaluation of μ, simulate the next policy obtained on the basis of this initial evaluation to obtain a set of representative states S

And repeat the evaluation of μ using additional trajectories initiated from S.


The most straightforward algorithmic approaches for approximating the cost function is direct.

It is used to find an approximation S that matches ∈best Jμ in some normed error sense, i.e.,

Where

Here, || ． || is usually some (weighted) Euclidean norm.

J


If the matrix φ has linearly independent columns, the solution is unique and can also be represented as:

where Π denotes projection with respect to || ． || on the subspace S.

A major difficulty is that specific cost function values Jμ(i) can only be estimated through their simulation-generated cost samples.


An alternative approach, referred to as indirect , is to approximate the solution of Bellman’s equation J = TμJ on the subspace S.

We can view this equation as a projected form of Bellman’s equation.


Solving projected equations as approximations to more complex/higher-dimensional equations has a long history in scientific computation in the context of Galerkin methods.

The use of the Monte Carlo simulation ideas that are central in approximate DP is an important characteristic that differentiates the methods from the Galerkin methods.

Agenda










The methods of this chapter rely to a large extent on simulation in conjunction with cost function approximation in order to deal with large state spaces.

The advantage that simulation holds in this regard can be traced to its ability to compute (approximately) sums with a very large number terms.


Example: Approximate Policy Evaluation

Consider the approximate solution of the Bellman equation that corresponds to a given policy of an n-state discounted problem:

where P is the transition probability matrix and α is the discount factor.

Let us adopt a hard aggregation approach whereby we divide the n states in two disjoint subsets I1 and I2 with I1∪I2 = {1, . . . , n}, and we use the piecewise constant approximation:


Example: Approximate Policy Evaluation (cont.)

This corresponds to the linear feature-based architecture J ≈ φr, where φ is the n × 2 matrix with column components equal to 1 or 0, depending on whether the component corresponds to I1 or I2.

We obtain the approximate equations:



we can reduce to just two equations by forming two weighted sums (with equal weights) of the equations corresponding to the states in I1 and I2, respectively:

where n1 and n2 are numbers of states in I1 and I2. We thus obtain the aggregate system of the following two equations in r1 and r2:



Here the challenge, when the number of states n is very large, is the calculation of the large sums in the right-hand side, which can be of order O(n2).

Simulation allows the approximate calculation of these sums with complexity that is independent of n.

This is similar to the advantage that Monte-Carlo integration holds over numerical integration.









Agenda

Batch Gradient Methods for Policy Evaluation

Suppose that the current policy is μ, and for a given r, is an approximation of Jμ(i). We generate an “improved” policy μ using the formula:

To evaluate approximately , we select a subset of “representative” states S (obtained by simulation), and for each i ∈ S, we obtain M(i) samples of the cost .

The mth such sample is denoted by c(i,m), and it can be viewed as plus some simulation error.

( , )J i r

J

( )J i

( )J i


We obtain the corresponding parameter vector r by solving the following least squares problem:

The above problem can be solved if a linear approximation architecture is used.

However, when a nonlinear architecture is used, we may use gradient-like methods for solving the least squares problem.


Let us focus on an N-transition portion (i0, . . . , iN) of a simulated trajectory, also called a batch. We view the numbers

as cost samples, one per initial state i0, . . . , iN−1, which can be used for least squares approximation of the parametric architecture ( , )J i r


One way to solve this least squares problem is to use a gradient method, whereby the parameter associated with is updated at time N by

Here, denotes gradient with respect to and γ is a positive stepsize, which is usually diminishing over time (we leave its precise choice open for the moment).

r

J r


The update of is done after processing the entire batch, and that the gradients are evaluated at the preexisting value of , i.e., the one before the update.

In a traditional gradient method, the gradient iteration is repeated, until convergence to the solution.

( , )kJ i rr

r


However, there is an important tradeoff relating to the size N of the batch:

In order to reduce simulation error and generate multiple cost samples for a representatively large subset of states, it is necessary to use a large N,

Yet to keep the work per gradient iteration small, it is necessary to use a small N.


To address the issue of size of N, batches may be changed after one or more iterations.

Thus, the N-transition batch comes from a potentially longer simulated trajectory, or from one of many simulated trajectories.

We leave the method for generating simulated trajectories and forming batches open for the moment.

But we note that it influences strongly the result of the corresponding least squares optimization.

Agenda










We now consider a variant of the gradient method called incremental . This method can also be described through the use of N-transition batches.

But we will see that the method is suitable for use with a single very long simulated trajectory, viewed as a single batch.


For a given N-transition batch (i0, . . . , iN), the batch gradient method processes the N transitions all at once, and updates .

The incremental method updates a total of N times, once after each transition.

Each time it adds to the corresponding portion of the gradient that can be calculated using the newly available simulation data.

r

r

r


Thus, after each transition (ik, ik+1):

We evaluate the gradient at the current value of .

We sum all the terms that involve the transition (ik, ik+1), and we update by making a correction along their sum:

r( , )kJ i r

r


By adding the “incremental” correction in the above iteration, we see that after N transitions, all the terms of the batch iteration will have been accumulated.

But there is a difference:

In the incremental version, is changed during the processing of the batch, and the gradient is evaluated at the most recent value of [after the transition (it, it+1)].

r( , )tJ i r

r


By contrast, in the batch version these gradients are evaluated at the value of prevailing at the end of the batch.

Note that the gradient sum can be conveniently updated following each transition, thereby resulting in an efficient implementation.

r


It can be seen that because is updated at intermediate transitions within a batch (rather than at the end of the batch), the location of the end of the batch becomes less relevant.

In this case, for each state i, we will have one cost sample for every time when state i is encountered in the simulation.

Accordingly state i will be weighted in the least squares optimization in proportion to the frequency of its occurrence within the simulated trajectory.

r


Generally, the incremental versions of the gradient methods can be implemented more flexibly and tend to converge faster than their batch counterparts.

However, the rate of convergence can be very slow.

Agenda










Approximation DP Our Approach

The method of gradient and stepsize determination

Open problems Collect meaningful and easy to take information during simulation

The extent of the use of simulation

Calculation only Coupled with whole solution approach

Problem structure Single Double

Considered target Estimate cost from n states

Minimize aggregate effect of n categories attackers

Progress Architecture only Architecture and implementation

The End

approximation dynamic programming presented by yu-shun, wang

Documents

policy evaluationcomparison

given policy

optimal policy

policy evaluation algorithmswe

approximate cost

types of methods

optimal cost

cost structure