approximate pomdp algorithms

11/9/17

1

ApproximatePOMDPAlgorithms

RonParrCompSci 590.2DukeUniversity

Overview

• Pointbasedmethods

• Policysearch

• Augmentedstatemethods

11/9/17

2

Review

• FinitehorizonPOMDPvaluefunctionispiecewiselinearandconvex(forarbitraryhorizonlengths)• Maxoverasetof“alpha”vectors• Eachvectorcorrespondstoaconditionalplan

• Numberofpiecescangrowexponentially

• Hardtosolveproblemswithmorethanhigh1’sorlow10’sofstates

Pointbasedmethods

• Idea:InsteadofgeneratingALLa vectorsateachiteration,whatifwegeneratedjustasubset?• Everyalphavectorwouldstillbeavalidconditionalplan• Valuefunctionwouldlower-boundthetruevaluefunction

• Pointbasedalgorithmsgeneratea vectorsthatareoptimalforonlyafinitesetofpoints,ratherthanfortheentirebeliefspace

11/9/17

3

VisualizingPBVI(figurefromPineau etal.)

states, is a set of discrete actions, and is a set of dis-crete observations providing incomplete and/or noisy state in-formation. The POMDP model is parameterized by: ,the initial belief distribution;

, the distribution describing the probabilityof transitioning from state to state when taking action ;

, the dis-tribution describing the probability of observing from stateafter taking action ; , the reward signal received

when executing action in state ; and , the discount factor.A key assumption of POMDPs is that the state is only par-

tially observable. Therefore we rely on the concept of a beliefstate, denoted , to represent a probability distribution overstates. The belief is a sufficient statistic for a given history:

(1)and is updated at each time-step to incorporate the latest ac-tion, observation pair:

(2)

where is the normalizing constant.The goal of POMDP planning is to find a sequence of ac-

tions maximizing the expected sum of rewards. Given that the state is not necessarily

fully observable, the goal is to maximize expected reward foreach belief. The value function can be formulated as:

(3)

When optimized exactly, this value function is always piece-wise linear and convex in the belief [Sondik, 1971] (seeFig. 1, left side). After consecutive iterations, the solutionconsists of a set of -vectors: . Each-vector represents an -dimensional hyper-plane, and de-fines the value function over a bounded region of the be-lief: . In addition, each-vector is associated with an action, defining the best imme-diate policy assuming optimal behavior for the followingsteps (as defined respectively by the sets ).The -th horizon value function can be built from the pre-

vious solution using the Backup operator, . We usenotation to denote an exact value backup:

(4)

A number of algorithms have been proposed to implementthis backup by directly manipulating -vectors, using a com-bination of set projection and pruning operations [Sondik,1971; Cassandra et al., 1997; Zhang and Zhang, 2001].We now describe the most straight-forward version of exactPOMDP value iteration.To implement the exact update , we first generate

intermediate sets and (Step 1):(5)

Next we create ( ), the cross-sum over observa-tions, which includes one from each (Step 2):

(6)

Finally we take the union of sets (Step 3):

(7)

In practice, many of the vectors in the final set may becompletely dominated by another vector ( ),or by a combination of other vectors. Those vectors can bepruned away without affecting the solution. Finding dom-inated vectors can be expensive (checking whether a singlevector is dominated requires solving a linear program), but isusually worthwhile to avoid an explosion of the solution size.To better understand the complexity of the exact update, letbe the number of -vectors in the previous solution set.

Step 1 creates projections and Step 2 generatescross-sums. So, in the worst case, the new solu-

tion has size (time ). Given thatthis exponential growth occurs for every iteration, the impor-tance of pruning away unnecessary vectors is clear. It alsohighlights the impetus for approximate solutions.

3 Point-based value iterationIt is a well understood fact that most POMDP problems, evengiven arbitrary action and observation sequences of infinitelength, are unlikely to reach most of the points in the beliefsimplex. Thus it seems unnecessary to plan equally for allbeliefs, as exact algorithms do, and preferable to concentrateplanning on most probable beliefs.The point-based value iteration (PBVI) algorithm solves a

POMDP for a finite set of belief points .It initializes a separate -vector for each selected point, andrepeatedly updates (via value backups) the value of that -vector. As shown in Figure 1, by maintaining a full -vectorfor each belief point, PBVI preserves the piece-wise linear-ity and convexity of the value function, and defines a valuefunction over the entire belief simplex. This is in contrastto grid-based approaches [Lovejoy, 1991; Brafman, 1997;Hauskrecht, 2000; Zhou and Hansen, 2001; Bonet, 2002],which update only the value at each belief grid point.

α0

b2 b1 b0 b3b2 b1 b0 b3

V={ ,α1,α

2}

Figure 1: POMDP value function representation using PBVI (on theleft) and a grid (on the right).

The complete PBVI algorithm is designed as an anytimealgorithm, interleaving steps of value iteration and steps ofbelief set expansion. It starts with an initial set of belief pointsfor which it applies a first series of backup operations. It thengrows the set of belief points, and finds a new solution for theexpanded set. By interleaving value backup iterations with

Questions

• Howdowepickthepoints?

• Howdowefindtheoptimala vectorforeachpoint?

11/9/17

4

PickingPoints

• Typicallydoneheuristically

• Explorationfrominitialdist.findsasetofreachablebeliefstates

• Reasonableifstartdist.isknownandentirebeliefspaceisnotreachable(exactlyPOMDPalgorithmsmaybeworkingtoohard)

Computenewavectors(figurefromJietal.)

corporating the latest action and observation via Bayes rule:

bza(s′) =

p(z|s′, a)∑

s∈S p(s′|s, a)b(s)p(z|b, a)

(1)

where bza denotes the belief state updated from b by taking

action a and observing z.The dynamic behavior of the belief state is itself a

discrete-time continuous-state Markov process (Smallwood& Sondik 1973), and a POMDP can be recast as a com-pletely observable MDP with a (|S|−1)-dimensional con-tinuous state space △. Based on these facts, several ex-act algorithms (Kaelbling, Littman, & Cassandra 1998;Cassandra, Littman, & Zhang 1997) have been developed.However, because of exponential worst cast complexity,these algorithms typically are limited to solving problemswith low tens of states. The poor scalability of exact al-gorithms has led to the development of a wide variety ofapproximate techniques (Pineau, Gordon, & Thrun 2003;Poupart & Boutilier 2003; Smith & Simmons 2005; Spaan& Vlassis 2005), of which PBVI (Pineau, Gordon, & Thrun2003) proves to be a particularly simple and practical algo-rithm.

Point-Based Value IterationInstead of planning on the entire belief simplex △, as exactvalue iteration does, point-based algorithms (Pineau, Gor-don, & Thrun 2003; Spaan & Vlassis 2005) alleviate thecomputational load by planning only on a finite set of beliefpoints B. They utilize the fact that most practical POMDPproblems assume an initial belief b0 , and concentrate plan-ning resources on regions of the simplex that are reachable(in simulation) from b0 . Based on this idea, Pineau et al.(2003) proposed a PBVI algorithm that first collects a finiteset of belief points B by forward simulating the POMDPmodel and then maintains a single gradient (α-vector, inPOMDP terminology) for each b ∈ B. This is summarizedin Algorithm 1 (without the last two lines).

Algorithm 1. Point-based backupfunction Γ′ = backup(Γ, B)% Γ is a set of α-vectors representing value functionΓ′ = ∅for each b∈B

αza = arg maxα∈Γ α · bz

a, for every a∈A, z∈Zαa(s) = R(s, a) + γ

∑

z,s′ p(s′|s, a)p(z|s′, a)αza(s′),

α′ = arg max{αa}a∈Aαa · b

if α′ /∈ Γ′, then Γ′ ← Γ′ + α′, endend%The following two lines are added for modified backupΛ= all α∈Γ that are dominant at least at one bz

π(b)

Γ′= Γ′ + Λ

Hansen’s Policy IterationPolicy iteration (Sondik 1978; Hansen 1998) iterates overpolicies directly, in contrast to the indirect policy repre-sentation of value iteration. It consists of two interacting

processes, one computing the value function of the currentpolicy (policy evaluation), and the other computing an im-proved policy with respect to the current value function (pol-icy improvement). These two processes alternate, until con-vergence is achieved (close) to an optimal policy.

Policy Evaluation Hansen (1998) represents the policyin the form of a finite-state controller (FSC), denoted byπ = ⟨N , E⟩, where N denotes a finite set of nodes (or ma-chine states) and E denotes a finite set of edges. Each ma-chine state n ∈N is labeled by an action a ∈A, each edgee ∈ E by an observation z ∈ Z, and each machine state hasone outward edge per observation. Consequently, a policyrepresented in this way can be executed by taking the actionassociated with the “current machine state”, and changingthe current machine state by following the edge labeled bythe observation made.As noted by Sondik (1978), the cross-product of the en-

vironment states S and machine statesN constitutes a finiteMarkov chain, and the value function of policy π (repre-sented by a set of α-vectors, with one vector correspondingto one machine state) can be calculated by solving the fol-lowing system of linear equations:

απn(s)= R(s, a(n))+ γ

X

s′,z

p(s′|s, a(n))p(z|s′, a(n))απl(n,z)(s

′) (2)

where n ∈ N is the index of a machine state, a(n) is theaction associated with machine state n, and l(n, z) is theindex of its successor machine state if z is observed. As aresult, the value function of π can be represented by a set of|N | vectors Γπ = {απ

1 ,απ2 , · · · ,απ

|N |}, such that

V π(b) = maxα∈Γπ α · b

Policy Improvement The policy improvement step ofHansen’s policy iteration involves dynamic programming totransform the value function V π represented by Γπ into animproved value function represented by another set of α-vectors, Γπ′ . As noted by Hansen (1998), each α-vectorin Γπ′ has a corresponding choice of action, and for eachpossible observation choice of an α-vector in Γπ . This in-formation can be used to transform an old FSC π into animproved FSC π′ by a simple comparison of Γπ and Γπ′ .A detailed procedure for FSC transformation is presented inAlgorithm 2 when we introduce the PBPI algorithm, sincePBPI shares the structure of Hansen’s algorithm and has thesame procedure for FSC transformation.

Point-Based Policy IterationOur point-based policy iteration (PBPI) algorithm aims tocombine some of the most desirable properties of Hansen’spolicy iteration with point-based value iteration. Specifi-cally, PBPI replaces the exact policy improvement step ofHansen’s algorithm with PBVI (Pineau, Gordon, & Thrun2003), such that policy improvement is concentrates on afinite sample beliefs B. This algorithm is summarized be-low in Algorithm 2, with the principal structure shared withHansen’s algorithm. There are two important differencesbetween PBPI and Hansen’s algorithm: (1) In PBPI, thebackup operation only applies to the points in B, not the

11/9/17

5

PBVIperformance(figurefromPineau etal.)

100 102 104 106−20

−18

−16

−14

−12

−10

−8

TIME (secs)

REW

ARD

PBVIQMDP

10−2 100 102 104−0.5

0

0.5

1

1.5

2

2.5

TIME (secs)

REW

ARD

PBVIQMDPIncPrune

10−2 100 102 1040

0.1

0.2

0.3

0.4

0.5

0.6

0.7

TIME (secs)

REW

ARD

PBVIQMDPIncPrune

10−2 100 102 1040

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

TIME (secs)

REW

ARD

PBVIQMDPIncPrune

Figure 3: PBVI performance for four problems: Tag(left), Maze33(center-left), Hallway(center-right) and Hallway2(right)

Method Goal% Reward Time(s)Maze33 / Tiger-GridQMDP[*] n.a. 0.198 0.19 n.a.Grid [Brafman, 1997] n.a. 0.94 n.v. 174PBUA [Poon, 2001] n.a. 2.30 12116 660PBVI[*] n.a. 2.25 3448 470HallwayQMDP[*] 47 0.261 0.51 n.a.QMDP [Littman et al., 1995] 47.4 n.v. n.v. n.a.PBUA [Poon, 2001] 100 0.53 450 300PBVI[*] 96 0.53 288 86Hallway2QMDP[*] 22 0.109 1.44 n.a.QMDP [Littman et al., 1995] 25.9 n.v. n.v. n.a.Grid [Brafman, 1997] 98 n.v. n.v. 337PBUA [Poon, 2001] 100 0.35 27898 1840PBVI[*] 98 0.34 360 95TagQMDP[*] 17 -16.769 13.55 n.a.PBVI[*] 59 -9.180 180880 1334n.a.=not applicable n.v.=not available

Table 1: Results for POMDP domains. Those marked [*] were com-puted by us; other results were likely computed on different plat-forms, and therefore time comparisons may be approximate at best.All results assume a standard (not lookahead) controller.

four heuristics. For each problem we apply PBVI using eachof the belief-point selection heuristics, and include the QMDPapproximation as a baseline comparison. Figure 4 shows thecomputation time versus the reward performance for each do-main.The key result from Figure 4 is the rightmost panel, which

shows performance on the largest, most complicated domain.In this domain our SSEA rule clearly performs best. Insmaller domains (left two panels) the choice of heuristic mat-ters less: all heuristics except random exploration (RA) per-form equivalently well.

6 Related workSignificant work has been done in recent years to improvethe tractability of POMDP solutions. A number of increas-ingly efficient exact value iteration algorithms have beenproposed [Cassandra et al., 1997; Kaelbling et al., 1998;Zhang and Zhang, 2001]. They are successful in finding op-timal solutions, however are generally limited to very smallproblems (a dozen states) since they plan optimally for allbeliefs. PBVI avoids the exponential growth in plan size by

restricting value updates to a finite set of (reachable) beliefs.

There are several approximate value iteration algorithmswhich are related to PBVI. For example, there are many grid-based methods which iteratively update the values of discretebelief points. These methods differ in how they partition thebelief space into a grid [Brafman, 1997; Zhou and Hansen,2001].

More similar to PBVI are those approaches which updateboth the value and gradient at each grid point [Lovejoy, 1991;Hauskrecht, 2000; Poon, 2001]. While the actual point-basedupdate is essentially the same between all of these, the over-all algorithms differ in a few important aspects. WhereasPoon only accepts updates that increase the value at a gridpoint (requiring special initialization of the value function),and Hauskrecht always keeps earlier -vectors (causing theset to grow too quickly), PBVI requires no such assumptions.A more important benefit of PBVI is the theoretical guaran-tees it provides: our guarantees are more widely applicableand provide stronger error bounds than those for other point-based updates.

In addition, PBVI is significantly smarter than previousalgorithms about how it selects belief points. PBVI selectsonly reachable beliefs; other algorithms use random beliefs,or (like Poon’s and Lovejoy’s) require the inclusion of a largenumber of fixed beliefs such as the corners of the probabil-ity simplex. Moreover, PBVI selects belief points which im-prove its error bounds as quickly as possible. In practice, ourexperiments on the large domain of lasertag demonstrate thatPBVI’s belief-selection rule handily outperforms several al-ternate methods. (Both Hauskrecht and Poon did considerusing stochastic simulation to generate new points, but nei-ther found simulation to be superior to random point place-ments. We attribute this result to the smaller size of their testdomains. We believe that as more POMDP research moves tolarger planning domains, newer and smarter belief selectionrules will become more and more important.)

Gradient-based policy search methods have also been usedto optimize POMDP solutions [Baxter and Bartlett, 2000;Kearns et al., 1999; Ng and Jordan, 2000], successfully solv-ing multi-dimensional, continuous-state problems. In ourview, one of the strengths of these methods lies in the factthat they restrict optimization to reachable beliefs (as doesPBVI). Unfortunately, policy search techniques can be ham-pered by low-gradient plateaus and poor local minima, andtypically require the selection of a restricted policy class.

QMDPisanapproximationmethodthatuses1avectorperactionatalliterations.IncrementalPruningwasonebestexactmethodsatthetime.

PBVIlimitations

• Fixedrepresentationsizewasnotadaptivetocomplexityvaluefn.

• Onlyasfastasvalueiteration,whichconvergesasymptotically

• Notmonotonic:can’tguaranteethatvaluesofallbeliefstatesincreaseateveryiteration

• Isthereawaytogetthebenefitsofpolicyiteration?

11/9/17

6

PointBasedPolicyIteration

• Combinespolicyiterationwithpointbasemethods• Mainidea:• Maintainafinitestatemachine(FSC)asthepolicy• EvaluatetheFSC• DoonestepofPBVI• UseoutputofPBVItoimprovetheFSC• Repeat

PBPIFSCmodifications

• Foreachnewa vectorfromPBVI• Ifthenewa vectorduplicatestheonefromourFSC,donothing• Ifthenewa vectorpointwisedominatesonefromourFSC

• CreateanewFSCstate• Modifytransitionstogofromdominatedstatetonewstate

• Elseaddanewstatecorrespondingtothenewa vector

• Whenallnewa vectorsfromPBVIareprocessed,removeunreachablestatesfromtheFSC

11/9/17

7

PBVIvs.PBPI

• SeefiguresfromShietal.

• Ingeneral:• Achievesbetterperformancewithafixedbudgetofbeliefpoints(notsurprisingbecauserepresentationsizecangrow)

• CanoftenachievehigherperformancewithlessCPUtime

LimitationsofPointBasedMethods

• Stillcanbeslow

• Assumesaknownmodel:• Atplanningtime• Atexecutiontime

11/9/17

8

PolicySearch

• Recallpolicygradient(figurefromSutton&Barto):

• ThisstillworksevenifMarkovpropertyisviolated,but…

13.3. REINFORCE: MONTE CARLO POLICY GRADIENT 341

which is exactly what we want, a quantity that we can sample on each time stepwhose expectation is equal to the gradient. Using this sample to instantiate ourgeneric stochastic gradient ascent algorithm (13.1), we obtain the update

✓t+1.= ✓t + ↵�tGt

r✓⇡(At|St, ✓)

⇡(At|St, ✓). (13.6)

We call this algorithm REINFORCE (after Williams, 1992). Its update has anintuitive appeal. Each increment is proportional to the product of a return Gt and avector, the gradient of the probability of taking the action actually taken, divided bythe probability of taking that action. The vector is the direction in parameter spacethat most increases the probability of repeating the action At on future visits to stateSt. The update increases the parameter vector in this direction proportional to thereturn, and inversely proportional to the action probability. The former makes sensebecause it causes the parameter to move most in the directions that favor actionsthat yield the highest return. The latter makes sense because otherwise actions thatare selected frequently are at an advantage (the updates will be more often in theirdirection) and might win out even if they do not yield the highest return.

Note that REINFORCE uses the complete return from time t, which includes allfuture rewards up until the end of the episode. In this sense REINFORCE is aMonte Carlo algorithm and is well defined only for the episodic case with all updatesmade in retrospect after the episode is completed (like the Monte Carlo algorithmsin Chapter 5). This is shown explicitly in the boxed pseudocode below.

REINFORCE, A Monte-Carlo Policy-Gradient Method (episodic)

Input: a di↵erentiable policy parameterization ⇡(a|s, ✓), 8a 2 A, s 2 S, ✓ 2 Rd

Initialize policy parameter ✓Repeat forever:

Generate an episode S0, A0, R1, . . . , ST�1, AT�1, RT , following ⇡(·|·, ✓)For each step of the episode t = 0, . . . , T � 1:

G return from step t✓ ✓ + ↵�tGr✓ log ⇡(At|St, ✓)

The vector r✓⇡(At|St,✓)⇡(At|St,✓) in the REINFORCE update is the only place the policy

parameterization appears in the algorithm. This vector has been given several namesand notations in the literature; we will refer to it simply as the eligibility vector. Theeligibility vector is often written in the compact form r✓ log ⇡(At|St, ✓), using theidentity r log x = rx

x. This form is used in all the boxed pseudocode in this chapter.

In earlier examples in this chapter we considered exponential softmax policies (13.2)with linear action preferences (13.3). For this parameterization, the eligibility vectoris

r✓ log ⇡(a|s, ✓) = x(s, a)�X

b

⇡(b|s, ✓)x(s, b). (13.7)

PolicySearchforPOMDPs

11/9/17

9

NaïvepolicysearchinPOMDPs

• Policysearch“works”,butpolicyislimitedtoamappingfromobservationstoactions

• Doesn’tsolvethepartialobservabilityproblem

• Atbestcanrandomizeactionstoavoidlossesfromstateconfusion

PolicysearchwithFSCs

• CreatearandomFSC• Createapolicyfunctionthat(orsetofpolicyfunctions)thatcoveractionsintheenvironmentandtransitionsintheFSC• Usepolicygradientmethodstotunebothofthese

• Coolidea• Hardtomakeworkinpractice:• Lotsoflocaloptima• Hardtofindgoodcontrollerswithmorethanafewhiddenstates

11/9/17

10

PolicySearchinPOMPssummary

• Advantages:• Doesrequireknowledgeofthemodel• Doesnotneedtomaintainabeliefstate• Representationisfixedinsize

• Disadvantages:• Beliefstatestellyoualot– hardtodowellwithoutthem• Manyofthechallengesofpolicygradientmethods:

• Localoptima• Varianceinthegradientestimate• Slow

AugmentedStateMethods

11/9/17

11

Augmentedstate

• POMDPsaretrickyb/cprocessisnotMarkovianintheobservation• Ratherthanchangethealgorithm,whynotchangethestate?

• Advantage:GettorunregularMDPalgorithmsonthenewstate

• Challenge:Howtodothis

FiniteHistoryWindow

• ProblemmightnotbeMarkovianincurrentobservation,but• PerhapsitisMarkovianifweaugmentthestatetoincludeak-stepwindowofpreviousstates• Advantages:• Obviouslytherightthingtodoifyoucanaffordtodoit• Simple

• Disadvantages:• Fornstates,dstephistory,statespacegrowswithnd• Notalwaysobvioushowlargetomaked

11/9/17

12

HistoryTrees

• Historywindowprobablywastedalotofefforttrackingirrelevantdistinctions:• Manystatesmayhaveunique/unambiguousobservations• Noneedtorememberhistorywhenweseethese

• HistorytreesdefinestateasavariablelengthvectorofpreviousstatesandactionssufficienttoensureMarkovproperty• Inpractice:• Collectstatisticsonhistories• WhenviolationsofMarkovpropertyaredetected,extendhistory

Historytreeexample

• Robotgoingthroughmaze• Supposetwointersectionslookalike• Historytreecanbeusedtorememberhowtherobotgotto theintersection,tohelpdistinguishbetweensimilarstates

• Howtodiscoverthis:• Needtocollectstatisticsonallpossibleextensionsofcurrenthistories• Whennextstatesornextutilitiesdivergebasedupondifferentextensionsofthehistory,growthehistory

11/9/17

13

HistorytreeProsandCons

• Worksverywellinsomeproblemswhereshort(ish)historiesaresufficienttorecovertheMarkovproperty• Moreefficientthanfinitewindowmethods

• Limitations:• Needstocollectalotofdata• Canbehardtodeterminewhentoaugmenthistoriesifthereisalotofnoise• Myopic/greedy(willmissifyouneedtoremembersomethingfrom20stepsinthepast,andrememberingsomething1…19stepsinthepastdoesn’thelp.)

AugmentedstatewithFunctionApproximation• Idea:Usefunctionapproximationtolearnhowtoaugmentthestate“automatically”• Oldidea(atleastasfarbackasLininthe90’s)

functionapproximator(e.g.neuralnetwork)

state,action Q-value

self-learnedstate-augmentingfeatures(initiallyrandomNNoutput)

11/9/17

14

Learnedaugmentedstate

• Coolidea• Createsself-referenceinthefunctionapproximation(recurrentneuralnetwork)• Difficulttotrainandhardtoestablishconvergenceinpractice

• Recenteffortstoaddressdifficultiesinrecurrentneuralnetworks,e.g.,LSTM,haveshownpromiseinthisapplication!

POMDPapproximationsummary

• Knownmodelofmoderatesize:Usepointbasedmethods

• Modesthistorydependence:Augmentstatew/historywindow/tree

• Unknownmodel,unknownhistorydependence:• Thisishard!• Learnmodel?• DeeplearningwithLSTMorsimilarmethodstolearnrepresentation

approximate pomdp algorithms

Documents