# incremental pruning

Post on 08-Jan-2016

20 views

Embed Size (px)

DESCRIPTION

Incremental Pruning. CSE 574 May 9, 2003 Stanley Kok. Value-Iteration (Recap). DP update – a step in value-iteration MDP S – finite set of states in the world A – finite set of actions T: SxA -> Π (S)(e.g. T(s,a,s’) = 0.2) R: SxA -> R (e.g. R(s,a) = 10) Algm. POMDP. - PowerPoint PPT PresentationTRANSCRIPT

Incremental PruningCSE 574 May 9, 2003Stanley Kok

Value-Iteration (Recap)DP update a step in value-iterationMDPS finite set of states in the worldA finite set of actionsT: SxA -> (S)(e.g. T(s,a,s) = 0.2)R: SxA -> R (e.g. R(s,a) = 10)Algm

POMDP tuple S, A, T, R of MDP finite set of observationsO:SxA-> ()

Belief state - information state b, probability distribution over S- b(s1)

POMDP - SESE State Estimator updates belief state based on previous belief state last action, current observation SE(b,a,o) = b

POMDP - SE

POMDP - Focus on component

POMDP-> Belief MDPMDP parameters: S => B, set of belief statesA => sameT => (b,a,b)R => (b, a)

Solve with value-iteration algm

POMDP - (b,a,b)

(b, a)

Two ProblemsHow to represent value function over continuous belief space?How to update value function Vt with Vt-1?

POMDP -> MDPS => B, set of belief statesA => sameT => (b,a,b)R => (b, a)

Running ExamplePOMDP withTwo states (s1 and s2)Two actions (a1 and a2)Three observations (z1, z2, z3)

1D belief space for a 2 state POMDPProbability that state is s1

First Problem SolvedKey insight: value functionpiecewise linear & convex (PWLC)Convexity makes intuitive senseMiddle of belief space high entropy, cant select actions appropriately, less long-term rewardNear corners of simplex low entropy, take actions more likely to be appropriate for current world state, gain more rewardEach line (hyperplane) represented with vectorCoefficients of line (hyperplane)e.g. V(b) = c1 x b(s1) + c2 x (1-b(s1))

To find value function at b, find vector with largest dot pdt with b

Second ProblemCant iterate over all belief states (infinite) for value-iteration butGiven vectors representing Vt-1, generate vectors representing Vt

Horizon 1No futureValue function consists only of immediate rewarde.g. R(s1, a1) = 0, R(s2, a1) = 1.5,R(s1, a2) = 1, R(s2, a2) = 0b =

Value of doing a1 = 1 x b(s1) + 0 x b(s2)= 1 x 0.25 + 0 x 0.75

Value of doing a2 = 0 x b(s1) + 1.5 x b(s2)= 0 x 0.25 + 1.5 x 0.75

Second ProblemBreak problem down into 3 steps-Compute value of belief state given action and observation-Compute value of belief state given action-Compute value of belief state

Horizon 2 Given action & obsIf in belief state b,what is the best value of doing action a1 and seeing z1?Best value = best value of immediate action + best value of next actionBest value of immediate action = horizon 1 value function

Horizon 2 Given action & obsAssume best immediate action is a1 and obs is z1 Whats the best action for b that results from initial b when perform a1 and observe z1?Not feasible do this for all belief states (infinite)

Horizon 2 Given action & obsConstruct function over entire (initial) belief spacefrom horizon 1 value function with belief transformation built in

Horizon 2 Given action & obsS(a1, z1) corresponds to papers

S() built in:- horizon 1 value function - belief transformation- Weight of seeing z after performing a- Discount factor- Immediate Reward

S() PWLC

Second ProblemBreak problem down into 3 steps-Compute value of belief state given action and observation-Compute value of belief state given action-Compute value of belief state

Horizon 2 Given actionWhat is the horizon 2 value of a belief state given immediate action is a1?Horizon 2, do action a1Horizon 1, do action?

Horizon 2 Given actionWhats the best strategy at b?How to compute line (vector) representing best strategy at b? (easy)How many strategies are there in figure?Whats the max number of strategies (after taking immediate action a1)?

Horizon 2 Given actionHow can we represent the 4 regions (strategies) as a value function?Note: each region is a strategy

Horizon 2 Given actionSum up vectors representing regionSum of vectors = vectors (add lines, get lines)Correspond to papers transformation

Horizon 2 Given actionWhat does each region represent?Why is this step hard (alluded to in paper)?

Second ProblemBreak problem down into 3 steps-Compute value of belief state given action and observation-Compute value of belief state given action-Compute value of belief state

Horizon 2

a1a2U

Horizon 2This tells youhow to act! =>

Purge

Use horizon 2 value function to update horizon 3s ...

The Hard StepEasy to visually inspect to obtain different regionsBut in higher dimensional space, with many actions and observations.hard problem

Nave way - EnumerateHow does Incremental Pruning do it?

Incremental PruningHow does IP improve nave method?Will IP ever do worse than nave method?

CombinationsPurge/Filter

Incremental PruningWhats other novel idea(s) in IP?RR: Come up with smaller set D as argument to Dominate()

RR has more linear pgms but less contraints in the worse case.Empirically constraints saves more time than linear programs require

Incremental PruningWhats other novel idea(s) in IP?RR: Come up with smaller set D as argument to Dominate()

Identifying WitnessWitness Thm:-Let Ua be a set of vectors representing value function-Let u be in Ua (e.g. u = z1,a2 + z2,a1 + z3,a1 )-If there is a vector v which differs from u in one observation (e.g. v = z1,a1 + z2,a1 + z3,a1) andthere is a b such that b.v > b.u,-then Ua is not equal to the true value function

Witness AlgmRandomly choose a belief state bCompute vector representing best value at b (easy)Add vector to agendaWhile agenda is not emptyGet vector Vtop from top of agendab = Dominate(Vtop, Ua)If b is not null (there is a witness),compute vector u for best value at b and add it to Uacompute all vectors vs that differ from u at one observation and add them to agenda

Linear SupportIf value function is incorrect, biggest diff is at edges (convexity)

Linear Support

ExperimentsComments???

Important IdeasPurge()

FlawsInsufficient background/motivation

Future ResearchBetter best-case/worse-case analysesPrecision parameter

VariantsReactive Policy- st = zt; - (z) = a- branch & bound search- gradient ascent search- perceptual aliasing problemFinite History Window- (z1zk) = a- Suffix tree to represent observation, leaf actionRecurrent Neural Nets- use neural nets to maintain some state (so information about past is not forgotten)

Variants Belief State MDPExact V, exact bApproximate V, exact b- Discreting b into a grid and interpolateExact V, approximate b- Use particle filters to sample b- track approximate belief state using DBNApproximate V, Approximate b- combine previous two

Variants - PegasusPolicy Evaluation of Goodness And Search Using ScenariosConvert POMDP to another POMDP with deterministic state transitionsSearch for policy of transformed POMDP with highest estimated value

Thats it!

Recommended