partially-observable markov decision processes

MCAI 2013 1

Partially-Observable Markov Decision Processes

Tom Dietterich

MCAI 2013 2

Markov Decision Processas a Decision Diagram

𝑠𝑡 𝑠𝑡+1

𝑎𝑡 𝑟 𝑡 Note:We observe before we choose All states, actions, and rewards are observed

MCAI 2013 3

What If We Can’t Directly Observe the State?

𝑠𝑡 𝑠𝑡+1

𝑎𝑡 𝑟 𝑡 Note:We observe before we choose Only the observations are observed, not the underlying states

𝑜𝑡 𝑜𝑡+1

MCAI 2013 4

POMDPs are Hard to Solve

• Tradeoff between taking actions to gain information and taking actions to change the world– Some actions can do both

Optimal Management of Difficult-to-Observe Invasive Species [Regan et al., 2011]

MCAI 20135

Branched Broomrape (Orobanche ramosa) Annual parasitic plant Attaches to root system of

host plant Results in 75-90% reduction in

host biomass Each plant makes ~50,000

seeds Seeds are viable for 12 years

Quarantine Area in S. Australia

MCAI 20136

375 farms; 70km x 70km area

Goog

le m

aps

Formulation as a POMDP:Single Farm

7

States: {Empty, Seeds, Plants &

Seeds} Actions:

{Nothing, Host Denial, Fumigation}

Observations: {Absent, Present} Detection probability

Rewards: Cost(Nothing)

Cost(Host Denial) Cost(Fumigation)

Objective: 20-year discounted reward

(discount = 0.96)

State Diagram

MCAI 2013

Optimal MDP Policy

8

If plant is detected, Fumigate; Else Do Nothing Assumes perfect detection

www.grdc.com.au

MCAI 2013

Optimal POMDP Policy for

9

Same as the Optimal MDP Policy

ActionOBSERVATION

Decision StateAfter State

MCAI 2013

0 1Fumigate ABSENT

PRESENT

NothingABSENT

PRESENT

Optimal Policy for

10

Deny Host for 15 years before switching to Nothing

For Deny Host for 17 years before switching to Nothing

MCAI 2013

Deny Deny0 1Fumigate ABS

PRESENT

ABS

PRESENT

2 ABS 16

PRESENT

... Nothing

ABS

PRESENT

Probability of Eradication

11 MCAI 2013

Discussion

12

POMDP is exactly solvable because the state space is very small

Real problem is more complex Each farm can have many fields, each with its own

hidden state There 375 farms in the quarantine area

states if we treat each farm as a single unit Exact solution of large POMDPs is beyond the state

of the art

Notice that there is no tradeoff between acting to gather information and acting to change the world. None of the actions gain information

MCAI 2013

Ways to Avoid a POMDP (1)

MCAI 201313

State Estimation and State Tracking In many problems, we have (or can acquire)

enough sensors so that we can estimate the state quite well has low uncertainty Let be the most likely hidden state

In such problems, we can pretend that we have an MDP and we can directly observe

We do not need to take actions to gain information, so we do not face this difficult tradeoff

Ways to Avoid a POMDP (2)

MCAI 201314

Pure Information-Gathering POMDPs Consider a medical diagnosis case for a specific

disease where there are tests, that can be performed. Our goal is to decide whether the patient has the disease by choosing tests to perform Each test has two possible outcomes and Each test has a cost Given any subset of the outcomes, we can compute the

probability that the patient has the disease There is a “false positive” cost for incorrectly saying

that and a “false negative” cost, for saying that

Formulation as an MDP

MCAI 201315

States:

starting state is Actions

actions are the medical tests action says “the patient does not have the disease” and terminates with cost 0 if

correct and cost if incorrect action says “the patient has the disease” and terminates with cost 0 if correct and

cost if incorrect State Transitions

When we perform test in state , the resulting state sets the th entry in the state to according to

When we perform a “declare” action, the problem transitions to a terminal state with probability 1

If there aren’t too many tests and we know , we can enumerate the states and solve this via standard MDP methods

Belief States

MCAI 201316

In general, we can think of a POMDP as being an MDP over a Belief State

In the medical diagnosis cases, the belief states have the form (0,1,?,?,0,?)

In the Broomrape case, the belief state is a probability distribution over the 3 states:

emptyseeds

weeds + seeds

Belief State Reasoning

MCAI 201317

Each observation updates the belief state Example: observing the

presence of weeds means weeds are present and seeds might also be present

emptyseeds

weeds + seeds

emptyseeds

weeds + seedsobserve present

Taking Actions

MCAI 201318

Each action updates the belief state Example: fumigate

emptyseeds

weeds + seeds

emptyseeds

weeds + seedsfumigate

Belief MDP

MCAI 201319

State space: all reachable belief states Action space: same actions as the POMDP Reward function: expected rewards derived

from the underlying states Transition function: moves in belief space

Problem: Belief space is continuous and there can be an immense number of reachable states

Monte Carlo Policy Evaluation

MCAI 201320

Key Insight: It is just as easy to evaluate a policy via Monte Carlo trials in a POMDP as it is an in MDP!

Approach: Define a space of policies Evaluate them by Monte Carlo trials Pick the best one

Finite State Machine Policies

MCAI 201321

In many POMDPs (and MDPs), a policy can be represented as a finite state machine

We can design a set of FSM policies and then evaluate them

There are algorithms for incrementally improving FSM policies

Deny Deny0 1Fumigate ABS

PRESENT

ABS

PRESENT

2 ABS 16

PRESENT

... Nothing

ABS

PRESENT

Summary

MCAI 201322

Many problems in AI can be formulated as POMDPs Formulating a problem as a POMDP doesn’t help much,

because they are so hard to solve (PSPACE-hard for finite horizon; undecidable for infinite horizon) Can we do state estimation and pretend ? Are we performing pure observation actions? Can the policy be divided into a pure observation phase and a

pure action phase? If so, we can use MDP methods instead

Unfortunately, many problems in ecosystem management are “essential” POMDPs that mix information gathering and world-changing actions

Monte Carlo methods (based on policy space search) are one of the most practical ways of finding good POMDP solutions

partially-observable markov decision processes

Documents

state space

belief space problem

belief stateexample

belief statesmcai

decision diagrammcai

space of policiesevaluate

markov decision processas

single farm7 state diagrammcai