percentile optimization for markov decision processes with

Percentile Optimization

for Markov Decision Processes with

Parameter Uncertainty

Erick Delage, Stanford University, [email protected]

Shie Mannor, McGill University, [email protected]

For more information, visit www.stanford.edu/∼edelage

(Research supported by the Fonds Quebecoisde la recherche sur la nature et les technologies.)

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 1/34

Vanilla Data-Driven Decision Making

Gather data from your system

Choose a model for the system

Point estimate the parameters of your model

Find a policy that maximizes return

Hopefully, true system ≈ estimated model


A Machine Replacement Problem

Machine ages and can require “light” or “heavy” repairs

Two different repair services are available

Choose repair policy that minimizes long term costs


Example of Model Uncertainty

Given that for the “heavy” repair state historical data says:

Repair option 1 was successful 90% of 100 trials


What is the true model for repair option #2?







What is the TRUE long term cost of using repair option #2?


The Curse of Model Uncertainty

Expected discounted return

Distribution of Expected ReturnsExpected returnprediction based on estimated model

Actual expected returnwhen accountingfor parameteruncertainty


The Curse of Model Uncertainty (II)

Model uncertainty is always present

We cannot always afford to make it negligible

Robust methods are difficult to apply and aredeceptively conservative


Data-Driven MDP Approach

We address the problem of model uncertainty in the contextof Markov decision processes

Assumption :The system behaves as an MDP with known states andactions

Data is available for :Transition trajectories & noisy reward measures

Goal :Find a stationary policy that performs “well” on the trueMDP (risk-sensitivity)



We propose a risk-sensitive method for addressingdata-driven Markov decision processes

Estimatingmodeluncertaintyfrom data

Findingrisk-sensitivecontrols


Markov Decision Processes

A simple and popular model (MDP)

Ingredients:

1. State space S

2. Action space A

3. Reward R : S ×A → R

4. Transition probability P (s′|s, a)

Dynamics: St → At → Rt → St+1

DecisionMaker

SystemSt → St+1

at St rt


MDPs: The Objective

Objective: maximize (over all policies π)

Eπ

[

∞∑

t=0

αtRt

∣

∣

∣S0 = s

]

with α < 1 .

There exists an optimal stationary and deterministic policy.

π : S → A

Algorithmically “easy”: linear programming, policy iteration,value iteration, dynamic programming


Parameter Uncertainty in MDP

We always have uncertainty in the parameters

R and P (s′|s, a)

I don’t have a model - estimate from data

I know I don’t know (part of the model)

MDP is a model reduction


Bayesian Parameter Uncertainty

We always have uncertainty in the parameters

R and P (s′|s, a)

Bayesian Approach:

Start with a prior : P(R,P )

Gather model observations O : O ∼ P(O|R,P )

Compute posterior distribution over the model :P(R,P |O) ∝ P(O|R,P )P(R,P )

Posterior can be evaluated using Gibbs sampling


A Distribution over MDP Models

A Gaussian prior on Rewards:

Prior belief is R(i, a) ∝ N (µ(i,a), σ2(i,a))

Given a new measurement R(i, a) = R(i, a) + ν(Gaussian noise), belief remains Gaussian

A Dirichlet prior on transition parameters P (.|i, a) = −→p :

Prior belief on −→p is f(−→p ) ∝∏|S|

j=1 pβj−1j

Given a new transition from (i, a), belief remains aDirichlet distribution


The Bayesian Approach

We have a probability over models:

Consider V π = Eπx[

∑∞t=0 αtR(xt)] as a random variable.

For a given π and a current belief we can ask what is:

Emodels [V π] = Emodels

[

Eπx

[

∞∑

t=0

αtR(xt)

] ]

What if true MDP does not behave like Emodels[Vπ]?


The Curse of Parameter Uncertainty

Distribution of Random V π

−4 −8 −12−16−20Expected discounted reward (Vπ)

Vπexp. model

Emodels

(Vπ |model)

Vπworse

Vπbest

90% Percentile


The Robust Approach

∆(R) = {set of all possible rewards}∆(P) = {set of all possible transition probabilities}

Objective:

max.π

minR∈∆(R),P∈∆(P )

Eπ

[

∞∑

t=0

αtR(xt)

]

Tractable solution via dynamic programming

Non-probabilistic uncertainty

Uncertainty may be difficult to calibrate

Leads to conservative policies

(Iyengar, 2002; Nilim and El-Ghaoui , 2005)



Find optimal policy according to:

max.policy π,y∈R

y

sub. to Pmodels

(

Eπx(

∑∞t=0 αtR(xt)) ≥ y

)

≥ η ,

Value-at-risk: η is the risk parameter



Find optimal policy according to:

max.policy π,y∈R

y

sub. to Pmodels

(

Eπx(

∑∞t=0 αtR(xt)) ≥ y

)

≥ η ,

Value-at-risk: η is the risk parameter

It turns out that solving the percentile optimization is:

NP-hard in general

NP-hard even if transitions are known

Polytime for Gaussian reward parameters

Useful approximation for Dirichlet transitions


Percentile Optimization : Rewards

Suppose R ≈ N (µR,ΘR) and q is initial distribution onstates, an η-percentile optimal policy can be found using

max.x∈R|S|×|A|

∑

a x⊤a µR − Φ−1(η)‖

∑

a x⊤a Θ

1

2

R‖2

subject to∑

a x⊤a = q⊤ +

∑

a αx⊤a Pa

x⊤a ≥ 0 , ∀ a ∈ A.

Not much harder than the original problem


Percentile Optimization : Transitions (I)

In the case that rewards are known, but transitions areuncertain, it is already hard to solve:

max.policy π

Emodels

[

Eπx

[

∞∑

t=0

αtRt

] ]

equivalent to:

max.π

Emodels

[

(I − αPmodelπ )−1R

]

The objective depends non-linearly on all moments of P


Percentile Optimization : Transitions (II)

Let F(π) be the second order approximation

F(π) = qTXπR + α2qTXπΠQπXπR

F(π) only depends on first and second moments of P

Optimizing F(π) is tractable for problem ≈ 1000 states


Percentile Optimization : Transitions (II)

Let F(π) be the second order approximation

F(π) = qTXπR + α2qTXπΠQπXπR

F(π) only depends on first and second moments of P

Optimizing F(π) is tractable for problem ≈ 1000 states

Given more than M observed transitions from anystate-action pair, policy π = arg maxπ F(π) iso(1/

√

(1 − η)M) optimal according to the percentileproblem


Experiment on MDP with Dirichlet Prior

State related returns R are fully known

Dirichlet prior for transitions P (s′|s, a)

Observed 5 transitions for each state-action pair

Choose repair policy that maximizes 90% percentilebound on returns


Experiment on MDP with Dirichlet Prior

Random V π Comparison

−4 −8 −12−16−20−24Discounted reward

robust policy performance nominal policy performance 2nd order approx. policy performance

90% percentiles

means


Closing Discussion

We proposed a risk-sensitive approach for data-driven MDPoptimization:

Estimatingmodeluncertaintyfrom data(BayesianFramework)


Closing Discussion

We proposed a risk-sensitive approach for data-driven MDPoptimization:

Estimatingmodeluncertaintyfrom data(BayesianFramework)


(PercentileCriteria)


Closing Discussion

We proposed a risk-sensitive approach for data-driven MDPoptimization

Future Work:

Revisit standard data-driven MDPs with this percentilebased method

Use this framework to provide strategies for parameterexploration

Address model uncertainty in other forms of decisionproblems


Thank You !

For more information, visit www.stanford.edu/∼edelage


percentile optimization for markov decision processes with

Documents