percentile optimization for markov decision processes with

34
Percentile Optimization for Markov Decision Processes with Parameter Uncertainty Erick Delage, Stanford University, [email protected] Shie Mannor, McGill University, [email protected] For more information, visit www.stanford.edu/edelage (Research supported by the Fonds Qu ´ eb ´ ecois de la recherche sur la nature et les technologies.) Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 1/34

Upload: others

Post on 05-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Percentile Optimization

for Markov Decision Processes with

Parameter Uncertainty

Erick Delage, Stanford University, [email protected]

Shie Mannor, McGill University, [email protected]

For more information, visit www.stanford.edu/∼edelage

(Research supported by the Fonds Quebecoisde la recherche sur la nature et les technologies.)

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 1/34

Vanilla Data-Driven Decision Making

Gather data from your system

Choose a model for the system

Point estimate the parameters of your model

Find a policy that maximizes return

Hopefully, true system ≈ estimated model

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 2/34

A Machine Replacement Problem

Machine ages and can require “light” or “heavy” repairs

Two different repair services are available

Choose repair policy that minimizes long term costs

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 3/34

Example of Model Uncertainty

Given that for the “heavy” repair state historical data says:

Repair option 1 was successful 90% of 100 trials

Repair option 2 was successful 100% of 5 trials

What is the true model for repair option #2?

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 4/34

Example of Model Uncertainty

Given that for the “heavy” repair state historical data says:

Repair option 1 was successful 90% of 100 trials

Repair option 2 was successful 100% of 5 trials

What is the true model for repair option #2?

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 5/34

Example of Model Uncertainty

Given that for the “heavy” repair state historical data says:

Repair option 1 was successful 90% of 100 trials

Repair option 2 was successful 100% of 5 trials

What is the true model for repair option #2?

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 6/34

Example of Model Uncertainty

Given that for the “heavy” repair state historical data says:

Repair option 1 was successful 90% of 100 trials

Repair option 2 was successful 100% of 5 trials

What is the true model for repair option #2?

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 7/34

Example of Model Uncertainty

Given that for the “heavy” repair state historical data says:

Repair option 1 was successful 90% of 100 trials

Repair option 2 was successful 100% of 5 trials

What is the true model for repair option #2?

What is the TRUE long term cost of using repair option #2?

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 8/34

The Curse of Model Uncertainty

Expected discounted return

Distribution of Expected ReturnsExpected returnprediction based on estimated model

Actual expected returnwhen accountingfor parameteruncertainty

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 9/34

The Curse of Model Uncertainty (II)

Model uncertainty is always present

We cannot always afford to make it negligible

Robust methods are difficult to apply and aredeceptively conservative

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 10/34

Data-Driven MDP Approach

We address the problem of model uncertainty in the contextof Markov decision processes

Assumption :The system behaves as an MDP with known states andactions

Data is available for :Transition trajectories & noisy reward measures

Goal :Find a stationary policy that performs “well” on the trueMDP (risk-sensitivity)

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 11/34

Data-Driven MDP Approach

We propose a risk-sensitive method for addressingdata-driven Markov decision processes

Estimatingmodeluncertaintyfrom data

Findingrisk-sensitivecontrols

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 12/34

Data-Driven MDP Approach

Estimatingmodeluncertaintyfrom data

Findingrisk-sensitivecontrols

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 13/34

Markov Decision Processes

A simple and popular model (MDP)

Ingredients:

1. State space S

2. Action space A

3. Reward R : S ×A → R

4. Transition probability P (s′|s, a)

Dynamics: St → At → Rt → St+1

DecisionMaker

SystemSt → St+1

at St rt

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 14/34

MDPs: The Objective

Objective: maximize (over all policies π)

[

∞∑

t=0

αtRt

∣S0 = s

]

with α < 1 .

There exists an optimal stationary and deterministic policy.

π : S → A

Algorithmically “easy”: linear programming, policy iteration,value iteration, dynamic programming

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 15/34

Parameter Uncertainty in MDP

We always have uncertainty in the parameters

R and P (s′|s, a)

I don’t have a model - estimate from data

I know I don’t know (part of the model)

MDP is a model reduction

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 16/34

Bayesian Parameter Uncertainty

We always have uncertainty in the parameters

R and P (s′|s, a)

Bayesian Approach:

Start with a prior : P(R,P )

Gather model observations O : O ∼ P(O|R,P )

Compute posterior distribution over the model :P(R,P |O) ∝ P(O|R,P )P(R,P )

Posterior can be evaluated using Gibbs sampling

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 17/34

A Distribution over MDP Models

A Gaussian prior on Rewards:

Prior belief is R(i, a) ∝ N (µ(i,a), σ2(i,a))

Given a new measurement R(i, a) = R(i, a) + ν(Gaussian noise), belief remains Gaussian

A Dirichlet prior on transition parameters P (.|i, a) = −→p :

Prior belief on −→p is f(−→p ) ∝∏|S|

j=1 pβj−1j

Given a new transition from (i, a), belief remains aDirichlet distribution

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 18/34

The Bayesian Approach

We have a probability over models:

Consider V π = Eπx[

∑∞t=0 αtR(xt)] as a random variable.

For a given π and a current belief we can ask what is:

Emodels [V π] = Emodels

[

Eπx

[

∞∑

t=0

αtR(xt)

] ]

What if true MDP does not behave like Emodels[Vπ]?

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 19/34

The Curse of Parameter Uncertainty

Distribution of Random V π

−4 −8 −12−16−20Expected discounted reward (Vπ)

Vπexp. model

Emodels

(Vπ |model)

Vπworse

Vπbest

90% Percentile

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 20/34

Data-Driven MDP Approach

Estimatingmodeluncertaintyfrom data

Findingrisk-sensitivecontrols

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 21/34

The Robust Approach

∆(R) = {set of all possible rewards}∆(P) = {set of all possible transition probabilities}

Objective:

max.π

minR∈∆(R),P∈∆(P )

[

∞∑

t=0

αtR(xt)

]

Tractable solution via dynamic programming

Non-probabilistic uncertainty

Uncertainty may be difficult to calibrate

Leads to conservative policies

(Iyengar, 2002; Nilim and El-Ghaoui , 2005)

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 22/34

Percentile Optimization

Find optimal policy according to:

max.policy π,y∈R

y

sub. to Pmodels

(

Eπx(

∑∞t=0 αtR(xt)) ≥ y

)

≥ η ,

Value-at-risk: η is the risk parameter

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 23/34

Percentile Optimization

Find optimal policy according to:

max.policy π,y∈R

y

sub. to Pmodels

(

Eπx(

∑∞t=0 αtR(xt)) ≥ y

)

≥ η ,

Value-at-risk: η is the risk parameter

It turns out that solving the percentile optimization is:

NP-hard in general

NP-hard even if transitions are known

Polytime for Gaussian reward parameters

Useful approximation for Dirichlet transitions

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 24/34

Percentile Optimization : Rewards

Suppose R ≈ N (µR,ΘR) and q is initial distribution onstates, an η-percentile optimal policy can be found using

max.x∈R|S|×|A|

a x⊤a µR − Φ−1(η)‖

a x⊤a Θ

1

2

R‖2

subject to∑

a x⊤a = q⊤ +

a αx⊤a Pa

x⊤a ≥ 0 , ∀ a ∈ A.

Not much harder than the original problem

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 25/34

Percentile Optimization : Transitions (I)

In the case that rewards are known, but transitions areuncertain, it is already hard to solve:

max.policy π

Emodels

[

Eπx

[

∞∑

t=0

αtRt

] ]

equivalent to:

max.π

Emodels

[

(I − αPmodelπ )−1R

]

The objective depends non-linearly on all moments of P

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 26/34

Percentile Optimization : Transitions (II)

Let F(π) be the second order approximation

F(π) = qTXπR + α2qTXπΠQπXπR

F(π) only depends on first and second moments of P

Optimizing F(π) is tractable for problem ≈ 1000 states

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 27/34

Percentile Optimization : Transitions (II)

Let F(π) be the second order approximation

F(π) = qTXπR + α2qTXπΠQπXπR

F(π) only depends on first and second moments of P

Optimizing F(π) is tractable for problem ≈ 1000 states

Given more than M observed transitions from anystate-action pair, policy π = arg maxπ F(π) iso(1/

(1 − η)M) optimal according to the percentileproblem

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 28/34

Experiment on MDP with Dirichlet Prior

State related returns R are fully known

Dirichlet prior for transitions P (s′|s, a)

Observed 5 transitions for each state-action pair

Choose repair policy that maximizes 90% percentilebound on returns

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 29/34

Experiment on MDP with Dirichlet Prior

Random V π Comparison

−4 −8 −12−16−20−24Discounted reward

robust policy performance nominal policy performance 2nd order approx. policy performance

90% percentiles

means

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 30/34

Closing Discussion

We proposed a risk-sensitive approach for data-driven MDPoptimization:

Estimatingmodeluncertaintyfrom data(BayesianFramework)

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 31/34

Closing Discussion

We proposed a risk-sensitive approach for data-driven MDPoptimization:

Estimatingmodeluncertaintyfrom data(BayesianFramework)

Findingrisk-sensitivecontrols

(PercentileCriteria)

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 32/34

Closing Discussion

We proposed a risk-sensitive approach for data-driven MDPoptimization

Future Work:

Revisit standard data-driven MDPs with this percentilebased method

Use this framework to provide strategies for parameterexploration

Address model uncertainty in other forms of decisionproblems

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 33/34

Thank You !

For more information, visit www.stanford.edu/∼edelage

Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 34/34