percentile optimization for markov decision processes with
TRANSCRIPT
Percentile Optimization
for Markov Decision Processes with
Parameter Uncertainty
Erick Delage, Stanford University, [email protected]
Shie Mannor, McGill University, [email protected]
For more information, visit www.stanford.edu/∼edelage
(Research supported by the Fonds Quebecoisde la recherche sur la nature et les technologies.)
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 1/34
Vanilla Data-Driven Decision Making
Gather data from your system
Choose a model for the system
Point estimate the parameters of your model
Find a policy that maximizes return
Hopefully, true system ≈ estimated model
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 2/34
A Machine Replacement Problem
Machine ages and can require “light” or “heavy” repairs
Two different repair services are available
Choose repair policy that minimizes long term costs
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 3/34
Example of Model Uncertainty
Given that for the “heavy” repair state historical data says:
Repair option 1 was successful 90% of 100 trials
Repair option 2 was successful 100% of 5 trials
What is the true model for repair option #2?
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 4/34
Example of Model Uncertainty
Given that for the “heavy” repair state historical data says:
Repair option 1 was successful 90% of 100 trials
Repair option 2 was successful 100% of 5 trials
What is the true model for repair option #2?
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 5/34
Example of Model Uncertainty
Given that for the “heavy” repair state historical data says:
Repair option 1 was successful 90% of 100 trials
Repair option 2 was successful 100% of 5 trials
What is the true model for repair option #2?
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 6/34
Example of Model Uncertainty
Given that for the “heavy” repair state historical data says:
Repair option 1 was successful 90% of 100 trials
Repair option 2 was successful 100% of 5 trials
What is the true model for repair option #2?
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 7/34
Example of Model Uncertainty
Given that for the “heavy” repair state historical data says:
Repair option 1 was successful 90% of 100 trials
Repair option 2 was successful 100% of 5 trials
What is the true model for repair option #2?
What is the TRUE long term cost of using repair option #2?
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 8/34
The Curse of Model Uncertainty
Expected discounted return
Distribution of Expected ReturnsExpected returnprediction based on estimated model
Actual expected returnwhen accountingfor parameteruncertainty
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 9/34
The Curse of Model Uncertainty (II)
Model uncertainty is always present
We cannot always afford to make it negligible
Robust methods are difficult to apply and aredeceptively conservative
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 10/34
Data-Driven MDP Approach
We address the problem of model uncertainty in the contextof Markov decision processes
Assumption :The system behaves as an MDP with known states andactions
Data is available for :Transition trajectories & noisy reward measures
Goal :Find a stationary policy that performs “well” on the trueMDP (risk-sensitivity)
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 11/34
Data-Driven MDP Approach
We propose a risk-sensitive method for addressingdata-driven Markov decision processes
Estimatingmodeluncertaintyfrom data
Findingrisk-sensitivecontrols
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 12/34
Data-Driven MDP Approach
Estimatingmodeluncertaintyfrom data
Findingrisk-sensitivecontrols
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 13/34
Markov Decision Processes
A simple and popular model (MDP)
Ingredients:
1. State space S
2. Action space A
3. Reward R : S ×A → R
4. Transition probability P (s′|s, a)
Dynamics: St → At → Rt → St+1
DecisionMaker
SystemSt → St+1
at St rt
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 14/34
MDPs: The Objective
Objective: maximize (over all policies π)
Eπ
[
∞∑
t=0
αtRt
∣
∣
∣S0 = s
]
with α < 1 .
There exists an optimal stationary and deterministic policy.
π : S → A
Algorithmically “easy”: linear programming, policy iteration,value iteration, dynamic programming
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 15/34
Parameter Uncertainty in MDP
We always have uncertainty in the parameters
R and P (s′|s, a)
I don’t have a model - estimate from data
I know I don’t know (part of the model)
MDP is a model reduction
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 16/34
Bayesian Parameter Uncertainty
We always have uncertainty in the parameters
R and P (s′|s, a)
Bayesian Approach:
Start with a prior : P(R,P )
Gather model observations O : O ∼ P(O|R,P )
Compute posterior distribution over the model :P(R,P |O) ∝ P(O|R,P )P(R,P )
Posterior can be evaluated using Gibbs sampling
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 17/34
A Distribution over MDP Models
A Gaussian prior on Rewards:
Prior belief is R(i, a) ∝ N (µ(i,a), σ2(i,a))
Given a new measurement R(i, a) = R(i, a) + ν(Gaussian noise), belief remains Gaussian
A Dirichlet prior on transition parameters P (.|i, a) = −→p :
Prior belief on −→p is f(−→p ) ∝∏|S|
j=1 pβj−1j
Given a new transition from (i, a), belief remains aDirichlet distribution
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 18/34
The Bayesian Approach
We have a probability over models:
Consider V π = Eπx[
∑∞t=0 αtR(xt)] as a random variable.
For a given π and a current belief we can ask what is:
Emodels [V π] = Emodels
[
Eπx
[
∞∑
t=0
αtR(xt)
] ]
What if true MDP does not behave like Emodels[Vπ]?
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 19/34
The Curse of Parameter Uncertainty
Distribution of Random V π
−4 −8 −12−16−20Expected discounted reward (Vπ)
Vπexp. model
Emodels
(Vπ |model)
Vπworse
Vπbest
90% Percentile
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 20/34
Data-Driven MDP Approach
Estimatingmodeluncertaintyfrom data
Findingrisk-sensitivecontrols
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 21/34
The Robust Approach
∆(R) = {set of all possible rewards}∆(P) = {set of all possible transition probabilities}
Objective:
max.π
minR∈∆(R),P∈∆(P )
Eπ
[
∞∑
t=0
αtR(xt)
]
Tractable solution via dynamic programming
Non-probabilistic uncertainty
Uncertainty may be difficult to calibrate
Leads to conservative policies
(Iyengar, 2002; Nilim and El-Ghaoui , 2005)
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 22/34
Percentile Optimization
Find optimal policy according to:
max.policy π,y∈R
y
sub. to Pmodels
(
Eπx(
∑∞t=0 αtR(xt)) ≥ y
)
≥ η ,
Value-at-risk: η is the risk parameter
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 23/34
Percentile Optimization
Find optimal policy according to:
max.policy π,y∈R
y
sub. to Pmodels
(
Eπx(
∑∞t=0 αtR(xt)) ≥ y
)
≥ η ,
Value-at-risk: η is the risk parameter
It turns out that solving the percentile optimization is:
NP-hard in general
NP-hard even if transitions are known
Polytime for Gaussian reward parameters
Useful approximation for Dirichlet transitions
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 24/34
Percentile Optimization : Rewards
Suppose R ≈ N (µR,ΘR) and q is initial distribution onstates, an η-percentile optimal policy can be found using
max.x∈R|S|×|A|
∑
a x⊤a µR − Φ−1(η)‖
∑
a x⊤a Θ
1
2
R‖2
subject to∑
a x⊤a = q⊤ +
∑
a αx⊤a Pa
x⊤a ≥ 0 , ∀ a ∈ A.
Not much harder than the original problem
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 25/34
Percentile Optimization : Transitions (I)
In the case that rewards are known, but transitions areuncertain, it is already hard to solve:
max.policy π
Emodels
[
Eπx
[
∞∑
t=0
αtRt
] ]
equivalent to:
max.π
Emodels
[
(I − αPmodelπ )−1R
]
The objective depends non-linearly on all moments of P
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 26/34
Percentile Optimization : Transitions (II)
Let F(π) be the second order approximation
F(π) = qTXπR + α2qTXπΠQπXπR
F(π) only depends on first and second moments of P
Optimizing F(π) is tractable for problem ≈ 1000 states
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 27/34
Percentile Optimization : Transitions (II)
Let F(π) be the second order approximation
F(π) = qTXπR + α2qTXπΠQπXπR
F(π) only depends on first and second moments of P
Optimizing F(π) is tractable for problem ≈ 1000 states
Given more than M observed transitions from anystate-action pair, policy π = arg maxπ F(π) iso(1/
√
(1 − η)M) optimal according to the percentileproblem
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 28/34
Experiment on MDP with Dirichlet Prior
State related returns R are fully known
Dirichlet prior for transitions P (s′|s, a)
Observed 5 transitions for each state-action pair
Choose repair policy that maximizes 90% percentilebound on returns
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 29/34
Experiment on MDP with Dirichlet Prior
Random V π Comparison
−4 −8 −12−16−20−24Discounted reward
robust policy performance nominal policy performance 2nd order approx. policy performance
90% percentiles
means
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 30/34
Closing Discussion
We proposed a risk-sensitive approach for data-driven MDPoptimization:
Estimatingmodeluncertaintyfrom data(BayesianFramework)
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 31/34
Closing Discussion
We proposed a risk-sensitive approach for data-driven MDPoptimization:
Estimatingmodeluncertaintyfrom data(BayesianFramework)
Findingrisk-sensitivecontrols
(PercentileCriteria)
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 32/34
Closing Discussion
We proposed a risk-sensitive approach for data-driven MDPoptimization
Future Work:
Revisit standard data-driven MDPs with this percentilebased method
Use this framework to provide strategies for parameterexploration
Address model uncertainty in other forms of decisionproblems
Delage E., Mannor S., Percentile Optimization for MDPs with Parameter Uncertainty – p. 33/34