markov decision processes - courses.cs.washington.edu · model the sequential decision making of a...
TRANSCRIPT
![Page 1: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/1.jpg)
Markov Decision Processes
Mausam
CSE 515
![Page 2: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/2.jpg)
Markov Decision Process
Operations
Research
Artificial
Intelligence
Machine
Learning
Graph
Theory
RoboticsNeuroscience
/Psychology
Control
TheoryEconomics
model the sequential decision making of a rational agent.
![Page 3: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/3.jpg)
A Statistician’s view to MDPs
Markov
Chain
One-step
Decision Theory
Markov Decision Process
• sequential process
• models state transitions
• autonomous process
• one-step process
• models choice
• maximizes utility
• Markov chain + choice
• Decision theory + sequentiality
• sequential process
• models state transitions
• models choice
• maximizes utility
s s s u
s s
u
a
a
![Page 4: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/4.jpg)
A Planning View
What action
next?
Percepts Actions
Environment
Static vs. Dynamic
Fully
vs.
Partially
Observable
Perfect
vs.
Noisy
Deterministic vs.
Stochastic
Instantaneous vs.
Durative
Predictable vs. Unpredictable
![Page 5: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/5.jpg)
Classical Planning
What action
next?
Percepts Actions
Environment
Static
Fully
Observable
Perfect
Predictable
Instantaneous
Deterministic
![Page 6: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/6.jpg)
Deterministic, fully observable
![Page 7: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/7.jpg)
Stochastic Planning: MDPs
What action
next?
Percepts Actions
Environment
Static
Fully
Observable
Perfect
Stochastic
Instantaneous
Unpredictable
![Page 8: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/8.jpg)
Stochastic, Fully Observable
![Page 9: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/9.jpg)
Markov Decision Process (MDP)
• S: A set of states
• A: A set of actions
• Pr(s’|s,a): transition model
• C(s,a,s’): cost model
• G: set of goals
• s0: start state
• : discount factor
• R(s,a,s’): reward model
factoredFactored MDP
absorbing/
non-absorbing
![Page 10: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/10.jpg)
Objective of an MDP
• Find a policy : S→ A
• which optimizes
• minimizes expected cost to reach a goal
• maximizes expected reward
• maximizes expected (reward-cost)
• given a ____ horizon
• finite
• infinite
• indefinite
• assuming full observability
discounted
or
undiscount.
![Page 11: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/11.jpg)
Role of Discount Factor ()
• Keep the total reward/total cost finite
• useful for infinite horizon problems
• Intuition (economics):
• Money today is worth more than money tomorrow.
• Total reward: r1 + r2 + 2r3 + …
• Total cost: c1 + c2 + 2c3 + …
![Page 12: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/12.jpg)
Examples of MDPs
• Goal-directed, Indefinite Horizon, Cost Minimization MDP
• <S, A, Pr, C, G, s0>
• Most often studied in planning, graph theory communities
• Infinite Horizon, Discounted Reward Maximization MDP
• <S, A, Pr, R, >
• Most often studied in machine learning, economics, operations research communities
• Goal-directed, Finite Horizon, Prob. Maximization MDP
• <S, A, Pr, G, s0, T>
• Also studied in planning community
• Oversubscription Planning: Non absorbing goals, Reward Max. MDP
• <S, A, Pr, G, R, s0>
• Relatively recent model
most popular
![Page 13: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/13.jpg)
Bellman Equations for MDP1
• <S, A, Pr, C, G, s0>
• Define J*(s) {optimal cost} as the minimum
expected cost to reach a goal from this state.
• J* should satisfy the following equation:
![Page 14: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/14.jpg)
Bellman Equations for MDP2
• <S, A, Pr, R, s0, >
• Define V*(s) {optimal value} as the maximum
expected discounted reward from this state.
• V* should satisfy the following equation:
![Page 15: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/15.jpg)
Bellman Equations for MDP3
• <S, A, Pr, G, s0, T>
• Define P*(s,t) {optimal prob} as the maximum
expected probability to reach a goal from this
state starting at tth timestep.
• P* should satisfy the following equation:
![Page 16: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/16.jpg)
Bellman Backup (MDP2)
• Given an estimate of V* function (say Vn)
• Backup Vn function at state s
• calculate a new estimate (Vn+1) :
• Qn+1(s,a) : value/cost of the strategy:
• execute action a in s, execute n subsequently
• n = argmaxa∈Ap(s)Qn(s,a)
V
R V
ax
![Page 17: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/17.jpg)
Bellman Backup
V0= 0
V0= 1
V0= 2
Q1(s,a1) = 2 + 0
Q1(s,a2) = 5 + 0.9£ 1
+ 0.1£ 2
Q1(s,a3) = 4.5 + 2
max
V1= 6.5
(~1)
agreedy = a3
5a2
a1
a3
s0
s1
s2
s3
![Page 18: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/18.jpg)
Value iteration [Bellman’57]
• assign an arbitrary assignment of V0 to each state.
• repeat
• for all states s
• compute Vn+1(s) by Bellman backup at s.
• until maxs |Vn+1(s) – Vn(s)| <
Iteration n+1
Residual(s)
-convergence
![Page 19: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/19.jpg)
Comments
• Decision-theoretic Algorithm
• Dynamic Programming
• Fixed Point Computation
• Probabilistic version of Bellman-Ford Algorithm• for shortest path computation
• MDP1 : Stochastic Shortest Path Problem
Time Complexity
• one iteration: O(|S|2|A|)
• number of iterations: poly(|S|, |A|, 1/(1-))
Space Complexity: O(|S|)
Factored MDPs
• exponential space, exponential time
![Page 20: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/20.jpg)
Convergence Properties
• Vn → V* in the limit as n→1
• -convergence: Vn function is within of V*
• Optimality: current policy is within 2/(1-) of optimal
• Monotonicity• V0 ≤p V* ⇒ Vn ≤p V* (Vn monotonic from below)
• V0 ≥p V* ⇒ Vn ≥p V* (Vn monotonic from above)
• otherwise Vn non-monotonic
![Page 21: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/21.jpg)
Policy Computation
Optimal policy is stationary and time-independent.
• for infinite/indefinite horizon problems
Policy Evaluation
A system of linear equations in |S| variables.
ax
ax R V
R VV
![Page 22: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/22.jpg)
Changing the Search Space
• Value Iteration
• Search in value space
• Compute the resulting policy
• Policy Iteration
• Search in policy space
• Compute the resulting value
![Page 23: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/23.jpg)
Policy iteration [Howard’60]
• assign an arbitrary assignment of 0 to each state.
• repeat
• Policy Evaluation: compute Vn+1: the evaluation of n
• Policy Improvement: for all states s
• compute n+1(s): argmaxa2 Ap(s)Qn+1(s,a)
• until n+1 = n
Advantage
• searching in a finite (policy) space as opposed to
uncountably infinite (value) space ⇒ convergence faster.
• all other properties follow!
costly: O(n3)
approximate
by value iteration
using fixed policy
Modified
Policy Iteration
![Page 24: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/24.jpg)
Modified Policy iteration
• assign an arbitrary assignment of 0 to each state.
• repeat
• Policy Evaluation: compute Vn+1 the approx. evaluation of n
• Policy Improvement: for all states s
• compute n+1(s): argmaxa2 Ap(s)Qn+1(s,a)
• until n+1 = n
Advantage
• probably the most competitive synchronous dynamic
programming algorithm.
![Page 25: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/25.jpg)
Asynchronous Value Iteration
States may be backed up in any order
• instead of an iteration by iteration
As long as all states backed up infinitely often
• Asynchronous Value Iteration converges to optimal
![Page 26: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/26.jpg)
Asynch VI: Prioritized Sweeping
Why backup a state if values of successors same?
Prefer backing a state
• whose successors had most change
Priority Queue of (state, expected change in value)
Backup in the order of priority
After backing a state update priority queue
• for all predecessors
![Page 27: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/27.jpg)
Asynch VI: Real Time Dynamic Programming
[Barto, Bradtke, Singh’95]
• Trial: simulate greedy policy starting from start state;
perform Bellman backup on visited states
• RTDP: repeat Trials until value function converges
![Page 28: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/28.jpg)
Min
?
?s0
Vn
Vn
Vn
Vn
Vn
Vn
Vn
Qn+1(s0,a)
Vn+1(s0)
agreedy = a2
RTDP Trial
Goala1
a2
a3
?
![Page 29: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/29.jpg)
Comments
• Properties
• if all states are visited infinitely often then Vn → V*
• Advantages
• Anytime: more probable states explored quickly
• Disadvantages
• complete convergence can be slow!
![Page 30: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/30.jpg)
Reinforcement Learning
![Page 31: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/31.jpg)
Reinforcement Learning
Still have an MDP
• Still looking for policy
New twist: don’t know Pr and/or R
• i.e. don’t know which states are good
• and what actions do
Must actually try out actions to learn
![Page 32: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/32.jpg)
Model based methods
Visit different states, perform different actions
Estimate Pr and R
Once model built, do planning using V.I. or
other methods
Con: require _huge_ amounts of data
![Page 33: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/33.jpg)
Model free methods
Directly learn Q*(s,a) values
sample = R(s,a,s’) + maxa’Qn(s’,a’)
Nudge the old estimate towards the new sample
Qn+1(s,a) (1-)Qn(s,a) + [sample]
![Page 34: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/34.jpg)
Properties
Converges to optimal if
• If you explore enough
• If you make learning rate () small enough
• But not decrease it too quickly
• ∑i(s,a,i) = ∞
• ∑i2(s,a,i) < ∞
where i is the number of visits to (s,a)
![Page 35: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/35.jpg)
Model based vs. Model Free RL
Model based
• estimate O(|S|2|A|) parameters
• requires relatively larger data for learning
• can make use of background knowledge easily
Model free
• estimate O(|S||A|) parameters
• requires relatively less data for learning
![Page 36: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/36.jpg)
Exploration vs. Exploitation
Exploration: choose actions that visit new states in
order to obtain more data for better learning.
Exploitation: choose actions that maximize the
reward given current learnt model.
-greedy
• Each time step flip a coin
• With prob , take an action randomly
• With prob 1- take the current greedy action
Lower over time
• increase exploitation as more learning has happened
![Page 37: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/37.jpg)
Q-learning
Problems
• Too many states to visit during learning
• Q(s,a) is still a BIG table
We want to generalize from small set of training examples
Techniques
• Value function approximators
• Policy approximators
• Hierarchical Reinforcement Learning
![Page 38: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/38.jpg)
Task Hierarchy: MAXQ Decomposition [Dietterich’00]
Root
Take GiveNavigate(loc)
DeliverFetch
Extend-arm Extend-armGrab Release
MoveeMovewMovesMoven
Children of a task Children of a task
are unordered
![Page 39: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/39.jpg)
Partially Observable Markov Decision Processes
![Page 40: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/40.jpg)
Partially Observable MDPs
What action
next?
Percepts Actions
Environment
Static
Partially
Observable
Noisy
Stochastic
Instantaneous
Unpredictable
![Page 41: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/41.jpg)
Stochastic, Fully Observable
![Page 42: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/42.jpg)
Stochastic, Partially Observable
![Page 43: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/43.jpg)
POMDPs
In POMDPs we apply the very same idea as in MDPs.
Since the state is not observable,
the agent has to make its decisions based on the belief state
which is a posterior distribution over states.
Let b be the belief of the agent about the current state
POMDPs compute a value function over belief space:
γa b, aa
![Page 44: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/44.jpg)
POMDPs
Each belief is a probability distribution,
• value fn is a function of an entire probability distribution.
Problematic, since probability distributions are continuous.
Also, we have to deal with huge complexity of belief spaces.
For finite worlds with finite state, action, and observation
spaces and finite horizons,
• we can represent the value functions by piecewise linear
functions.
![Page 45: Markov Decision Processes - courses.cs.washington.edu · model the sequential decision making of a rational agent. A Statistician’s view to MDPs Markov Chain One-step ... Model](https://reader033.vdocuments.mx/reader033/viewer/2022050315/5f778d7ec79dc727f27330fb/html5/thumbnails/45.jpg)
Applications
Robotic control
• helicopter maneuvering, autonomous vehicles
• Mars rover - path planning, oversubscription planning
• elevator planning
Game playing - backgammon, tetris, checkers
Neuroscience
Computational Finance, Sequential Auctions
Assisting elderly in simple tasks
Spoken dialog management
Communication Networks – switching, routing, flow control
War planning, evacuation planning