markov systems and markov decision processes...states, actions, observations passive controlled...

19
15-381: AI Markov Systems and Markov Decision Processes November 11, 2009 Manuela Veloso Carnegie Mellon

Upload: others

Post on 20-Feb-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

15-381: AI

Markov Systems and

Markov Decision Processes

November 11, 2009

Manuela Veloso

Carnegie Mellon

Page 2: Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

States, Actions, Observations

Passive Controlled

Fully Observable Markov Model Markov Decision

Process (MDP)

Hidden State Hidden Markov Model Partially Observable

2

Hidden State Hidden Markov Model

- HMM

Partially Observable

MDP (POMDP)

Page 3: Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

Markov Model – Observation

3

• States

• Rewards

• Transition probabilities

Page 4: Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

Markov Systems with Rewards

• Finite set of n states, si

• Probabilistic state matrix, P, pij

– Probabilty of transition from si to sj

• “Goal achievement” - Reward for each state, ri

4

• Discount factor - γ

Page 5: Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

“Solving” Markov Systems with

Rewards• Find “value of each state”

• Process:

– Start at state si– Receive immediate reward ri– Move randomly to a new state according to

5

– Move randomly to a new state according to

the probability transition matrix

– Future rewards (of next state) are discounted

by γ

Page 6: Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

Solving a Markov System with Rewards

• V*(si) - expected discounted sum of future rewards

starting in state si

• V*(si) = ri + γ[pi1V*(s1) + pi2V*(s2) + ... pinV*(sn)]

6

Page 7: Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

Value Iteration to Solve a Markov System

with Rewards• V1(si) - expected discounted sum of future rewards

starting in state si for one step.

• V2(si) - expected discounted sum of future rewards

starting in state si for two steps.

• ...

• Vk(s ) - expected discounted sum of future rewards

7

• Vk(si) - expected discounted sum of future rewards

starting in state si for k steps.

• As k → ∞Vk(si) → V*(si)

• Stop when difference of k + 1 and k values is smaller

than some ∈.

Page 8: Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

3-State Example

8

Page 9: Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

3-State Example: Values γ = 0.5

9

Page 10: Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

3-State Example: Values γ = 0.9

10

Page 11: Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

3-State Example: Values γ = 0.2

11

Page 12: Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

Markov Decision Processes

• Finite set of states, s1,..., sn

• Finite set of actions, a1,..., am

• Probabilistic state,action transitions:

)action takeand current (next prob k assp ===

12

• Reward for each state, r1,..., rn

)action takeand current (next prob kijkij assp ===

Page 13: Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

Solving an MDP

• A policy is a mapping from states to actions.

• Optimal policy - for every state, there is no other action

that gets a higher sum of discounted future rewards.

• For every MDP there exists an optimal policy.

• Solving an MDP is finding an optimal policy.

13

• Solving an MDP is finding an optimal policy.

• A specific policy converts an MDP into a plain Markov

system with rewards.

Page 14: Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

Deterministic MDP Example

14

(Reward on unlabelled transitions is zero.)

Page 15: Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

Nondeterministic Example

0 500

15

10 50

Page 16: Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

Solving MDPs

• Find the value of each state and the action

to take from each state.

• Process:

–Start in state si

16

–Start in state si–Receive immediate reward ri–Choose action ak ∈ A

–Change to state sj with probability

–Discount future rewards

Page 17: Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

Policy Iteration

• Start with some policy π0(si).

• Such policy transforms the MDP into a plain Markov

system with rewards.

• Compute the values of the states according to the

current policy.

17

• Update policy:

• Keep computing

• Stop when πk+1 = πk.

( ) ( )} γ{maxargπ 01 ∑+=

j

jaijiai sVprs

π

Page 18: Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

Value Iteration

• V*(si) - expected discounted future rewards, if we start

from state si, and we follow the optimal policy.

• Compute V* with value iteration:

– Vk(si) = maximum possible future sum of rewards

starting from state si for k steps.

18

• Bellman’s Equation:

• Dynamic programming

( ) ( )} γ{max

1

1 ∑=

+ +=

N

j

jnk

ijikin sVprsV

Page 19: Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

Summary

• Markov model for state/action transitions.

• Markov systems with reward - goal achievement

• Markov decision processes - added actions

• Value, policy iteration

19