mdp solutions - github pages · 1 s. joo ([email protected]) 10/28/2014 1 cs 4649/7649...

23
1 10/28/2014 S. Joo ([email protected]) 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of Computing Georgia Institute of Technology MDP solutions *Slides based in part on Dr. Mike Stilman and Dr. Pieter Abbeel’s slides 10/28/2014 S. Joo ([email protected]) 2 • CS7649 - project proposal: Due Oct. 30 (proposal outline: proposal_outline.pdf) - project final report: Due Dec. 4, 23:59pm, conference-style paper - project presentation: Dec. 11, 11:30am - 2:20pm • CS4649 - project reviewer assignment: Oct. 28 ( 2 ~ 3 reviewers/project) - proposal review report: Due Nov. 6 - project review report(for the assigned project): Due Dec. 11, 11:30am - project presentation review*(for all presentation): Due Dec. 11, 2:20pm *presentation review sheets will be provided Administrative– Final Project

Upload: others

Post on 18-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

1

10/28/2014S. Joo ([email protected]) 1

CS 4649/7649Robot Intelligence: Planning

Sungmoon Joo

School of Interactive Computing

College of Computing

Georgia Institute of Technology

MDP solutions

*Slides based in part on Dr. Mike Stilman and Dr. Pieter Abbeel’s slides

10/28/2014S. Joo ([email protected]) 2

• CS7649

- project proposal: Due Oct. 30 (proposal outline: proposal_outline.pdf)

- project final report: Due Dec. 4, 23:59pm, conference-style paper

- project presentation: Dec. 11, 11:30am - 2:20pm

• CS4649

- project reviewer assignment: Oct. 28 ( 2 ~ 3 reviewers/project)

- proposal review report: Due Nov. 6

- project review report(for the assigned project): Due Dec. 11, 11:30am

- project presentation review*(for all presentation): Due Dec. 11, 2:20pm

*presentation review sheets will be provided

Administrative– Final Project

Page 2: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

2

10/28/2014S. Joo ([email protected]) 3

Probability

(1) 0 ≤ P(A) ≤ 1 (2) P(True) = 1 (3) P(False) = 0

(4) P(A or B) = P(A) + P(B) – P(A and B)

(5) P(A = vi ∧ A = vj) = 0 if i ≠ j

(6) P(A = v1 ∨ A = v2 ∨… ∨ A = vk) = 1

P (not A) = P (¬ A) = 1 – P(A)

P (A) = P (A ∧ B) + P(A ∧ ¬ B)

Axioms

Theorems

10/28/2014S. Joo ([email protected]) 4

Conditional Probability

• Conditional Probability (Definition)

• Corollary

• Bayes Rule

Page 3: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

3

10/28/2014S. Joo ([email protected]) 5

Markov Property

● “Markov” generally means that given the present state, the future and the past are independent

● For MP, “Markov” means

● For MDP, “Markov” means

10/28/2014S. Joo ([email protected]) 6

Solving MDP

● In deterministic single-agent search problems, we want an optimal

plan, or sequence of actions, from start to a goal

● For MDP, we want an optimal policy

- A policy gives an action for each state- An optimal policy maximizes expected utility(e.g. sum of rewards),

if followed

● Two ways to define stationary utilities

- Additive utility

- Discounted utility

Page 4: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

4

10/28/2014S. Joo ([email protected]) 7

Solving MDP

● Problem: infinite state sequences have infinite rewards

● Solutions

- Finite horizon: Terminate episodes after a fixed horizon of T steps,

yielding non-stationary policies (policies depend on time)

- Discounting: for

- …

● We’ve discussed the case with the infinite horizon & discounting factor

- the optimal policy is stationary (the same as at all times)

10/28/2014S. Joo ([email protected]) 8

Robots with Uncertain Actions

Page 5: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

5

10/28/2014S. Joo ([email protected]) 9

Value Iteration

10/28/2014S. Joo ([email protected]) 10

Value Iteration

Page 6: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

6

10/28/2014S. Joo ([email protected]) 11

Value Iteration

10/28/2014S. Joo ([email protected]) 12

Value Iteration

Page 7: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

7

10/28/2014S. Joo ([email protected]) 13

Value Iteration

10/28/2014S. Joo ([email protected]) 14

Effect of Discount & Uncertainty in MDP

Parameters:

- Discount

- Uncertainty/Noise

Example from: http://www.cs.berkeley.edu/~pabbeel/cs287-fa13/ - Rewards (e.g. cliff)

+ Rewards

Page 8: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

8

10/28/2014S. Joo ([email protected]) 15

discount uncertainty discount uncertainty

Effect of Discount & Uncertainty in MDP

10/28/2014S. Joo ([email protected]) 16

discount uncertainty discount uncertainty

Effect of Discount & Uncertainty in MDP

Page 9: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

9

10/28/2014S. Joo ([email protected]) 17

MDP Applications: Passive Dynamic Walking

10/28/2014S. Joo ([email protected]) 18

MDP Applications: Inverted Pendulum

Page 10: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

10

10/28/2014S. Joo ([email protected]) 19

MDP Applications: Inverted Pendulum

10/28/2014S. Joo ([email protected]) 20

MDP Applications: Inverted Pendulum

Page 11: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

11

10/28/2014S. Joo ([email protected]) 21

Improving Efficiency

10/28/2014S. Joo ([email protected]) 22

Policy Evaluation

● Our goal is to find a policy, not just computing values.

● In value iteration, we compute values then find a policy

● How do we calculate the V’s for a fixed policy?

- use Bellman’s equation

- …

Page 12: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

12

● Policy Iteration

- Step 1. Policy evaluation

: calculate utilities for some fixed policy (not optimal utilities!) until convergence

- Step 2. Policy improvement

: update policy using one step look-ahead with resulting converged (but not optimal!) utilities as future values

-Repeat steps until policy converges

● Properties

- It’s still optimal!

- Can converge faster under some conditions

10/28/2014S. Joo ([email protected]) 23

Policy Iteration

● Policy Evaluation

- with a fixed current policy , find values with Bellman updates

- iterate until values converges

● Policy Improvement

- with fixed(converged) values, find the best action

● Theorem

Policy iteration is guaranteed to converge and at convergence, the current policy and its value function are the optimal policy and the optimal value function.

10/28/2014S. Joo ([email protected]) 24

Policy Iteration

Page 13: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

13

10/28/2014S. Joo ([email protected]) 25

Policy Iteration

10/28/2014S. Joo ([email protected]) 26

Policy Iteration

Page 14: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

14

10/28/2014S. Joo ([email protected]) 27

Policy Iteration

10/28/2014S. Joo ([email protected]) 28

Policy Iteration

Page 15: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

15

10/28/2014S. Joo ([email protected]) 29

Policy Iteration

10/28/2014S. Joo ([email protected]) 30

Policy Iteration

Page 16: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

16

10/28/2014S. Joo ([email protected]) 31

Comparison

● In value iteration:

- Every pass (or “backup”) updates both utilities (explicitly, based on

current utilities) and policy (possibly implicitly, based on current policy)

10/28/2014S. Joo ([email protected]) 32

● In policy iteration:

- Several passes to update utilities with frozen policy

- Occasional passes to update policies

Comparison

Page 17: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

17

10/28/2014S. Joo ([email protected]) 33

Asynchronous Iterations

● Asynchronous policy iteration(we saw this)

- Any sequences of partial updates to either policy entries or utilities will

converge if every state is visited infinitely often

● Asynchronous value iteration

- In value iteration, we update every state in each iteration

- Actually, any sequences of Bellman updates will converge if every state

is visited infinitely often

- In fact, we can update the policy as seldom or often as we like, and we

will still converge

- Idea : Update states whose value we expect to change

If is large then update predecessors of s

10/28/2014S. Joo ([email protected]) 34

Model availability: P(j|i,a)

Page 18: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

18

10/28/2014S. Joo ([email protected]) 35

Q-Learning

10/28/2014S. Joo ([email protected]) 36

Q-Learning

Page 19: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

19

10/28/2014S. Joo ([email protected]) 37

Q-Learning

10/28/2014S. Joo ([email protected]) 38

Q-Learning

Page 20: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

20

10/28/2014S. Joo ([email protected]) 39

Q-Learning

10/28/2014S. Joo ([email protected]) 40

Q-Learning

Page 21: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

21

10/28/2014S. Joo ([email protected]) 41

Q-Learning

10/28/2014S. Joo ([email protected]) 42

Q-Learning

Page 22: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

22

10/28/2014S. Joo ([email protected]) 43

Q-Learning

10/28/2014S. Joo ([email protected]) 44

Q-Learning

Page 23: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of

23

10/28/2014S. Joo ([email protected]) 45

Summary

RL

MDP(VI, PI, RL) Control, Estimation POMDP