machinelearningforrobotics intelligentsystemsseries...

74
Machine Learning for Robotics Intelligent Systems Series Lecture 7 Georg Martius MPI for Intelligent Systems, Tübingen, Germany June 12, 2017 Georg Martius Machine Learning for Robotics June 12, 2017 1 / 31

Upload: others

Post on 22-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Machine Learning for RoboticsIntelligent Systems Series

Lecture 7

Georg Martius

MPI for Intelligent Systems, Tübingen, Germany

June 12, 2017

Georg Martius Machine Learning for Robotics June 12, 2017 1 / 31

Page 2: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Markov Decision Processes

Georg Martius Machine Learning for Robotics June 12, 2017 2 / 31

Page 3: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Markov Process

A Markov process is a memoryless random process, i.e. a sequence of randomstates S1, S2, . . . with the Markov property.

Reminder: Markov propertyA state St is Markov if and only if

P (St+1 | St) = P (St+1 | S1, . . . , St)

Definition (Markov Process/ Markov Chain)A Markov Process (or Markov Chain) is a tuple (S,P)• S is a (finite) set of states• P is a state transition 0 probability matrix,

Pss′ = P (St+1 = s′ | St = s)

Georg Martius Machine Learning for Robotics June 12, 2017 3 / 31

Page 4: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Markov Process

A Markov process is a memoryless random process, i.e. a sequence of randomstates S1, S2, . . . with the Markov property.

Reminder: Markov propertyA state St is Markov if and only if

P (St+1 | St) = P (St+1 | S1, . . . , St)

Definition (Markov Process/ Markov Chain)A Markov Process (or Markov Chain) is a tuple (S,P)• S is a (finite) set of states• P is a state transition 0 probability matrix,

Pss′ = P (St+1 = s′ | St = s)

Georg Martius Machine Learning for Robotics June 12, 2017 3 / 31

Page 5: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Markov Process

A Markov process is a memoryless random process, i.e. a sequence of randomstates S1, S2, . . . with the Markov property.

Reminder: Markov propertyA state St is Markov if and only if

P (St+1 | St) = P (St+1 | S1, . . . , St)

Definition (Markov Process/ Markov Chain)A Markov Process (or Markov Chain) is a tuple (S,P)• S is a (finite) set of states• P is a state transition 0 probability matrix,

Pss′ = P (St+1 = s′ | St = s)

Georg Martius Machine Learning for Robotics June 12, 2017 3 / 31

Page 6: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Example: Student Markov Chain

Georg Martius Machine Learning for Robotics June 12, 2017 4 / 31

Page 7: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Example: Student Markov Chain Episodes

Sample episodes for starting from S1 =C1

S1, S2, . . . , ST

• C1 C2 C3 Pass Sleep

• C1 FB FB C1 C2 Sleep• C1 C2 C3 Pub C2 C3 Pass Sleep• C1 FB FB C1 C2 C3 Pub C1 FB FB

FB C1 C2 C3 Pub C2 Sleep

Georg Martius Machine Learning for Robotics June 12, 2017 5 / 31

Page 8: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Example: Student Markov Chain Episodes

Sample episodes for starting from S1 =C1

S1, S2, . . . , ST

• C1 C2 C3 Pass Sleep• C1 FB FB C1 C2 Sleep

• C1 C2 C3 Pub C2 C3 Pass Sleep• C1 FB FB C1 C2 C3 Pub C1 FB FB

FB C1 C2 C3 Pub C2 Sleep

Georg Martius Machine Learning for Robotics June 12, 2017 5 / 31

Page 9: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Example: Student Markov Chain Episodes

Sample episodes for starting from S1 =C1

S1, S2, . . . , ST

• C1 C2 C3 Pass Sleep• C1 FB FB C1 C2 Sleep• C1 C2 C3 Pub C2 C3 Pass Sleep

• C1 FB FB C1 C2 C3 Pub C1 FB FBFB C1 C2 C3 Pub C2 Sleep

Georg Martius Machine Learning for Robotics June 12, 2017 5 / 31

Page 10: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Example: Student Markov Chain Episodes

Sample episodes for starting from S1 =C1

S1, S2, . . . , ST

• C1 C2 C3 Pass Sleep• C1 FB FB C1 C2 Sleep• C1 C2 C3 Pub C2 C3 Pass Sleep• C1 FB FB C1 C2 C3 Pub C1 FB FBFB C1 C2 C3 Pub C2 Sleep

Georg Martius Machine Learning for Robotics June 12, 2017 5 / 31

Page 11: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Example: Student Markov Chain Transition Matrix

Georg Martius Machine Learning for Robotics June 12, 2017 6 / 31

Page 12: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Markov Reward Process

A Markov reward process is a Markov chain with values.

Definition (MRP)A Markov Reward Process is a tuple (S,P,R, γ)• S is a finite set of states• P is a state transition probability matrix,

P (St+1 | St) = P (St+1 | S1, . . . , St)

• R is a reward function, Rs = E[Rt+1|St = s]• γ is a discount factor, γ ∈ [0, 1]

Note that the reward can be stochastic (Rs is in expectation)

Georg Martius Machine Learning for Robotics June 12, 2017 7 / 31

Page 13: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Markov Reward Process

A Markov reward process is a Markov chain with values.

Definition (MRP)A Markov Reward Process is a tuple (S,P,R, γ)• S is a finite set of states• P is a state transition probability matrix,

P (St+1 | St) = P (St+1 | S1, . . . , St)

• R is a reward function, Rs = E[Rt+1|St = s]

• γ is a discount factor, γ ∈ [0, 1]

Note that the reward can be stochastic (Rs is in expectation)

Georg Martius Machine Learning for Robotics June 12, 2017 7 / 31

Page 14: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Markov Reward Process

A Markov reward process is a Markov chain with values.

Definition (MRP)A Markov Reward Process is a tuple (S,P,R, γ)• S is a finite set of states• P is a state transition probability matrix,

P (St+1 | St) = P (St+1 | S1, . . . , St)

• R is a reward function, Rs = E[Rt+1|St = s]• γ is a discount factor, γ ∈ [0, 1]

Note that the reward can be stochastic (Rs is in expectation)

Georg Martius Machine Learning for Robotics June 12, 2017 7 / 31

Page 15: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Example: Student MRP

Georg Martius Machine Learning for Robotics June 12, 2017 8 / 31

Page 16: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Return

DefinitionThe return Gt is the total discounted reward from time-step t.

Gt = Rt+1 + γRt+2 + · · · =∞∑k=0

γkRt+k+1

• The discount γ ∈ [0, 1] devaluates future rewards:A reward R after k + 1 time-steps is counted as γkR.

• Extreme cases:

γ close to 0 leads to immidate reward maximization onlyγ close to 1 leads to far-sighted evaluation

Georg Martius Machine Learning for Robotics June 12, 2017 9 / 31

Page 17: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Return

DefinitionThe return Gt is the total discounted reward from time-step t.

Gt = Rt+1 + γRt+2 + · · · =∞∑k=0

γkRt+k+1

• The discount γ ∈ [0, 1] devaluates future rewards:A reward R after k + 1 time-steps is counted as γkR.• Extreme cases:

γ close to 0 leads to immidate reward maximization onlyγ close to 1 leads to far-sighted evaluation

Georg Martius Machine Learning for Robotics June 12, 2017 9 / 31

Page 18: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Return

DefinitionThe return Gt is the total discounted reward from time-step t.

Gt = Rt+1 + γRt+2 + · · · =∞∑k=0

γkRt+k+1

• The discount γ ∈ [0, 1] devaluates future rewards:A reward R after k + 1 time-steps is counted as γkR.• Extreme cases:

γ close to 0 leads to immidate reward maximization only

γ close to 1 leads to far-sighted evaluation

Georg Martius Machine Learning for Robotics June 12, 2017 9 / 31

Page 19: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Return

DefinitionThe return Gt is the total discounted reward from time-step t.

Gt = Rt+1 + γRt+2 + · · · =∞∑k=0

γkRt+k+1

• The discount γ ∈ [0, 1] devaluates future rewards:A reward R after k + 1 time-steps is counted as γkR.• Extreme cases:

γ close to 0 leads to immidate reward maximization onlyγ close to 1 leads to far-sighted evaluation

Georg Martius Machine Learning for Robotics June 12, 2017 9 / 31

Page 20: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Discussion on discounting

Why is discouting often used?

• Mathematically convenient to discount rewards (keeps returns finite)• A way to model the uncertainty about the future (since the model may notbe exact)

• Animal/human behaviour shows preference for immediate rewards

In some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.

Georg Martius Machine Learning for Robotics June 12, 2017 10 / 31

Page 21: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Discussion on discounting

Why is discouting often used?• Mathematically convenient to discount rewards (keeps returns finite)

• A way to model the uncertainty about the future (since the model may notbe exact)

• Animal/human behaviour shows preference for immediate rewardsIn some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.

Georg Martius Machine Learning for Robotics June 12, 2017 10 / 31

Page 22: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Discussion on discounting

Why is discouting often used?• Mathematically convenient to discount rewards (keeps returns finite)• A way to model the uncertainty about the future (since the model may notbe exact)

• Animal/human behaviour shows preference for immediate rewardsIn some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.

Georg Martius Machine Learning for Robotics June 12, 2017 10 / 31

Page 23: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Discussion on discounting

Why is discouting often used?• Mathematically convenient to discount rewards (keeps returns finite)• A way to model the uncertainty about the future (since the model may notbe exact)

• Animal/human behaviour shows preference for immediate rewards

In some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.

Georg Martius Machine Learning for Robotics June 12, 2017 10 / 31

Page 24: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Discussion on discounting

Why is discouting often used?• Mathematically convenient to discount rewards (keeps returns finite)• A way to model the uncertainty about the future (since the model may notbe exact)

• Animal/human behaviour shows preference for immediate rewards

In some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.

Georg Martius Machine Learning for Robotics June 12, 2017 10 / 31

Page 25: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Discussion on discounting

Why is discouting often used?• Mathematically convenient to discount rewards (keeps returns finite)• A way to model the uncertainty about the future (since the model may notbe exact)

• Animal/human behaviour shows preference for immediate rewardsIn some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.

Georg Martius Machine Learning for Robotics June 12, 2017 10 / 31

Page 26: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Value Function

The value function describes the value of a state (in the stationary state)

DefinitionThe state value function v(s) of an MRP is the exected return starting fromstate s

v(s) = E[Gt|St = s]

Georg Martius Machine Learning for Robotics June 12, 2017 11 / 31

Page 27: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Example: Value Function for Student MRP

from David SilverGeorg Martius Machine Learning for Robotics June 12, 2017 12 / 31

Page 28: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Example: Value Function for Student MRP

from David SilverGeorg Martius Machine Learning for Robotics June 12, 2017 12 / 31

Page 29: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Example: Value Function for Student MRP

from David SilverGeorg Martius Machine Learning for Robotics June 12, 2017 12 / 31

Page 30: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Bellman Equation (MRP) I

Idea: Make value computation recursive by tearing apart contributions from:• immediate reward• and from discounted future rewards

v(s) = E[Gt | St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]= E[Rt+1 + γv(St+1) | St = s]

Mh... need Expectation over St+1Use transition matrix to get probabilities of succeeding state:

v(s) = Rs + γ∑s′∈SPss′v(s′)

Georg Martius Machine Learning for Robotics June 12, 2017 13 / 31

Page 31: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Bellman Equation (MRP) I

Idea: Make value computation recursive by tearing apart contributions from:• immediate reward• and from discounted future rewards

v(s) = E[Gt | St = s]

= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]= E[Rt+1 + γv(St+1) | St = s]

Mh... need Expectation over St+1Use transition matrix to get probabilities of succeeding state:

v(s) = Rs + γ∑s′∈SPss′v(s′)

Georg Martius Machine Learning for Robotics June 12, 2017 13 / 31

Page 32: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Bellman Equation (MRP) I

Idea: Make value computation recursive by tearing apart contributions from:• immediate reward• and from discounted future rewards

v(s) = E[Gt | St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]

= E[Rt+1 + γv(St+1) | St = s]

Mh... need Expectation over St+1Use transition matrix to get probabilities of succeeding state:

v(s) = Rs + γ∑s′∈SPss′v(s′)

Georg Martius Machine Learning for Robotics June 12, 2017 13 / 31

Page 33: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Bellman Equation (MRP) I

Idea: Make value computation recursive by tearing apart contributions from:• immediate reward• and from discounted future rewards

v(s) = E[Gt | St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]= E[Rt+1 + γv(St+1) | St = s]

Mh... need Expectation over St+1Use transition matrix to get probabilities of succeeding state:

v(s) = Rs + γ∑s′∈SPss′v(s′)

Georg Martius Machine Learning for Robotics June 12, 2017 13 / 31

Page 34: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Bellman Equation (MRP) I

Idea: Make value computation recursive by tearing apart contributions from:• immediate reward• and from discounted future rewards

v(s) = E[Gt | St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]= E[Rt+1 + γv(St+1) | St = s]

Mh... need Expectation over St+1

Use transition matrix to get probabilities of succeeding state:

v(s) = Rs + γ∑s′∈SPss′v(s′)

Georg Martius Machine Learning for Robotics June 12, 2017 13 / 31

Page 35: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Bellman Equation (MRP) I

Idea: Make value computation recursive by tearing apart contributions from:• immediate reward• and from discounted future rewards

v(s) = E[Gt | St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]= E[Rt+1 + γv(St+1) | St = s]

Mh... need Expectation over St+1Use transition matrix to get probabilities of succeeding state:

v(s) = Rs + γ∑s′∈SPss′v(s′)

Georg Martius Machine Learning for Robotics June 12, 2017 13 / 31

Page 36: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Example: Bellman Equation for Student MRP

from David SilverGeorg Martius Machine Learning for Robotics June 12, 2017 14 / 31

Page 37: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Bellman Equation (MRP) II

Bellman equations in matrix form:

v = R+ γPv

where v ∈ R|S| and R are vectors

The Bellman equation can be solved directly:

v = (I− γP)−1R

• computational complexity is O(|S|3)

Georg Martius Machine Learning for Robotics June 12, 2017 15 / 31

Page 38: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Markov Decision Process

A Markov reward process has no agent, there is no influence on the system.

And MDR with an active agent forms a Markov Decision Process.• Agent takes decision by executing actions• State is Markovian

Definition (MDP)A Markov Decision Process is a tuple (S,A,P,R, γ)• S is a finite set of states• A is a finite set of actions• P is a state transition probability matrix,

Pass′ = P (St+1 | St, At = a)

• R is a reward function, Ras = E[Rt+1|St = s,At = a]• γ is a discount factor, γ ∈ [0, 1]

Georg Martius Machine Learning for Robotics June 12, 2017 16 / 31

Page 39: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Markov Decision Process

A Markov reward process has no agent, there is no influence on the system.And MDR with an active agent forms a Markov Decision Process.• Agent takes decision by executing actions• State is Markovian

Definition (MDP)A Markov Decision Process is a tuple (S,A,P,R, γ)• S is a finite set of states• A is a finite set of actions• P is a state transition probability matrix,

Pass′ = P (St+1 | St, At = a)

• R is a reward function, Ras = E[Rt+1|St = s,At = a]• γ is a discount factor, γ ∈ [0, 1]

Georg Martius Machine Learning for Robotics June 12, 2017 16 / 31

Page 40: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Markov Decision Process

A Markov reward process has no agent, there is no influence on the system.And MDR with an active agent forms a Markov Decision Process.• Agent takes decision by executing actions• State is Markovian

Definition (MDP)A Markov Decision Process is a tuple (S,A,P,R, γ)• S is a finite set of states• A is a finite set of actions• P is a state transition probability matrix,

Pass′ = P (St+1 | St, At = a)

• R is a reward function, Ras = E[Rt+1|St = s,At = a]• γ is a discount factor, γ ∈ [0, 1]

Georg Martius Machine Learning for Robotics June 12, 2017 16 / 31

Page 41: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Example: Student MDP

from David SilverGeorg Martius Machine Learning for Robotics June 12, 2017 17 / 31

Page 42: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

How to model decision taking?

The agent has a action function called policy.

DefinitionA policy π is a distribution over actions given states,

π(a|s) = P (At = a | St = s)

• Since it is a Markov process the policy only depends on the current state• Implication: policies are stationary (independent of time)

An MDP with a given policy turns into a MRP:

Pπss′ =∑a∈A

π(a|s)Pass′

Rπs =∑a∈A

π(a|s)Ras

Georg Martius Machine Learning for Robotics June 12, 2017 18 / 31

Page 43: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

How to model decision taking?

The agent has a action function called policy.

DefinitionA policy π is a distribution over actions given states,

π(a|s) = P (At = a | St = s)

• Since it is a Markov process the policy only depends on the current state• Implication: policies are stationary (independent of time)

An MDP with a given policy turns into a MRP:

Pπss′ =∑a∈A

π(a|s)Pass′

Rπs =∑a∈A

π(a|s)Ras

Georg Martius Machine Learning for Robotics June 12, 2017 18 / 31

Page 44: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

How to model decision taking?

The agent has a action function called policy.

DefinitionA policy π is a distribution over actions given states,

π(a|s) = P (At = a | St = s)

• Since it is a Markov process the policy only depends on the current state• Implication: policies are stationary (independent of time)

An MDP with a given policy turns into a MRP:

Pπss′ =∑a∈A

π(a|s)Pass′

Rπs =∑a∈A

π(a|s)Ras

Georg Martius Machine Learning for Robotics June 12, 2017 18 / 31

Page 45: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

How to model decision taking?

The agent has a action function called policy.

DefinitionA policy π is a distribution over actions given states,

π(a|s) = P (At = a | St = s)

• Since it is a Markov process the policy only depends on the current state• Implication: policies are stationary (independent of time)

An MDP with a given policy turns into a MRP:

Pπss′ =∑a∈A

π(a|s)Pass′

Rπs =∑a∈A

π(a|s)Ras

Georg Martius Machine Learning for Robotics June 12, 2017 18 / 31

Page 46: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Modelling expected returns in MDP

How good is each state when we follow the policy pi?

DefinitionThe state-value function vπ(s) of an MDP is the expected return when startingfrom state s and following policy π.

vπ(s) = E[Gt|St = s]

Should we change the policy?How much does choosing a different action change the value?

DefinitionThe action-value function qπ(s, a) of an MDP is the expected return whenstarting from state s, taking action a, and then following policy π.

qπ(s, a) = E[Gt|St = s,At = a]

Georg Martius Machine Learning for Robotics June 12, 2017 19 / 31

Page 47: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Modelling expected returns in MDP

How good is each state when we follow the policy pi?

DefinitionThe state-value function vπ(s) of an MDP is the expected return when startingfrom state s and following policy π.

vπ(s) = E[Gt|St = s]

Should we change the policy?How much does choosing a different action change the value?

DefinitionThe action-value function qπ(s, a) of an MDP is the expected return whenstarting from state s, taking action a, and then following policy π.

qπ(s, a) = E[Gt|St = s,At = a]

Georg Martius Machine Learning for Robotics June 12, 2017 19 / 31

Page 48: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Modelling expected returns in MDP

How good is each state when we follow the policy pi?

DefinitionThe state-value function vπ(s) of an MDP is the expected return when startingfrom state s and following policy π.

vπ(s) = E[Gt|St = s]

Should we change the policy?How much does choosing a different action change the value?

DefinitionThe action-value function qπ(s, a) of an MDP is the expected return whenstarting from state s, taking action a, and then following policy π.

qπ(s, a) = E[Gt|St = s,At = a]

Georg Martius Machine Learning for Robotics June 12, 2017 19 / 31

Page 49: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Modelling expected returns in MDP

How good is each state when we follow the policy pi?

DefinitionThe state-value function vπ(s) of an MDP is the expected return when startingfrom state s and following policy π.

vπ(s) = E[Gt|St = s]

Should we change the policy?How much does choosing a different action change the value?

DefinitionThe action-value function qπ(s, a) of an MDP is the expected return whenstarting from state s, taking action a, and then following policy π.

qπ(s, a) = E[Gt|St = s,At = a]

Georg Martius Machine Learning for Robotics June 12, 2017 19 / 31

Page 50: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Example: State-Value function for Student MDP

from David SilverGeorg Martius Machine Learning for Robotics June 12, 2017 20 / 31

Page 51: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Bellman Expectation Equation

Recall: Bellman Equation: decompose expected reward into immediate rewardplus discounted value of successor state,

vπ(s) = Eπ[Rt+1 + γvπ(St+1)|St = s]

The action-value function can be similarly decomposed,

qπ(s, a) = Eπ[Rt+1 + γqπ(St+1, At+1)|St = s,At = a]

Georg Martius Machine Learning for Robotics June 12, 2017 21 / 31

Page 52: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Bellman Expectation Equation

Recall: Bellman Equation: decompose expected reward into immediate rewardplus discounted value of successor state,

vπ(s) = Eπ[Rt+1 + γvπ(St+1)|St = s]

The action-value function can be similarly decomposed,

qπ(s, a) = Eπ[Rt+1 + γqπ(St+1, At+1)|St = s,At = a]

Georg Martius Machine Learning for Robotics June 12, 2017 21 / 31

Page 53: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Bellman Equation: joined update of vπ and qπ

Value function can be derived from qπ:

vπ(s) =∑a∈A

π(a|s)qπ(s, a)

. . . and q can be computed from transition model

qπ(s, a) = Ras + γ∑s′∈SPass′vπ(s′)

Substituting q in v:

vπ(s) =∑a∈A

π(a|s)(Ras + γ

∑s′∈SPass′vπ(s′)

)

Substituting v in q:

qπ(s, a) = Ras + γ∑s′∈SPass′

∑a′∈A

π(a′|s′)qπ(s′, a′)

Georg Martius Machine Learning for Robotics June 12, 2017 22 / 31

Page 54: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Bellman Equation: joined update of vπ and qπ

Value function can be derived from qπ:

vπ(s) =∑a∈A

π(a|s)qπ(s, a)

. . . and q can be computed from transition model

qπ(s, a) = Ras + γ∑s′∈SPass′vπ(s′)

Substituting q in v:

vπ(s) =∑a∈A

π(a|s)(Ras + γ

∑s′∈SPass′vπ(s′)

)

Substituting v in q:

qπ(s, a) = Ras + γ∑s′∈SPass′

∑a′∈A

π(a′|s′)qπ(s′, a′)

Georg Martius Machine Learning for Robotics June 12, 2017 22 / 31

Page 55: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Bellman Equation: joined update of vπ and qπ

Value function can be derived from qπ:

vπ(s) =∑a∈A

π(a|s)qπ(s, a)

. . . and q can be computed from transition model

qπ(s, a) = Ras + γ∑s′∈SPass′vπ(s′)

Substituting q in v:

vπ(s) =∑a∈A

π(a|s)(Ras + γ

∑s′∈SPass′vπ(s′)

)

Substituting v in q:

qπ(s, a) = Ras + γ∑s′∈SPass′

∑a′∈A

π(a′|s′)qπ(s′, a′)

Georg Martius Machine Learning for Robotics June 12, 2017 22 / 31

Page 56: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Bellman Equation: joined update of vπ and qπ

Value function can be derived from qπ:

vπ(s) =∑a∈A

π(a|s)qπ(s, a)

. . . and q can be computed from transition model

qπ(s, a) = Ras + γ∑s′∈SPass′vπ(s′)

Substituting q in v:

vπ(s) =∑a∈A

π(a|s)(Ras + γ

∑s′∈SPass′vπ(s′)

)

Substituting v in q:

qπ(s, a) = Ras + γ∑s′∈SPass′

∑a′∈A

π(a′|s′)qπ(s′, a′)

Georg Martius Machine Learning for Robotics June 12, 2017 22 / 31

Page 57: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Bellman Equation: joined update of vπ and qπ

Value function can be derived from qπ:

vπ(s) =∑a∈A

π(a|s)qπ(s, a)

. . . and q can be computed from transition model

qπ(s, a) = Ras + γ∑s′∈SPass′vπ(s′)

Substituting q in v:

vπ(s) =∑a∈A

π(a|s)(Ras + γ

∑s′∈SPass′vπ(s′)

)

Substituting v in q:

qπ(s, a) = Ras + γ∑s′∈SPass′

∑a′∈A

π(a′|s′)qπ(s′, a′)

Georg Martius Machine Learning for Robotics June 12, 2017 22 / 31

Page 58: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Example: Bellman update for v in Student MDP

from David Silverπ(a|s) = 0.5

Georg Martius Machine Learning for Robotics June 12, 2017 23 / 31

Page 59: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Explicit solution for vπ

Since a policy induces a MRP vπ can be diretly computed (as before)

v = (I− γPπ)−1Rπ

But do we want vπ?

We want to find the optimal policy and its value function!

Georg Martius Machine Learning for Robotics June 12, 2017 24 / 31

Page 60: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Explicit solution for vπ

Since a policy induces a MRP vπ can be diretly computed (as before)

v = (I− γPπ)−1Rπ

But do we want vπ?

We want to find the optimal policy and its value function!

Georg Martius Machine Learning for Robotics June 12, 2017 24 / 31

Page 61: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Explicit solution for vπ

Since a policy induces a MRP vπ can be diretly computed (as before)

v = (I− γPπ)−1Rπ

But do we want vπ?

We want to find the optimal policy and its value function!

Georg Martius Machine Learning for Robotics June 12, 2017 24 / 31

Page 62: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Optimal Value Function

DefinitionThe optimal state-value function v∗(s) is the maximum value function over allpolicies

v∗(s) = maxπ vπ(s)

DefinitionThe optimal action-value function q∗(s, a) is the maximum action-valuefunction over all policies

q∗(s, a) = maxπ qπ(s, a)

What does it mean?• v∗ specifies the best possible performance in an MDP• Knowing v∗ solves the MDP (how? we will see. . . )

Georg Martius Machine Learning for Robotics June 12, 2017 25 / 31

Page 63: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Optimal Value Function

DefinitionThe optimal state-value function v∗(s) is the maximum value function over allpolicies

v∗(s) = maxπ vπ(s)

DefinitionThe optimal action-value function q∗(s, a) is the maximum action-valuefunction over all policies

q∗(s, a) = maxπ qπ(s, a)

What does it mean?

• v∗ specifies the best possible performance in an MDP• Knowing v∗ solves the MDP (how? we will see. . . )

Georg Martius Machine Learning for Robotics June 12, 2017 25 / 31

Page 64: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Optimal Value Function

DefinitionThe optimal state-value function v∗(s) is the maximum value function over allpolicies

v∗(s) = maxπ vπ(s)

DefinitionThe optimal action-value function q∗(s, a) is the maximum action-valuefunction over all policies

q∗(s, a) = maxπ qπ(s, a)

What does it mean?• v∗ specifies the best possible performance in an MDP• Knowing v∗ solves the MDP (how? we will see. . . )

Georg Martius Machine Learning for Robotics June 12, 2017 25 / 31

Page 65: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Example: Optimal Value Function v∗ in Student MDP

from David Silver

Georg Martius Machine Learning for Robotics June 12, 2017 26 / 31

Page 66: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Example: Optimal State Function q∗ in Student MDP

from David Silver

Georg Martius Machine Learning for Robotics June 12, 2017 27 / 31

Page 67: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Optimal Policy

Actually solving the MDP means we have also the optimal policy.

Define a partial ordering over policies

π ≥ π′ if vπ(s) ≥ vπ′(s),∀s

TheoremFor any Markov Decision Process• There exists an optimal policy π∗ that is better than or equal to all other

policies, π∗ ≥ π,∀π• All optimal policies achieve the optimal state-value function, vπ∗(s) = v∗(s)• All optimal policies achieve the optimal action-value function,qπ∗(s, a) = q∗(s, a)

Georg Martius Machine Learning for Robotics June 12, 2017 28 / 31

Page 68: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Optimal Policy

Actually solving the MDP means we have also the optimal policy.Define a partial ordering over policies

π ≥ π′ if vπ(s) ≥ vπ′(s),∀s

TheoremFor any Markov Decision Process• There exists an optimal policy π∗ that is better than or equal to all other

policies, π∗ ≥ π,∀π• All optimal policies achieve the optimal state-value function, vπ∗(s) = v∗(s)• All optimal policies achieve the optimal action-value function,qπ∗(s, a) = q∗(s, a)

Georg Martius Machine Learning for Robotics June 12, 2017 28 / 31

Page 69: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Finding an Optimal Policy

Given the optimal action-value function q∗:

the optimimal policy is given

by maximizing it.

π∗(a|s) = Ja = argmaxa∈A

q∗(s, a)K

J·K is Iverson bracket: 1 if true, otherwise 0.• There is always a deterministic optimal policy for any MDP• If we know q∗(s, a), we immediately have the optimal policy (greedy)

Georg Martius Machine Learning for Robotics June 12, 2017 29 / 31

Page 70: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Finding an Optimal Policy

Given the optimal action-value function q∗: the optimimal policy is given

by maximizing it.

π∗(a|s) = Ja = argmaxa∈A

q∗(s, a)K

J·K is Iverson bracket: 1 if true, otherwise 0.• There is always a deterministic optimal policy for any MDP• If we know q∗(s, a), we immediately have the optimal policy (greedy)

Georg Martius Machine Learning for Robotics June 12, 2017 29 / 31

Page 71: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Bellman Equation for optimal value functions

Also for the optimal value functions we can use Bellmans optimiality equations:

v∗(s) = maxa∈A q∗(s, a)

q∗(s, a) = Ras + γ∑s′∈SPass′v∗(s′)

Substituting q in v:

v∗(s) = maxa∈A

(Ras + γ

∑s′∈SPass′v∗(s′)

)

Substituting v in q:

q∗(s, a) = Ras + γ∑s′∈SPass′ maxa′∈A q∗(s′, a′)

Georg Martius Machine Learning for Robotics June 12, 2017 30 / 31

Page 72: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Bellman Equation for optimal value functions

Also for the optimal value functions we can use Bellmans optimiality equations:

v∗(s) = maxa∈A q∗(s, a)

q∗(s, a) = Ras + γ∑s′∈SPass′v∗(s′)

Substituting q in v:

v∗(s) = maxa∈A

(Ras + γ

∑s′∈SPass′v∗(s′)

)

Substituting v in q:

q∗(s, a) = Ras + γ∑s′∈SPass′ maxa′∈A q∗(s′, a′)

Georg Martius Machine Learning for Robotics June 12, 2017 30 / 31

Page 73: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Bellman Equation for optimal value functions

Also for the optimal value functions we can use Bellmans optimiality equations:

v∗(s) = maxa∈A q∗(s, a)

q∗(s, a) = Ras + γ∑s′∈SPass′v∗(s′)

Substituting q in v:

v∗(s) = maxa∈A

(Ras + γ

∑s′∈SPass′v∗(s′)

)

Substituting v in q:

q∗(s, a) = Ras + γ∑s′∈SPass′ maxa′∈A q∗(s′, a′)

Georg Martius Machine Learning for Robotics June 12, 2017 30 / 31

Page 74: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius

Solving the Bellman Optimality Equation

Bellman Optimality Equation is non-linear• No closed form solution (in general)• Many iterative solution methods

Value IterationPolicy IterationQ-learningSARSA

Georg Martius Machine Learning for Robotics June 12, 2017 31 / 31