machinelearningforrobotics intelligentsystemsseries...

Machine Learning for RoboticsIntelligent Systems Series

Lecture 7

Georg Martius

MPI for Intelligent Systems, Tübingen, Germany

June 12, 2017

Georg Martius Machine Learning for Robotics June 12, 2017 1 / 31

Markov Decision Processes


Markov Process

A Markov process is a memoryless random process, i.e. a sequence of randomstates S1, S2, . . . with the Markov property.

Reminder: Markov propertyA state St is Markov if and only if

P (St+1 | St) = P (St+1 | S1, . . . , St)

Definition (Markov Process/ Markov Chain)A Markov Process (or Markov Chain) is a tuple (S,P)• S is a (finite) set of states• P is a state transition 0 probability matrix,

Pss′ = P (St+1 = s′ | St = s)


Example: Student Markov Chain


Example: Student Markov Chain Episodes

Sample episodes for starting from S1 =C1

S1, S2, . . . , ST

• C1 C2 C3 Pass Sleep

• C1 FB FB C1 C2 Sleep• C1 C2 C3 Pub C2 C3 Pass Sleep• C1 FB FB C1 C2 C3 Pub C1 FB FB

FB C1 C2 C3 Pub C2 Sleep




S1, S2, . . . , ST

• C1 C2 C3 Pass Sleep• C1 FB FB C1 C2 Sleep

• C1 C2 C3 Pub C2 C3 Pass Sleep• C1 FB FB C1 C2 C3 Pub C1 FB FB

FB C1 C2 C3 Pub C2 Sleep




S1, S2, . . . , ST

• C1 C2 C3 Pass Sleep• C1 FB FB C1 C2 Sleep• C1 C2 C3 Pub C2 C3 Pass Sleep

• C1 FB FB C1 C2 C3 Pub C1 FB FBFB C1 C2 C3 Pub C2 Sleep




S1, S2, . . . , ST

• C1 C2 C3 Pass Sleep• C1 FB FB C1 C2 Sleep• C1 C2 C3 Pub C2 C3 Pass Sleep• C1 FB FB C1 C2 C3 Pub C1 FB FBFB C1 C2 C3 Pub C2 Sleep


Example: Student Markov Chain Transition Matrix


Markov Reward Process

A Markov reward process is a Markov chain with values.

Definition (MRP)A Markov Reward Process is a tuple (S,P,R, γ)• S is a finite set of states• P is a state transition probability matrix,

P (St+1 | St) = P (St+1 | S1, . . . , St)

• R is a reward function, Rs = E[Rt+1|St = s]• γ is a discount factor, γ ∈ [0, 1]

Note that the reward can be stochastic (Rs is in expectation)





P (St+1 | St) = P (St+1 | S1, . . . , St)

• R is a reward function, Rs = E[Rt+1|St = s]

• γ is a discount factor, γ ∈ [0, 1]






P (St+1 | St) = P (St+1 | S1, . . . , St)

• R is a reward function, Rs = E[Rt+1|St = s]• γ is a discount factor, γ ∈ [0, 1]



Example: Student MRP


Return

DefinitionThe return Gt is the total discounted reward from time-step t.

Gt = Rt+1 + γRt+2 + · · · =∞∑k=0

γkRt+k+1

• The discount γ ∈ [0, 1] devaluates future rewards:A reward R after k + 1 time-steps is counted as γkR.

• Extreme cases:

γ close to 0 leads to immidate reward maximization onlyγ close to 1 leads to far-sighted evaluation


Return


Gt = Rt+1 + γRt+2 + · · · =∞∑k=0

γkRt+k+1

• The discount γ ∈ [0, 1] devaluates future rewards:A reward R after k + 1 time-steps is counted as γkR.• Extreme cases:



Return


Gt = Rt+1 + γRt+2 + · · · =∞∑k=0

γkRt+k+1


γ close to 0 leads to immidate reward maximization only

γ close to 1 leads to far-sighted evaluation


Return


Gt = Rt+1 + γRt+2 + · · · =∞∑k=0

γkRt+k+1




Discussion on discounting

Why is discouting often used?

• Mathematically convenient to discount rewards (keeps returns finite)• A way to model the uncertainty about the future (since the model may notbe exact)

• Animal/human behaviour shows preference for immediate rewards

In some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.



Why is discouting often used?• Mathematically convenient to discount rewards (keeps returns finite)

• A way to model the uncertainty about the future (since the model may notbe exact)

• Animal/human behaviour shows preference for immediate rewardsIn some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.



Why is discouting often used?• Mathematically convenient to discount rewards (keeps returns finite)• A way to model the uncertainty about the future (since the model may notbe exact)





• Animal/human behaviour shows preference for immediate rewards

In some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.


Value Function

The value function describes the value of a state (in the stationary state)

DefinitionThe state value function v(s) of an MRP is the exected return starting fromstate s

v(s) = E[Gt|St = s]


Example: Value Function for Student MRP

from David SilverGeorg Martius Machine Learning for Robotics June 12, 2017 12 / 31

Bellman Equation (MRP) I

Idea: Make value computation recursive by tearing apart contributions from:• immediate reward• and from discounted future rewards

v(s) = E[Gt | St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]= E[Rt+1 + γv(St+1) | St = s]

Mh... need Expectation over St+1Use transition matrix to get probabilities of succeeding state:

v(s) = Rs + γ∑s′∈SPss′v(s′)




v(s) = E[Gt | St = s]

= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]= E[Rt+1 + γv(St+1) | St = s]






v(s) = E[Gt | St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]

= E[Rt+1 + γv(St+1) | St = s]







Mh... need Expectation over St+1

Use transition matrix to get probabilities of succeeding state:



Example: Bellman Equation for Student MRP


Bellman Equation (MRP) II

Bellman equations in matrix form:

v = R+ γPv

where v ∈ R|S| and R are vectors

The Bellman equation can be solved directly:

v = (I− γP)−1R

• computational complexity is O(|S|3)


Markov Decision Process

A Markov reward process has no agent, there is no influence on the system.

And MDR with an active agent forms a Markov Decision Process.• Agent takes decision by executing actions• State is Markovian

Definition (MDP)A Markov Decision Process is a tuple (S,A,P,R, γ)• S is a finite set of states• A is a finite set of actions• P is a state transition probability matrix,

Pass′ = P (St+1 | St, At = a)

• R is a reward function, Ras = E[Rt+1|St = s,At = a]• γ is a discount factor, γ ∈ [0, 1]


Markov Decision Process

A Markov reward process has no agent, there is no influence on the system.And MDR with an active agent forms a Markov Decision Process.• Agent takes decision by executing actions• State is Markovian

Definition (MDP)A Markov Decision Process is a tuple (S,A,P,R, γ)• S is a finite set of states• A is a finite set of actions• P is a state transition probability matrix,

Pass′ = P (St+1 | St, At = a)

• R is a reward function, Ras = E[Rt+1|St = s,At = a]• γ is a discount factor, γ ∈ [0, 1]


Example: Student MDP


How to model decision taking?

The agent has a action function called policy.

DefinitionA policy π is a distribution over actions given states,

π(a|s) = P (At = a | St = s)

• Since it is a Markov process the policy only depends on the current state• Implication: policies are stationary (independent of time)

An MDP with a given policy turns into a MRP:

Pπss′ =∑a∈A

π(a|s)Pass′

Rπs =∑a∈A

π(a|s)Ras


Modelling expected returns in MDP

How good is each state when we follow the policy pi?

DefinitionThe state-value function vπ(s) of an MDP is the expected return when startingfrom state s and following policy π.

vπ(s) = E[Gt|St = s]

Should we change the policy?How much does choosing a different action change the value?

DefinitionThe action-value function qπ(s, a) of an MDP is the expected return whenstarting from state s, taking action a, and then following policy π.

qπ(s, a) = E[Gt|St = s,At = a]


Example: State-Value function for Student MDP


Bellman Expectation Equation

Recall: Bellman Equation: decompose expected reward into immediate rewardplus discounted value of successor state,

vπ(s) = Eπ[Rt+1 + γvπ(St+1)|St = s]

The action-value function can be similarly decomposed,

qπ(s, a) = Eπ[Rt+1 + γqπ(St+1, At+1)|St = s,At = a]


Bellman Equation: joined update of vπ and qπ

Value function can be derived from qπ:

vπ(s) =∑a∈A

π(a|s)qπ(s, a)

. . . and q can be computed from transition model

qπ(s, a) = Ras + γ∑s′∈SPass′vπ(s′)

Substituting q in v:

vπ(s) =∑a∈A

π(a|s)(Ras + γ

∑s′∈SPass′vπ(s′)

)

Substituting v in q:

qπ(s, a) = Ras + γ∑s′∈SPass′

∑a′∈A

π(a′|s′)qπ(s′, a′)


Example: Bellman update for v in Student MDP

from David Silverπ(a|s) = 0.5


Explicit solution for vπ

Since a policy induces a MRP vπ can be diretly computed (as before)

v = (I− γPπ)−1Rπ

But do we want vπ?

We want to find the optimal policy and its value function!


Optimal Value Function

DefinitionThe optimal state-value function v∗(s) is the maximum value function over allpolicies

v∗(s) = maxπ vπ(s)

DefinitionThe optimal action-value function q∗(s, a) is the maximum action-valuefunction over all policies

q∗(s, a) = maxπ qπ(s, a)

What does it mean?• v∗ specifies the best possible performance in an MDP• Knowing v∗ solves the MDP (how? we will see. . . )







What does it mean?

• v∗ specifies the best possible performance in an MDP• Knowing v∗ solves the MDP (how? we will see. . . )







What does it mean?• v∗ specifies the best possible performance in an MDP• Knowing v∗ solves the MDP (how? we will see. . . )


Example: Optimal Value Function v∗ in Student MDP

from David Silver


Example: Optimal State Function q∗ in Student MDP

from David Silver


Optimal Policy

Actually solving the MDP means we have also the optimal policy.

Define a partial ordering over policies

π ≥ π′ if vπ(s) ≥ vπ′(s),∀s

TheoremFor any Markov Decision Process• There exists an optimal policy π∗ that is better than or equal to all other

policies, π∗ ≥ π,∀π• All optimal policies achieve the optimal state-value function, vπ∗(s) = v∗(s)• All optimal policies achieve the optimal action-value function,qπ∗(s, a) = q∗(s, a)


Optimal Policy

Actually solving the MDP means we have also the optimal policy.Define a partial ordering over policies

π ≥ π′ if vπ(s) ≥ vπ′(s),∀s

TheoremFor any Markov Decision Process• There exists an optimal policy π∗ that is better than or equal to all other

policies, π∗ ≥ π,∀π• All optimal policies achieve the optimal state-value function, vπ∗(s) = v∗(s)• All optimal policies achieve the optimal action-value function,qπ∗(s, a) = q∗(s, a)


Finding an Optimal Policy

Given the optimal action-value function q∗:

the optimimal policy is given

by maximizing it.

π∗(a|s) = Ja = argmaxa∈A

q∗(s, a)K

J·K is Iverson bracket: 1 if true, otherwise 0.• There is always a deterministic optimal policy for any MDP• If we know q∗(s, a), we immediately have the optimal policy (greedy)


Finding an Optimal Policy

Given the optimal action-value function q∗: the optimimal policy is given

by maximizing it.

π∗(a|s) = Ja = argmaxa∈A

q∗(s, a)K

J·K is Iverson bracket: 1 if true, otherwise 0.• There is always a deterministic optimal policy for any MDP• If we know q∗(s, a), we immediately have the optimal policy (greedy)


Bellman Equation for optimal value functions

Also for the optimal value functions we can use Bellmans optimiality equations:

v∗(s) = maxa∈A q∗(s, a)

q∗(s, a) = Ras + γ∑s′∈SPass′v∗(s′)

Substituting q in v:

v∗(s) = maxa∈A

(Ras + γ

∑s′∈SPass′v∗(s′)

)

Substituting v in q:

q∗(s, a) = Ras + γ∑s′∈SPass′ maxa′∈A q∗(s′, a′)


Solving the Bellman Optimality Equation

Bellman Optimality Equation is non-linear• No closed form solution (in general)• Many iterative solution methods

Value IterationPolicy IterationQ-learningSARSA


machinelearningforrobotics intelligentsystemsseries...

Documents