reinforcement learning: learning algorithms

Reinforcement Learning:Learning algorithms

Yishay Mansour

Tel-Aviv University

Outline

• Last week– Goal of Reinforcement Learning

– Mathematical Model (MDP)

– Planning• Value iteration

• Policy iteration

• This week: Learning Algorithms– Model based

– Model Free

Planning - Basic Problems.

Policy evaluation - Given a policy , estimate its return.

Optimal control - Find an optimal policy *(maximizes the return from any start state).

Given a complete MDP model.

),aR(sγreturn iii

i

0

Planning - Value Functions

V(s) The expected value starting at state s following

Q(s,a) The expected value starting at state s with action a and then following

V(s) and Q(s,a) are define using an optimal policy .

V(s) = max V(s)

Algorithms - optimal control

CLAIM: A policy is optimal if and only if at each state s:

V(s)MAXa Q(s,a)} (Bellman Eq.)

The greedy policy with respect to Q(s,a) is

(s) = argmaxa{Q(s,a) }

MDP - computing optimal policy

1. Linear Programming

2. Value Iteration method.

)},({maxarg)( 1 asQs i

ai

)}'( )',,(),({max)('

1 sVsasasRsVs

i

a

i

3. Policy Iteration method.

Planning versus Learning

Tightly coupled in Reinforcement Learning

Goal: maximize return while learning.

Example - Elevator Control

Planning (alone) : Given arrival model build schedule

Learning (alone): Model the arrival model well.

Real objective: Construct a schedule while updating model

Learning Algorithms

Given access only to actions perform: 1. policy evaluation. 2. control - find optimal policy.

Two approaches: 1. Model based (Dynamic Programming). 2. Model free (Q-Learning).

Learning - Model Based

Estimate the model from the observation.(Both transition probability and rewards.)

Use the estimated model as the true model,and find optimal policy.

If we have a “good” estimated model, we shouldhave a “good” estimation.

Learning - Model Based: off policy

• Let the policy run for a “long” time.– what is “long” ?!

• Build an “observed model”:– Transition probabilities– Rewards

• Use the “observed model” to estimate value of the policy.

Learning - Model Basedoff-policy algorithm

• Observe a trajectory generated by a policy π– off-policy: not need to control

• For every s,s’ ε S and a ε A:– obsδ(s,a,s’)= #(s,a,s’) / #(s,a,*)– obsR(s,a) = average(r | s,a)

• Find an optimal policy for (S,a,obsδ,obsR)

Learning - Model Based

• Claim: if the model is “accurate” we will find a near-optimal policy.

• Hidden assumption: – Each (s,a,*) is sampled many times.– This is the “responsibility” of the policy π

• Simple question: How many samples we need for each (s,a,*)

Learning - Model Basedsample size

Sample size (optimal policy):

Naive: O(|S| log (|S| |A|) ) samples per (s,a) (approximates each transition (s,a,s’) well.)

Better: O(log (|S| |A|) ) samples per (s,a) (Sufficient to approximate optimal policy.) [KS, NIPS’98]

Learning - Model Based: on policy

• The learner has control over the action.– The immediate goal is to lean a model

• As before:– Build an “observed model”:

• Transition probabilities and Rewards

– Use the “observed model” to estimate value of the policy.

• Accelerating the learning:– How to reach “new” places ?!

Learning - Model Based: on policy

Well sampled nodes Relatively unknown nodes

Learning: Policy improvement

• Assume that we can perform:– Given a policy ,– Compute V and Q functions of

• Can run policy improvement:– = Greedy (Q)

• Process converges if estimates are accurate.

Learning: Monte Carlo Methods

• Assume we can run in episodes– Terminating MDP– Discounted return

• Simplest: sample the return of state s:– Wait to reach state s,– Compute the return from s,– Average all the returns.

Learning: Monte Carlo Methods

• First visit:– For each state in the episode, – Compute the return from first occurrence– Average the returns

• Every visit:– Might be biased!

• Computing optimal policy:– Run policy iteration.

Learning - Model FreePolicy evaluation: TD(0)

An online view:At state st we performed action at, received reward rt and moved to state st+1.

Our “estimation error” is Errt =rt+V(st+1)-V(st), The update:

Vt +1(st) = Vt(st ) + Errt

Note that for the correct value function we have:

E[r+V(s’)-V(s)] =E[Errt]=0

Learning - Model FreeOptimal Control: off-policy

Learn online the Q function.

Qt+1 (st ,at ) = Qt (st ,at )+ rt+ Vt (st+1) - Qt (st ,at )]

OFF POLICY: Q-Learning

Any underlying policy selects actions.Assumes every state-action performed infinitely oftenLearning rate dependency.

Convergence in the limit: GUARANTEED [DW,JJS,S,TS]

Learning - Model FreeOptimal Control: on-policy

Learn online the Q function.

Qt+1 (st ,at ) = Qt (st ,at )+ rt+ Qt (st+1,at+1) - Qt (st ,at )]

ON-Policy: SARSA at+1 the -greedy policy for Qt.

The policy selects the action!Need to balance exploration and exploitation.

Convergence in the limit: GUARANTEED [DW,JJS,S,TS]

Learning - Model FreePolicy evaluation: TD()

Again: At state st we performed action at, received reward rt and moved to state st+1.Our “estimation error” Errt =rt+V(st+1)-V(st),

Update every state s:

Vt +1(s) = Vt(s ) + Errt e(s)

Update of e(s) :When visiting s: incremented by 1: e(s) = e(s)+1For all other s’:

decremented by factor: e(s’) = e(s’)

Summary

Markov Decision Process: Mathematical Model. Planning Algorithms.

Learning Algorithms:Model BasedMonte CarloTD(0) Q-LearningSARSATD()

reinforcement learning: learning algorithms

Documents

optimal policy p

observed model

greedy policy

policy evaluation

arrival model

policy improvementassume

nearoptimal policy

learning model basedestimate