reinforcement learning: learning algorithms function approximation

1

Reinforcement Learning:Learning algorithms

Function ApproximationYishay Mansour

Tel-Aviv University

2

Outline

• Week I: Basics– Mathematical Model (MDP)

– Planning• Value iteration

• Policy iteration

• Week II: Learning Algorithms– Model based

– Model Free

• Week III: Large state space

3

Learning Algorithms

Given access only to actions perform: 1. policy evaluation. 2. control - find optimal policy.

Two approaches: 1. Model based (Dynamic Programming). 2. Model free (Q-Learning, SARSA).

4

Learning: Policy improvement

• Assume that we can compute:– Given a policy π,– The V and Q functions of π

• Can perform policy improvement:– Π= Greedy (Q)

• Process converges if estimations are accurate.

5

Learning - Model FreeOptimal Control: off-policy

Learn online the Q function.

Qt+1 (st ,at ) = Qt (st ,at )+ At

OFF POLICY: Q-Learning

Maximization Operator!!!

At = rt+ MAXa {Qt (st+1,a)} - Qt (st ,at )

6

Learning - Model FreePolicy evaluation: TD(0)

An online view:At state st we performed action at, received reward rt and moved to state st+1.

Our “estimation error” is At =rt+Vt(st+1)-Vt(st), The update:

Vt +1(st) = Vt(st ) + At

No maximization over actions!

7

Learning - Model FreeOptimal Control: on-policy

Learn online the optimal Q* function.

Qt+1 (st ,at ) = Qt (st ,at )+ rt+ Qt (st+1,at+1) - Qt (st ,at )]

ON-Policy: SARSA at+1 the -greedy policy for Qt.

The policy selects the action!Need to balance exploration and exploitation.

8

Modified Notation

• Rather than Q(s,a) have Qa(s)

• Greedy(Q) = MAXa Qa(s)

• Each action has a function Qa(s)

• Learn each Qa(s) independently!

9

Large state space

• Reduce number of states– Symmetries (x-o)– Cluster states

• Define attributes

• Limited number of attributes

• Some states will be identical

10

Example X-O

• For each action (square)– Consider row/diagonal/column through it

– The state will encode the status of “rows”:• Two X’s

• Two O’s

• Mixed (both X and O)

• One X

• One O

• empty

– Only Three types of squares/actions

11

Clustering states

• Need to create attributes

• Attributes should be “game dependent”

• Different “real” states - same representation

• How do we run?– We estimate action value.– Consider only legal actions.– Play “best” action.

12

Function Approximation

• Use a limited model for Qa(s)• Have an attribute vector:

– Each state s has a vector vec(s)=x1 ... xk

– Normally k << |S|

• Examples:– Neural Network– Decision tree– Linear Function

• Weights = 1 ... k

• Value i xi

13

Gradient Decent

• Minimize Squared Error– Square Error = ½ P(s) [V(s) – V(s)]2

– P(s) is a weighting on the states

• Algorithm: (t+1) = (t) + [V(st) – V(t)(st)] (t) V(t)(st)

(t) = partial derivatives

– Replace V(st) by a sample

• Monte Carlo: use Rt for V(st)

• TD(0) use At for [V(st) – V(t)(st)]

14

Linear Functions

• Linear function: i xi = < ,x >

• Derivative (t) Vt(st) = vec(st)

• Update Rule: t+1 = t + [V(st) – Vt(st)] vec(st)

– MC: t+1 = t + [ Rt – < t ,st>] vec(st)

– TD: t+1 = t + At vec(st)

15

Example: 4 in a row

• Select attributes for action (column):– 3 in a row (type X or type O)

– 2 in a row (type X or O) and [blocked/ not]

– Next location 3 in a row.• Next move might lose

– Other “features”

• RL will learn the weights.• Look ahead significantly helps

– use max-min tree

16

Bootstrapping

• Playing against a “good” player– Using ....

• Self play– Start with a random player– play against one self.

• Choose a starting point.– Max-Min tree with simple scoring function.

• Add some simple guidance– add “compulsory” moves.

17

Scoring Function

• Checkers:– Number of pieces– Number of Queens

• Chess– Weighted sum of pieces

• Othello/Reversi– Difference in number of pieces

• Can be used with Max-Min Tree– (,) pruning

18

Example: Revesrsi (Othello)

• Use a simple score functions:– difference in pieces– edge pieces– corner pieces

• Use Max-Min Tree

• RL: optimize weights.

19

Advanced issues

• Time constraints– fast and slow modes

• Opening– can help

• End game– many cases: few pieces,– can be solved efficiently

• Train on a specific state– might be helpful/ not sure that its worth the effort.

20

What is Next?

• Create teams:– Choose a game!

• GUI for game– Deadline April 12, 2010

• System specification – Project outline– High level components planning– May 10, 2010

21

Schedule (more)

• Build system• Project completion

– Aug. 30, 2010

• All supporting documents in html!

• From next week:– Each groups works by itself.– Feel free to contact us.

reinforcement learning: learning algorithms function approximation

Documents

t rt t

t vst vtst t vtstt

x derivative t vtst

rt g qt

optimal policy

model free qlearning

atoff policy

policy improvementassume