soar-rl: reinforcement learning and soar

Soar-RL: Reinforcement

Learning and SoarShelley Nason

Reinforcement Learning

Reinforcement learning:Learning how to act so as to maximize the expected cumulative value of a (numeric) reward signal

In Soar terminology, RL learns operator comparison knowledge

A learning method for low-knowledge situations Non-explanation-based, trial and error learning – RL

does not require any model of operator effects to improve action choice.

Additional requirement – rewards. Therefore RL component should be automatic and

general-purpose. Ultimately avoid

Task-specific hand-coding of features Hand-decomposed task or reward structure Programmer tweaking of learning parameters And so on

Q-values Q(s,a): the expected discounted sum of

future rewards, given that the agent takes action a from state s, and follows a particular policy thereafter

Given optimal Q-function, selecting action with highest Q-value at each state yields optimal policy

state action state action state actionEtc.

reward λ*rewardaction

Representing the Q-function

In Soar-RL, Q-function stored as productions, testing state and operator, and asserting numeric preferences.

Sp{RL-rule(state <s> ^operator <o> +) … (<s> ^operator <o> = 0.33231)}

During decision phase, the Q-value of an operator O is taken to be the sum of all numeric preferences asserted for O.

Learning

Q-value represented as concatenation:StateXAction Set of Features Value

Q-learning: Move the value of Q(st,at) toward rt + γ*maxa Q(st+1,a).

Bootstrapping: Update the prediction at one step using the prediction at the next step.

Automatic Feature Generation

Q-learning(updating parameters of

linear function)

state action state

reward

Current Work-Automatic Feature generation Constructing rule conditions with which to

associate values Since values stored with RL rules, some RL

rule must fire for every state-action pair for bootstrapping to work

Sufficient distinctions required that agent will not confuse state-action pairs with significantly different Q-values

Want rules that take advantage of opportunities for generalization

Waterjug Task-A reasonable set of rules

One for each state-action pair, for instance-

Sp {RL-0003 :rl (state <s> ^jug <j1> ^jug <j2> ^operator <o> +) (<j1> ^volume 5 ^contents 5) (<j2> ^volume 3 ^contents 0) (<o> ^name fill ^jug <j2>) (<s> ^operator <o> = 0)}

46 RL rules

0,00,0

0,30,3

3,03,0

3,33,3

5,15,1

0,10,1

1,01,01,31,34,04,0

4,34,3

5,25,2

0,20,2

2,02,0

2,32,3

5,05,0

5,35,3

How to generate rule automatically? Rule could be built from WM, for instance- sp {RL-1:rl(state <s> ^name waterjug ^jug <j1> ^jug <j2> ^operator <o4> + ^superstate nil ^type state)(<j1> ^contents 0 ^free 5 ^volume 5)(<j2> ^contents 0 ^free 3 ^volume 3)(<o4> ^name fill ^jug <j2>)(<s> ^operator <o4> = 0)}

But we want generalization…0,0

1,01,34,0

Adaptive representations

System constructs feature set so that more distinctions in parts of state-action space requiring more distinctions.

Specific-to-general:Collect instances and cluster according to similar values.

General-to-specific:Add distinctions when area with single representation appears to contain multiple values.

General-to-specificOur most general rules Rules made from operator proposals Sp {RL-1:rl(state <s1> ^name waterjug ^jug <j1>

^operator <o1> +)(<j1> ^free 3)(<o1> ^jug <j1> ^name fill)-->(<s1> ^operator <o1> = 0)}

Only generated when no rule fires for the selected operator

1,01,34,0

If jug has contents 3, then pour from jug into other jug.

Specialization –Example of overgeneral representation

Predicted Q-values at the state (3,0)

1 2 3 4 5 6 7 8 9 10

Update #

alue 3050

300030033033

Goal achieved Goal achieved

Predicted Q-value for state (3,0)

1 86 171 256 341 426 511 596 681 766 851 936 1021 1106 1191 1276 1361 1446 1531 1616 1701 1786 1871 1956 2041

Update #

alue 3050

300030033033

How to fix – Add following rule If there is a jug with

volume 3 and contents 3, pour this jug into the other jug.

0,00,0

0,30,3

3,03,0

3,33,3

5,15,1

0,10,1

1,01,01,31,34,04,0

4,34,3

5,25,2

0,20,2

2,02,0

2,32,3

5,05,0

5,35,3

How to fix – Add following rule If there is a jug with volume 3 and contents 3, pour

this jug into the other jug.Q-values at (3,0)

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 133 139 145 151 157 163 169 175

3050300030033033

Designing a specialization procedure1. How to decide whether to specialize a given

rule.2. Given that we have chosen to specialize a

rule, what conditions should we add to the rule?

3. (optional) In what, if any, cases should a rule be eliminated?

Question 2 (What conditions to add to a rule) – Proposed Answer Trying an activation-based scheme.

When an (instantiated) rule R decides to specialize, it finds the most activated WME, w = (ID ATTR VALUE).

Traces upward through WM, to find a shortest path from ID to some identifier in the rule’s instantiation

w and the WMEs in the trace add themselves to the conditions in R to form a new rule R’

If R’ is not a duplicate of some existing rule, R’ is added to the Rete (without removing R).

Question 1 (Should a given rule be specialized) – Proposed answer Track weights (numeric preferences) of rules. Weights should converge when rules

sufficient, so stop specializing when weights stop moving.

Tracking weights of rule RL-3

1 18 35 52 69 86 103 120 137 154 171 188 205 222 239 256 273 290 307 324 341 358 375 392 409 426 443 460 477 494 511

# steps in run

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100

# rules

Total # RL rules by run

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97

Conclusions Nuggets – Did work (at least on Waterjug)

That is, came to follow the optimal policy. Coal – Makes too many rules

1. Specializes rules that don’t need specialization.2. Specializations not always useful, since chosen

heuristically3. Will try to explain non-determinism by making

rules.

soar-rl: reinforcement learning and soar

rl state

rlrule state

rl rules

rl component

error learning rl

operator o

stateaction pairs

type state

Documents

reinforcement learning - rl in finite...

david silver’s introduction to rl slides reinforcement...

reinforcement learning an introduction and...

meger ősítéses tanulás = reinforcement learning (rl)

multi-agent reinforcement learning for new generation...

introducing constrained heuristic search to the soar ... ·...

an introduction to reinforcement learning (rl)

chip placement with deep reinforcement learning ·...

reinforcement learning: a tutorial - machine learning...

continuous action reinforcement learning automata, an rl

a unifying framework for computational reinforcement...

introduction of reinforcement...

a2-rl: aesthetics aware reinforcement learning for image...

machine learning techniques and applications reinforcement...

manix au - automatic state construction using dt for rl...

tutorial: deep reinforcement learning · outline...

mdps and rl outline reinforcement...

soar-rl discussion: future directions, open questions, and...

multi-agent reinforcement learning - game theory polimi ·...

quantum reinforcement learning - arxiv · quantum...