the ideals reality of science

The ideals reality of science

• The pursuit of verifiable answers highly cited papers for your c.v.

• The validation of our results by reproduction convincing referees who did not see your code or data

• An altruistic, collective enterprise A race to outrun your colleagues in front of the giant bear of grant funding

Credit: Fernando Pérez @ Pycon 2014

• Voting on topics• Exercise 4• Final Project• Advanced RL class• Feedback on class presentations

Inter-Task Transfer• Learning tabula rasa can be unnecessarily slow

• Humans can use past information– Soccer with different numbers of players

• Agents leverage learned knowledge in novel tasks– Bias learning : speedup method

Primary Questions

SourceSSOURCE, ASOURCE

TargetSTARGET, A TARGET

• Is it possible to transfer learned knowledge?

• Possible to transfer without a providing a task mapping?

• Only consider reinforcement learning tasks• Lots of work in supervised settings

Value Function Transfer

ρ(QS (SS, AS)) = QT (ST, AT)

Action-Value function transferred

ρ is task-dependant: relies on inter-task mappings

ρ

Q not defined on ST and ATSource Task

Target Task

Environment

Agent

ActionT StateT RewardT

Environment

Agent

ActionS StateS RewardS

QS: SS×AS→ℜQT: ST×AT→ℜ

Inter-Task Mappings

Source Target

TargetSource

Inter-Task Mappings• χx: starget→ssource

– Given state variable in target task (some x from s=x1, x2, … xn)

– Return corresponding state variable in source task

• χA: atarget→asource– Similar, but for actions

• Intuitive mappings exist in some domains (Oracle)• Used to construct transfer functional

Target⟨x1 …

xn ⟩{a

1 …a

m }

STARGET

A TARGET

Source

SSOURCE

ASOURCE

⟨x1 …

xk ⟩{a

1 …a

j }χx

χA

Example

• Cancer• Castle attack

Lazaric

• Transfer can– Reduce need for instances in target task– Reduce need for domain expert knowledge

• What changes– State space, state features– A, T, R, goal state– learning method

Threshold: 8.5

Perfo

rman

ce

Task 2 from Scratch Task 2 with Transfer Task 1 + Task 2 withTransfer

Time

Target: no Transfer Target: with Transfer Target + Source: with TransferTask 2 from Scratch Task 2 with Transfer

Time

Target: no transfer Target: with Transfer

Two distinct scenarios:1. Target Time Metric: Successful if target task learning time reduced

Transfer Evaluation MetricsSet a threshold performance

Majority of agents can achieve with learning

2. Total Time Metric: Successful if total (source + target) time reduced

“Sunk Cost” is ignoredSource task(s) independently useful

AI GoalEffectively utilize past knowledge

Only care about TargetSource Task(s) not useful

Engineering GoalMinimize total training

• Previous was “learning speed improvement”• Can also have

– Asymptotic improvement– Jumpstart improvement

0 10 50 100 250 500 1000 3000 60000

5

10

15

20

25

30

35

Avg. 4 vs. 3 timeAvg. 3 vs. 2 time

# 3 vs. 2 Episodes

Sim

ulat

or H

ours

Value Function Transfer: Time to threshold in 4 vs. 3

Total Time

No TransferTarget Task

Time

}

Results: Scaling up to 5 vs. 4

47% of no transfer

# 4 vs 3 episodes

4 vs 3 time (hrs)

5 vs 4 time (hrs)

Total time (hrs)

0 0 18.01 18.011000 1.86 13.12 14.98

• All statistically significant when compared to no transfer

# 3 vs 2 episodes

3 vs 2 time (hrs)

# 4 vs 3 episodes

4 vs 3 time (hrs)

5 vs 4 time (hrs)

Total time (hrs)

500 1.05 500 0.99 6.46 8.51000 2.14 0 0 6.99 9.13

Problem Statement• Humans can selecting a training sequence• Results in faster training / better performance

• Meta-planning problem for agent learning– Known vs. Unknown final task?

MDPMDP MDPMDP

MDPMDP ?MDP

• TL vs. multi-task vs. lifelong learning vs. generalization vs. concept drift

• Learning from Easy Missions• Changing length of cart pole

K2

K3 T2

T1

K1

Both takers move towards player with ball

Goal: Maintain possession of ball

5 agents3 (stochastic) actions13 (noisy & continuous) state variables

Keeper with ball may hold ball or pass to either teammate

Keepaway [Stone, Sutton, and Kuhlmann 2005]

4 vs. 3:

7 agents4 actions19 state variables

Learning Keepaway• Sarsa update

– CMAC, RBF, and neural network approximation successful• Qπ(s,a): Predicted number of steps episode will last

– Reward = +1 for every timestep

’s Effect on CMACs

4 vs. 33 vs. 2

• For each weight in 4 vs. 3 function approximator:o Use inter-task mapping to find corresponding 3 vs. 2 weight

Keepaway Hand-coded χA

• Hold4v3 Hold3v2 • Pass14v3 Pass13v2

• Pass24v3 Pass23v2

• Pass34v3 Pass23v2

Actions in 4 vs. 3 have “similar” actions in 3 vs. 2

0 10 50 100 250 500 1000 3000 60000

5

10

15

20

25

30

35

Avg. 4 vs. 3 timeAvg. 3 vs. 2 time

# 3 vs. 2 Episodes

Sim

ulat

or H

ours

Value Function Transfer: Time to threshold in 4 vs. 3

Total Time

No TransferTarget Task

Time

}

Example Transfer Domains• Series of mazes with different goals [Fernandez and Veloso, 2006]• Mazes with different structures [Konidaris and Barto, 2007]

Example Transfer Domains• Series of mazes with different goals [Fernandez and Veloso, 2006]• Mazes with different structures [Konidaris and Barto, 2007]• Keepaway with different numbers of players [Taylor and Stone, 2005]• Keepaway to Breakaway [Torrey et al, 2005]

All tasks are drawn from the same domaino Task: An MDPo Domain: Setting for semantically similar tasks

oWhat about Cross-Domain Transfer?o Source task could be much simplero Show that source and target can be less similar

Example Transfer Domains• Series of mazes with different goals [Fernandez and Veloso, 2006]• Mazes with different structures [Konidaris and Barto, 2007]• Keepaway with different numbers of players [Taylor and Stone, 2005]• Keepaway to Breakaway [Torrey et al, 2005]

Source Task: Ringworld

RingworldGoal: avoid being tagged

2 agents3 actions7 state variablesFully ObservableDiscrete State Space (Q-table with ~8,100 s,a pairs)Stochastic Actions

Opponent moves directly towards player

Player may stay or run towards a pre-defined location

K2

K3 T2

T1

K1

3 vs. 2 KeepawayGoal: Maintain possession of ball

5 agents3 actions13 state variablesPartially ObservableContinuous State SpaceStochastic Actions

Source Task: Knight’s Joust

Knight’s JoustGoal: Travel from start to goal line

2 agents3 actions3 state variablesFully ObservableDiscrete State Space (Q-table with ~600 s,a pairs)Deterministic Actions

K2

K3 T2

T1

K1Opponent

moves directly towards player

Player may move North, or take a knight jump to either side

3 vs. 2 KeepawayGoal: Maintain possession of ball

5 agents3 actions13 state variablesPartially ObservableContinuous State SpaceStochastic Actions

Rule Transfer Overview

1. Learn a policy (π : S → A) in the source task– TD, Policy Search, Model-Based, etc.

2. Learn a decision list, Dsource, summarizing π3. Translate (Dsource) → Dtarget (applies to target task)

– State variables and actions can differ in two tasks4. Use Dtarget to learn a policy in target task

Allows for different learning methods and function approximators in source and target tasks

Learn π Learn DsourceTranslate (Dsource)

→ Dtarget Use Dtarget

Rule Transfer DetailsLearn π Learn Dsource

Translate (Dsource) → Dtarget

Use Dtarget

Environment

Agent

Action State Reward

Source Task

• In this work we use Sarsao Q : S × A → Return

• Other learning methods possible



Use Dtarget

• Use learned policy to record S, A pairs• Use JRip (RIPPER in Weka) to learn a decision list

Environment

Agent

Action State RewardAction State

State Action

… …

• IF s1 < 4 and s2 > 5 → a1

• ELSEIF s1 < 3 → a2

• ELSEIF s3 > 7 → a1

• …



Use Dtarget

• Inter-task Mappings• χx: starget→ssource

– Given state variable in target task (some x from s = x1, x2, … xn)

– Return corresponding state variable in source task

• χA: atarget→asource

– Similar, but for actions

rule rule’translate

χ A

χ x



Use Dtarget

K2

K3 T2

T1

K1

Stay Hold BallRunNear Pass to K2

RunFar Pass to K3

dist(Player, Opponent) dist(K1,T1)… …

χx

χA

IF dist(Player, Opponent) > 4 → Stay

IF dist(K1,T1) > 4 → Hold Ball



Use Dtarget

• Many possible ways to use Dtarget

o Value Bonuso Extra Actiono Extra Variable

• Assuming TD learner in target tasko Should generalize to other learning methods

Evaluate agent’s 3 actions in state s = s1, s2

Q(s1, s2, a1) = 5Q(s1, s2, a2) = 3Q(s1, s2, a3) = 4

Dtarget(s) = a2

+ 8


Q(s1, s2, a1) = 5Q(s1, s2, a2) = 3Q(s1, s2, a3) = 4Q(s1, s2, a4) = 7 (take action a2)


Q(s1, s2, s3, a1) = 5Q(s1, s2, s3, a2) = 3Q(s1, s2, s3, a3) = 4


Q(s1, s2, a2, a1) = 5Q(s1, s2, a2, a2) = 9Q(s1, s2, a2, a3) = 4

(shaping)(initially force agent to select)(initially force agent to select)

Results: Extra Action

Without Transfer

Ringworld: 20,000 episodes

Knight’s Joust: 50,000 episodes

Keepaway Transfer Learning ResultsInitial Performance

Asymptotic Performance

Accumulated Reward

No Transfer 7.8 21.6 756.7Ringworld 11.9 23.0 842.0Knight's Joust 13.8 21.8 758.5

the ideals reality of science

Documents

transfer target

target time metric

performance target

target taskreduce

total time metric

transfertarget source

intertask mappings q

learned knowledge