Download - The ideals reality of science

The ideals reality of science

• The pursuit of verifiable answers highly cited papers for your c.v.

• The validation of our results by reproduction convincing referees who did not see your code or data

• An altruistic, collective enterprise A race to outrun your colleagues in front of the giant bear of grant funding

Credit: Fernando Pérez @ Pycon 2014

• Voting on topics• Exercise 4• Final Project• Advanced RL class• Feedback on class presentations

Inter-Task Transfer• Learning tabula rasa can be unnecessarily slow

• Humans can use past information– Soccer with different numbers of players

• Agents leverage learned knowledge in novel tasks– Bias learning : speedup method

Primary Questions

SourceSSOURCE, ASOURCE

TargetSTARGET, A TARGET

• Is it possible to transfer learned knowledge?

• Possible to transfer without a providing a task mapping?

• Only consider reinforcement learning tasks• Lots of work in supervised settings

Value Function Transfer

ρ(QS (SS, AS)) = QT (ST, AT)

Action-Value function transferred

ρ is task-dependant: relies on inter-task mappings

ρ

Q not defined on ST and ATSource Task

Target Task

Environment

Agent

ActionT StateT RewardT

Environment

Agent

ActionS StateS RewardS

QS: SS×AS→ℜQT: ST×AT→ℜ

Inter-Task Mappings

Source Target

TargetSource

Inter-Task Mappings• χx: starget→ssource

– Given state variable in target task (some x from s=x1, x2, … xn)

– Return corresponding state variable in source task

• χA: atarget→asource– Similar, but for actions

• Intuitive mappings exist in some domains (Oracle)• Used to construct transfer functional

Target⟨x1 …

xn ⟩{a

1 …a

m }

STARGET

A TARGET

Source

SSOURCE

ASOURCE

⟨x1 …

xk ⟩{a

1 …a

j }χx

χA

Example

• Cancer• Castle attack

Lazaric

• Transfer can– Reduce need for instances in target task– Reduce need for domain expert knowledge

• What changes– State space, state features– A, T, R, goal state– learning method

Threshold: 8.5

Perfo

rman

ce

Task 2 from Scratch Task 2 with Transfer Task 1 + Task 2 withTransfer

Time

Target: no Transfer Target: with Transfer Target + Source: with TransferTask 2 from Scratch Task 2 with Transfer

Time

Target: no transfer Target: with Transfer

Two distinct scenarios:1. Target Time Metric: Successful if target task learning time reduced

Transfer Evaluation MetricsSet a threshold performance

Majority of agents can achieve with learning

2. Total Time Metric: Successful if total (source + target) time reduced

“Sunk Cost” is ignoredSource task(s) independently useful

AI GoalEffectively utilize past knowledge

Only care about TargetSource Task(s) not useful

Engineering GoalMinimize total training

• Previous was “learning speed improvement”• Can also have

– Asymptotic improvement– Jumpstart improvement

0 10 50 100 250 500 1000 3000 60000

5

10

15

20

25

30

35

Avg. 4 vs. 3 timeAvg. 3 vs. 2 time

# 3 vs. 2 Episodes

Sim

ulat

or H

ours

Value Function Transfer: Time to threshold in 4 vs. 3

Total Time

No TransferTarget Task

Time

}

Results: Scaling up to 5 vs. 4

47% of no transfer

# 4 vs 3 episodes

4 vs 3 time (hrs)

5 vs 4 time (hrs)

Total time (hrs)

0 0 18.01 18.011000 1.86 13.12 14.98

• All statistically significant when compared to no transfer

# 3 vs 2 episodes

3 vs 2 time (hrs)

# 4 vs 3 episodes

4 vs 3 time (hrs)

5 vs 4 time (hrs)

Total time (hrs)

500 1.05 500 0.99 6.46 8.51000 2.14 0 0 6.99 9.13

Problem Statement• Humans can selecting a training sequence• Results in faster training / better performance

• Meta-planning problem for agent learning– Known vs. Unknown final task?

MDPMDP MDPMDP

MDPMDP ?MDP

• TL vs. multi-task vs. lifelong learning vs. generalization vs. concept drift

• Learning from Easy Missions• Changing length of cart pole

K2

K3 T2

T1

K1

Both takers move towards player with ball

Goal: Maintain possession of ball

5 agents3 (stochastic) actions13 (noisy & continuous) state variables

Keeper with ball may hold ball or pass to either teammate

Keepaway [Stone, Sutton, and Kuhlmann 2005]

4 vs. 3:

7 agents4 actions19 state variables

Learning Keepaway• Sarsa update

– CMAC, RBF, and neural network approximation successful• Qπ(s,a): Predicted number of steps episode will last

– Reward = +1 for every timestep

’s Effect on CMACs

4 vs. 33 vs. 2

• For each weight in 4 vs. 3 function approximator:o Use inter-task mapping to find corresponding 3 vs. 2 weight

Keepaway Hand-coded χA

• Hold4v3 Hold3v2 • Pass14v3 Pass13v2

• Pass24v3 Pass23v2

• Pass34v3 Pass23v2

Actions in 4 vs. 3 have “similar” actions in 3 vs. 2

0 10 50 100 250 500 1000 3000 60000

5

10

15

20

25

30

35

Avg. 4 vs. 3 timeAvg. 3 vs. 2 time

# 3 vs. 2 Episodes

Sim

ulat

or H

ours

Value Function Transfer: Time to threshold in 4 vs. 3

Total Time

No TransferTarget Task

Time

}

Example Transfer Domains• Series of mazes with different goals [Fernandez and Veloso, 2006]• Mazes with different structures [Konidaris and Barto, 2007]

Example Transfer Domains• Series of mazes with different goals [Fernandez and Veloso, 2006]• Mazes with different structures [Konidaris and Barto, 2007]• Keepaway with different numbers of players [Taylor and Stone, 2005]• Keepaway to Breakaway [Torrey et al, 2005]

All tasks are drawn from the same domaino Task: An MDPo Domain: Setting for semantically similar tasks

oWhat about Cross-Domain Transfer?o Source task could be much simplero Show that source and target can be less similar

Example Transfer Domains• Series of mazes with different goals [Fernandez and Veloso, 2006]• Mazes with different structures [Konidaris and Barto, 2007]• Keepaway with different numbers of players [Taylor and Stone, 2005]• Keepaway to Breakaway [Torrey et al, 2005]

Source Task: Ringworld

RingworldGoal: avoid being tagged

2 agents3 actions7 state variablesFully ObservableDiscrete State Space (Q-table with ~8,100 s,a pairs)Stochastic Actions

Opponent moves directly towards player

Player may stay or run towards a pre-defined location

K2

K3 T2

T1

K1

3 vs. 2 KeepawayGoal: Maintain possession of ball

5 agents3 actions13 state variablesPartially ObservableContinuous State SpaceStochastic Actions

Source Task: Knight’s Joust

Knight’s JoustGoal: Travel from start to goal line

2 agents3 actions3 state variablesFully ObservableDiscrete State Space (Q-table with ~600 s,a pairs)Deterministic Actions

K2

K3 T2

T1

K1Opponent

moves directly towards player

Player may move North, or take a knight jump to either side

3 vs. 2 KeepawayGoal: Maintain possession of ball

5 agents3 actions13 state variablesPartially ObservableContinuous State SpaceStochastic Actions

Rule Transfer Overview

1. Learn a policy (π : S → A) in the source task– TD, Policy Search, Model-Based, etc.

2. Learn a decision list, Dsource, summarizing π3. Translate (Dsource) → Dtarget (applies to target task)

– State variables and actions can differ in two tasks4. Use Dtarget to learn a policy in target task

Allows for different learning methods and function approximators in source and target tasks

Learn π Learn DsourceTranslate (Dsource)

→ Dtarget Use Dtarget

Rule Transfer DetailsLearn π Learn Dsource

Translate (Dsource) → Dtarget

Use Dtarget

Environment

Agent

Action State Reward

Source Task

• In this work we use Sarsao Q : S × A → Return

• Other learning methods possible



Use Dtarget

• Use learned policy to record S, A pairs• Use JRip (RIPPER in Weka) to learn a decision list

Environment

Agent

Action State RewardAction State

State Action

… …

• IF s1 < 4 and s2 > 5 → a1

• ELSEIF s1 < 3 → a2

• ELSEIF s3 > 7 → a1

• …



Use Dtarget

• Inter-task Mappings• χx: starget→ssource

– Given state variable in target task (some x from s = x1, x2, … xn)

– Return corresponding state variable in source task

• χA: atarget→asource

– Similar, but for actions

rule rule’translate

χ A

χ x



Use Dtarget

K2

K3 T2

T1

K1

Stay Hold BallRunNear Pass to K2

RunFar Pass to K3

dist(Player, Opponent) dist(K1,T1)… …

χx

χA

IF dist(Player, Opponent) > 4 → Stay

IF dist(K1,T1) > 4 → Hold Ball



Use Dtarget

• Many possible ways to use Dtarget

o Value Bonuso Extra Actiono Extra Variable

• Assuming TD learner in target tasko Should generalize to other learning methods

Evaluate agent’s 3 actions in state s = s1, s2

Q(s1, s2, a1) = 5Q(s1, s2, a2) = 3Q(s1, s2, a3) = 4

Dtarget(s) = a2

+ 8


Q(s1, s2, a1) = 5Q(s1, s2, a2) = 3Q(s1, s2, a3) = 4Q(s1, s2, a4) = 7 (take action a2)


Q(s1, s2, s3, a1) = 5Q(s1, s2, s3, a2) = 3Q(s1, s2, s3, a3) = 4


Q(s1, s2, a2, a1) = 5Q(s1, s2, a2, a2) = 9Q(s1, s2, a2, a3) = 4

(shaping)(initially force agent to select)(initially force agent to select)

Results: Extra Action

Without Transfer

Ringworld: 20,000 episodes

Knight’s Joust: 50,000 episodes

Keepaway Transfer Learning ResultsInitial Performance

Asymptotic Performance

Accumulated Reward

No Transfer 7.8 21.6 756.7Ringworld 11.9 23.0 842.0Knight's Joust 13.8 21.8 758.5

Download - The ideals reality of science

Top Related