the ideals reality of science

32
The ideals reality of science • The pursuit of verifiable answers highly cited papers for your c.v. • The validation of our results by reproduction convincing referees who did not see your code or data • An altruistic, collective enterprise A race to outrun your colleagues in front of the giant bear of grant funding Credit: Fernando Pérez @ Pycon 2014

Upload: adanna

Post on 24-Feb-2016

55 views

Category:

Documents


0 download

DESCRIPTION

The ideals reality of science. The pursuit of verifiable answers highly cited papers for your c.v. The validation of our results by reproduction convincing referees who did not see your code or data - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The  ideals  reality of science

The ideals reality of science

• The pursuit of verifiable answers highly cited papers for your c.v.

• The validation of our results by reproduction convincing referees who did not see your code or data

• An altruistic, collective enterprise A race to outrun your colleagues in front of the giant bear of grant funding

Credit: Fernando Pérez @ Pycon 2014

Page 2: The  ideals  reality of science

• Voting on topics• Exercise 4• Final Project• Advanced RL class• Feedback on class presentations

Page 3: The  ideals  reality of science

Inter-Task Transfer• Learning tabula rasa can be unnecessarily slow

• Humans can use past information– Soccer with different numbers of players

• Agents leverage learned knowledge in novel tasks– Bias learning : speedup method

Page 4: The  ideals  reality of science

Primary Questions

SourceSSOURCE, ASOURCE

TargetSTARGET, A TARGET

• Is it possible to transfer learned knowledge?

• Possible to transfer without a providing a task mapping?

• Only consider reinforcement learning tasks• Lots of work in supervised settings

Page 5: The  ideals  reality of science

Value Function Transfer

ρ(QS (SS, AS)) = QT (ST, AT)

Action-Value function transferred

ρ is task-dependant: relies on inter-task mappings

ρ

Q not defined on ST and ATSource Task

Target Task

Environment

Agent

ActionT StateT RewardT

Environment

Agent

ActionS StateS RewardS

QS: SS×AS→ℜQT: ST×AT→ℜ

Page 6: The  ideals  reality of science

Inter-Task Mappings

Source Target

TargetSource

Page 7: The  ideals  reality of science

Inter-Task Mappings• χx: starget→ssource

– Given state variable in target task (some x from s=x1, x2, … xn)

– Return corresponding state variable in source task

• χA: atarget→asource– Similar, but for actions

• Intuitive mappings exist in some domains (Oracle)• Used to construct transfer functional

Target⟨x1 …

xn ⟩{a

1 …a

m }

STARGET

A TARGET

Source

SSOURCE

ASOURCE

⟨x1 …

xk ⟩{a

1 …a

j }χx

χA

Page 8: The  ideals  reality of science

Example

• Cancer• Castle attack

Page 9: The  ideals  reality of science

Lazaric

• Transfer can– Reduce need for instances in target task– Reduce need for domain expert knowledge

• What changes– State space, state features– A, T, R, goal state– learning method

Page 10: The  ideals  reality of science

Threshold: 8.5

Perfo

rman

ce

Task 2 from Scratch Task 2 with Transfer Task 1 + Task 2 withTransfer

Time

Target: no Transfer Target: with Transfer Target + Source: with TransferTask 2 from Scratch Task 2 with Transfer

Time

Target: no transfer Target: with Transfer

Two distinct scenarios:1. Target Time Metric: Successful if target task learning time reduced

Transfer Evaluation MetricsSet a threshold performance

Majority of agents can achieve with learning

2. Total Time Metric: Successful if total (source + target) time reduced

“Sunk Cost” is ignoredSource task(s) independently useful

AI GoalEffectively utilize past knowledge

Only care about TargetSource Task(s) not useful

Engineering GoalMinimize total training

Page 11: The  ideals  reality of science

• Previous was “learning speed improvement”• Can also have

– Asymptotic improvement– Jumpstart improvement

Page 12: The  ideals  reality of science

0 10 50 100 250 500 1000 3000 60000

5

10

15

20

25

30

35

Avg. 4 vs. 3 timeAvg. 3 vs. 2 time

# 3 vs. 2 Episodes

Sim

ulat

or H

ours

Value Function Transfer: Time to threshold in 4 vs. 3

Total Time

No TransferTarget Task

Time

}

Page 13: The  ideals  reality of science

Results: Scaling up to 5 vs. 4

47% of no transfer

# 4 vs 3 episodes

4 vs 3 time (hrs)

5 vs 4 time (hrs)

Total time (hrs)

0 0 18.01 18.011000 1.86 13.12 14.98

• All statistically significant when compared to no transfer

# 3 vs 2 episodes

3 vs 2 time (hrs)

# 4 vs 3 episodes

4 vs 3 time (hrs)

5 vs 4 time (hrs)

Total time (hrs)

500 1.05 500 0.99 6.46 8.51000 2.14 0 0 6.99 9.13

Page 14: The  ideals  reality of science

Problem Statement• Humans can selecting a training sequence• Results in faster training / better performance

• Meta-planning problem for agent learning– Known vs. Unknown final task?

MDPMDP MDPMDP

MDPMDP ?MDP

Page 15: The  ideals  reality of science

• TL vs. multi-task vs. lifelong learning vs. generalization vs. concept drift

• Learning from Easy Missions• Changing length of cart pole

Page 16: The  ideals  reality of science

K2

K3 T2

T1

K1

Both takers move towards player with ball

Goal: Maintain possession of ball

5 agents3 (stochastic) actions13 (noisy & continuous) state variables

Keeper with ball may hold ball or pass to either teammate

Keepaway [Stone, Sutton, and Kuhlmann 2005]

4 vs. 3:

7 agents4 actions19 state variables

Page 17: The  ideals  reality of science

Learning Keepaway• Sarsa update

– CMAC, RBF, and neural network approximation successful• Qπ(s,a): Predicted number of steps episode will last

– Reward = +1 for every timestep

Page 18: The  ideals  reality of science

’s Effect on CMACs

4 vs. 33 vs. 2

• For each weight in 4 vs. 3 function approximator:o Use inter-task mapping to find corresponding 3 vs. 2 weight

Page 19: The  ideals  reality of science

Keepaway Hand-coded χA

• Hold4v3 Hold3v2 • Pass14v3 Pass13v2

• Pass24v3 Pass23v2

• Pass34v3 Pass23v2

Actions in 4 vs. 3 have “similar” actions in 3 vs. 2

Page 20: The  ideals  reality of science

0 10 50 100 250 500 1000 3000 60000

5

10

15

20

25

30

35

Avg. 4 vs. 3 timeAvg. 3 vs. 2 time

# 3 vs. 2 Episodes

Sim

ulat

or H

ours

Value Function Transfer: Time to threshold in 4 vs. 3

Total Time

No TransferTarget Task

Time

}

Page 21: The  ideals  reality of science

Example Transfer Domains• Series of mazes with different goals [Fernandez and Veloso, 2006]• Mazes with different structures [Konidaris and Barto, 2007]

Page 22: The  ideals  reality of science

Example Transfer Domains• Series of mazes with different goals [Fernandez and Veloso, 2006]• Mazes with different structures [Konidaris and Barto, 2007]• Keepaway with different numbers of players [Taylor and Stone, 2005]• Keepaway to Breakaway [Torrey et al, 2005]

Page 23: The  ideals  reality of science

All tasks are drawn from the same domaino Task: An MDPo Domain: Setting for semantically similar tasks

oWhat about Cross-Domain Transfer?o Source task could be much simplero Show that source and target can be less similar

Example Transfer Domains• Series of mazes with different goals [Fernandez and Veloso, 2006]• Mazes with different structures [Konidaris and Barto, 2007]• Keepaway with different numbers of players [Taylor and Stone, 2005]• Keepaway to Breakaway [Torrey et al, 2005]

Page 24: The  ideals  reality of science

Source Task: Ringworld

RingworldGoal: avoid being tagged

2 agents3 actions7 state variablesFully ObservableDiscrete State Space (Q-table with ~8,100 s,a pairs)Stochastic Actions

Opponent moves directly towards player

Player may stay or run towards a pre-defined location

K2

K3 T2

T1

K1

3 vs. 2 KeepawayGoal: Maintain possession of ball

5 agents3 actions13 state variablesPartially ObservableContinuous State SpaceStochastic Actions

Page 25: The  ideals  reality of science

Source Task: Knight’s Joust

Knight’s JoustGoal: Travel from start to goal line

2 agents3 actions3 state variablesFully ObservableDiscrete State Space (Q-table with ~600 s,a pairs)Deterministic Actions

K2

K3 T2

T1

K1Opponent

moves directly towards player

Player may move North, or take a knight jump to either side

3 vs. 2 KeepawayGoal: Maintain possession of ball

5 agents3 actions13 state variablesPartially ObservableContinuous State SpaceStochastic Actions

Page 26: The  ideals  reality of science

Rule Transfer Overview

1. Learn a policy (π : S → A) in the source task– TD, Policy Search, Model-Based, etc.

2. Learn a decision list, Dsource, summarizing π3. Translate (Dsource) → Dtarget (applies to target task)

– State variables and actions can differ in two tasks4. Use Dtarget to learn a policy in target task

Allows for different learning methods and function approximators in source and target tasks

Learn π Learn DsourceTranslate (Dsource)

→ Dtarget Use Dtarget

Page 27: The  ideals  reality of science

Rule Transfer DetailsLearn π Learn Dsource

Translate (Dsource) → Dtarget

Use Dtarget

Environment

Agent

Action State Reward

Source Task

• In this work we use Sarsao Q : S × A → Return

• Other learning methods possible

Page 28: The  ideals  reality of science

Rule Transfer DetailsLearn π Learn Dsource

Translate (Dsource) → Dtarget

Use Dtarget

• Use learned policy to record S, A pairs• Use JRip (RIPPER in Weka) to learn a decision list

Environment

Agent

Action State RewardAction State

State Action

… …

• IF s1 < 4 and s2 > 5 → a1

• ELSEIF s1 < 3 → a2

• ELSEIF s3 > 7 → a1

• …

Page 29: The  ideals  reality of science

Rule Transfer DetailsLearn π Learn Dsource

Translate (Dsource) → Dtarget

Use Dtarget

• Inter-task Mappings• χx: starget→ssource

– Given state variable in target task (some x from s = x1, x2, … xn)

– Return corresponding state variable in source task

• χA: atarget→asource

– Similar, but for actions

rule rule’translate

χ A

χ x

Page 30: The  ideals  reality of science

Rule Transfer DetailsLearn π Learn Dsource

Translate (Dsource) → Dtarget

Use Dtarget

K2

K3 T2

T1

K1

Stay Hold BallRunNear Pass to K2

RunFar Pass to K3

dist(Player, Opponent) dist(K1,T1)… …

χx

χA

IF dist(Player, Opponent) > 4 → Stay

IF dist(K1,T1) > 4 → Hold Ball

Page 31: The  ideals  reality of science

Rule Transfer DetailsLearn π Learn Dsource

Translate (Dsource) → Dtarget

Use Dtarget

• Many possible ways to use Dtarget

o Value Bonuso Extra Actiono Extra Variable

• Assuming TD learner in target tasko Should generalize to other learning methods

Evaluate agent’s 3 actions in state s = s1, s2

Q(s1, s2, a1) = 5Q(s1, s2, a2) = 3Q(s1, s2, a3) = 4

Dtarget(s) = a2

+ 8

Evaluate agent’s 3 actions in state s = s1, s2

Q(s1, s2, a1) = 5Q(s1, s2, a2) = 3Q(s1, s2, a3) = 4Q(s1, s2, a4) = 7 (take action a2)

Evaluate agent’s 3 actions in state s = s1, s2

Q(s1, s2, s3, a1) = 5Q(s1, s2, s3, a2) = 3Q(s1, s2, s3, a3) = 4

Evaluate agent’s 3 actions in state s = s1, s2

Q(s1, s2, a2, a1) = 5Q(s1, s2, a2, a2) = 9Q(s1, s2, a2, a3) = 4

(shaping)(initially force agent to select)(initially force agent to select)

Page 32: The  ideals  reality of science

Results: Extra Action

Without Transfer

Ringworld: 20,000 episodes

Knight’s Joust: 50,000 episodes

Keepaway Transfer Learning ResultsInitial Performance

Asymptotic Performance

Accumulated Reward

No Transfer 7.8 21.6 756.7Ringworld 11.9 23.0 842.0Knight's Joust 13.8 21.8 758.5