Download - The ideals reality of science
The ideals reality of science
• The pursuit of verifiable answers highly cited papers for your c.v.
• The validation of our results by reproduction convincing referees who did not see your code or data
• An altruistic, collective enterprise A race to outrun your colleagues in front of the giant bear of grant funding
Credit: Fernando Pérez @ Pycon 2014
• Voting on topics• Exercise 4• Final Project• Advanced RL class• Feedback on class presentations
Inter-Task Transfer• Learning tabula rasa can be unnecessarily slow
• Humans can use past information– Soccer with different numbers of players
• Agents leverage learned knowledge in novel tasks– Bias learning : speedup method
Primary Questions
SourceSSOURCE, ASOURCE
TargetSTARGET, A TARGET
• Is it possible to transfer learned knowledge?
• Possible to transfer without a providing a task mapping?
• Only consider reinforcement learning tasks• Lots of work in supervised settings
Value Function Transfer
ρ(QS (SS, AS)) = QT (ST, AT)
Action-Value function transferred
ρ is task-dependant: relies on inter-task mappings
ρ
Q not defined on ST and ATSource Task
Target Task
Environment
Agent
ActionT StateT RewardT
Environment
Agent
ActionS StateS RewardS
QS: SS×AS→ℜQT: ST×AT→ℜ
Inter-Task Mappings
Source Target
TargetSource
Inter-Task Mappings• χx: starget→ssource
– Given state variable in target task (some x from s=x1, x2, … xn)
– Return corresponding state variable in source task
• χA: atarget→asource– Similar, but for actions
• Intuitive mappings exist in some domains (Oracle)• Used to construct transfer functional
Target⟨x1 …
xn ⟩{a
1 …a
m }
STARGET
A TARGET
Source
SSOURCE
ASOURCE
⟨x1 …
xk ⟩{a
1 …a
j }χx
χA
Example
• Cancer• Castle attack
Lazaric
• Transfer can– Reduce need for instances in target task– Reduce need for domain expert knowledge
• What changes– State space, state features– A, T, R, goal state– learning method
Threshold: 8.5
Perfo
rman
ce
Task 2 from Scratch Task 2 with Transfer Task 1 + Task 2 withTransfer
Time
Target: no Transfer Target: with Transfer Target + Source: with TransferTask 2 from Scratch Task 2 with Transfer
Time
Target: no transfer Target: with Transfer
Two distinct scenarios:1. Target Time Metric: Successful if target task learning time reduced
Transfer Evaluation MetricsSet a threshold performance
Majority of agents can achieve with learning
2. Total Time Metric: Successful if total (source + target) time reduced
“Sunk Cost” is ignoredSource task(s) independently useful
AI GoalEffectively utilize past knowledge
Only care about TargetSource Task(s) not useful
Engineering GoalMinimize total training
• Previous was “learning speed improvement”• Can also have
– Asymptotic improvement– Jumpstart improvement
0 10 50 100 250 500 1000 3000 60000
5
10
15
20
25
30
35
Avg. 4 vs. 3 timeAvg. 3 vs. 2 time
# 3 vs. 2 Episodes
Sim
ulat
or H
ours
Value Function Transfer: Time to threshold in 4 vs. 3
Total Time
No TransferTarget Task
Time
}
Results: Scaling up to 5 vs. 4
47% of no transfer
# 4 vs 3 episodes
4 vs 3 time (hrs)
5 vs 4 time (hrs)
Total time (hrs)
0 0 18.01 18.011000 1.86 13.12 14.98
• All statistically significant when compared to no transfer
# 3 vs 2 episodes
3 vs 2 time (hrs)
# 4 vs 3 episodes
4 vs 3 time (hrs)
5 vs 4 time (hrs)
Total time (hrs)
500 1.05 500 0.99 6.46 8.51000 2.14 0 0 6.99 9.13
Problem Statement• Humans can selecting a training sequence• Results in faster training / better performance
• Meta-planning problem for agent learning– Known vs. Unknown final task?
MDPMDP MDPMDP
MDPMDP ?MDP
• TL vs. multi-task vs. lifelong learning vs. generalization vs. concept drift
• Learning from Easy Missions• Changing length of cart pole
K2
K3 T2
T1
K1
Both takers move towards player with ball
Goal: Maintain possession of ball
5 agents3 (stochastic) actions13 (noisy & continuous) state variables
Keeper with ball may hold ball or pass to either teammate
Keepaway [Stone, Sutton, and Kuhlmann 2005]
4 vs. 3:
7 agents4 actions19 state variables
Learning Keepaway• Sarsa update
– CMAC, RBF, and neural network approximation successful• Qπ(s,a): Predicted number of steps episode will last
– Reward = +1 for every timestep
’s Effect on CMACs
4 vs. 33 vs. 2
• For each weight in 4 vs. 3 function approximator:o Use inter-task mapping to find corresponding 3 vs. 2 weight
Keepaway Hand-coded χA
• Hold4v3 Hold3v2 • Pass14v3 Pass13v2
• Pass24v3 Pass23v2
• Pass34v3 Pass23v2
Actions in 4 vs. 3 have “similar” actions in 3 vs. 2
0 10 50 100 250 500 1000 3000 60000
5
10
15
20
25
30
35
Avg. 4 vs. 3 timeAvg. 3 vs. 2 time
# 3 vs. 2 Episodes
Sim
ulat
or H
ours
Value Function Transfer: Time to threshold in 4 vs. 3
Total Time
No TransferTarget Task
Time
}
Example Transfer Domains• Series of mazes with different goals [Fernandez and Veloso, 2006]• Mazes with different structures [Konidaris and Barto, 2007]
Example Transfer Domains• Series of mazes with different goals [Fernandez and Veloso, 2006]• Mazes with different structures [Konidaris and Barto, 2007]• Keepaway with different numbers of players [Taylor and Stone, 2005]• Keepaway to Breakaway [Torrey et al, 2005]
All tasks are drawn from the same domaino Task: An MDPo Domain: Setting for semantically similar tasks
oWhat about Cross-Domain Transfer?o Source task could be much simplero Show that source and target can be less similar
Example Transfer Domains• Series of mazes with different goals [Fernandez and Veloso, 2006]• Mazes with different structures [Konidaris and Barto, 2007]• Keepaway with different numbers of players [Taylor and Stone, 2005]• Keepaway to Breakaway [Torrey et al, 2005]
Source Task: Ringworld
RingworldGoal: avoid being tagged
2 agents3 actions7 state variablesFully ObservableDiscrete State Space (Q-table with ~8,100 s,a pairs)Stochastic Actions
Opponent moves directly towards player
Player may stay or run towards a pre-defined location
K2
K3 T2
T1
K1
3 vs. 2 KeepawayGoal: Maintain possession of ball
5 agents3 actions13 state variablesPartially ObservableContinuous State SpaceStochastic Actions
Source Task: Knight’s Joust
Knight’s JoustGoal: Travel from start to goal line
2 agents3 actions3 state variablesFully ObservableDiscrete State Space (Q-table with ~600 s,a pairs)Deterministic Actions
K2
K3 T2
T1
K1Opponent
moves directly towards player
Player may move North, or take a knight jump to either side
3 vs. 2 KeepawayGoal: Maintain possession of ball
5 agents3 actions13 state variablesPartially ObservableContinuous State SpaceStochastic Actions
Rule Transfer Overview
1. Learn a policy (π : S → A) in the source task– TD, Policy Search, Model-Based, etc.
2. Learn a decision list, Dsource, summarizing π3. Translate (Dsource) → Dtarget (applies to target task)
– State variables and actions can differ in two tasks4. Use Dtarget to learn a policy in target task
Allows for different learning methods and function approximators in source and target tasks
Learn π Learn DsourceTranslate (Dsource)
→ Dtarget Use Dtarget
Rule Transfer DetailsLearn π Learn Dsource
Translate (Dsource) → Dtarget
Use Dtarget
Environment
Agent
Action State Reward
Source Task
• In this work we use Sarsao Q : S × A → Return
• Other learning methods possible
Rule Transfer DetailsLearn π Learn Dsource
Translate (Dsource) → Dtarget
Use Dtarget
• Use learned policy to record S, A pairs• Use JRip (RIPPER in Weka) to learn a decision list
Environment
Agent
Action State RewardAction State
State Action
… …
• IF s1 < 4 and s2 > 5 → a1
• ELSEIF s1 < 3 → a2
• ELSEIF s3 > 7 → a1
• …
Rule Transfer DetailsLearn π Learn Dsource
Translate (Dsource) → Dtarget
Use Dtarget
• Inter-task Mappings• χx: starget→ssource
– Given state variable in target task (some x from s = x1, x2, … xn)
– Return corresponding state variable in source task
• χA: atarget→asource
– Similar, but for actions
rule rule’translate
χ A
χ x
Rule Transfer DetailsLearn π Learn Dsource
Translate (Dsource) → Dtarget
Use Dtarget
K2
K3 T2
T1
K1
Stay Hold BallRunNear Pass to K2
RunFar Pass to K3
dist(Player, Opponent) dist(K1,T1)… …
χx
χA
IF dist(Player, Opponent) > 4 → Stay
IF dist(K1,T1) > 4 → Hold Ball
Rule Transfer DetailsLearn π Learn Dsource
Translate (Dsource) → Dtarget
Use Dtarget
• Many possible ways to use Dtarget
o Value Bonuso Extra Actiono Extra Variable
• Assuming TD learner in target tasko Should generalize to other learning methods
Evaluate agent’s 3 actions in state s = s1, s2
Q(s1, s2, a1) = 5Q(s1, s2, a2) = 3Q(s1, s2, a3) = 4
Dtarget(s) = a2
+ 8
Evaluate agent’s 3 actions in state s = s1, s2
Q(s1, s2, a1) = 5Q(s1, s2, a2) = 3Q(s1, s2, a3) = 4Q(s1, s2, a4) = 7 (take action a2)
Evaluate agent’s 3 actions in state s = s1, s2
Q(s1, s2, s3, a1) = 5Q(s1, s2, s3, a2) = 3Q(s1, s2, s3, a3) = 4
Evaluate agent’s 3 actions in state s = s1, s2
Q(s1, s2, a2, a1) = 5Q(s1, s2, a2, a2) = 9Q(s1, s2, a2, a3) = 4
(shaping)(initially force agent to select)(initially force agent to select)
Results: Extra Action
Without Transfer
Ringworld: 20,000 episodes
Knight’s Joust: 50,000 episodes
Keepaway Transfer Learning ResultsInitial Performance
Asymptotic Performance
Accumulated Reward
No Transfer 7.8 21.6 756.7Ringworld 11.9 23.0 842.0Knight's Joust 13.8 21.8 758.5