the ideals reality of science
DESCRIPTION
The ideals reality of science. The pursuit of verifiable answers highly cited papers for your c.v. The validation of our results by reproduction convincing referees who did not see your code or data - PowerPoint PPT PresentationTRANSCRIPT
The ideals reality of science
• The pursuit of verifiable answers highly cited papers for your c.v.
• The validation of our results by reproduction convincing referees who did not see your code or data
• An altruistic, collective enterprise A race to outrun your colleagues in front of the giant bear of grant funding
Credit: Fernando Pérez @ Pycon 2014
• Voting on topics• Exercise 4• Final Project• Advanced RL class• Feedback on class presentations
Inter-Task Transfer• Learning tabula rasa can be unnecessarily slow
• Humans can use past information– Soccer with different numbers of players
• Agents leverage learned knowledge in novel tasks– Bias learning : speedup method
Primary Questions
SourceSSOURCE, ASOURCE
TargetSTARGET, A TARGET
• Is it possible to transfer learned knowledge?
• Possible to transfer without a providing a task mapping?
• Only consider reinforcement learning tasks• Lots of work in supervised settings
Value Function Transfer
ρ(QS (SS, AS)) = QT (ST, AT)
Action-Value function transferred
ρ is task-dependant: relies on inter-task mappings
ρ
Q not defined on ST and ATSource Task
Target Task
Environment
Agent
ActionT StateT RewardT
Environment
Agent
ActionS StateS RewardS
QS: SS×AS→ℜQT: ST×AT→ℜ
Inter-Task Mappings
Source Target
TargetSource
Inter-Task Mappings• χx: starget→ssource
– Given state variable in target task (some x from s=x1, x2, … xn)
– Return corresponding state variable in source task
• χA: atarget→asource– Similar, but for actions
• Intuitive mappings exist in some domains (Oracle)• Used to construct transfer functional
Target⟨x1 …
xn ⟩{a
1 …a
m }
STARGET
A TARGET
Source
SSOURCE
ASOURCE
⟨x1 …
xk ⟩{a
1 …a
j }χx
χA
Example
• Cancer• Castle attack
Lazaric
• Transfer can– Reduce need for instances in target task– Reduce need for domain expert knowledge
• What changes– State space, state features– A, T, R, goal state– learning method
Threshold: 8.5
Perfo
rman
ce
Task 2 from Scratch Task 2 with Transfer Task 1 + Task 2 withTransfer
Time
Target: no Transfer Target: with Transfer Target + Source: with TransferTask 2 from Scratch Task 2 with Transfer
Time
Target: no transfer Target: with Transfer
Two distinct scenarios:1. Target Time Metric: Successful if target task learning time reduced
Transfer Evaluation MetricsSet a threshold performance
Majority of agents can achieve with learning
2. Total Time Metric: Successful if total (source + target) time reduced
“Sunk Cost” is ignoredSource task(s) independently useful
AI GoalEffectively utilize past knowledge
Only care about TargetSource Task(s) not useful
Engineering GoalMinimize total training
• Previous was “learning speed improvement”• Can also have
– Asymptotic improvement– Jumpstart improvement
0 10 50 100 250 500 1000 3000 60000
5
10
15
20
25
30
35
Avg. 4 vs. 3 timeAvg. 3 vs. 2 time
# 3 vs. 2 Episodes
Sim
ulat
or H
ours
Value Function Transfer: Time to threshold in 4 vs. 3
Total Time
No TransferTarget Task
Time
}
Results: Scaling up to 5 vs. 4
47% of no transfer
# 4 vs 3 episodes
4 vs 3 time (hrs)
5 vs 4 time (hrs)
Total time (hrs)
0 0 18.01 18.011000 1.86 13.12 14.98
• All statistically significant when compared to no transfer
# 3 vs 2 episodes
3 vs 2 time (hrs)
# 4 vs 3 episodes
4 vs 3 time (hrs)
5 vs 4 time (hrs)
Total time (hrs)
500 1.05 500 0.99 6.46 8.51000 2.14 0 0 6.99 9.13
Problem Statement• Humans can selecting a training sequence• Results in faster training / better performance
• Meta-planning problem for agent learning– Known vs. Unknown final task?
MDPMDP MDPMDP
MDPMDP ?MDP
• TL vs. multi-task vs. lifelong learning vs. generalization vs. concept drift
• Learning from Easy Missions• Changing length of cart pole
K2
K3 T2
T1
K1
Both takers move towards player with ball
Goal: Maintain possession of ball
5 agents3 (stochastic) actions13 (noisy & continuous) state variables
Keeper with ball may hold ball or pass to either teammate
Keepaway [Stone, Sutton, and Kuhlmann 2005]
4 vs. 3:
7 agents4 actions19 state variables
Learning Keepaway• Sarsa update
– CMAC, RBF, and neural network approximation successful• Qπ(s,a): Predicted number of steps episode will last
– Reward = +1 for every timestep
’s Effect on CMACs
4 vs. 33 vs. 2
• For each weight in 4 vs. 3 function approximator:o Use inter-task mapping to find corresponding 3 vs. 2 weight
Keepaway Hand-coded χA
• Hold4v3 Hold3v2 • Pass14v3 Pass13v2
• Pass24v3 Pass23v2
• Pass34v3 Pass23v2
Actions in 4 vs. 3 have “similar” actions in 3 vs. 2
0 10 50 100 250 500 1000 3000 60000
5
10
15
20
25
30
35
Avg. 4 vs. 3 timeAvg. 3 vs. 2 time
# 3 vs. 2 Episodes
Sim
ulat
or H
ours
Value Function Transfer: Time to threshold in 4 vs. 3
Total Time
No TransferTarget Task
Time
}
Example Transfer Domains• Series of mazes with different goals [Fernandez and Veloso, 2006]• Mazes with different structures [Konidaris and Barto, 2007]
Example Transfer Domains• Series of mazes with different goals [Fernandez and Veloso, 2006]• Mazes with different structures [Konidaris and Barto, 2007]• Keepaway with different numbers of players [Taylor and Stone, 2005]• Keepaway to Breakaway [Torrey et al, 2005]
All tasks are drawn from the same domaino Task: An MDPo Domain: Setting for semantically similar tasks
oWhat about Cross-Domain Transfer?o Source task could be much simplero Show that source and target can be less similar
Example Transfer Domains• Series of mazes with different goals [Fernandez and Veloso, 2006]• Mazes with different structures [Konidaris and Barto, 2007]• Keepaway with different numbers of players [Taylor and Stone, 2005]• Keepaway to Breakaway [Torrey et al, 2005]
Source Task: Ringworld
RingworldGoal: avoid being tagged
2 agents3 actions7 state variablesFully ObservableDiscrete State Space (Q-table with ~8,100 s,a pairs)Stochastic Actions
Opponent moves directly towards player
Player may stay or run towards a pre-defined location
K2
K3 T2
T1
K1
3 vs. 2 KeepawayGoal: Maintain possession of ball
5 agents3 actions13 state variablesPartially ObservableContinuous State SpaceStochastic Actions
Source Task: Knight’s Joust
Knight’s JoustGoal: Travel from start to goal line
2 agents3 actions3 state variablesFully ObservableDiscrete State Space (Q-table with ~600 s,a pairs)Deterministic Actions
K2
K3 T2
T1
K1Opponent
moves directly towards player
Player may move North, or take a knight jump to either side
3 vs. 2 KeepawayGoal: Maintain possession of ball
5 agents3 actions13 state variablesPartially ObservableContinuous State SpaceStochastic Actions
Rule Transfer Overview
1. Learn a policy (π : S → A) in the source task– TD, Policy Search, Model-Based, etc.
2. Learn a decision list, Dsource, summarizing π3. Translate (Dsource) → Dtarget (applies to target task)
– State variables and actions can differ in two tasks4. Use Dtarget to learn a policy in target task
Allows for different learning methods and function approximators in source and target tasks
Learn π Learn DsourceTranslate (Dsource)
→ Dtarget Use Dtarget
Rule Transfer DetailsLearn π Learn Dsource
Translate (Dsource) → Dtarget
Use Dtarget
Environment
Agent
Action State Reward
Source Task
• In this work we use Sarsao Q : S × A → Return
• Other learning methods possible
Rule Transfer DetailsLearn π Learn Dsource
Translate (Dsource) → Dtarget
Use Dtarget
• Use learned policy to record S, A pairs• Use JRip (RIPPER in Weka) to learn a decision list
Environment
Agent
Action State RewardAction State
State Action
… …
• IF s1 < 4 and s2 > 5 → a1
• ELSEIF s1 < 3 → a2
• ELSEIF s3 > 7 → a1
• …
Rule Transfer DetailsLearn π Learn Dsource
Translate (Dsource) → Dtarget
Use Dtarget
• Inter-task Mappings• χx: starget→ssource
– Given state variable in target task (some x from s = x1, x2, … xn)
– Return corresponding state variable in source task
• χA: atarget→asource
– Similar, but for actions
rule rule’translate
χ A
χ x
Rule Transfer DetailsLearn π Learn Dsource
Translate (Dsource) → Dtarget
Use Dtarget
K2
K3 T2
T1
K1
Stay Hold BallRunNear Pass to K2
RunFar Pass to K3
dist(Player, Opponent) dist(K1,T1)… …
χx
χA
IF dist(Player, Opponent) > 4 → Stay
IF dist(K1,T1) > 4 → Hold Ball
Rule Transfer DetailsLearn π Learn Dsource
Translate (Dsource) → Dtarget
Use Dtarget
• Many possible ways to use Dtarget
o Value Bonuso Extra Actiono Extra Variable
• Assuming TD learner in target tasko Should generalize to other learning methods
Evaluate agent’s 3 actions in state s = s1, s2
Q(s1, s2, a1) = 5Q(s1, s2, a2) = 3Q(s1, s2, a3) = 4
Dtarget(s) = a2
+ 8
Evaluate agent’s 3 actions in state s = s1, s2
Q(s1, s2, a1) = 5Q(s1, s2, a2) = 3Q(s1, s2, a3) = 4Q(s1, s2, a4) = 7 (take action a2)
Evaluate agent’s 3 actions in state s = s1, s2
Q(s1, s2, s3, a1) = 5Q(s1, s2, s3, a2) = 3Q(s1, s2, s3, a3) = 4
Evaluate agent’s 3 actions in state s = s1, s2
Q(s1, s2, a2, a1) = 5Q(s1, s2, a2, a2) = 9Q(s1, s2, a2, a3) = 4
(shaping)(initially force agent to select)(initially force agent to select)
Results: Extra Action
Without Transfer
Ringworld: 20,000 episodes
Knight’s Joust: 50,000 episodes
Keepaway Transfer Learning ResultsInitial Performance
Asymptotic Performance
Accumulated Reward
No Transfer 7.8 21.6 756.7Ringworld 11.9 23.0 842.0Knight's Joust 13.8 21.8 758.5