towards real-life reinforcement learningmlittman/talks/ai04-rl.pdf · 2004-05-19 · instance-based...
TRANSCRIPT
Towards Real-LifeReinforcement Learning
Michael L. LittmanRutgers University
Department of Computer Science
Rutgers Laboratory for Real-Life Reinforcement Learning
Where We’re Going
Introduce reinforcement learning
• why I think it’s exciting
Define the problem and current approaches
• highlight challenges of RL with real data
Current projects in my lab:
• efficient exploration
• rich sensors
• partial observability
• non-stationary environments
Why RL? Why AI?
•Make “Your plastic pal who’sfun to be with”
•Solve challenging problems in computer science
•Understand humanity
•Create useful tools
•Produce a decentmovie
Creating Human-Level AI
Significant motivator in the early days
Big question still unanswered:
What kind of information do we need to putinto our programs for them to be intelligent?
How ought we program intelligent machines?
• Program behavior?
• Program desires?
(Likely long haul even after we answer this.)
Impressive Accomplishment
Honda’s Asimo
• development began in 1999, building on 13years of engineering experience.
• claimed “mostadvancedhumanoid robotever created”
• walks 1mph
And Yet…
Asimo is controlled/programmed by people:
• structure of the walk programmed in
• reactions to perturbations programmed in
• directed by technicians andpuppeteers duringthe performance
• static stability
Crawl Before Walk
Impressive accomplishment:
• Fastest reported walk/crawl on an Aibo
• Gait pattern optimized automatically
Human “Crawling”
Seven types known:
• diagonal crawl
• rolling
• bottom shuffling
• commando
• “city crawl”
• one leg extended
• knee sliding
• skip it
Perhaps our programming isn’t for crawling at all, butfor the desire for movement!
Reinforcement-Learning Hypothesis
Intelligent behavior arises from
the actions of an individual seeking to
maximize its received reward signals
in a complex and changing world.
Research program:
• identify where reward signals come from,
• develop algorithms that search the space ofbehaviors to maximize reward signals.
Find The Ball
Learn:
• which way to turn
• to minimize steps
• to see goal (ball)
• from camera input
• given experience.
The RL Problem
Input: <s1, a1, s2, r1>, <s2, a2, s3, r2>, …, st
Output: ats to maximize discounted sum of ris.
, right, , +1
Problem Formalization: MDP
Most popular formalization: Markov decision process
Assume:
• States/sensations, actions discrete.
• Transitions, rewards stationary and Markov.
• Transition function: Pr(s’|s,a) = T(s,a,s’).
• Reward function: E[r|s,a] = R(s,a).
Then:
• Optimal policy !*(s) = argmaxa Q*(s,a)
• where Q*(s,a) = R(s,a) + " #s’ T(s,a,s’) maxa’ Q*(s’,a’)
Find the Ball: MDP Version
• Actions: rotate left/right
• States: orientation
• Reward: +1 for facing ball, 0 otherwise
It Can Be Done: Q-learning
Since optimal Q function is sufficient, useexperience to estimate it (Watkins & Dayan 92)
Given <s, a, s’, r>:Q(s,a) $ Q(s,a) + %t(r + " maxa’ Q(s’,a’) – Q(s,a) )
If:
• all (s,a) pairs updated infinitely often
• Pr(s’|s,a) = T(s,a,s’), E[r|s,a] = R(s,a)
• #%t = !, #%t 2 < !
Then: Q(s,a) & Q*(s,a)
Real-Life Reinforcement Learning
Emphasize learning with real* data.
Q-learning good, but might not be right here…
Mismatches to “Find the Ball” MDP:
• Efficient exploration: data is expensive
• Rich sensors: never see the same thing twice
• Aliasing: different states can look similar
• Non-stationarity: details change over time
* Or, if simulated, from simulators developed outsidethe AI community
Convergence of Policy
Logical disconnect:
• Q-learning converges in the limit to Q*(s,a).
• Best action is greedy: !*(s) = argmaxa Q*(s,a).
• But, for convergence, can’t starve (best) action.
Weak result (Singh, Jaakola, Littman, Szepesvári 00):
• GLIE (greedy in the limit with infinite exploration).
• Policy converges to optimal, not just Q values.
• Example: Decaying '-greedy. Visit n(s) to state s,choose random action with probability 1/n(s).
Efficient Exploration
Limit is nice, but would like something faster.
Goal: Policy that’s ' optimal with prob. 1-(
after polynomial amount of experience.
E3 (Kearns & Singh 98):
• Use experience to estimate model (T and R).
• Find optimal greedy policy wrt the model.
• Use model uncertainty to guide exploration.
Similar to RMAX (Brafman & Tennenholtz 02).
Model-Based Interval Estimation
For each state-action pair, keep distribution &conf. intervals (Wiering 99; Strehl & Littman 04).
Of distributions within conf. interval, assumemost optimistic. (Can compute efficiently.)
Like bandit (Kaelbling 93; Fong 95), optimalreward or learn something significant fast.
s1 s2 s3 s4 s5 sMAX
Some (non-real) MBIE Results
Each plot measures cumulative reward by trial.
Varied exploration parameters (6-arms MDP).
Text Filtering (real example)
Document
Stream
Model of
User’s Interest
Filtering
Relevance Judgment
Reward/Penalty
• Goal of the filtering system:– Maximize the total reward over long run
– Balance exploration and exploitation
Updating
Rich Sensors
Can treat experience as state via instance-based view.
, right, , +1
, left, , +0
, right, , +0
, right, , +1
, right
With an appropriatesimilarity function, canmake approximatetransition model and derivea policy (Ormoneit & Sen 02).
Allows us to combine withMBIE-style approaches.
Exploration in Continuous Space
-0.07
-0.05
-0.03
-0.01
0.01
0.03
0.05
0.07
-1.2 -0.7 -0.2 0.3
-0.07
-0.05
-0.03
-0.01
0.01
0.03
0.05
0.07
-1.2 -0.7 -0.2 0.3
Random Exploration (graph on left)
• After 2000 steps, little progress is made
• Requires 20,000 steps on average to reach goal
Efficient Exploration (graph on right)
• Goal is reached after 1528 steps
Also robot and GPS data (non-Markovian)
GOAL
with LeRoux
When Sensations Not Enough
Robust to weak non-Markovianness.
But, won’t take actions to gain information.
Network repair example (Littman, Ravi, Fenson, Howard 04).
• Recover from corrupted network interface config.
• Minimize time to repair.
• Info. gathering actions: PluggedIn, PingIp,PingLhost, PingGateway, DnsLookup, …
• Repair actions: RenewLease, UseCachedIP, FixIP.
Additional information helps to make the right choice.
How Learn a Model?
Assume repair episodes are iid.
In each episode, some mix of actions taken.
New episode:
• Assume it is one of the previous episodes.
• Time-minimum plan under this assumption.
• “State” is subset of past episodes.
Instance-based approach to partiallyobservability.
Learning Network Troubleshooting
Recovery from corruptednetwork interfaceconfiguration.
Java/Windows XP:Minimize time to repair.
After 95 failure episodes
Non-Stationary Environments
Problem: To predict future events in the face of abruptchanges in the environment.
Income: 2/3 Income: 1/3
Investment: 2/3
Investment: 1/3
Animal behavior: Matchinvestment to income givenmultiple options
Observation (Gallistel et al.):Abrupt changes in payoffrates result in abruptchanges in investmentrates. Proposed change-detection algorithm.
with Diuk, Sharma
Recognizing Changes in Disk Access
• Under real usage conditions, abrupt changes between usage modes.
• Detecting abrupt changes to mode can save energy
Portable computers usetechniques such as disk spin-down/up to conserve energy.Given the history of diskaccesses of the user,predicting how long it will beuntil the next disk accessoccurs
Where We Went
Reinforcement learning: Lots of progress.
Let’s reconnect learning with real data:
• previous ideas contribute significantly
• model-based approaches showing promise
• new twists needed
• some fundamental new ideas needed– representation
– reward
– reasoning about change and partial observability
Large rewards yet to be found!