towards real-life reinforcement learningmlittman/talks/ai04-rl.pdf · 2004-05-19 · instance-based...

Towards Real-LifeReinforcement Learning

Michael L. LittmanRutgers University

Department of Computer Science

Rutgers Laboratory for Real-Life Reinforcement Learning

Where We’re Going

Introduce reinforcement learning

• why I think it’s exciting

Define the problem and current approaches

• highlight challenges of RL with real data

Current projects in my lab:

• efficient exploration

• rich sensors

• partial observability

• non-stationary environments

Why RL? Why AI?

•Make “Your plastic pal who’sfun to be with”

•Solve challenging problems in computer science

•Understand humanity

•Create useful tools

•Produce a decentmovie

Creating Human-Level AI

Significant motivator in the early days

Big question still unanswered:

What kind of information do we need to putinto our programs for them to be intelligent?

How ought we program intelligent machines?

• Program behavior?

• Program desires?

(Likely long haul even after we answer this.)

Impressive Accomplishment

Honda’s Asimo

• development began in 1999, building on 13years of engineering experience.

• claimed “mostadvancedhumanoid robotever created”

• walks 1mph

And Yet…

Asimo is controlled/programmed by people:

• structure of the walk programmed in

• reactions to perturbations programmed in

• directed by technicians andpuppeteers duringthe performance

• static stability

Crawl Before Walk

Impressive accomplishment:

• Fastest reported walk/crawl on an Aibo

• Gait pattern optimized automatically

Human “Crawling”

Seven types known:

• diagonal crawl

• rolling

• bottom shuffling

• commando

• “city crawl”

• one leg extended

• knee sliding

• skip it

Perhaps our programming isn’t for crawling at all, butfor the desire for movement!

Reinforcement-Learning Hypothesis

Intelligent behavior arises from

the actions of an individual seeking to

maximize its received reward signals

in a complex and changing world.

Research program:

• identify where reward signals come from,

• develop algorithms that search the space ofbehaviors to maximize reward signals.

Find The Ball

Learn:

• which way to turn

• to minimize steps

• to see goal (ball)

• from camera input

• given experience.

The RL Problem

Input: <s1, a1, s2, r1>, <s2, a2, s3, r2>, …, st

Output: ats to maximize discounted sum of ris.

, right, , +1

Problem Formalization: MDP

Most popular formalization: Markov decision process

Assume:

• States/sensations, actions discrete.

• Transitions, rewards stationary and Markov.

• Transition function: Pr(s’|s,a) = T(s,a,s’).

• Reward function: E[r|s,a] = R(s,a).

Then:

• Optimal policy !*(s) = argmaxa Q*(s,a)

• where Q*(s,a) = R(s,a) + " #s’ T(s,a,s’) maxa’ Q*(s’,a’)

Find the Ball: MDP Version

• Actions: rotate left/right

• States: orientation

• Reward: +1 for facing ball, 0 otherwise

It Can Be Done: Q-learning

Since optimal Q function is sufficient, useexperience to estimate it (Watkins & Dayan 92)

Given <s, a, s’, r>:Q(s,a) $ Q(s,a) + %t(r + " maxa’ Q(s’,a’) – Q(s,a) )

If:

• all (s,a) pairs updated infinitely often

• Pr(s’|s,a) = T(s,a,s’), E[r|s,a] = R(s,a)

• #%t = !, #%t 2 < !

Then: Q(s,a) & Q*(s,a)

Real-Life Reinforcement Learning

Emphasize learning with real* data.

Q-learning good, but might not be right here…

Mismatches to “Find the Ball” MDP:

• Efficient exploration: data is expensive

• Rich sensors: never see the same thing twice

• Aliasing: different states can look similar

• Non-stationarity: details change over time

* Or, if simulated, from simulators developed outsidethe AI community

Convergence of Policy

Logical disconnect:

• Q-learning converges in the limit to Q*(s,a).

• Best action is greedy: !*(s) = argmaxa Q*(s,a).

• But, for convergence, can’t starve (best) action.

Weak result (Singh, Jaakola, Littman, Szepesvári 00):

• GLIE (greedy in the limit with infinite exploration).

• Policy converges to optimal, not just Q values.

• Example: Decaying '-greedy. Visit n(s) to state s,choose random action with probability 1/n(s).

Efficient Exploration

Limit is nice, but would like something faster.

Goal: Policy that’s ' optimal with prob. 1-(

after polynomial amount of experience.

E3 (Kearns & Singh 98):

• Use experience to estimate model (T and R).

• Find optimal greedy policy wrt the model.

• Use model uncertainty to guide exploration.

Similar to RMAX (Brafman & Tennenholtz 02).

Model-Based Interval Estimation

For each state-action pair, keep distribution &conf. intervals (Wiering 99; Strehl & Littman 04).

Of distributions within conf. interval, assumemost optimistic. (Can compute efficiently.)

Like bandit (Kaelbling 93; Fong 95), optimalreward or learn something significant fast.

s1 s2 s3 s4 s5 sMAX

Some (non-real) MBIE Results

Each plot measures cumulative reward by trial.

Varied exploration parameters (6-arms MDP).

Text Filtering (real example)

Document

Stream

Model of

User’s Interest

Filtering

Relevance Judgment

Reward/Penalty

• Goal of the filtering system:– Maximize the total reward over long run

– Balance exploration and exploitation

Updating

Rich Sensors

Can treat experience as state via instance-based view.

, right, , +1

, left, , +0

, right, , +0

, right, , +1

, right

With an appropriatesimilarity function, canmake approximatetransition model and derivea policy (Ormoneit & Sen 02).

Allows us to combine withMBIE-style approaches.

Exploration in Continuous Space

-0.07

-0.05

-0.03

-0.01

0.01

0.03

0.05

0.07

-1.2 -0.7 -0.2 0.3

-0.07

-0.05

-0.03

-0.01

0.01

0.03

0.05

0.07

-1.2 -0.7 -0.2 0.3

Random Exploration (graph on left)

• After 2000 steps, little progress is made

• Requires 20,000 steps on average to reach goal

Efficient Exploration (graph on right)

• Goal is reached after 1528 steps

Also robot and GPS data (non-Markovian)

GOAL

with LeRoux

When Sensations Not Enough

Robust to weak non-Markovianness.

But, won’t take actions to gain information.

Network repair example (Littman, Ravi, Fenson, Howard 04).

• Recover from corrupted network interface config.

• Minimize time to repair.

• Info. gathering actions: PluggedIn, PingIp,PingLhost, PingGateway, DnsLookup, …

• Repair actions: RenewLease, UseCachedIP, FixIP.

Additional information helps to make the right choice.

How Learn a Model?

Assume repair episodes are iid.

In each episode, some mix of actions taken.

New episode:

• Assume it is one of the previous episodes.

• Time-minimum plan under this assumption.

• “State” is subset of past episodes.

Instance-based approach to partiallyobservability.

Learning Network Troubleshooting

Recovery from corruptednetwork interfaceconfiguration.

Java/Windows XP:Minimize time to repair.

After 95 failure episodes

Non-Stationary Environments

Problem: To predict future events in the face of abruptchanges in the environment.

Income: 2/3 Income: 1/3

Investment: 2/3

Investment: 1/3

Animal behavior: Matchinvestment to income givenmultiple options

Observation (Gallistel et al.):Abrupt changes in payoffrates result in abruptchanges in investmentrates. Proposed change-detection algorithm.

with Diuk, Sharma

Recognizing Changes in Disk Access

• Under real usage conditions, abrupt changes between usage modes.

• Detecting abrupt changes to mode can save energy

Portable computers usetechniques such as disk spin-down/up to conserve energy.Given the history of diskaccesses of the user,predicting how long it will beuntil the next disk accessoccurs

Where We Went

Reinforcement learning: Lots of progress.

Let’s reconnect learning with real data:

• previous ideas contribute significantly

• model-based approaches showing promise

• new twists needed

• some fundamental new ideas needed– representation

– reward

– reasoning about change and partial observability

Large rewards yet to be found!

towards real-life reinforcement learningmlittman/talks/ai04-rl.pdf · 2004-05-19 · instance-based...

Documents