hierarchical reinforcement learning

54
Hierarchical Reinforcement Learning Mausam Survey and Comparison of HRL technique

Post on 30-Dec-2015

31 views

Category:

Documents

Tags:

• policy s

DESCRIPTION

Hierarchical Reinforcement Learning. Mausam. [A Survey and Comparison of HRL techniques]. The Outline of the Talk. MDPs and Bellman’s curse of dimensionality. RL: Simultaneous learning and planning. Explore avenues to speed up RL. Illustrate prominent HRL methods. - PowerPoint PPT Presentation

TRANSCRIPT

Hierarchical Reinforcement Learning

Mausam

[A Survey and Comparison of HRL techniques]

The Outline of the Talk

MDPs and Bellman’s curse of dimensionality.

RL: Simultaneous learning and planning.Explore avenues to speed up RL.

Illustrate prominent HRL methods.Compare prominent HRL methods.

Discuss future research.Summarise

Decision Making

Environment

Percept Action

What action next?

Slide courtesy Dan Weld

Personal Printerbot

States (S) : {loc,has-robot-printout, user-loc,has-user-printout},map

Actions (A) :{moven,moves,movee,movew, extend-arm,grab-page,release-pages}

Reward (R) : if h-u-po +20 else -1Goal (G) : All states with h-u-po true.

Start state : A state with h-u-po false.

Episodic Markov Decision Process

hS, A, P, R, G, s0i S : Set of environment states. A : Set of available actions. P : Probability Transition model. P(s’|

s,a)* R : Reward model. R(s)* G : Absorbing goal states. s0 : Start state. : Discount factor**.

* Markovian assumption.** bounds R for infinite horizon.

Episodic MDP ´ MDP with

absorbing goals

Episodic MDP ´ MDP with

absorbing goals

Goal of an Episodic MDP

Find a policy (S ! A), which:maximises expected discounted reward

for a a fully observable* Episodic MDP.if agent is allowed to execute for an

indefinite horizon.

* Non-noisy complete information perceptors

Solution of an Episodic MDP

Define V*(s) : Optimal reward starting in state s.

Value Iteration : Start with an estimate of V*(s) and successively re-estimate it to converge to a fixed point.

Complexity of Value Iteration

Each iteration – polynomial in |S|Number of iterations – polynomial in

|S|Overall – polynomial in |S|

Polynomial in |S| - |S| : exponential in number of features in the domain*.

* Bellman’s curse of dimensionality

The Outline of the Talk

MDPs and Bellman’s curse of dimensionality.

RL: Simultaneous learning and planning.

Explore avenues to speed up RL. Illustrate prominent HRL methods. Compare prominent HRL methods.

Discuss future research. Summarise

Learning

Environment

Data

•Gain knowledge•Gain understanding•Gain skills•Modification of behavioural tendency

Decision Making while Learning*Environment

PerceptsDatum Action

What action next?

•Gain knowledge•Gain understanding•Gain skills•Modification of behavioural tendency

* Known as ReinforcementLearning

Reinforcement LearningUnknown P and reward R.Learning Component : Estimate the P

and R values via data observed from the environment.

Planning Component : Decide which actions to take that will maximise reward.

Exploration vs. Exploitation GLIE (Greedy in Limit with

Infinite Exploration)

Learning

Model-based learningLearn the model, and do planningRequires less data, more computation

Model-free learningPlan without learning an explicit modelRequires a lot of data, less computation

Q-LearningInstead of learning, P and R, learn Q*

directly. Q*(s,a) : Optimal reward starting in

s, if the first action is a, and after that the optimal policy is followed.

Q* directly defines the optimal policy:

Optimal policy is the action with maximum

Q* value.

Optimal policy is the action with maximum

Q* value.

Q-Learning

Given an experience tuple hs,a,s’,ri

Under suitable assumptions, and GLIE exploration Q-Learning

converges to optimal.

New estimate of Q value

New estimate of Q value

Old estimate of Q value

Old estimate of Q value

Semi-MDP: When actions take time.

The Semi-MDP equation:

Semi-MDP Q-Learning equation:

where experience tuple is hs,a,s’,r,Ni

r = accumulated discounted reward while action a was executing.

Printerbot

Paul G. Allen Center has 85000 sq ft space

Each floor ~ 85000/7 ~ 12000 sq ftDiscretise location on a floor: 12000

parts.State Space (without map) :

2*2*12000*12000 --- very large!!!!!How do humans do the

decision making?

The Outline of the Talk

MDPs and Bellman’s curse of dimensionality.

RL: Simultaneous learning and planning.Explore avenues to speedup RL.Illustrate prominent HRL methods.Compare prominent HRL methods.

Discuss future research. Summarise

1. The Mathematical Perspective

A Structure Paradigm S : Relational MDP A : Concurrent MDP P : Dynamic Bayes Nets R : Continuous-state MDP G : Conjunction of state

variables V : Algebraic Decision Diagrams : Decision List (RMDP)

2. Modular Decision Making

2. Modular Decision Making

•Go out of room•Walk in hallway•Go in the room

2. Modular Decision Making

Humans plan modularly at different granularities of understanding.

Going out of one room is similar to going out of another room.

Navigation steps do not depend on whether we have the print out or not.

3. Background Knowledge

Classical Planners using additional control knowledge can scale up to larger problems.

(E.g. : HTN planning, TLPlan)What forms of control knowledge can

we provide to our Printerbot?First pick printouts, then deliver them.Navigation – consider rooms, hallway,

separately, etc.

A mechanism that exploits all three avenues : Hierarchies

1. Way to add a special (hierarchical) structure on different parameters of an MDP.

2. Draws from the intuition and reasoning in human decision making.

3. Way to provide additional control knowledge to the system.

The Outline of the Talk

MDPs and Bellman’s curse of dimensionality.

RL: Simultaneous learning and planning.Explore avenues to speedup RL.

Illustrate prominent HRL methods.Compare prominent HRL methods.

Discuss future research. Summarise

HierarchyHierarchy of : Behaviour, Skill,

Module, SubTask, Macro-action, etc.picking the pagescollision avoidancefetch pages phasewalk in hallway

HRL ´ RL with temporally extended actions

Hierarchical Algos ´ Gating Mechanism

Hierarchical Learning•Learning the gating function•Learning the individual behaviours•Learning both

*

*Can be a multi- level hierarchy.

g is a gateg is a gate

bi is a behaviour

bi is a behaviour

Option : Movee until end of hallway

Start : Any state in the hallway.

Execute : policy as shown.

Terminate : when s is end of hallway.

Options [Sutton, Precup, Singh’99]

An option is a well defined behaviour.o = h Io, o, o i

Io : Set of states (IoµS) in which o can be initiated.

os : Policy (S!A*) when o is executing.

o(s) : Probability that o terminates in s.

*Can be a policy over lower level options.

Learning

An option is temporally extended action with well defined policy.

Set of options (O) replaces the set of actions (A)

Learning occurs outside options.Learning over options ´ Semi MDP Q-

Learning.

Machine: Movee + Collision Avoidance

Movew Moven Moven Return

Movew Moves Moves Return

Movee Choose

Return

End of hallway

: End of hallway

Obstacle

Call M1

Call M2

M1

M2

Hierarchies of Abstract Machines[Parr, Russell’97]

A machine is a partial policy represented by a Finite State Automaton.

Node :Execute a ground action.Call a machine as a subroutine.Choose the next node.Return to the calling machine.

Hierarchies of Abstract Machines

A machine is a partial policy represented by a Finite State Automaton.

Node :Execute a ground action.Call a machine as subroutine.Choose the next node.Return to the calling machine.

LearningLearning occurs within machines, as

machines are only partially defined.Flatten all machines out and consider

states [s,m] where s is a world state, and m, a machine node ´ MDP

reduce(SoM) : Consider only states where machine node is a choice node ´ Semi-MDP.

Learning ¼ Semi-MDP Q-Learning

[Dietterich’00]

Root

Take GiveNavigate(loc)

DeliverFetch

Extend-arm Extend-armGrab Release

MoveeMovewMovesMoven

unordered

unordered

MAXQ Decomposition

Define C([s,i],j) as the reward received in i after j finishes.

Q([s,Fetch],Navigate(prr)) = V([s,Navigate(prr)])+C([s,Fetch],Navigate(prr))*

Express V in terms of CLearn C, instead of learning Q

*Observe the context-free nature of Q-value

The Outline of the Talk

MDPs and Bellman’s curse of dimensionality.

RL: Simultaneous learning and planning.Explore avenues to speedup RL.

Illustrate prominent HRL methods.Compare prominent HRL methods.

Discuss future research. Summarise

1. State AbstractionAbstract state : A state having

fewer state variables; different world states maps to the same abstract state.

If we can reduce some state variables, then we can reduce on the learning time considerably!

We may use different abstract states for different macro-actions.

State Abstraction in MAXQRelevance : Only some variables are

relevant for the task.Fetch : user-loc irrelevantNavigate(printer-room) : h-r-po,h-u-po,user-

locFewer params for V of lower levels.

Funnelling : Subtask maps many states to smaller set of states. Fetch : All states map to h-r-po=true,

loc=pr.room. Fewer params for C of higher levels.

State Abstraction in Options, HAM

Options : Learning required only in states that are terminal states for some option.

HAM : Original work has no abstraction.Extension: Three-way value

decomposition*:Q([s,m],n) = V([s,n]) + C([s,m],n) + Cex([s,m])

Similar abstractions are employed.*[Andre,Russell’02]

2. Optimality

Hierarchical Optimalityvs.

Recursive Optimality

OptimalityOptions :

HierarchicalUse (A [ O) : Global**Interrupt options

HAM : Hierarchical*MAXQ : Recursive*

* Can define eqns for both optimalities**Adv. of using macro-actions maybe lost.

3. Language ExpressivenessOption

Can only input a complete policyHAM

Can input a complete policy.Can input a task hierarchy.Can represent “amount of effort”.Later extended to partial programs.

MAXQCannot input a policy (full/partial)

4. Knowledge Requirements

OptionsRequires complete specification of policy.One could learn option policies – given

Medium requirementsMAXQ

Minimal requirements

5. Models advancedOptions : ConcurrencyHAM : Richer representation,

ConcurrencyMAXQ : Continuous time, state, actions;

Multi-agents, Average-reward.In general, more researchers have

followed MAXQLess input knowledgeValue decomposition

S : Options, MAXQ A : All P : None R : MAXQ G : All V : MAXQ : All

The Outline of the Talk

MDPs and Bellman’s curse of dimensionality.

RL: Simultaneous learning and planning.Explore avenues to speedup RL.

Illustrate prominent HRL methods.Compare prominent HRL methods.

Discuss future research. Summarise

Directions for Future Research

Bidirectional State AbstractionsHierarchies over other RL

researchModel based methodsFunction Approximators

Probabilistic PlanningHierarchical P and Hierarchical R

Imitation Learning

Directions for Future Research

TheoryBounds (goodness of hierarchy)Non-asymptotic analysis

Automated Discovery Discovery of HierarchiesDiscovery of State Abstraction

Apply…

Applications

Toy Robot Flight SimulatorAGV SchedulingKeepaway

soccer

Parts

Assemblies

Ware-house

P2 P1

P3 P4

D2

D3 D4

D1

Images courtesy various sources

Thinking Big…

"... consider maze domains. Reinforcement learning researchers, including this author, have spent countless years of research solving a solved problem! Navigating in grid worlds, even with stochastic dynamics, has been far from rocket science since the advent of search techniques such as A*.” -- David Andre

Use planners, theorem provers, etc. as components in big hierarchical solver.

The Outline of the Talk

MDPs and Bellman’s curse of dimensionality.

RL: Simultaneous learning and planning.Explore avenues to speedup RL.

Illustrate prominent HRL methods.Compare prominent HRL methods.

Discuss future research. Summarise

How to choose appropriate hierarchy

Look at available domain knowledgeIf some behaviours are completely

specified – optionsIf some behaviours are partially

specified – HAMIf less domain knowledge available –

MAXQWe can use all three to specify

different behaviours in tandem.

Main ideas in HRL community

Hierarchies speedup learningValue function decompositionState Abstractions Greedy non-hierarchical executionContext-free learning and pseudo-

rewardsPolicy improvement by re-estimation

and re-learning.