hierarchical reinforcement learning ersin basaran 19/03/2005

Hierarchical Hierarchical Reinforcement LearningReinforcement Learning

Ersin BasaranErsin Basaran19/03/200519/03/2005

OutlineOutline

Reinforcement LearningReinforcement Learning RL AgentRL Agent PolicyPolicy

Hierarchical Reinforcement LearningHierarchical Reinforcement Learning The NeedThe Need Sub-Goal DetectionSub-Goal Detection State ClustersState Clusters Border StatesBorder States Continuous State and/or Action SpacesContinuous State and/or Action Spaces OptionsOptions Macro Q-Learning with Parallel Option DiscoveryMacro Q-Learning with Parallel Option Discovery

Experimental ResultsExperimental Results

Reinforcement LearningReinforcement Learning

Agent observes the state, and takes the Agent observes the state, and takes the action according to the policyaction according to the policy

Policy is a function from the state space Policy is a function from the state space onto the action spaceonto the action space

Policy can be deterministic or non-Policy can be deterministic or non-deterministicdeterministic

State and action spaces can be discrete, State and action spaces can be discrete, continuous or hybridcontinuous or hybrid

RL AgentRL Agent

No model of the environmentNo model of the environment

Agent observes state s, takes action a and Agent observes state s, takes action a and goes into state s’ observing reward rgoes into state s’ observing reward r

Agent tries to maximize total expected Agent tries to maximize total expected reward (return) reward (return)

Finite state machine modelFinite state machine model

S S’

a, r

PolicyPolicy

In a flat RL model, policy is a map from each In a flat RL model, policy is a map from each state to a primitive actionstate to a primitive action

In the optimal policy, the action taken by the In the optimal policy, the action taken by the agent return highest return at each each stepagent return highest return at each each step

Can be kept in tabular format for small state and Can be kept in tabular format for small state and action spacesaction spaces

Function approximators can be used for large Function approximators can be used for large state or action spaces (or continuous ones)state or action spaces (or continuous ones)

The Need For Hierarchical RLThe Need For Hierarchical RL

Increase the performanceIncrease the performanceApplying RL to the problems with large action Applying RL to the problems with large action and/or state space become feasibleand/or state space become feasibleDetection of sub-goals can help the agent to Detection of sub-goals can help the agent to have the abstract actions defined over the have the abstract actions defined over the primitive actionsprimitive actionsSub-goals and abstract actions can be used in Sub-goals and abstract actions can be used in different tasks on the same domain. The different tasks on the same domain. The knowledge is transferred between tasksknowledge is transferred between tasksThe policy of the agent can be translated into a The policy of the agent can be translated into a natural languagenatural language

Sub-goal DetectionSub-goal Detection

A sub-goal can be a single state, a subset A sub-goal can be a single state, a subset of the state space, or a constraint in the of the state space, or a constraint in the state spacestate space

Reaching a sub-goal should help the Reaching a sub-goal should help the agent reaching the main goal (to get the agent reaching the main goal (to get the highest return)highest return)

Sub-goals must be discovered by the Sub-goals must be discovered by the agent autonomouslyagent autonomously

State ClustersState Clusters

The states in a cluster are strongly connected to The states in a cluster are strongly connected to each othereach otherThe number of state transitions among clusters The number of state transitions among clusters are smallare smallThe states at two ends of a state transition The states at two ends of a state transition between two different clusters are sub-goal between two different clusters are sub-goal candidatescandidatesClusters can be hierarchicalClusters can be hierarchical Different clusters can be in the same cluster at a Different clusters can be in the same cluster at a

higher levelhigher level

Border StatesBorder States

Some actions cannot be applied in some states. Some actions cannot be applied in some states. These states are defined as border statesThese states are defined as border states

Border states are assumed to have a transition Border states are assumed to have a transition sequence. We can travel through the border sequence. We can travel through the border states by taking some actionsstates by taking some actions

Each end in this transition sequence is a Each end in this transition sequence is a candidate sub-goal assuming the agent candidate sub-goal assuming the agent sufficiently explored the environmentsufficiently explored the environment

Border State DetectionBorder State Detection

For discrete action and state space For discrete action and state space F(s): set of states which can be reached from F(s): set of states which can be reached from

state state ss in one time unit in one time unit G(s): if an action in G(s) is applied at state s, G(s): if an action in G(s) is applied at state s,

no state transition occursno state transition occurs H(s): if an action in H(s) is applied at state s, H(s): if an action in H(s) is applied at state s,

the agent moves to a different statethe agent moves to a different state

Border State DetectionBorder State Detection

Detect the longest state sequence sDetect the longest state sequence s00,s,s11,s,s22,,

…,s…,sk-1k-1,s,skk which satisfies the following which satisfies the following

constraintsconstraints ssiiF(sF(si+1i+1) or s) or si+1i+1F(sF(sii) for 0) for 0i<ki<k

G(sG(sii))G(sG(si+1i+1) ) for 0<i<k-1 for 0<i<k-1

H(sH(s00) ) G(sG(s11) )

H(sH(skk) ) G(sG(sk-1k-1) )

ss0 0 and sand skk are candidate sub-goals are candidate sub-goals

Border States on Continuous State Border States on Continuous State and Action Spacesand Action Spaces

Environment is assumed to be boundedEnvironment is assumed to be boundedState and action vectors can include both State and action vectors can include both continuous and discrete dimensionscontinuous and discrete dimensionsThe derivative of state vector with respect The derivative of state vector with respect to the action vector can be usedto the action vector can be usedThe border state regions must have small The border state regions must have small derivatives for some action vectorsderivatives for some action vectorsThe large change in these derivatives is The large change in these derivatives is the indication of border state regionsthe indication of border state regions

OptionsOptions

An option is a policyAn option is a policy

It can be local (defined on a subset of It can be local (defined on a subset of state space) or can be globalstate space) or can be global

The option policy can use primitive actions The option policy can use primitive actions or other optionsor other options

It is hierarchicalIt is hierarchical

Used to reach sub-goalsUsed to reach sub-goals

Macro Q-Learning with Parallel Macro Q-Learning with Parallel Option DiscoveryOption Discovery

Agent starts with no sub-goal and optionAgent starts with no sub-goal and optionIt detects the sub-goals and learns the option policies It detects the sub-goals and learns the option policies and the main policy simultaneouslyand the main policy simultaneouslyOptions are formed and removed from the model Options are formed and removed from the model according the sub-goal detection algorithmaccording the sub-goal detection algorithmWhen a possible sub-goal is detected, a new option is When a possible sub-goal is detected, a new option is added to the model to have the policy to reach this sub-added to the model to have the policy to reach this sub-goalgoalAll options policies are updated in parallelAll options policies are updated in parallelThe agent generates an internal reward if a sub-goal is The agent generates an internal reward if a sub-goal is reachedreached

Macro Q-Learning with Parallel Macro Q-Learning with Parallel Option DiscoveryOption Discovery

An Option is defined by the following: An Option is defined by the following: O = (O = (oo, , oo, I, Ioo, Q, Qoo, r, roo))

where Qwhere Qoo is Q values for the option and r is Q values for the option and roo

is the internal reward signal associated is the internal reward signal associated with the optionwith the option

Intra-option learning method is usedIntra-option learning method is used

ExperimentsExperiments

Flat RLFlat RL Hierarchical RLHierarchical RL

Options in HRLOptions in HRL

Questions and Questions and Suggestions!!!Suggestions!!!

hierarchical reinforcement learning ersin basaran 19/03/2005

Documents

large state

small state

single state

border state detection

state space fs

large action andor state

non deterministic state

number of state transitions