evolutionary algorithms for reinforcement learning

1

Evolutionary Algorithms for ReinfoEvolutionary Algorithms for Reinforcement Learningrcement Learning

David E.MoriartyAlan C.Schultz

John J.Grefenstette

2

OverviewOverview

Reinforcement Learning TD Algorithms for Reinforcement Learning Evolutionary Algorithms for RL Policy Representations in EARL Fitness and Credit Assignment in EARL Strengths of EARL Limitations of EARL

3

Reinforcement LearningReinforcement Learning

Flexible approach to the design of intelligent agents in situations for which both planning and supervised learning ard impractical.

Goal is to solve sequential decision tasks through trail and error interactions with the environments. At any given time step t, an agent perceves its state st a

nd selects an action at. The system responds by giving the agent some numeric

al reward and changing into state st+1=

4


The agent’s goal is to learn a policy,π:S->A. Optimal policy is typically defined as the policy that pr

oduces the greatest cumulative reward over all states

is the cumulative reward received from state s using policy π.

Infinite horizon finite horizon

5


Agent’s state descriptions are usually identified with the values returned by its sensors. Often the sensors do not give the agent complete state information (Partial observability)

RL provides a flexible approach to the design of intelligent agents in situations for which both planning and supervised learning are impractical.

RL can be applied to problems for which significant domain knowledge is either unavailable or costly to obtain.

6


Policy space vs Value function space Goal:find an optimal policy π* Policy-space search

Maintain explicit representations of policies and modify them through a viriety of search operators.

Dynamic programming, value iteration, simulated annealing, evolutionary algorithms

Value-space search Attempt to learn the value function V* which returns the expected

cumulative reward for the optimal policy from the state. TD algorithms

7

Temporal Difference AlgorithmsTemporal Difference Algorithms

Uses observations of prediction differences from consecutive states to update value predictions. Value function update rule

Q-Learning: compute Q function(a value function that represents the expected value of taking action a in state s and being optimally thereafter)

Q function update rule

8

Evolutionary Algorithms for ReinfoEvolutionary Algorithms for Reinforcement Learning(EARL)rcement Learning(EARL) Policy-space search using Evolutionary algorithm

Requirement of EA An appropriate mapping between the search space and the spac

e of chromosomes An appropriate fitness function.

For many problems, EA can be applied in a relative straightforward manner

The most critical design choice of EA is the representation.

A form of search bias similar to biases in other ML methods. EA is sensitive to the choice of representations

9

A simple EARLA simple EARL

Use single chromosome per policy with a single gene associated with each observed state.

Each Gene’s value represents the action value associated with the corresponding state.

Fitness can be evaluated during a single trail(deterministic case) or averaged over a sample of trails.

Basic crossover and mutation operator is used.

10

Policy Representations in EARLPolicy Representations in EARL

Single-Chromosome Representation Rule-based representation

– A set of condition-action rules

Neural net based representation– Use function approximator and use EA to adjust parameters

11


Distributed Representation Allow evolution to work at a more detailed level. Permits the user to exploit background knowledge.

Rule-based representation Learning Classifier Systems(LCS)

Uses an EA to evolve if-then rules called classifiers that map sensory input to an appropriate action.

When sensory input is received it is posted on the message list. If the left hand side of a classifier matches a message on the message

list, its right hand side is posted on the message list. These new messages may subsequently trigger other classifiers

Each chronosome represents a single decision rule and the entire populations represents the agent’s policy.

12


Holland’s Learning Classifier System

LCS population for grid world.

13


Distributed Neural net based representation Use a population of neurons and a population of networ

k blueprints Uses a priori knowledge that individual neurons are building blocks in neural networks.

14

Fitness and Credit assignments iFitness and Credit assignments in EARLn EARL Policy Level Credit Assignment

How to apportion the rewards of a sequence of decisions to individual decisions.

– In EARL, credit is implicitly assigned over extended sequence since policies that prescribe poor individual decisions will have fewer offspring.

– In TD, immediate reward and the estimated payoff are explicitly propagated back.

15

Fitness and Credit assignments iFitness and Credit assignments in EARLn EARL Subpolicy Credit Assignment

For distributed-representation EARLs, fitness is explicitly assigned to individual components.

Classifier system– Each classifier has a strength which is updated using a TD-like

method called Bucket bridge algorithm SAMUEL

– Each gene maintains a quantity called strength– Strength pays a role in resolving conflict and triggering mutation.

16

Strengths of EARLStrengths of EARL

Scaling up to Large State Spaces Policy generalization

Most EARL specifies the policy at a level of abstraction higher than an explicit mapping from observed states to actions.

Policy Selection Attention is focused on profitable action only,reducing space r

equirements for policies.

17


Dealing with Incomplete State Information Implicitly distinguishes among ambigious states More robust than simple TD methods

An PO environment The Policy obtained

18


Simple TD methods are vulnerable to hidden state problems.

EARL methods associate credit with entire policies, they rely more on net results of decision sequences than on sensor information that may be ambigious.

The agent itself remains unable to distinguish the two blue states, but the EARL implicitly distinguishes among ambigious states by rewarding policies that avoids the bad states.

Additional features such as the agent’s previous decisions and observations can help disambiguating the two blue states.

19


Non-stationary Environments As long as the environment changes slowly with respec

t to the time required to evaluate a population of policies, the population should be able to track a changing fitness landscape without any alternation of the algorithm.

Algorithms for ensuring diversity in evolving populations can be used

Fitness sharing Crowding Local mating

20


EARL with distributed policy representations achieve diversity automatically and are well-suited for adaptation in dynamic environment.

If the learning system can detect changes in the environment, even more direct response is possible

Anytime learning– When environmental change is detected, the population of policie

s is partially reinitialized, using previously learned policies selected on the basis of similarity between the previously encountered environment and the current environment.

– Having a population of policies can help from being affected by some kind of errors in detecting environmental changes.

21

Limitations of EARLLimitations of EARL

Online Learning Require a large number of experiences to evaluate a large population

of policies. It may be dangerous to permit an agent to perform random actions. Both of the objections apply to TD methods as well.

Rare States TD maintain statistics concerning every state-action pair. Rare state information may eventually be lost due to mutation.

Proofs of Optimality Q learning has a proof of optimality. No general theoretical tools are available that can be applied to realist

ic EARL problems.

evolutionary algorithms for reinforcement learning

Documents

action value

optimal policy

value function v

expected value

state s

search space

value iteration

value predictions