reinforcement learning emergent behaviour andrew swann information engineering strategic research...

43
Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

Upload: kevin-jennings

Post on 16-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

Reinforcement learning

Emergent behaviour

Andrew SwannInformation Engineering

Strategic Research Centre

Rolls-Royce plc

Page 2: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

2

Introduction

• Emerge means come into view, unplanned• Emergence without guidance is problematic• Reinforcement means rewarding desired behaviour• No “teacher” to specify correct behaviour• Behaviour must emerge during learning

Page 3: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

3

Literature review

• The talk is a literature review about reinforcement learning.

• There is no need to make notes of the literature references because these are included in the handouts.

• Covers1. Introducing reinforcement learning2. Scalability3. Applications4. Multi-agent systems

Page 4: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

4

The promise of reinforcement learning

• “Its promise is beguiling – a way of programming agents by reward and punishment without needing to specify how the task is to be achieved. But there are formidable computational obstacles to fulfilling this promise.”

• Kaelbling, L., & Littman, M., & Moore, A., “Reinforcement learning: a survey”, Journals of Artificial Intelligence Research 4, 1996, pp. 237-285.

Page 5: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

5

Reinforcement learning framework

Environment

Agent

Reward r

Situation s

Action a

Page 6: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

6

Time Difference learning• Time difference algorithm “TD”

is the learning rate, normally < 1. r is reward.• Perfect V would not alter V(statet) in this equation.• “Evolutionary Algorithms for Reinforcement Learning”, Moriarty, D., &

Schultz, A., & Grefenstette, J., Journal of AI Research, 11, 1999, pp. 241-276.

• This assumes a relatively small number of possible states.• Over many iterations, the update rule will adjust the values

of each state so that they agree with their successors and eventually with the reward received at the end.

)))(()(()()( 1 ttttt rstateVstateVstateVstateV

Page 7: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

7

Q-learning

• Q function represents the expected value of taking action a in state s and acting optimally thereafter.

• Removes need for model of the environment.

• Compute policy by maximising benefit over actions a.

• Propagates Q value back through time, without

needing to predict next state given current state and

action.• Some noise is introduced to action selection, in order

to promote exploration.

)))(),((),((max),(),( 11 ttttttttt srasQasQasQasQ

Page 8: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

8

Scalability

Page 9: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

9

Bias

• There are a variety of reinforcement learning techniques that work effectively on a variety of small problems.

• These do not scale to large problems unless knowledge is incorporated as bias to the agent system.

• This is because the number of states increases exponentially with the number of state variables.

• Kaelbling, L., & Littman, M., & Moore, A., “Reinforcement learning: a survey”, Journal of AI Research 4, 1996, pp. 237-285.

Page 10: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

10

Bias

• Shaping means: train on a simplification of the problem before the actual problem.

• Local reinforcement means: training on parts of the problem separately. Not always possible.

• The more bias is used, the less “beguiling” reinforcement learning is because the more work needs to be done by the programmer.

Page 11: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

11

Infinite states

• RL can be applied to real valued state space by generalisation, spreading information between similar states.

• Achieved by modelling the reward landscape using a neural network.

• State s and action a are the inputs, Q is the output.• Rummery, G., & Niranjan, M., “On-line Q-learning

using connectionist systems”, Technical Report 166, Cambridge University Engineering Department, September 1994.

Page 12: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

12

Example of emergence

• Robot moves around a room to reach target.• Random jump option was added by reinforcement

learning.• System produced unanticipated solution; jump

randomly till close to target.• More efficient than guided steps up to that point.

Page 13: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

13

MLP versus alternatives

• Rummery reports that the multilayer perceptron (“MLP”) scales well to large input and state spaces.

• Radial basis functions and CMACS do not.• [Rummery & Niranjan 94] Rummery, G., & Niranjan, M., “On-line

Q-learning using connectionist systems”, Technical Report 166, Cambridge University Engineering Department, 1997.

• [Pyeatt & Howe 98] find decision trees more effective than MLP, because MLP “forgets” parts of the state space.

• Pyeatt, L., & Howe, A., “Decision tree function approximation in reinforcement learning”, Technical Report CS-98-112, Colarado State University, [email protected], http://www.cs.colostate.edu/~pyeatt

Page 14: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

14

MLP versus alternatives

• Pyeatt & Howe apply their methods to the pole-balancing problem, a simulated mountain car, and a simulated racing car.

• They found that Q-learning is less robust for cooperative learning than for single-agent learning.

Page 15: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

15

RL problems are hard

• Even the best RL algorithm converges too slowly if rewards have to wait till the goal is reached. [Caironi & Dorigo 94].

• Human trainer can solve this problem, assessing agent progress and rewarding as appropriate.

• Analogous to “Clicker” training of dogs.• Trainer reward should be learned separately from

environment reward, then disconnect trainer.• Caironi, P., & Dorigo, M., “Training and delayed reinforcements

in Q-learning agents”, Technical report, University of Brussels, 1994.

Page 16: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

16

Competing agents

• Littman considers multi-agent reinforcement learning with exactly two competing agents, one maximising a term, the other minimising it, in an artificial simplistic game. Deterministic player can easily be beaten – think of the game “rock, paper, scissors”.

• Littman, M., “Markov games as a framework for multi-agent reinforcement learning”, Proceedings of the 11th International Conference on Machine Learning, pp. 157-163, New Brunswick, New Jersey, 1994.

Page 17: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

17

Competing agents

• Littman found that training two Q-learning players against each other gave far better results than training one against a random player.

• Surprising, as a random player should improve exploration.

Page 18: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

18

Applications

Page 19: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

19

Backgammon

• [Tesauro 95] reports extraordinarily good results using RL to play Backgammon.

• World class play was achieved using a neural network.

• However, this success was not achieved for any other problem.

• Tesauro, G., “Temporal difference learning and TD-Gammon”, Communications of the ACM, March 1995, Vol 38, no. 3.

Page 20: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

20

Dynamic JSS

• [Aydin & Oztemel 2000] apply RL agents to dynamic job-shop scheduling (“JSS”).

• Jobs may arrive at the shop at any time.• Generalisation is achieved using hard-c-means

instead of a neural net. Successful results were achieved.

• Aydin, M., & Oztemel, E., “Dynamic job-shop scheduling using reinforcement learning agents”, Robots and Autonomous systems, 33, 2000, pp.169-178.

Page 21: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

21

Deadlocks

• [Ohkura et al] apply RL to a system allowing deadlocks. Straightforward RL did not work because deadlocks were formed.

• This problem was solved by using two learning units in each agent, one deciding when to be “altruistic”.

• The application is an extremely simplified navigation problem.

• Ohkura, K., & Sera, K., & Ueda, K., “A learning multi-agent approach to dynamic scheduling for autonomous distributed manufacturing systems”, Technical Report, Faculty of Engineering, Kobe University, Japan, 1998, [email protected]

Page 22: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

22

Elevator control

• [Crites & Barto 97] apply multiple reinforcement learning agents to the problem of elevator group control.

• One agent per elevator car.• Reward based on passenger waiting times.• Groups of simple agents show interesting group

behaviour, but it cannot yet be predicted.• Predefined goals need RL.• Crites, R., & Barto, A., “Elevator group control using multiple

reinforcement learning agents”, Dept. of Computer Science, University of Massachussetts, Amhers, MA 01003, USA, 1997, [email protected]

Page 23: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

23

Interactive browsing

• [Leuski 2000] applies RL to interactive browsing, which is his main subject of study.

• RL using neural networks.• As the user examines the supposedly most relevant

documents and reclassifies them, the system changes its estimates on that basis.

• It is not clear that this is really a RL problem, since the user specifies relevance, and this is the value to be estimated.

Page 24: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

24

Interactive browsing

• Leuski, A., “Relevance and reinforcement in interactive browsing”, Technical Report, Centre for Intelligent Information Retrieval, Department of Computer Science, University of Massachussetts, Amherst, MA 01003 USA, 2000.

• http://www.ece.unh.edu/robots/cmac.htm

Page 25: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

25

The Edge of Chaos

• Among many others, [Sims 94] claims that improved performance can be found on “the edge of chaos”, that is, the border between order and chaos.

• At this point the components of a system never quite lock and never dissolve into turbulence either.

• The edge of chaos is a battle between stagnation and anarchy.

Page 26: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

26

The Edge of Chaos

• Sims asks us to remember that a flock is not a bird. • However, a flock is not a business solution either,

and Sims does not address the problem of evolving functionality.

• It is for this reason that RL is considered necessary for effective emergent behaviour.

• Sims, K., “Evolved Virtual Creatures: Examples from work in progress”, presentation, 1994.

Page 27: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

27

Multi-Agent Systems (“MAS”)

Page 28: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

28

Multi-Agent Systems (“MAS”)

• Sims gives a useful but highly speculative report of some hoped-for milestones for agent research.

• These are credited to• Murch, R., & Johnson, T., “Intelligent Software Agents”, Prentice

Hall, Upper Saddle River, New Jersey, 1999.

Page 29: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

29

MAS milestones

• 2005: Host-based stand-alone agents search the internet.

• 2005: Host-based and capable of negotiating with computers and other agents, involving many business and personal functions.

• 2010: Agents are mobile and highly personalized, but standalone.

• 2010: Agents are mobile and capable of negotiating with computers and other agents.

• 2020: Agents employ subagents.

Page 30: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

30

MAS milestones

• 2050: Agents can activate and inhabit real-world robotics and pursue goals beyond software

• 2050: Agents are self-replicating and can design agents to specific needs, independent and self-motivated

Page 31: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

31

MAS programming paradigm

• [Wong & Mikler 99] ignore agent learning, and consider MAS instead as a programming paradigm.

• They claim that it will scale to large problems.• Wong, J., & Mikler, A., “Intelligent mobile agents in large

distributed autonomous cooperative systems”, The Journal of Systems & Software, 47, 1999, pp. 75-87.

Page 32: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

32

Football playing agents

• [Asada et al 95] make each agent learn to cope with other competitive agents, using the idea of goal scoring from football.

• Real robots were used, carrying out vision-based reinforcement learning.

• Shaping was used. The opponent goalkeeper was first stationary, then moving slowly.

• Final goal (sic) is learning to cooperatively pass the ball. [Some Premiership strikers take note!]

Page 33: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

33

Football playing agents

• Asada, M., & Uchibe, E., & Hosoda, K., “Agents that learn from other competitive agents”, Technical Report, Department of Mechanical Engineering for Computer Controlled Machinery, Osaka University, 2-1, Yamadaoka, Suita, Osaka 565, Japan. 1995 [email protected]

Page 34: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

34

Agents and humans

• [Weib & Dillenbourg 97] argue that the interaction of a large number of software agents should resemble human cooperation.

• In human subjects, individuals who provide explanations learn more than those who do not.

• Agent work on this basis has not been done, and may be optimistic in the short and medium term.

• Weib, G., & Dillenbourg, P., “What is ‘multi’ in multi-agent learning?”. [email protected], 1997.

Page 35: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

35

Modelling other agents

• Hu & Wellman used agents that learn each other’s behaviour.

• The learned model does correctly predict the learned reactions of the other agent, but it does not learn how they would react to other behaviour.

• They don’t introduce noise to help exploration.• Hu, J., & Wellman, M., “Self-fulfilling bias in multi-agent

learning”, Proceedings of the International Conference on Multi-Agent Systems, pp. 118-125, Kyoto 1996.

Page 36: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

36

Predator prey

• Haynes & Sen consider learning predator and prey.• Proximity to prey was used as a guide.• Haynes, T., & Sen, S., “Evolving behavioural strategies in

predators and prey”, Adaptation and learning in multi-agent systems: lecture notes in Artificial Intelligence, Springer-Verlag, 1996, pp. 113-126.

Page 37: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

37

Predators and prey

• Tan used hunter agents and prey agents. • Predator performance improved when predators

could communicate their sensations.• Tan, M., “Multi-agent reinforcement learning: independent

versus cooperative agents”, Proceedings of the 10th international conference on machine learning, pp. 330-337, Amherst, MA, 1993.

Page 38: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

38

Harder predator prey

• Sandholm and Crites report that just slightly harder predator-prey problems give disappointing results.

• Associating current actions with future payoffs is hard.

• Sandholm, T., & Crites, R., “Multiagent reinforcement learning in the iterated prisoner’s dilemma”, Biosystems, 37, pp. 147-166, 1996.

Page 39: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

39

Prisoner’s dilemma

• Sandholm & Crites also considered the iterated prisoner’s dilemma.

• Learning agents learned to cooperate with a tit-for-tat player.

• Exploration improved learning.

Page 40: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

40

Pushing a block

• Sen and associates applied RL for MAS to pushing a block.

• State space quantized to 20 states.• Agents learned coordinated action without even being

aware of each other.• Sen, S., & Sekaran, M., & Hale, J., “Learning to coordinate

without sharing information”, Proceedings of the 12th international conference on artificial intelligence, AAAI Press/MIT Press, 1994, pp. 426-431.

Page 41: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

41

Pushing a block

Goal position

Ideal path

Page 42: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

42

Summary

• Littman found that training two Q-learning players against each other gave far better results than training one against a random player.

• Surprising because a random player improves exploration.

• Pyeatt & Howe found that Q-learning is less robust for cooperative learning than for single-agent learning.

• Crites & Barto argue that bottom-up learning produces unpredictable results, and RL is needed for predefined goals.

Page 43: Reinforcement learning Emergent behaviour Andrew Swann Information Engineering Strategic Research Centre Rolls-Royce plc

43

Summary

• There are a variety of reinforcement learning techniques that work effectively on a variety of small problems.

• However, these do not scale to large problems unless knowledge is incorporated as bias to the agent system (Kaelbling et al).

• If this claim is true, it limits the usefulness of reinforcement learning.

• Most work on RL seems to be done at the Computer Science Department at the University of Massachusetts.