emilie kaufmann - inriachercheurs.lille.inria.fr/ekaufman/rlcours0.pdf · emilie kaufmannjcristal -...

Reinforcement Learning

Emilie Kaufmann([email protected])

Ecole Centrale de Lille, 2019/2020

Emilie Kaufmann |CRIStAL - 1

This Reinforcement Learning (RL) class

I 8 lectures (2 hours each)

I 4 practical sessions (1 on bandits, 3 on RL)

I project presentation morning : January 27th, 2020 (4 hours)

Evaluation : one project (groups of 1 or 2)

I List of projects available on December 10th

Class jointly taught with :

I Olivier Pietquin (Google Brain)

Deep Reinforcement Learning (4 hours)

I Omar Darwiche-Domingues (Inria SequeL)

Practical Sessions of Reinforcement Learning (6 hours)


Useful References

I Some books

I many research papers (references in the slides)

I material from the first RL Summer School :https://rlss.inria.fr/program/


https://rlss.inria.fr/program/

Reinforcement Learning :

Introduction


What is Reinforcement Learning ?

Ü learning by “trial and error”

Ü learning to behave in an unknown, shochastic environement bymaximizing some real-valued reward signal

Example : learning to bike without a perfect knowledge of physics


Key RL concepts

A learning agent sequentially interacts with its environment byperforming actions. Each action

I provides an instantaneous reward

I leads to an evolution of the agent’s state

Agent’s goal : act so as to maximize its total reward

source : Wikipedia


Key RL concepts

Keywords (high-level) :

I Reward : instantaneous feedback received after acting

I Value : total reward the agent can get in some state

I Policy : strategy to choose an action in a given state

Agent’s goal : find a policy that maximizes the value in each state

source : Wikipedia


RL successes : Games (1/2)

From Backgammon...

1992, TD-gammon

... to Go

2015, AlphaGo2017, AlphaGo Zero

Ü RL agents learn new types of strategies


RL successes : Games (2/2)

I Learning to play from pixels (and rewards) : Atari Games2010+ Deep Reinforcement Learning

I Recent challenges : multi-player / partial information games

OpenAI Five (2019) Pluribus (2019)


RL sucessess : Content Optimization

I online advertisement

Ü action : display an add / reward : click

I (sequential) recommender systems

Ü action : recommend a movie / reward : rating


RL : Many potential applications

I Smart grid / microgrid management

source : ScienceDirect.comActions :

I charge or discharge storage systems

I turn on or off renewable energy source

I buy energy from the market ...

Reward : - Cost


RL : Many potential applications

I Autonomous robotics

I Self-driving cars ?


History of RL

• Learning to behave from rewards : an old idea from psychology

I 1900s : observation of animal behavior(e.g. Thorndike 1911 “Law of Effect”)

Of several responses made to the same situation, those which are accompanied

or closely followed by satisfaction to the animal will [...] be more likely to recur.

I 1920s : Pavlov work on conditionnal reflexesfirst occurence of “reinforcement” in animal learning

source : Wikipedia


History of RL

• Learning to behave from rewards : does it happen in the brain ?

I Oak and Miller 1954 : first experiments on electric brain stimuli forcontrolling mice behavior

Ü hypothesis that dopamine broadcast rewards signal to the brain

I Today’s RL Dopamine :)

https://github.com/google/dopamine


https://github.com/google/dopamine

History of RL

• Some steps towards computational RL

I 1950s, Shannon’s machines : “Theseus”, a mice finding how to getout of a maze, a chess player, a Rubik’s cube solver

I 1957, Bellmann : Dynamic Programming(control of dynamical systems)

I 1961, Minsky “Towards artificial intelligence”

I 1978, Sutton : Temporal Difference Learning(artificial intelligence)

I 1989, Watkins : Q-Learning algorithm

Nowadays, reinforcement learning is mostly formalized as learning anoptimal policy in an incompletely-known Markov Decision Process.


RL ⊆ ML

RL is also viewed as a sub-field of Machine Learning

3 types of Machine Learning (ML) tasks :

Supervised Learning

Learn to make predictions, based on a large batch of data for which thetarget variable is observed

Unsupervised Learning

Find some latent structure in data (clusters, low-rank structure...)

Reinforcement Learning

Learn to take decisions / influence the data collection process


Outline of the class

• Lecture 1. Markov Decision Processes (MDP), a formalization forreinforcement learning problem(s)

• Lecture 2. One-state, several actions : solving multi-armed banditsUCB algorithms. Thompson Sampling

• Lecture 3. Solving a MDP with known parameters.Dynamic Programming, Value/Policy Iteration

• Lecture 4. First Reinforcement Learning algorithms.TD Learning, Q-Learning

• Lecture 5. Approximate Dynamic Programming

• Lecture 6. Deep Reinforcement Learning (O. Pietquin)

• Lecture 7. Policy Gradient Methods (O. Pietquin)

• Lecture 8. Bandit tools for RLBandit-based exploration, Monte-Carlo Tree Search Methods


emilie kaufmann - inriachercheurs.lille.inria.fr/ekaufman/rlcours0.pdf · emilie kaufmannjcristal -...

Documents