thomas trappenberg reinforcement learning. three kinds of learning: 1.supervised learning 2....
TRANSCRIPT
![Page 1: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/1.jpg)
Thomas Trappenberg
Reinforcement learning
![Page 2: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/2.jpg)
Three kinds of learning:
1.Supervised learning
2. Unsupervised Learning
3. Reinforcement learning
Detailed teacher that provides desired output y for a giveninput x: training set {x,y} find appropriate mapping function y=h(x;w) [= W (x) ]
Delayed feedback from the environment in form of reward/punishment when reaching state s with action a: reward r(s,a) find optimal policy a=*(s) Most general learning circumstances
Unlabeled samples are provided from which the system has tofigure out good representations: training set {x} find sparse basis functions bi so that x=i ci bi
![Page 3: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/3.jpg)
Maximize expected Utility
www.koerding.com
![Page 4: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/4.jpg)
2. Reinforcement learning
-0.1
-0.1 -0.1
-0.1
-0.1
-0.1
-0.1 -0.1 -0.1
From Russel and Norvik
![Page 5: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/5.jpg)
Markov Decision Process (MDP)
![Page 6: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/6.jpg)
Goal: maximize total expected payoff
Two important quantities
policy:
value function:
Optimal Control
![Page 7: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/7.jpg)
Calculate value function (dynamic programming)
Richard Bellman1920-1984
Bellman Equation for policy
Deterministic policies to simplify notation
Solution: Analytic or Incremental
![Page 8: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/8.jpg)
Remark on different formulations:
Some (like Sutton, Alpaydin, but not Russel & Norvik) define the value as the reward at the next state plus all the following reward:
instead of
![Page 9: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/9.jpg)
Policy Iteration
![Page 10: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/10.jpg)
Value Iteration
Bellman Equation for optimal policy
![Page 11: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/11.jpg)
But:
Environment not known a prioriObservability of statesCurse of Dimensionality
Solution:
Online (TD) POMDP Model-based RL
![Page 12: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/12.jpg)
POMDP:
Partially observable MDPs can be reduced to MDPs by considering believe states b:
![Page 13: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/13.jpg)
Online value function estimation (TD learning)
If the environment is not known, use Monte Carlo method with bootstrapping
Expected payoff before taking step
Expected reward after taking step = actual reward plus discounted expected payoff of next step
Temporal Difference
What if the environment is not completely known ?
![Page 14: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/14.jpg)
Online optimal control: Exploitation versus Exploration
On-policy TD learning: Sarsa
Off-policy TD learning: Q-learning
![Page 15: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/15.jpg)
Model-based RL: TD()
Instead of tabular methods as mainly discussed before, use function approximator with parameters and gradient descent step (Satton 1988):
For example by using a neural network with weights and corresponding delta learning rule
when updating the weights after an episode of m steps.The only problem is that we receive the feedback r only after the t-th step. So we need to keep a memory (trace) of the sequence.
![Page 16: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/16.jpg)
Model-based RL: TD() … alternative formulation
We can write
An putting this into the formula and rearranging the sum gives
We still need to keep the cumulative sum of the derivative terms, but otherwise it looks already closer to bootstrapping.
![Page 17: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/17.jpg)
Model-based RL: TD()
We now introduce a new algorithm by weighting recent gradients more than ones in the distance
This is called the TD() rule. For we recover the TD(1) rule. Interesting is also the the other extreme of TD(0)
Which uses the prediction of V(t+1) as supervision signal for step t. Otherwise this is equivalent to supervised learning and can easily be generalized to hidden layer networks.
![Page 18: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/18.jpg)
Free-Energy-Based RL:
This can be generalized to Boltzmann machines (Sallans & Hinton 2004)
Paul Hollensen: Sparse, topographic RBM successfully learns to drive the e-puck and avoid obstacles, given training data (proximity sensors, motor speeds)
![Page 19: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/19.jpg)
![Page 20: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/20.jpg)
Ivan Pavlov1849-1936Nobel Prize 1904
Classical Conditioning
Rescorla-Wagner Model (1972)
![Page 21: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/21.jpg)
Stimulus B Stimulus A Reward
Stimulus A No reward
Wolfram Schultz
Reward Signals in the Brain
![Page 22: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/22.jpg)
![Page 23: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/23.jpg)
Maia & Frank 2011
Disorders with effectsOn dopamine system:
Parkinson’s diseaseTourett’s syndromeADHDDrug addictionSchizophrenia
![Page 24: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed](https://reader035.vdocuments.mx/reader035/viewer/2022062511/551ac90955034606048b4de5/html5/thumbnails/24.jpg)
Conclusion and Outlook
Three basic categories of learning:
Supervised: Lots of progress through statistical learning theory Kernel machines, graphical models, etc
Unsupervised: Hot research area with some progress,deep temporal learning
Reinforcement: Important topic in animal behavior, model-based RL