a brief introduction to reinforcement...
TRANSCRIPT
Outline• Characteristics of Reinforcement Learning (RL)
• Components of RL (MDP, value, policy, Bellman)
• Planning (policy iteration, value iteration)
• Model-free Prediction (MC, TD)
• Model-free Control (Q-Learning)
• Deep Reinforcement Learning (DQN)
2
Outline• Characteristics of Reinforcement Learning (RL)
• The RL Problem (MDP, value, policy, Bellman)
• Planning (policy iteration, value iteration)
• Model-free Prediction (MC, TD)
• Model-free Control (Q-Learning)
• Deep Reinforcement Learning (DQN)
4
Characteristics of RL SL VS RL
• Supervised Learning
• i.i.d data
• direct and strong supervision (label: what is the right thing to do)
• instantaneous feedback
• Reinforcement Learning
• sequential data, non-i.i.d
• no supervisor, only a reward signal (rule: what you did is good or bad)
• delayed feedback
5
Characteristics of RL SL VS RL
• Supervised Learning
• i.i.d data
• direct and strong supervision (label: what is the right thing to do)
• instantaneous feedback
• Reinforcement Learning
• sequential data, non-i.i.d
• no supervisor, only a reward signal (rule: what you did is good or bad)
• delayed feedback
6
Characteristics of RL SL VS RL
• Supervised Learning
• i.i.d data
• direct and strong supervision (label: what is the right thing to do)
• instantaneous feedback
• Reinforcement Learning
• sequential data, non-i.i.d
• no supervisor, only a reward signal (rule: what you did is good or bad)
• delayed feedback
7
Characteristics of RL SL VS RL
• Supervised Learning
• i.i.d data
• direct and strong supervision (label: what is the right thing to do)
• instantaneous feedback
• Reinforcement Learning
• sequential data, non-i.i.d
• no supervisor, only a reward signal (rule: what you did is good or bad)
• delayed feedback
8
Outline• Characteristics of Reinforcement Learning (RL)
• Components of RL (MDP, value, policy, Bellman)
• Planning (policy iteration, value iteration)
• Model-free Prediction (MC, TD)
• Model-free Control (Q-Learning)
• Deep Reinforcement Learning (DQN)
10
Components of RL MDP
• A general framework for sequential decision making
• A MDP is a tuple:
• Markov property:
“The future is independent of the past given the present”
11
hS,A,P,R, �i
S :states
A :actions
P :transition probability,Pass0 = P [St+1 = s0|St = s,At = a]
R :reward function,Ras = E [Rt+1|St = s,At = a]
� :discount factor, � 2 [0, 1]
Components of RL MDP
• A general framework for sequential decision making
• A MDP is a tuple:
• Markov property:
“The future is independent of the past given the present”
12
hS,A,P,R, �i
S :states
A :actions
P :transition probability,Pass0 = P [St+1 = s0|St = s,At = a]
R :reward function,Ras = E [Rt+1|St = s,At = a]
� :discount factor, � 2 [0, 1]
Components of RL MDP
• A general framework for sequential decision making
• A MDP is a tuple:
• Markov property:
“The future is independent of the past given the present”
13
hS,A,P,R, �i
S :states
A :actions
P :transition probability,Pass0 = P [St+1 = s0|St = s,At = a]
R :reward function,Ras = E [Rt+1|St = s,At = a]
� :discount factor, � 2 [0, 1]
• Policy:
• Return:
• State-value function:
• Action-value function:
Components of RL Policy & Return & Value
14
Gt = Rt+1 + �Rt+2 + . . . =1X
k=0
�kRt+k+1
⇡(a|s) = P [At = a|St = s]
v⇡(s) = E⇡ [Gt|St = s]
q⇡(s,a) = E⇡ [Gt|St = s,At = a]
• Policy:
• Return:
• State-value function:
• Action-value function:
Components of RL Policy & Return & Value
15
Gt = Rt+1 + �Rt+2 + . . . =1X
k=0
�kRt+k+1
⇡(a|s) = P [At = a|St = s]
v⇡(s) = E⇡ [Gt|St = s]
q⇡(s,a) = E⇡ [Gt|St = s,At = a]
• Policy:
• Return:
• State-value function:
• Action-value function:
Components of RL Policy & Return & Value
16
Gt = Rt+1 + �Rt+2 + . . . =1X
k=0
�kRt+k+1
⇡(a|s) = P [At = a|St = s]
v⇡(s) = E⇡ [Gt|St = s]
q⇡(s,a) = E⇡ [Gt|St = s,At = a]
• Policy:
• Return:
• State-value function:
• Action-value function:
Components of RL Policy & Return & Value
17
Gt = Rt+1 + �Rt+2 + . . . =1X
k=0
�kRt+k+1
⇡(a|s) = P [At = a|St = s]
v⇡(s) = E⇡ [Gt|St = s]
q⇡(s,a) = E⇡ [Gt|St = s,At = a]
Components of RL Bellman Equations
• Bellman Expectation Equation
18
v⇡(s) = E⇡ [Rt+1 + �v⇡(St+1)|St = s]
v⇡(s) =X
a2A⇡(a|s)
Ra
s + �X
s02SPass0v⇡(s
0)
!
• Bellman Optimality Equation
Components of RL Bellman Equations
19
v⇤(s) = maxa
Ra
s + �X
s02SPass0v⇤(s
0)
!
Components of RL Prediction VS Control
• Prediction
given a policy, evaluate how much reward you can get by following that policy
• Control
find an optimal policy that maximizes the cumulative future reward
20
Components of RL Prediction VS Control
• Prediction
given a policy, evaluate how much reward you can get by following that policy
• Control
find an optimal policy that maximizes the cumulative future reward
21
Components of RL Planning VS Learning
• Planning
• the underlying MDP is known
• agent only needs to perform computations on the given model
• dynamic programming (policy iteration, value iteration)
• Learning
• the underlying MDP is initially unknown
• agent needs to interact with the environment
• model-free (learn value / policy) / model-based (learn model, plan on it)
22
Components of RL Planning VS Learning
• Planning
• the underlying MDP is known
• agent only needs to perform computations on the given model
• dynamic programming (policy iteration, value iteration)
• Learning
• the underlying MDP is initially unknown
• agent needs to interact with the environment
• model-free (learn value / policy) / model-based (learn model, plan on it)
23
Components of RL Planning VS Learning
• Planning
• the underlying MDP is known
• agent only needs to perform computations on the given model
• dynamic programming (policy iteration, value iteration)
• Learning
• the underlying MDP is initially unknown
• agent needs to interact with the environment
• model-free (learn value / policy) / model-based (learn model, plan on it)
24
Components of RL Planning VS Learning
• Planning
• the underlying MDP is known
• agent only needs to perform computations on the given model
• dynamic programming (policy iteration, value iteration)
• Learning
• the underlying MDP is initially unknown
• agent needs to interact with the environment
• model-free (learn value / policy) / model-based (learn model, plan on it)
25
Outline• Characteristics of Reinforcement Learning (RL)
• The RL Problem (MDP, value, policy, Bellman)
• Planning (policy iteration, value iteration)
• Model-free Prediction (MC, TD)
• Model-free Control (Q-Learning)
• Deep Reinforcement Learning (DQN)
27
Planning Dynamic Programming
• Applied when optimal solutions can be decomposed into subproblems
• For prediction:
• Input:
• Output:
• For control:
• Input:
• Output:
28
< S,A,R,S, � >,⇡
< S,A,R,S, � >
v⇡
v⇤,⇡⇤
Planning Dynamic Programming
• Applied when optimal solutions can be decomposed into subproblems
• For prediction: (iterative policy evaluation)
• Input:
• Output:
• For control: (policy iteration, value iteration)
• Input:
• Output:
29
< S,A,R,S, � >,⇡
< S,A,R,S, � >
v⇡
v⇤,⇡⇤
Planning Dynamic Programming
• Applied when optimal solutions can be decomposed into subproblems
• For prediction: (iterative policy evaluation)
• Input:
• Output:
• For control: (policy iteration, value iteration)
• Input:
• Output:
30
< S,A,R,S, � >,⇡
< S,A,R,S, � >
v⇡
v⇤,⇡⇤
• Iterative application of Bellman Expectation backup
Planning Iterative Policy Evaluation
31
vk+1(s) =X
a2A⇡(a|s)
Ra
s + �X
s02SPass0vk(s
0)
!
v1 ! v2 ! . . . ! v⇡
• Evaluate the given policy and get:
• Get an improved policy by acting greedily:
Planning Policy Iteration
32
⇡0 = greedy(v⇡)
v⇡
• Iterative application of Bellman Optimality backupv1 ! v2 ! . . . ! v⇤
Planning Value Iteration
33
vk+1(s) = maxa2A
Ra
s + �X
s02SPass0vk(s
0)
!
Planning Synchronous DP Algorithms
34
Outline• Characteristics of Reinforcement Learning (RL)
• The RL Problem (MDP, value, policy, Bellman)
• Planning (policy iteration, value iteration)
• Model-free Prediction (MC, TD)
• Model-free Control (Q-Learning)
• Deep Reinforcement Learning (DQN)
36
• Planning
• the underlying MDP is known
• agent only needs to perform computations on the given model
• dynamic programming (policy iteration, value iteration)
• Learning
• the underlying MDP is initially unknown
• agent needs to interact with the environment
• model-free (learn value / policy) / model-based (learn model, plan on it)
Recap: Components of RL Planning VS Learning
37
• Monte Carlo Learning
• learns from complete trajectories, no bootstrapping
• estimates values by looking at sample returns, empirical mean return
• Temporal Difference Learning
• learns from incomplete episodes, by bootstrapping, substituting the remainder of the trajectory with our estimate
• updates a guess towards a guess
Model-free Prediction MC VS TD
38
• Monte Carlo Learning
• learns from complete trajectories, no bootstrapping
• estimates values by looking at sample returns, empirical mean return
• Temporal Difference Learning
• learns from incomplete episodes, by bootstrapping, substituting the remainder of the trajectory with our estimate
• updates a guess towards a guess
Model-free Prediction MC VS TD
39
• Monte Carlo Learning
• learns from complete trajectories, no bootstrapping
• estimates values by looking at sample returns, empirical mean return
• Temporal Difference Learning
• learns from incomplete episodes, by bootstrapping, substituting the remainder of the trajectory with our estimate
• updates a guess towards a guess
Model-free Prediction MC VS TD
40
• Goal:
• Recall: Return is the total discounted reward:
• Recall: Value function is the expected return:
• MC policy evaluation (every visit MC):
uses empirical mean return instead of expected return
Model-free Prediction MC
41
learn v⇡ from episodes of experience under policy ⇡
v⇡(s) = E⇡ [Gt|St = s]
Gt = Rt+1 + �Rt+2 + . . . =1X
k=0
�kRt+k+1
• Goal:
• Recall: Return is the total discounted reward:
• Recall: Value function is the expected return:
• MC policy evaluation (every visit MC):
uses empirical mean return instead of expected return
Model-free Prediction MC
42
learn v⇡ from episodes of experience under policy ⇡
v⇡(s) = E⇡ [Gt|St = s]
Gt = Rt+1 + �Rt+2 + . . . =1X
k=0
�kRt+k+1
• Goal:
• Recall: Return is the total discounted reward:
• Recall: Value function is the expected return:
• MC policy evaluation (every visit MC):
uses empirical mean return instead of expected return
Model-free Prediction MC
43
learn v⇡ from episodes of experience under policy ⇡
v⇡(s) = E⇡ [Gt|St = s]
Gt = Rt+1 + �Rt+2 + . . . =1X
k=0
�kRt+k+1
• Goal:
• Recall: Return is the total discounted reward:
• Recall: Value function is the expected return:
• MC policy evaluation (every visit MC):
uses empirical mean return instead of expected return
Model-free Prediction MC
44
learn v⇡ from episodes of experience under policy ⇡
v⇡(s) = E⇡ [Gt|St = s]
Gt = Rt+1 + �Rt+2 + . . . =1X
k=0
�kRt+k+1
• Goal:
• MC:
• TD:
Model-free Prediction MC -> TD
45
learn v⇡ from episodes of experience under policy ⇡
V (St) V (St) + ↵(Gt � V (St))
V (St) V (St) + ↵(Rt+1 + �V (St+1)� V (St))
updates V (St) towards actual return: Gt
updates V (St) towards estimated return: Rt+1 + �V (St+1)
Model-free Prediction MC VS TD: Driving Home
46
Model-free Prediction MC Backup
47
Model-free Prediction TD Backup
48
Model-free Prediction DP Backup
49
Model-free Prediction Unified View
50
Outline• Characteristics of Reinforcement Learning (RL)
• The RL Problem (MDP, value, policy, Bellman)
• Planning (policy iteration, value iteration)
• Model-free Prediction (MC, TD)
• Model-free Control (Q-Learning)
• Deep Reinforcement Learning (DQN)
52
• MDP is unknown:
but experience can be sampled
• MDP is known:
but too big to use except to sample from it
Model-free Control Why model-free?
53
• Evaluate the given policy and get:
• Get an improved policy by acting greedily:
Recap: Planning Policy Iteration
54
⇡0 = greedy(v⇡)
v⇡
• Evaluate the given policy and get:
• Get an improved policy by acting greedily:
Model-free Control Generalized Policy Iteration
55
⇡0 = greedy(v⇡)
v⇡
• Greedy policy improvement over V(s) requires model of MDP
• Greedy policy improvement over Q(s,a) is model-free
Model-free Control V -> Q
56
⇡0(s) = argmaxa2A
(Ras + Pa
ss0V (s0))
⇡0(s) = argmaxa2A
Q(s,a)
Model-free Control SARSA
57
Q(S,A) Q(S,A) + ↵ (R+ �Q(S0,A0)�Q(S,A))
Model-free Control Q-Learning
58
Q(S,A) Q(S,A) + ↵⇣R+ �max
a0Q(S0,a0)�Q(S,A)
⌘
Model-free Control SARSA VS Q-Learning
59
Model-free Control DP VS TD
60
Model-free Control DP VS TD
61
Outline• Characteristics of Reinforcement Learning (RL)
• The RL Problem (MDP, value, policy, Bellman)
• Planning (policy iteration, value iteration)
• Model-free Prediction (MC, TD)
• Model-free Control (Q-Learning)
• Deep Reinforcement Learning (DQN)
63
Deep Reinforcement Learning Why?
• So far we represented value function by a lookup table
• every state s has an entry V(s)
• every state-action pair (s, a) has an entry Q(s, a)
• Problem w/ large MDPs
• too many states and/or actions to store in memory
• too slow to learn the value of each state individually
64
Deep Reinforcement Learning Why?
• So far we represented value function by a lookup table
• every state s has an entry V(s)
• every state-action pair (s, a) has an entry Q(s, a)
• Problem w/ large MDPs
• too many states and/or actions to store in memory
• too slow to learn the value of each state individually
65
Deep Reinforcement Learning How to?
• Use deep networks to represent:
• value function (value-based methods)
• policy (policy-based methods)
• model (model-based methods)
• Optimize value function / policy / model end-to-end
66
Deep Reinforcement Learning How to?
• Use deep networks to represent:
• value function (value-based methods)
• policy (policy-based methods)
• model (model-based methods)
• Optimize value function / policy / model end-to-end
67
Deep Reinforcement Learning Q-learning -> DQN
68
Deep Reinforcement Learning Q-learning -> DQN
69
Deep Reinforcement Learning AI = RL + DL
• Reinforcement Learning (RL)
• a general purpose framework for decision making
• learn policies to maximize future reward
• Deep Learning (DL)
• a general purpose framework for representation learning
• given an objective, learn representation that is required to achieve objective
• DRL: a single agent which can solve any human-level task
• RL defines the objective
• DL gives the mechanism
• RL + DL = general intelligence
70
Deep Reinforcement Learning AI = RL + DL
• Reinforcement Learning (RL)
• a general purpose framework for decision making
• learn policies to maximize future reward
• Deep Learning (DL)
• a general purpose framework for representation learning
• given an objective, learn representation that is required to achieve objective
• DRL: a single agent which can solve any human-level task
• RL defines the objective
• DL gives the mechanism
• RL + DL = general intelligence
71
Deep Reinforcement Learning AI = RL + DL
• Reinforcement Learning (RL)
• a general purpose framework for decision making
• learn policies to maximize future reward
• Deep Learning (DL)
• a general purpose framework for representation learning
• given an objective, learn representation that is required to achieve objective
• DRL: a single agent which can solve any human-level task
• RL defines the objective
• DL gives the mechanism
• RL + DL = general intelligence
72
Some Recommendations• Reinforcement Learning from David Silver on YouTube
• Reinforcement Learning, An Introduction, Richard Sutton, 2nd Edition
• DQN Nature Paper: Human-level Control Through Deep Reinforcement Learning
• Flappy Bird:
• Tabular RL: https://github.com/SarvagyaVaish/FlappyBirdRL
• Deep RL: https://github.com/songrotek/DRL-FlappyBird
• Many many 3rd party implementations, just search for “deep reinforcement learning”, “dqn”, “a3c” on github
• My implementations in pytorch: https://github.com/jingweiz/pytorch-rl
73