11-777 multimodal machine learning fall 2020
TRANSCRIPT
![Page 1: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/1.jpg)
Intro to Reinforcement Learning Part I11-777 Multimodal Machine Learning Fall 2020
Paul Liang
[email protected]@pliang279
![Page 2: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/2.jpg)
Admin
Matching for midterm presentation
→ Give feedback to relevant teams
Due by Wednesday 10/28 8pm ET
https://forms.gle/fEQgmk4g6Yfop6Ct5
![Page 3: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/3.jpg)
Admin
Instructions for the midterm presentations are on piazza resources:
- https://piazza.com/class_profile/get_resource/kcnr11wq24q6z7/kg742kik5kv6tg- Deadline for pre-recorded presentation: Friday, November 13th, 2020 at 8pm ET- 7 minutes, mostly about error analysis and updated ideas, don't try to present
everything…
Instructions for midterm report are also online:
- https://piazza.com/class_profile/get_resource/kcnr11wq24q6z7/kgraer741fw3n4- Deadline: Sunday, November 15th, 2020- 8 pages for teams of 3 and 9 pages for the other teams- Multimodal baselines, error analysis, proposed ideas
![Page 4: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/4.jpg)
Used Materials
Acknowledgement: Some of the material and slides for this lecture were borrowed from the Deep RL Bootcamp at UC Berkeley organized by Pieter Abbeel, Yan Duan, Xi Chen, and Andrej Karpathy, as well as Katerina Fragkiadaki and Ruslan Salakhutdinov’s 10-703 course at CMU, who in turn borrowed much from Rich Sutton’s class and David Silver’s class on Reinforcement Learning.
![Page 5: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/5.jpg)
Contents
● Introduction to RL● Markov Decision Processes (MDPs)● Solving known MDPs using value and policy iteration● Solving unknown MDPs using function approximation and Q-learning
![Page 6: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/6.jpg)
Reinforcement Learning
ALVINN, 1989 AlphaGo, 2016 DQN, 2015
![Page 7: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/7.jpg)
Reinforcement Learning
Trajectory
![Page 8: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/8.jpg)
Markov Decision Process (MDPs)
Trajectory
![Page 9: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/9.jpg)
A state should summarize all past information and have the Markov property.
We should be able to throw away the history once state is known
- If some information is only partially observable: Partially Observable MDP (POMDP)
Markov assumption + Fully observable
![Page 10: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/10.jpg)
We aim to maximize total discounted reward:
Return
Discount factor
![Page 11: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/11.jpg)
Policy
Definition: A policy is a distribution over actions given states
- A policy fully defines the behavior of an agent- The policy is stationary (time-independent)- During learning, the agent changes its policy as a result
of experience
Special case: deterministic policies
![Page 12: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/12.jpg)
Learn the optimal policy to maximize return
Goal:
Return:
![Page 13: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/13.jpg)
Reinforcement Learning vs Supervised Learning
● Sequential decision making ● One-step decision making
Reinforcement Learning Supervised Learning
![Page 14: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/14.jpg)
Reinforcement Learning vs Supervised Learning
● Sequential decision making● Maximize cumulative reward
● One-step decision making● Maximize immediate reward
Reinforcement Learning Supervised Learning
![Page 15: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/15.jpg)
Reinforcement Learning vs Supervised Learning
● Sequential decision making● Maximize cumulative reward● Sparse rewards
● One-step decision making● Maximize immediate reward● Dense supervision
Reinforcement Learning Supervised Learning
![Page 16: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/16.jpg)
Reinforcement Learning vs Supervised Learning
● Sequential decision making● Maximize cumulative reward● Sparse rewards● Environment maybe unknown
● One-step decision making● Maximize immediate reward● Dense supervision● Environment always known
Reinforcement Learning Supervised Learning
![Page 17: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/17.jpg)
Intersection between RL and supervised learningImitation learning!
![Page 18: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/18.jpg)
Intersection between RL and supervised learningImitation learning!
Obtain expert trajectories (e.g. human driver/video demonstrations):
![Page 19: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/19.jpg)
Intersection between RL and supervised learningImitation learning!
Perform supervised learning by predicting expert action
D = {(s0, a*0), (s1, a*1), (s2, a*2), …}
Obtain expert trajectories (e.g. human driver/video demonstrations):
![Page 20: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/20.jpg)
Intersection between RL and supervised learningImitation learning!
Perform supervised learning by predicting expert action
D = {(s0, a*0), (s1, a*1), (s2, a*2), …}
But: distribution mismatch between training and testingHard to recover from sub-optimal states
Sometimes not safe/possible to collect expert trajectories
Obtain expert trajectories (e.g. human driver/video demonstrations):
![Page 21: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/21.jpg)
Learn the optimal policy to maximize return
Goal:
Return:
![Page 22: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/22.jpg)
- Definition: the state-value function of an MDP is the expected return starting from state s, and following policy
- Definition: the action-value function is the expected return starting from state s, taking action a, and then following policy
State and action value functions
Captures long term reward
Captures long term reward
![Page 23: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/23.jpg)
- Definition: the optimal state-value function is the maximum value function over all policies
- Definition: the optimal action-value function is the maximum action-value function over all policies
Optimal state and action value functions
![Page 24: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/24.jpg)
Solving MDPs
![Page 25: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/25.jpg)
Solving MDPs
![Page 26: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/26.jpg)
Value functions
- Value functions measure the goodness of a particular state or state/action pair: how good is it for an agent to be in a particular state or execute a particular action at a particular state, for a given policy
- Optimal value functions measure the best possible states or state/action pairs under all possible policies
![Page 27: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/27.jpg)
Relationships between state and action values
State value functions Action value functions
![Page 28: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/28.jpg)
Relationships between state and action values
State value functions Action value functions
![Page 29: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/29.jpg)
Relationships between state and action values
State value functions Action value functions
![Page 30: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/30.jpg)
Relationships between state and action values
State value functions Action value functions
![Page 31: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/31.jpg)
Obtaining the optimal policyOptimal policy can be found by maximizing over Q*(s,a)
![Page 32: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/32.jpg)
Obtaining the optimal policyOptimal policy can be found by maximizing over Q*(s,a)
Optimal policy can also be found by maximizing over V*(s’) with one-step look ahead
![Page 33: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/33.jpg)
Bellman expectation
Recursively:
So, how do we find Q*(s,a) and V*(s)?
![Page 34: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/34.jpg)
Bellman expectation
Recursively:
By taking expectations:
So, how do we find Q*(s,a) and V*(s)?
![Page 35: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/35.jpg)
Bellman expectation
Recursively:
By taking expectations:
So, how do we find Q*(s,a) and V*(s)?
![Page 36: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/36.jpg)
Bellman expectation
Recursively:
By taking expectations:
So, how do we find Q*(s,a) and V*(s)?
![Page 37: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/37.jpg)
Bellman expectation
Recursively:
By taking expectations:
So, how do we find Q*(s,a) and V*(s)?
![Page 38: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/38.jpg)
Bellman expectation for state value functions
![Page 39: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/39.jpg)
Bellman expectation for state value functions
![Page 40: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/40.jpg)
Bellman expectation for state value functions
![Page 41: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/41.jpg)
Bellman expectation for action value functions
![Page 42: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/42.jpg)
Bellman expectation for action value functions
![Page 43: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/43.jpg)
Bellman expectation for action value functions
![Page 44: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/44.jpg)
Bellman expectation for action value functions
![Page 45: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/45.jpg)
Solving the Bellman expectation equations
![Page 46: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/46.jpg)
Solving the Bellman expectation equations
Solve the linear system
variables: for all s
constants: p(s’|s,a), r(s,a,s’)
![Page 47: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/47.jpg)
Solving the Bellman expectation equations
Solve by iterative methods
Solve the linear system
variables: for all s
constants: p(s’|s,a), r(s,a,s’)
![Page 48: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/48.jpg)
Policy Evaluation
1. Policy evaluationIterate until convergence:
![Page 49: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/49.jpg)
Policy Iteration
2. Policy ImprovementFind the best action according to one-step look ahead
1. Policy evaluationIterate until convergence:
![Page 50: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/50.jpg)
Policy Iteration
2. Policy ImprovementFind the best action according to one-step look ahead
1. Policy evaluationIterate until convergence:
![Page 51: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/51.jpg)
Policy Iteration
2. Policy ImprovementFind the best action according to one-step look ahead
1. Policy evaluationIterate until convergence:
Repeat until policy converges. Guaranteed to converge to optimal policy.
![Page 52: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/52.jpg)
Bellman optimality for state value functions
The value of a state under an optimal policy must equal the expected return for the best action from that state
For the Bellman expectation equations we summed over all leaves, here we choose the best branch
![Page 53: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/53.jpg)
Bellman optimality for state value functions
The value of a state under an optimal policy must equal the expected return for the best action from that state
For the Bellman expectation equations we summed over all leaves, here we choose the best branch
![Page 54: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/54.jpg)
Bellman optimality for action value functions
For the Bellman expectation equations we summed over all leaves, here we choose the best branch
![Page 55: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/55.jpg)
Bellman optimality for action value functions
For the Bellman expectation equations we summed over all leaves, here we choose the best branch
![Page 56: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/56.jpg)
Solving the Bellman optimality equations
![Page 57: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/57.jpg)
Solving the Bellman optimality equations
Solve by iterative methods
![Page 58: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/58.jpg)
Value Iteration
Slides from Fragkiadaki
![Page 59: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/59.jpg)
Value Iteration
Find the best action according to one-step look ahead
Slides from Fragkiadaki
![Page 60: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/60.jpg)
Value Iteration
Find the best action according to one-step look ahead
Repeat until policy converges. Guaranteed to converge to optimal policy.
Slides from Fragkiadaki
![Page 61: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/61.jpg)
Q-Value Iteration
Slides from Fragkiadaki
![Page 62: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/62.jpg)
Repeat until policy converges. Guaranteed to converge to optimal policy.
Q-value iteration
Value iteration
Summary: Exact methods
Fully known MDP
statestransitions
rewards
Bellman optimality equations
![Page 63: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/63.jpg)
Repeat until policy converges. Guaranteed to converge to optimal policy.
Policy iteration
Q-value iteration
Value iteration
Q-policy iteration
Summary: Exact methods
Fully known MDP
statestransitions
rewards
Bellman optimality equations
Bellman expectation equations
![Page 64: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/64.jpg)
Fully known MDP
statestransitions
rewards
Bellman optimality equations
Bellman expectation equations
Repeat until policy converges. Guaranteed to converge to optimal policy.
Limitations:Iterate over and storage for all states and actions: requires small, discrete
state and action spaceUpdate equations require fully observable MDP and known transitions
Policy iteration
Q-value iteration
Value iteration
Q-policy iteration
Summary: Exact methods
![Page 65: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/65.jpg)
Solving unknown MDPs using function approximation
![Page 66: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/66.jpg)
Recap: Q-value iteration
This is problematic when do not know the transitionsSlides from Fragkiadaki
![Page 67: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/67.jpg)
Tabular Q-learning
Slides from Fragkiadaki
![Page 68: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/68.jpg)
Tabular Q-learning
simulation and exploration
Slides from Fragkiadaki
![Page 69: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/69.jpg)
Tabular Q-learning update
learning rate
Key idea: implicitly estimate the transitions via simulation
![Page 70: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/70.jpg)
Tabular Q-learningBellman optimality
Slides from Fragkiadaki
![Page 71: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/71.jpg)
Tabular Q-learningBellman optimality
Slides from Fragkiadaki
![Page 72: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/72.jpg)
Tabular Q-learning
Slides from Fragkiadaki
![Page 73: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/73.jpg)
Epsilon-greedy
Poor estimates of Q(s,a) at the start:
Bad initial estimates in the first few cases can drive policy into sub-optimal region, and never explore further.
Gradually decrease epsilon as policy is learned.
![Page 74: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/74.jpg)
Tabular Q-learning
Slides from Fragkiadaki
![Page 75: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/75.jpg)
Convergence
Slides from Abbeel
![Page 76: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/76.jpg)
Tabular Q-learningTabular: keep a |S| x |A| table of Q(s,a)Still requires small and discrete state and action spaceHow can we generalize to unseen states?
![Page 77: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/77.jpg)
Summary: Tabular Q-learning
targetold estimate
Tabular Q-learning
MDPwith
unknowntransitions
Replace true expectation over transitions with
estimates
Bellman optimality equations
simulation and exploration, epsilon greedy is important!
![Page 78: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/78.jpg)
Summary: Tabular Q-learning
targetold estimate
Tabular Q-learning
MDPwith
unknowntransitions
Replace true expectation over transitions with
estimates
Bellman optimality equations
simulation and exploration, epsilon greedy is important!
![Page 79: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/79.jpg)
Summary: Tabular Q-learning
targetold estimate
Tabular: keep a |S| x |A| table of Q(s,a)Still requires small and discrete state and action spaceHow can we generalize to unseen states?
Tabular Q-learning
MDPwith
unknowntransitions
Replace true expectation over transitions with
estimates
Bellman optimality equations
simulation and exploration, epsilon greedy is important!
![Page 80: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/80.jpg)
Deep Q-learning
DQN, 2015
Q-learning with function approximation to extract informative features from high-dimensional input states.
![Page 81: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/81.jpg)
Deep Q-learning
Represent value function by Q network with weights w
+ high-dimensional, continuous states+ generalization to new states
Slides from Fragkiadaki
![Page 82: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/82.jpg)
Deep Q-learning
- Optimal Q-values should obey Bellman equation
- Treat right-hand as as a target
- Minimize MSE loss by stochastic gradient descent
![Page 83: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/83.jpg)
Deep Q-learning
- Minimize MSE loss by stochastic gradient descent
- Converges to Q* using table lookup representation
- But diverges using neural networks due to:- Correlations between samples- Non-stationary targets
![Page 84: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/84.jpg)
- To remove correlations, build data-set from agent’s own experience
- Sample random mini-batch of transitions (s,a,r,s’) from D
Experience replay
exploration, epsilon greedy is important!
![Page 85: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/85.jpg)
Fixed Q-targets- Sample random mini-batch of transitions (s,a,r,s’) from D- Compute Q-learning targets w.r.t. old fixed parameters w-
- Optimize MSE between Q-network and Q-learning targets
![Page 86: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/86.jpg)
Fixed Q-targets- Sample random mini-batch of transitions (s,a,r,s’) from D- Compute Q-learning targets w.r.t. old fixed parameters w-
- Optimize MSE between Q-network and Q-learning targets
- Use stochastic gradient descent- Update w- with updated w every ~1000 iterations
![Page 87: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/87.jpg)
Deep Q-learning for Atari
![Page 88: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/88.jpg)
Deep Q-learning for Atari
Slides from Fragkiadaki
![Page 89: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/89.jpg)
Deep Q-learning for Atari
Encourage Markov property
Slides from Fragkiadaki
![Page 90: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/90.jpg)
Superhuman results
Slides from Fragkiadaki
![Page 91: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/91.jpg)
Superhuman results
Needs reaction speedShort term rewardLess exploration requiredDeep RL >>> humans
Slides from Fragkiadaki
![Page 92: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/92.jpg)
Superhuman results
Needs reaction speedShort term rewardLess exploration requiredDeep RL >>> humans
Super long term rewardMore exploration requiredRequires knowledge of complex dynamics e.g. key, ladder.Challenge for deep RL
Slides from Fragkiadaki
![Page 93: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/93.jpg)
Superhuman results on Montezuma’s Revenge
Encourages agent to explore its environment by maximizing curiosity.I.e. how well can I predict my environment?1. less training data2. stochastic3. unknown dynamicsSo I should explore more.
Burda et. al., ICLR 2019
![Page 94: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/94.jpg)
Fully known MDP
statestransitions
rewards
Bellman optimality equations
Bellman expectation equations
Repeat until policy converges. Guaranteed to converge to optimal policy.
Limitations:Iterate over and storage for all states and actions: requires small, discrete
state and action spaceUpdate equations require fully observable MDP and known transitions
Policy iteration
Q-value iteration
Value iteration
Q-policy iteration
Summary: Exact methods
![Page 95: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/95.jpg)
Summary: Tabular Q-learningMDPwith
unknowntransitions
Replace true expectation over transitions with
estimates
Bellman optimality equations
targetold estimate
Tabular: keep a |S| x |A| table of Q(s,a)Still requires small and discrete state and action spaceHow can we generalize to unseen states?
Tabular Q-learning
simulation and exploration, epsilon greedy is important!
![Page 96: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/96.jpg)
Summary: Deep Q-learning
targetold estimate
Works for high-dimensional state and action spacesGeneralizes to unseen states
Stochastic gradient descent + Experience replay + Fixed Q-targets
![Page 97: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/97.jpg)
Applications: RL and Language
![Page 98: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/98.jpg)
RL and Language
Luketina et. al., IJCAI 2019
![Page 99: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/99.jpg)
Language-conditional RL
● Instruction following● Rewards from instructions● Language in the observation and action space
![Page 100: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/100.jpg)
Language-conditional RL: Instruction following
● Navigation via instruction following
Chaplot et. al., AAAI 2018Misra et. al., EMNLP 2017
![Page 101: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/101.jpg)
Language-conditional RL: Instruction following
● Navigation via instruction following
FusionAlignmentGround languageRecognize objectsNavigate to objectsGeneralize to unseen objects
Chaplot et. al., AAAI 2018Misra et. al., EMNLP 2017
![Page 102: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/102.jpg)
Language-conditional RL: Instruction following
● Interaction with the environment
Chaplot et. al., AAAI 2018
![Page 103: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/103.jpg)
Language-conditional RL: Instruction following
● Gated attention via element-wise product
Chaplot et. al., AAAI 2018
FusionAlignmentGround languageRecognize objects
![Page 104: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/104.jpg)
Language-conditional RL: Instruction following
● Policy learning
Chaplot et. al., AAAI 2018
![Page 105: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/105.jpg)
Language-conditional RL: Instruction following
Chaplot et. al., AAAI 2018
![Page 106: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/106.jpg)
Language-conditional RL: Instruction following
Grounding is important for generalization
blue armor, red pillar -> blue pillar
Chaplot et. al., AAAI 2018
![Page 107: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/107.jpg)
Language-conditional RL: Rewards from instructions
Montezuma’s revenge
Sparse, long-term reward problemGeneral solution: reward shaping via auxiliary rewards
![Page 108: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/108.jpg)
Language-conditional RL: Rewards from instructions
Montezuma’s revenge
Sparse, long-term reward problemGeneral solution: reward shaping via auxiliary rewards
Encourages agent to explore its environment by maximizing curiosity.How well can I predict my environment?1. Less training data2. Stochastic3. Unknown dynamicsSo I should explore more.
Pathak et. al., ICML 2017Burda et. al., ICLR 2019
![Page 109: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/109.jpg)
Language-conditional RL: Rewards from instructions
Montezuma’s revenge
Sparse, long-term reward problemGeneral solution: reward shaping via auxiliary rewards
Natural language for reward shaping
Goyal et. al., IJCAI 2019
“Jump over the skull while going to the left”
from Amazon Mturk :-(asked annotators to play the game and describe entities
Intermediate rewards to speed up learning
![Page 110: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/110.jpg)
Language-conditional RL: Rewards from instructions
Montezuma’s revenge
Natural language for reward shaping
Goyal et. al., IJCAI 2019
Encourages agent to take actions related to the instructions
![Page 111: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/111.jpg)
Language-conditional RL: Rewards from instructions
Montezuma’s revenge
Natural language for reward shaping
Goyal et. al., IJCAI 2019
Encourages agent to take actions related to the instructions
![Page 112: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/112.jpg)
Language-conditional RL: Language in S and A
● Embodied QA: Navigation + QA
Das et. al., CVPR 2018Most methods similar to instruction following
![Page 113: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/113.jpg)
Language-assisted RL
● Language for communicating domain knowledge● Language for structuring policies
![Page 114: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/114.jpg)
Language-assisted RL: Domain knowledge
● Properties of entities in the environment are annotated by language
Narasimhan et. al., JAIR 2018
from Amazon Mturk :-(asked annotators to play the game and describe entities
![Page 115: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/115.jpg)
Language-assisted RL: Domain knowledge
● Properties of entities in the environment are annotated by language
Narasimhan et. al., JAIR 2018
Fusion problem
![Page 116: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/116.jpg)
Language-assisted RL: Domain knowledge
● Properties of entities in the environment are annotated by language
Narasimhan et. al., JAIR 2018
Fusion problem
![Page 117: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/117.jpg)
Language-assisted RL: Domain knowledge
● Properties of entities in the environment are annotated by language
Narasimhan et. al., JAIR 2018
Grounded language learningHelps to ground the meaning of text to the dynamics, transitions, and rewards
Language helps in multi-task learning and transfer learning
Fusion problem
![Page 118: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/118.jpg)
Language-assisted RL: Domain knowledge
Branavan et. al., JAIR 2012
● Learning to read instruction manuals
![Page 119: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/119.jpg)
Language-assisted RL: Domain knowledge
Branavan et. al., JAIR 2012
1. Choose relevant sentences2. Label words into action-description, state-description, or background
● Learning to read instruction manuals
![Page 120: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/120.jpg)
Language-assisted RL: Domain knowledge
Branavan et. al., JAIR 2012
1. Choose relevant sentences2. Label words into action-description, state-description, or background
● Learning to read instruction manuals
![Page 121: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/121.jpg)
Language-assisted RL: Domain knowledge
Branavan et. al., JAIR 2012
1. Choose relevant sentences2. Label words into action-description, state-description, or background
● Learning to read instruction manuals
![Page 122: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/122.jpg)
Language-assisted RL: Domain knowledge
Branavan et. al., JAIR 2012
1. Choose relevant sentences2. Label words into action-description, state-description, or background
● Learning to read instruction manuals
![Page 123: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/123.jpg)
Language-assisted RL: Domain knowledge
● Learning to read instruction manuals
Branavan et. al., JAIR 2012
A: action-descriptionS: state-description
Relevant sentences
![Page 124: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/124.jpg)
Language-assisted RL: Domain knowledge
● Learning to read instruction manuals
Branavan et. al., JAIR 2012
Grounded language learningGround the meaning of text to the dynamics, transitions, and rewards
Language helps in learning
![Page 125: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/125.jpg)
Language-assisted RL: Domain knowledge
● Learning to read instruction manuals
Branavan et. al., JAIR 2012
Language is most important at the start when you don’t have a good policyAfterwards, the model relies on game features
![Page 126: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/126.jpg)
Language for structuring policies
● Composing modules for Embodied QA
Das et. al., CoRL 2018
![Page 127: 11-777 Multimodal Machine Learning Fall 2020](https://reader035.vdocuments.mx/reader035/viewer/2022081623/615756eb1854700efb212777/html5/thumbnails/127.jpg)
Language for structuring policies
● Composing modules for Embodied QA
Das et. al., CoRL 2018