csc 411: lecture 19: reinforcement learning › ~urtasun › courses › csc411_fall16 ›...

119
CSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto November 29, 2016 Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 1 / 38

Upload: others

Post on 07-Jul-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

CSC 411: Lecture 19: Reinforcement Learning

Richard Zemel, Raquel Urtasun and Sanja Fidler

University of Toronto

November 29, 2016

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 1 / 38

Page 2: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Today

Learn to play games

Reinforcement Learning

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 2 / 38

Page 3: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Playing Games: Atari

https://www.youtube.com/watch?v=V1eYniJ0Rnk

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 3 / 38

Page 4: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Playing Games: Super Mario

https://www.youtube.com/watch?v=wfL4L_l4U9A

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 4 / 38

Page 5: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Making Pancakes!

https://www.youtube.com/watch?v=W_gxLKSsSIE

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 5 / 38

Page 6: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Reinforcement Learning Resources

RL tutorial – on course website

Reinforcement Learning: An Introduction, Sutton & Barto Book (1998)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 6 / 38

Page 7: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Reinforcement Learning

Learning algorithms differ in the information available to learner

I Supervised: correct outputsI Unsupervised: no feedback, must construct measure of good outputI Reinforcement learning

More realistic learning scenario:

I Continuous stream of input information, and actionsI Effects of action depend on state of the worldI Obtain reward that depends on world state and actions

I not correct response, just some feedback

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 7 / 38

Page 8: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Reinforcement Learning

Learning algorithms differ in the information available to learner

I Supervised: correct outputs

I Unsupervised: no feedback, must construct measure of good outputI Reinforcement learning

More realistic learning scenario:

I Continuous stream of input information, and actionsI Effects of action depend on state of the worldI Obtain reward that depends on world state and actions

I not correct response, just some feedback

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 7 / 38

Page 9: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Reinforcement Learning

Learning algorithms differ in the information available to learner

I Supervised: correct outputsI Unsupervised: no feedback, must construct measure of good output

I Reinforcement learning

More realistic learning scenario:

I Continuous stream of input information, and actionsI Effects of action depend on state of the worldI Obtain reward that depends on world state and actions

I not correct response, just some feedback

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 7 / 38

Page 10: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Reinforcement Learning

Learning algorithms differ in the information available to learner

I Supervised: correct outputsI Unsupervised: no feedback, must construct measure of good outputI Reinforcement learning

More realistic learning scenario:

I Continuous stream of input information, and actionsI Effects of action depend on state of the worldI Obtain reward that depends on world state and actions

I not correct response, just some feedback

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 7 / 38

Page 11: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Reinforcement Learning

Learning algorithms differ in the information available to learner

I Supervised: correct outputsI Unsupervised: no feedback, must construct measure of good outputI Reinforcement learning

More realistic learning scenario:

I Continuous stream of input information, and actions

I Effects of action depend on state of the worldI Obtain reward that depends on world state and actions

I not correct response, just some feedback

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 7 / 38

Page 12: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Reinforcement Learning

Learning algorithms differ in the information available to learner

I Supervised: correct outputsI Unsupervised: no feedback, must construct measure of good outputI Reinforcement learning

More realistic learning scenario:

I Continuous stream of input information, and actionsI Effects of action depend on state of the world

I Obtain reward that depends on world state and actions

I not correct response, just some feedback

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 7 / 38

Page 13: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Reinforcement Learning

Learning algorithms differ in the information available to learner

I Supervised: correct outputsI Unsupervised: no feedback, must construct measure of good outputI Reinforcement learning

More realistic learning scenario:

I Continuous stream of input information, and actionsI Effects of action depend on state of the worldI Obtain reward that depends on world state and actions

I not correct response, just some feedback

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 7 / 38

Page 14: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Reinforcement Learning

Learning algorithms differ in the information available to learner

I Supervised: correct outputsI Unsupervised: no feedback, must construct measure of good outputI Reinforcement learning

More realistic learning scenario:

I Continuous stream of input information, and actionsI Effects of action depend on state of the worldI Obtain reward that depends on world state and actions

I not correct response, just some feedback

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 7 / 38

Page 15: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Reinforcement Learning

[pic from: Peter Abbeel]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 8 / 38

Page 16: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Example: Tic Tac Toe, Notation

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 9 / 38

Page 17: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Example: Tic Tac Toe, Notation

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 10 / 38

Page 18: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Example: Tic Tac Toe, Notation

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 11 / 38

Page 19: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Example: Tic Tac Toe, Notation

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 12 / 38

Page 20: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Formulating Reinforcement Learning

World described by a discrete, finite set of states and actions

At every time step t, we are in a state st , and we:

I Take an action at (possibly null action)I Receive some reward rt+1

I Move into a new state st+1

An RL agent may include one or more of these components:

I Policy π: agent’s behaviour functionI Value function: how good is each state and/or actionI Model: agent’s representation of the environment

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 13 / 38

Page 21: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Formulating Reinforcement Learning

World described by a discrete, finite set of states and actions

At every time step t, we are in a state st , and we:

I Take an action at (possibly null action)I Receive some reward rt+1

I Move into a new state st+1

An RL agent may include one or more of these components:

I Policy π: agent’s behaviour functionI Value function: how good is each state and/or actionI Model: agent’s representation of the environment

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 13 / 38

Page 22: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Formulating Reinforcement Learning

World described by a discrete, finite set of states and actions

At every time step t, we are in a state st , and we:

I Take an action at (possibly null action)

I Receive some reward rt+1

I Move into a new state st+1

An RL agent may include one or more of these components:

I Policy π: agent’s behaviour functionI Value function: how good is each state and/or actionI Model: agent’s representation of the environment

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 13 / 38

Page 23: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Formulating Reinforcement Learning

World described by a discrete, finite set of states and actions

At every time step t, we are in a state st , and we:

I Take an action at (possibly null action)I Receive some reward rt+1

I Move into a new state st+1

An RL agent may include one or more of these components:

I Policy π: agent’s behaviour functionI Value function: how good is each state and/or actionI Model: agent’s representation of the environment

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 13 / 38

Page 24: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Formulating Reinforcement Learning

World described by a discrete, finite set of states and actions

At every time step t, we are in a state st , and we:

I Take an action at (possibly null action)I Receive some reward rt+1

I Move into a new state st+1

An RL agent may include one or more of these components:

I Policy π: agent’s behaviour functionI Value function: how good is each state and/or actionI Model: agent’s representation of the environment

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 13 / 38

Page 25: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Formulating Reinforcement Learning

World described by a discrete, finite set of states and actions

At every time step t, we are in a state st , and we:

I Take an action at (possibly null action)I Receive some reward rt+1

I Move into a new state st+1

An RL agent may include one or more of these components:

I Policy π: agent’s behaviour function

I Value function: how good is each state and/or actionI Model: agent’s representation of the environment

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 13 / 38

Page 26: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Formulating Reinforcement Learning

World described by a discrete, finite set of states and actions

At every time step t, we are in a state st , and we:

I Take an action at (possibly null action)I Receive some reward rt+1

I Move into a new state st+1

An RL agent may include one or more of these components:

I Policy π: agent’s behaviour functionI Value function: how good is each state and/or action

I Model: agent’s representation of the environment

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 13 / 38

Page 27: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Formulating Reinforcement Learning

World described by a discrete, finite set of states and actions

At every time step t, we are in a state st , and we:

I Take an action at (possibly null action)I Receive some reward rt+1

I Move into a new state st+1

An RL agent may include one or more of these components:

I Policy π: agent’s behaviour functionI Value function: how good is each state and/or actionI Model: agent’s representation of the environment

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 13 / 38

Page 28: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Policy

A policy is the agent’s behaviour.

It’s a selection of which action to take, based on the current state

Deterministic policy: a = π(s)

Stochastic policy: π(a|s) = P[at = a|st = s]

[Slide credit: D. Silver]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 14 / 38

Page 29: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Value Function

Value function is a prediction of future reward

Used to evaluate the goodness/badness of states

Our aim will be to maximize the value function (the total reward we receiveover time): find the policy with the highest expected reward

By following a policy π, the value function is defined as:

V π(st) = rt + γrt+1 + γ2rt+2 + · · ·

γ is called a discount rate, and it is always 0 ≤ γ ≤ 1

If γ close to 1, rewards further in the future count more, and we say that theagent is “farsighted”

γ is less than 1 because there is usually a time limit to the sequence ofactions needed to solve a task (we prefer rewards sooner rather than later)

[Slide credit: D. Silver]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 15 / 38

Page 30: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Value Function

Value function is a prediction of future reward

Used to evaluate the goodness/badness of states

Our aim will be to maximize the value function (the total reward we receiveover time): find the policy with the highest expected reward

By following a policy π, the value function is defined as:

V π(st) = rt + γrt+1 + γ2rt+2 + · · ·

γ is called a discount rate, and it is always 0 ≤ γ ≤ 1

If γ close to 1, rewards further in the future count more, and we say that theagent is “farsighted”

γ is less than 1 because there is usually a time limit to the sequence ofactions needed to solve a task (we prefer rewards sooner rather than later)

[Slide credit: D. Silver]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 15 / 38

Page 31: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Value Function

Value function is a prediction of future reward

Used to evaluate the goodness/badness of states

Our aim will be to maximize the value function (the total reward we receiveover time): find the policy with the highest expected reward

By following a policy π, the value function is defined as:

V π(st) = rt + γrt+1 + γ2rt+2 + · · ·

γ is called a discount rate, and it is always 0 ≤ γ ≤ 1

If γ close to 1, rewards further in the future count more, and we say that theagent is “farsighted”

γ is less than 1 because there is usually a time limit to the sequence ofactions needed to solve a task (we prefer rewards sooner rather than later)

[Slide credit: D. Silver]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 15 / 38

Page 32: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Value Function

Value function is a prediction of future reward

Used to evaluate the goodness/badness of states

Our aim will be to maximize the value function (the total reward we receiveover time): find the policy with the highest expected reward

By following a policy π, the value function is defined as:

V π(st) = rt + γrt+1 + γ2rt+2 + · · ·

γ is called a discount rate, and it is always 0 ≤ γ ≤ 1

If γ close to 1, rewards further in the future count more, and we say that theagent is “farsighted”

γ is less than 1 because there is usually a time limit to the sequence ofactions needed to solve a task (we prefer rewards sooner rather than later)

[Slide credit: D. Silver]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 15 / 38

Page 33: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Model

The model describes the environment by a distribution over rewards andstate transitions:

P(st+1 = s ′, rt+1 = r ′|st = s, at = a)

We assume the Markov property: the future depends on the past onlythrough the current state

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 16 / 38

Page 34: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Maze Example

Rewards:

−1 per time-step

Actions: N, E, S, W

States: Agent’s location

[Slide credit: D. Silver]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 17 / 38

Page 35: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Maze Example

Rewards: −1 per time-step

Actions:

N, E, S, W

States: Agent’s location

[Slide credit: D. Silver]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 17 / 38

Page 36: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Maze Example

Rewards: −1 per time-step

Actions: N, E, S, W

States:

Agent’s location

[Slide credit: D. Silver]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 17 / 38

Page 37: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Maze Example

Rewards: −1 per time-step

Actions: N, E, S, W

States: Agent’s location

[Slide credit: D. Silver]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 17 / 38

Page 38: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Maze Example

Arrows represent policy π(s)for each state s

[Slide credit: D. Silver]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 18 / 38

Page 39: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Maze Example

Numbers represent value V π(s)of each state s

[Slide credit: D. Silver]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 19 / 38

Page 40: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Example: Tic-Tac-Toe

Consider the game tic-tac-toe:

I reward: win/lose/tie the game (+1/− 1/0) [only at final move in givengame]

I state: positions of X’s and O’s on the boardI policy: mapping from states to actions

I based on rules of game: choice of one open position

I value function: prediction of reward in future, based on current state

In tic-tac-toe, since state space is tractable, can use a table to representvalue function

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 20 / 38

Page 41: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Example: Tic-Tac-Toe

Consider the game tic-tac-toe:

I reward:

win/lose/tie the game (+1/− 1/0) [only at final move in givengame]

I state: positions of X’s and O’s on the boardI policy: mapping from states to actions

I based on rules of game: choice of one open position

I value function: prediction of reward in future, based on current state

In tic-tac-toe, since state space is tractable, can use a table to representvalue function

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 20 / 38

Page 42: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Example: Tic-Tac-Toe

Consider the game tic-tac-toe:

I reward: win/lose/tie the game (+1/− 1/0) [only at final move in givengame]

I state: positions of X’s and O’s on the boardI policy: mapping from states to actions

I based on rules of game: choice of one open position

I value function: prediction of reward in future, based on current state

In tic-tac-toe, since state space is tractable, can use a table to representvalue function

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 20 / 38

Page 43: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Example: Tic-Tac-Toe

Consider the game tic-tac-toe:

I reward: win/lose/tie the game (+1/− 1/0) [only at final move in givengame]

I state:

positions of X’s and O’s on the boardI policy: mapping from states to actions

I based on rules of game: choice of one open position

I value function: prediction of reward in future, based on current state

In tic-tac-toe, since state space is tractable, can use a table to representvalue function

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 20 / 38

Page 44: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Example: Tic-Tac-Toe

Consider the game tic-tac-toe:

I reward: win/lose/tie the game (+1/− 1/0) [only at final move in givengame]

I state: positions of X’s and O’s on the board

I policy: mapping from states to actionsI based on rules of game: choice of one open position

I value function: prediction of reward in future, based on current state

In tic-tac-toe, since state space is tractable, can use a table to representvalue function

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 20 / 38

Page 45: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Example: Tic-Tac-Toe

Consider the game tic-tac-toe:

I reward: win/lose/tie the game (+1/− 1/0) [only at final move in givengame]

I state: positions of X’s and O’s on the boardI policy: mapping from states to actions

I based on rules of game: choice of one open position

I value function: prediction of reward in future, based on current state

In tic-tac-toe, since state space is tractable, can use a table to representvalue function

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 20 / 38

Page 46: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Example: Tic-Tac-Toe

Consider the game tic-tac-toe:

I reward: win/lose/tie the game (+1/− 1/0) [only at final move in givengame]

I state: positions of X’s and O’s on the boardI policy: mapping from states to actions

I based on rules of game: choice of one open position

I value function: prediction of reward in future, based on current state

In tic-tac-toe, since state space is tractable, can use a table to representvalue function

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 20 / 38

Page 47: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Example: Tic-Tac-Toe

Consider the game tic-tac-toe:

I reward: win/lose/tie the game (+1/− 1/0) [only at final move in givengame]

I state: positions of X’s and O’s on the boardI policy: mapping from states to actions

I based on rules of game: choice of one open position

I value function: prediction of reward in future, based on current state

In tic-tac-toe, since state space is tractable, can use a table to representvalue function

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 20 / 38

Page 48: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Example: Tic-Tac-Toe

Consider the game tic-tac-toe:

I reward: win/lose/tie the game (+1/− 1/0) [only at final move in givengame]

I state: positions of X’s and O’s on the boardI policy: mapping from states to actions

I based on rules of game: choice of one open position

I value function: prediction of reward in future, based on current state

In tic-tac-toe, since state space is tractable, can use a table to representvalue function

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 20 / 38

Page 49: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

RL & Tic-Tac-Toe

Each board position (taking into account symmetry) has some probability

Simple learning process:

I start with all values = 0.5I policy: choose move with highest

probability of winning given currentlegal moves from current state

I update entries in table based onoutcome of each game

I After many games value function willrepresent true probability of winningfrom each state

Can try alternative policy: sometimes select moves randomly (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 21 / 38

Page 50: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

RL & Tic-Tac-Toe

Each board position (taking into account symmetry) has some probability

Simple learning process:

I start with all values = 0.5I policy: choose move with highest

probability of winning given currentlegal moves from current state

I update entries in table based onoutcome of each game

I After many games value function willrepresent true probability of winningfrom each state

Can try alternative policy: sometimes select moves randomly (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 21 / 38

Page 51: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

RL & Tic-Tac-Toe

Each board position (taking into account symmetry) has some probability

Simple learning process:

I start with all values = 0.5

I policy: choose move with highestprobability of winning given currentlegal moves from current state

I update entries in table based onoutcome of each game

I After many games value function willrepresent true probability of winningfrom each state

Can try alternative policy: sometimes select moves randomly (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 21 / 38

Page 52: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

RL & Tic-Tac-Toe

Each board position (taking into account symmetry) has some probability

Simple learning process:

I start with all values = 0.5I policy: choose move with highest

probability of winning given currentlegal moves from current state

I update entries in table based onoutcome of each game

I After many games value function willrepresent true probability of winningfrom each state

Can try alternative policy: sometimes select moves randomly (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 21 / 38

Page 53: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

RL & Tic-Tac-Toe

Each board position (taking into account symmetry) has some probability

Simple learning process:

I start with all values = 0.5I policy: choose move with highest

probability of winning given currentlegal moves from current state

I update entries in table based onoutcome of each game

I After many games value function willrepresent true probability of winningfrom each state

Can try alternative policy: sometimes select moves randomly (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 21 / 38

Page 54: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

RL & Tic-Tac-Toe

Each board position (taking into account symmetry) has some probability

Simple learning process:

I start with all values = 0.5I policy: choose move with highest

probability of winning given currentlegal moves from current state

I update entries in table based onoutcome of each game

I After many games value function willrepresent true probability of winningfrom each state

Can try alternative policy: sometimes select moves randomly (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 21 / 38

Page 55: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

RL & Tic-Tac-Toe

Each board position (taking into account symmetry) has some probability

Simple learning process:

I start with all values = 0.5I policy: choose move with highest

probability of winning given currentlegal moves from current state

I update entries in table based onoutcome of each game

I After many games value function willrepresent true probability of winningfrom each state

Can try alternative policy: sometimes select moves randomly (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 21 / 38

Page 56: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Basic Problems

Markov Decision Problem (MDP): tuple (S ,A,P, γ) where P is

P(st+1 = s ′, rt+1 = r ′|st = s, at = a)

Standard MDP problems:

1. Planning: given complete Markov decision problem as input, computepolicy with optimal expected return

[Pic: P. Abbeel]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 22 / 38

Page 57: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Basic Problems

Markov Decision Problem (MDP): tuple (S ,A,P, γ) where P is

P(st+1 = s ′, rt+1 = r ′|st = s, at = a)

Standard MDP problems:

1. Planning: given complete Markov decision problem as input, computepolicy with optimal expected return

[Pic: P. Abbeel]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 22 / 38

Page 58: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Basic Problems

Markov Decision Problem (MDP): tuple (S ,A,P, γ) where P is

P(st+1 = s ′, rt+1 = r ′|st = s, at = a)

Standard MDP problems:

1. Planning: given complete Markov decision problem as input, computepolicy with optimal expected return

[Pic: P. Abbeel]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 22 / 38

Page 59: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Basic Problems

Markov Decision Problem (MDP): tuple (S ,A,P, γ) where P is

P(st+1 = s ′, rt+1 = r ′|st = s, at = a)

Standard MDP problems:

1. Planning: given complete Markov decision problem as input, computepolicy with optimal expected return

2. Learning: We don’t know which states are good or what the actionsdo. We must try out the actions and states to learn what to do

[P. Abbeel]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 23 / 38

Page 60: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Example of Standard MDP Problem

1. Planning: given complete Markov decision problem as input, compute policywith optimal expected return

2. Learning: Only have access to experience in the MDP, learn a near-optimalstrategy

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 24 / 38

Page 61: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Example of Standard MDP Problem

1. Planning: given complete Markov decision problem as input, compute policywith optimal expected return

2. Learning: Only have access to experience in the MDP, learn a near-optimalstrategy

We will focus on learning, but discuss planning along the way

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 25 / 38

Page 62: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Exploration vs. Exploitation

If we knew how the world works (embodied in P), then the policy should bedeterministic

I just select optimal action in each state

Reinforcement learning is like trial-and-error learning

The agent should discover a good policy from its experiences of theenvironment

Without losing too much reward along the way

Since we do not have complete knowledge of the world, taking what appearsto be the optimal action may prevent us from finding better states/actions

Interesting trade-off:

I immediate reward (exploitation) vs. gaining knowledge that mightenable higher future reward (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 26 / 38

Page 63: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Exploration vs. Exploitation

If we knew how the world works (embodied in P), then the policy should bedeterministic

I just select optimal action in each state

Reinforcement learning is like trial-and-error learning

The agent should discover a good policy from its experiences of theenvironment

Without losing too much reward along the way

Since we do not have complete knowledge of the world, taking what appearsto be the optimal action may prevent us from finding better states/actions

Interesting trade-off:

I immediate reward (exploitation) vs. gaining knowledge that mightenable higher future reward (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 26 / 38

Page 64: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Exploration vs. Exploitation

If we knew how the world works (embodied in P), then the policy should bedeterministic

I just select optimal action in each state

Reinforcement learning is like trial-and-error learning

The agent should discover a good policy from its experiences of theenvironment

Without losing too much reward along the way

Since we do not have complete knowledge of the world, taking what appearsto be the optimal action may prevent us from finding better states/actions

Interesting trade-off:

I immediate reward (exploitation) vs. gaining knowledge that mightenable higher future reward (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 26 / 38

Page 65: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Exploration vs. Exploitation

If we knew how the world works (embodied in P), then the policy should bedeterministic

I just select optimal action in each state

Reinforcement learning is like trial-and-error learning

The agent should discover a good policy from its experiences of theenvironment

Without losing too much reward along the way

Since we do not have complete knowledge of the world, taking what appearsto be the optimal action may prevent us from finding better states/actions

Interesting trade-off:

I immediate reward (exploitation) vs. gaining knowledge that mightenable higher future reward (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 26 / 38

Page 66: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Exploration vs. Exploitation

If we knew how the world works (embodied in P), then the policy should bedeterministic

I just select optimal action in each state

Reinforcement learning is like trial-and-error learning

The agent should discover a good policy from its experiences of theenvironment

Without losing too much reward along the way

Since we do not have complete knowledge of the world, taking what appearsto be the optimal action may prevent us from finding better states/actions

Interesting trade-off:

I immediate reward (exploitation) vs. gaining knowledge that mightenable higher future reward (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 26 / 38

Page 67: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Exploration vs. Exploitation

If we knew how the world works (embodied in P), then the policy should bedeterministic

I just select optimal action in each state

Reinforcement learning is like trial-and-error learning

The agent should discover a good policy from its experiences of theenvironment

Without losing too much reward along the way

Since we do not have complete knowledge of the world, taking what appearsto be the optimal action may prevent us from finding better states/actions

Interesting trade-off:

I immediate reward (exploitation) vs. gaining knowledge that mightenable higher future reward (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 26 / 38

Page 68: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Examples

Restaurant Selection

I Exploitation: Go to your favourite restaurantI Exploration: Try a new restaurant

Online Banner Advertisements

I Exploitation: Show the most successful advertI Exploration: Show a different advert

Oil Drilling

I Exploitation: Drill at the best known locationI Exploration: Drill at a new location

Game Playing

I Exploitation: Play the move you believe is bestI Exploration: Play an experimental move

[Slide credit: D. Silver]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 27 / 38

Page 69: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

MDP Formulation

Goal: find policy π that maximizes expected accumulated future rewardsV π(st), obtained by following π from state st :

V π(st) = rt + γrt+1 + γ2rt+2 + · · ·

=∞∑i=0

γ i rt+i

Game show example:

I assume series of questions, increasingly difficult, but increasing payoffI choice: accept accumulated earnings and quit; or continue and risk

losing everything

Notice that:V π(st) = rt + γV π(st+1)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 28 / 38

Page 70: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

MDP Formulation

Goal: find policy π that maximizes expected accumulated future rewardsV π(st), obtained by following π from state st :

V π(st) = rt + γrt+1 + γ2rt+2 + · · ·

=∞∑i=0

γ i rt+i

Game show example:

I assume series of questions, increasingly difficult, but increasing payoffI choice: accept accumulated earnings and quit; or continue and risk

losing everything

Notice that:V π(st) = rt + γV π(st+1)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 28 / 38

Page 71: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

MDP Formulation

Goal: find policy π that maximizes expected accumulated future rewardsV π(st), obtained by following π from state st :

V π(st) = rt + γrt+1 + γ2rt+2 + · · ·

=∞∑i=0

γ i rt+i

Game show example:

I assume series of questions, increasingly difficult, but increasing payoff

I choice: accept accumulated earnings and quit; or continue and risklosing everything

Notice that:V π(st) = rt + γV π(st+1)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 28 / 38

Page 72: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

MDP Formulation

Goal: find policy π that maximizes expected accumulated future rewardsV π(st), obtained by following π from state st :

V π(st) = rt + γrt+1 + γ2rt+2 + · · ·

=∞∑i=0

γ i rt+i

Game show example:

I assume series of questions, increasingly difficult, but increasing payoffI choice: accept accumulated earnings and quit; or continue and risk

losing everything

Notice that:V π(st) = rt + γV π(st+1)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 28 / 38

Page 73: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

What to Learn

We might try to learn the function V (which we write as V ∗)

V ∗(s) = maxa

[r(s, a) + γV ∗(δ(s, a))]

Here δ(s, a) gives the next state, if we perform action a in current state s

We could then do a lookahead search to choose best action from any state s:

π∗(s) = arg maxa

[r(s, a) + γV ∗(δ(s, a))]

But there’s a problem:

I This works well if we know δ() and r()I But when we don’t, we cannot choose actions this way

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 29 / 38

Page 74: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

What to Learn

We might try to learn the function V (which we write as V ∗)

V ∗(s) = maxa

[r(s, a) + γV ∗(δ(s, a))]

Here δ(s, a) gives the next state, if we perform action a in current state s

We could then do a lookahead search to choose best action from any state s:

π∗(s) = arg maxa

[r(s, a) + γV ∗(δ(s, a))]

But there’s a problem:

I This works well if we know δ() and r()I But when we don’t, we cannot choose actions this way

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 29 / 38

Page 75: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

What to Learn

We might try to learn the function V (which we write as V ∗)

V ∗(s) = maxa

[r(s, a) + γV ∗(δ(s, a))]

Here δ(s, a) gives the next state, if we perform action a in current state s

We could then do a lookahead search to choose best action from any state s:

π∗(s) = arg maxa

[r(s, a) + γV ∗(δ(s, a))]

But there’s a problem:

I This works well if we know δ() and r()I But when we don’t, we cannot choose actions this way

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 29 / 38

Page 76: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

What to Learn

We might try to learn the function V (which we write as V ∗)

V ∗(s) = maxa

[r(s, a) + γV ∗(δ(s, a))]

Here δ(s, a) gives the next state, if we perform action a in current state s

We could then do a lookahead search to choose best action from any state s:

π∗(s) = arg maxa

[r(s, a) + γV ∗(δ(s, a))]

But there’s a problem:

I This works well if we know δ() and r()

I But when we don’t, we cannot choose actions this way

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 29 / 38

Page 77: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

What to Learn

We might try to learn the function V (which we write as V ∗)

V ∗(s) = maxa

[r(s, a) + γV ∗(δ(s, a))]

Here δ(s, a) gives the next state, if we perform action a in current state s

We could then do a lookahead search to choose best action from any state s:

π∗(s) = arg maxa

[r(s, a) + γV ∗(δ(s, a))]

But there’s a problem:

I This works well if we know δ() and r()I But when we don’t, we cannot choose actions this way

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 29 / 38

Page 78: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning

Define a new function very similar to V ∗

Q(s, a) = r(s, a) + γV ∗(δ(s, a))

If we learn Q, we can choose the optimal action even without knowing δ!

π∗(s) = arg maxa

[r(s, a) + γV ∗(δ(s, a))]

= arg maxa

Q(s, a)

Q is then the evaluation function we will learn

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 38

Page 79: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning

Define a new function very similar to V ∗

Q(s, a) = r(s, a) + γV ∗(δ(s, a))

If we learn Q, we can choose the optimal action even without knowing δ!

π∗(s) = arg maxa

[r(s, a) + γV ∗(δ(s, a))]

= arg maxa

Q(s, a)

Q is then the evaluation function we will learn

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 38

Page 80: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning

Define a new function very similar to V ∗

Q(s, a) = r(s, a) + γV ∗(δ(s, a))

If we learn Q, we can choose the optimal action even without knowing δ!

π∗(s) = arg maxa

[r(s, a) + γV ∗(δ(s, a))]

= arg maxa

Q(s, a)

Q is then the evaluation function we will learn

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 38

Page 81: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning

Define a new function very similar to V ∗

Q(s, a) = r(s, a) + γV ∗(δ(s, a))

If we learn Q, we can choose the optimal action even without knowing δ!

π∗(s) = arg maxa

[r(s, a) + γV ∗(δ(s, a))]

= arg maxa

Q(s, a)

Q is then the evaluation function we will learn

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 38

Page 82: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 31 / 38

Page 83: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Training Rule to Learn Q

Q and V ∗ are closely related:

V ∗(s) = maxa

Q(s, a)

So we can write Q recursively:

Q(st , at) = r(st , at) + γV ∗(δ(st , at))

= r(st , at) + γmaxa′

Q(st+1, a′)

Let Q̂ denote the learner’s current approximation to Q

Consider training rule

Q̂(s, a)← r(s, a) + γmaxa′

Q̂(s ′, a′)

where s ′ is state resulting from applying action a in state s

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 32 / 38

Page 84: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Training Rule to Learn Q

Q and V ∗ are closely related:

V ∗(s) = maxa

Q(s, a)

So we can write Q recursively:

Q(st , at) = r(st , at) + γV ∗(δ(st , at))

= r(st , at) + γmaxa′

Q(st+1, a′)

Let Q̂ denote the learner’s current approximation to Q

Consider training rule

Q̂(s, a)← r(s, a) + γmaxa′

Q̂(s ′, a′)

where s ′ is state resulting from applying action a in state s

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 32 / 38

Page 85: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Training Rule to Learn Q

Q and V ∗ are closely related:

V ∗(s) = maxa

Q(s, a)

So we can write Q recursively:

Q(st , at) = r(st , at) + γV ∗(δ(st , at))

= r(st , at) + γmaxa′

Q(st+1, a′)

Let Q̂ denote the learner’s current approximation to Q

Consider training rule

Q̂(s, a)← r(s, a) + γmaxa′

Q̂(s ′, a′)

where s ′ is state resulting from applying action a in state s

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 32 / 38

Page 86: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Training Rule to Learn Q

Q and V ∗ are closely related:

V ∗(s) = maxa

Q(s, a)

So we can write Q recursively:

Q(st , at) = r(st , at) + γV ∗(δ(st , at))

= r(st , at) + γmaxa′

Q(st+1, a′)

Let Q̂ denote the learner’s current approximation to Q

Consider training rule

Q̂(s, a)← r(s, a) + γmaxa′

Q̂(s ′, a′)

where s ′ is state resulting from applying action a in state s

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 32 / 38

Page 87: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Training Rule to Learn Q

Q and V ∗ are closely related:

V ∗(s) = maxa

Q(s, a)

So we can write Q recursively:

Q(st , at) = r(st , at) + γV ∗(δ(st , at))

= r(st , at) + γmaxa′

Q(st+1, a′)

Let Q̂ denote the learner’s current approximation to Q

Consider training rule

Q̂(s, a)← r(s, a) + γmaxa′

Q̂(s ′, a′)

where s ′ is state resulting from applying action a in state s

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 32 / 38

Page 88: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning for Deterministic World

For each s, a initialize table entry Q̂(s, a)← 0

Start in some initial state s

Do forever:

I Select an action a and execute itI Receive immediate reward rI Observe the new state s ′

I Update the table entry for Q̂(s, a) using Q learning rule:

Q̂(s, a)← r(s, a) + γmaxa′

Q̂(s ′, a′)

I s ← s ′

If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 33 / 38

Page 89: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning for Deterministic World

For each s, a initialize table entry Q̂(s, a)← 0

Start in some initial state s

Do forever:

I Select an action a and execute itI Receive immediate reward rI Observe the new state s ′

I Update the table entry for Q̂(s, a) using Q learning rule:

Q̂(s, a)← r(s, a) + γmaxa′

Q̂(s ′, a′)

I s ← s ′

If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 33 / 38

Page 90: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning for Deterministic World

For each s, a initialize table entry Q̂(s, a)← 0

Start in some initial state s

Do forever:

I Select an action a and execute itI Receive immediate reward rI Observe the new state s ′

I Update the table entry for Q̂(s, a) using Q learning rule:

Q̂(s, a)← r(s, a) + γmaxa′

Q̂(s ′, a′)

I s ← s ′

If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 33 / 38

Page 91: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning for Deterministic World

For each s, a initialize table entry Q̂(s, a)← 0

Start in some initial state s

Do forever:

I Select an action a and execute it

I Receive immediate reward rI Observe the new state s ′

I Update the table entry for Q̂(s, a) using Q learning rule:

Q̂(s, a)← r(s, a) + γmaxa′

Q̂(s ′, a′)

I s ← s ′

If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 33 / 38

Page 92: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning for Deterministic World

For each s, a initialize table entry Q̂(s, a)← 0

Start in some initial state s

Do forever:

I Select an action a and execute itI Receive immediate reward r

I Observe the new state s ′

I Update the table entry for Q̂(s, a) using Q learning rule:

Q̂(s, a)← r(s, a) + γmaxa′

Q̂(s ′, a′)

I s ← s ′

If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 33 / 38

Page 93: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning for Deterministic World

For each s, a initialize table entry Q̂(s, a)← 0

Start in some initial state s

Do forever:

I Select an action a and execute itI Receive immediate reward rI Observe the new state s ′

I Update the table entry for Q̂(s, a) using Q learning rule:

Q̂(s, a)← r(s, a) + γmaxa′

Q̂(s ′, a′)

I s ← s ′

If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 33 / 38

Page 94: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning for Deterministic World

For each s, a initialize table entry Q̂(s, a)← 0

Start in some initial state s

Do forever:

I Select an action a and execute itI Receive immediate reward rI Observe the new state s ′

I Update the table entry for Q̂(s, a) using Q learning rule:

Q̂(s, a)← r(s, a) + γmaxa′

Q̂(s ′, a′)

I s ← s ′

If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 33 / 38

Page 95: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning for Deterministic World

For each s, a initialize table entry Q̂(s, a)← 0

Start in some initial state s

Do forever:

I Select an action a and execute itI Receive immediate reward rI Observe the new state s ′

I Update the table entry for Q̂(s, a) using Q learning rule:

Q̂(s, a)← r(s, a) + γmaxa′

Q̂(s ′, a′)

I s ← s ′

If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 33 / 38

Page 96: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning for Deterministic World

For each s, a initialize table entry Q̂(s, a)← 0

Start in some initial state s

Do forever:

I Select an action a and execute itI Receive immediate reward rI Observe the new state s ′

I Update the table entry for Q̂(s, a) using Q learning rule:

Q̂(s, a)← r(s, a) + γmaxa′

Q̂(s ′, a′)

I s ← s ′

If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 33 / 38

Page 97: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Updating Estimated Q

Assume the robot is in state s1; some of its current estimates of Q are asshown; executes rightward move

Q̂(s1, aright) ← r + γmaxa′

Q̂(s2, a′)

← r + 0.9 maxa{63, 81, 100} ← 90

Important observation: at each time step (making an action a in state sonly one entry of Q̂ will change (the entry Q̂(s, a))

Notice that if rewards are non-negative, then Q̂ values only increase from 0,approach true Q

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 38

Page 98: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Updating Estimated Q

Assume the robot is in state s1; some of its current estimates of Q are asshown; executes rightward move

Q̂(s1, aright) ← r + γmaxa′

Q̂(s2, a′)

← r + 0.9 maxa{63, 81, 100} ← 90

Important observation: at each time step (making an action a in state sonly one entry of Q̂ will change (the entry Q̂(s, a))

Notice that if rewards are non-negative, then Q̂ values only increase from 0,approach true Q

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 38

Page 99: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Updating Estimated Q

Assume the robot is in state s1; some of its current estimates of Q are asshown; executes rightward move

Q̂(s1, aright) ← r + γmaxa′

Q̂(s2, a′)

← r + 0.9 maxa{63, 81, 100} ← 90

Important observation: at each time step (making an action a in state sonly one entry of Q̂ will change (the entry Q̂(s, a))

Notice that if rewards are non-negative, then Q̂ values only increase from 0,approach true Q

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 38

Page 100: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Updating Estimated Q

Assume the robot is in state s1; some of its current estimates of Q are asshown; executes rightward move

Q̂(s1, aright) ← r + γmaxa′

Q̂(s2, a′)

← r + 0.9 maxa{63, 81, 100} ← 90

Important observation: at each time step (making an action a in state sonly one entry of Q̂ will change (the entry Q̂(s, a))

Notice that if rewards are non-negative, then Q̂ values only increase from 0,approach true Q

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 38

Page 101: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Updating Estimated Q

Assume the robot is in state s1; some of its current estimates of Q are asshown; executes rightward move

Q̂(s1, aright) ← r + γmaxa′

Q̂(s2, a′)

← r + 0.9 maxa{63, 81, 100} ← 90

Important observation: at each time step (making an action a in state sonly one entry of Q̂ will change (the entry Q̂(s, a))

Notice that if rewards are non-negative, then Q̂ values only increase from 0,approach true Q

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 38

Page 102: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning: Summary

Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state

Each executed action a results in transition from state si to sj ; algorithm

updates Q̂(si , a) using the learning rule

Intuition for simple grid world, reward only upon entering goal state → Qestimates improve from goal state back

1. All Q̂(s, a) start at 02. First episode – only update Q̂(s, a) for transition leading to goal state3. Next episode – if go thru this next-to-last transition, will update

Q̂(s, a) another step back4. Eventually propagate information from transitions with non-zero reward

throughout state-action space

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 35 / 38

Page 103: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning: Summary

Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state

Each executed action a results in transition from state si to sj ; algorithm

updates Q̂(si , a) using the learning rule

Intuition for simple grid world, reward only upon entering goal state → Qestimates improve from goal state back

1. All Q̂(s, a) start at 02. First episode – only update Q̂(s, a) for transition leading to goal state3. Next episode – if go thru this next-to-last transition, will update

Q̂(s, a) another step back4. Eventually propagate information from transitions with non-zero reward

throughout state-action space

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 35 / 38

Page 104: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning: Summary

Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state

Each executed action a results in transition from state si to sj ; algorithm

updates Q̂(si , a) using the learning rule

Intuition for simple grid world, reward only upon entering goal state → Qestimates improve from goal state back

1. All Q̂(s, a) start at 02. First episode – only update Q̂(s, a) for transition leading to goal state3. Next episode – if go thru this next-to-last transition, will update

Q̂(s, a) another step back4. Eventually propagate information from transitions with non-zero reward

throughout state-action space

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 35 / 38

Page 105: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning: Summary

Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state

Each executed action a results in transition from state si to sj ; algorithm

updates Q̂(si , a) using the learning rule

Intuition for simple grid world, reward only upon entering goal state → Qestimates improve from goal state back

1. All Q̂(s, a) start at 0

2. First episode – only update Q̂(s, a) for transition leading to goal state3. Next episode – if go thru this next-to-last transition, will update

Q̂(s, a) another step back4. Eventually propagate information from transitions with non-zero reward

throughout state-action space

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 35 / 38

Page 106: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning: Summary

Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state

Each executed action a results in transition from state si to sj ; algorithm

updates Q̂(si , a) using the learning rule

Intuition for simple grid world, reward only upon entering goal state → Qestimates improve from goal state back

1. All Q̂(s, a) start at 02. First episode – only update Q̂(s, a) for transition leading to goal state

3. Next episode – if go thru this next-to-last transition, will updateQ̂(s, a) another step back

4. Eventually propagate information from transitions with non-zero rewardthroughout state-action space

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 35 / 38

Page 107: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning: Summary

Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state

Each executed action a results in transition from state si to sj ; algorithm

updates Q̂(si , a) using the learning rule

Intuition for simple grid world, reward only upon entering goal state → Qestimates improve from goal state back

1. All Q̂(s, a) start at 02. First episode – only update Q̂(s, a) for transition leading to goal state3. Next episode – if go thru this next-to-last transition, will update

Q̂(s, a) another step back

4. Eventually propagate information from transitions with non-zero rewardthroughout state-action space

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 35 / 38

Page 108: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning: Summary

Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state

Each executed action a results in transition from state si to sj ; algorithm

updates Q̂(si , a) using the learning rule

Intuition for simple grid world, reward only upon entering goal state → Qestimates improve from goal state back

1. All Q̂(s, a) start at 02. First episode – only update Q̂(s, a) for transition leading to goal state3. Next episode – if go thru this next-to-last transition, will update

Q̂(s, a) another step back4. Eventually propagate information from transitions with non-zero reward

throughout state-action space

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 35 / 38

Page 109: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning: Exploration/Exploitation

Have not specified how actions chosen (during learning)

Can choose actions to maximize Q̂(s, a)

Good idea?

Can instead employ stochastic action selection (policy):

P(ai |s) =exp(kQ̂(s, ai ))∑j exp(kQ̂(s, aj))

Can vary k during learning

I more exploration early on, shift towards exploitation

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 38

Page 110: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning: Exploration/Exploitation

Have not specified how actions chosen (during learning)

Can choose actions to maximize Q̂(s, a)

Good idea?

Can instead employ stochastic action selection (policy):

P(ai |s) =exp(kQ̂(s, ai ))∑j exp(kQ̂(s, aj))

Can vary k during learning

I more exploration early on, shift towards exploitation

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 38

Page 111: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning: Exploration/Exploitation

Have not specified how actions chosen (during learning)

Can choose actions to maximize Q̂(s, a)

Good idea?

Can instead employ stochastic action selection (policy):

P(ai |s) =exp(kQ̂(s, ai ))∑j exp(kQ̂(s, aj))

Can vary k during learning

I more exploration early on, shift towards exploitation

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 38

Page 112: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning: Exploration/Exploitation

Have not specified how actions chosen (during learning)

Can choose actions to maximize Q̂(s, a)

Good idea?

Can instead employ stochastic action selection (policy):

P(ai |s) =exp(kQ̂(s, ai ))∑j exp(kQ̂(s, aj))

Can vary k during learning

I more exploration early on, shift towards exploitation

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 38

Page 113: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning: Exploration/Exploitation

Have not specified how actions chosen (during learning)

Can choose actions to maximize Q̂(s, a)

Good idea?

Can instead employ stochastic action selection (policy):

P(ai |s) =exp(kQ̂(s, ai ))∑j exp(kQ̂(s, aj))

Can vary k during learning

I more exploration early on, shift towards exploitation

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 38

Page 114: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Q Learning: Exploration/Exploitation

Have not specified how actions chosen (during learning)

Can choose actions to maximize Q̂(s, a)

Good idea?

Can instead employ stochastic action selection (policy):

P(ai |s) =exp(kQ̂(s, ai ))∑j exp(kQ̂(s, aj))

Can vary k during learning

I more exploration early on, shift towards exploitation

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 38

Page 115: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Non-deterministic Case

What if reward and next state are non-deterministic?

We redefine V ,Q based on probabilistic estimates, expected values of them:

V π(s) = Eπ[rt + γrt+1 + γ2rt+2 + · · · ]

= Eπ[∞∑i=0

γ i rt+i ]

and

Q(s, a) = E [r(s, a) + γV ∗(δ(s, a))]

= E [r(s, a) + γ∑s′

p(s ′|s, a) maxa′

Q(s ′, a′)]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 37 / 38

Page 116: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Non-deterministic Case

What if reward and next state are non-deterministic?

We redefine V ,Q based on probabilistic estimates, expected values of them:

V π(s) = Eπ[rt + γrt+1 + γ2rt+2 + · · · ]

= Eπ[∞∑i=0

γ i rt+i ]

and

Q(s, a) = E [r(s, a) + γV ∗(δ(s, a))]

= E [r(s, a) + γ∑s′

p(s ′|s, a) maxa′

Q(s ′, a′)]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 37 / 38

Page 117: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Non-deterministic Case

What if reward and next state are non-deterministic?

We redefine V ,Q based on probabilistic estimates, expected values of them:

V π(s) = Eπ[rt + γrt+1 + γ2rt+2 + · · · ]

= Eπ[∞∑i=0

γ i rt+i ]

and

Q(s, a) = E [r(s, a) + γV ∗(δ(s, a))]

= E [r(s, a) + γ∑s′

p(s ′|s, a) maxa′

Q(s ′, a′)]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 37 / 38

Page 118: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Non-deterministic Case: Learning Q

Training rule does not converge (can keep changing Q̂ even if initialized totrue Q values)

So modify training rule to change more slowly

Q̂(s, a)← (1− αn)Q̂n−1(s, a) + αn[r + γmaxa′

Q̂n−1(s ′, a′)]

where s ′ is the state land in after s, and a′ indexes the actions that can betaken in state s ′

αn =1

1 + visitsn(s, a)

where visits is the number of times action a is taken in state s

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 38 / 38

Page 119: CSC 411: Lecture 19: Reinforcement Learning › ~urtasun › courses › CSC411_Fall16 › 19_rl.pdfCSC 411: Lecture 19: Reinforcement Learning Richard Zemel, Raquel Urtasun and Sanja

Non-deterministic Case: Learning Q

Training rule does not converge (can keep changing Q̂ even if initialized totrue Q values)

So modify training rule to change more slowly

Q̂(s, a)← (1− αn)Q̂n−1(s, a) + αn[r + γmaxa′

Q̂n−1(s ′, a′)]

where s ′ is the state land in after s, and a′ indexes the actions that can betaken in state s ′

αn =1

1 + visitsn(s, a)

where visits is the number of times action a is taken in state s

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 38 / 38