approximate q-learning - university of washington€¦ · approximate q-learning • q-learning...

11/9/16

1

ApproximateQ-Learning

DanWeld/UniversityofWashington[ManyslidestakenfromDanKleinandPieterAbbeel /CS188IntrotoAIatUCBerkeley– materialsavailableathttp://ai.berkeley.edu.]

QLearningForall s,a

InitializeQ(s,a)=0RepeatForever

Whereareyou?s.ChoosesomeactionaExecuteitinrealworld: (s,a,r,s’)Doupdate:

} Equivalently

11/9/16

2

QLearningForall s,a

InitializeQ(s,a)=0RepeatForever

Whereareyou?s.ChoosesomeactionaExecuteitinrealworld: (s,a,r,s’)Doupdate:

Example:PacmanLet’ssaywediscoverthroughexperiencethatthisstateisbad:

Oreventhisone!

11/9/16

3

Q-learning,nofeatures,50learningtrials

QuickTime™ and aGIF decompressor

are needed to see this picture.

Q-learning,nofeatures,1000learningtrials:

QuickTime™ and aGIF decompressor

are needed to see this picture.

11/9/16

4

Feature-BasedRepresentationsSoln:describestatesw/vectoroffeatures (aka“properties”)

– Features= functionsfromstatestoR(often0/1)capturingimportantpropertiesofthestate

– Examples:• Distancetoclosestghostordot• Numberofghosts• 1/(disttodot)2• IsPacman inatunnel?(0/1)……etc.

• Isstatetheexactstateonthisslide?

– Canalsodescribeaq-state(s,a)withfeatures(e.g.actionmovesclosertofood)

Howtousefeatures?

V(s)=g(f1(s),f2(s),…,fn(s))

UsingfeatureswecanrepresentVand/orQasfollows:

Q(s,a)=g(f1(s,a),f2(s,a),…,fn(s,a))

Whatshouldweuseforg?(andf)?

11/9/16

5

LinearCombination

• Usingafeaturerepresentation,wecanwriteaqfunction(orvaluefunction)foranystateusingafewweights:

• Advantage:ourexperienceissummedupinafewpowerfulnumbers

• Disadvantage:statessharingfeaturesmayactuallyhaveverydifferentvalues!

ApproximateQ-Learning

• Q-learningwithlinearQ-functions:

• Intuitiveinterpretation:– Adjustweightsofactivefeatures– E.g.,ifsomethingunexpectedlybadhappens,blamethe

featuresthatwereon:disprefer allstateswiththatstate’sfeatures

• Formaljustification:inafewslides!

ExactQ’s

ApproximateQ’s

11/9/16

6

Example:Pacman Features𝑄 𝑠, 𝑎 = 𝑤(𝑓*+, 𝑠, 𝑎 + 𝑤.𝑓/0,(𝑠, 𝑎)

𝑓*+, 𝑠, 𝑎 = 1

𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑡𝑜𝑐𝑙𝑜𝑠𝑒𝑠𝑡𝑓𝑜𝑜𝑑𝑎𝑓𝑡𝑒𝑟𝑡𝑎𝑘𝑖𝑛𝑔𝑎

𝑓/0, 𝑠, 𝑎 = 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑡𝑜𝑐𝑙𝑜𝑠𝑒𝑠𝑡𝑔ℎ𝑜𝑠𝑡𝑎𝑓𝑡𝑒𝑟𝑡𝑎𝑘𝑖𝑛𝑔

𝑓*+, 𝑠, 𝑁𝑂𝑅𝑇𝐻 = 0.5

𝑓/0, 𝑠, 𝑁𝑂𝑅𝑇𝐻 = 1.0

Example:Q-Pacman

[Demo:approximateQ-learningpacman

α=0.004

11/9/16

7

VideoofDemoApproximateQ-Learning-- Pacman

Sidebar:Q-LearningandLeastSquares

11/9/16

8

0 200

20

40

010

2030

40

0

10

20

30

20

22

24

26

LinearApproximation:Regression

Prediction: Prediction:

Optimization:LeastSquares

0 200

Erroror“residual”

Prediction

Observation

11/9/16

9

MinimizingError

Approximatequpdateexplained:

Imaginewehadonlyonepointx,withfeaturesf(x),targetvaluey,andweightsw:

“target” “prediction”

0 2 4 6 8 10 12 14 16 18 20-15

-10

-5

0

5

10

15

20

25

30

Degree15polynomial

Overfitting:WhyLimitingCapacityCanHelp

11/9/16

10

SimpleProblem

21

Given:FeaturesofcurrentstatePredict:WillPacman dieonthenextstep?

Justonefeature.Seeapattern?

22

§ Ghostonestepaway,pacman dies§ Ghostonestepaway,pacman dies§ Ghostonestepaway,pacman dies§ Ghostonestepaway,pacman dies§ Ghostonestepaway,pacman lives§ Ghostmorethanonestepaway,pacman lives§ Ghostmorethanonestepaway,pacman lives§ Ghostmorethanonestepaway,pacman lives§ Ghostmorethanonestepaway,pacman lives§ Ghostmorethanonestepaway,pacman lives§ Ghostmorethanonestepaway,pacman lives

Learn:Ghostonestepawayà pacman dies!

11/9/16

11

Whatifweaddmorefeatures?

24

§ Ghostonestepaway,score211,pacman dies§ Ghostonestepaway,score341,pacman dies§ Ghostonestepaway,score231,pacman dies§ Ghostonestepaway,score121,pacman dies§ Ghostonestepaway,score301,pacman lives§ Ghostmorethanonestepaway,score205,pacman lives§ Ghostmorethanonestepaway,score441,pacman lives§ Ghostmorethanonestepaway,score219,pacman lives§ Ghostmorethanonestepaway,score199,pacman lives§ Ghostmorethanonestepaway,score331,pacman lives§ Ghostmorethanonestepaway,score251,pacman lives

Learn:GhostonestepawayANDscoreisNOTprimenumberà pacman dies!

0 2 4 6 8 10 12 14 16 18 20-15

-10

-5

0

5

10

15

20

25

30

Degree 1 polynomial

There’sfitting,andthere’s

11/9/16

12

0 2 4 6 8 10 12 14 16 18 20-15

-10

-5

0

5

10

15

20

25

30

Degree 2 polynomial

There’sfitting,andthere’s

0 2 4 6 8 10 12 14 16 18 20-15

-10

-5

0

5

10

15

20

25

30

Degree 15 polynomial

Overfitting

11/9/16

13

ApproximatingQFunction• LinearApproximation• CouldalsouseDeepNeuralNetwork

– https://www.nervanasys.com/demystifying-deep-reinforcement-learning/

Q(s,a)

Deepmind Atari

https://www.youtube.com/watch?v=V1eYniJ0Rnk

11/9/16

14

DQNResultsonAtariDQN Results in Atari

SlideadaptedfromDavidSilver

ApproximatingtheQFunctionLinearApproximation

Q

f1(s,a)f2(s,a)

fm(s,a)

Qf1(s,a)f2(s,a)

fm(s,a)

NeuralApproximation(nonlinear) h(z) = 1

1+ e−z

00

1

z

ah(z)

11/9/16

15

DeepRepresentationsDeep Representations

I A deep representation is a composition of many functions

x

//h1

// ... //h

n

//y

//l

w1

OO

... wn

OO

I Its gradient can be backpropagated by the chain rule

@l@x

@l@h1

@h1@xoo

@h1@w1✏✏

...

@h2@h1oo @l

@hn

@hn

@hn�1oo

@hn

@wn

✏✏

@l@y

@y@h

noo

@l@w1

... @l@wn


hidden

output

inputi

j

k

vij

[X1 ,X2,X3] zea

-+=1

1

00

1

z

a

[Y1 ,Y2]

wjk

• MultipleLayers• FeedForward• Connected Weights• 1-of-NOutput

iji

iwxz j å=

MultiLayerPerceptron

11/9/16

16

TrainingviaStochasticGradientDescentTraining Neural Networks by Stochastic Gradient Descent

I Sample gradient of expected loss L(w) = E [l ]

@l

@w⇠ E

@l

@w

�=

@L(w)

@w

I Adjust w down the sampled gradient

�w / @l

@w

!"#$%"&'(%")*+,#'-'!"#$%%" (%")*+,#'.+/0+,#

� !"#$%&'('%$&#()&*+$,*$#&&-&$$$$."'%"$'*$%-/0'*,('-*$.'("$("#$1)*%('-*$,22&-3'/,(-& %,*$0#$)+#4$(-$%&#,(#$,*$#&&-&$1)*%('-*$$$$$$$$$$$$$

� !"#$2,&(',5$4'11#&#*(',5$-1$("'+$#&&-&$1)*%('-*$$$$$$$$$$$$$$$$6$("#$7&,4'#*($%,*$*-.$0#$)+#4$(-$)24,(#$("#$'*(#&*,5$8,&',05#+$'*$("#$1)*%('-*$,22&-3'/,(-& 9,*4$%&'('%:;$$$$$$

<&,4'#*($4#+%#*($=>?


i

j

k

vij

wjk

• Minimize error ofcalculated output

• Adjust weights• GradientDescent

• Procedure• ForwardPhase• Backpropagationof errors

• For each sample,multipleepochs

Aka...Backpropagation

11/9/16

17

WeightSharingWeight Sharing

Recurrent neural network shares weights between time-steps

y

t

y

t+1

... //h

t

//

OO

h

t+1//

OO

...

w

??

x

t

OO

w

==

x

t+1

OO

Convolutional neural network shares weights between local regions

w1

w1

w2

w2

x

h1

h2


Recap:Approx Q-LearningQ-Learning

I Optimal Q-values should obey Bellman equation

Q

⇤(s, a) = Es

0

r + � max

a

0Q(s 0, a0)⇤ | s, a

�

I Treat right-hand side r + � maxa

0Q(s 0, a0,w) as a target

I Minimise MSE loss by stochastic gradient descent

l =⇣r + � max

a

Q(s 0, a0,w) � Q(s, a,w)⌘2

I Converges to Q

⇤ using table lookup representationI But diverges using neural networks due to:

I Correlations between samplesI Non-stationary targets


11/9/16

18

DeepQ-Networks(DQN)ExperienceReplayDeep Q-Networks (DQN): Experience Replay

To remove correlations, build data-set from agent’s own experience

s1, a1, r2, s2s2, a2, r3, s3 ! s, a, r , s 0

s3, a3, r4, s4...

s

t

, at

, rt+1, st+1 ! s

t

, at

, rt+1, st+1

Sample experiences from data-set and apply update

l =

✓r + � max

a

0Q(s 0, a0,w�) � Q(s, a,w)

◆2

To deal with non-stationarity, target parameters w� are held fixed


DQNinAtariDQN in Atari

I End-to-end learning of values Q(s, a) from pixels s

I Input state s is stack of raw pixels from last 4 frames

I Output is Q(s, a) for 18 joystick/button positions

I Reward is change in score for that step

Network architecture and hyperparameters fixed across all games


11/9/16

19

DeepMindResources

Seealso:http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

That’sallforReinforcementLearning!

• Verytoughproblem:Howtoperformanytaskwellinanunknown,noisyenvironment!

• Traditionallyusedmostlyforrobotics,but…

49

ReinforcementLearningAgent

Data(experienceswithenvironment)

Policy(howtoactinthefuture)

GoogleDeepMind – RLappliedtodatacenterpowerusage

11/9/16

20

That’sallforReinforcementLearning!

Lotsofopenresearchareas:– Howtobestbalanceexplorationandexploitation?– Howtodealwithcaseswherewedon’tknowagoodstate/featurerepresentation?

50

ReinforcementLearningAgent

Data(experienceswithenvironment)

Policy(howtoactinthefuture)

Conclusion• We’redonewithPartI:Search

andPlanning!

• We’veseenhowAImethodscansolveproblemsin:– Search– ConstraintSatisfactionProblems– Games– MarkovDecisionProblems– ReinforcementLearning

• Nextup:PartII:UncertaintyandLearning!

approximate q-learning - university of washington€¦ · approximate q-learning • q-learning...

Documents