approximate q-learning - university of washington€¦ · approximate q-learning • q-learning...
TRANSCRIPT
11/9/16
1
ApproximateQ-Learning
DanWeld/UniversityofWashington[ManyslidestakenfromDanKleinandPieterAbbeel /CS188IntrotoAIatUCBerkeley– materialsavailableathttp://ai.berkeley.edu.]
QLearningForall s,a
InitializeQ(s,a)=0RepeatForever
Whereareyou?s.ChoosesomeactionaExecuteitinrealworld: (s,a,r,s’)Doupdate:
} Equivalently
11/9/16
2
QLearningForall s,a
InitializeQ(s,a)=0RepeatForever
Whereareyou?s.ChoosesomeactionaExecuteitinrealworld: (s,a,r,s’)Doupdate:
Example:PacmanLet’ssaywediscoverthroughexperiencethatthisstateisbad:
Oreventhisone!
11/9/16
3
Q-learning,nofeatures,50learningtrials
QuickTime™ and aGIF decompressor
are needed to see this picture.
Q-learning,nofeatures,1000learningtrials:
QuickTime™ and aGIF decompressor
are needed to see this picture.
11/9/16
4
Feature-BasedRepresentationsSoln:describestatesw/vectoroffeatures (aka“properties”)
– Features= functionsfromstatestoR(often0/1)capturingimportantpropertiesofthestate
– Examples:• Distancetoclosestghostordot• Numberofghosts• 1/(disttodot)2• IsPacman inatunnel?(0/1)……etc.
• Isstatetheexactstateonthisslide?
– Canalsodescribeaq-state(s,a)withfeatures(e.g.actionmovesclosertofood)
Howtousefeatures?
V(s)=g(f1(s),f2(s),…,fn(s))
UsingfeatureswecanrepresentVand/orQasfollows:
Q(s,a)=g(f1(s,a),f2(s,a),…,fn(s,a))
Whatshouldweuseforg?(andf)?
11/9/16
5
LinearCombination
• Usingafeaturerepresentation,wecanwriteaqfunction(orvaluefunction)foranystateusingafewweights:
• Advantage:ourexperienceissummedupinafewpowerfulnumbers
• Disadvantage:statessharingfeaturesmayactuallyhaveverydifferentvalues!
ApproximateQ-Learning
• Q-learningwithlinearQ-functions:
• Intuitiveinterpretation:– Adjustweightsofactivefeatures– E.g.,ifsomethingunexpectedlybadhappens,blamethe
featuresthatwereon:disprefer allstateswiththatstate’sfeatures
• Formaljustification:inafewslides!
ExactQ’s
ApproximateQ’s
11/9/16
6
Example:Pacman Features𝑄 𝑠, 𝑎 = 𝑤(𝑓*+, 𝑠, 𝑎 + 𝑤.𝑓/0,(𝑠, 𝑎)
𝑓*+, 𝑠, 𝑎 = 1
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑡𝑜𝑐𝑙𝑜𝑠𝑒𝑠𝑡𝑓𝑜𝑜𝑑𝑎𝑓𝑡𝑒𝑟𝑡𝑎𝑘𝑖𝑛𝑔𝑎
𝑓/0, 𝑠, 𝑎 = 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑡𝑜𝑐𝑙𝑜𝑠𝑒𝑠𝑡𝑔ℎ𝑜𝑠𝑡𝑎𝑓𝑡𝑒𝑟𝑡𝑎𝑘𝑖𝑛𝑔
𝑓*+, 𝑠, 𝑁𝑂𝑅𝑇𝐻 = 0.5
𝑓/0, 𝑠, 𝑁𝑂𝑅𝑇𝐻 = 1.0
Example:Q-Pacman
[Demo:approximateQ-learningpacman
α=0.004
11/9/16
7
VideoofDemoApproximateQ-Learning-- Pacman
Sidebar:Q-LearningandLeastSquares
11/9/16
8
0 200
20
40
010
2030
40
0
10
20
30
20
22
24
26
LinearApproximation:Regression
Prediction: Prediction:
Optimization:LeastSquares
0 200
Erroror“residual”
Prediction
Observation
11/9/16
9
MinimizingError
Approximatequpdateexplained:
Imaginewehadonlyonepointx,withfeaturesf(x),targetvaluey,andweightsw:
“target” “prediction”
0 2 4 6 8 10 12 14 16 18 20-15
-10
-5
0
5
10
15
20
25
30
Degree15polynomial
Overfitting:WhyLimitingCapacityCanHelp
11/9/16
10
SimpleProblem
21
Given:FeaturesofcurrentstatePredict:WillPacman dieonthenextstep?
Justonefeature.Seeapattern?
22
§ Ghostonestepaway,pacman dies§ Ghostonestepaway,pacman dies§ Ghostonestepaway,pacman dies§ Ghostonestepaway,pacman dies§ Ghostonestepaway,pacman lives§ Ghostmorethanonestepaway,pacman lives§ Ghostmorethanonestepaway,pacman lives§ Ghostmorethanonestepaway,pacman lives§ Ghostmorethanonestepaway,pacman lives§ Ghostmorethanonestepaway,pacman lives§ Ghostmorethanonestepaway,pacman lives
Learn:Ghostonestepawayà pacman dies!
11/9/16
11
Whatifweaddmorefeatures?
24
§ Ghostonestepaway,score211,pacman dies§ Ghostonestepaway,score341,pacman dies§ Ghostonestepaway,score231,pacman dies§ Ghostonestepaway,score121,pacman dies§ Ghostonestepaway,score301,pacman lives§ Ghostmorethanonestepaway,score205,pacman lives§ Ghostmorethanonestepaway,score441,pacman lives§ Ghostmorethanonestepaway,score219,pacman lives§ Ghostmorethanonestepaway,score199,pacman lives§ Ghostmorethanonestepaway,score331,pacman lives§ Ghostmorethanonestepaway,score251,pacman lives
Learn:GhostonestepawayANDscoreisNOTprimenumberà pacman dies!
0 2 4 6 8 10 12 14 16 18 20-15
-10
-5
0
5
10
15
20
25
30
Degree 1 polynomial
There’sfitting,andthere’s
11/9/16
12
0 2 4 6 8 10 12 14 16 18 20-15
-10
-5
0
5
10
15
20
25
30
Degree 2 polynomial
There’sfitting,andthere’s
0 2 4 6 8 10 12 14 16 18 20-15
-10
-5
0
5
10
15
20
25
30
Degree 15 polynomial
Overfitting
11/9/16
13
ApproximatingQFunction• LinearApproximation• CouldalsouseDeepNeuralNetwork
– https://www.nervanasys.com/demystifying-deep-reinforcement-learning/
Q(s,a)
Deepmind Atari
https://www.youtube.com/watch?v=V1eYniJ0Rnk
11/9/16
14
DQNResultsonAtariDQN Results in Atari
SlideadaptedfromDavidSilver
ApproximatingtheQFunctionLinearApproximation
Q
f1(s,a)f2(s,a)
fm(s,a)
Qf1(s,a)f2(s,a)
fm(s,a)
NeuralApproximation(nonlinear) h(z) = 1
1+ e−z
00
1
z
ah(z)
11/9/16
15
DeepRepresentationsDeep Representations
I A deep representation is a composition of many functions
x
//h1
// ... //h
n
//y
//l
w1
OO
... wn
OO
I Its gradient can be backpropagated by the chain rule
@l@x
@l@h1
@h1@xoo
@h1@w1✏✏
...
@h2@h1oo @l
@hn
@hn
@hn�1oo
@hn
@wn
✏✏
@l@y
@y@h
noo
@l@w1
... @l@wn
SlideadaptedfromDavidSilver
hidden
output
inputi
j
k
vij
[X1 ,X2,X3] zea
-+=1
1
00
1
z
a
[Y1 ,Y2]
wjk
• MultipleLayers• FeedForward• Connected Weights• 1-of-NOutput
iji
iwxz j å=
MultiLayerPerceptron
11/9/16
16
TrainingviaStochasticGradientDescentTraining Neural Networks by Stochastic Gradient Descent
I Sample gradient of expected loss L(w) = E [l ]
@l
@w⇠ E
@l
@w
�=
@L(w)
@w
I Adjust w down the sampled gradient
�w / @l
@w
!"#$%"&'(%")*+,#'-'!"#$%%" (%")*+,#'.+/0+,#
� !"#$%&'('%$&#()&*+$,*$#&&-&$$$$."'%"$'*$%-/0'*,('-*$.'("$("#$1)*%('-*$,22&-3'/,(-& %,*$0#$)+#4$(-$%&#,(#$,*$#&&-&$1)*%('-*$$$$$$$$$$$$$
� !"#$2,&(',5$4'11#&#*(',5$-1$("'+$#&&-&$1)*%('-*$$$$$$$$$$$$$$$$6$("#$7&,4'#*($%,*$*-.$0#$)+#4$(-$)24,(#$("#$'*(#&*,5$8,&',05#+$'*$("#$1)*%('-*$,22&-3'/,(-& 9,*4$%&'('%:;$$$$$$
<&,4'#*($4#+%#*($=>?
SlideadaptedfromDavidSilver
i
j
k
vij
wjk
• Minimize error ofcalculated output
• Adjust weights• GradientDescent
• Procedure• ForwardPhase• Backpropagationof errors
• For each sample,multipleepochs
Aka...Backpropagation
11/9/16
17
WeightSharingWeight Sharing
Recurrent neural network shares weights between time-steps
y
t
y
t+1
... //h
t
//
OO
h
t+1//
OO
...
w
??
x
t
OO
w
==
x
t+1
OO
Convolutional neural network shares weights between local regions
w1
w1
w2
w2
x
h1
h2
SlideadaptedfromDavidSilver
Recap:Approx Q-LearningQ-Learning
I Optimal Q-values should obey Bellman equation
Q
⇤(s, a) = Es
0
r + � max
a
0Q(s 0, a0)⇤ | s, a
�
I Treat right-hand side r + � maxa
0Q(s 0, a0,w) as a target
I Minimise MSE loss by stochastic gradient descent
l =⇣r + � max
a
Q(s 0, a0,w) � Q(s, a,w)⌘2
I Converges to Q
⇤ using table lookup representationI But diverges using neural networks due to:
I Correlations between samplesI Non-stationary targets
SlideadaptedfromDavidSilver
11/9/16
18
DeepQ-Networks(DQN)ExperienceReplayDeep Q-Networks (DQN): Experience Replay
To remove correlations, build data-set from agent’s own experience
s1, a1, r2, s2s2, a2, r3, s3 ! s, a, r , s 0
s3, a3, r4, s4...
s
t
, at
, rt+1, st+1 ! s
t
, at
, rt+1, st+1
Sample experiences from data-set and apply update
l =
✓r + � max
a
0Q(s 0, a0,w�) � Q(s, a,w)
◆2
To deal with non-stationarity, target parameters w� are held fixed
SlideadaptedfromDavidSilver
DQNinAtariDQN in Atari
I End-to-end learning of values Q(s, a) from pixels s
I Input state s is stack of raw pixels from last 4 frames
I Output is Q(s, a) for 18 joystick/button positions
I Reward is change in score for that step
Network architecture and hyperparameters fixed across all games
SlideadaptedfromDavidSilver
11/9/16
19
DeepMindResources
Seealso:http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf
That’sallforReinforcementLearning!
• Verytoughproblem:Howtoperformanytaskwellinanunknown,noisyenvironment!
• Traditionallyusedmostlyforrobotics,but…
49
ReinforcementLearningAgent
Data(experienceswithenvironment)
Policy(howtoactinthefuture)
GoogleDeepMind – RLappliedtodatacenterpowerusage
11/9/16
20
That’sallforReinforcementLearning!
Lotsofopenresearchareas:– Howtobestbalanceexplorationandexploitation?– Howtodealwithcaseswherewedon’tknowagoodstate/featurerepresentation?
50
ReinforcementLearningAgent
Data(experienceswithenvironment)
Policy(howtoactinthefuture)
Conclusion• We’redonewithPartI:Search
andPlanning!
• We’veseenhowAImethodscansolveproblemsin:– Search– ConstraintSatisfactionProblems– Games– MarkovDecisionProblems– ReinforcementLearning
• Nextup:PartII:UncertaintyandLearning!