kernelized value function approximation for reinforcement learning

of 29/29
Kernelized Value Function Approximation for Reinforcement Learning Gavin Taylor and Ronald Parr Duke University

Post on 12-Jan-2016

43 views

Category:

Documents

Tags:

• reward errortransition

Embed Size (px)

DESCRIPTION

Kernelized Value Function Approximation for Reinforcement Learning. Gavin Taylor and Ronald Parr Duke University. Kernel: k(s,s’) Training Data: (s,r,s’),(s,r,s’) (s,r,s’)…. Solve for value directly using KLSTD or GPTD. Solve for model as in GPRL. Kernelized Value Function. Kernelized - PowerPoint PPT Presentation

TRANSCRIPT

• Kernelized Value Function Approximation for Reinforcement LearningGavin Taylor and Ronald ParrDuke University

• Overview

• Overview - ContributionsConstruct new model-based VFAEquate novel VFA with previous workDecompose Bellman Error into reward and transition errorUse decomposition to understand VFArewarderrortransitionerrorBellmanError

• OutlineMotivation, Notation, and FrameworkKernel-Based ModelsModel-Based VFAInterpretation of Previous WorkBellman Error DecompositionExperimental Results and Conclusions

• Markov Reward ProcessesM=(S,P,R,) Value: V(s)=expected, discounted sum of rewards from state sBellman equation:

Bellman equation in matrix notation:

• KernelsProperties:Symmetric function between two points:PSD K-matrixUses:Dot-product in high-dimensional space (kernel trick)Gain expressivenessRisks:OverfittingHigh computational cost

• OutlineMotivation, Notation, and FrameworkKernel-Based ModelsModel-Based VFAInterpretation of Previous WorkBellman Error DecompositionExperimental Results and Conclusions

• Kernelized RegressionApply kernel trick to least-squares regression

t: target valuesK: kernel matrix, wherek(x): column vector, where : regularization matrix

• Kernel-Based ModelsApproximate reward model

Approximate transition modelWant to predict k(s) (not s)Construct matrix K, where

• Model-based Value Function

• Model-based Value FunctionUnregularized:Regularized:Whole state space:

• Previous WorkKernel Least-Squares Temporal Difference Learning (KLSTD) [Xu et. al., 2005]Rederive LSTD, replacing dot products with kernelsNo regularizationGaussian Process Temporal Difference Learning (GPTD) [Engel, et al., 2005]Model value directly with a GPGaussian Processes in Reinforcement Learning (GPRL) [Rasmussen and Kuss, 2004]Model transitions and value with GPsDeterministic reward

• Equivalency: GPTD noise parameter: GPRL regularization parameter

MethodValue FunctionModel-based EquivalentKLSTDGPTDGPRLModel-based [T&P `09]

• OutlineMotivation, Notation, and FrameworkKernel-Based ModelsModel-Based VFAInterpretation of Previous WorkBellman Error DecompositionExperimental Results and Conclusions

• Model ErrorError in reward approximation:

Error in transition approximation:: expected next kernel values: approximate next kernel values

• Bellman ErrorrewarderrortransitionerrorBellman Error a linear combination of reward and transition errors

• OutlineMotivation, Notation, and FrameworkKernel-Based ModelsModel-Based VFAInterpretation of Previous WorkBellman Error DecompositionExperimental Results and Conclusions

• ExperimentsVersion of two room problem [Mahadevan & Maggioni, 2006] Use Bellman Error decomposition to tune regularization parametersREWARD

• Experiments

• ConclusionNovel, model-based view of kernelized RL built around kernel regressionPrevious work differs from model-based view only in approach to regularizationBellman Error can be decomposed into transition and reward errorTransition and reward error can be used to tune parameters

• Thank you!

• What about policy improvement?Wrap policy iteration around kernelized VFAExample: KLSPIBellman error decomposition will be policy dependentChoice of regularization parameters may be policy dependentOur results do not apply to SARSA variants of kernelized RL, e.g., GPSARSA

• Whats left?Kernel selectionKernel selection (not just parameter tuning)Varying kernel parameters across statesCombining kernels (See Kolter & Ng 09)Computation costs in large problemsK is O(#samples)Inverting K is expensiveRole of sparsification, interaction w/regularization

• Comparing model-based approachesTransition modelGPRL: models s as a GPT&P: approximates k(s) given k(s)Reward modelGPRL: deterministic rewardT&P: reward approximated with regularized, kernelized regression

• Dont you have to know the model?For our experiments & graphs: Reward, transition errors calculated with true R, K

In practice: Cross-validation could be used to tune parameters to minimize reward and transition errors

• Why is the GPTD regularization term asymmetric?GPTD is equivalent to T&P whenCan be viewed as propagating the regularizer through the transition model Is this a good idea?Our contribution: Tools to evaluate this question

• What about Variances?Variances can play an important role in Bayesian interpretations of kernelized RLCan guide explorationCan ground regularization parametersOur analysis focuses on the meanVariances a valid topic for future work

• Does this apply to the recent work of Farahmand et al.?Not directlyAll methods assume (s,r,s) dataFarahmand et al. include next states (s) in their kernel, i.e., k(s,s) and k(s,s)Previous work, and ours, includes only s in the kernel: k(s,s)

• How is This Different from Parr et al. ICML 2008?Parr et al. considers linear fixed point solutions, not kernelized methodsEquivalence between linear fixed point methods was fairly well understood already

Our contribution:We provide a unifying view of previous kernel-based methodsWe extend the equivalence between model-based and direct methods to the kernelized case

Dealing with kernelized value-function approximations of Markov ProcessesIn past work, two different ways of producing such an approximationFirst, solve directlySecond, solve for model first, then solve for value function given modelDifferent techniques, no unified approach Will show they are the sameCommonly used in support vector machines, kernel regression, GPsGP is regression that returns a distributionGives you a measure of uncertaintyWell be building our model-based vfa using kernelized regression, so lets run through thatTraining dataGoing to predict our next kernel valuesNote, this is a model-free approach that approximates the value directly