# kernelized value function approximation for reinforcement learning

Post on 12-Jan-2016

43 views

Embed Size (px)

DESCRIPTION

Kernelized Value Function Approximation for Reinforcement Learning. Gavin Taylor and Ronald Parr Duke University. Kernel: k(s,s’) Training Data: (s,r,s’),(s,r,s’) (s,r,s’)…. Solve for value directly using KLSTD or GPTD. Solve for model as in GPRL. Kernelized Value Function. Kernelized - PowerPoint PPT PresentationTRANSCRIPT

Kernelized Value Function Approximation for Reinforcement LearningGavin Taylor and Ronald ParrDuke University

Overview

Overview - ContributionsConstruct new model-based VFAEquate novel VFA with previous workDecompose Bellman Error into reward and transition errorUse decomposition to understand VFArewarderrortransitionerrorBellmanError

OutlineMotivation, Notation, and FrameworkKernel-Based ModelsModel-Based VFAInterpretation of Previous WorkBellman Error DecompositionExperimental Results and Conclusions

Markov Reward ProcessesM=(S,P,R,) Value: V(s)=expected, discounted sum of rewards from state sBellman equation:

Bellman equation in matrix notation:

KernelsProperties:Symmetric function between two points:PSD K-matrixUses:Dot-product in high-dimensional space (kernel trick)Gain expressivenessRisks:OverfittingHigh computational cost

OutlineMotivation, Notation, and FrameworkKernel-Based ModelsModel-Based VFAInterpretation of Previous WorkBellman Error DecompositionExperimental Results and Conclusions

Kernelized RegressionApply kernel trick to least-squares regression

t: target valuesK: kernel matrix, wherek(x): column vector, where : regularization matrix

Kernel-Based ModelsApproximate reward model

Approximate transition modelWant to predict k(s) (not s)Construct matrix K, where

Model-based Value Function

Model-based Value FunctionUnregularized:Regularized:Whole state space:

Previous WorkKernel Least-Squares Temporal Difference Learning (KLSTD) [Xu et. al., 2005]Rederive LSTD, replacing dot products with kernelsNo regularizationGaussian Process Temporal Difference Learning (GPTD) [Engel, et al., 2005]Model value directly with a GPGaussian Processes in Reinforcement Learning (GPRL) [Rasmussen and Kuss, 2004]Model transitions and value with GPsDeterministic reward

Equivalency: GPTD noise parameter: GPRL regularization parameter

MethodValue FunctionModel-based EquivalentKLSTDGPTDGPRLModel-based [T&P `09]

OutlineMotivation, Notation, and FrameworkKernel-Based ModelsModel-Based VFAInterpretation of Previous WorkBellman Error DecompositionExperimental Results and Conclusions

Model ErrorError in reward approximation:

Error in transition approximation:: expected next kernel values: approximate next kernel values

Bellman ErrorrewarderrortransitionerrorBellman Error a linear combination of reward and transition errors

ExperimentsVersion of two room problem [Mahadevan & Maggioni, 2006] Use Bellman Error decomposition to tune regularization parametersREWARD

Experiments

ConclusionNovel, model-based view of kernelized RL built around kernel regressionPrevious work differs from model-based view only in approach to regularizationBellman Error can be decomposed into transition and reward errorTransition and reward error can be used to tune parameters

Thank you!

What about policy improvement?Wrap policy iteration around kernelized VFAExample: KLSPIBellman error decomposition will be policy dependentChoice of regularization parameters may be policy dependentOur results do not apply to SARSA variants of kernelized RL, e.g., GPSARSA

Whats left?Kernel selectionKernel selection (not just parameter tuning)Varying kernel parameters across statesCombining kernels (See Kolter & Ng 09)Computation costs in large problemsK is O(#samples)Inverting K is expensiveRole of sparsification, interaction w/regularization

Comparing model-based approachesTransition modelGPRL: models s as a GPT&P: approximates k(s) given k(s)Reward modelGPRL: deterministic rewardT&P: reward approximated with regularized, kernelized regression

Dont you have to know the model?For our experiments & graphs: Reward, transition errors calculated with true R, K

In practice: Cross-validation could be used to tune parameters to minimize reward and transition errors

Why is the GPTD regularization term asymmetric?GPTD is equivalent to T&P whenCan be viewed as propagating the regularizer through the transition model Is this a good idea?Our contribution: Tools to evaluate this question

What about Variances?Variances can play an important role in Bayesian interpretations of kernelized RLCan guide explorationCan ground regularization parametersOur analysis focuses on the meanVariances a valid topic for future work

Does this apply to the recent work of Farahmand et al.?Not directlyAll methods assume (s,r,s) dataFarahmand et al. include next states (s) in their kernel, i.e., k(s,s) and k(s,s)Previous work, and ours, includes only s in the kernel: k(s,s)

How is This Different from Parr et al. ICML 2008?Parr et al. considers linear fixed point solutions, not kernelized methodsEquivalence between linear fixed point methods was fairly well understood already

Our contribution:We provide a unifying view of previous kernel-based methodsWe extend the equivalence between model-based and direct methods to the kernelized case

Dealing with kernelized value-function approximations of Markov ProcessesIn past work, two different ways of producing such an approximationFirst, solve directlySecond, solve for model first, then solve for value function given modelDifferent techniques, no unified approach Will show they are the sameCommonly used in support vector machines, kernel regression, GPsGP is regression that returns a distributionGives you a measure of uncertaintyWell be building our model-based vfa using kernelized regression, so lets run through thatTraining dataGoing to predict our next kernel valuesNote, this is a model-free approach that approximates the value directly