a bayesian approach to empirical local linearization for...
TRANSCRIPT
A Bayesian Approach toEmpirical Local Linearization forRobotics
Jo-Anne Ting1, Aaron D’Souza2,Sethu Vijayakumar3, Stefan Schaal1
1University of Southern California,2Google, Inc., 3University of Edinburgh
ICRA 2008May 23, 2008
2
Outline
• Motivation• Past & related work• Bayesian locally weighted regression• Experimental results• Conclusions
3
Motivation
• Locally linear methods have been shown to be useful for robotcontrol (e.g., learning internal models of high-dimensional systemsfor feedforward control or local linearizations for optimal control &reinforcement learning).
• A key problem is to find the “right” size of the local region for alinearization, as in locally weighted regression.
• Existing methods* use either cross-validation techniques, complexstatistical hypothesis or require significant manual parameter tuningfor good & stable performance.
*e.g., supersmoothing (Friedman, 84), LWPR (Vijayakumar et al, 05), (Fan & Gijbels, 92 & 95)
X
Y
4
Outline
• Motivation• Past & related work• Bayesian locally weighted regression• Experimental results• Conclusions
5
Quick Review of Locally Weighted Regression• Given a nonlinear regression problem, , our goal is to
approximate a locally linear model at each query point xq in order tomake the prediction:
• We compute the measure of locality for each data sample with aspatial weighting kernel K, e.g., wi = K(xi, xq, h).
• If we can find the “right” local regime for each xq, nonlinear functionapproximation may be solved accurately and efficiently.
y = f x( ) + !
yq= b
Txq
Previous methods may:i) Be sensitive to initial valuesii) Require tuning/setting of open parametersiii) Be computationally involved
Weighting kernel
X
Y
6
Outline
• Motivation• Past & related work• Bayesian locally weighted regression• Experimental results• Conclusions
7
Bayesian Locally Weighted Regression• Our variational Bayesian algorithm:
i. Learns both b and the optimal hii. Handles high-dimensional dataiii. Associates a scalar indicator weight wi with each data sample
• We assume the following prior distributions:
i = 1,..,N
bm
!2
ahm
bhm
!N
2
n
m = 1,..,d
wim
yi
xim
hm
p yixi
( ) ~ Normal bT
xi,!
2( )
p b !2( ) ~ Normal 0,! 2
"b 0
( )
p !2( ) ~ Scaled-Inv-# 2
n,!N
2( )
where each data sample has a weight wi:
wi= w
im
m=1
d
! , where p wim
( ) ~ Bernoulli 1+ xim" x
qm( )r
hm
#$
%&"1
( )hm
~ Gamma ahm
,bhm
( )
8
Inference Procedure
• We can treat this as an EM learning problem (Dempster & Laird, ‘77):
• We use a variational factorial approximation of the true joint posteriordistribution* (e.g., Ghahramani & Beal, ‘00) and a variationalapproximation on concave/convex functions, as suggested by(Jaakkola & Jordan, ‘00), to get analytically tractable inference.
Maximize L,where L = log p yi ,wi ,b,z! ,h xi( )i=1
N
"
*Q b,!z,h( ) =Q b,!
z( )Q h( )
L = log p yi xi ,b,! 2( )wi
i=1
N
" + log p wim( )m=1
d
"i=1
N
" + log p b ! 2( )
+ log p !2( ) + log p h( )
where
9
Important Things to Note
• For each local model, our algorithm:
i. Learns the optimal bandwidth value, h (i.e. the “appropriate”local regime)
ii. Is linear in the number of input dimensions per EM iteration (foran extended model with intermediate hidden variables, z,introduced for fast computation)
iii. Provides a natural framework to incorporate prior knowledge ofthe strong (or weak) presence of noise
10
Outline
• Motivation• Past & related work• Bayesian locally weighted regression• Experimental results• Conclusions
11
Experimental Results: Synthetic dataFunction with discontinuity + N(0,0.3025) output noise
Function with increasing curvature + N(0,0.01) output noise
X
Y
12
Experimental Results: Synthetic dataFunction with peak + N(0,0.09) output noise
Straight line (notice “flat” kernels are learnt)
13
Experimental Results: Synthetic data
Kernel Shaping Gaussian Process regressionTarget function
2D “cross” function* + N(0, 0.01)
Kernel Shaping: Learnt Kernels
*Training data has 500 samples and mean-zero noise with variance of 0.01 added to outputs.
14
Experimental Results: Robot arm data• Given a kinematics problem for a 7 DOF robot arm:
we want to estimate the Jacobian, J, for the purpose of establishingthe algorithm does the right thing for each local regression problem:
• For a particular local linearization problem, we compare the estimatedJacobian using BLWR, JBLWR, to the:• Analytically computed Jacobian, JA
• Estimated Jacobian using locally weighted regression, JLWR(where the optimal distance metric is found with cross-validation).
p = f !( )
Input data consists of 7 arm joint angles
p = x y [ z]T
Resulting position of arm’s endeffector in Cartesian space
dp
dt=df !( )
d!
J=?{
d!
dt
15
Angular & Magnitude Differences of Jacobians• We compare each of the estimated Jacobian matrices, JLWR & JBLWR,
with the analytically computed Jacobian, JA.
• Specifically, we calculate the angular & magnitude differencesbetween the row vectors of the Jacobian matrices:
• Observations:• BLWR & LWR (with an optimally tuned distance metric) perform similarly
• The problem is ill-conditioned and not so easy to solve as it may appear.
• Angular differences for J2 are large, but magnitudes of vectors are small.
JA,1
JBLWR,1
e.g. consider the 1st row vector of JBLWR andthe 1st row vector of JA
16
Outline
• Motivation• Past & related work• Bayesian locally weighted regression• Experimental results• Conclusions
17
Conclusions
• We have a Bayesian formulation of spatially locally adaptive kernels that:
i. Learns the optimal bandwidth value, h (i.e., “appropriate” local regime)ii. Is computationally efficientiii. Provides a natural framework to incorporate prior knowledge of noise
level
• Extensions to high-dimensional data with redundant & irrelevant inputdimension, incremental version, embedding in other nonlinear methods,etc. are ongoing.
18
Angular & Magnitude Differences of Jacobians
0.57580.46870.107125 degreesJ3
0.04270.27800.235379 degreesJ2
0.64640.52800.112919 degreesJ1
|JBLWR,i||JA,i|abs(|JA,i|- |JBLWR,i|)∠JA,i - ∠JBLWR,iJi
Between analytical Jacobian JA & inferred Jacobian JBLWR
0.59030.46870.121627 degreesJ3
0.07340.27800.204785 degreesJ2
0.64110.52800.118216 degreesJ1
|JLWR,i||JA,i|abs(|JA,i|- |JLWR,i|)∠JA,i - ∠JLWR,iJi
Between analytical Jacobian JA & inferred Jacobian of LWR (with D=0.1) JLWR
Observations:i) BLWR & LWR (with an optimally tuned D) perform similarlyii) Problem is ill-conditioned (condition number is very high ~1e5).iii) Angular differences for J2 are large, but magnitudes of vectors are small.