hazara, murtaza; kyrki, ville model selection for ... · m. hazara and v. kyrki are with school of...

This is an electronic reprint of the original article.This reprint may differ from the original in pagination and typographic detail.

Powered by TCPDF (www.tcpdf.org)

This material is protected by copyright and other intellectual property rights, and duplication or sale of all or part of any of the repository collections is not permitted, except that material may be duplicated by you for your research use or educational purposes in electronic or print form. You must obtain permission for any other use. Electronic or print copies may not be offered, whether for sale or otherwise to anyone who is not an authorised user.

Hazara, Murtaza; Kyrki, VilleModel selection for incremental learning of generalizable movement primitives

Published in:Proceedings of the 2017 18th International Conference on Advanced Robotics, ICAR 2017

DOI:10.1109/ICAR.2017.8023633

Published: 30/08/2017

Document VersionPeer reviewed version

Please cite the original version:Hazara, M., & Kyrki, V. (2017). Model selection for incremental learning of generalizable movement primitives. InProceedings of the 2017 18th International Conference on Advanced Robotics, ICAR 2017 (pp. 359-366).[8023633] IEEE. https://doi.org/10.1109/ICAR.2017.8023633

https://doi.org/10.1109/ICAR.2017.8023633

https://doi.org/10.1109/ICAR.2017.8023633

Model Selection for Incremental Learningof Generalizable Movement Primitives

Murtaza Hazara and Ville Kyrki

Abstract— Although motor primitives (MPs) have been stud-ied extensively, much less attention has been devoted tostudying their generalization to new situations. To cope withvarying conditions, a MP’s policy encoding must supportgeneralization over task parameters to avoid learning separateprimitives for each condition. Local and linear parameterizedmodels have been proposed to interpolate over task parame-ters to provide limited generalization.

In this paper, we present a global parametric motionprimitive which allows generalization beyond local or linearmodels. Primitives are modelled using a linear basis functionmodel with global non-linear basis functions. Using theglobal parametric model, we developed an online incrementallearning framework for constructing a database of MPs froma single human demonstration. Above all, we propose amodel selection method that can choose an optimal modelcomplexity even with few training samples, which makes itsuitable for online incremental learning. Experiments witha ball-in-a-cup task with varying string lengths demonstratethat the global parametric approach can successfully extractunderlying regularities in a database of MPs leading toenhanced generalization capability of the parametric MPs andincreased speed (convergence rate) of learning. Furthermore,it significantly excels over locally weighted regression both interms of inter- and extrapolation.

I. INTRODUCTION

Learning a skill in a perturbed environment oftenrequires practising it under various conditions. For example,to learn to score in basketball, an individual needs topractice throwing from different locations. Subsequently,generalizing to a new situation (e.g. location) becomeseasier as the individual learns incrementally the underlyingregularities of the task. Incremental learning has beenstudied [1] in the context of iterative learning control(ILC) where a desired trajectory is adapted to a knownreference trajectory in an online incrementally manner.However, in this paper, we propose incremental learningin the context of reinforcement learning (RL) where sucha reference trajectory is unknown. In fact, finding thereference trajectory for the new perturbed environment isour objective.

In this paper, we propose a new global parametriclearning from demonstration (LfD) approach for gener-alizing an imitated task to new unseen situations. Weselected the ball-in-a-cup task to assess how effective ourmethod is in generalizing from initial demonstration with

M. Hazara and V. Kyrki are with School of Electrical Engi-neering, Aalto University, Finland [email protected],[email protected]

This work was supported by Academy of Finland, decisions 264239and 268580.

(a)

5 10 15 20 25 30 3510

12

15

17

20

22

25

27

Input data

GPR

Second order model

linear model (2 pts.)

linear model (all pts.)

(b)

Fig. 1: (a) Ball-in-a-cup game with two different stringlengths. (b) Models fit to 7 points of a non-linear function.Only the second order model captures the global pattern,while the GPR model tends to the mean when extrapolating,and the linear model is either (when trained with allsamples) leaning toward the mean, or finds only the localpattern when fit to only two points marked by the greencolor.

certain string length to changed lengths. We demonstratethat our approach is capable of interpolation (new stringlength in the range of training set) and extrapolation(string length outside the range of training set). Thekinematics of the initial demonstration are encoded usinga Dynamic Movement Primitive (DMP). Afterwards, theshape parameters of the DMP are optimized using PoWER[2] for a new task parameter (string length) and added to adatabase (DB) of MPs. The global model is updated in anonline fashion and its complexity is controlled as new MPsare added to the database. The global model can provideRL with a good initial policy boosting up the learningprocess, and in return RL provides the global model withmore training data leading to an enhanced prediction ofpolicies for new unseen task parameters.

The main contribution of this paper is providing anincremental learning framework in the context of RL forconstructing a DB of MPs. The underlying regularities ofthis DB are extracted online using a novel global parametricmodel generalizing task parameters to policy parameters.Above all, we propose a novel penalized log-likelihoodbased model selection for controlling the complexity ofthe global model as new MPs are added to the DB. We alsoshow that this model selection works even with few trainingsamples indicating its superiority for online incrementallearning over traditional model selection methods such asAkaike information criterion (AIC), Bayesian information

criterion (BIC), and leave-one-out cross validation. Ourexperiments demonstrate also that our global parametricapproach generalizes significantly better than LWR in termsof both inter- and extrapolation.

II. RELATED WORK

To adapt LfD models to new environments, the modelparameters need to be adjusted according to parameterscharacterizing the new environment or task. Existing gener-alizable LfD models can be categorized as (i) generalizationby design where the parameters are explicit in the modelstructure such as the goal of a DMP; (ii) generalizationbased on interpolation which uses a weighted combinationof training models; and (iii) generalization by global linearmodels where the model parameters depend linearly onthe environment/task parameters.

Kober et al. [3] utilized a cost regularized kernel regres-sion (based on Gaussian process regression) for learningthe mapping of new situations to meta-parameters includ-ing the initial position, goal, amplitude and the duration ofthe MPs; they model the automatic adjustment of the meta-parameters as a RL problem. The approach allows thenadjusting these designed aspects of motion based on thetask but does not enable modifying other characteristicsof the trajectory (policy), making the approach suitablefor learning tasks where the DMPs are adapted spatiallyand temporally without changing the overall shape of themotion, but unsuitable for tasks where the dynamics duringmotion are changed such as in this paper.

Researchers have recently shown interest also in general-izing DMP shape parameters to new situations [4], [5], [6],[7], [8], [9]. The approaches are primarily based on localregression methods. For example, Da Silva et al. [4] extractlower dimensional manifolds (latent space) from learnedpolicies using ISOMAP algorithm; they achieve a general-izable policy by mapping the manifolds (representing taskparameters) to DMPs shape parameters. Support VectorMachines with local Gaussian kernels are used for learningthe mapping. Similarly, Forte et al. [5] utilize Gaussianprocess regression (GPR) to learn the mapping of taskparameters to DMP shape parameters. Ude et al. [6] andNemec et al. [8] utilize Locally Weighted Regression (LWR)and kernels with positive weights for learning the mapping.Stulp et al. [7] learn the original shape parameters andgeneralize it with one single regression using Gaussiankernels. Mülling et al. [9] propose a linear mixture of MPsfor generalization and refine the generalized behavior usingRL.

Although the local regression approaches might interpo-late within the range of demonstrations, their extrapolationcapability is not guaranteed, which is mainly because of thekernels learning the local structure; thus they typically tendtowards the mean of training data when extrapolating. Thisproblem is illustrated using a simple example in Fig. 1bwhich shows local (GPR) and global (linear, second order)models fit to a set of points. When a line is fit to two pointsclose to each other, it learns the local pattern allowing

interpolation and extrapolation in a small neighborhood. Ifthe line fit is made using all points, the fit tends to becomepoor everywhere. A local regression such as GPR performswell in interpolation, but not in extrapolation. When a wellfitting higher order global model can be found, it typicallyoutperforms the others in extrapolation.

Carrera et al. [10] developed a parametric MPs modelbased on a mixture of several DMPs. First, they recordmultiple demonstrations. A DMP model is fit to eachdemonstration, and a parametric value is assigned to itrepresenting the task environment in which the demon-stration was recorded. Then, they calculate the influenceof each model using a distance function between themodel parameters and the parametric value describing thecurrent environment perturbation. Using the influence ofeach model as its mixing coefficient, the mixture of modelsis computed at the acceleration level. Since it is a linearcombination with positive coefficients, the mixture modelis not expected to be capable of extrapolation. Calinonet al. [11] proposed an MPs model based on a Gaussianmixture model and generalize it to new situations usingexpectation maximization. Although their model is capableof linear extrapolation, it is only applicable when the taskparameters can be represented in the form of coordinatesystems. All things considered, few researchers [12] haveconsidered the extrapolation capability in generalizing taskparameters to model parameters, which is the main focusof this paper.

III. METHOD

In this section, we review dynamic movement primi-tives (DMPs). After that, we clarify our global parametricdynamic movement primitives (GPDMPs) method whichincorporates both linear and non-linear parametric mod-els. Besides that, we review traditional model selectionapproaches and then explain our penalized log-likelihoodbased model selection method.

A. Dynamic Movement Primitives

DMPs encode a policy for a one-dimensional systemusing two differential equations. The first differentialequation

z =−ταz z (1)

formulates a canonical system where z denotes the phaseof a movement; τ= 1

T represents the time constant whereT is the duration of a demonstrated motion, and αz isa constant controlling the speed of the canonical system.This first order system resembles an adjustable clockdriving the transform system

1

τx =αx (βx (g −x)− x)+ f (z;w) (2)

consisting of a simple linear dynamical system acting likea spring damper perturbed by a non-linear component(forcing function) f (z;w). x denotes the state of the system,and g represents the goal. The linear system is critically

damped by setting the gains as αx = 14βx . The forcing

functionf (z;w) = wT g (3)

controls the trajectory of the system using a time-parameterized kernel vector g and a modifiable policyparameter vector (shape parameters) w. Each element ofthe kernel vector

[g]n = ψn(z)z∑Ni=1ψ

i (z)(g −x0) (4)

is determined by a normalized basis function ψn(z)multiplied by the phase variable z and the scaling factor(g − x0) allowing for the spatial scaling of the resultingtrajectory. Normally, a radial basis function (RBF) kernel

ψn(z) = exp(−hn(z − cn)2) (5)

is selected as the basis function. The centres of kernels(cn) are usually equispaced in time spanning the wholedemonstrated trajectory. It is also a common practice tochoose the same temporal width (hn = 2

3 |cn −cn−1|) for allkernels. Furthermore, the contribution of the non-linearcomponent (3) decays exponentially by including the phasevariable z in the kernels. Hence, the transform system (2)converges to the goal g .

The shape parameter vector w can be learned usingweighted linear regression [13]; firstly, the nominal forcingfunction f r e f is retrieved by integrating the transformsystem (2) with respect a human demonstration xdemo ;next, the shape parameter for every kernel is estimatedusing

[w]n = (ZTΨZ)−1ZTΨfr e f (6)

where [fr e f ]t = f r e ft , [Z]t = zt , and Ψ =

diag(ψn1 , ...,ψn

t , ...,ψnT ).

B. Global Parametric Dynamic Movement Primitives

Using DMPs, a task can be imitated from a humandemonstration; however, the reproduced task cannot beadapted to different environment conditions. To overcomethis limitation, we have integrated a parametric modelto DMPs capturing the variability of a task from multipledemonstrations. We transform the basic forcing function(3) into a parametric forcing function

f (z, l;w) = w(l)T g (7)

where the kernel weight vector w is parametrized using aparameter vector l of measurable environment factors.

We model the dependency of the weights with respectto parameters as a linear combination of J basis vectors vi

with coefficients depending on parameters in a non-linearfashion,

w(l) =J∑

i=0φi (l)vi = VTφ(l) (8)

where V s a J ×N matrix of parameters with N referring tothe number of kernels g. φ(l) is a J dimensional columnvector with elements φ j (l). The non-linear basis φ j (l) fora polynomial model in one parameter is φ j (l ) = l j .

The formulation captures linear models such as [12]as a special case. Considering a single parameter l forpresentational simplicity, the linear model can be written

w(l ) = l v1 +v0. (9)

For a chosen non-linear basis (known functions φi ),the basis vectors can be calculated by minimizing thedifference between modelled and initial non-parametricDMP shape parameters,

argminV

K∑k=1

‖w(lk )−wk‖2 (10)

where wk denotes the initial weight vector of a non-parametric DMP optimized for parameter values lk . Theinitial weights can be merely imitated from a humandemonstration using (6) or improved using a policy searchmethod [14]. In either case, reproducing an imitated taskusing wk should lead to a successful performance in anenvironment parametrized by lk .

In order to solve (10), one needs to construct the designmatrix

Φ=

φ1(l1) φ2(l1) . . . φJ (l1)φ1(l2) φ2(l2) . . . φJ (l2)

......

. . ....

φ1(lK ) φ2(lK ) . . . φJ (lK )

(11)

where K denotes the number of initial DMPs weight vectorswhich must be at least equal to or greater than the order ofthe model J to avoid unconstrained optimization problems.Furthermore, the rows of the target matrix

W =

wT1...

wTK

(12)

represent an initial DMPs weight vector. We can minimize(10) with respect to the matrix of basis vectors V, giving

V = (ΦTΦ)−1ΦT W. (13)

C. Model selection

The order of complexity for a parametric regressionmodel needs to be selected for overcoming the so-calledover-fitting problem. The model complexity can be deter-mined for every single policy parameter (e.g. weight ofevery single DMP kernel). However, in such a univariatemodel, the correlation among policy parameters are ig-nored, thus resulting in a very non-smooth trajectories fornew task parameters. Hence, we consider a single modelcomplexity for the global model.

The model choice for a particular application is acompromise between complexity of the modelled attractorlandscape and overfitting due to insufficient data. The bestgeneralization can be achieved by choosing an optimalorder of complexity for the model, which is addressed ina model selection method such as cross validation, AIC,or BIC.

Leave-one-out cross validation is a traditional modelselection technique which separates training data intoa training set and a validation set for assessing thegeneralization performance. On the other hand modelselection methods based on information criteria such asAIC and BIC exploit the whole training set for both trainingthe model and choosing its complexity. Both of thesemethods are based on penalized log-likelihood, where thepenalty term constrain the model complexity. AIC selectsthe model which maximizes

AI = log p(W|Θ)− J (14)

where p(W|Θ) denotes the likelihood of training data;however, AIC favours higher order of complexity resultingin overfitting.

BIC put a stronger penalty on the number of freeparameters (in this case J), and thereby controlling thecomplexity of the model. The optimal model order can beselected by minimizing the BIC-cost

B =−2log p(W|Θ)+ J logK . (15)

We assume a linear regression model with additive whiteGaussian noise

W =ΦV+E (16)

where E is the error matrix

E =

εT1...εT

K

(17)

with each rowεi = wi − VTφ(li ) (18)

representing the difference between the i th training samplewi and its prediction VTφ(li ). Hence, the likelihood of datais

log p(W|Θ) = logK∏

i=1N (wi |li , V, Σ)

=−K N

2log (2π)− K

2log (det(Σ))−

1

2tr {(W−ΦV)T (W−ΦV)Σ−1}

(19)

where N is the number of DMPs kernels (size of g in4); V denotes the maximum likelihood estimate (MLE)of parameter matrix (using (13)); and, Σ represent MLEestimate of the covariance

Σ= 1

K(W−ΦV)T (W−ΦV) (20)

of DMPs weight vector w. After eliminating the constantterm (−K N

2 log (2π)) in (19), one can rewrite (15) into

B = K log(det(Σ))+tr ((W−ΦV)T (W−ΦV)Σ−1)+J logK . (21)

The determinant of the MLE estimate of the covariance,Σ will be zero with few training samples causing the

BIC cost to go toward negative infinity. Hence, we havemodified the traditional definition of the BIC cost

BM =−2log p(E |Θ)+ J logK (22)

by considering the distribution of the noise which can bewritten into

log p(E |Θ) = logK∏

i=1N (εi |0,Σ) =−K N

2log(2π)− K

2log(det(Σ))

− 1

2

K∑i=1

(εTi Σεi )

=−K N

2log(2π)− K

2log(det(Σ))

− 1

2

K∑i=1

(wi − VTφ(li ))TΣ(wi − VTφ(li ))

=−K N

2log(2π)− K

2log(det(Σ))

− 1

2tr ((W−ΦV)T (W−ΦV)Σ−1)

(23)

due to the the i.i.d assumption on the white additive noise(εi ∼N (0,Σ)), where Σ represents a constant covariancematrix which needs to be determined prior to the modelselection process. In our experiments, we have selected anscaled identity matrix sI as the constant covariance matrixwhere s denotes the scale. The scale can be determinedwith respect to the magnitude of the error (differencebetween wk and VTφ(lk )). A simple way to estimate thescale is to look at the largest eigenvalue of the MLE estimateof the covariance matrix (20) with linear fitting. Aftereliminating constant terms (−K N

2 log(2π) and K2 log(det(Σ)))

in (23), one can rewrite the modified BIC cost (22) into

BM = tr ((W−ΦV)T (W−ΦV)Σ−1)+ J logK . (24)

The model which minimizes BM will be selected. Thefirst term in BM favours higher order model while thesecond term discourages a very high order; hence, itguarantees best prediction while avoiding over-fitting. Sincethe determinant of the MLE covariance does not appearin our modified BIC BM (24), it is more suitable than thetraditional BIC when the training samples are scarce.

D. Reinforcement Learning

Executing a DMP with imitated shape parameters mightnot lead to a successful reproduction of a task. One wayto refine the shape parameters is to learn them throughtrial-and-error using policy search reinforcement learning(RL). Next, we briefly review the state-of-the-art policysearch method PoWER [2] which was used in this work tooptimize individual primitives.

PoWER (see Algorithm 1) updates the DMP shapeparameters θ ≡ w iteratively. In each iteration, (several)stochastic roll-out(s) of the task is performed, each ofwhich is achieved by adding random (Gaussian) noise tothe DMPs shape parameters. Each noisy vector is weightedby the returned accumulated reward. Hence, the higher

the returned reward, the more the noisy vector contributeto the updated policy parameters. This exploration processcontinues until the algorithm converges to the optimalpolicy.

Algorithm 1 PoWER [2].

Input: The initial policy parameters θ, the explorationvariance Σ

1: repeat2: Sample: Perform rollout(s) using a = (θ+εt)Tφ(s,t)

with εTt φ(s,t) ∼N (0,φ(s,t)TΣφ(s,t)) as stochastic

policy and collect all (t,sht ,ah

t ,sht+1,εh

t ,rht+1) for

t = {1,2, . . . ,T+1}.3: Estimate: Use unbiased estimate

Qπ(s, a, t ) =T∑

t=t

r (s t , a t , s t+1, t ).

4: Reweight: Compute importance weights and reweighrollouts, discard low-importance rollouts.

5: Update policy using

θk+1 = θk +⟨∑T

t=1 εt Qπ(s,a, t )⟩w(τ)

⟨∑Tt=1 Qπ(s,a, t )⟩w(τ)

6: until convergance θk+1 ≈ θk

The structure of the noise is a key element influencingthe convergence speed of a policy search method but thechoice is a trade-off. In the case of DMP shape parametersand uncorrelated noise, high noise variance causes largeaccelerations of the system, causing a safety hazard andpossibly surpassing the physical capabilities of the robot.In contrast, low noise variance makes the learning processslow.

To address this trade-off, we propose to use correlatednoise instead of the earlier works employing uncorrelatednoise. Since the elements of a DMP parameter vectorcorrespond to temporally ordered perturbing forces, wewant to control their temporal statistics. To achieve this,an intuitive structure for the covariance matrix Σ = R−1

can be used where the quadratic control cost matrix

R =F∑

j=1h j AT

j A j (25)

is a weighted combination of quadratic costs relatedto finite difference matrices A1 · · ·AF . h j denotes theweight of the j -th finite difference matrix, where j is theorder of differentiation. This structure allows us then tocontrol statistics of any order. In experiments, we considervariation only in acceleration (second order). Thus, h2 = 1and all other weights h j = 0, j 6= 2, and the second order

finite difference matrix A2 can be written

A2 =

1 0 0· · ·

0 0 0−2 1 0 0 0 01 −2 1 0 0 0

.... . .

...0 0 0

· · ·1 −2 1

0 0 0 0 1 −20 0 0 0 0 1

. (26)

With this covariance matrix, the noise signal is smooth(see Fig. 2a) due to limited acceleration and it has smallmagnitude in the beginning and at the end of the trajectory.Hence, safe exploration is provided. It is worth mentioningthat a similar covariance matrix has been applied in [15]and [16] for direct trajectory encoding.

In order to control the magnitude of noise, we used afurther modified covariance matrix Σ= γβR−1 where γ is aconstant controlling the initial magnitude and convergencefactor

β= 1∑Ii=1 r 2

i

, (27)

reduces the magnitude of noise as the policy searchalgorithm is converging to the optimal policy.

Time (index)

5 10 15 20 25 30 35 40 45 50 55

Magnitude

-8

-6

-4

-2

0

2

4

6

(a)

number of rollouts

0 20 40 60 80 100 120

avera

ge r

etu

rn

0

0.2

0.4

0.6

0.8

1Eleven initial rollouts

(b)

Fig. 2: (a) Noises ε sampled from a zero mean multivariateGaussian distribution N (0,Σ) with γ= 1 and β= 0.01. (b)Mean and variance of returns over 12 trials where 11 initialroll-outs were used before re-weighting DMP weights.

IV. EXPERIMENTAL EVALUATION

We studied experimentally the generalization perfor-mance of the proposed model using a Ball-in-a-Cuptask taught to KUKA LBR 4+ initially using kinestheticteaching. In this section, we explain the incrementallearning scenario and compare it with LWR and non-incremental learning in terms of convergence speed andthe extrapolation capability of the model.

A. Ball-in-a-Cup Task

The Ball-in-a-Cup game consists of a cup, a string, anda ball; the ball is attached to the cup by the string (seeFig. 1a). The objective of the game is to get the ball in thecup by moving the cup in a suitable fashion. In practice,the cup needs to be moved back and forth at first; then,a movement is induced on the cup, thus pulling the ballup and catching it with the cup.

We chose the Ball-in-a-Cup game because variation inthe environment can be generated simply by changing thestring length. The string length is observable and easy toevaluate, thus providing a suitable parameter representingthe environment variation. Nevertheless, changing thelength requires a complex change in the motion to succeedin the game. Hence, the generalization capability of aparametric LfD model can be easily assessed using thisgame.

The state of the robot is defined in a seven dimen-sional space X = {x, y, z, qx , qy , qz , qw }, where Xp = {x, y, z}represents the position of the robot end effector (cup),and Xq = {qx , qy , qz , qw } formulates its orientation usinga quaternion. The ball-in-a-cup is essentially a two-dimensional game and thus only motion along two axes,y and z was used. In the demonstration phase, therobot was set compliant along y and z, while it wasset stiff rotationally and along x, which were consideredas constant states. The plane spanning y and z wasorthogonal with respect to the table upon which the robotwas mounted (see Fig. 1a).

The trajectories along y and z were encoded usingseparate DMPs with same number of parameters. We foundexperimentally that 55 kernels (shape parameters) wererequired so that the reproduced movement was able to putthe ball above the rim of the cup in the execution phase.In total, 110 shape parameters were then learned usingweighted linear regression (6). However, using these initialshape parameters, the reproduced movement did not putthe ball back into the cup. Hence, the shape parameterswere optimized in a trial-and-error fashion using RL asdescribed in Sec. III-D.

Reward function is the most fundamental ingredient ofRL. We formulated the reward function similar to [2] as

r (t ) ={

e−αd 2, if t = tc ,

0, otherwise(28)

where tc denotes the time instant when the ball crossesthe rim of the cup with a downward motion; d representsthe horizontal distance between the rim of the cup andthe centre of the ball; and α is a scaling parameter set to100 in our experiments. The closer the ball is to the rimof the cup, the higher the reward will be. As the shapeparameters are fine-tuned in a trial-and-error approach,the ball would get closer to the cup. Furthermore, thereward is zero if the ball does not reach above the rim ofthe cup.

B. Incremental Learning

Learning without a prior is a very time-consuming pro-cess. The speed of RL will improve significantly when thelearning process is started from a good initial policy. In fact,it is customary to initiate the policy search process with apolicy that imitates a human demonstration. However, theoptimized policy is not guaranteed to work successfullyin a new environment characterized by a different taskparameter. In this case, a policy must be optimized for the

30 31 32 33 34 35 36 37 38 39 40 41

a)

b)

c)

d)

e)

f)

g)

h)

Fig. 3: Validity ranges (red lines) of models learned fromdatabase of different sizes. (a) zero order (model selection)global model trained only on one MPs. (b) zero ordermodel trained on 2 MPs. (c) zero-order (model selection)model trained on 3 MPs. (d),(e),(f), and (g) represent first,second, third, and fourth order models trained on thesame DB of MPs as in (c). (h) Locally weighted regression.

new task parameter. Starting the optimization process froman imitated policy is still time-consuming. The speed oflearning process can increase with a more accurate initialpolicy such as the policy optimized for the closest taskparameter. In this way, a database of MPs can be created.

The underlying regularities in this database can beextracted in an online manner for enhancing the extrapo-lation capabilities and boosting up the speed of learningprocess. In fact, we are constructing a global model inan online fashion and control its complexity as new MPsare added to the database. The global model can providethe policy search process with an initial policy which ismore accurate than the policy optimized for the closesttask parameter. Utilizing only the policy optimized for theclosest task parameter is equivalent of using a DB of onlyone MP and a global model of order zero.

We studied first the generalization performance of globalmodels with different complexities and compared themwith LWR. We started the incremental learning process ina Ball-in-a-cup game with a string length of 37 cm. Themodel selection indicated a zero-order global model. Withthis model, the game could be re-enacted successfully forstring length of 35 to 38 cm. Next, we used the modelfor providing an initial policy for RL and optimized it forstring length of 34 cm. This newly optimized MP was addedto the database, and subsequently the global model andits complexity were updated. The model selection againindicated zeroth order for the complexity of the globalmodel. This model works successfully for string length of34 to 38 cm. An initial policy for string length of 33 cmwas estimated using the model learned from the currentdatabase, optimized using PoWER, and then added tothe database. With this database of three MPs, the modelselection still indicated a zero order model. These resultsare depicted in Fig. 3 where X indicates training samples inthe database, and red line the validity region of the model

33 34 40

String Length

0

10

20

30

40

50

60

70

80

90

100N

um

be

r o

f R

L I

tera

tio

ns

RL Convergence Rate

Incremental

Non-incremental

(a)

0 20 40 60 80 100 120

Number of RL iterations

0

0.2

0.4

0.6

0.8

1

Rew

ard

Convergence Rate of String Length = 40cm

incremental learinng

non-incremental

(b)

Fig. 4: (a) RL convergence rate of incremental vs. non-incremental learning. The number of iterations in (a)include the 11 initial roll-outs. (b) The convergence rate ofincremental learning vs. non-incremental for string lengthof 40cm. The rewards of 11 intial roll-outs are not shownin (b).

where 10 consecutive roll-outs of the same policy weresuccessful. Fitted to the DB of same three MPs, a higherorder model such as linear (Fig. 3.d), second (Fig. 3.e), third(Fig. 3.f), and fourth order(Fig. 3.g) could not improve theextrapolation capability. This indicates that the proposedmodel selection is able to identify the required complexity.Moreover, LWR (see 3.h) could extrapolate to string lengthof 32 cm because there are more training samples nearby,but it lost both inter- and extrapolation capability for stringlengths of 37–38 cm. This indicates superior generalizationcapabilities of global models in this task.

C. Incremental vs Non-incremental learning

We next studied the effect of incremental learning onthe convergence rate of RL for optimizing the policyparameters for a new task parameter. As a starting point,the model trained on MPs for string lengths of 33, 34,and 37 could interpolate among the training samples andextrapolate to length 39 cm. When the MP estimated bythis model was used as a starting point of RL for stringlength of 40 cm, 18 roll-outs (including 11 initial roll-outs)of RL were needed. On the other hand, when only MP oflength 37 was used as the initial policy, the policy searchtook 97 roll-outs (including 11 initial roll-outs) to optimizethe MP (compare the red vs. blue in Fig 4.b). Studyingother string lengths, incremental learning consistently ledto faster convergence when extrapolating (compare thered vs. blue in Fig 4.a), demonstrating that incrementallearning speeds up the learning process by providing amore accurate starting point.

D. Model Selection

We finally studied if the proposed model selectioncriterion can choose an optimal complexity for bothincremental and non-incremental learning. The MP opti-mized for string length 40 cm was added to the database.The global model was constructed for different ordersof complexity. The model selection criterion indicated 2(parabolic) as the order for Y direction and zero for the Z

30 31 32 33 34 35 36 37 38 39 40 41

a)

b)

c)

d)

e)

f)

g)

Fig. 5: Validity ranges (red lines) of models learned fromdatabase of incrementally learned MPs. (a) model selectionindicates 2nd order for Y and zero order for Z. (b), (c), (d),(e), and (f) represent zero, first, second, third, and fourthorder models trained on the same DB of MPs as in (a). (g)Locally weighted regression.

direction. This model (Fig. 5.a) could inter- and extrapolatebetween 31 and 40 cm. Constant (Fig. 5.b) and linear(Fig. 5.c) models were not sufficient as they could noteven interpolate in the whole range. The interpolation andextrapolation range of a 2nd order model (Fig. 5.d) wasthe same as the model selected by the proposed criterion(Fig. 5.a) but the model selection gives a simpler (zeroorder) model for Z direction. Furthermore, higher ordermodels (see 5.e and 5.f) improved neither interpolationnor extrapolation capability. Besides, a higher order modelcan overfit to the training samples leading to very limitedextrapolation. For example, a 4th order model fit to thedatabase (see Fig. 5.f) led to bad trajectory (see Fig. 7) forstring length of 45. Therefore, it was not safe to exploit avery high order model for extrapolating to a distant taskparameter. A similar problem was encountered with LWR,which yielded in a too high acceleration for both Y andZ in the beginning of the trajectory (marked with blackstar in Fig. 7). Hence, model selection is a key processfor constructing an online incremental DB of MPs usingparametric models. All in all, the proposed model selectionmethod resulted in the simplest yet most effective andsafest model, excelling over LWR (Fig. 5.g).

To study the effectiveness of model selection in non-incremental learning, we constructed models of varyingcomplexity from a DB of MPs learned non-incrementally.The validity ranges of the models are shown in Fig. 6.Linear (Fig. 6.b) and second-order (Fig. 6.c) models hadthe largest validity region. The model selection criterionchose the linear model, which was the simplest with thebest extrapolation capability.

The extrapolation capability dwindled when non-incremental DB was used (compare Fig. 6.b and Fig. 5.a).Moreover, higher order models (6.d and e) lost their inter-polation capability since each MP in the non-incrementaldatabase was optimized by starting from the policy op-timized for the closest task parameter, thus ignoring theunderlying regularities in distant task parameter space.

30 31 32 33 34 35 36 37 38 39 40 41

a)

b)

c)

d)

e)

f)

Fig. 6: Validity ranges (red lines) of models learned fromdatabases of non-incrementally learned MPs. (a), (b), (c),(d), (e) represent zero, first, second, third, and fourth ordermodels trained on the same DB of MPs as in (a). Modelselection chose (b) as the best model. (f) Locally weightedregression.

0 0.5 1 1.5 2 2.5 3

Time (sec)

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Po

sitio

n (

me

ter)

Y

OT

4th order

LWR

MS

(a)

0 0.5 1 1.5 2 2.5 3

Time (sec)

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Positio

n (

mete

r)

Z

(b)

Fig. 7: Extrapolated trajectories for string length of 45cm.(a)trajectories in Y direction. (b) trajectories in Z. Optimaltrajectory for string length of 40cm (OT), 4th order globalmodel (4th order), global model with complexity deter-mined by model selection (MS), and LWR trajectories (LWR)are demonstrated. In (b), model selection and optimaltrajectory overlaps because model selection indicates zeroas the best complexity for Z direction.

Similarly, LWR on the non-incremental DB (Fig. 6.f)performed worse than our approach (Fig. 6.b). Theseresults indicate that model selection and incrementallearning are key ingredients for learning generalizablemotion primitives.

V. CONCLUSION

In this paper, we proposed a global model for parametricMPs. The model maps a task parameter to policy parame-ters, allowing for generalizing a policy to new situationsand incremental construction of a database of primitives.The complexity of the model is updated online as a newtraining example is added to the database. Only a singlehuman demonstration is then needed for constructing thedatabase. Our experiments demonstrated that the globalmodel can provide RL with a more accurate initial policyresulting in a faster convergence; in return, RL provides theglobal model with additional training examples leading to abetter predictive accuracy. Comparing incremental versusnon-incremental learning, we showed that incremental

learning performs better than non-incremental approachboth in terms of generalization and the speed of learning.

All things considered, the global model is simple andcan be scaled to accommodate for non-linearities in thetask space. Furthermore, we proposed a novel penalizedlog-likelihood based model selection method which isintegral for constructing online incremental DB of MPsusing a parametric approach. Experiments showed thatthe model selection approach lead to a global modelwhich is simple; overcomes over-fitting; and, performsbetter than locally weighted regression both in termsof inter- and extrapolation. It also works even with fewtraining samples, which indicates its suitability for onlineincremental learning.

REFERENCES

[1] A. Gams, M. Denisa, and A. Ude, “Learning of parametric couplingterms for robot-environment interaction,” in Humanoid Robots(Humanoids), 2015 IEEE-RAS 15th International Conference on,pp. 304–309, IEEE, 2015.

[2] J. Kober and J. R. Peters, “Policy search for motor primitives inrobotics,” in Advances in neural information processing systems,pp. 849–856, 2009.

[3] J. Kober, E. Oztop, and J. Peters, “Reinforcement learning toadjust robot movements to new situations,” in IJCAI Proceedings-International Joint Conference on Artificial Intelligence, vol. 22,p. 2650, 2011.

[4] B. Da Silva, G. Konidaris, and A. Barto, “Learning parameterizedskills,” arXiv preprint arXiv:1206.6398, 2012.

[5] D. Forte, A. Gams, J. Morimoto, and A. Ude, “On-line motionsynthesis and adaptation using a trajectory database,” Roboticsand Autonomous Systems, vol. 60, no. 10, pp. 1327–1339, 2012.

[6] A. Ude, A. Gams, T. Asfour, and J. Morimoto, “Task-specific gener-alization of discrete and periodic dynamic movement primitives,”IEEE Transactions on Robotics, vol. 26, no. 5, pp. 800–815, 2010.

[7] F. Stulp, G. Raiola, A. Hoarau, S. Ivaldi, and O. Sigaud, “Learningcompact parameterized skills with a single regression,” in 201313th IEEE-RAS International Conference on Humanoid Robots(Humanoids), pp. 417–422, IEEE, 2013.

[8] B. Nemec, R. Vuga, and A. Ude, “Efficient sensorimotor learningfrom multiple demonstrations,” Advanced Robotics, vol. 27, no. 13,pp. 1023–1031, 2013.

[9] K. Mülling, J. Kober, O. Kroemer, and J. Peters, “Learning to selectand generalize striking movements in robot table tennis,” TheInternational Journal of Robotics Research, vol. 32, no. 3, pp. 263–279,2013.

[10] A. Carrera, N. Palomeras, N. Hurtós, P. Kormushev, and M. Carreras,“Learning multiple strategies to perform a valve turning withunderwater currents using an i-auv,” in OCEANS 2015-Genova, pp. 1–8, IEEE, 2015.

[11] S. Calinon, T. Alizadeh, and D. G. Caldwell, “On improving theextrapolation capability of task-parameterized movement models,”in 2013 IEEE/RSJ International Conference on Intelligent Robots andSystems, pp. 610–616, IEEE, 2013.

[12] T. Matsubara, S.-H. Hyon, and J. Morimoto, “Learning parametricdynamic movement primitives from multiple demonstrations,”Neural Networks, vol. 24, no. 5, pp. 493–500, 2011.

[13] A. J. Ijspeert, J. Nakanishi, and S. Schaal, “Movement imitationwith nonlinear dynamical systems in humanoid robots,” in Roboticsand Automation, 2002. Proceedings. ICRA’02. IEEE InternationalConference on, vol. 2, pp. 1398–1403, IEEE, 2002.

[14] J. R. Peters, Machine learning of motor skills for robotics. PhD thesis,University of Southern California, 2007.

[15] M. Kalakrishnan, L. Righetti, P. Pastor, and S. Schaal, “Learningforce control policies for compliant manipulation,” in IntelligentRobots and Systems (IROS), 2011 IEEE/RSJ International Conferenceon, pp. 4639–4644, IEEE, 2011.

[16] M. Kalakrishnan, S. Chitta, E. Theodorou, P. Pastor, and S. Schaal,“Stomp: Stochastic trajectory optimization for motion planning,” inRobotics and Automation (ICRA), 2011 IEEE International Conferenceon, pp. 4569–4574, IEEE, 2011.

hazara, murtaza; kyrki, ville model selection for ... · m. hazara and v. kyrki are with school of...

Documents