from least squares regression to high …...generation vs. recall 159 + 72 = 231 “recall”...

IROS2018 Tutorial Mo-TUT-2

From Least Squares Regressionto High-dimensional Motion Primitives

Freek Stulp, Sylvain Calinon, Gerhard Neumann

Generation vs. Recall

159 + 72 = ???

Motion Generation Motion Recall

159 + 72 = (159+2) + (72-2)= 161 + 70

159 + 72 = (159+2) + (72-2)= 161 + 70= 231 “generation”

159 + 72 = (9+2) + (150+70)= 11 + 220

159 + 72 = (9+2) + (150+70)= 11 + 220= 231 “generation”

159 + 72 = ???

159 + 72 = 231 “recall”

Distinction between these two strategies important incognitive science, artificial intelligence, robotics, teaching

159 + 72 = 231 “recall”

Distinction between these two strategies important incognitive science, artificial intelligence, robotics, teaching

Motion primitives in nature

Giszter, S.; Mussa-Ivaldi, F. & Bizzi, E. Convergent force fields organized in the frog’s spinal cord Journal of Neuroscience, 1993

Flash, T. & Hochner, B. Motor Primitives in Vertebrates and Invertebrates Current Opinion in Neurobiology, 2005

Motion primitives for robots

Motion primitives for robots?

• Couple degrees of freedom to deal with high-dimensional systems

• Sequencing and superpositioning of MPs for more complex task

• Low-dimensional parameterization of MP enables learning

• MPs can be bootstrapped with demonstrations

• Direct mappings between task parameters and MP parameters

Motion primitives for robots!

Ijspeert, A. J.; Nakanishi, J. & Schaal, S. Movement imitation with nonlinear dynamical systems in humanoid robots. ICRA, 2002

Schedule9:00 – 9:15 Introduction9:15 – 10:45 Regression Tutorial Freek Stulp

10:45 – 11:00 Motion Primitives 1 Sylvain Calinon11:00 – 11:30 Coffee Break11:30 – 12:15 Motion Primitives 1 (cont.) Sylvain Calinon12:15 – 13:15 Motion Primitives 2 Gerhard Neumann13:15 – 13:30 Wrap up

Regression TutorialIROS’18 Tutorial

Freek StulpInstitute of Robotics and Mechatronics, German Aerospace Center (DLR)

Autonomous Systems and Robotics, ENSTA-ParisTech

01.10.2018

What Is Regression?

Estimating a relationshipbetween input variablesand continuous output variablesfrom data

What Is Regression?

Application: Dynamic parameter estimation

An, C.; Atkeson, C. and Hollerbach, J. (1985).

Estimation of inertial parameters of rigid body links of manipulators [404]IEEE Conference on Decision and Control.

What Is Regression?

Application: Programming by demonstration

Rozo, L.; Calinon, S.; Caldwell, D. G.; Jimenez, P. and Torras, C. (2016).

Learning Physical Collaborative Robot Behaviors from Human DemonstrationsIEEE Trans. on Robotics.

Calinon, S.; Guenter, F. and Billard, A. (2007).

On Learning, Representing and Generalizing a Task in a Humanoid Robot [725]IEEE Transactions on Systems, Man and Cybernetics.

What Is Regression?

Application: Biosignal Processing

Gijsberts, A., Bohra, R., Sierra Gonzlez, D., Werner, A., Nowak, M., Caputo, B., Roa, M. and Castellini, C. (2014)

Stable myoelectric control of a hand prosthesis using non-linear incremental learningFrontiers in Neurorobotics

What Is Not Regression?

Training data{( xn︸︷︷︸

, yn︸︷︷︸target

)}Nn=1 ∀n,xn ∈ X ∧ yn ∈ Y

Supervised Learning targets availableRegression targets available Y ⊆ RM

Classification targets available Y ⊆ 1, . . .KReinforcement learning no targets, only rewards rn ⊆ RUnsupervised learning no targets at all

Regression – Assumptions about the Function

linear locally linear smooth none

Regression – Assumptions about the Function

linear locally linear smooth none

→ Linear Least Squares

A.M. Legendre (1805).Nouvelles methodes pour la determination des orbites des comtes [519]Firmin Didot.

C.F. Gauss (1809).Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientum [943]

Linear Least Squares

x1,1 x1,2 · · · x1,Dx2,1 x2,2 · · · x2,D

......

. . ....

xN,1 xN,2 · · · xN,D

y1y2...

• X is the N × D “design matrix”• Each row is a D-dim. data point

f (x) = aᵀxLinear Least Squares

Linear Least Squares

x1,1 x1,2 · · · x1,Dx2,1 x2,2 · · · x2,D

......

. . ....

xN,1 xN,2 · · · xN,D

y1y2...

• X is the N × D “design matrix”• Each row is a D-dim. data point

Linear Least Squares• Which line fits the data best?

1 Define fitting criterion2 Optimize a w.r.t. criterion

Nice! Closed form solution to find a∗

Offset trickf (x) = aᵀx + b

= [ ab ]

ᵀ[ x

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.. . .

.xN,1 xN,2 · · · xN,D 1

1 Define fitting criterionSum of squared residuals

S(a) =N∑

r2n (1)

(yn − f (xn))2 (2)

= [ ab ]

ᵀ[ x

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.. . .

.xN,1 xN,2 · · · xN,D 1

1 Define fitting criterionSum of squared residuals

S(a) =N∑

r2n (1)

(yn − f (xn))2 (2)

Applied to a linear model

S(a) =N∑

(yn − aᵀxn)2 (3)

= (y− Xa)ᵀ(y− Xa), (4)

= [ ab ]

ᵀ[ x

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.. . .

.xN,1 xN,2 · · · xN,D 1

2 Optimize a w.r.t. criterionMinimize sum of squared residuals S(a)

a∗ = arg mina

S(a) (1)

= arg mina

(y− Xa)ᵀ(y− Xa) (2)

Quadratic cost: when is its derivative 0?

S′(a) = 2(a(XᵀX)− Xᵀy) (3)

a∗ = (XᵀX)−1Xᵀy. (4)

= [ ab ]

ᵀ[ x

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.. . .

.xN,1 xN,2 · · · xN,D 1

a∗ = arg mina

S(a) (1)

= arg mina

(y− Xa)ᵀ(y− Xa) (2)

S′(a) = 2(a(XᵀX)− Xᵀy) (3)

a∗ = (XᵀX)−1Xᵀy. (4)

f (x) = a∗ᵀxLinear Least Squares

= [ ab ]

ᵀ[ x

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.. . .

.xN,1 xN,2 · · · xN,D 1

a∗ = arg mina

S(a) (1)

= arg mina

(y− Xa)ᵀ(y− Xa) (2)

S′(a) = 2(a(XᵀX)− Xᵀy) (3)

a∗ = (XᵀX)−1Xᵀy. (4)

= [ ab ]

ᵀ[ x

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.. . .

.xN,1 xN,2 · · · xN,D 1

a∗ = arg mina

S(a) (1)

= arg mina

(y− Xa)ᵀ(y− Xa) (2)

S′(a) = 2(a(XᵀX)− Xᵀy) (3)

a∗ = (XᵀX)−1Xᵀy. (4)

= [ ab ]

ᵀ[ x

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.. . .

.xN,1 xN,2 · · · xN,D 1

Outline

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

Outline

Linear LeastSquares

important to fit”

LocallyWeighted

Regression+ friends

in input space”

Linear LeastSquares

Outline

Linear LeastSquares

important to fit”

LocallyWeighted

Regression+ friends

in input space”

Weighted Linear Least Squares

Idea: more important to fit some points than others.

1 Define fitting criterionWeighted residuals:

S(a) =N∑

wn(yn − aᵀxn)2. (5)

= (y− Xa)ᵀW(y− Xa),

2 Optimize a w.r.t. criterion

a∗ = (XᵀWX)−1XᵀWy. (7)

• Importance ≡Weight wn

• Example weighting• manual• boxcar function• Gaussian function

wn = φ(xn,θ)

= exp(− 1

2 (x− c)ᵀΣ−1(x− c))

with θ = (c,Σ)

S(a) =N∑

= (y− Xa)ᵀW(y− Xa), (6)

S(a) =N∑

= (y− Xa)ᵀW(y− Xa), (6)

S(a) =N∑

= (y− Xa)ᵀW(y− Xa), (6)

Locally Weighted RegressionsIn robotics, functions usually non-linear. But often locally linear!

Idea: Do multiple, independent, locally weighted least sq. regressions

William S. Cleveland; Susan J. Devlin (1988).Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting [4074]Journal of the American Statistical Association.

Atkeson, C. G.; Moore, A. W. and Schaal, S. (1997).Locally Weighted Learning for Control [2160]Artificial Intelligence Review.

Locally Weighted Regressions

• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

ae = (XᵀWeX)−1XᵀWey. (8)

Resulting model

f (x) =E∑

φ(x,θe) · (aᵀex). (9)

(φ must be normalized)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

Resulting model

f (x) =E∑

φ(x,θe) · (aᵀex). (9)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

Resulting model

f (x) =E∑

φ(x,θe) · (aᵀex). (9)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

Resulting model

f (x) =E∑

φ(x,θe) · (aᵀex). (9)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

Resulting model

f (x) =E∑

φ(x,θe) · (aᵀex). (9)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

Resulting model

f (x) =E∑

φ(x,θe) · (aᵀex). (9)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

Resulting model

f (x) =E∑

φ(x,θe) · (aᵀex). (9)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

Resulting model

f (x) =E∑

φ(x,θe) · (aᵀex). (9)

Variations of Locally Weighted Regressions

Receptive Field Weighted Regression• Incremental, not batch• E , centers c1...E and widths Σ1...E determined automatically• Disadvantage: many open parameters

Schaal, S. and Atkeson, C. G. (1997).Receptive Field Weighted Regression [34]Technical Report TR-H-209, ATR Human Information Processing Laboratories.

Locally Weighted Projection Regression• As RFWR, but also performs dimensionality reduction within

each receptive field

Vijayakumar, S. and Schaal, S. (2000).Locally Weighted Projection Regression . . . [208]International Conference on Machine Learning.

Variations of Locally Weighted Regressions

Receptive Field Weighted Regression• Incremental, not batch• E , centers c1...E and widths Σ1...E determined automatically• Disadvantage: many open parameters

Schaal, S. and Atkeson, C. G. (1997).Receptive Field Weighted Regression [34]Technical Report TR-H-209, ATR Human Information Processing Laboratories.

Locally Weighted Projection Regression• As RFWR, but also performs dimensionality reduction within

each receptive field

Vijayakumar, S. and Schaal, S. (2000).Locally Weighted Projection Regression . . . [208]International Conference on Machine Learning.

Outline

Linear LeastSquares

important to fit”

LocallyWeighted

Regression+ friends

in input space”

LocallyWeighted

Regressions+ friends

RegularizationRegularization

“Avoid large pa-rameter vectors.”

Outline

Linear LeastSquares

important to fit”

LocallyWeighted

Regression+ friends

in input space”

LocallyWeighted

Regressions+ friends

Regularization

Outline

Linear LeastSquares

important to fit”

LocallyWeighted

Regression+ friends

in input space”

Regularization

Regularization• Idea: penalize large parameter vectors to

• avoid overfitting / achieve sparse parameter vectors

a∗ = arg mina(

12‖y− Xᵀa‖2︸︷︷︸

fit data

2‖a‖2︸︷︷︸

small parameters

) (10)

Use combination of L1 and L2: “Elastic Nets”

a∗ = arg mina(

12‖y− Xᵀa‖2︸︷︷︸

fit data

2‖a‖2︸︷︷︸

small parameters

) (10)

L2-norm for ‖a‖

‖a‖2 =

|ad |2) 1

Euclidean distance

a∗ = (λI + XᵀX)−1Xᵀy.

“Thikonov Regularization”“Ridge Regression”

L1-norm for ‖a‖

‖a‖1 =

|ad |1) 1

Manhattan distance

no closed-form solution . . .

“LASSO Regularization”

a∗ = arg mina(

12‖y− Xᵀa‖2︸︷︷︸

fit data

2‖a‖2︸︷︷︸

small parameters

) (10)

L2-norm for ‖a‖

‖a‖2 =

|ad |2) 1

Euclidean distance

L1-norm for ‖a‖

‖a‖1 =

|ad |1) 1

Manhattan distance

a∗ = arg mina(

12‖y− Xᵀa‖2︸︷︷︸

fit data

2‖a‖2︸︷︷︸

small parameters

) (10)

L2-norm for ‖a‖

‖a‖2 =

|ad |2) 1

Euclidean distance

L1-norm for ‖a‖

‖a‖1 =

|ad |1) 1

Manhattan distance

a∗ = arg mina(

12‖y− Xᵀa‖2︸︷︷︸

fit data

2‖a‖2︸︷︷︸

small parameters

) (10)

L2-norm for ‖a‖

‖a‖2 =

|ad |2) 1

Euclidean distance

L1-norm for ‖a‖

‖a‖1 =

|ad |1) 1

Manhattan distance

a∗ = arg mina(

12‖y− Xᵀa‖2︸︷︷︸

fit data

2‖a‖2︸︷︷︸

small parameters

) (10)

Michael Littman and Charles Isbell feat Infinite Harmony

“Overfitting A Cappella”

Beyond squares

a∗ = arg mina(

12‖y− Xᵀa‖2︸︷︷︸

fit data

2‖a‖2︸︷︷︸

small parameters

) (11)

Penalty on residuals rn(fit data)

Penalty on parameters a(regularization)

Beyond squares

a∗ = arg mina(

12‖y− Xᵀa‖2︸︷︷︸

fit data

2‖a‖2︸︷︷︸

small parameters

) (11)

Penalty on residuals rn(fit data)

Penalty on parameters a(regularization)

Linear Support Vector Regression

• No closed-form solution, but efficient optimizers exist

Outline

Linear LeastSquares

important to fit”

LocallyWeighted

Regression+ friends

in input space”

Regularization

Outline

Linear LeastSquares

important to fit”

LocallyWeighted

Regression+ friends

in input space”

Regularization

Outline

Linear LeastSquares

important to fit”

LocallyWeighted

Regression+ friends

in input space”

RegularizationRegularization

Radial Basis Function Network

f (x) =E∑

we · φ(x,ce). (12)

f (x) =E∑

we · φ(x,ce). (12)

f (x) =E∑

we · φ(x,ce). (12)

f (x) =E∑

we · φ(x,ce). (12)

f (x) =E∑

we · φ(x,ce). (12)

f (x) =E∑

we · φ(x,ce). (12)

f (x) =E∑

we · φ(x,θe). (13)

Feature matrix (analogous to design matrix X)

φ(x1, c1) φ(x1, c2) · · · φ(x1, cE )φ(x2, c1) φ(x2, c2) · · · φ(x2, cE )

......

. . ....

φ(xN , c1) φ(xN , c2) · · · φ(xN , cE )

Least squares solution

w∗ = (ΘᵀΘ)−1Θᵀy. (15)

f (x) =E∑

we · φ(x,θe). (13)

......

. . ....

φ(xN , c1) φ(xN , c2) · · · φ(xN , cE )

w∗ = (ΘᵀΘ)−1Θᵀy. (15)

f (x) =E∑

we · φ(x,θe). (13)

......

. . ....

φ(xN , c1) φ(xN , c2) · · · φ(xN , cE )

w∗ = (ΘᵀΘ)−1Θᵀy. (15)

f (x) =E∑

we · φ(x,θe). (13)

......

. . ....

φ(xN , c1) φ(xN , c2) · · · φ(xN , cE )

w∗ = (ΘᵀΘ)−1Θᵀy. (15)

Kernel Ridge Regression

• Like a RBFN, but every data point is the center of a basis function

• Uses L2 regularization

f (x) =N∑

wn · k(x,xn). (16)

“Gram matrix”(analogous to design matrix X)

K(X,X) =

k(x1, x1) k(x1, x2) · · · k(x1, xN )k(x2, x1) k(x2, x2) · · · k(x2, xN )

......

. . ....

k(xN , x1) k(xN , x2) · · · k(xN , xN )

w∗ = (KᵀK)−1Kᵀy (18)

= K−1y, (19)

w∗ = (λI + K)−1y with L2 regularization

Kernel Ridge Regression

• Like a RBFN, but every data point is the center of a basis function• Uses L2 regularization

f (x) =N∑

wn · k(x,xn). (16)

“Gram matrix”(analogous to design matrix X)

K(X,X) =

k(x1, x1) k(x1, x2) · · · k(x1, xN )k(x2, x1) k(x2, x2) · · · k(x2, xN )

......

. . ....

k(xN , x1) k(xN , x2) · · · k(xN , xN )

w∗ = (KᵀK)−1Kᵀy (18)

= K−1y, (19)

w∗ = (λI + K)−1y with L2 regularization (20)

(Radial) Basis Function Networks

Beyond radial basis functions• Cosines: Ridge Regression with Random Fourier Features• Sigmoids: Extreme Learning Machines (MLFF with 1 hidden)• Boxcars: model trees (as decision trees, but for regression)• Kernels: every data point is the center of a radial basis function

• Since least squares is at the heart of all of these• incremental versions← recursive least squares• apply L2 regularization (still closed form)

(Radial) Basis Function Networks

Beyond radial basis functions• Cosines: Ridge Regression with Random Fourier Features• Sigmoids: Extreme Learning Machines (MLFF with 1 hidden)• Boxcars: model trees (as decision trees, but for regression)• Kernels: every data point is the center of a radial basis function

• Since least squares is at the heart of all of these• incremental versions← recursive least squares• apply L2 regularization (still closed form)

Freek, aren’t you being a bit shallow?

• Deep learning great when you• do not know the features• know the features to be hierarchically organized

John Smart

Freek, aren’t you being a bit shallow?

• Deep learning great when you• do not know the features• know the features to be hierarchically organized

Rajeswaran A, Lowrey K, Todorov E and Kakade S. (2017).Towards generalization and simplicity in continuous controlNeural Information Processing Systems (NIPS).

A neural network perspective

All these models can be considered (degenerate) neural networks!

Backpropagation can be used in all these models!

A neural network perspective

All these models can be considered (degenerate) neural networks!

Backpropagation can be used in all these models!

Linear model

Figure: Network representation of a linear model. Activation is. . . linear!

f (x) =∑E

e=1 beφ(x,θe)

Figure: The RBFN model. φe is an abbreviation of φ(x,θe)

f (x) =∑E

e=1 beφ(x,θe)

Figure: The RRRFF model. φe is an abbreviation of φ(x,θe)

f (x) =∑E

e=1 beφ(x,θe)

Figure: The SVR model. φe is an abbreviation of φ(x,θe)

Regression trees

f (x) =∑E

e=1 beφ(x,θe)

Figure: The regression trees model. φe is an abbreviation of φ(x,θe)

Extreme learning machine

f (x) =∑E

e=1 beφ(x,θe)

Figure: The extreme learning machine model. φe is an abbreviation ofφ(x,θe)

Extreme Learning Machine vs. (Deep) Neural Networks

• ELM: sigmoid act. function, no hidden layer, random features• ANN: sigmoid act. function, hidden layers, learned features

KRR and GPR

w1w2w3w4w5

f (x) =∑N

n=1 wnφn(x)

Figure: The function model used in KRR and GPR, as a network.

Locally weighted regression

∑...

Figure: Function model in Locally Weighted Regressions, represented as afeedforward neural network. The functions φe(x) generate the weights we

from the hidden nodes – which contain linear sub-models (aᵀe x + be) – to the

output node. Here, φe is an abbreviation of φ(x,θe)

Conclusion

Linear LeastSquares

important to fit”

LocallyWeighted

Regression+ friends

in input space”

WeightedSum of Basis

Functions

WeightedSum of Linear

Models

Conclusion: Generic batch regression flow-chart

Algorithmleast squares: a∗ = (λI + XᵀX)−1Xᵀy

Modellinear model: f (x) = aᵀx

Meta parametersregularization: λ

Model parametersslopes: a

Regressionalgorithm

training data(inputs/targets)

modelparams

meta-params Evaluate

input(novel)

output(prediction)

Regressionalgorithm

modelparams

input(novel)

output(prediction)

Regressionalgorithm

modelparams

input(novel)

output(prediction)

Regressionalgorithm

modelparams

meta-params

Evaluate

input(novel)

output(prediction)

Regressionalgorithm

modelparams

input(novel)

output(prediction)

ConclusionLinear Least

Squares“Fit a line to

some data points”

important to fit”

LocallyWeighted

Regression+ friends

in input space”

Functions

Models

Unified Model

f (x) =∑E

e=1φ(x,θe)·(be + aᵀ

e x) Weighted sum of linear models (21)

f (x) =∑E

e=1φ(x,θe)· we Weighted sum of basis functions (22)

some data points”

important to fit”

LocallyWeighted

Regression+ friends

in input space”

Functions

Models

Unified Model

f (x) =∑E

(22) is a special case of (21) with ae = 0 and be ≡ we

some data points”

important to fit”

LocallyWeighted

Regression+ friends

in input space”

Functions

Models

Unified Model

f (x) =∑E

some data points”

important to fit”

LocallyWeighted

Regression+ friends

in input space”

Functions

Models

Unified Model

Freek Stulp and Olivier Sigaud (2015).Many regression algorithms, one unified model - A review.Neural Networks.

f (x) =∑E

Conclusion

Mixture of linear models(unified model)

f (x) =∑E

e=1 φ(x,θe) · (aᵀex+be)

sub-models: aᵀex + be

weights: φ(x,θe)

GMR (Gaussian Mixture R.)

Gaussian BFsLWR (Locally Weighted R.)

RFWR (Receptive Field Weighted R.)

LWPR (Locally Weighted Projection R.)

XCSFAny BFs

M5 (Model Trees)Box-car BFs

Linear modelf (x) = aᵀx + b

LLS (Linear Least Squares)

(Weighted Linear Least Squares)

Weighted sum of basis functions

f (x) =∑E

e=1 φ(x,θe) · be

sub-models: φ(x,θe)

weights: be

RBFN (Radial Basis Function Network)

Radial φ(||x− c||)

KRR (Kernel Ridge R.)

GPR (Gaussian Process R.)Kernel φ(x, x′)

iRFRLSI-SSGPR

Cosine BFs

CART (Regression Trees)

Box-car BFs

ELM (Extreme Learning Machine)

Backpropagation

Sigmoidφ(〈x,w〉)

Figure: Classification of regression algorithms, based only on the modelused to represent the underlying function.

Many toolkits available

• Python• scikit-learn: http://scikit-learn.org• StatsModels: http://www.statsmodels.org/• PbDlib: http://calinon.ch/codes.htm• dmpbbo: https://github.com/stulp/dmpbbo

• Matlab• curvefit: https://www.mathworks.com/help/curvefit/linear-and-nonlinear-regression.html

• PbDlib: http://calinon.ch/codes.htm• C++

• PbDlib: http://calinon.ch/codes.htm• dmpbbo: https://github.com/stulp/dmpbbo

Personal FavouritesGaussian process regression

+ Very few assumptions+ Meta-parameters estimated from data itself+ Estimates variance also+ Works in high dimensions- Training/query times increase with amount of data- Not easy to make incremental

Gaussian mixture regression+ Estimates variance also+ Algorithm is inherently incremental+ Some meta-parameters, but easy to tune+ Fast training times- Training only stable for low input dimensions

Locally Weighted Regressions+ Fast query times, fast training+ Few meta-parameters, and easy to set+ Stable learning results (batch)- Not incremental- No variance estimate

Deep Learning+ Automatic extraction of (hierarhical) features

Conclusion

• Don’t think about these regression algorithms as being unique• Similar algorithms that use different subsets of algorithmic features

• All these models are essentially shallow neural networks withdifferent basis functions

Thank you for your attention!

Conclusion

• Don’t think about these regression algorithms as being unique• Similar algorithms that use different subsets of algorithmic features

• All these models are essentially shallow neural networks withdifferent basis functions

Thank you for your attention!

Model: (Locally) Linear Regularization Model: Sum of Basis Functions ConclusionUnified Model

Appendix

Regression Tutorial – Freek Stulp

Gaussian Process Regression

“Given a Gaussian process on some topological space T , with acontinuous covariance kernel C(·, ·) : T × T → R, we can associate aHilbert space, which is the reproducing kernel Hilbert space ofreal-valued functions on T , with C as kernel function.”

Instead of screaming, let’s talk about what it means to be smooth.

• Points that are close in the input space should be close in theoutput space.• Cities that are close geographically have similar temperatures (on

average)• Taller people have larger shoe sizes (on average)

• Shoe size covaries with height

Gaussian Process Regression – Covariance Function

covariance function

covariance matrix (Gram matrix)

K(X,X) =

xAug xMuc xWar xMin xMos

xAug 1.00 0.96 0.42 0.02 0.00

xMuc 0.96 1.00 0.59 0.04 0.00xWar 0.42 0.59 1.00 0.32 0.10xMin 0.02 0.04 0.32 1.00 0.80xMos 0.00 0.00 0.10 0.80 1.00

• Remarks• Basis function has very specific interpretation: covariance• No temperature measurements y have been made yet• Prior: assume temperature is 0◦C

QuestionExpected temperature in Munich, given 9◦C in Augsburg?

(condition on TAug = 9, i.e. E [TMuc | TAug = 9])

covariance function

covariance matrix (Gram matrix)

K(X,X) =

xAug 1.00 0.96 0.42 0.02 0.00

xMuc 0.96 1.00 0.59 0.04 0.00xWar 0.42 0.59 1.00 0.32 0.10xMin 0.02 0.04 0.32 1.00 0.80xMos 0.00 0.00 0.10 0.80 1.00

covariance function covariance matrix (Gram matrix)

K(X,X) =

xAug 1.00 0.96 0.42 0.02 0.00xMuc 0.96 1.00 0.59 0.04 0.00xWar 0.42 0.59 1.00 0.32 0.10xMin 0.02 0.04 0.32 1.00 0.80xMos 0.00 0.00 0.10 0.80 1.00

K(X,X) =

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96

What are the plane slopes?

see above︷︸︸︷k(xq ,X) K(X,X)−1y︸︷︷︸

Least squares!

k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00

Least squares!

k(xMuc, [xAug xMin]) =[0.96 0.04]