from least squares regression to high …...generation vs. recall 159 + 72 = 231 “recall”...

Post on 04-May-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

IROS2018 Tutorial Mo-TUT-2

From Least Squares Regressionto High-dimensional Motion Primitives

Freek Stulp, Sylvain Calinon, Gerhard Neumann

Generation vs. Recall

159 + 72 = ???

Motion Generation Motion Recall

Generation vs. Recall

159 + 72 = (159+2) + (72-2)= 161 + 70

Motion Generation Motion Recall

Generation vs. Recall

159 + 72 = (159+2) + (72-2)= 161 + 70= 231 “generation”

Motion Generation Motion Recall

Generation vs. Recall

159 + 72 = (9+2) + (150+70)= 11 + 220

Motion Generation Motion Recall

Generation vs. Recall

159 + 72 = (9+2) + (150+70)= 11 + 220= 231 “generation”

Motion Generation Motion Recall

Generation vs. Recall

159 + 72 = ???

Motion Generation Motion Recall

Generation vs. Recall

159 + 72 = 231 “recall”

Motion Generation Motion Recall

Generation vs. Recall

159 + 72 = 231 “recall”

Distinction between these two strategies important incognitive science, artificial intelligence, robotics, teaching

Motion Generation Motion Recall

Generation vs. Recall

159 + 72 = 231 “recall”

Distinction between these two strategies important incognitive science, artificial intelligence, robotics, teaching

Motion Generation Motion Recall

Motion primitives in nature

Giszter, S.; Mussa-Ivaldi, F. & Bizzi, E. Convergent force fields organized in the frog’s spinal cord Journal of Neuroscience, 1993

Flash, T. & Hochner, B. Motor Primitives in Vertebrates and Invertebrates Current Opinion in Neurobiology, 2005

Motion primitives for robots

Motion primitives in nature

Giszter, S.; Mussa-Ivaldi, F. & Bizzi, E. Convergent force fields organized in the frog’s spinal cord Journal of Neuroscience, 1993

Flash, T. & Hochner, B. Motor Primitives in Vertebrates and Invertebrates Current Opinion in Neurobiology, 2005

Motion primitives for robots?

• Couple degrees of freedom to deal with high-dimensional systems

• Sequencing and superpositioning of MPs for more complex task

• Low-dimensional parameterization of MP enables learning

• MPs can be bootstrapped with demonstrations

• Direct mappings between task parameters and MP parameters

Motion primitives in nature

Giszter, S.; Mussa-Ivaldi, F. & Bizzi, E. Convergent force fields organized in the frog’s spinal cord Journal of Neuroscience, 1993

Flash, T. & Hochner, B. Motor Primitives in Vertebrates and Invertebrates Current Opinion in Neurobiology, 2005

Motion primitives for robots!

Ijspeert, A. J.; Nakanishi, J. & Schaal, S. Movement imitation with nonlinear dynamical systems in humanoid robots. ICRA, 2002

Schedule9:00 – 9:15 Introduction9:15 – 10:45 Regression Tutorial Freek Stulp

10:45 – 11:00 Motion Primitives 1 Sylvain Calinon11:00 – 11:30 Coffee Break11:30 – 12:15 Motion Primitives 1 (cont.) Sylvain Calinon12:15 – 13:15 Motion Primitives 2 Gerhard Neumann13:15 – 13:30 Wrap up

Regression TutorialIROS’18 Tutorial

Freek StulpInstitute of Robotics and Mechatronics, German Aerospace Center (DLR)

Autonomous Systems and Robotics, ENSTA-ParisTech

01.10.2018

What Is Regression?

Estimating a relationshipbetween input variablesand continuous output variablesfrom data

What Is Regression?

Estimating a relationshipbetween input variablesand continuous output variablesfrom data

What Is Regression?

Estimating a relationshipbetween input variablesand continuous output variablesfrom data

What Is Regression?

Estimating a relationshipbetween input variablesand continuous output variablesfrom data

Application: Dynamic parameter estimation

An, C.; Atkeson, C. and Hollerbach, J. (1985).

Estimation of inertial parameters of rigid body links of manipulators [404]IEEE Conference on Decision and Control.

What Is Regression?

Estimating a relationshipbetween input variablesand continuous output variablesfrom data

Application: Programming by demonstration

Rozo, L.; Calinon, S.; Caldwell, D. G.; Jimenez, P. and Torras, C. (2016).

Learning Physical Collaborative Robot Behaviors from Human DemonstrationsIEEE Trans. on Robotics.

Calinon, S.; Guenter, F. and Billard, A. (2007).

On Learning, Representing and Generalizing a Task in a Humanoid Robot [725]IEEE Transactions on Systems, Man and Cybernetics.

What Is Regression?

Estimating a relationshipbetween input variablesand continuous output variablesfrom data

Application: Biosignal Processing

Gijsberts, A., Bohra, R., Sierra Gonzlez, D., Werner, A., Nowak, M., Caputo, B., Roa, M. and Castellini, C. (2014)

Stable myoelectric control of a hand prosthesis using non-linear incremental learningFrontiers in Neurorobotics

What Is Not Regression?

Training data{( xn︸︷︷︸

input

, yn︸︷︷︸target

)}Nn=1 ∀n,xn ∈ X ∧ yn ∈ Y

Supervised Learning targets availableRegression targets available Y ⊆ RM

Classification targets available Y ⊆ 1, . . .KReinforcement learning no targets, only rewards rn ⊆ RUnsupervised learning no targets at all

Regression – Assumptions about the Function

linear locally linear smooth none

Regression – Assumptions about the Function

linear locally linear smooth none

→ Linear Least Squares

A.M. Legendre (1805).Nouvelles methodes pour la determination des orbites des comtes [519]Firmin Didot.

C.F. Gauss (1809).Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientum [943]

Linear Least Squares

X =

x1,1 x1,2 · · · x1,Dx2,1 x2,2 · · · x2,D

......

. . ....

xN,1 xN,2 · · · xN,D

, y =

y1y2...

yN

• X is the N × D “design matrix”• Each row is a D-dim. data point

f (x) = aᵀxLinear Least Squares

Linear Least Squares

X =

x1,1 x1,2 · · · x1,Dx2,1 x2,2 · · · x2,D

......

. . ....

xN,1 xN,2 · · · xN,D

, y =

y1y2...

yN

• X is the N × D “design matrix”• Each row is a D-dim. data point

f (x) = aᵀxLinear Least Squares

Linear Least Squares• Which line fits the data best?

1 Define fitting criterion2 Optimize a w.r.t. criterion

Nice! Closed form solution to find a∗

f (x) = aᵀxLinear Least Squares

Offset trickf (x) = aᵀx + b

= [ ab ]

ᵀ[ x

1 ]

X =

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.

.

.

.

.

.. . .

.

.

.

.

.

.xN,1 xN,2 · · · xN,D 1

Linear Least Squares• Which line fits the data best?

1 Define fitting criterion2 Optimize a w.r.t. criterion

1 Define fitting criterionSum of squared residuals

S(a) =N∑

n=1

r2n (1)

=N∑

n=1

(yn − f (xn))2 (2)

Nice! Closed form solution to find a∗

f (x) = aᵀxLinear Least Squares

Offset trickf (x) = aᵀx + b

= [ ab ]

ᵀ[ x

1 ]

X =

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.

.

.

.

.

.. . .

.

.

.

.

.

.xN,1 xN,2 · · · xN,D 1

Linear Least Squares• Which line fits the data best?

1 Define fitting criterion2 Optimize a w.r.t. criterion

1 Define fitting criterionSum of squared residuals

S(a) =N∑

n=1

r2n (1)

=N∑

n=1

(yn − f (xn))2 (2)

Applied to a linear model

S(a) =N∑

n=1

(yn − aᵀxn)2 (3)

= (y− Xa)ᵀ(y− Xa), (4)

Nice! Closed form solution to find a∗

f (x) = aᵀxLinear Least Squares

Offset trickf (x) = aᵀx + b

= [ ab ]

ᵀ[ x

1 ]

X =

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.

.

.

.

.

.. . .

.

.

.

.

.

.xN,1 xN,2 · · · xN,D 1

Linear Least Squares• Which line fits the data best?

1 Define fitting criterion2 Optimize a w.r.t. criterion

2 Optimize a w.r.t. criterionMinimize sum of squared residuals S(a)

a∗ = arg mina

S(a) (1)

= arg mina

(y− Xa)ᵀ(y− Xa) (2)

Quadratic cost: when is its derivative 0?

S′(a) = 2(a(XᵀX)− Xᵀy) (3)

a∗ = (XᵀX)−1Xᵀy. (4)

Nice! Closed form solution to find a∗

f (x) = aᵀxLinear Least Squares

Offset trickf (x) = aᵀx + b

= [ ab ]

ᵀ[ x

1 ]

X =

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.

.

.

.

.

.. . .

.

.

.

.

.

.xN,1 xN,2 · · · xN,D 1

Linear Least Squares• Which line fits the data best?

1 Define fitting criterion2 Optimize a w.r.t. criterion

2 Optimize a w.r.t. criterionMinimize sum of squared residuals S(a)

a∗ = arg mina

S(a) (1)

= arg mina

(y− Xa)ᵀ(y− Xa) (2)

Quadratic cost: when is its derivative 0?

S′(a) = 2(a(XᵀX)− Xᵀy) (3)

a∗ = (XᵀX)−1Xᵀy. (4)

Nice! Closed form solution to find a∗

f (x) = a∗ᵀxLinear Least Squares

Offset trickf (x) = aᵀx + b

= [ ab ]

ᵀ[ x

1 ]

X =

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.

.

.

.

.

.. . .

.

.

.

.

.

.xN,1 xN,2 · · · xN,D 1

Linear Least Squares• Which line fits the data best?

1 Define fitting criterion2 Optimize a w.r.t. criterion

2 Optimize a w.r.t. criterionMinimize sum of squared residuals S(a)

a∗ = arg mina

S(a) (1)

= arg mina

(y− Xa)ᵀ(y− Xa) (2)

Quadratic cost: when is its derivative 0?

S′(a) = 2(a(XᵀX)− Xᵀy) (3)

a∗ = (XᵀX)−1Xᵀy. (4)

Nice! Closed form solution to find a∗

f (x) = a∗ᵀxLinear Least Squares

Offset trickf (x) = aᵀx + b

= [ ab ]

ᵀ[ x

1 ]

X =

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.

.

.

.

.

.. . .

.

.

.

.

.

.xN,1 xN,2 · · · xN,D 1

Linear Least Squares• Which line fits the data best?

1 Define fitting criterion2 Optimize a w.r.t. criterion

2 Optimize a w.r.t. criterionMinimize sum of squared residuals S(a)

a∗ = arg mina

S(a) (1)

= arg mina

(y− Xa)ᵀ(y− Xa) (2)

Quadratic cost: when is its derivative 0?

S′(a) = 2(a(XᵀX)− Xᵀy) (3)

a∗ = (XᵀX)−1Xᵀy. (4)

Nice! Closed form solution to find a∗

f (x) = a∗ᵀxLinear Least Squares

Offset trickf (x) = aᵀx + b

= [ ab ]

ᵀ[ x

1 ]

X =

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.

.

.

.

.

.. . .

.

.

.

.

.

.xN,1 xN,2 · · · xN,D 1

Outline

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

Outline

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

Linear LeastSquares

Outline

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

WeightedLeast Squares

Weighted Linear Least Squares

Idea: more important to fit some points than others.

1 Define fitting criterionWeighted residuals:

S(a) =N∑

n=1

wn(yn − aᵀxn)2. (5)

= (y− Xa)ᵀW(y− Xa),

(6)

2 Optimize a w.r.t. criterion

a∗ = (XᵀWX)−1XᵀWy. (7)

Weighted Linear Least Squares

Idea: more important to fit some points than others.

• Importance ≡Weight wn

• Example weighting• manual• boxcar function• Gaussian function

wn = φ(xn,θ)

= exp(− 1

2 (x− c)ᵀΣ−1(x− c))

with θ = (c,Σ)

1 Define fitting criterionWeighted residuals:

S(a) =N∑

n=1

wn(yn − aᵀxn)2. (5)

= (y− Xa)ᵀW(y− Xa),

(6)

2 Optimize a w.r.t. criterion

a∗ = (XᵀWX)−1XᵀWy. (7)

Weighted Linear Least Squares

Idea: more important to fit some points than others.

1 Define fitting criterionWeighted residuals:

S(a) =N∑

n=1

wn(yn − aᵀxn)2. (5)

= (y− Xa)ᵀW(y− Xa),

(6)

2 Optimize a w.r.t. criterion

a∗ = (XᵀWX)−1XᵀWy. (7)

Weighted Linear Least Squares

Idea: more important to fit some points than others.

1 Define fitting criterionWeighted residuals:

S(a) =N∑

n=1

wn(yn − aᵀxn)2. (5)

= (y− Xa)ᵀW(y− Xa), (6)

2 Optimize a w.r.t. criterion

a∗ = (XᵀWX)−1XᵀWy. (7)

Weighted Linear Least Squares

Idea: more important to fit some points than others.

1 Define fitting criterionWeighted residuals:

S(a) =N∑

n=1

wn(yn − aᵀxn)2. (5)

= (y− Xa)ᵀW(y− Xa), (6)

2 Optimize a w.r.t. criterion

a∗ = (XᵀWX)−1XᵀWy. (7)

Weighted Linear Least Squares

Idea: more important to fit some points than others.

1 Define fitting criterionWeighted residuals:

S(a) =N∑

n=1

wn(yn − aᵀxn)2. (5)

= (y− Xa)ᵀW(y− Xa), (6)

2 Optimize a w.r.t. criterion

a∗ = (XᵀWX)−1XᵀWy. (7)

Locally Weighted RegressionsIn robotics, functions usually non-linear. But often locally linear!

Idea: Do multiple, independent, locally weighted least sq. regressions

William S. Cleveland; Susan J. Devlin (1988).Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting [4074]Journal of the American Statistical Association.

Atkeson, C. G.; Moore, A. W. and Schaal, S. (1997).Locally Weighted Learning for Control [2160]Artificial Intelligence Review.

Locally Weighted RegressionsIn robotics, functions usually non-linear. But often locally linear!

Idea: Do multiple, independent, locally weighted least sq. regressions

William S. Cleveland; Susan J. Devlin (1988).Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting [4074]Journal of the American Statistical Association.

Atkeson, C. G.; Moore, A. W. and Schaal, S. (1997).Locally Weighted Learning for Control [2160]Artificial Intelligence Review.

Locally Weighted RegressionsIn robotics, functions usually non-linear. But often locally linear!

Idea: Do multiple, independent, locally weighted least sq. regressions

William S. Cleveland; Susan J. Devlin (1988).Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting [4074]Journal of the American Statistical Association.

Atkeson, C. G.; Moore, A. W. and Schaal, S. (1997).Locally Weighted Learning for Control [2160]Artificial Intelligence Review.

Locally Weighted Regressions

• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

ae = (XᵀWeX)−1XᵀWey. (8)

Resulting model

f (x) =E∑

e=1

φ(x,θe) · (aᵀex). (9)

(φ must be normalized)

Locally Weighted Regressions

• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

ae = (XᵀWeX)−1XᵀWey. (8)

Resulting model

f (x) =E∑

e=1

φ(x,θe) · (aᵀex). (9)

(φ must be normalized)

Locally Weighted Regressions

• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

ae = (XᵀWeX)−1XᵀWey. (8)

Resulting model

f (x) =E∑

e=1

φ(x,θe) · (aᵀex). (9)

(φ must be normalized)

Locally Weighted Regressions

• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

ae = (XᵀWeX)−1XᵀWey. (8)

Resulting model

f (x) =E∑

e=1

φ(x,θe) · (aᵀex). (9)

(φ must be normalized)

Locally Weighted Regressions

• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

ae = (XᵀWeX)−1XᵀWey. (8)

Resulting model

f (x) =E∑

e=1

φ(x,θe) · (aᵀex). (9)

(φ must be normalized)

Locally Weighted Regressions

• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

ae = (XᵀWeX)−1XᵀWey. (8)

Resulting model

f (x) =E∑

e=1

φ(x,θe) · (aᵀex). (9)

(φ must be normalized)

Locally Weighted Regressions

• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

ae = (XᵀWeX)−1XᵀWey. (8)

Resulting model

f (x) =E∑

e=1

φ(x,θe) · (aᵀex). (9)

(φ must be normalized)

Locally Weighted Regressions

• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

ae = (XᵀWeX)−1XᵀWey. (8)

Resulting model

f (x) =E∑

e=1

φ(x,θe) · (aᵀex). (9)

(φ must be normalized)

Variations of Locally Weighted Regressions

Receptive Field Weighted Regression• Incremental, not batch• E , centers c1...E and widths Σ1...E determined automatically• Disadvantage: many open parameters

Schaal, S. and Atkeson, C. G. (1997).Receptive Field Weighted Regression [34]Technical Report TR-H-209, ATR Human Information Processing Laboratories.

Locally Weighted Projection Regression• As RFWR, but also performs dimensionality reduction within

each receptive field

Vijayakumar, S. and Schaal, S. (2000).Locally Weighted Projection Regression . . . [208]International Conference on Machine Learning.

Variations of Locally Weighted Regressions

Receptive Field Weighted Regression• Incremental, not batch• E , centers c1...E and widths Σ1...E determined automatically• Disadvantage: many open parameters

Schaal, S. and Atkeson, C. G. (1997).Receptive Field Weighted Regression [34]Technical Report TR-H-209, ATR Human Information Processing Laboratories.

Locally Weighted Projection Regression• As RFWR, but also performs dimensionality reduction within

each receptive field

Vijayakumar, S. and Schaal, S. (2000).Locally Weighted Projection Regression . . . [208]International Conference on Machine Learning.

Outline

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

WeightedLeast Squares

LocallyWeighted

Regressions+ friends

RegularizationRegularization

“Avoid large pa-rameter vectors.”

Outline

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

WeightedLeast Squares

LocallyWeighted

Regressions+ friends

Regularization

Regularization

“Avoid large pa-rameter vectors.”

Outline

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

Regularization

Regularization

“Avoid large pa-rameter vectors.”

Regularization• Idea: penalize large parameter vectors to

• avoid overfitting / achieve sparse parameter vectors

a∗ = arg mina(

12‖y− Xᵀa‖2︸ ︷︷ ︸

fit data

2‖a‖2︸ ︷︷ ︸

small parameters

) (10)

Use combination of L1 and L2: “Elastic Nets”

Regularization• Idea: penalize large parameter vectors to

• avoid overfitting / achieve sparse parameter vectors

a∗ = arg mina(

12‖y− Xᵀa‖2︸ ︷︷ ︸

fit data

2‖a‖2︸ ︷︷ ︸

small parameters

) (10)

L2-norm for ‖a‖

‖a‖2 =

(D∑

d=1

|ad |2) 1

2

Euclidean distance

a∗ = (λI + XᵀX)−1Xᵀy.

“Thikonov Regularization”“Ridge Regression”

L1-norm for ‖a‖

‖a‖1 =

(D∑

d=1

|ad |1) 1

1

=D∑

d=1

|ad |

Manhattan distance

no closed-form solution . . .

“LASSO Regularization”

Use combination of L1 and L2: “Elastic Nets”

Regularization• Idea: penalize large parameter vectors to

• avoid overfitting / achieve sparse parameter vectors

a∗ = arg mina(

12‖y− Xᵀa‖2︸ ︷︷ ︸

fit data

2‖a‖2︸ ︷︷ ︸

small parameters

) (10)

L2-norm for ‖a‖

‖a‖2 =

(D∑

d=1

|ad |2) 1

2

Euclidean distance

a∗ = (λI + XᵀX)−1Xᵀy.

“Thikonov Regularization”“Ridge Regression”

L1-norm for ‖a‖

‖a‖1 =

(D∑

d=1

|ad |1) 1

1

=D∑

d=1

|ad |

Manhattan distance

no closed-form solution . . .

“LASSO Regularization”

Use combination of L1 and L2: “Elastic Nets”

Regularization• Idea: penalize large parameter vectors to

• avoid overfitting / achieve sparse parameter vectors

a∗ = arg mina(

12‖y− Xᵀa‖2︸ ︷︷ ︸

fit data

2‖a‖2︸ ︷︷ ︸

small parameters

) (10)

L2-norm for ‖a‖

‖a‖2 =

(D∑

d=1

|ad |2) 1

2

Euclidean distance

a∗ = (λI + XᵀX)−1Xᵀy.

“Thikonov Regularization”“Ridge Regression”

L1-norm for ‖a‖

‖a‖1 =

(D∑

d=1

|ad |1) 1

1

=D∑

d=1

|ad |

Manhattan distance

no closed-form solution . . .

“LASSO Regularization”

Use combination of L1 and L2: “Elastic Nets”

Regularization• Idea: penalize large parameter vectors to

• avoid overfitting / achieve sparse parameter vectors

a∗ = arg mina(

12‖y− Xᵀa‖2︸ ︷︷ ︸

fit data

2‖a‖2︸ ︷︷ ︸

small parameters

) (10)

L2-norm for ‖a‖

‖a‖2 =

(D∑

d=1

|ad |2) 1

2

Euclidean distance

a∗ = (λI + XᵀX)−1Xᵀy.

“Thikonov Regularization”“Ridge Regression”

L1-norm for ‖a‖

‖a‖1 =

(D∑

d=1

|ad |1) 1

1

=D∑

d=1

|ad |

Manhattan distance

no closed-form solution . . .

“LASSO Regularization”

Use combination of L1 and L2: “Elastic Nets”

Regularization• Idea: penalize large parameter vectors to

• avoid overfitting / achieve sparse parameter vectors

a∗ = arg mina(

12‖y− Xᵀa‖2︸ ︷︷ ︸

fit data

2‖a‖2︸ ︷︷ ︸

small parameters

) (10)

Michael Littman and Charles Isbell feat Infinite Harmony

“Overfitting A Cappella”

Use combination of L1 and L2: “Elastic Nets”

Beyond squares

a∗ = arg mina(

12‖y− Xᵀa‖2︸ ︷︷ ︸

fit data

2‖a‖2︸ ︷︷ ︸

small parameters

) (11)

Penalty on residuals rn(fit data)

Penalty on parameters a(regularization)

Beyond squares

a∗ = arg mina(

12‖y− Xᵀa‖2︸ ︷︷ ︸

fit data

2‖a‖2︸ ︷︷ ︸

small parameters

) (11)

Penalty on residuals rn(fit data)

Penalty on parameters a(regularization)

Linear Support Vector Regression

• No closed-form solution, but efficient optimizers exist

Outline

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

Regularization

Regularization

Radial BasisFunctionNetwork+ friends

“Avoid large pa-rameter vectors.”

Outline

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

Regularization

Regularization

Radial BasisFunctionNetwork+ friends

“Avoid large pa-rameter vectors.”

Outline

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

RegularizationRegularization

Radial BasisFunctionNetwork+ friends

“Avoid large pa-rameter vectors.”

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,ce). (12)

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,ce). (12)

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,ce). (12)

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,ce). (12)

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,ce). (12)

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,ce). (12)

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,θe). (13)

Feature matrix (analogous to design matrix X)

Θ =

φ(x1, c1) φ(x1, c2) · · · φ(x1, cE )φ(x2, c1) φ(x2, c2) · · · φ(x2, cE )

......

. . ....

φ(xN , c1) φ(xN , c2) · · · φ(xN , cE )

(14)

Least squares solution

w∗ = (ΘᵀΘ)−1Θᵀy. (15)

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,θe). (13)

Feature matrix (analogous to design matrix X)

Θ =

φ(x1, c1) φ(x1, c2) · · · φ(x1, cE )φ(x2, c1) φ(x2, c2) · · · φ(x2, cE )

......

. . ....

φ(xN , c1) φ(xN , c2) · · · φ(xN , cE )

(14)

Least squares solution

w∗ = (ΘᵀΘ)−1Θᵀy. (15)

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,θe). (13)

Feature matrix (analogous to design matrix X)

Θ =

φ(x1, c1) φ(x1, c2) · · · φ(x1, cE )φ(x2, c1) φ(x2, c2) · · · φ(x2, cE )

......

. . ....

φ(xN , c1) φ(xN , c2) · · · φ(xN , cE )

(14)

Least squares solution

w∗ = (ΘᵀΘ)−1Θᵀy. (15)

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,θe). (13)

Feature matrix (analogous to design matrix X)

Θ =

φ(x1, c1) φ(x1, c2) · · · φ(x1, cE )φ(x2, c1) φ(x2, c2) · · · φ(x2, cE )

......

. . ....

φ(xN , c1) φ(xN , c2) · · · φ(xN , cE )

(14)

Least squares solution

w∗ = (ΘᵀΘ)−1Θᵀy. (15)

Kernel Ridge Regression

• Like a RBFN, but every data point is the center of a basis function

• Uses L2 regularization

f (x) =N∑

n=1

wn · k(x,xn). (16)

“Gram matrix”(analogous to design matrix X)

K(X,X) =

k(x1, x1) k(x1, x2) · · · k(x1, xN )k(x2, x1) k(x2, x2) · · · k(x2, xN )

......

. . ....

k(xN , x1) k(xN , x2) · · · k(xN , xN )

(17)

w∗ = (KᵀK)−1Kᵀy (18)

= K−1y, (19)

w∗ = (λI + K)−1y with L2 regularization

(20)

Kernel Ridge Regression

• Like a RBFN, but every data point is the center of a basis function• Uses L2 regularization

f (x) =N∑

n=1

wn · k(x,xn). (16)

“Gram matrix”(analogous to design matrix X)

K(X,X) =

k(x1, x1) k(x1, x2) · · · k(x1, xN )k(x2, x1) k(x2, x2) · · · k(x2, xN )

......

. . ....

k(xN , x1) k(xN , x2) · · · k(xN , xN )

(17)

w∗ = (KᵀK)−1Kᵀy (18)

= K−1y, (19)

w∗ = (λI + K)−1y with L2 regularization (20)

(Radial) Basis Function Networks

Beyond radial basis functions• Cosines: Ridge Regression with Random Fourier Features• Sigmoids: Extreme Learning Machines (MLFF with 1 hidden)• Boxcars: model trees (as decision trees, but for regression)• Kernels: every data point is the center of a radial basis function

• Since least squares is at the heart of all of these• incremental versions← recursive least squares• apply L2 regularization (still closed form)

(Radial) Basis Function Networks

Beyond radial basis functions• Cosines: Ridge Regression with Random Fourier Features• Sigmoids: Extreme Learning Machines (MLFF with 1 hidden)• Boxcars: model trees (as decision trees, but for regression)• Kernels: every data point is the center of a radial basis function

• Since least squares is at the heart of all of these• incremental versions← recursive least squares• apply L2 regularization (still closed form)

Freek, aren’t you being a bit shallow?

• Deep learning great when you• do not know the features• know the features to be hierarchically organized

John Smart

Freek, aren’t you being a bit shallow?

• Deep learning great when you• do not know the features• know the features to be hierarchically organized

Rajeswaran A, Lowrey K, Todorov E and Kakade S. (2017).Towards generalization and simplicity in continuous controlNeural Information Processing Systems (NIPS).

A neural network perspective

All these models can be considered (degenerate) neural networks!

Backpropagation can be used in all these models!

A neural network perspective

All these models can be considered (degenerate) neural networks!

Backpropagation can be used in all these models!

Linear model

x1

x2

xD

...

∑y

a1

a2

aD

Figure: Network representation of a linear model. Activation is. . . linear!

RBFN

x1

xD

...

φ1

φ2

φE

...

∑y

b1

b2

bE

f (x) =∑E

e=1 beφ(x,θe)

Figure: The RBFN model. φe is an abbreviation of φ(x,θe)

RRRFF

x1

xD

...

φ1

φ2

φE

...

∑y

b1

b2

bE

f (x) =∑E

e=1 beφ(x,θe)

Figure: The RRRFF model. φe is an abbreviation of φ(x,θe)

SVR

x1

xD

...

φ1

φ2

φE

...

∑y

b1

b2

bE

f (x) =∑E

e=1 beφ(x,θe)

Figure: The SVR model. φe is an abbreviation of φ(x,θe)

Regression trees

x1

xD

...

φ1

φ2

φE

...

∑y

b1

b2

bE

f (x) =∑E

e=1 beφ(x,θe)

Figure: The regression trees model. φe is an abbreviation of φ(x,θe)

Extreme learning machine

x1

xD

...

φ1

φ2

φE

...

∑y

b1

b2

bE

f (x) =∑E

e=1 beφ(x,θe)

Figure: The extreme learning machine model. φe is an abbreviation ofφ(x,θe)

Extreme Learning Machine vs. (Deep) Neural Networks

• ELM: sigmoid act. function, no hidden layer, random features• ANN: sigmoid act. function, hidden layers, learned features

KRR and GPR

x1

xD

...

φ1

φ2

φ3

φ4

φ5

φ6

φN

...

∑y

w1w2w3w4w5

w6

wN

f (x) =∑N

n=1 wnφn(x)

Figure: The function model used in KRR and GPR, as a network.

Locally weighted regression

x1

xD

...

1

φ1

φ2

φE

∑...

∑y

b 1

b2

bE

a1,1

a1,2

a1,E

aD,1

aD,2

aD,E

w1

w2

wE

Figure: Function model in Locally Weighted Regressions, represented as afeedforward neural network. The functions φe(x) generate the weights we

from the hidden nodes – which contain linear sub-models (aᵀe x + be) – to the

output node. Here, φe is an abbreviation of φ(x,θe)

Conclusion

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

WeightedSum of Basis

Functions

WeightedSum of Linear

Models

Conclusion: Generic batch regression flow-chart

Algorithmleast squares: a∗ = (λI + XᵀX)−1Xᵀy

Modellinear model: f (x) = aᵀx

Meta parametersregularization: λ

Model parametersslopes: a

Regressionalgorithm

.

.

training data(inputs/targets)

model

modelparams

meta-params Evaluate

input(novel)

output(prediction)

Conclusion: Generic batch regression flow-chart

Algorithmleast squares: a∗ = (λI + XᵀX)−1Xᵀy

Modellinear model: f (x) = aᵀx

Meta parametersregularization: λ

Model parametersslopes: a

Regressionalgorithm

.

.

training data(inputs/targets)

model

modelparams

meta-params Evaluate

input(novel)

output(prediction)

Conclusion: Generic batch regression flow-chart

Algorithmleast squares: a∗ = (λI + XᵀX)−1Xᵀy

Modellinear model: f (x) = aᵀx

Meta parametersregularization: λ

Model parametersslopes: a

Regressionalgorithm

.

.

training data(inputs/targets)

model

modelparams

meta-params Evaluate

input(novel)

output(prediction)

Conclusion: Generic batch regression flow-chart

Algorithmleast squares: a∗ = (λI + XᵀX)−1Xᵀy

Modellinear model: f (x) = aᵀx

Meta parametersregularization: λ

Model parametersslopes: a

Regressionalgorithm

.

.

training data(inputs/targets)

model

modelparams

meta-params

Evaluate

input(novel)

output(prediction)

Conclusion: Generic batch regression flow-chart

Algorithmleast squares: a∗ = (λI + XᵀX)−1Xᵀy

Modellinear model: f (x) = aᵀx

Meta parametersregularization: λ

Model parametersslopes: a

Regressionalgorithm

.

.

training data(inputs/targets)

model

modelparams

meta-params Evaluate

input(novel)

output(prediction)

ConclusionLinear Least

Squares“Fit a line to

some data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

WeightedSum of Basis

Functions

WeightedSum of Linear

Models

Unified Model

f (x) =∑E

e=1φ(x,θe)·(be + aᵀ

e x) Weighted sum of linear models (21)

f (x) =∑E

e=1φ(x,θe)· we Weighted sum of basis functions (22)

ConclusionLinear Least

Squares“Fit a line to

some data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

WeightedSum of Basis

Functions

WeightedSum of Linear

Models

Unified Model

f (x) =∑E

e=1φ(x,θe)·(be + aᵀ

e x) Weighted sum of linear models (21)

f (x) =∑E

e=1φ(x,θe)· we Weighted sum of basis functions (22)

(22) is a special case of (21) with ae = 0 and be ≡ we

ConclusionLinear Least

Squares“Fit a line to

some data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

WeightedSum of Basis

Functions

WeightedSum of Linear

Models

Unified Model

f (x) =∑E

e=1φ(x,θe)·(be + aᵀ

e x) Weighted sum of linear models (21)

f (x) =∑E

e=1φ(x,θe)· we Weighted sum of basis functions (22)

(22) is a special case of (21) with ae = 0 and be ≡ we

ConclusionLinear Least

Squares“Fit a line to

some data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

WeightedSum of Basis

Functions

WeightedSum of Linear

Models

Unified Model

Freek Stulp and Olivier Sigaud (2015).Many regression algorithms, one unified model - A review.Neural Networks.

f (x) =∑E

e=1φ(x,θe)·(be + aᵀ

e x) Weighted sum of linear models (21)

f (x) =∑E

e=1φ(x,θe)· we Weighted sum of basis functions (22)

(22) is a special case of (21) with ae = 0 and be ≡ we

Conclusion

Mixture of linear models(unified model)

f (x) =∑E

e=1 φ(x,θe) · (aᵀex+be)

sub-models: aᵀex + be

weights: φ(x,θe)

GMR (Gaussian Mixture R.)

Gaussian BFsLWR (Locally Weighted R.)

RFWR (Receptive Field Weighted R.)

LWPR (Locally Weighted Projection R.)

XCSFAny BFs

M5 (Model Trees)Box-car BFs

Linear modelf (x) = aᵀx + b

E = 1

LLS (Linear Least Squares)

(Weighted Linear Least Squares)

Weighted sum of basis functions

f (x) =∑E

e=1 φ(x,θe) · be

sub-models: φ(x,θe)

weights: be

a = 0

RBFN (Radial Basis Function Network)

Radial φ(||x− c||)

KRR (Kernel Ridge R.)

GPR (Gaussian Process R.)Kernel φ(x, x′)

iRFRLSI-SSGPR

Cosine BFs

CART (Regression Trees)

Box-car BFs

ELM (Extreme Learning Machine)

Backpropagation

Sigmoidφ(〈x,w〉)

Figure: Classification of regression algorithms, based only on the modelused to represent the underlying function.

Many toolkits available

• Python• scikit-learn: http://scikit-learn.org• StatsModels: http://www.statsmodels.org/• PbDlib: http://calinon.ch/codes.htm• dmpbbo: https://github.com/stulp/dmpbbo

• Matlab• curvefit: https://www.mathworks.com/help/curvefit/linear-and-nonlinear-regression.html

• PbDlib: http://calinon.ch/codes.htm• C++

• PbDlib: http://calinon.ch/codes.htm• dmpbbo: https://github.com/stulp/dmpbbo

Personal FavouritesGaussian process regression

+ Very few assumptions+ Meta-parameters estimated from data itself+ Estimates variance also+ Works in high dimensions- Training/query times increase with amount of data- Not easy to make incremental

Gaussian mixture regression+ Estimates variance also+ Algorithm is inherently incremental+ Some meta-parameters, but easy to tune+ Fast training times- Training only stable for low input dimensions

Locally Weighted Regressions+ Fast query times, fast training+ Few meta-parameters, and easy to set+ Stable learning results (batch)- Not incremental- No variance estimate

Deep Learning+ Automatic extraction of (hierarhical) features

Conclusion

• Don’t think about these regression algorithms as being unique• Similar algorithms that use different subsets of algorithmic features

• All these models are essentially shallow neural networks withdifferent basis functions

Thank you for your attention!

Conclusion

• Don’t think about these regression algorithms as being unique• Similar algorithms that use different subsets of algorithmic features

• All these models are essentially shallow neural networks withdifferent basis functions

Thank you for your attention!

Model: (Locally) Linear Regularization Model: Sum of Basis Functions ConclusionUnified Model

Appendix

Regression Tutorial – Freek Stulp

Gaussian Process Regression

“Given a Gaussian process on some topological space T , with acontinuous covariance kernel C(·, ·) : T × T → R, we can associate aHilbert space, which is the reproducing kernel Hilbert space ofreal-valued functions on T , with C as kernel function.”

Instead of screaming, let’s talk about what it means to be smooth.

Gaussian Process Regression

“Given a Gaussian process on some topological space T , with acontinuous covariance kernel C(·, ·) : T × T → R, we can associate aHilbert space, which is the reproducing kernel Hilbert space ofreal-valued functions on T , with C as kernel function.”

Instead of screaming, let’s talk about what it means to be smooth.

Gaussian Process Regression

“Given a Gaussian process on some topological space T , with acontinuous covariance kernel C(·, ·) : T × T → R, we can associate aHilbert space, which is the reproducing kernel Hilbert space ofreal-valued functions on T , with C as kernel function.”

Instead of screaming, let’s talk about what it means to be smooth.

Gaussian Process Regression

• Points that are close in the input space should be close in theoutput space.• Cities that are close geographically have similar temperatures (on

average)• Taller people have larger shoe sizes (on average)

• Shoe size covaries with height

Gaussian Process Regression – Covariance Function

covariance function

covariance matrix (Gram matrix)

K(X,X) =

xAug xMuc xWar xMin xMos

xAug 1.00 0.96 0.42 0.02 0.00

xMuc 0.96 1.00 0.59 0.04 0.00xWar 0.42 0.59 1.00 0.32 0.10xMin 0.02 0.04 0.32 1.00 0.80xMos 0.00 0.00 0.10 0.80 1.00

• Remarks• Basis function has very specific interpretation: covariance• No temperature measurements y have been made yet• Prior: assume temperature is 0◦C

QuestionExpected temperature in Munich, given 9◦C in Augsburg?

(condition on TAug = 9, i.e. E [TMuc | TAug = 9])

Gaussian Process Regression – Covariance Function

covariance function

covariance matrix (Gram matrix)

K(X,X) =

xAug xMuc xWar xMin xMos

xAug 1.00 0.96 0.42 0.02 0.00

xMuc 0.96 1.00 0.59 0.04 0.00xWar 0.42 0.59 1.00 0.32 0.10xMin 0.02 0.04 0.32 1.00 0.80xMos 0.00 0.00 0.10 0.80 1.00

• Remarks• Basis function has very specific interpretation: covariance• No temperature measurements y have been made yet• Prior: assume temperature is 0◦C

QuestionExpected temperature in Munich, given 9◦C in Augsburg?

(condition on TAug = 9, i.e. E [TMuc | TAug = 9])

Gaussian Process Regression – Covariance Function

covariance function covariance matrix (Gram matrix)

K(X,X) =

xAug xMuc xWar xMin xMos

xAug 1.00 0.96 0.42 0.02 0.00xMuc 0.96 1.00 0.59 0.04 0.00xWar 0.42 0.59 1.00 0.32 0.10xMin 0.02 0.04 0.32 1.00 0.80xMos 0.00 0.00 0.10 0.80 1.00

• Remarks• Basis function has very specific interpretation: covariance• No temperature measurements y have been made yet• Prior: assume temperature is 0◦C

QuestionExpected temperature in Munich, given 9◦C in Augsburg?

(condition on TAug = 9, i.e. E [TMuc | TAug = 9])

Gaussian Process Regression – Covariance Function

covariance function covariance matrix (Gram matrix)

K(X,X) =

xAug xMuc xWar xMin xMos

xAug 1.00 0.96 0.42 0.02 0.00xMuc 0.96 1.00 0.59 0.04 0.00xWar 0.42 0.59 1.00 0.32 0.10xMin 0.02 0.04 0.32 1.00 0.80xMos 0.00 0.00 0.10 0.80 1.00

• Remarks• Basis function has very specific interpretation: covariance• No temperature measurements y have been made yet• Prior: assume temperature is 0◦C

QuestionExpected temperature in Munich, given 9◦C in Augsburg?

(condition on TAug = 9, i.e. E [TMuc | TAug = 9])

Gaussian Process Regression – Covariance Function

covariance function covariance matrix (Gram matrix)

K(X,X) =

xAug xMuc xWar xMin xMos

xAug 1.00 0.96 0.42 0.02 0.00xMuc 0.96 1.00 0.59 0.04 0.00xWar 0.42 0.59 1.00 0.32 0.10xMin 0.02 0.04 0.32 1.00 0.80xMos 0.00 0.00 0.10 0.80 1.00

• Remarks• Basis function has very specific interpretation: covariance• No temperature measurements y have been made yet• Prior: assume temperature is 0◦C

QuestionExpected temperature in Munich, given 9◦C in Augsburg?

(condition on TAug = 9, i.e. E [TMuc | TAug = 9])

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

K(X,X) =[ xAug xMin

xAug 1.00 0.02xMin 0.02 1.00

]

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

K(X,X) =[ xAug xMin

xAug 1.00 0.02xMin 0.02 1.00

]

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Kernel Regression

f (x) =∑N

n=1wn · k(x,xn)

w∗ = K(X,X)−1y

Gaussian Process Regression – Example

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

The more measurements become available,the more certain we become

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

The more measurements become available,the more certain we become

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

The more measurements become available,the more certain we become

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

The more measurements become available,the more certain we become

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

The more measurements become available,the more certain we become

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

The more measurements become available,the more certain we become

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Gaussian Process Regression – Example

The more measurements become available,the more certain we become

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

top related