linear and weighted regression support vector...

APPLIED MACHINE LEARNING

1

MACHINE LEARNING

Linear and Weighted Regression

Support Vector Regression


2

How to estimate a continuous output value y?

Color

Len

gth

Bananas

Apples

Maps N-dimensions input x ∈ ℝN to discrete values y

x = [Length, Color] “Banana” or “Apple”

Classification (reminder)

E.g.:


3

Income: GDP 2003 (log scale)

Life

sat

isfa

ctio

n India

Nigeria

Cambodia

China

US

Japan

Italy

0 10 000 20 000 30 000 40 000

3

4

5

6

7

Bangladesh

Maps N-dimensions input x ∈ ℝN to continuous values y ∈ ℝ

Continuous value of life satisfaction

Regression: introduction

Income (GDP)


4

Maps N-dimensions input x ∈ ℝN to continuous values y ∈ ℝ

Income (GDP) Continuous value of life satisfaction

Regression: introduction

Income: GDP 2003 (log scale)

Life

sat

isfa

ctio

n India

Nigeria

Cambodia

China

US

Japan

Italy

0 10 000 20 000 30 000 40 000

3

4

5

6

7

Bangladesh

Query point: RussiaGDP = 30 000

Estimation of life satisfaction = 6.5


5School of Engineering – Section Microtechnique @ 2004 A.. Billard – Adapted from Blei 99 and Dorr & Montz 2004

Example of Use of Regressive MethodsPredict the number of diplomas that will be awarded in the next ten years across

the two EPF the number of diploma follow a non-linear curve as a function of

time.


6

Example of Use of Regressive MethodsPredict the velocity of the robot given its position.

x*: target

x f x


7

Example of Use of Regressive Methods


8

; , Ty f x w b w x b

Linear Regression

Linear regression searches a linear mapping between input x and

output y, parametrized by the slope vector w and intercept b.

y

x

b


9

; , Ty f x w b w x b

One can omit the intercept by centering the data:

Linear Regression


output y, parametrized by the slope vector w and intercept b.

*

' and ' , , : mean on and

' ' '

with '

Least-square estimate of ' ' ' 0

' '.

T

T

T

T

y y y x x x x y x y

y w x b

b b w x y

b y w x

y w x


10

; Ty f x w w x

Linear Regression


output y, parametrized by the slope vector w.

y

x


11

Linear Regression

1 2 1 2Pair of training points [ ... ] and [ ... ]

, .

M M

i N i

M X x x x y y y y

x y

Find the optimal parameter w through least-square regression:

2

*

1

1min

2i i

MT

wi

w w x y

Finds an analytical solution through partial differentiation:

1

*= T Tw X X X Y

ℝℝ


12

x

All points have equal weight.

Regression through weighted Least Square

2

*

1 2

1

1min , & ...

2

i ii i

MT

Mw

i

w w x y

Weighted Linear Regression

y

Standard linear regression

ˆ : estimatory

ℝ


13

y

Points in red have large weights.


2

*

1

1min ,

2

i ii i

MT

wi

w w x y


x

y

ℝ


14

y



Points in red have large weights.

x

y

2

*

1

1min ,

2

i ii i

MT

wi

w w x y

ℝ


15

1

2

Assuming a set of weights for all datapoints, we set B a diagonal matrix

with entries ,

.....

...................

Change of variable: and .

Minimizing f

i

i

M

T

B

Z BX v By

1

or MSE, one gets an estimator for at the query point:

ˆ T T T T

y

y x w x Z Z Z v

1

* T Tw X X X y

Contrast to the solution for un-weighted linear regression



16

assumes that a single linear dependency applies everywhere.

Not true for data sets with local dependencies.

y

x

Limitations of Linear Regression

2

*

1

1min , : constant weights

2

i ii i

MT

wi

w w x y


ℝ


17

assumes that a single linear dependency applies everywhere.

Not true for data sets with local dependencies.

It would be useful to design a regression method that estimates best

the linear dependencies locally.

Limitations of Linear Regression

2

*

1

1min , : constant weights

2

i ii i

MT

wi

w w x y


ℝ


18

Locally Weighted Regression

,

= , , with , , , .i

i i i id x x

i x K d x x K d x x e d x x x x

y x

X: query point

Estimate is determined through local influence of each group of datapoints

1 1

/ : weights function of xii j i

M M

i j

y x x y x x

y

ℝ


19


,

= , , with , , , .i

i i i id x x

i x K d x x K d x x e d x x x x


X: query point

y x

Generates a smooth function y(x)

1 1


M M

i j

y x x y x x

y

ℝ


20



1 1


M M

i j

y x x y x x

Model-free regression!

No longer explicit model of the formTy w x

Regression computed at each query point.

Depends on training points.

= ,i

i x K d x x

ℝ


21


1 1


M M

i j

y x x y x x


2

1

ˆmin min ,

Local cost function at , the query point.

i i

M

i

J x y y K d x x

x

Optimal solution to the local cost function:

= ,i

i x K d x x

ℝ


22

= ,i

i x K d x x



1 1


M M

i j

y x x y x x

Which training points?

Which kernel?

ℝ


23

Exercise Session Part I


24

Data-driven Regression

y

xBlue: true functionRed: estimated function

Good prediction depends on the choice of datapoints.


25

y

Good prediction depends on the choice of datapoints.

The more datapoints, the better the fit.

Computational costs increase dramatically with number of datapoints

x


Blue: true functionRed: estimated function


26

y

Several methods in ML for performing non-linear regression.

Differ in the objective function, in the amount of parameters.

Gaussian Process Regression (GPR) uses all datapoints (model-free)

x


Gaussian Process Regression not covered in class!Not examined in the final exam!



27

y



Gaussian Process Regression (GPR) uses all datapoints (model-free) Support Vector Regression (SVR) picks a subset of datapoints (support vectors)

x




28

y

x



Gaussian Process Regression (GPR) uses all datapoints (model-free) Support Vector Regression (SVR) picks a subset of datapoints (support vectors)Gaussian Mixture Regression (GMR) generates a new set of datapoints (centers of Gaussian functions)




29

y

x


Estimate of the noise is important to measure goodness of fit.


31

y

x

Support Vector Regression (SVR) assumes an estimate of the noise

model (e-tube) and then compute f directly within a noise-tolerance.




32

y

x

Gaussian Mixture Regression (GMR) builds a local estimate of the noise

model through the variance of the system.




33



34

1,...

Assume a nonlinear mapping , s.t. .

How to estimate to best predict the pair of training points , ?i i

i M

f y f x

f x y

How to generalize the support vector machine framework for classification to estimate continuous functions?

1. Assume a non-linear mapping through feature space and then perform linear regression in feature space

2. Supervised learning – minimizes an error function.

First determine a way to measure error on testing set in the linear case!



35

Assume a linear mapping , s.t. .Tf y f x w x b

Measure the error on prediction

b is estimated as in SVR through least-square regression on support vectors; hence we ignore it for the rest of the developments .


1,...

How to estimate and to best predict the pair of training points , ?i i

i Mw b x y

x

y Ty f x w x b


36


Set an upper bound on the error and

consider as correctly classified all points

such that ( ) ,

Penalize only datapoints that are

not contained in the -tube.

f x y

e

e

e

x

y Ty f x w x b


37

x


e-margin

The e-margin is a measure of the width of the e-insensitive tube.It is a measure of the precision of the regression.

A small ||w|| corresponds to a small slope for f. In the linear case, f is more horizontal.

y


38

x


e-margin

y

A large ||w|| corresponds to a large slope for f. In the linear case, f is more vertical.

The flatter the slope of the function f, the larger the emargin.

To maximize the margin, we must minimize the norm of w.


41


2

1,...

This can be rephrased as a constraint-based optimization problem

of the form:

1minimize

2

,subject to

,

i

i

i

i

i M

w

w x b y

y w x b

e

e

Need to penalize points outside the e-insensitive tube.


42


Need to penalize points outside the e-insensitive tube.

*

2 *

1

*

*

Introduce slack variables , , 0 :

1 Cminimize +

2

,

subject to ,

0, 0

i

i

i i

M

i i

i

i

i

i

i

i i

C

wM

w x b y

y w x b

e

e

i*

i


43


All points outside the e-tube become Support Vectors

i*

i

*

2 *

1

*

*

Introduce slack variables , , 0 :

1 Cminimize +

2

,

subject to ,

0, 0

i

i

i i

M

i i

i

i

i

i

i

i i

C

wM

w x b y

y w x b

e

e

We now have the solution to the linear regression problem.

How to generalize this to the nonlinear case?


44

Lift x into feature space and then perform linear regression in feature space.

Linear Case:

,

Non-Linear Case:

,

y f x w x b

x x

y f x w x b

w lives in feature space!

x x



45


2 *

1

*

*

In feature space, we obtain the same constrained optimization problem:

1 Cminimize +

2

,

subject to ,

0, 0

i

i

M

i i

i

i

i

i

i

i i

wM

w x b y

y w x b

e

e


46


2 * * *

1 1

1

* *

1

1 C CL , , *, = +

2

,

,

i

ii

i

i

M M

i i i i i

i i

Mi

i

i

Mi

i

i

w b wM M

y w x b

y w x b

e

e

Again, we can solve this quadratic problem by introducing sets of

Lagrange multipliers and writing the Lagrangian :

Lagrangian = Objective function + l * constraints


47

2 * * *

1 1

1

* *

1

1 C CL , , *, = +

2

,

,

i

ii

i

i

M M

i i i i i

i i

Mi

i

i

Mi

i

i

w b wM M

y w x b

y w x b

e

e


i*

i

Constraints on points lying on either side of the e-tube

* & 0 for all points that do not satisfy the constraints

points outside the -tube

ii

e

0i

* 0i


48


0i

* 0i

Requiring that the partial derivatives are all zero:

*

1

L0.i

Mi

i

i

w xw

*

1

.i

Mi

i

i

w x

Linear combination of support vectors

*

1

L0.i

M

i

ib

*

1 1

i

M M

i

i i

Rebalancing the effect of the support vectors on both sides of the e-tube

* & 0 for all points that do not satisfy the constraints

points outside the -tube

ii

e


49

*

* *

, 1

* *,

1 1

* *

1

1,

2max

subject to 0 and , 0,

i

i i

i i

Mi j

i j j

i j

M Mi

i i

i i

Mi

i i

i

x x

y

C

M

e


And replacing in the primal Lagrangian, we get the Dual optimization problem:

, ,i j i jk x x x x

Kernel Trick


50


The solution is given by:

*

1

,i

Mi

i

i

y f x k x x b

Linear Coefficients(Lagrange multipliers for each constraint).

If one uses RBF Kernel,M un-normalized isotropic Gaussians centered on each training datapoint.

*

1

, ,i

Mi

i

i

y f x w x b x x b

, ,i j i jk x x x x

Kernel Trick


51


y

x

*

1

,i

Mi

i

i

y f x k x x b


Kernel places a Gauss function on each SV


52


y

x

*

1

,i

Mi

i

i

y f x k x x b


The Lagrange multipliers define the importance of each Gaussian function.

*

1 1.5 2 2 4 3 *

3 1.5 *

5 1 6 2.5

b

Converges to b when SV effect vanishes.

1x 2x 3x 4x 5x6x

Y=f(x)


53

*

1

*

1 1

SVR gives the following estimate for each pair of datapoints ,

, , i 1...

An estimate of can be computed using the above:

1,

j j

Mj j i

i i

i

M Mj j i

i i

j i

y x

y k x x b M

b

b y k x xM



54

2 *

1

*

*

1 Cminimize +

2

,

subject to ,

0, 0

i

i

M

i i

i

i

i

i

i

i i

wM

w x b y

y w x b

e

e

eSVR: Hyperparameters

The solution to SVR we just saw is referred to as eSVR

Two Hyperparameters

C controls the penalty term on poor fite determines the minimal required precision


55

Effect of the RBF kernel width on the fit. Here fit using C=100, e=0.1, kernel width=0.01.

eSVR: Effect of Hyperparameters


56

Effect of the RBF kernel width on the fit. Here fit using C=100, e=0.01, kernel width=0.01


Overfitting


57


Effect of the RBF kernel width on the fit. Here fit using C=100, e=0.05, kernel width=0.01 Reduction of the effect of the kernel width on the fit by choosing appropriate hyperparameters..


58


Mldemos does not display the support vectors if there is more than one point for the same x!


59

Summary

Linear regression can be solved through Least-Mean-Square

estimation and yields an optimal analytical solution.

Weighted regression offers the possibility to perform a local

regression and yields also an optimal analytical solution.

The estimate is no longer global and is computed around each

group of data point!

Support Vector Regression: performs regression on a non-linear

function. Determines automatically the important points. The

estimate is globally optimal.


60

Examples of Applications of SVR Next


61

Catching Objects in FlightModel Object’s Dynamics

Build model of dynamics using


Compute derivative (closed form)

Use model in Extended Kalman Filter

for real-time tracking

1

, i

MT Ti i

i

x k x x x x b


62

• Designed for children from 1 year of age

• Fruit of 3 years of developmentLorenzo Piccardi, Jean-Baptiste Keller, Martin

Duvanel, Olivier Barbey, Karim Benmachiche, Dario Poggiali, Dave Bergomi, Basilio Noris

• 2 cameras, 2 microphones, 1 mirror96° x 96° field of view, 25Hz / 50Hz, 180g

www.pomelo-technologies.com

Application of SVR: Mapping Eyes to Gaze


63

B. Noris, J.-B. Keller and A. Billard. A Wearable Gaze Tracking System for Children in Unconstrained Environments. Int. journal of Computer Vision and

Image Understanding, 2011



64

96°

96°



65

Use Support Vector Regression to learn the mapping from eyes

appearance to gaze coordinates.

?



66

We normalize the image through high-pass filtering

We collect images of the eyes and directions of the gaze.

+ + + + ...

Learn mapping Eye Image Position in Image through

Support Vector Regression (SVR)



67

Different elements give different cues

Pupil, Iris and Cornea

?

Wrinkles, Eyelids

and EyelashesSupport Vector

Regression



68



69

From

object recognition using

Eye tracking

To reconstructing path

in Shop

www.pomelo-technologies.com


70

Gaze tracking using SVR Object detection using SVM

Monitoring Consumers’ Visual Behavior


71

Msc Projects in Industry

linear and weighted regression support vector...

Documents