04 machine learning - supervised linear classifier

Machine Learning for Data MiningLinear Classifiers

Andres Mendez-Vazquez

May 23, 2016

1 / 85

Outline

1 IntroductionThe Simplest FunctionsSplitting the SpaceThe Decision Surface

2 Developing an Initial SolutionGradient Descent Procedure

The Geometry of a Two-Category Linearly-Separable CaseBasic Method

Minimum Squared Error ProcedureThe Error IdeaThe Final Error EquationThe Data MatrixMulti-Class SolutionIssues with Least Squares!!!What about Numerical Stability?

2 / 85

Outline

3 / 85

What is it?

First than anything, we have a parametric model!!!Here, we have an hyperplane as a model:

g(x) = wT x + w0 (1)

In the case of R2

We have the following function:

g (x) = w1x1 + w2x2 + w0 (2)

4 / 85

What is it?

First than anything, we have a parametric model!!!Here, we have an hyperplane as a model:

g(x) = wT x + w0 (1)

In the case of R2

We have the following function:

g (x) = w1x1 + w2x2 + w0 (2)

4 / 85

Outline

5 / 85

Splitting The Space R2

Using a simple straight line

6 / 85

Outline

7 / 85

Defining a Decision Surface

The equation g (x) = 0 defines a decision surfaceSeparating the elements in classes, ω1 and ω2.

When g (x) is linear the decision surface is an hyperplaneGiven x1 and x2 are both on the decision surface:

wT x1 + w0 = 0wT x2 + w0 = 0

wT x1 + w0 = wT x2 + w0 (3)

8 / 85

wT x1 + w0 = 0wT x2 + w0 = 0

wT x1 + w0 = wT x2 + w0 (3)

8 / 85

wT x1 + w0 = 0wT x2 + w0 = 0

wT x1 + w0 = wT x2 + w0 (3)

8 / 85

wT (x1 − x2) = 0 (4)

Remark: Any vector in the hyperplane is perpendicular to wT i.e. wT

is normal to the hyperplane.

Something Notable

Properties

9 / 85

wT (x1 − x2) = 0 (4)

Something Notable

Properties

9 / 85

wT (x1 − x2) = 0 (4)

Something Notable

Properties

9 / 85

Therefore

The space is split in two regions (Example in R3) by the hyperplane H

10 / 85

Some Properties of the Hyperplane

Given that g (x) > 0 if x ∈ R1

11 / 85

It is moreWe can say the following

Any x ∈ R1 is on the positive side of H.Any x ∈ R2 is on the negative side of H.

In addition, g (x) can give us a way to obtain the distance from x tothe hyperplane HFirst, we express any x as follows

x = xp + rw

‖w‖

Wherexp is the normal projection of x onto H.r is the desired distance

I Positive, if x is in the positive sideI Negative, if x is in the negative side

12 / 85

x = xp + rw

‖w‖

12 / 85

x = xp + rw

‖w‖

12 / 85

x = xp + rw

‖w‖

12 / 85

x = xp + rw

‖w‖

12 / 85

x = xp + rw

‖w‖

12 / 85

x = xp + rw

‖w‖

12 / 85

We have something like this

We have then

13 / 85

NowSince g (xp) = 0We have that

g (x) = g

(xp + r

‖w‖

(xp + r

‖w‖

= wT xp + w0 + rwT w

‖w‖

= g (xp) + r‖w‖2

‖w‖= r ‖w‖

Then, we have

r = g (x)‖w‖

14 / 85

g (x) = g

(xp + r

‖w‖

(xp + r

‖w‖

= g (xp) + r‖w‖2

‖w‖= r ‖w‖

Then, we have

r = g (x)‖w‖

14 / 85

g (x) = g

(xp + r

‖w‖

(xp + r

‖w‖

= g (xp) + r‖w‖2

‖w‖= r ‖w‖

Then, we have

r = g (x)‖w‖

14 / 85

g (x) = g

(xp + r

‖w‖

(xp + r

‖w‖

= g (xp) + r‖w‖2

‖w‖= r ‖w‖

Then, we have

r = g (x)‖w‖

14 / 85

g (x) = g

(xp + r

‖w‖

(xp + r

‖w‖

= g (xp) + r‖w‖2

‖w‖= r ‖w‖

Then, we have

r = g (x)‖w‖

14 / 85

g (x) = g

(xp + r

‖w‖

(xp + r

‖w‖

= g (xp) + r‖w‖2

‖w‖= r ‖w‖

Then, we have

r = g (x)‖w‖

14 / 85

In particular

The distance from the origin to H

r = g (0)‖w‖

= wT (0) + w0‖w‖

= w0‖w‖

RemarksIf w0 > 0, the origin is on the positive side of H.If w0 < 0, the origin is on the negative side of H.If w0 = 0, the hyperplane has the homogeneous form wT x andhyperplane passes through the origin.

15 / 85

In particular

r = g (0)‖w‖

= wT (0) + w0‖w‖

= w0‖w‖

15 / 85

In particular

r = g (0)‖w‖

= wT (0) + w0‖w‖

= w0‖w‖

15 / 85

In particular

r = g (0)‖w‖

= wT (0) + w0‖w‖

= w0‖w‖

15 / 85

In addition...

If we do the following

g (x) = w0 +d∑

i=1wixi =

d∑i=0

wixi (7)

By making

x0 = 1 and y =

1x1...xd

Wherey is called an augmented feature vector.

16 / 85

In addition...

g (x) = w0 +d∑

i=1wixi =

d∑i=0

wixi (7)

By making

x0 = 1 and y =

1x1...xd

16 / 85

In addition...

g (x) = w0 +d∑

i=1wixi =

d∑i=0

wixi (7)

By making

x0 = 1 and y =

1x1...xd

16 / 85

In a similar way

We have the augmented weight vector

waug =

w0w1...wd

RemarksThe addition of a constant component to x preserves all the distancerelationship between samples.The resulting y vectors, all lie in a d-dimensional subspace which isthe x-space itself.

17 / 85

In a similar way

waug =

w0w1...wd

17 / 85

In a similar way

waug =

w0w1...wd

17 / 85

More Remarks

In additionThe hyperplane decision surface H defined by wT

augy = 0 passesthrough the origin in y-space.Even though the corresponding hyperplane H can be in any positionof the x-space.

The distance from y to H is |wTaugy|

‖waug‖ or |g(x)|‖waug‖ .

Since ‖waug‖ > ‖w‖This distance is less or at least equal to the distance from x to H.

This mapping is quite usefulBecause we only need to find a weight vector waug instead of finding theweight vector w and the threshold w0.

18 / 85

More Remarks

18 / 85

More Remarks

18 / 85

More Remarks

18 / 85

More Remarks

18 / 85

Outline

19 / 85

Outline

20 / 85

Initial Supposition

Suppose, we haven samples x1,x2, ...,xn some labeled ω1 and some labeled ω2.

We want a vector weight w such thatwT xi > 0, if xi ∈ ω1.wT xi < 0, if xi ∈ ω2.

We suggest the following normalizationWe replace all the samples xi ∈ ω2 by their negative vectors!!!

21 / 85

Initial Supposition

21 / 85

Initial Supposition

21 / 85

Initial Supposition

21 / 85

The Usefulness of the Normalization

Once the normalization is doneWe only need for a weight vector w such that wT xi > 0 for all thesamples.

The name of this weight vectorIt is called a separating vector or solution vector.

22 / 85

The Usefulness of the Normalization

Once the normalization is doneWe only need for a weight vector w such that wT xi > 0 for all thesamples.

The name of this weight vectorIt is called a separating vector or solution vector.

22 / 85

Here, we have the solution region for w

Do not confuse this region with the decision region!!!

separat

ing plan

solution space

Remark: w is not unique!!! We can have different w’s solving theproblem

23 / 85

Here, we have the solution region for w

Do not confuse this region with the decision region!!!

separat

ing plan

solution space

Remark: w is not unique!!! We can have different w’s solving theproblem

23 / 85

Here, we have the solution region for w undernormalizationDo not confuse this region with the decision region!!!

"separat

ing" pla

solution space

Remark: w is not unique!!!24 / 85

Here, we have the solution region for w undernormalizationDo not confuse this region with the decision region!!!

"separat

ing" pla

solution space

Remark: w is not unique!!!24 / 85

How do we get this w?

In order to be able to do thisWe need to impose constraints to the problem.

Possible constraints!!!To find a unit-length weight vector that maximizes the minimumdistance from the samples to the separating plane.To find the minimum-length weight vector satisfying wT xi ≥ b for alli where b is a constant called the margin.Here the solution space resulting from the intersections of thehalf-spaces such that wT xi ≥ b > 0 lies within the previous solutionspace!!!

25 / 85

We have thenA new boundary by a distance b

‖xi‖

solution region

26 / 85

Outline

27 / 85

Gradient Descent

For this, we will define a criterion function J (w)A classic optimization

The basic procedure is as follow1 Start with a random weight vector w (1).2 Compute the gradient vector ∇J (w (1)).3 Obtain value w (2) by moving from w (1) in the direction of the

steepest descent:1 i.e. along the negative of the gradient.2 By using the following equation:

w (k + 1) = w (k)− η (k)∇J (w (k)) (8)

28 / 85

Gradient Descent

w (k + 1) = w (k)− η (k)∇J (w (k)) (8)

28 / 85

Gradient Descent

w (k + 1) = w (k)− η (k)∇J (w (k)) (8)

28 / 85

Gradient Descent

w (k + 1) = w (k)− η (k)∇J (w (k)) (8)

28 / 85

Gradient Descent

w (k + 1) = w (k)− η (k)∇J (w (k)) (8)

28 / 85

Gradient Descent

w (k + 1) = w (k)− η (k)∇J (w (k)) (8)

28 / 85

What is η (k)?Hereη (k) is a positive scale factor or learning rate!!!

The basic algorithm looks like thisAlgorithm 1 (Basic gradient descent)

1 begin initialize w, criterion θ, η (·), k = 02 do k = k + 13 w = w − η (k)∇J (w)4 until η (k)∇J (w) < θ

5 return w

Problem!!! How to choose the learning rate?If η (k) is too small, convergence is quite slow!!!If η (k) is too large, correction will overshot and can even diverge!!!

29 / 85

5 return w

29 / 85

5 return w

29 / 85

5 return w

29 / 85

5 return w

29 / 85

5 return w

29 / 85

5 return w

29 / 85

5 return w

29 / 85

5 return w

29 / 85

Using the Taylor’s second-order expansion around valuew (k)

We do the following

J (w) = J (w (k)) +∇JT (w −w (k)) + 12 (w −w (k))T H (w −w (k)) (9)

Remark: This is know as Taylor’s Second Order expansion!!!

Here, we have∇J is the vector of partial derivatives ∂J

∂wievaluated at w (k).

H is the Hessian matrix of second partial derivatives ∂2J∂wi∂wj

evaluated at w (k).

30 / 85

We do the following

evaluated at w (k).

30 / 85

We do the following

evaluated at w (k).

30 / 85

We do the following

evaluated at w (k).

30 / 85

We substitute (Eq. 8) into (Eq. 9)

w (k + 1)−w (k) = η (k)∇J (w (k)) (10)

We have then

J (w (k + 1)) ∼=J (w (k)) +∇JT (−η (k)∇J (w (k))) + ...

12 (−η (k)∇J (w (k)))T H (−η (k)∇J (w (k)))

Finally, we have

J (w (k + 1)) ∼= J (w (k))− η (k) ‖∇J‖2 + 12η

2 (k)∇JT H∇J (11)

31 / 85

w (k + 1)−w (k) = η (k)∇J (w (k)) (10)

We have then

J (w (k + 1)) ∼=J (w (k)) +∇JT (−η (k)∇J (w (k))) + ...

12 (−η (k)∇J (w (k)))T H (−η (k)∇J (w (k)))

Finally, we have

J (w (k + 1)) ∼= J (w (k))− η (k) ‖∇J‖2 + 12η

2 (k)∇JT H∇J (11)

31 / 85

w (k + 1)−w (k) = η (k)∇J (w (k)) (10)

We have then

J (w (k + 1)) ∼=J (w (k)) +∇JT (−η (k)∇J (w (k))) + ...

12 (−η (k)∇J (w (k)))T H (−η (k)∇J (w (k)))

Finally, we have

J (w (k + 1)) ∼= J (w (k))− η (k) ‖∇J‖2 + 12η

2 (k)∇JT H∇J (11)

31 / 85

Derive with respect to η (k) and make the result equal tozero

We have then

−‖∇J‖2 + η (k)∇JT H∇J = 0 (12)

Finally

η (k) = ‖∇J‖2

∇JT H∇J(13)

Remark This is the optimal step size!!!

Problem!!!Calculating H can be quite expansive!!!

32 / 85

We have then

−‖∇J‖2 + η (k)∇JT H∇J = 0 (12)

Finally

η (k) = ‖∇J‖2

∇JT H∇J(13)

32 / 85

We have then

−‖∇J‖2 + η (k)∇JT H∇J = 0 (12)

Finally

η (k) = ‖∇J‖2

∇JT H∇J(13)

32 / 85

We can have an adaptive linear search!!!

We can use the idea of having everything fixed, but η (k)Then, we can have the following functionf (η (k)) = w (k)− η (k)∇J (w (k))

We can optimized using linear search methods

Linear Search MethodsBacktracking linear searchBisection methodGolden ratioEtc.

33 / 85

Example: Golden Ratio

Imagine that you have a linear function f : L→ R

Where: Chose a and b such that a+ba = a

b (The Golden Ratio).

34 / 85

Example: Golden Ratio

Imagine that you have a linear function f : L→ R

Where: Chose a and b such that a+ba = a

b (The Golden Ratio).

34 / 85

The process is as follow

Given f1, f2, f3, wheref1 = f (x1)f2 = f (x2)f3 = f (x3)

We have thenif f2 is smaller than f1 and f3 then the minimum lies in [x1, x3]

Now, we generate x4 with f4 = f (x4)In the largest subinterval!!! [x2, x3]

35 / 85

Finally

Two casesIf f4a > f2 then the minimum lies between x1 and x4 and the newtriplet is x1, x2 and x4.If f4b < f2 then the minimum lies between x2 and x3 and the newtriplet is x2, x4 and x3.

ThenRepeat the procedure!!!

For more, please read the paper“SEQUENTIAL MINIMAX SEARCH FOR A MAXIMUM” by J. Kiefer

36 / 85

Finally

36 / 85

Finally

36 / 85

Finally

36 / 85

We have another method...

Derive the second Taylor expansion by w

J (w) = J (w (k)) +∇JT (w −w (k)) + 12 (w −w (k))T H (w −w (k))

We get

∇J + Hw −Hw (k) = 0 (14)

Hw = Hw (k)−∇JH−1Hw = H−1Hw (k)−H−1∇J

w = w (k)−H−1∇J

37 / 85

We get

∇J + Hw −Hw (k) = 0 (14)

w = w (k)−H−1∇J

37 / 85

We get

∇J + Hw −Hw (k) = 0 (14)

w = w (k)−H−1∇J

37 / 85

The Newton-Raphson Algorithm

We have the following algorithmAlgorithm 2 (Newton descent)

1 begin initialize w, criterion θ2 do k = k + 13 w = w −H−1∇J (w)4 until H−1∇J (w) < θ

5 return w

38 / 85

5 return w

38 / 85

5 return w

38 / 85

5 return w

38 / 85

5 return w

38 / 85

5 return w

38 / 85

5 return w

38 / 85

Outline

39 / 85

Initial Setup

ImportantWe get away from our initial normalization of the samples!!!

Now, we are going to use the method know asMinimum Squared Error

40 / 85

Initial Setup

ImportantWe get away from our initial normalization of the samples!!!

Now, we are going to use the method know asMinimum Squared Error

40 / 85

Now, assume the following

Imagine that your problem has two classes ω1 and ω2 in R2

1 They are linearly separable!!!2 You require to label them.

We have a problem!!!Which is the problem?

We do not know the hyperplane!!!Thus, what distance each point has to the hyperplane?

41 / 85

A Simple Solution For Our Quandary

Label the Classesω1 =⇒ +1ω2 =⇒ −1

We produce the following labels1 if x ∈ ω1 then yideal = gideal (x) = +1.2 if x ∈ ω2 then yideal = gideal (x) = −1.

Remark: We have a problem with this labels!!!

42 / 85

Outline

43 / 85

Now, What?

Assume true function f is given by

ynoise = gnoise (x) = wT x + w0 + ε (15)

Where the εIt has a ε ∼ N

(µ, σ2)

Thus, we can do the following

ynoise = gnoise (x) = gideal (x) + ε (16)

44 / 85

Now, What?

(µ, σ2)

44 / 85

Now, What?

(µ, σ2)

44 / 85

Thus, we have

What to do?

ε = ynoise − gideal (x) (17)

Graphically

45 / 85

Thus, we haveWhat to do?

ε = ynoise − gideal (x) (17)

Graphically

45 / 85

Outline

46 / 85

Sum Over All ErrorsWe can do the following

J (w) =N∑

i=1ε2i =

N∑i=1

(yi − gideal (x))2 (18)

Remark: Know as least squares (Fitting the vertical offset!!!)

GeneralizeIf

The dimensionality of each sample (data point) is d,You can extend each vector sample to be xT = (1,x′),We have:

N∑i=1

(yi − xT w

)2= (y −Xw)T (y −Xw) = ‖y −Xw‖22 (19)

47 / 85

J (w) =N∑

i=1ε2i =

N∑i=1

(yi − gideal (x))2 (18)

GeneralizeIf

N∑i=1

(yi − xT w

)2= (y −Xw)T (y −Xw) = ‖y −Xw‖22 (19)

47 / 85

J (w) =N∑

i=1ε2i =

N∑i=1

(yi − gideal (x))2 (18)

GeneralizeIf

N∑i=1

(yi − xT w

)2= (y −Xw)T (y −Xw) = ‖y −Xw‖22 (19)

47 / 85

J (w) =N∑

i=1ε2i =

N∑i=1

(yi − gideal (x))2 (18)

GeneralizeIf

N∑i=1

(yi − xT w

)2= (y −Xw)T (y −Xw) = ‖y −Xw‖22 (19)

47 / 85

Outline

48 / 85

What is X

It is the Data Matrix

1 (x1)1 · · · (x1)j · · · (x1)d...

......

1 (xi)1 (xi)j (xi)d...

......

1 (xN )1 · · · (xN )j · · · (xN )d

We know the followingdxTAx

dx= Ax+ATx,

dx= A (21)

49 / 85

What is X

It is the Data Matrix

1 (x1)1 · · · (x1)j · · · (x1)d...

......

1 (xi)1 (xi)j (xi)d...

......

1 (xN )1 · · · (xN )j · · · (xN )d

We know the followingdxTAx

dx= Ax+ATx,

dx= A (21)

49 / 85

Note about other representations

We could have xT = (x1, x2, ..., xd, 1) thus

(x1)1 · · · (x1)j · · · (x1)d 1...

......

(xi)1 (xi)j (xi)d 1...

......

(xN )1 · · · (xN )j · · · (xN )d 1

50 / 85

We can expand our quadratic formula!!!

(y −Xw)T (y −Xw) = yT y−wT XT y−yT Xw + wT XT Xw (23)

Making Possible to have by deriving with respect to w and assumingthat XT X is invertible

w =(XT X

)−1XT y (24)

Note:XT X is always positive semi-definite. If it is also invertible, it ispositive definite.

Thus, How we get the discriminant function?Any Ideas?

51 / 85

w =(XT X

)−1XT y (24)

51 / 85

w =(XT X

)−1XT y (24)

51 / 85

The Final Discriminant Function

Very Simple!!!

g(x) = xT w = xT(XT X

)−1XT y (25)

52 / 85

Pseudo-inverse of a Matrix

DefinitionSuppose that A ∈ Rm×n and rank (A) = m. We call the matrix

A+ =(ATA

)−1AT

the pseudo inverse of A.

ReasonA+ inverts A on its image

What?If w ∈ image (A), then there is some v ∈ Rn such that w = Av. Hence:

A+w = A+Av =(ATA

)−1ATAv

53 / 85

A+ =(ATA

)−1AT

A+w = A+Av =(ATA

)−1ATAv

53 / 85

A+ =(ATA

)−1AT

A+w = A+Av =(ATA

)−1ATAv

53 / 85

What lives where?

We haveX ∈ RN×(d+1)

Image (X) = span{

Xcol1 , ..., Xcol

}xi ∈ Rd

w ∈ Rd+1

Xcoli ,y ∈ RN

Basically y, the list of desired inputs the is being protected into

Xcol1 , ..., Xcol

by the projection operator X(XT X

)−1XT .

54 / 85

What lives where?

Image (X) = span{

Xcol1 , ..., Xcol

}xi ∈ Rd

w ∈ Rd+1

Xcoli ,y ∈ RN

Xcol1 , ..., Xcol

)−1XT .

54 / 85

What lives where?

Image (X) = span{

Xcol1 , ..., Xcol

}xi ∈ Rd

w ∈ Rd+1

Xcoli ,y ∈ RN

Xcol1 , ..., Xcol

)−1XT .

54 / 85

What lives where?

Image (X) = span{

Xcol1 , ..., Xcol

}xi ∈ Rd

w ∈ Rd+1

Xcoli ,y ∈ RN

Xcol1 , ..., Xcol

)−1XT .

54 / 85

What lives where?

Image (X) = span{

Xcol1 , ..., Xcol

}xi ∈ Rd

w ∈ Rd+1

Xcoli ,y ∈ RN

Xcol1 , ..., Xcol

)−1XT .

54 / 85

What lives where?

Image (X) = span{

Xcol1 , ..., Xcol

}xi ∈ Rd

w ∈ Rd+1

Xcoli ,y ∈ RN

Xcol1 , ..., Xcol

)−1XT .

54 / 85

Geometric Interpretation

We have1 The image of the mapping w to Xw is a linear subspace of RN .2 As w runs through all points Rd+1, the function value Xw runs

through all points in the image spaceimage (X) = span

1 , ..., Xcold+1

3 Each w defines one point Xw =∑d

j=0wjXcolj .

4 w is the point which minimizes the distance d (y, image (X)).

55 / 85

1 , ..., Xcold+1

j=0wjXcolj .

55 / 85

1 , ..., Xcold+1

j=0wjXcolj .

55 / 85

1 , ..., Xcold+1

j=0wjXcolj .

55 / 85

Geometrically

Ahhhh!!!

56 / 85

Outline

57 / 85

Multi-Class Solution

What to do?1 We might reduce the problem to c− 1 two-class problems.2 We might use c(c−1)

2 linear discriminants, one for every pair of classes.

However

58 / 85

However

58 / 85

However

58 / 85

What to do?

Define c linear discriminant functions

gi (x) = wT x + wi0 for i = 1, ..., c (27)

This is known as a linear machineRule: if gk (x) > gj (x) for all j 6= k =⇒ x ∈ ωk

Nice Properties (It can be proved!!!)1 Decision Regions are Singly Connected.2 Decision Regions are Convex.

59 / 85

What to do?

gi (x) = wT x + wi0 for i = 1, ..., c (27)

59 / 85

What to do?

gi (x) = wT x + wi0 for i = 1, ..., c (27)

59 / 85

Proof of Properties

Actually quite simpleGiven

y = λxA + (1− λ) xB

with λ ∈ (0, 1).

60 / 85

Proof of Properties

Actually quite simpleGiven

y = λxA + (1− λ) xB

with λ ∈ (0, 1).

60 / 85

Proof of Properties

We know that

gk (y) = wT (λxA + (1− λ) xB) + w0

= λwT xA + λw0 + (1− λ) wT xB + (1− λ)w0

= λgk (xA) + (1− λ) gk (xA)> λgj (xA) + (1− λ) gj (xA)> gj (λxA + (1− λ) xB)> gj (y)

For all j 6= k

Or...y belongs to an area k defined by the rule!!!This area is Convex and Singly Connected because the definition ofy.

61 / 85

Proof of Properties

We know that

gk (y) = wT (λxA + (1− λ) xB) + w0

= λwT xA + λw0 + (1− λ) wT xB + (1− λ)w0

For all j 6= k

61 / 85

Proof of Properties

We know that

gk (y) = wT (λxA + (1− λ) xB) + w0

= λwT xA + λw0 + (1− λ) wT xB + (1− λ)w0

For all j 6= k

61 / 85

Proof of Properties

We know that

gk (y) = wT (λxA + (1− λ) xB) + w0

= λwT xA + λw0 + (1− λ) wT xB + (1− λ)w0

For all j 6= k

61 / 85

Proof of Properties

We know that

gk (y) = wT (λxA + (1− λ) xB) + w0

= λwT xA + λw0 + (1− λ) wT xB + (1− λ)w0

For all j 6= k

61 / 85

Proof of Properties

We know that

gk (y) = wT (λxA + (1− λ) xB) + w0

= λwT xA + λw0 + (1− λ) wT xB + (1− λ)w0

For all j 6= k

61 / 85

Proof of Properties

We know that

gk (y) = wT (λxA + (1− λ) xB) + w0

= λwT xA + λw0 + (1− λ) wT xB + (1− λ)w0

For all j 6= k

61 / 85

Proof of Properties

We know that

gk (y) = wT (λxA + (1− λ) xB) + w0

= λwT xA + λw0 + (1− λ) wT xB + (1− λ)w0

For all j 6= k

61 / 85

However!!!

No so nice properties!!!It limits the power of classification for multi-objective function.

62 / 85

How do we train this Linear Machine?

We know that each ωk class is described bygk (x) = wT

k x + w0 where k = 1, ..., c

We then design a single machine

g (x) = W T x (28)

63 / 85

We know that each ωk class is described bygk (x) = wT

k x + w0 where k = 1, ..., c

We then design a single machine

g (x) = W T x (28)

63 / 85

We have the following

1 w11 w12 · · · w1d

1 w21 w22 · · · w2d

1 w31 w32 · · · w3d...

......

...1 wc1 wc2 · · · wcd

What about the labels?OK, we know how to do with 2 classes, What about many classes?

64 / 85

We have the following

1 w11 w12 · · · w1d

1 w21 w22 · · · w2d

1 w31 w32 · · · w3d...

......

...1 wc1 wc2 · · · wcd

What about the labels?OK, we know how to do with 2 classes, What about many classes?

64 / 85

Use a vector ti with dimensionality c to identify each element at eachclassWe have then the following dataset

{xi, ti} for i = 1, 2, ..., N

We build the following Matrix of Vectors

65 / 85

Use a vector ti with dimensionality c to identify each element at eachclassWe have then the following dataset

{xi, ti} for i = 1, 2, ..., N

We build the following Matrix of Vectors

65 / 85

Thus, we create the following Matrix

A Matrix containing all the required information

XW − T (31)

Where we have the following vector[xT

i w1,xTi w2,x

Ti w3, ...,x

Remark: It is the vector result of multiplication of row i of X againstW on XW .

That is compared to the vector tTi on T by using the subtraction of

vectors

εi =[xT

i w1,xTi w2,x

Ti w3, ...,x

]− tT

i (33)

66 / 85

XW − T (31)

i w1,xTi w2,x

Ti w3, ...,x

vectors

εi =[xT

i w1,xTi w2,x

Ti w3, ...,x

]− tT

i (33)

66 / 85

XW − T (31)

i w1,xTi w2,x

Ti w3, ...,x

vectors

εi =[xT

i w1,xTi w2,x

Ti w3, ...,x

]− tT

i (33)

66 / 85

What do we want?

We want the quadratic error12ε

This specific quadratic errors are at the diagonal of the matrix

(XW − T )T (XW − T )

We can use the trace function to generate the desired total error of

J (·) = 12

N∑i=1

ε2i (34)

67 / 85

What do we want?

(XW − T )T (XW − T )

J (·) = 12

N∑i=1

ε2i (34)

67 / 85

What do we want?

(XW − T )T (XW − T )

J (·) = 12

N∑i=1

ε2i (34)

67 / 85

The trace allows to express the total error

J (W ) = 12Trace

{(XW − T )T (XW − T )

Thus, we have by the same derivative method

W =(XT X

)XT T = X+T (36)

68 / 85

The trace allows to express the total error

J (W ) = 12Trace

{(XW − T )T (XW − T )

Thus, we have by the same derivative method

W =(XT X

)XT T = X+T (36)

68 / 85

How we train this Linear Machine?

Thus, we obtain the discriminant

g (x) = W T x = T T(X+

)x (37)

69 / 85

Outline

70 / 85

Issues with Least SquaresRobustness

1 Least squares works only if X has full column rank, i.e. if XT X isinvertible.

2 If XT X almost not invertible, least squares is numerically unstable.1 Statistical consequence: High variance of predictions.

Not suited for high-dimensional data1 Modern problems: Many dimensions/features/predictors (possibly

thousands).2 Only a few of these may be important:

1 It needs some form of feature selection.2 Possible some type of regularization

Why?1 Treats all dimensions equally2 Relevant dimensions are averaged with irrelevant ones

71 / 85

Issues with Least Squares

Problem with OutliersNo Outliers Outliers

72 / 85

Issues with Least Squares

What about the Linear Machine?Please, run the algorithm and tell me...

73 / 85

What to Do About Numerical Stability?

RegularityA matrix which is not invertible is also called a singular matrix. A matrixwhich is invertible (not singular) is called regular.

In computationsIntuitions:

1 A singular matrix maps an entire linear subspace into a single point.2 If a matrix maps points far away from each other to points very close

to each other, it almost behaves like a singular matrix.

Mapping is related to the eigenvalues!!!Large positive eigenvalues ⇒ the mapping is large!!!Small positive eigenvalues ⇒ the mapping is small!!!

74 / 85

Outline

75 / 85

All this comes from the following statementA positive semi-definite matrix A is singular ⇐⇒ smallest eigenvalue is 0

Consequence for StatisticsIf a statistical prediction involves the inverse of an almost-singular matrix,the predictions become unreliable (high variance).

76 / 85

All this comes from the following statementA positive semi-definite matrix A is singular ⇐⇒ smallest eigenvalue is 0

Consequence for StatisticsIf a statistical prediction involves the inverse of an almost-singular matrix,the predictions become unreliable (high variance).

76 / 85

What can be done?Ridge RegressionRidge regression is a modification of least squares. It tries to make leastsquares more robust if XT X is almost singular.

The solution

wRidge =(XT X + λI

)−1XT y (38)

where λ is a tunning parameter

Thus, we can do the following given that XT X is positive definiteAssume that ξ1, ξ2, ..., ξd+1 are eigenvectors of XT X with eigenvaluesλ1, λ2, ..., λd+1: (

XT X + λI)ξi = (λi + λ) ξi (39)

i.e. λi + λ is an eigenvalue for(XT X + λI

)77 / 85

The solution

wRidge =(XT X + λI

)−1XT y (38)

XT X + λI)ξi = (λi + λ) ξi (39)

)77 / 85

The solution

wRidge =(XT X + λI

)−1XT y (38)

XT X + λI)ξi = (λi + λ) ξi (39)

)77 / 85

What does this mean?

Something NotableYou can control the singularity by detecting the smallest eigenvalue.

ThusWe add an appropriate tunning value λ.

78 / 85

What does this mean?

Something NotableYou can control the singularity by detecting the smallest eigenvalue.

ThusWe add an appropriate tunning value λ.

78 / 85

Thus, what we need to do?

Process1 Find the eigenvalues of XT X

2 If all of them are bigger than zero we are fine!!!3 Find the smallest one, then tune if necessary.4 Build wRidge =

(XT X + λI

)−1XT y.

79 / 85

(XT X + λI

)−1XT y.

79 / 85

(XT X + λI

)−1XT y.

79 / 85

(XT X + λI

)−1XT y.

79 / 85

What about Thousands of Features?

There is a technique for thatLeast Absolute Shrinkage and Selection Operator (LASSO) invented byRobert Tibshirani that uses L1 =

∑di=1 |wi|.

The Least Squared Error takes the form ofN∑

(yi − xT w

d∑i=1|wi| (40)

However

You have other regularizations as L2 =√∑d

i=1 |wi|2

80 / 85

∑di=1 |wi|.

(yi − xT w

d∑i=1|wi| (40)

However

i=1 |wi|2

80 / 85

∑di=1 |wi|.

(yi − xT w

d∑i=1|wi| (40)

However

i=1 |wi|2

80 / 85

Graphically

The first area correspond to the L1 regularization and the second one?

81 / 85

Graphically

Yes the circle defined as L2 =√∑d

i=1 |wi|2

82 / 85

The seminal paper by Robert Tibshirani

An initial study of this regularization can be seen in“Regression Shrinkage and Selection via the LASSO” by Robert Tibshirani

- 1996

83 / 85

This out the scope of this class

However, it is worth noticing that the most efficient method forsolving LASSO problems is“Pathwise Coordinate Optimization” By Jerome Friedman, Trevor Hastie,

Holger Ho and Robert Tibshirani

NeverthelessIt will be a great seminar paper!!!

84 / 85

This out the scope of this class

However, it is worth noticing that the most efficient method forsolving LASSO problems is“Pathwise Coordinate Optimization” By Jerome Friedman, Trevor Hastie,

Holger Ho and Robert Tibshirani

NeverthelessIt will be a great seminar paper!!!

84 / 85

ExercisesDuda and HartChapter 5

1, 3, 4, 7, 13, 17

BishopChapter 4

4.1, 4.4, 4.7,

TheodoridisChapter 3 - Problems

Using python 3.6Chapter 3 - Computer Experiments

Using python 3.1Using python and Newton 3.2

85 / 85

1, 3, 4, 7, 13, 17

BishopChapter 4

4.1, 4.4, 4.7,

85 / 85

1, 3, 4, 7, 13, 17

BishopChapter 4

4.1, 4.4, 4.7,

85 / 85

04 machine learning - supervised linear classifier

Engineering