machine learning in computer vision mva 2011 - lecture...

Machine Learning in Computer VisionMVA 2011 - Lecture Notes

Iasonas KokkinosEcole Centrale Paris

[email protected]

October 18, 2011

Chapter 1

Introduction

This is an informal set of notes for the first part of the class. They are by no means com-plete, and do not substitute attending class. My main intent is to give you a skimmedversion of the class slides, which you will hopefully find easier to consult.

These notes were compiled based on several standard textbooks including:

• T. Hastie, R. Tibshirani and J. Friedman, ‘Elements of Statistical Learning’,Springer 2009.

• V. Vapnik, ‘The Nature of Statistical Learning’, Springer 2000.

• C. Bishop, ‘Pattern Recognition and Machine Learning’, Springer 2006.

Apart from these, helpful resources can be found online; in particular Andrew Nghas a highly-recommended set of handouts here:http://cs229.stanford.edu/materials.html

while the more theoretically inclined can consult Robert Schapire’s course:http://www.cs.princeton.edu/courses/archive/spring08/cos511/schedule.html

and Peter Bartlett’s course:http://www.cs.berkeley.edu/˜bartlett/courses/281b-sp08/

for more extensive introductions to learning theory, boosting and SVMs.As this is the first edition of these lecture notes, there will inevitably be inconsis-

tencies in notation with the slides, typos, etc; your feedback will be most valuable inimproving them for the next year.

1

Chapter 2

Linear Regression

Assumed form of discriminant function:

f(x,w) =M∑j=1

wjxj +w0 =M∑j=0

wjxj , x0 = 1 (2.1)

In the more general case:

f(x,w) =M∑j=0

wjBj(x) (2.2)

where Bj(x) can be a nonlinear function of x.We want to minimize the empirical loss (=training loss) of our classifier quantified

in terms of the following criterion:

L(w) =∑i

l(yi, f(xi,w))

=N∑i=1

(yi − f(xi;w))2

=N∑i=1

yi −M∑j=1

wjxij

2

We introduce vector/matrix notation:

y =

y1y2...yN

N × 1 X =

x11 x1

2 . . . x1M

...xN1 xN

N

N ×M,

and the cost becomes:

L(w) = (y −Xw)T (y −Xw)

= eTe,

2

where e = y−Xw is called the residual vector. This is why the criterion is also knownas residual-sum-of-squares cost.

Optimizing the training criterion gives:

∂L(w)

∂w= 2XT (y −Xw) = 0

XTy = XTXw

w = (XTX)−1XTy (2.3)y = Xw = X(XTX)−1XT y (2.4)

2.1 Ridge regressionWhat happens when the regression coefficients are correlated? As an extreme case,consider xi

1 = xi2, ∀i. Then the matrix (XTX)−1 becomes ill-conditioned and the

corresponding coefficients can take on arbitrary values: a large positive value of onecoefficient can be canceled by a large negative one.

Therefore we regularize our problem by introducing a penalty C(w) on the expan-sion coefficients:

L(w, λ) = eTe+ λC(w) (2.5)

where λ determines the tradeoff between the data term eTe and the regularizer.For the case of ridge regression, we penalize the l2 norm of the weight vector, i.e.

∥w∥22 =∑K

k=1 w2k = wTw. The cost function becomes:

Lridge(w) = eTe+ λwTw (2.6)= yTy − 2wTXTy +wT

(XTX+ λI

)w

The condition for an optimum of L(w) gives:

w∗ =(XTX+ λI

)−1XTy (2.7)

So the estimate labels will be:

y = Xw∗ (2.8)

2.1.1 LassoInstead of the l2 we now penalize the l1 norm of the vector w:

ELasso(w, λ) =N∑i=1

(yi −Xw)T (y −Xw) + λM∑k=1

|wk| (2.9)

The derivative of ELasso w.r.t. wi at any point will be given by:

∂(ELasso(w))

∂wi= λsign(wi) +

N∑i=1

−2(yi −Xw)Twi

3

If we use for instance Gradient descent to minimize ELasspo, the rate at which wi willbe updated will not be affected by the magnitude of wi, but only on its sign. Note thatis in contrast to the logistic regression case:

∂(ERidge(w))

∂wi= λ(wi) +

M∑i=1

−2(yi −Xw)Twi,

where small-in-magnitude values of wi ‘can survive’ as they become more slowlysuppressed as they decrease. This difference, due to the different norms used, leads tosparser solutions in Lasso.

2.2 Selection of λIf we optimize w.r.t λ directly, λ = 0 is the obvious, but trivial solution. This so-lution will suffer from ill-posedness: as mentioned earlier, the matrix XTX may bepoorly conditioned for correlated X . So, the parameters of the model may exhibit highvariance, depending on the specific training set used.

What λ does is regularize our problem, i.e. impose some constraints that render itwell-posed. We first see why this is so.

2.2.1 Bias variance decomposition of errorAssume actual generation process

y = g(x) + ϵ (2.10)

Our model: y = g(x) = fw(x), and w = h(λ, S)Expected generalization error at x0:

x0 (2.11)Err(x0) = E[(y − g(x0))

2]

= E[(y − g(x0) + g(x0)− g(x0))2]

= E[((y − g(x0)) + (g(x0)− E[g(x0)]) + (E[g(x0)]− g(x0)))2]

= σ2 + (g(x0)− E[g(x0]))2︸︷︷︸

Bias

+E[(E[g(x0)]− g(x0))2]︸︷︷︸

V ariance

(2.12)

Bias is a decreasing function of model complexity. Variance is increasing.

2.2.2 Cross ValidationSetting λ allows us to control the tradeoff between bias and variance. Consider two ex-treme cases: λ = 0: no regularization - ill posed problem, and high variance. λ = inf:constant model - high bias. The best is somewhere in between. Use cross-validation toestimate λ.

4

2.3 SVD-based interpretations (optional reading)Consider the following Singular Value Decomposition for the N ×M matrix X 1:

X︸︷︷︸N×M

= U︸︷︷︸N×M

D︸︷︷︸M×M

VT︸︷︷︸M×M

D = diag(d1, d2, . . . , dmin(M,N), 0, . . . , 0︸︷︷︸max(M−N,0)

), di ≥ di+1, di ≥ 0

while the columns of U and V are orthogonal and unit-norm. We remind that N is thenumber of training points and M the dimensionality of the features.

A first thing to note is that if we have zero-mean (‘centered’) data, XTX will bethe covariance matrix of the data. Moreover we have:

XTX = VDTUTUDVT = VDTDVT = VD2VT (2.14)

so V will contain the eigenvalues of the covariance matrix of the data.Further, we can use the SVD decomposition interpret certain quantities obtained

in the previous sections. First, for (2.4), which gives the estimated output y in linearregression, we have:

y = X(XTX

)−1XTy

y = UDVT(VDTUTUDVT

)−1VTDTUy

= U(UTy

)We can interpret this expression as follows:U: forms an orthonormal basis for a subspace of RM .CH = UTh: gives us the expansion coefficients for the projection of a vector h onthis basis.h = UCh: is the reconstruction of h on basis U.

We therefore have that UTy gives the projection coefficients for y on U. So y =UUTy, gives us the projection of y on basis U

For the case of ridge regression, using the expression in (2.8) we get:

y = X(XTX+ λI

)−1XTy

= UD(D2 + λI

)−1DUTy

=M∑i=1

um

d2jd2j + λ

uTmy

1This version of SVD is slightly non-typical; it is the one used in Hastie, Tibshirani and Friedman. Thedefault is for D to be N ×M : Typical SVD, for N > M :

X︸︷︷︸N×M

= U︸︷︷︸N×N

D︸︷︷︸N×M

VT︸︷︷︸M×M

D = [diag(d1, d2, . . . , dN )|(0)N×N−M ] , di ≥ di+1, di ≥ 0 (2.13)

5

Hence the name ‘shrinkage’: the expansion coefficients for the reconstruction of y on

the subspace spanned by U are shrunk byd2j

d2j+λ

< 1 w.r.t. the values used in linearregression.

6

Chapter 3

Logistic Regression

3.1 Probabilistic transition from linear to logistic re-gression

Consider that the labels are generated according to the linear model + gaussian noiseassumption:

P (yi|xi,w) =1√2πσ2

exp

(−(yi − xwT

)22σ2

). (3.1)

We consider as a ‘merit function’ the likelihood of the data:

C(w) = P (y|X,w) =∏i

P (yi|xi,w)

or more conveniently, the log-likelihood:

log(C(w)) = logP (y|X) =∑i

logP (yi|xi)

= c− 1

2σ2

∑i

(yi −wTxi)2

= c− 1

2σ2(y −Xw)

T(y −Xw) =

= c− 1

2σ2eTe (3.2)

So minimizing the least squares criterion amounts to maximizing the log Likelihoodof the labels under the model in (3.1). But this model make sense only for continu-ous observations. In classification the observations are discrete (labels). We thereforeconsider using a Bernoulli distribution instead:

P (y = 1|x,w) = f(x,w)

P (y = 0|x,w) = 1− f(x,w).

7

We can compactly express these two equations as:

P (y|x) = (f(x,w))y(1− f(x,w)1−y (3.3)

We can now write the (log-)likelihood of the data as:

C(w) = P (y|X,w) =∏i

P (yi|xi,w)

logC(w) = logP (y|X,w) =∑i

logP (yi|xi,w)

=∑i

yi log f(xi,w) + (1− yi) log(1− f(xi,w))(3.4)

Now consider using the following expression for f(x,w):

f(x,w) = g(wTx) =1

1 + exp(−wTx)(3.5)

g(a) = 11+exp(−a) is called the sigmoidal function and has several interesting proper-

ties, the most useful ones being:

dg

da= g(a)(1− g(a)), 1− g(a) = g(−a)

Plugging (3.5) into (3.4), this is the criterion we should be maximizing:

logC(w) = logP (y|X,w) =N∑i=1

yi log g(wTxi) + (1− yi) log(1− g(wTxi))(3.6)

A simple approach to maximize it is to use Gradient ascent:

wt+1k = wt

k + α∂C(w)

∂wk

∣∣∣∣wt

(3.7)

Using the expression in (3.6), we have:

∂C(w)

∂wk=

N∑i=1

[yi

1

g(wTxi)

∂g(wTxi)

∂wk+ (1− yi)

1

1− g(wTxi)(−∂g(wTxi)

∂wk)

]

=

N∑i=1

[yi

1

g(wTxi)− (1− yi)

1

1− g(wTxi)

]g(wTxi)(1− g(wTxi))

∂wTxi

∂wk

=

N∑i=1

[yi(1− g(wTxi))− (1− yi)g(wTxi)

]xik

=N∑i=1

[yi − g(wTxi)

]xik

8

We can write this more compactly by using vector notation; using the N × 1 vectorg with elements:

gi = g(wTxi) (3.8)

and the N × 1-dimensional k-th column Xk of the N ×M matrix X, we have:

∂C(w)

∂wk= [y − g]

TXk = XT

k [y − g]

Stacking the partial derivatives together we form the Jacobian:

J(w) =dC(w)

dw=

XT1 [y − g]. . .

XTM [y − g]

= XT [y − g]

The gradient-ascen update for w will be:

wt+1 = wt + αJTw

A more elaborate, but more efficient technique is the Newton-Raphson method.Consider first the Newton method for finding the roots of a 1-D function f(θ):

θ′ = θ − f(θ)

f ′(θ)

If we want to maximize the function, we need to find roots of its derivative f ′; so weapply the Newton method to find the roots of f ′(θ):

θ′ = θ − f ′(θ)

f ′′(θ)

In our case we want to maximize a multidimensional function, which leads us tothe Newton Raphson method:

w′ = w − (H(w))−1J(w). (3.9)

We already saw how to compute the Jacobian. For the k, j-th element of the Hessianwe have

H(w)(k, j) =∂2 logC(w)

∂wk∂wj

=∂∑N

i=1

[yi − g(xiwT )

]xik

∂wj

= −N∑i=1

∂g(xiwT )

∂wjxik

= −N∑i=1

g(xiwT )(1− g(xiwT ))xijx

ik

= −XTRX, where Ri,i = g(xi)(1− g(xi)) (3.10)

9

This Newton-Raphson update is called ‘Iteratively Reweighted Least Squares’ inStatistics; to see why this is so, we rewrite the update as:

w′ = w +(XTRX

)−1XT (y − g)

=(XTRX

)−1 [(XTRX

)w0 +XT (y − g)

]=

(XTRX

)−1XTR

[Xw +R−1(y − g)

]=

(XTRX

)−1XTRz

where we have introduced the ‘adjusted response’:

z = Xw +R−1(y − g)

zi = xi w︸︷︷︸Previous value

+1

gi(1− gi)(yi − gi), (3.11)

Using this new vector z, we can interpret the vector w as being the solution to aweighted least squares problem:

w′ = argminw

N∑i=1

Ri(zi − xiwT )2 (3.12)

Note that apart from z, also the matrix R is updated at each iteration; this explains the‘iteratively re-’ prefix in the ‘IRLS’ name.

3.2 Log-lossWe simplify the expression for our training criterion to see what logistic regressiondoes. We therefore introduce:

y± = 2yb − 1 (3.13)yb ∈ {0, 1} (3.14)

y± ∈ {−1, 1} (3.15)

yb =y± + 1

2, (3.16)

where yb is the label we have been using so far and y± is a new label that makes certainexpressions easier.

In specific we have:

P (y = 1|f(x)) =1

1 + exp(−f(x))(3.17)

P (y = −1|f(x)) = 1− P (y = 1|f(x)) (3.18)

=1

exp(f(x)) + 1(3.19)

P (yi|f(xi)) =1

1 + exp(−yif(xi))(3.20)

10

So we can write:

L(w) = −C(w) =

N∑i=1

log(1 + exp(−yi±f(xi))) (3.21)

The quantity

l(yi, xi) = log(1 + exp(−yi±f(xi))) (3.22)

is called the log-loss.

11

Chapter 4

Support Vector Machines

4.1 Large Margin classifiersIn linear classification, we are concerned with construction of separating hyperplane:wTx = 0.

For instance, logistic regression provides an estimate of:

P (y = 1|x;w) = g(wTx) =1

1 + exp(−wTx)

and provide a decision rule using:

yi =

{1 ifP (y = 1|x;w) ≥ .50 ifP (y = 1|x;w) < .5

(4.1)

Intuitively, a ‘confident’ decision is made when |wTx| is large. So we would like tohave:

(wTxi) ≫ 0 if yi = 1

(wTxi) ≪ 0 if yi = −1

or, combined:

yi(wTxi) ≫ 0.

Given a training set, we want to find a decision boundary that allows us to makecorrect and confident predictions on all training examples.

For points near the decision boundary, even though we may be correct, smallchanges in the training set could cause different classifications; so our classifier willnot be confident. If a point is far from the separating hyperplane then we are moreconfident in our predictions. In other words, we would like to have all points being far(and on the correct side of) the decision plane.

12

4.1.1 NotationInstead of

h(x) = g(wTx) (4.2)

we now consider classifiers of the form:

h(x) = g(wTx+ b) (4.3)

(b = w0), where g(z) = 1 if z > 0, and g(z) = −1, if z < 0. Unlike logisticregression, where g gives a probability estimate, here the classifier directly outputs adecision.

4.1.2 MarginsThe ‘confidence scores’ described above are called functional margins: scaling w, b bytwo doubles the margin, without essentially changing the classifier.

We therefore now turn to geometric margins, which are defined in terms of unit-norm vector w

|w| .We express first any point as:

x = x⊥ + γw

|w|

where x⊥ is the projection of x on the decision plane, wTx⊥ + b = 0, w|w| is a unit-

norm vector perpendicular to the decision plane and γ is thus the distance of x fromthe decision plane. Solving for γ we have:

wTx = wTx⊥ +wT γw

|w|wTx = −b+ γ|w|

γ =wTx+ b

|w|=

wT

|w|x+

b

|w|

The above analysis works for the case where y = 1. If y = −1, we would like wTx+bto be negative, we therefore write:

γi = yi(wT

|w|xi +

b

|w|

)For a training set xi, yi and a classifier w we define its geometric margin as:

γ = mini

γi (4.4)

which indicates the worst performance of the classifier on the training set.

13

4.1.3 Introducing margins in classificationOur task is to classify all points with maximal geometric margin. If we consider thecase where the points are separable, we have the following optimization problem:

maxγ,w,b

γ

s.t. yi(wTxi + b) ≥ γ, i = 1 . . .M

|w| = 1

In words, maximize γ, where all points have at least γ geometric margin. It is hardto incorporate the |w| = 1 constraint (it is non-convex; for two dimensional inputs, wshould lie on a circle). We therefore divide both sides by |w|:

maxγ,w,b

γ′

|w|s.t. yi(wTxi + b) ≥ γ′, i = 1 . . .M

From the constraint, since w is no longer constrained to have unit norm, we realizethat γ′ is a functional margin. Consider we find a combination of w, b that solves theoptimization problem. We get a one-dimensional family of solutions by scaling w, b, γ′

by a common constant. We discard this ambiguity by consider only those values of w, bfor which γ′ = 1, which gives us the following problem:

maxw,b

1

|w|s.t. yi(wTxi + b) ≥ 1, i = 1 . . .M

Equivalently:

minw,b

1

2|w|2

s.t. yi(wTxi + b) ≥ 1, i = 1 . . .M

So we end up with a quadratic optimization problem, involving a Convex (quadratic)function of w and linear constraints on w.

4.2 DualityConsider a general constrained optimization problem:

minw

f(w)

s.t. hi(w) = 0, i = 1 . . . l

gi(w) ≤ 0, i = 1 . . .m

This is equivalent to unconstrained optimization:

minw

fuc(w) = f(w) +∑l

i=1 I0(hi(w)) +∑m

i=1 I+(gi(w)) (4.5)

14

where I0, I−: hard constraints

I0(x) =

{0, x = 0∞, x = 0

, I+(x) =

{0, x ≤ 0∞, x > 0

We form a lower bound of this function, by replacing each hard constraint with asoft one:

I0(xi) → λixi

I+(xi) → µixi, µ > 0

λ, µ: Penalty for violating constraints.This gives us the Lagrangian:

L(w, λ, µ) = f(w) +l∑

i=1

λihi(w) +m∑i=1

µigi(w), µi > 0∀i

We observe that

fuc(w) = maxλ,µ:µi>0

L(w, λ, µ). (4.6)

Therefore

f(w∗) = minw

maxλ,µ:µi>0

L(w, λ, µ). (4.7)

Now consider the Lagrange Dual Function:

θ(λ, µ) = infw

L(w, λ, µ). (4.8)

θ(λ, µ) is always a lower bound on the optimal value p∗ of the original problem. Ifthe original problem has a (feasible) solution w∗, then:

L(w∗, λ, µ) = f(w∗) +l∑

i=1

λihi(w∗) +

m∑i=1

µigi(w∗) =

w∗:feasible= f(w∗) +

l∑i=1

λi0 +m∑i=1

µi gi(w∗)︸︷︷︸

<0

=

µi>0

≤ f(w∗).

So

θ(λ, µ) = infw

L(w, λ, µ) ≤ L(w∗, λ, µ) ≤ f(w∗). (4.9)

So we have a function θ(λ, µ) that lower bounds the minimum cost of our originalproblem. We want to make this bound as tight as possible, i.e. maximize it.

15

New task (dual problem):

maxλ,µ

θ(λ, µ) (4.10)

s.t. µi > 0 ∀i

In general,

d∗ = maxλ,µ

θ(λ, µ)

= maxλ,µ:µi>0

minw

L(w, λ, µ)

≤ minw

maxλ,µ:µi>0

L(w, λ, µ)

= minw

fuc(w) = p∗

If f(w), hi(w), fi(w) are convex functions, and the primal and dual problems arefeasible, then the bound is tight, i.e.

d∗ = p∗,

which means that there exist w∗, λ∗, µ∗ meeting all constraints while satisfying:

f(w∗) = θ(λ∗, µ∗) (4.11)

At that point we have:

f(w∗) = θ(λ∗, µ∗)

= infw

f(w) +l∑

i=1

λ∗i hi(w) +

m∑i=1

µ∗i fi(w)

≤ f(w∗) +M∑i=1

λ∗i hi(w

∗) +m∑i=1

µ∗i fi(w

∗)

≤ f(w∗)

Therefore: µ∗i fi(w

∗) = 0, ∀i (‘Complementary Slackness’).

4.2.1 Karush-Kuhn-Tucker ConditionsL(w∗, λ∗, µ∗) is a saddle point (minimum for w, maximum for λ, µ). From the mini-mum w.r.t w∗ we have that

∇f(w∗) +l∑

i=1

λi∇hi(w∗) +

m∑i=1

∇fi(w∗) = 0

16

From the maximum w.r.t. λ, µ we have that the constraints are satisfied. Combiningwith the complementary slackness conditions we have the KKT conditions:

hi(w∗) = 0

fi(w∗) ≤ 0

µifi(w∗) = 0

µi ≥ 0

∇f(w∗) +

l∑i=1

λi∇hi(w∗) +

m∑i=1

∇fi(w∗) = 0

4.2.2 Dual Problem for Large-Margin trainingThe primal problem is the objective we setup for max-margin training:

minw,b

1

2|w|2

s.t. −yi(wTxi + b) + 1 ≤ 0, i = 1 . . .M

Its Lagrangian will be:

L(w, b, µ) =1

2|w|2 −

M∑i=1

µi[yi(wTxi + b)− 1] (4.12)

At an optimum we have a minimum w.r.t. w, which gives:

0 = w∗ −M∑i=1

µi[yixi]

w∗ =

M∑i=1

µiyixi,

and a minimum w.r.t. b, which gives:

0 =M∑i=1

µiyi

Plugging in the optimal values into the Lagrangian we get:

θ(µ) = L(w∗, b∗, µ)

=1

2|w∗|2 −

M∑i=1

µi[yi(w∗Txi + b)− 1]

=1

2(M∑i=1

µiyixi)T (

M∑j=1

µjyjxj)−

M∑i=1

µi[yi((

M∑j=1

µjyjxj)Txi + b)− 1]

=M∑i=1

µi −1

2

M∑i=1

M∑j=1

µiµjyiyj(xi)T (xj)− b

M∑i=1

µiyi

17

We thereby eliminated w∗ from the problem. For reasons that will become clear later,we write (xi)T (xj) =< xi, xj > (inner product). The new optimization problembecomes:

maxµ

θ(µ) =

M∑i=1

µi −1

2

M∑i=1

M∑j=1

µiµjyiyj < xi, xj >

s.t. µi > 0, ∀iM∑i=1

µiyi = 0

If we solve the problem above w.r.t. µ, we can use µ to determine w∗. One importantnote is that as

w∗ =

M∑i=1

µiyixi, (4.13)

the final decision boundary orientation is a linear combination of the training points.Moreover, from complementary slackness we have:

µigi(x) = 0, (4.14)

where:

gi(x) = 1− yi(wTxi + b) ≤ 0

yi(wTxi + b) ≥ 1 (4.15)

Therefore if µi = 0 then yi(wTxi + b) = 1.As w∗ is formed from the set of points for which µi = 0, this means that only the

points that lie on the margin influence the solution. These points are called SupportVectors.

From the set S of support vectors we can determine best b∗:

yi(wTxi + b) = 1, ∀i ∈ S

b∗ =1

NS

∑i∈S

(yi −wTxi

)(4.16)

4.2.3 Non-separable dataTo deal with the case where the data cannot be separated, we introduce positive ‘slackvariables’, ξi, i = 1 . . .M in the constraints:

wTxi + b ≥ 1− ξi, yi = 1

wTxi + b ≤ −1 + ξi, yi = −1

or equivalently:

yi(wTxi + b) ≥ 1− ξi

18

If ξi > 1 an error occurs, and∑

i ξi is an upper bound on the number of errors. We

consider a new problem:

minw,b

1

2∥w∥2 + C

M∑i=1

ξi

s.t. yi(wTxi + b) ≥ 1− ξi

ξi ≥ 0 (4.17)

where C determines the tradeoff between the large margin (low ∥w∥) and low errorconstraint.

The Lagrangian of this problem is:

L(w, b, ξ, µ, ν) =1

2wTw + C

M∑i=1

ξi −M∑i=1

µi[yi(wTx+ b)− 1 + ξ]−

M∑i=1

νiξi,

and its dual becomes:

maxµ

M∑i=1

µi −1

2

M∑i=1

M∑j=1

yiyjµiµj < xi, xj >

s.t. 0 ≤ µi ≤ CM∑i=1

µiyi = 0.

The KKT conditions are similar; an interesting change is that we now have:

C − µi − νi = 0 (4.18)yi(< xi,w > +b)− 1 + ξi ≥ 0 (4.19)

µi[yi(< xi,w > +b)− 1 + ξi

]= 0 (4.20)

νiξi = 0 (4.21)ξi ≥ 0 (4.22)µi ≥ 0 (4.23)νi ≥ 0 (4.24)

which gives

yi(< xi,w > +b) > 13:ξi≥0→ µi = 0

yi(< xi,w > +b) < 12:ξi>0→4:νi=0→1:→ µi = C

yi(< xi,w > +b) = 13:µiξi=0→1:→ µi ∈ [0, C]

19

µi = 0(3)→ ξi = 1− yi(< xi,w > +b) (4.25)

µi = 0(1)→ νi = C

(4)→ ξi = 0 (4.26)ξi ≥ 0 (4.27)

ξi = max(0, 1− yi(< xi,w > +b)) (4.28)

4.2.4 KernelsUp to now, the decision was based on the sign of

y = wTx =N∑i=1

wixi (4.29)

What if we used nonlinear transformations of the data?

y = wTϕ(x) =

K∑i=1

wiϕk(x) (4.30)

As a concrete example, consider

x =

[x1

x2

], ϕ(x) =

x1

x2

x21

x22

x1x2

Using such features would allow us to go from linear to quadratic decision boundaries,defined by: ∑

i

wiϕi(x) = 0 (4.31)

We observe that, in the linear case, for any new point x, the decision will have theform:

y = sign(< w,x > +b)µ′i=µiy

i

= sign(

M∑i=1

µ′i < xi,x > +b) (4.32)

So all we need to know in order to compute the response at a new x is its inner productwith the Support Vectors. So if we plug in our new features, we guess that we shouldget a classifier of the form:

y = sign(M∑i=1

µ′i < ϕ(xi), ϕ(x) > +b) (4.33)

20

Actually everything in the original algorithm was written in terms of inner productsbetween input vectors. So we can write everything in terms of inner products of ournew features ϕ; in specific, the dual optimization problem becomes:

maxµ

M∑i=1

µi −1

2

M∑i=1

M∑j=1

yiyjµiµj < ϕ(xi), ϕ(xj) >

s.t. 0 ≤ µi ≤ CM∑i=1

µiyi = 0

If now we define a Kernel between two points x and y as:

K(x,y) =< ϕ(x), ϕ(y) >

and replace < x,y > with K(x,y), we can write the same algorithm:

y = sign(M∑i=1

µiyiK(xi,x) + b) (4.34)

maxµ

M∑i=1

µi −1

2

M∑i=1

M∑j=1

yiyjµiµjK(xi,xj)

s.t. 0 ≤ µi ≤ CM∑i=1

µiyi = 0

As a concrete example, consider the mapping:

x =

[x1

x2

], ϕ(x) =

x1

x2√2x1x2

Then

< ϕ(x), ϕ(y) > = x21y

21 + 2x1x2y1y2 + x2

2y22

= (x1y1 + x2y2)2

= (xT y)2

= K(x, y)

In the same way, we can use K(x, y) = (xT y + c)d to implement a mapping into(N + d

d

)space.

21

Another case:

K(x, y) = exp(−∥x− y∥2

2σ2) (4.35)

It can be shown that this amounts to an inner product in an infinite-dimensional featurespace.

K can de a distance-like measure among inputs, even though we may not havea straightforward feature-space interpretation; it allows us to apply SVMs to strings,graphs, documents etc. Most importantly, it allows us to work in very high dimensionalspaces without necessarily finding and/or computing the mappings.

Under what conditions does a kernel function represent an inner product in somespace? We want to have Kij = K(xi, xj) = K(xj , xi) = Kji. K: Gram matrix.

For m points this should be valid ∀i = 1 . . .m, j = 1 . . .m.Consider now the quantity:

zTKz =∑i

∑j

ziKijzj

=∑i

∑j

ziϕT (xi)ϕ(xj)zj

=∑i

∑j

zi∑k

ϕk(xi)ϕk(x

j)zj

=∑k

∑i

∑j

ziϕk(xi)ϕk(x

j)zj

=

∑k

(∑i

ziϕk(xi)

)2

≥ 0

K is therefore positive semi-definite and symmetric.‘Mercer Kernel’ consider a Kernel K : RN × RN → R; K represents an inner

product if and only if K is symmetric positive semi-definite.

4.3 Hinge vs. Log-lossConsider the SVM training criterion:

L(w) =

m∑i=1

[ξi] +1

2CwTw

=m∑i=1

[1− yi(wTxi)]+ + λwTw (4.36)

[1− yi(ti)]+ : hinge loss

22

Compared to regularized Logistic regression

L(w) =m∑i=1

log(1 + exp(−y(wTxi)) + λwTw,

the hinge loss is more abrupt: as soon as as yi(wTxi) > 1, the i-th point is declared‘well-classified’ and no longer contributes to the solution (i.e. is not a support vector).

23

Chapter 5

Adaboost

If many cases it easy to find some rules of thumb that are often correct but hard to finda single highly accurate prediction rule

Boosting is a general method for converting a set of rules of thumb into highlyaccurate prediction rules; assuming that a given weak learning algorithm can consis-tently find classifiers (rules of thumb) at least slightly better than random, say, accuracy51(in two-class setting), then given sufficient data a boosting algorithm can provablyconstruct a single classifier with very high accuracy.

5.1 AdaBoostAdaboost is a boosting algorithm that provides justifiable answers to the followingquestions:

• First, how to choose examples on each round? Adaboost concentrates on hardestexamples, i.e. those most often misclassified by previous rules of thumb.

• Second how to combine rules of thumb into single prediction rule? Adaboostproposes to take a weighted majority vote of these rules of thumb.

In specific we consider that we are provided with a training set: (xi, yi), xi ∈X , yi ∈ {−1, 1}, i = 1, . . . , N . Adaboost maintains a probability density on these datawhich is initially uniform, and on the way places larger weight on harder examples.At each round a weak classifier ht (‘rule of thumb’) is identified that has the bestperformance on the weighted data, and used to update (i) the classifier’s expression (ii)the probability density on the data.

24

In specific Adaboost can be described in terms of the following algorithm:Initially, set D1(i) =

1N , ∀i

For t = 1 . . . T :

• Find weak classifier ht : X → {−1, 1} with smallest error

ϵt =

∑Ni=1 D

it[y

i = ht(xi)]∑

i Dit

(5.1)

• Set at = 12 log

(1−ϵt)ϵt

• Update distribution

Dit+1 =

Dit

Zt×{

exp(−αt), ifyi = ht(xi)

exp(αt), ifyi = ht(xi)

=Di

t

Ztexp(−αty

iht(xi)),

where to guarantee that Dt+1 remains a proper density we normalize by:

Zt =∑i

Dit

exp(−αty

iht(xi))

• Combine classifier results (weighted vote)

f(x) =∑t

atht(x)

• Output final classifier:

H(x) = sign (f(x)) (5.2)

5.1.1 Training Error AnalysisWe first introduce first the edge of the weak learner at the t-th round as:

ϵt =1

2− γt︸︷︷︸

Edge

We will prove that:N∑i=1

1

N[yi = H(xi)]︸︷︷︸

Final Training Error

≤T∏

t=1

[2√

ϵt(1− ϵt)]

=

T∏t=1

√1− 4γ2

t

≤ exp

(−2

T∑t=1

γ2t

)

25

The challenge lies in proving the first inequality; the second comes from the definitionof the edge, while the third from the fact that

√1− 2x ≤ exp(−x), x ∈ [0, 1

2 ]. If weWe have required that the weak learner is always correct more than half of the times

on the training data, i.e. that γt > 0∀t, so if ∀t γt > γ > 0 we will then have:

N∑i=1

1

N[yi = H(xi)] =≤ exp(−2Tγ2) (5.3)

which means that the number of training errors will be exponentially decreasing intime.

First, by recursing, we have that:

Dit+1 =

Dit

Ztexp(−αty

iht(xi))

=

Dit−1

Zt−1exp(−αt−1y

iht−1(xi))

Ztexp(−αty

ihT (xi))

=Dt−1

ZtZt−1exp

(−[αty

iht(xi) + αt−1y

iht−1(xi)])

. . . (5.4)

=1

N

exp(−yi

∑tk=1 αkhk(x

i))

∏tk=1 Zk

or:

DiT+1 =

1

N

exp(−yif(xi)

)∏Tk=1 Zk

(5.5)

which expresses the distribution on the data in terms of the classifier’s score on them.We use this result to obtain that:

N∑i=1

1

N[yi = H(xi)]

H(x)=signf(x)=

1

N

∑i

{1 yif(xi) ≤ 00 yif(xi) > 0

=N∑i=1

1

Nexp(−yif(xi))

=N∑i=1

DiT+1

T∏k=1

Zk

= 1T∏

k=1

Zk

which relates the number of errors with the normalizing constants Zk at each round.

26

We set out to prove that:

N∑i=1

1

N[yi = H(xi)]︸︷︷︸

Final Training Error

≤T∏

t=1

[2√

ϵt(1− ϵt)]

(5.6)

so, given the last result, all we need to prove is that:

2√ϵt(1− ϵt) = Zt

The key to this is the choice we made for α:

α = log(ϵ

1− ϵ) (5.7)

In specific, we have:

Zt =

N∑i=1

Dit exp(−αty

iht(xi))

=∑

i:yi =ht(xi)

Dit exp(−αty

iht(xi)) +

∑i:yi=ht(xi)

Dit exp(−αty

iht(xi))

=∑

i:yi =ht(xi)

Dit exp(αt) +

∑i:yi=ht(xi)

Dit exp(−αt)

= (ϵt) exp(αt) + (1− ϵt) exp(−αt)

α= 12 log( 1−ϵ

ϵ )= ϵt

√1− ϵt√ϵt

+ (1− ϵt)

√ϵt√

1− ϵt

= 2√ϵt(1− ϵt)

This concludes the proof.

5.1.2 Adaboost as coordinate descentWe demonstrated that boosting optimizes the following upper bound to the misclassi-fication cost:

N∑i=1

1

N[yi = H(xi)] ≤ 1

mexp(−yif(x

i)) (5.8)

=T∏

t=1

Zt

=T∏

t=1

[ϵt exp(αt) + (1− ϵt) exp(−αt)] (5.9)

27

Actually the choice of αt is such that Zt is minimized at each iteration:

∂ϵt exp(αt) + (1− ϵt) exp(−αt)

∂αt= 0

ϵt exp(αt)− (1− ϵt) exp(−α) = 0

ϵt exp(2αt) = (1− ϵt)

αt =1

2log

((1− ϵt)

ϵt

)which gives

ϵt exp(αt) + (1− ϵt) exp(−αt) = 2√ϵt(1− ϵt) (5.10)

So choosing the best (in the sense of minimizing Zt, and thence (5.8)) value of αtells us that Zt will equal 2

√ϵt(1− ϵt), where ϵt < 1/2 is the weighted error of the

weak learner. Moreover√ϵt(1− ϵt) is increasing in ϵt in [0, 1/2], so finding the weak

learner with the minimal ϵt guarantees that we most quickly decrease (5.8).We can conceptually describe what this amounts to as follows: we have at our dis-

posal the responses of all weak learners (an infinite number of functions in general),and consider combining them with a weight vector, initially set to zero. In this setting,the Adaboost algorithm can be interpreted as coordinate descent:1 we pick the ‘di-mension’ (weak learner, ht), and ‘step’ (voting strength, αt) that result in the sharpestdecrease in the cost.

So we can see Adaboost as a greedy, but particularly effective optimization proce-dure.

5.1.3 Adaboost as building an additive modelA similar interpretation of Adaboost is in terms of fitting a linear model by finding ateach step the optimal perturbation of a previously constructed decision rule. In specificconsider that we have f(x) at round t and we want to find how to optimally perturb itby:

f ′(x) = f(x) + αh(x) (5.11)

1Coordinate descent is an optimization technique that changes a single dimension of an optimized vectorat each iteration

28

where h(x) is a weak learner and α is the perturbation scale. To consider the impactthat this perturbation will have on the cost we use a Taylor expansion as follows :

N∑i=1

exp(−yif ′(xi)) =M∑i=1

exp(−yi[f(xi) + αh(xi)])

≃N∑i=1

exp(−yif(xi))[1− αyih(xi) +α2

2(yi)2h2(xi)]

=N∑i=1

exp(−yif(xi))[1− αyih(xi) +α2

2]

= ZN∑i=1

exp(−yif(xi))

Z︸︷︷︸Di

[1− αyih(xi) +α2

2]

where Z =∑N

i=1 exp(−yif(xi)). So our goal is to minimize:

N∑i=1

Di[1− αyih(xi) +α2

2] (5.12)

Finding the optimal h amounts to solving:

h = argmaxN∑i=1

Diyih(xi) = 1− ϵ (5.13)

So picking the classifier with minimal ϵ amounts to finding the optimal h And whenoptimizing wrt α we get:

α = argmaxα

N∑i=1

Di exp(−αyih(xi)) = ϵ exp(α) + (1− ϵ) exp(−α) (5.14)

So, once more, we get the Adaboost algorithm.

5.1.4 Adaboost as a method for estimating the posteriorConsider the expectation of the exponential loss:

EX,Y exp(−yf(x)) =

∫x

∑y

exp(−yf(x))P (x, y)dxdy

=

∫x

∑y

exp(−yf(x))P (y|x)dyP (x)dx

=

∫x

[[P (y = 1|x) exp(−f(x)) + P (y = −1|x) exp(f(x))]dy] dx

29

If a function f(x) where to minimize this cost, at a point x it should satisfy

f(x) = argmaxg[P (y = 1|x) exp(−g) + P (y = −1|x) exp(g)

or

f(x) = logP (y = 1|x)P (y = −1|x)

P (y = 1|x) = 1

1 + exp(−2yf(x))

In Adaboost work with functions that are expressed as sums of weak learners. So theoptimal f(x) will not necessarily be given by the expression log P (y=1|x)

P (y=−1|x) , but ratherwill be an approximation to it formed by a linear combination of weak learners.

5.2 Learning TheoryFor pattern recognition:

• Small sample analysis.

• Bounds on worst performance.

Hoeffding inequality: Let Z1, . . . , Zm be iid, drawn from Bernoulli distribution:P (Zi) = ϕZi(1 − ϕ)1−Zi . Consider their mean: ϕ = 1

m

∑mi=1 Zi. Then for any

γ > 0,

P (|ϕ− ϕ| > γ) ≤ 2 exp(−2γ2m)

On training set {xi, yi}, i = 1 . . .m, consider training error for hypothesis h:

e(h) =1

m

m∑i=1

[h(xi) = yi]

Generalization error

e(h) = P(x,y)∝D([h(x) = y])

How different will e be from e if training set is drawn from D?.Consider an Hypothesis Class: the set of all classifiers considered. E.g. for linear

classification, all linear classifiers.Consider that we have a finite class:

H = {h1, . . . hK} (5.15)

e.g. a small set of decision stumps for a few thresholds. No parameters to estimate.Take any of these classifiers, hi. Consider Bernoulli variable Z generated by sam-

pling (x, y) D and indicating whether hi(x) = y. We can write the empirical error interms of this Bernoulli variable:

e(hi) =1

m

m∑j=1

Zj

30

The mean of this variable is e(hi), i.e. the generalization error for the i-th classifier.We then have from the Hoeffding inequality:

P (|e(hi)− e(hi)| > γ) ≤ 2 exp(−2γ2m)

For large m, the training error will be close to the generalization error. Consider allpossible classifiers:

P (∃h ∈ H : |e(h)− e(h)| > γ) ≤K∑i=1

P (|e(hi)− e(h)| > γ)

≤K∑

k=1

2 exp(−2γ2m)

≤ K2 exp(−2γ2m)

Union Bound:

P (A1 ∪A2 . . . ∪Ak) ≤ P (A1) + . . . P (Ak)

Consider we have m training points, k hypotheses and want to find a γ such that

P (|e − e| > γ) = δ. Plugging in the above equation we have γ =√

12m log 2k

δ . Sowith probability 1− δ we have for all h ∈ H:

|e(h)− e(h)| ≤√

1

2mlog

2k

δ(5.16)

What does this mean? Consider h∗ to be the best possible hypothesis, in terms ofgeneralization error and h to be the best in terms of training error. We then have

e(h) ≤ e(h) + γ

≤ e(h∗) + γ

≤ e(h∗) + 2γ (5.17)

Generalization error of chosen hypothesis is not much larger than that of the optimalone.

Bias/Variance: If we have small hypothesis set, we may have large e(h), but it willnot differ much from e(h∗). If we have large hypothesis set, e(h) becomes smaller, butγ increases.

31

machine learning in computer vision mva 2011 - lecture...

Documents