midterm review - university of...

Post on 05-Aug-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Midterm review

CS 446

1. Lecture review

(Lec1.) Basic setting: supervised learning

Training data: labeled examples

(x1, y1), (x2, y2), . . . , (xn, yn)

where

I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and

I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.

Goal: learn a function f from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs. (Note: 0 training error is easy;test/population error is what matters.)

learned predictorpast labeled examples learning algorithm

predicted label

new (unlabeled) example

1 / 61

(Lec1.) Basic setting: supervised learning

Training data: labeled examples

(x1, y1), (x2, y2), . . . , (xn, yn)

where

I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and

I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.

Goal: learn a function f from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs. (Note: 0 training error is easy;test/population error is what matters.)

learned predictorpast labeled examples learning algorithm

predicted label

new (unlabeled) example

1 / 61

(Lec1.) Basic setting: supervised learning

Training data: labeled examples

(x1, y1), (x2, y2), . . . , (xn, yn)

where

I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and

I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.

Goal: learn a function f from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs. (Note: 0 training error is easy;test/population error is what matters.)

learned predictorpast labeled examples learning algorithm

predicted label

new (unlabeled) example

1 / 61

(Lec1.) Basic setting: supervised learning

Training data: labeled examples

(x1, y1), (x2, y2), . . . , (xn, yn)

where

I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and

I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.

Goal: learn a function f from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs. (Note: 0 training error is easy;test/population error is what matters.)

learned predictorpast labeled examples learning algorithm

predicted label

new (unlabeled) example

1 / 61

(Lec2.) k-nearest neighbors classifier

Given: labeled examples D := {(xi, yi)}ni=1

Predictor: fD,k : X → Y:

On input x,

1. Find the k points xi1 ,xi2 , . . . ,xik among {xi}ni=1 “closest” to x(the k nearest neighbors).

2. Return the plurality of yi1 , yi2 , . . . , yik .

(Break ties in both steps arbitrarily.)

2 / 61

(Lec2.) Choosing k

The hold-out set approach

1. Pick a subset V ⊂ S (hold-out set, a.k.a. validation set).2. For each k ∈ {1, 3, 5, . . . }:

I Construct k-NN classifier fS\V,k using S \ V .

I Compute error rate of fS\V,k on V(“hold-out error rate”).

3. Pick the k that gives the smallest hold-out error rate.

(There are many other approaches.)

3 / 61

(Lec2.) Choosing k

The hold-out set approach

1. Pick a subset V ⊂ S (hold-out set, a.k.a. validation set).2. For each k ∈ {1, 3, 5, . . . }:

I Construct k-NN classifier fS\V,k using S \ V .

I Compute error rate of fS\V,k on V(“hold-out error rate”).

3. Pick the k that gives the smallest hold-out error rate.

(There are many other approaches.)

3 / 61

(Lec2.) Decision trees

Directly optimize tree structure for good classification.

A decision tree is a function f : X → Y, represented by a binary tree in which:

I Each tree node is associated with a splitting rule g : X → {0, 1}.I Each leaf node is associated with a label y ∈ Y.

When X = Rd, typically only considersplitting rules of the form

g(x) = 1{xi > t}

for some i ∈ [d] and t ∈ R.Called axis-aligned or coordinate splits.

(Notation: [d] := {1, 2, . . . , d}.)

x1 > 1.7

x2 > 2.8y = 1

y = 2 y = 3

4 / 61

(Lec2.) Decision trees

Directly optimize tree structure for good classification.

A decision tree is a function f : X → Y, represented by a binary tree in which:

I Each tree node is associated with a splitting rule g : X → {0, 1}.I Each leaf node is associated with a label y ∈ Y.

When X = Rd, typically only considersplitting rules of the form

g(x) = 1{xi > t}

for some i ∈ [d] and t ∈ R.Called axis-aligned or coordinate splits.

(Notation: [d] := {1, 2, . . . , d}.)

x1 > 1.7

x2 > 2.8y = 1

y = 2 y = 3

4 / 61

(Lec2.) Decision tree example

sepal length/width1.5 2 2.5 3

peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5

6 Classifying irises by sepal and petalmeasurements

I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width

I x2 = ratio of petal length to width

5 / 61

(Lec2.) Decision tree example

sepal length/width1.5 2 2.5 3

peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5

6 Classifying irises by sepal and petalmeasurements

I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width

I x2 = ratio of petal length to width

y = 2

5 / 61

(Lec2.) Decision tree example

sepal length/width1.5 2 2.5 3

peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5

6 Classifying irises by sepal and petalmeasurements

I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width

I x2 = ratio of petal length to width

x1 > 1.7

5 / 61

(Lec2.) Decision tree example

sepal length/width1.5 2 2.5 3

peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5

6 Classifying irises by sepal and petalmeasurements

I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width

I x2 = ratio of petal length to width

x1 > 1.7

y = 1 y = 3

5 / 61

(Lec2.) Decision tree example

sepal length/width1.5 2 2.5 3

peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5

6 Classifying irises by sepal and petalmeasurements

I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width

I x2 = ratio of petal length to width

x1 > 1.7

x2 > 2.8y = 1

5 / 61

(Lec2.) Decision tree example

sepal length/width1.5 2 2.5 3

peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5

6 Classifying irises by sepal and petalmeasurements

I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width

I x2 = ratio of petal length to width

x1 > 1.7

x2 > 2.8y = 1

y = 2 y = 3

5 / 61

(Lec2.) Nearest neighbors and decision trees

Today we covered two standard machine learning methods.

1.0 0.5 0.0 0.5 1.0

1.0

0.5

0.0

0.5

1.0

x1

x2

Nearest neighbors.Training/fitting: memorize data.Testing/predicting: find k closestmemorized points, return plurity la-bel.Overfitting? Vary k.

Decision trees.Training/fitting: greedily partitionspace, reducing “uncertainty”.Testing/predicting: traverse tree,output leaf label.Overfitting? Limit or prune tree.

Note: both methods can ouput real numbers (regression, not classification);return median/mean of { neighbors, points reaching leaf }. 6 / 61

(Lec3-4.) ERM setup for least squares.

I Predictors/model: f(x) = wTx;a linear predictor/regressor.(For linear classification: x 7→ sgn(wTx).)

I Loss/penalty: the least squares loss

`ls(y, y) = `ls(y, y) = (y − y)2.

(Some conventions scale this by 1/2.)

I Goal: minimize least squares emprical risk

Rls(f) =1

n

n∑i=1

`ls(yi, f(xi)) =1

n

n∑i=1

(yi − f(xi))2.

I Specifically, we choose w ∈ Rd according to

arg minw∈Rd

Rls

(x 7→ wTx

)= arg min

w∈Rd

1

n

n∑i=1

(yi −wTxi)2.

I More generally, this is the ERM approach:pick a model and minimize empirical risk over the model parameters.

7 / 61

(Lec3-4.) ERM in general

I Pick a family of models/predictors F .(For today, we use linear predictors.)

I Pick a loss function `.(For today, we chose squared loss.)

I Minimize the empirical risk over the model parameters.

We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.

Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.

8 / 61

(Lec3-4.) ERM in general

I Pick a family of models/predictors F .(For today, we use linear predictors.)

I Pick a loss function `.(For today, we chose squared loss.)

I Minimize the empirical risk over the model parameters.

We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.

Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.

8 / 61

(Lec3-4.) Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.

Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

9 / 61

(Lec3-4.) Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

9 / 61

(Lec3-4.) Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

9 / 61

(Lec3-4.) Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

9 / 61

(Lec3-4.) Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

9 / 61

(Lec3-4.) Full (factorization) SVD (new slide)

Given M ∈ Rn×d, let M = USV T denote the singular value decomposition(SVD), where

I U ∈ Rn×n is orthonormal, thus UTU = UUT = I,

I V ∈ Rd×d is orthonormal, thus V TV = V V T = I,

I S ∈ Rn×d has singular values s1 ≥ s2 ≥ · · · ≥ sminn,d along the diagonaland zeros elsewhere, where the number of positive singular values equalsthe rank of M .

Some facts:

I SVD is not unique when the singular values are not distinct; e.g., we canwrite I = UIV T where U is any orthonormal matrix.

I Pseudoinverse S+ ∈ Rd×n of S is obtained by starting with ST andtaking the reciprocal of each positive entry.

I Pseudoinverse of M is V S+UT.

I If M−1 exists, then M−1 = M+.

10 / 61

(Lec3-4.) Thin (decomposition) SVD (new slide)

Given M ∈ Rn×d, (s,u,v) are a singular value with corresponding left andright singular vectors if Mv = su and MTu = sv.The thin SVD of M is M =

∑ri=1 siuiv

Ti , where r is the rank of M , and

I left singular vectors (u1, . . . ,ur) are orthonormal (but we might haver < min{n, d}!) and span the column space of M ,

I right singular vectors (v1, . . . ,vr) are orthonormal (but we might haver < min{n, d}!) and span the row space of M ,

I sigular values s1 ≥ · · · ≥ sr > 0.

Some facts:

I Pseudoinverse M+ =∑ri=1

1siviu

Ti .

I (ui)ri=1 span t

11 / 61

(Lec3-4.) SVD and least squares

Recall: we’d like to find w such that

ATAw = ATb.

If w = A+b, then

ATAw =

r∑i=1

siviuTi

r∑i=1

siuivTi

r∑i=1

1

siviu

Ti

b=

r∑i=1

siviuTi

r∑i=1

uiuTi

b = ATb.

Henceforth, define wols = A+b as the OLS solution.(OLS = “ordinary least squares”.)

Note: in general, AA+ =∑ri=1 uiu

Ti 6= I.

12 / 61

(Lec3-4.) Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then

‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

Since

(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

Morever, writing A =∑ri=1 siuiv

Ti ,

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>

r∑i=1

s2ivivTi

(w′−w),

so w′ optimal iff w′ −w is in the right nullspace of A.

(We’ll revisit all this with convexity later.)

13 / 61

(Lec3-4.) Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then

‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

Since

(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

Morever, writing A =∑ri=1 siuiv

Ti ,

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>

r∑i=1

s2ivivTi

(w′−w),

so w′ optimal iff w′ −w is in the right nullspace of A.

(We’ll revisit all this with convexity later.)

13 / 61

(Lec3-4.) Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then

‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

Since

(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

Morever, writing A =∑ri=1 siuiv

Ti ,

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>

r∑i=1

s2ivivTi

(w′−w),

so w′ optimal iff w′ −w is in the right nullspace of A.

(We’ll revisit all this with convexity later.)

13 / 61

(Lec3-4.) Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

14 / 61

(Lec3-4.) Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

14 / 61

(Lec3-4.) Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

14 / 61

(Lec3-4.) Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

14 / 61

(Lec3-4.) Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

14 / 61

(Lec3-4.) Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

14 / 61

(Lec5-6.) Geometry of linear classifiers

x1

x2

H

w

A hyperplane in Rd is a linear subspaceof dimension d−1.

I A hyperplane in R2 is a line.

I A hyperplane in R3 is a plane.

I As a linear subspace, a hyperplanealways contains the origin.

A hyperplane H can be specified by a(non-zero) normal vector w ∈ Rd.

The hyperplane with normal vector wis the set of points orthogonal to w:

H ={x ∈ Rd : xTw = 0

}.

Given w and its corresponding H:H splits the sets labeled positive {x : wTx > 0} and those labeled negative{x : wTw < 0}.

15 / 61

(Lec5-6.) Classification with a hyperplane

H

w

span{w}

Projection of x onto span{w} (a line) hascoordinate

‖x‖2 · cos(θ)

where

cos θ =xTw

‖w‖2‖x‖2.

(Distance to hyperplane is ‖x‖2 · | cos(θ)|.)

Decision boundary is hyperplane (oriented by w):

xTw > 0 ⇐⇒ ‖x‖2 · cos(θ) > 0 ⇐⇒ x on same side of H as w

What should we do if we want hyperplane decision boundary that doesn’t(necessarily) go through origin?

16 / 61

(Lec5-6.) Classification with a hyperplane

H

w

span{w}

x

‖x‖2 · cos θθ

Projection of x onto span{w} (a line) hascoordinate

‖x‖2 · cos(θ)

where

cos θ =xTw

‖w‖2‖x‖2.

(Distance to hyperplane is ‖x‖2 · | cos(θ)|.)

Decision boundary is hyperplane (oriented by w):

xTw > 0 ⇐⇒ ‖x‖2 · cos(θ) > 0 ⇐⇒ x on same side of H as w

What should we do if we want hyperplane decision boundary that doesn’t(necessarily) go through origin?

16 / 61

(Lec5-6.) Classification with a hyperplane

H

w

span{w}

x

‖x‖2 · cos θθ

Projection of x onto span{w} (a line) hascoordinate

‖x‖2 · cos(θ)

where

cos θ =xTw

‖w‖2‖x‖2.

(Distance to hyperplane is ‖x‖2 · | cos(θ)|.)

Decision boundary is hyperplane (oriented by w):

xTw > 0 ⇐⇒ ‖x‖2 · cos(θ) > 0 ⇐⇒ x on same side of H as w

What should we do if we want hyperplane decision boundary that doesn’t(necessarily) go through origin?

16 / 61

(Lec5-6.) Classification with a hyperplane

H

w

span{w}

x

‖x‖2 · cos θθ

Projection of x onto span{w} (a line) hascoordinate

‖x‖2 · cos(θ)

where

cos θ =xTw

‖w‖2‖x‖2.

(Distance to hyperplane is ‖x‖2 · | cos(θ)|.)

Decision boundary is hyperplane (oriented by w):

xTw > 0 ⇐⇒ ‖x‖2 · cos(θ) > 0 ⇐⇒ x on same side of H as w

What should we do if we want hyperplane decision boundary that doesn’t(necessarily) go through origin?

16 / 61

(Lec5-6.) Linear separability

Is it always possible to find w with sign(wTxi) = yi?Is it always possible to find a hyperplane separating the data?(Appending 1 means it need not go through the origin.)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

Linearly separable. Not linearly separable.

17 / 61

(Lec5-6.) Cauchy-Schwarz (new slide)

Cauchy-Schwarz inequality. |aTb| ≤ ‖a‖ · ‖b‖.

Proof. If ‖a‖ = ‖b‖,

0 ≤ ‖a− b‖2 = ‖a‖2 − 2aTb+ ‖b‖2 = 2‖a‖ · ‖b‖ − 2aTb,

which rearranges to give aTb ≤ ‖a‖ · ‖b‖.

For the case ‖a‖ < ‖b‖, apply the preceding to ‖b‖‖a‖

(aT(‖a‖b‖b‖

)).

For the absolute value, apply the preceding to (a,−b). �

18 / 61

(Lec5-6.) Cauchy-Schwarz (new slide)

Cauchy-Schwarz inequality. |aTb| ≤ ‖a‖ · ‖b‖.

Proof. If ‖a‖ = ‖b‖,

0 ≤ ‖a− b‖2 = ‖a‖2 − 2aTb+ ‖b‖2 = 2‖a‖ · ‖b‖ − 2aTb,

which rearranges to give aTb ≤ ‖a‖ · ‖b‖.

For the case ‖a‖ < ‖b‖, apply the preceding to ‖b‖‖a‖

(aT(‖a‖b‖b‖

)).

For the absolute value, apply the preceding to (a,−b). �

18 / 61

(Lec5-6.) Logistic loss

Let’s state our classification goal with a generic margin loss `:

R(w) =1

n

n∑i=1

`(yiwTxi);

the key properties we want:

I ` is continuous;

I `(z) ≥ c1[z ≤ 0] = c`zo(z) for some c > 0 and any z ∈ R,

which implies R`(w) ≥ cRzo(w).

I `′(0) < 0 (pushes stuff from wrong to right).

Examples.

I Squared loss, written in margin form: `ls(z) := (1− z)2;note `ls(yy) = (1− yy)2 = y2(1− yy)2 = (y − y)2.

I Logistic loss: `log(z) = ln(1 + exp(−z)).

19 / 61

(Lec5-6.) Logistic loss

Let’s state our classification goal with a generic margin loss `:

R(w) =1

n

n∑i=1

`(yiwTxi);

the key properties we want:

I ` is continuous;

I `(z) ≥ c1[z ≤ 0] = c`zo(z) for some c > 0 and any z ∈ R,

which implies R`(w) ≥ cRzo(w).

I `′(0) < 0 (pushes stuff from wrong to right).

Examples.

I Squared loss, written in margin form: `ls(z) := (1− z)2;note `ls(yy) = (1− yy)2 = y2(1− yy)2 = (y − y)2.

I Logistic loss: `log(z) = ln(1 + exp(−z)).

19 / 61

(Lec5-6.) Squared and logistic losses on linearly separabledata I

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

-12.000

-8.000

-4.000

0.000

4.000

8.000

12.000

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

-1.200

-0.800

-0.400

0.000

0.400

0.800

1.200

Logistic loss. Squared loss.

20 / 61

(Lec5-6.) Squared and logistic losses on linearly separabledata II

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

-12.000

-8.000

-4.000

0.000

4.000

8.000

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

-1.200

-0.800

-0.400

0.000

0.400

0.800

1.200

Logistic loss. Squared loss.

21 / 61

(Lec5-6.) Logistic risk and separation

If there exists a perfect linear separator,empirical logistic risk minimization should find it.

Theorem.

If there exists w with yiwTxi > 0 for all i,

then every w with Rlog(w) < ln(2)/2n + infv Rlog(v),also satisfies yiw

Txi > 0.

Proof. Omitted.

22 / 61

(Lec5-6.) Logistic risk and separation

If there exists a perfect linear separator,empirical logistic risk minimization should find it.

Theorem. If there exists w with yiwTxi > 0 for all i,

then every w with Rlog(w) < ln(2)/2n + infv Rlog(v),also satisfies yiw

Txi > 0.

Proof. Omitted.

22 / 61

(Lec5-6.) Logistic risk and separation

If there exists a perfect linear separator,empirical logistic risk minimization should find it.

Theorem. If there exists w with yiwTxi > 0 for all i,

then every w with Rlog(w) < ln(2)/2n + infv Rlog(v),also satisfies yiw

Txi > 0.

Proof. Omitted.

22 / 61

(Lec5-6.) Least squares and logistic ERM

Least squares:

I Take gradient of ‖Aw − b‖2, set to 0;obtain normal equations ATAw = ATb.

I One choice is minimum norm solution A+b.

Logistic loss:

I Take gradient of Rlog(w) = 1n

∑ni=1 ln(1 + exp(yiw

Txi)) and set to 0 ???

Remark. Is A+b a “closed form expression”?

23 / 61

(Lec5-6.) Least squares and logistic ERM

Least squares:

I Take gradient of ‖Aw − b‖2, set to 0;obtain normal equations ATAw = ATb.

I One choice is minimum norm solution A+b.

Logistic loss:

I Take gradient of Rlog(w) = 1n

∑ni=1 ln(1 + exp(yiw

Txi)) and set to 0 ???

Remark. Is A+b a “closed form expression”?

23 / 61

(Lec5-6.) Least squares and logistic ERM

Least squares:

I Take gradient of ‖Aw − b‖2, set to 0;obtain normal equations ATAw = ATb.

I One choice is minimum norm solution A+b.

Logistic loss:

I Take gradient of Rlog(w) = 1n

∑ni=1 ln(1 + exp(yiw

Txi)) and set to 0 ???

Remark. Is A+b a “closed form expression”?

23 / 61

(Lec5-6.) Gradient descent

Given a function F : Rd → R, gradient descent is the iteration

wi+1 := wi − ηi∇wF (wi),

where w0 is given, and ηi is a learning rate / step size.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.010.0

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

2.000

4.00

0

6.000

8.000

10.000

12.000

14.000

Does this work for least squares?Later we’ll show it works for least squares and logistic regression due toconvexity.

24 / 61

(Lec5-6.) Gradient descent

Given a function F : Rd → R, gradient descent is the iteration

wi+1 := wi − ηi∇wF (wi),

where w0 is given, and ηi is a learning rate / step size.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.010.0

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

2.000

4.00

0

6.000

8.000

10.000

12.000

14.000

Does this work for least squares?

Later we’ll show it works for least squares and logistic regression due toconvexity.

24 / 61

(Lec5-6.) Gradient descent

Given a function F : Rd → R, gradient descent is the iteration

wi+1 := wi − ηi∇wF (wi),

where w0 is given, and ηi is a learning rate / step size.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.010.0

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

2.000

4.00

0

6.000

8.000

10.000

12.000

14.000

Does this work for least squares?Later we’ll show it works for least squares and logistic regression due toconvexity.

24 / 61

(Lec5-6.) Multiclass?

All our methods so far handle multiclass:

I k-nn and decision tree: plurality label.

I Least squares: arg minW∈Rd×k

‖AW −B‖2F

with B ∈ Rn×k;

W ∈ Rd×k is k separate linear regressors in Rd.

How about linear classifiers?

I At prediction time, x 7→ arg maxy f(x)y.

I As in binary case: interpretation f(x)y = Pr[Y = y|X = x].

What is a good loss function?

25 / 61

(Lec5-6.) Cross-entropy

Given two probability vectors p, q ∈∆k = {p ∈ Rk≥0 :∑i pi = 1},

H(p, q) = −k∑i=1

pi ln qi (cross-entropy).

I If p = q, then H(p, q) = H(p) (entropy); indeed

H(p, q) = −k∑i=1

pi ln

(piqipi

)= H(p)︸ ︷︷ ︸

entropy

+ KL(p, q)︸ ︷︷ ︸KL divergence

.

Since KL ≥ 0 and moreover 0 iff p = q,this is the cost/entropy of p plus a penalty for differing.

I Choose encoding yi = ey for y ∈ {1, . . . , k},and y ∝ exp(f(x)) with f : Rd → Rk;

`ce(y, f(x)) = H(y, y) = −k∑i=1

yi ln

exp(f(x)i)∑kj=1 exp(f(x)j)

= − ln

exp(f(x)y)∑kj=1 exp(f(x)j)

= −f(x)y + lnk∑j=1

exp(f(x)j).

(In pytorch, use torch.nn.CrossEntropyLoss()(f(x), y).) 26 / 61

(Lec5-6.) Cross-entropy, classification, and margins

The zero-one loss for classification is

`zo(yi, f(x)) = 1

[yi 6= arg max

jf(x)j

].

In the multiclass case, can define margin as

f(x)y −maxj 6=y

f(x)j ,

interpreted as “the distance by which f is correct”. (Can be negative!)

Since ln∑j zj ≈ maxj zj , cross-entropy satisfies

`ce(yi, f(x)) = −f(x)y + ln∑j

exp(f(x)j

)≈ −f(x)y + max

jf(x)j ,

thus minimizing cross-entropy maximizes margins.

27 / 61

(Lec7-8.) The ERM perspective

These lectures will follow an ERM perspective on deep networks:

I Pick a model/predictor class (network architecture).(We will spend most of our time on this!)

I Pick a loss/risk.(We will almost always use cross-entropy!)

I Pick an optimizer.(We will mostly treat this as a black box!)

The goal is low test error, whereas above only gives low training error;we will briefly discuss this as well.

28 / 61

(Lec7-8.) Basic deep networks

A self-contained expression is

x 7→ σL

(WLσL−1

(· · ·(W 2σ1(W 1x+ b1) + b2

)· · ·)

+ bL

),

with equivalent “functional form”

x 7→ (fL ◦ · · · ◦ f1)(x) where fi(z) = σi (W iz + bi) .

Some further details (many more to come!):

I (W i)Li=1 with W i ∈ Rdi×di−1 are the weights, and (bi)

Li=1 are the biases.

I (σi)Li=1 with σi : Rdi → Rdi are called nonlinearties, or activations, or

transfer functions, or link functions.

I This is only the basic setup; many things can and will change, please askmany questions!

29 / 61

(Lec7-8.) Choices of activation

Basic form:

x 7→ σL(WLσL−1

(· · ·W 2σ1(W 1x+ b1) + b2 · · ·

)+ bL

).

Choices of activation (univariate, coordinate-wise):

I Indicator/step/heavyside/threshold z 7→ 1[z ≥ 0].This was the original choice (1940s!).

I Sigmoid σs(z) := 11+exp(−z) .

This was popular roughly 1970s - 2005?

I Hyperbolic tangent z 7→ tanh(z).Similar to sigmoid, used during same interval.

I Rectified Linear Unit (ReLU) σr(z) = max{0, z}.It (and slight variants, e.g., Leaky ReLU, ELU, . . . ) are the dominantchoice now; popularized in “Imagenet/AlexNet” paper(Krizhevsky-Sutskever-Hinton, 2012).

I Identity z 7→ z; we’ll often use this as the last layer when we usecross-entropy loss.

I NON-coordinate-wise choices: we will discuss “softmax” and “pooling” abit later.

30 / 61

(Lec7-8.) “Architectures” and “models”

Basic form:

x 7→ σL(WLσL−1

(· · ·W 2σ1(W 1x+ b1) + b2 · · ·

)+ bL

).

((W i, bi))Li=1, the weights and biases, are the parameters.

Let’s roll them into W := ((W i, bi))Li=1,

and consider the network as a two-parameter function FW(x) = F (x;W).

I The model or class of functions is {FW : all possible W}. F (botharguments unset) is also called an architecture.

I When we fit/train/optimize, typically we leave the architecture fixed andvary W to minimize risk.

(More on this in a moment.)

31 / 61

(Lec7-8.) ERM recipe for basic networks

Standard ERM recipe:

I First we pick a class of functions/predictors;for deep networks, that means a F (·, ·).

I Then we pick a loss function and write down an empirical riskminimization problem; in these lectures we will pick cross-entropy:

arg minW

1

n

n∑i=1

`ce

(yi, F (xi,W)

)= arg min

W 1∈Rd×d1 ,b1∈Rd1

...WL∈R

dL−1×dL ,bL∈RdL

1

n

n∑i=1

`ce

(yi, F (xi; ((W i, bi))

Li=1)

)

= arg minW 1∈Rd×d1 ,b1∈Rd1

...WL∈R

dL−1×dL ,bL∈RdL

1

n

n∑i=1

`ce

(yi, σL(· · ·σ1(W 1xi + b1) · · · )

).

I Then we pick an optimizer. In this class, we only use gradient descentvariants. It is a miracle that this works.

32 / 61

(Lec7-8.) Sometimes, linear just isn’t enough

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-3.00

0

-1.50

00.0

00

1.500

3.000

4.500

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-32.

000

-24.

000

-16.

000

-8.0

00

-8.000

0.00

0

0.0008.00

0

8.000

16.000

Linear predictor:x 7→ wT [ x1 ].

Some blue points misclassified.

ReLU network:x 7→ W 2σr(W 1x + b1) + b2.

0 misclassifications!

33 / 61

(Lec7-8.) Classical example: XOR

Classical “XOR problem” (Minsky-Papert-’69).(Check wikipedia for “AI Winter”.)

Theorem. On this data, any linear classifier (with affine expansion)makes at least one mistake.Picture proof. Recall: linear classifiers correspond to separating hyperplanes.

I If it splits the blue points, it’s incorrect on one of them.

I If it doesn’t split the blue points,then one halfspace contains the common midpoint,and therefore wrong on at least one red point.

34 / 61

(Lec7-8.) Classical example: XOR

Classical “XOR problem” (Minsky-Papert-’69).(Check wikipedia for “AI Winter”.)

Theorem. On this data, any linear classifier (with affine expansion)makes at least one mistake.Picture proof. Recall: linear classifiers correspond to separating hyperplanes.

I If it splits the blue points, it’s incorrect on one of them.

I If it doesn’t split the blue points,then one halfspace contains the common midpoint,and therefore wrong on at least one red point.

34 / 61

(Lec7-8.) Classical example: XOR

Classical “XOR problem” (Minsky-Papert-’69).(Check wikipedia for “AI Winter”.)

Theorem. On this data, any linear classifier (with affine expansion)makes at least one mistake.Picture proof. Recall: linear classifiers correspond to separating hyperplanes.

I If it splits the blue points, it’s incorrect on one of them.

I If it doesn’t split the blue points,then one halfspace contains the common midpoint,and therefore wrong on at least one red point.

34 / 61

(Lec7-8.) One layer was not enough. How about two?

Theorem (Cybenko ’89, Hornik-Stinchcombe-White ’89, Funahashi ’89, Leshnoet al ’92, . . . ). Given any continuous function f : Rd → R and any ε > 0, thereexist parameters (W 1, b1,W 2) so that

supx∈[0,1]d

∣∣f(x)−W 2σ (W 1x+ b1)∣∣ ≤ ε,

as long as σ is “reasonable” (e.g., ReLU or sigmoid or threshold).

Remarks.

I Together with XOR example, justifies using nonlinearities.

I Does not justify (very) deep networks.

I Only says these networks exist, not that we can optimize for them!

35 / 61

(Lec7-8.) One layer was not enough. How about two?

Theorem (Cybenko ’89, Hornik-Stinchcombe-White ’89, Funahashi ’89, Leshnoet al ’92, . . . ). Given any continuous function f : Rd → R and any ε > 0, thereexist parameters (W 1, b1,W 2) so that

supx∈[0,1]d

∣∣f(x)−W 2σ (W 1x+ b1)∣∣ ≤ ε,

as long as σ is “reasonable” (e.g., ReLU or sigmoid or threshold).

Remarks.

I Together with XOR example, justifies using nonlinearities.

I Does not justify (very) deep networks.

I Only says these networks exist, not that we can optimize for them!

35 / 61

(Lec7-8.) General graph-based view

Classical graph-based perspective.

I Network is a directed acyclic graph;sources are inputs, sinks are outputs, intermediate nodes computez 7→ σ(wTz + b) (with their own (σ,w, b)).

I Nodes at distance 1 from inputs are the first layer, distance 2 is secondlayer, and so on.

“Modern” graph-based perspective.

I Edges in the graph can be multivariate, meaning vectors or generaltensors, and not just scalars.

I Edges will often “skip” layers;“layer” is therefore ambiguous.

I Diagram conventions differ;e.g., tensorflow graphs include nodes for parameters.

36 / 61

(Lec7-8.) 2-D convolution in deep networks (pictures)

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)

37 / 61

(Lec7-8.) 2-D convolution in deep networks (pictures)

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)

37 / 61

(Lec7-8.) 2-D convolution in deep networks (pictures)

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)

37 / 61

(Lec7-8.) 2-D convolution in deep networks (pictures)

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)

37 / 61

(Lec7-8.) Softmax

Replace vector input z with z′ ∝ ez, meaning

z 7→

(ez1∑j ezj, . . . ,

ezk∑j ezj,

).

I Converts input into a probability vector;useful for interpreting output network output as Pr[Y = y|X = x].

I We have baked it into our cross-entropy definition;last lectures networks with cross-entropy training had implicit softmax.

I If some coordinate j of z dominates others, then softmax is close to ej .

38 / 61

(Lec7-8.) Max pooling

2

2

3

0

3

0

0

1

0

3

0

0

2

1

2

0

2

2

3

1

1

2

3

1

0

3.0

3.0

3.0

2.0

3.0

3.0

3.0

3.0

3.0

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)

I Often used together with convolution layers;shrinks/downsamples the input.

I Another variant is average pooling.

I Implementation: torch.nn.MaxPool2d .

39 / 61

(Lec7-8.) Max pooling

2

2

3

0

3

0

0

1

0

3

0

0

2

1

2

0

2

2

3

1

1

2

3

1

0

3.0

3.0

3.0

2.0

3.0

3.0

3.0

3.0

3.0

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)

I Often used together with convolution layers;shrinks/downsamples the input.

I Another variant is average pooling.

I Implementation: torch.nn.MaxPool2d .

39 / 61

(Lec7-8.) Max pooling

2

2

3

0

3

0

0

1

0

3

0

0

2

1

2

0

2

2

3

1

1

2

3

1

0

3.0

3.0

3.0

2.0

3.0

3.0

3.0

3.0

3.0

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)

I Often used together with convolution layers;shrinks/downsamples the input.

I Another variant is average pooling.

I Implementation: torch.nn.MaxPool2d .

39 / 61

(Lec7-8.) Max pooling

2

2

3

0

3

0

0

1

0

3

0

0

2

1

2

0

2

2

3

1

1

2

3

1

0

3.0

3.0

3.0

2.0

3.0

3.0

3.0

3.0

3.0

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)

I Often used together with convolution layers;shrinks/downsamples the input.

I Another variant is average pooling.

I Implementation: torch.nn.MaxPool2d .

39 / 61

(Lec9-10.) Multivariate network single-example gradients

Define Gj(W) = σj(W j · · ·σ1(W 1x+ b) · · · ).The multivariate chain rule tells us

∇WF (Wx) = JTxT,

and J ∈ Rl×k is the Jacobian matrix of F : Rk → Rl at Wx,the matrix of all coordinate-wise derivatives.

dGLdWL

= JTLGL−1(W)T

dGLdbL

= JTL,

...

dGLdW j

= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T,

dGLdbj

= (JLWLJL−1WL−1 · · ·Jj)T ,

with Jj as the Jacobian of σj at W jGj−1(W) + bj .For example, with σj that is coordinate-wise σ : R→ R,

Jj is diag

(σ′([W jGj−1(W) + bj

]1

), . . . , σ′

([W jGj−1(W) + bj

]dj

)).

40 / 61

(Lec9-10.) Multivariate network single-example gradients

Define Gj(W) = σj(W j · · ·σ1(W 1x+ b) · · · ).The multivariate chain rule tells us

∇WF (Wx) = JTxT,

and J ∈ Rl×k is the Jacobian matrix of F : Rk → Rl at Wx,the matrix of all coordinate-wise derivatives.

dGLdWL

= JTLGL−1(W)T

dGLdbL

= JTL,

...

dGLdW j

= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T,

dGLdbj

= (JLWLJL−1WL−1 · · ·Jj)T ,

with Jj as the Jacobian of σj at W jGj−1(W) + bj .For example, with σj that is coordinate-wise σ : R→ R,

Jj is diag

(σ′([W jGj−1(W) + bj

]1

), . . . , σ′

([W jGj−1(W) + bj

]dj

)).

40 / 61

(Lec9-10.) Multivariate network single-example gradients

Define Gj(W) = σj(W j · · ·σ1(W 1x+ b) · · · ).The multivariate chain rule tells us

∇WF (Wx) = JTxT,

and J ∈ Rl×k is the Jacobian matrix of F : Rk → Rl at Wx,the matrix of all coordinate-wise derivatives.

dGLdWL

= JTLGL−1(W)T

dGLdbL

= JTL,

...

dGLdW j

= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T,

dGLdbj

= (JLWLJL−1WL−1 · · ·Jj)T ,

with Jj as the Jacobian of σj at W jGj−1(W) + bj .For example, with σj that is coordinate-wise σ : R→ R,

Jj is diag

(σ′([W jGj−1(W) + bj

]1

), . . . , σ′

([W jGj−1(W) + bj

]dj

)).

40 / 61

(Lec9-10.) Initialization

RecalldGLdW j

= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T.

I What if we set W = 0? What if σ = σr is a ReLU?

I What if we set two rows of W j (two nodes) identically?

I Resolving this issue is called symmetry breaking.

I Standard linear/dense layer initializations:

N (0,2

din) “He et al.”,

N (0,2

din + dout) Glorot/Xavier,

U(− 1√din

,1√din

) torch default.

(Convolution layers adjusted to have similar distributions.)Random initialization is emerging as a key story in deep networks!

41 / 61

(Lec9-10.) Initialization

RecalldGLdW j

= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T.

I What if we set W = 0? What if σ = σr is a ReLU?

I What if we set two rows of W j (two nodes) identically?

I Resolving this issue is called symmetry breaking.

I Standard linear/dense layer initializations:

N (0,2

din) “He et al.”,

N (0,2

din + dout) Glorot/Xavier,

U(− 1√din

,1√din

) torch default.

(Convolution layers adjusted to have similar distributions.)Random initialization is emerging as a key story in deep networks!

41 / 61

(Lec9-10.) Initialization

RecalldGLdW j

= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T.

I What if we set W = 0? What if σ = σr is a ReLU?

I What if we set two rows of W j (two nodes) identically?

I Resolving this issue is called symmetry breaking.

I Standard linear/dense layer initializations:

N (0,2

din) “He et al.”,

N (0,2

din + dout) Glorot/Xavier,

U(− 1√din

,1√din

) torch default.

(Convolution layers adjusted to have similar distributions.)Random initialization is emerging as a key story in deep networks!

41 / 61

(Lec9-10.) Initialization

RecalldGLdW j

= (JLWLJL−1WL−1 · · ·Jj)T Gj−1(W)T.

I What if we set W = 0? What if σ = σr is a ReLU?

I What if we set two rows of W j (two nodes) identically?

I Resolving this issue is called symmetry breaking.

I Standard linear/dense layer initializations:

N (0,2

din) “He et al.”,

N (0,2

din + dout) Glorot/Xavier,

U(− 1√din

,1√din

) torch default.

(Convolution layers adjusted to have similar distributions.)Random initialization is emerging as a key story in deep networks!

41 / 61

(Lec9-10; SGD slide.) Minibatches

We used the linearity of gradients to write

∇wR(w) =1

n

n∑i=1

∇w`(F (xi;w),yi).

What happens if we replace ((xi,yi))ni=1 with minibatch ((x′i,y

′i))

bi=1?

I Random minibatch =⇒ two gradients equal in expectation.

I Most torch layers take minibatch input:

I torch.nn.Linear has input shape (b, d), output (b, d′).I torch.nn.Conv2d has input shape (b, c, h, w), output (b, c′, h′, w′).

I This is used heavily outside deep learning as well.It is an easy way to use parallel floating point operations (as in GPU andCPU).

I Setting batch size is black magic and depends on many things(prediction problem, gpu characteristics, . . . ).

42 / 61

(Lec9-10.) Convex sets

A set S is convex if, for every pair of points {x,x′} in S, the line segmentbetween x and x′ is also contained in S. ({x,x′} ∈ S =⇒ [x,x′] ∈ S.)

convex not convex convex convex

Examples:

I All of Rd.

I Empty set.

I Half-spaces: {x ∈ Rd : aTx ≤ b}.I Intersections of convex sets.

I Polyhedra:{x ∈ Rd : Ax ≤ b

}=⋂mi=1

{x ∈ Rd : aT

ix ≤ bi}

.

I Convex hulls:conv(S) := {

∑ki=1 αixi : k ∈ N, xi ∈ S, αi ≥ 0,

∑ki=1 αi = 1}.

(Infinite convex hulls: intersection of all convex supersets.)

43 / 61

(Lec9-10.) Convex sets

A set S is convex if, for every pair of points {x,x′} in S, the line segmentbetween x and x′ is also contained in S. ({x,x′} ∈ S =⇒ [x,x′] ∈ S.)

convex not convex convex convex

Examples:

I All of Rd.

I Empty set.

I Half-spaces: {x ∈ Rd : aTx ≤ b}.I Intersections of convex sets.

I Polyhedra:{x ∈ Rd : Ax ≤ b

}=⋂mi=1

{x ∈ Rd : aT

ix ≤ bi}

.

I Convex hulls:conv(S) := {

∑ki=1 αixi : k ∈ N, xi ∈ S, αi ≥ 0,

∑ki=1 αi = 1}.

(Infinite convex hulls: intersection of all convex supersets.)

43 / 61

(Lec9-10.) Convex functions from convex sets

The epigraph of a function f is the area above the curve:

epi(f) :={

(x, y) ∈ Rd+1 : y ≥ f(x)}.

A function is convex if its epigraph is convex.

f is not convex f is convex

44 / 61

(Lec9-10.) Convex functions (standard definition)

A function f : Rd → R is convex if for any x,x′ ∈ Rd and α ∈ [0, 1],

f ((1− α)x+ αx′) ≤ (1− α) · f(x) + α · f(x′).

f is not convex f is convexx x′ x x′

Examples:

I f(x) = cx for any c > 0 (on R)

I f(x) = |x|c for any c ≥ 1 (on R)

I f(x) = bTx for any b ∈ Rd.

I f(x) =‖x‖ for any norm ‖·‖.I f(x) = xTAx for symmetric positive semidefinite A.

I f(x) = ln(∑d

i=1 exp(xi))

, which approximates maxi xi.

45 / 61

(Lec9-10.) Convex functions (standard definition)

A function f : Rd → R is convex if for any x,x′ ∈ Rd and α ∈ [0, 1],

f ((1− α)x+ αx′) ≤ (1− α) · f(x) + α · f(x′).

f is not convex f is convexx x′ x x′

Examples:

I f(x) = cx for any c > 0 (on R)

I f(x) = |x|c for any c ≥ 1 (on R)

I f(x) = bTx for any b ∈ Rd.

I f(x) =‖x‖ for any norm ‖·‖.I f(x) = xTAx for symmetric positive semidefinite A.

I f(x) = ln(∑d

i=1 exp(xi))

, which approximates maxi xi.

45 / 61

(Lec9-10.) Convexity of differentiable functions

Differentiable functions

If f : Rd → R is differentiable, then f isconvex if and only if

f(x) ≥ f(x0) +∇f(x0)T(x− x0)

for all x,x0 ∈ Rd.Note: this implies increasing slopes:(∇f(x)−∇f(y)

)T(x− y) ≥ 0.

f(x)

x0

a(x)

a(x) = f(x0) + f ′(x0)(x− x0)

Twice-differentiable functions

If f : Rd → R is twice-differentiable, then f is convex if and only if

∇2f(x) � 0

for all x ∈ Rd (i.e., the Hessian, or matrix of second-derivatives, is positivesemi-definite for all x).

46 / 61

(Lec9-10.) Convexity of differentiable functions

Differentiable functions

If f : Rd → R is differentiable, then f isconvex if and only if

f(x) ≥ f(x0) +∇f(x0)T(x− x0)

for all x,x0 ∈ Rd.Note: this implies increasing slopes:(∇f(x)−∇f(y)

)T(x− y) ≥ 0.

f(x)

x0

a(x)

a(x) = f(x0) + f ′(x0)(x− x0)

Twice-differentiable functions

If f : Rd → R is twice-differentiable, then f is convex if and only if

∇2f(x) � 0

for all x ∈ Rd (i.e., the Hessian, or matrix of second-derivatives, is positivesemi-definite for all x).

46 / 61

(Lec9-10.) Convex optimization problems

Standard form of a convex optimization problem:

minx∈Rd

f0(x)

s.t. fi(x) ≤ 0 for all i = 1, . . . , n

where f0, f1, . . . , fn : Rd → R are convex functions.

Fact: the feasible set

A :={x ∈ Rd : fi(x) ≤ 0 for all i = 1, 2, . . . , n

}is a convex set.

(SVMs next week will give us an example.)

47 / 61

(Lec9-10.) Convex optimization problems

Standard form of a convex optimization problem:

minx∈Rd

f0(x)

s.t. fi(x) ≤ 0 for all i = 1, . . . , n

where f0, f1, . . . , fn : Rd → R are convex functions.

Fact: the feasible set

A :={x ∈ Rd : fi(x) ≤ 0 for all i = 1, 2, . . . , n

}is a convex set.

(SVMs next week will give us an example.)

47 / 61

(Lec9-10.) Subgradients: Jensen’s inequality.

If f : Rd → R is convex, then Ef(X) ≥ f (EX).

Proof. Set y := EX, and pick any s ∈ ∂f (EX). Then

Ef(X) ≥ E(f(y) + sT(X − y)

)= f(y) + sTE (X − y) = f(y).

Note. This inequality comes up often!

48 / 61

(Lec11-12.) Maximum margin linear classifier

The solution w to the following mathematical optimization problem:

minw∈Rd

1

2‖w‖22

s.t. yxTw ≥ 1 for all (x, y) ∈ S

gives the linear classifier with the largest minimum margin on S—i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.

This is a convex optimization problem; can be solved in polynomial time.

If there is a solution (i.e., S is linearly separable), then the solution is unique.

We can solve this in a variety of ways (e.g., projected gradient descent);we will work with the dual.

Note: Can also explicitly include affine expansion, so decision boundary neednot pass through origin. We’ll do our derivations without it.

49 / 61

(Lec11-12.) Maximum margin linear classifier

The solution w to the following mathematical optimization problem:

minw∈Rd

1

2‖w‖22

s.t. yxTw ≥ 1 for all (x, y) ∈ S

gives the linear classifier with the largest minimum margin on S—i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.

This is a convex optimization problem; can be solved in polynomial time.

If there is a solution (i.e., S is linearly separable), then the solution is unique.

We can solve this in a variety of ways (e.g., projected gradient descent);we will work with the dual.

Note: Can also explicitly include affine expansion, so decision boundary neednot pass through origin. We’ll do our derivations without it.

49 / 61

(Lec11-12.) Maximum margin linear classifier

The solution w to the following mathematical optimization problem:

minw∈Rd

1

2‖w‖22

s.t. yxTw ≥ 1 for all (x, y) ∈ S

gives the linear classifier with the largest minimum margin on S—i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.

This is a convex optimization problem; can be solved in polynomial time.

If there is a solution (i.e., S is linearly separable), then the solution is unique.

We can solve this in a variety of ways (e.g., projected gradient descent);we will work with the dual.

Note: Can also explicitly include affine expansion, so decision boundary neednot pass through origin. We’ll do our derivations without it.

49 / 61

(Lec11-12.) Maximum margin linear classifier

The solution w to the following mathematical optimization problem:

minw∈Rd

1

2‖w‖22

s.t. yxTw ≥ 1 for all (x, y) ∈ S

gives the linear classifier with the largest minimum margin on S—i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.

This is a convex optimization problem; can be solved in polynomial time.

If there is a solution (i.e., S is linearly separable), then the solution is unique.

We can solve this in a variety of ways (e.g., projected gradient descent);we will work with the dual.

Note: Can also explicitly include affine expansion, so decision boundary neednot pass through origin. We’ll do our derivations without it.

49 / 61

(Lec11-12.) Maximum margin linear classifier

The solution w to the following mathematical optimization problem:

minw∈Rd

1

2‖w‖22

s.t. yxTw ≥ 1 for all (x, y) ∈ S

gives the linear classifier with the largest minimum margin on S—i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.

This is a convex optimization problem; can be solved in polynomial time.

If there is a solution (i.e., S is linearly separable), then the solution is unique.

We can solve this in a variety of ways (e.g., projected gradient descent);we will work with the dual.

Note: Can also explicitly include affine expansion, so decision boundary neednot pass through origin. We’ll do our derivations without it.

49 / 61

(Lec11-12.) SVM Duality summary

Lagrangian

L(w,α) =1

2‖w‖22 +

n∑i=1

αi(1− yixTiw).

Primal maximum margin problem was

P (w) = maxα≥0

L(w,α) = maxα≥0

1

2‖w‖22 +

n∑i=1

αi(1− yixTiw)

.Dual problem

D(α) = minw∈Rd

L(w,α) = L

n∑i=1

αiyixi,α

=n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2

2

=n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjxTixj .

Given dual optimum α,

I Corresponding primal optimum w =∑ni=1 αiyixi;

I Strong duality P (w) = D(α);

I αi > 0 implies yixTi w = 1,

and these yixi are support vectors.50 / 61

(Lec11-12.) Nonseparable case: bottom line

Unconstrained primal:

minw∈Rd

1

2‖w‖2 + C

n∑i=1

[1− yixT

iw]+.

Dual:

maxα∈Rn

0≤αi≤C

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2

Dual solution α gives primal solution w =∑ni=1 αiyixi.

Remarks.

I Can take C →∞ to recover the separable case.

I Dual is a constrained convex quadratic (can be solved with projectedgradient descent).

I Some presentations include bias in primal (xTiw + b);

this introduces a constraint∑ni=1 yiαi = 0 in dual.

I Some presentations replace 12

and C with λ2

and 1n

, respectively.

51 / 61

(Lec11-12.) Nonseparable case: bottom line

Unconstrained primal:

minw∈Rd

1

2‖w‖2 + C

n∑i=1

[1− yixT

iw]+.

Dual:

maxα∈Rn

0≤αi≤C

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2

Dual solution α gives primal solution w =∑ni=1 αiyixi.

Remarks.

I Can take C →∞ to recover the separable case.

I Dual is a constrained convex quadratic (can be solved with projectedgradient descent).

I Some presentations include bias in primal (xTiw + b);

this introduces a constraint∑ni=1 yiαi = 0 in dual.

I Some presentations replace 12

and C with λ2

and 1n

, respectively.

51 / 61

(Lec11-12.) Looking at the dual again

SVM dual problem only depends on xi through inner products xTixj .

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjxTixj .

If we use feature expansion (e.g., quadratic expansion) x 7→ φ(x), this becomes

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjφ(xi)Tφ(xj).

Solution w =∑ni=1 αiyiφ(xi) is used in the following way:

x 7→ φ(x)Tw =

n∑i=1

αiyiφ(x)Tφ(xi).

Key insight:

I Training and prediction only use φ(x)Tφ(x′), never an isolated φ(x);

I Sometimes computing φ(x)Tφ(x′) is much easier than computing φ(x).

52 / 61

(Lec11-12.) Looking at the dual again

SVM dual problem only depends on xi through inner products xTixj .

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjxTixj .

If we use feature expansion (e.g., quadratic expansion) x 7→ φ(x), this becomes

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjφ(xi)Tφ(xj).

Solution w =∑ni=1 αiyiφ(xi) is used in the following way:

x 7→ φ(x)Tw =

n∑i=1

αiyiφ(x)Tφ(xi).

Key insight:

I Training and prediction only use φ(x)Tφ(x′), never an isolated φ(x);

I Sometimes computing φ(x)Tφ(x′) is much easier than computing φ(x).

52 / 61

(Lec11-12.) Looking at the dual again

SVM dual problem only depends on xi through inner products xTixj .

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjxTixj .

If we use feature expansion (e.g., quadratic expansion) x 7→ φ(x), this becomes

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjφ(xi)Tφ(xj).

Solution w =∑ni=1 αiyiφ(xi) is used in the following way:

x 7→ φ(x)Tw =n∑i=1

αiyiφ(x)Tφ(xi).

Key insight:

I Training and prediction only use φ(x)Tφ(x′), never an isolated φ(x);

I Sometimes computing φ(x)Tφ(x′) is much easier than computing φ(x).

52 / 61

(Lec11-12.) Looking at the dual again

SVM dual problem only depends on xi through inner products xTixj .

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjxTixj .

If we use feature expansion (e.g., quadratic expansion) x 7→ φ(x), this becomes

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjφ(xi)Tφ(xj).

Solution w =∑ni=1 αiyiφ(xi) is used in the following way:

x 7→ φ(x)Tw =n∑i=1

αiyiφ(x)Tφ(xi).

Key insight:

I Training and prediction only use φ(x)Tφ(x′), never an isolated φ(x);

I Sometimes computing φ(x)Tφ(x′) is much easier than computing φ(x).

52 / 61

(Lec11-12.) Quadratic expansion

I φ : Rd → R1+2d+(d2), where

φ(x) =(

1,√

2x1, . . . ,√

2xd, x21, . . . , x

2d,

√2x1x2, . . . ,

√2x1xd, . . . ,

√2xd−1xd

)(Don’t mind the

√2’s. . . )

I Computing φ(x)Tφ(x′) in O(d) time:

φ(x)Tφ(x′) = (1 + xTx′)2.

I Much better than d2 time.

I What if we change exponent “2”?

I What if we replace additive “1” with 0?

53 / 61

(Lec11-12.) Quadratic expansion

I φ : Rd → R1+2d+(d2), where

φ(x) =(

1,√

2x1, . . . ,√

2xd, x21, . . . , x

2d,

√2x1x2, . . . ,

√2x1xd, . . . ,

√2xd−1xd

)(Don’t mind the

√2’s. . . )

I Computing φ(x)Tφ(x′) in O(d) time:

φ(x)Tφ(x′) = (1 + xTx′)2.

I Much better than d2 time.

I What if we change exponent “2”?

I What if we replace additive “1” with 0?

53 / 61

(Lec11-12.) Quadratic expansion

I φ : Rd → R1+2d+(d2), where

φ(x) =(

1,√

2x1, . . . ,√

2xd, x21, . . . , x

2d,

√2x1x2, . . . ,

√2x1xd, . . . ,

√2xd−1xd

)(Don’t mind the

√2’s. . . )

I Computing φ(x)Tφ(x′) in O(d) time:

φ(x)Tφ(x′) = (1 + xTx′)2.

I Much better than d2 time.

I What if we change exponent “2”?

I What if we replace additive “1” with 0?

53 / 61

(Lec11-12.) Quadratic expansion

I φ : Rd → R1+2d+(d2), where

φ(x) =(

1,√

2x1, . . . ,√

2xd, x21, . . . , x

2d,

√2x1x2, . . . ,

√2x1xd, . . . ,

√2xd−1xd

)(Don’t mind the

√2’s. . . )

I Computing φ(x)Tφ(x′) in O(d) time:

φ(x)Tφ(x′) = (1 + xTx′)2.

I Much better than d2 time.

I What if we change exponent “2”?

I What if we replace additive “1” with 0?

53 / 61

(Lec11-12.) Quadratic expansion

I φ : Rd → R1+2d+(d2), where

φ(x) =(

1,√

2x1, . . . ,√

2xd, x21, . . . , x

2d,

√2x1x2, . . . ,

√2x1xd, . . . ,

√2xd−1xd

)(Don’t mind the

√2’s. . . )

I Computing φ(x)Tφ(x′) in O(d) time:

φ(x)Tφ(x′) = (1 + xTx′)2.

I Much better than d2 time.

I What if we change exponent “2”?

I What if we replace additive “1” with 0?

53 / 61

(Lec11-12; RBF kernel.) Infinite dimensional featureexpansion

For any σ > 0, there is an infinite feature expansion φ : Rd → R∞ such that

φ(x)Tφ(x′) = exp

−∥∥x− x′∥∥222σ2

,

which can be computed in O(d) time.

(This is called the Gaussian kernel with bandwidth σ.)

54 / 61

(Lec11-12.) Kernels

A (positive definite) kernel function K : X × X → R is a symmetric functionsatisfying:

For any x1,x2, . . . ,xn ∈ X , the n×n matrix whose (i, j)-th entry isK(xi,xj) is positive semidefinite.

(This matrix is called the Gram matrix.)

For any kernel K, there is a feature mapping φ : X → H such that

φ(x)Tφ(x′) = K(x,x′).

H is a Hilbert space—i.e., a special kind of inner product space—called theReproducing Kernel Hilbert Space corresponding to K.

55 / 61

(Lec11-12.) Kernels

A (positive definite) kernel function K : X × X → R is a symmetric functionsatisfying:

For any x1,x2, . . . ,xn ∈ X , the n×n matrix whose (i, j)-th entry isK(xi,xj) is positive semidefinite.

(This matrix is called the Gram matrix.)

For any kernel K, there is a feature mapping φ : X → H such that

φ(x)Tφ(x′) = K(x,x′).

H is a Hilbert space—i.e., a special kind of inner product space—called theReproducing Kernel Hilbert Space corresponding to K.

55 / 61

(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)

Solve

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjK(xi,xj).

Solution w =∑ni=1 αiyiφ(xi) is used in the following way:

x 7→ φ(x)Tw =

n∑i=1

αiyiK(x,xi).

I To represent classifier, need to keep support vector examples (xi, yi) andcorresponding αi’s.

I To compute prediction on x, iterate through support vector examples andcompute K(x,xi) for each support vector xi . . .

Very similar to nearest neighbor classifier:predictor is represented using (a subset of) the training data.

56 / 61

(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)

Solve

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjK(xi,xj).

Solution w =∑ni=1 αiyiφ(xi) is used in the following way:

x 7→ φ(x)Tw =n∑i=1

αiyiK(x,xi).

I To represent classifier, need to keep support vector examples (xi, yi) andcorresponding αi’s.

I To compute prediction on x, iterate through support vector examples andcompute K(x,xi) for each support vector xi . . .

Very similar to nearest neighbor classifier:predictor is represented using (a subset of) the training data.

56 / 61

(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)

Solve

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjK(xi,xj).

Solution w =∑ni=1 αiyiφ(xi) is used in the following way:

x 7→ φ(x)Tw =n∑i=1

αiyiK(x,xi).

I To represent classifier, need to keep support vector examples (xi, yi) andcorresponding αi’s.

I To compute prediction on x, iterate through support vector examples andcompute K(x,xi) for each support vector xi . . .

Very similar to nearest neighbor classifier:predictor is represented using (a subset of) the training data.

56 / 61

(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)

Solve

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjK(xi,xj).

Solution w =∑ni=1 αiyiφ(xi) is used in the following way:

x 7→ φ(x)Tw =n∑i=1

αiyiK(x,xi).

I To represent classifier, need to keep support vector examples (xi, yi) andcorresponding αi’s.

I To compute prediction on x, iterate through support vector examples andcompute K(x,xi) for each support vector xi . . .

Very similar to nearest neighbor classifier:predictor is represented using (a subset of) the training data.

56 / 61

(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)

Solve

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjK(xi,xj).

Solution w =∑ni=1 αiyiφ(xi) is used in the following way:

x 7→ φ(x)Tw =n∑i=1

αiyiK(x,xi).

I To represent classifier, need to keep support vector examples (xi, yi) andcorresponding αi’s.

I To compute prediction on x, iterate through support vector examples andcompute K(x,xi) for each support vector xi . . .

Very similar to nearest neighbor classifier:predictor is represented using (a subset of) the training data.

56 / 61

(Lec11-12.) Nonlinear support vector machines (again)

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-32.

000

-24.

000

-16.

000

-8.0

00

-8.000

0.00

0

0.0008.00

0

8.000

16.000

ReLU network.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-12.

500

-10.

000

-7.5

00-5

.000

-2.5

00

-2.500

0.00

0

0.000

2.500

Quadratic SVM.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-3.0

00

-2.0

00

-1.0

00

-1.0

00

0.00

0

0.000

1.00

0

2.000

3.000

RBF SVM (σ = 1).

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-1.500-1.000

-1.000

-1.000

-0.5

00

-0.500

0.00

0

0.000

0.50

0

0.500

1.000

1.50

0

RBF SVM (σ = 0.1). 57 / 61

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.

I Therefore: if we update, we do so by rotating towards yx.

I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).

Not obvious that Perceptron will eventually terminate! (We’ll return to this.)

58 / 61

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.

I Therefore: if we update, we do so by rotating towards yx.

I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).

Not obvious that Perceptron will eventually terminate! (We’ll return to this.)

58 / 61

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.

I Therefore: if we update, we do so by rotating towards yx.

I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).

Not obvious that Perceptron will eventually terminate! (We’ll return to this.)

58 / 61

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.

I Therefore: if we update, we do so by rotating towards yx.

I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).

Not obvious that Perceptron will eventually terminate! (We’ll return to this.)

58 / 61

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.

I Therefore: if we update, we do so by rotating towards yx.

I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).

Scenario 1

wtxt yt = 1xTtwt ≤ 0

Current vector wt comparble to xt in length.

Not obvious that Perceptron will eventually terminate! (We’ll return to this.)

58 / 61

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.

I Therefore: if we update, we do so by rotating towards yx.

I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).

Scenario 1

wtxt yt = 1xTtwt ≤ 0

wt+1

Updated vector wt+1 now correctly classifies (xt, yt).

Not obvious that Perceptron will eventually terminate! (We’ll return to this.)

58 / 61

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.

I Therefore: if we update, we do so by rotating towards yx.

I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).

Scenario 2

wt

xtyt = 1xTtwt ≤ 0

Current vector wt much longer than xt.

Not obvious that Perceptron will eventually terminate! (We’ll return to this.)

58 / 61

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.

I Therefore: if we update, we do so by rotating towards yx.

I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).

Scenario 2

wt

xtyt = 1xTtwt ≤ 0

wt+1

Updated vector wt+1 does not correctly classify (xt, yt).

Not obvious that Perceptron will eventually terminate! (We’ll return to this.)

58 / 61

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter

w ← w + 1[ywTx ≤ 0]yx.

Remarks.

I Can interpret algorithm as:either we are correct with a margin (ywTx > 0) and we do nothing,or we are not and update w ← w + yx.

I Therefore: if we update, we do so by rotating towards yx.

I This makes sense: (w + yx)T(yx) = wT(yx) + ‖x‖2;i.e., we increase wT(yx).

Scenario 2

wt

xtyt = 1xTtwt ≤ 0

wt+1

Updated vector wt+1 does not correctly classify (xt, yt).

Not obvious that Perceptron will eventually terminate! (We’ll return to this.) 58 / 61

2. Homework review

Nice homework problems

I HW1-P3: SVD practice.

I HW2-P1: SVD practice, definition ‖A‖2 and ‖A‖F.

I HW3-P1: convexity practice.

I HW3-P2: deep network and probability practice.

I HW3-P4, HW3-P5: kernel practice.

59 / 61

3. Stuff added 3/12/2019

Belaboring Cauchy-Schwarz

Theorem (Cauchy-Schwarz). |aTb| ≤ ‖a‖ · ‖b‖.

Proof (another one. . . ). If either a or b are zero, then aTb = 0 = ‖a‖ · ‖b‖.If both a and b are nonzero, note for any r > 0 that

0 ≤∥∥∥∥ra− br

∥∥∥∥2 = r2‖a‖2 − 2aTb+‖b‖2

r2,

which can be rearranged into

aTb ≤ r2‖a‖2

2+‖b‖2

2r2.

Choosing r =√‖b‖/‖a‖,

aTb ≤ ‖a‖2

2

(‖b‖‖a‖

)+‖b‖2

2

(‖a‖‖b‖

)= ‖a‖ · ‖b‖.

Lastly, applying the bound to (a,−b) gives

−aTb = aT (−b) ≤ ‖a‖ · ‖ − b‖ = ‖a‖ · ‖b‖,

so together it follows that∣∣∣aTb

∣∣∣ ≤ ‖a‖ · ‖b‖. �

60 / 61

Belaboring Cauchy-Schwarz

Theorem (Cauchy-Schwarz). |aTb| ≤ ‖a‖ · ‖b‖.

Proof (another one. . . ). If either a or b are zero, then aTb = 0 = ‖a‖ · ‖b‖.If both a and b are nonzero, note for any r > 0 that

0 ≤∥∥∥∥ra− br

∥∥∥∥2 = r2‖a‖2 − 2aTb+‖b‖2

r2,

which can be rearranged into

aTb ≤ r2‖a‖2

2+‖b‖2

2r2.

Choosing r =√‖b‖/‖a‖,

aTb ≤ ‖a‖2

2

(‖b‖‖a‖

)+‖b‖2

2

(‖a‖‖b‖

)= ‖a‖ · ‖b‖.

Lastly, applying the bound to (a,−b) gives

−aTb = aT (−b) ≤ ‖a‖ · ‖ − b‖ = ‖a‖ · ‖b‖,

so together it follows that∣∣∣aTb

∣∣∣ ≤ ‖a‖ · ‖b‖. �

60 / 61

SVD again

1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.

2. Thin decomposition SVD: M =∑ri=1 siuiv

Ti .

3. Full factorization SVD: M = USV T.

4. “Operational” view of SVD: for M ∈ Rn×d,

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

0 0

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

>

.

First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).

Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.

61 / 61

SVD again

1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.

2. Thin decomposition SVD: M =∑ri=1 siuiv

Ti .

3. Full factorization SVD: M = USV T.

4. “Operational” view of SVD: for M ∈ Rn×d,

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

0 0

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

>

.

First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).

Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.

61 / 61

SVD again

1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.

2. Thin decomposition SVD: M =∑ri=1 siuiv

Ti .

3. Full factorization SVD: M = USV T.

4. “Operational” view of SVD: for M ∈ Rn×d,

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

0 0

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

>

.

First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).

Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.

61 / 61

SVD again

1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.

2. Thin decomposition SVD: M =∑ri=1 siuiv

Ti .

3. Full factorization SVD: M = USV T.

4. “Operational” view of SVD: for M ∈ Rn×d,

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

0 0

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

>

.

First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).

Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.

61 / 61

SVD again

1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.

2. Thin decomposition SVD: M =∑ri=1 siuiv

Ti .

3. Full factorization SVD: M = USV T.

4. “Operational” view of SVD: for M ∈ Rn×d,

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

0 0

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

>

.

First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).

Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.

61 / 61

SVD again

1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.

2. Thin decomposition SVD: M =∑ri=1 siuiv

Ti .

3. Full factorization SVD: M = USV T.

4. “Operational” view of SVD: for M ∈ Rn×d,

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

0 0

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

>

.

First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).

Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.

61 / 61

SVD again

1. SV triples: (s,u,v) satisfies Mv = su, and MTu = sv.

2. Thin decomposition SVD: M =∑ri=1 siuiv

Ti .

3. Full factorization SVD: M = USV T.

4. “Operational” view of SVD: for M ∈ Rn×d,

↑ ↑ ↑ ↑u1 · · · ur ur+1 · · · un↓ ↓ ↓ ↓

·s1 0

. . . 00 sr

0 0

· ↑ ↑ ↑ ↑v1 · · · vr vr+1 · · · vd↓ ↓ ↓ ↓

>

.

First part of U ,V span the col / row space (respectively),second part the left / right nullspaces (respectively).

Personally: I internalize SVD and use it to reason about matrices.E.g., “rank-nullity theorem”.

61 / 61

top related