# linear regression - university of...

of 371 /371
Linear regression CS 446

Author: others

Post on 19-Aug-2020

0 views

Category:

## Documents

Embed Size (px)

TRANSCRIPT

• Linear regression

CS 446

• 1. Overview

• todo

check some continuity bugsmake sure nothing missing from old lectures (both mine and daniel’s)fix some of those bugs, like b replacing y.delete the excess material from endadd proper summary slide which boils down concepts and reduces studentworry.

1 / 94

• Lecture 1: supervised learning

Training data: labeled examples

(x1, y1), (x2, y2), . . . , (xn, yn)

where

I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and

I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.

Goal: learn a function f̂ from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs.

learned predictorpast labeled examples learning algorithm

predicted label

new (unlabeled) example

2 / 94

• Lecture 2: nearest neighbors and decision trees

1.0 0.5 0.0 0.5 1.0

1.0

0.5

0.0

0.5

1.0

x1

x2

Nearest neighbors.Training/fitting: memorize data.Testing/predicting: find k closestmemorized points, return pluralitylabel.Overfitting? Vary k.

Decision trees.Training/fitting: greedily partitionspace, reducing “uncertainty”.Testing/predicting: traverse tree,output leaf label.Overfitting? Limit or prune tree.

3 / 94

• Lectures 3-4: linear regression

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration

50

60

70

80

90

dela

y

Linear regression / least squares.

Our first (of many!) linear predic-tion methods.

Today:

I Example.

I How to solve it; ERM, andSVD.

I Features.

Next lecture: advanced topics, in-cluding overfitting.

4 / 94

• 2. Example: Old Faithful

• Prediction problem: Old Faithful geyser (Yellowstone)

Task: Predict time of next eruption.

5 / 94

• Time between eruptions

Historical records of eruptions:

a1 b1 a2 a3a0 b2 b3b0 . . .

Y1 Y2 Y3

Time until next eruption: Yi := ai − bi−1.

Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .

On “Old Faithful” data:

I Using 136 past observations, we form mean estimate µ̂ = 70.7941.

I Can we do better?

6 / 94

• Time between eruptions

Historical records of eruptions:

a1 b1 a2 a3a0 b2 b3b0 . . .

Y1 Y2 Y3Time until next eruption: Yi := ai − bi−1.

Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .

On “Old Faithful” data:

I Using 136 past observations, we form mean estimate µ̂ = 70.7941.

I Can we do better?

6 / 94

• Time between eruptions

Historical records of eruptions:

an bnan−1 bn−1 . . .

Ydata

. . . t

Time until next eruption: Yi := ai − bi−1.

Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .

On “Old Faithful” data:

I Using 136 past observations, we form mean estimate µ̂ = 70.7941.

I Can we do better?

6 / 94

• Time between eruptions

Historical records of eruptions:

an bnan−1 bn−1 . . .

Ydata

. . . t

Time until next eruption: Yi := ai − bi−1.

Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .

On “Old Faithful” data:

I Using 136 past observations, we form mean estimate µ̂ = 70.7941.

I Can we do better?

6 / 94

• Looking at the data

Naturalist Harry Woodward observed that time until the next eruption seemsto be related to duration of last eruption.

1.5 2 2.5 3 3.5 4 4.5 5 5.5duration of last eruption

50

60

70

80

90

time

until

nex

t eru

ptio

n

7 / 94

• Looking at the data

Naturalist Harry Woodward observed that time until the next eruption seemsto be related to duration of last eruption.

1.5 2 2.5 3 3.5 4 4.5 5 5.5duration of last eruption

50

60

70

80

90

time

until

nex

t eru

ptio

n

7 / 94

• Using side-information

At prediction time t, duration of last eruption is available as side-information.

an bnan−1 bn−1 . . .

Ydata

. . . t

X

IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).

X takes values in X (e.g., X = R), Y takes values in R.

1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)

f̂ : X → R,

This is called “learning” or “training”.

2. At prediction time, observe X, and form prediction f̂(X).

How should we choose f̂ based on data? Recall:

I The model is our choice.

I We must contend with overfitting, bad fitting algorithms, . . .

8 / 94

• Using side-information

At prediction time t, duration of last eruption is available as side-information.

an bnan−1 bn−1 . . .

Y

. . . t

XXn Yn. . .

IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).

X takes values in X (e.g., X = R), Y takes values in R.

1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)

f̂ : X → R,

This is called “learning” or “training”.

2. At prediction time, observe X, and form prediction f̂(X).

How should we choose f̂ based on data? Recall:

I The model is our choice.

I We must contend with overfitting, bad fitting algorithms, . . .

8 / 94

• Using side-information

At prediction time t, duration of last eruption is available as side-information.

an bnan−1 bn−1 . . .

Y

. . . t

XXn Yn. . .

IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).

X takes values in X (e.g., X = R), Y takes values in R.

1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)

f̂ : X → R,

This is called “learning” or “training”.

2. At prediction time, observe X, and form prediction f̂(X).

How should we choose f̂ based on data? Recall:

I The model is our choice.

I We must contend with overfitting, bad fitting algorithms, . . .

8 / 94

• Using side-information

At prediction time t, duration of last eruption is available as side-information.

an bnan−1 bn−1 . . .

Y

. . . t

XXn Yn. . .

IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).

X takes values in X (e.g., X = R), Y takes values in R.

1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)

f̂ : X → R,

This is called “learning” or “training”.

2. At prediction time, observe X, and form prediction f̂(X).

How should we choose f̂ based on data? Recall:

I The model is our choice.

I We must contend with overfitting, bad fitting algorithms, . . .

8 / 94

• 3. Least squares and linear regression

• Which line?

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration

50

60

70

80

90

dela

y

Let’s predict with a linear regressor:

ŷ := wT [ x1 ] ,

where w ∈ R2 is learned from data.

Remark: appending 1 makes thisan affine function x 7→ w1x + w2.(More on this later. . . )

If data lies along a line,we should output that line.But what if not?

9 / 94

• Which line?

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration

50

60

70

80

90

dela

y

Let’s predict with a linear regressor:

ŷ := wT [ x1 ] ,

where w ∈ R2 is learned from data.

Remark: appending 1 makes thisan affine function x 7→ w1x + w2.(More on this later. . . )

If data lies along a line,we should output that line.But what if not?

9 / 94

• ERM setup for least squares.

I Predictors/model: f̂(x) = wTx;a linear predictor/regressor.(For linear classification: x 7→ sgn(wTx).)

I Loss/penalty: the least squares loss

`ls(y, ŷ) = `ls(y, ŷ) = (y − ŷ)2.

(Some conventions scale this by 1/2.)

I Goal: minimize least squares emprical risk

R̂ls(f̂) =1

n

n∑i=1

`ls(yi, f̂(xi)) =1

n

n∑i=1

(yi − f̂(xi))2.

I Specifically, we choose w ∈ Rd according to

arg minw∈Rd

R̂ls(x 7→ wTx

)= arg min

w∈Rd

1

n

n∑i=1

(yi −wTxi)2.

I More generally, this is the ERM approach:pick a model and minimize empirical risk over the model parameters.

10 / 94

• ERM in general

I Pick a family of models/predictors F .(For today, we use linear predictors.)

I Pick a loss function `.(For today, we chose squared loss.)

I Minimize the empirical risk over the model parameters.

We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.

Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.

11 / 94

• ERM in general

I Pick a family of models/predictors F .(For today, we use linear predictors.)

I Pick a loss function `.(For today, we chose squared loss.)

I Minimize the empirical risk over the model parameters.

We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.

Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.

11 / 94

• Least squares ERM in pictures

Red dots: data points.

Affine hyperplane: our predictions(via affine expansion (x1, x2) 7→ (1, x1, x2)).

ERM: minimize sum of squared verticallengths from hyperplane to points.

12 / 94

• Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT1 →

...← xTn →

, b := 1√ny1...yn

.

Can write empirical risk as

R̂(w) = 1n

n∑i=1

(yi − xTiw

)2= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R̂:

∇R̂(w) = 0, i.e., w is a critical point of R̂.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.

13 / 94

• Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT1 →

...← xTn →

, b := 1√ny1...yn

.Can write empirical risk as

R̂(w) = 1n

n∑i=1

(yi − xTiw

)2= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R̂:

∇R̂(w) = 0, i.e., w is a critical point of R̂.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.

13 / 94

• Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT1 →

...← xTn →

, b := 1√ny1...yn

.Can write empirical risk as

R̂(w) = 1n

n∑i=1

(yi − xTiw

)2= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R̂:

∇R̂(w) = 0, i.e., w is a critical point of R̂.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.

13 / 94

• Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT1 →

...← xTn →

, b := 1√ny1...yn

.Can write empirical risk as

R̂(w) = 1n

n∑i=1

(yi − xTiw

)2= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R̂:

∇R̂(w) = 0, i.e., w is a critical point of R̂.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.

13 / 94

• Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT1 →

...← xTn →

, b := 1√ny1...yn

.Can write empirical risk as

R̂(w) = 1n

n∑i=1

(yi − xTiw

)2= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R̂:

∇R̂(w) = 0, i.e., w is a critical point of R̂.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.

13 / 94

• Summary on ERM and linear regression

Procedure:

I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)

I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)

I In general, solutions are not unique. (Why not?)

I If ATA is invertible, can choose (unique) (ATA)−1ATb.

I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.

I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.

I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .

14 / 94

• Summary on ERM and linear regression

Procedure:

I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)

I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)

I In general, solutions are not unique. (Why not?)

I If ATA is invertible, can choose (unique) (ATA)−1ATb.

I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.

I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.

I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .

14 / 94

• Summary on ERM and linear regression

Procedure:

I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)

I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)

I In general, solutions are not unique. (Why not?)

I If ATA is invertible, can choose (unique) (ATA)−1ATb.

I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.

I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.

I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .

14 / 94

• Summary on ERM and linear regression

Procedure:

I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)

I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)

I In general, solutions are not unique. (Why not?)

I If ATA is invertible, can choose (unique) (ATA)−1ATb.

I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.

I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.

I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .

14 / 94

• 4. SVD and least squares

• SVD

Recall the Singular Value Decomposition (SVD) M = USV T ∈ Rm×n, where

I U ∈ Rm×r is orthonormal, S ∈ Rr×r is diag(s1, . . . , sr) withs1 ≥ s2 ≥ · · · ≥ sr ≥ 0, and V ∈ Rn×r is orthonormal, withr := rank(M). (If r = 0, use the convention of S = 0 ∈ R1×1.)

I This convention is sometimes called the thin SVD.

I Another notation is to write M =∑ri=1 siuiv

Ti . This avoids the issue

with 0 (empty sum is 0). Moreover, this notation makes it clear that(ui)

ri=1 span the column space and (vi)

ri=1 span the rows space of M .

I The full SVD will not be used in this class; it fills out U and V to be fullrank and orthonormal, and pads S with zeros. It agrees with theeigendecompositions of MTM and MMT.

I Note; numpy and pytorch have SVD (interfaces slightly differ).Determining r runs into numerical issues.

15 / 94

• Pseudoinverse

Let SVD M =∑ri=1 siuiv

Ti be given.

I Define pseudoinverse M+ =∑ri=1

1siviu

Ti .

(If 0 = M ∈ Rm×n, then 0 = M+ ∈ Rn×m.)I Alternatively, define pseudoinverse S+ of a diagonal matrix to be ST but

with reciprocals of non-zero elements;then M+ = V S+UT.

I Also called Moore-Penrose Pseudoinverse; it is unique, even though theSVD is not unique (why not?).

I If M−1 exists, then M−1 = M+ (why?).

16 / 94

• SVD and least squares

Recall: we’d like to find w such that

ATAw = ATb.

If w = A+b, then

ATAw =

r∑i=1

siviuTi

r∑i=1

siuivTi

r∑i=1

1

siviu

Ti

b=

r∑i=1

siviuTi

r∑i=1

uiuTi

b = ATb.Henceforth, define ŵols = A

+b as the OLS solution.(OLS = “ordinary least squares”.)

Note: in general, AA+ =∑ri=1 uiu

Ti 6= I.

17 / 94

• 5. Summary of linear regression so far

• Main points

I Model/function/predictor class of linear regressors x 7→ wTx.I ERM principle: we chose a loss (least squares) and find a good predictor

by minimizing empirical risk.

I ERM solution for least squares: pick w satisfying ATAw = ATb, which isnot unique; one unique choice is the ordinary least squares solution A+b.

18 / 94

• Part 2 of linear regression lecture. . .

• Recap on SVD. (A messy slide, I’m sorry.)

Suppose 0 6= M ∈ Rn×d, thus r := rank(M) > 0.

I “Decomposition form” thin SVD: M =∑ri=1 siuiv

Ti , and

s1 ≥ · · · ≥ sr > 0, and M+ =∑ri=1

1siviu

Ti . and in general

M+M =∑ri=1 viv

Ti 6= I.

I “Factorization form” thin SVD: M = USV T, U ∈ Rn×r and V ∈ Rd×rorthonormal but UTU and V TV are not identity matrices in general, andS = diag(s1, . . . , sr) ∈ Rr×r with s1 ≥ · · · ≥ sr > 0; pseudoinverseM+ = V S−1UT and in general M+M 6= MM+ 6= I.

I Full SVD: M = U fSfV Tf , U f ∈ Rn×n and V ∈ Rd×d orthonormal andfull rank so UTf U f and V

Tf V f are identity matrices and Sf ∈ Rn×d is zero

everywhere except the first r diagonal entries which ares1 ≥ · · · ≥ sr > 0; pseudoinverse M+ = V fS+f U

Tf where S

+f is obtained

by transposing Sf and then flipping nonzero entries, and in generalM+M 6= MM+ 6= I. Additional property: agreement witheigendecompositions of MMT and MTM .

The “full SVD” adds columns to U and V which hit zeros of S and thereforedon’t matter(as a sanity check, verify for yourself that all these SVDs are equal).

19 / 94

• Recap on SVD, zero matrix case

Suppose 0 = M ∈ Rn×d, thus r := rank(M) = 0.

I In all types of SVD, M+ is MT (another zero matrix).

I Technically speaking, s is a singular value of M iff exist nonzero vectors(u,v) with Mv = su and MTu = sv, and zero matrix therefore has nosingular values (or left/right singular vectors).

I “Factorization form thin SVD” becomes a little messy.

20 / 94

• 6. More on the normal equations

• Recall our matrix notation

Let labeled examples ((xi, yi))ni=1 be given.

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT1 →

...← xTn →

, b := 1√ny1...yn

.

Can write empirical risk as

R̂(w) = 1n

n∑i=1

(yi − xTiw

)2= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R̂:

∇R̂(w) = 0, i.e., w is a critical point of R̂.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

We’ll now finally show that normal equations imply optimality.

21 / 94

• Recall our matrix notation

Let labeled examples ((xi, yi))ni=1 be given.

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT1 →

...← xTn →

, b := 1√ny1...yn

.Can write empirical risk as

R̂(w) = 1n

n∑i=1

(yi − xTiw

)2= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R̂:

∇R̂(w) = 0, i.e., w is a critical point of R̂.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

We’ll now finally show that normal equations imply optimality.

21 / 94

• Recall our matrix notation

Let labeled examples ((xi, yi))ni=1 be given.

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT1 →

...← xTn →

, b := 1√ny1...yn

.Can write empirical risk as

R̂(w) = 1n

n∑i=1

(yi − xTiw

)2= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R̂:

∇R̂(w) = 0, i.e., w is a critical point of R̂.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

We’ll now finally show that normal equations imply optimality.

21 / 94

• Recall our matrix notation

Let labeled examples ((xi, yi))ni=1 be given.

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT1 →

...← xTn →

, b := 1√ny1...yn

.Can write empirical risk as

R̂(w) = 1n

n∑i=1

(yi − xTiw

)2= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R̂:

∇R̂(w) = 0, i.e., w is a critical point of R̂.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

We’ll now finally show that normal equations imply optimality.

21 / 94

• Recall our matrix notation

Let labeled examples ((xi, yi))ni=1 be given.

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT1 →

...← xTn →

, b := 1√ny1...yn

.Can write empirical risk as

R̂(w) = 1n

n∑i=1

(yi − xTiw

)2= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R̂:

∇R̂(w) = 0, i.e., w is a critical point of R̂.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

We’ll now finally show that normal equations imply optimality.21 / 94

• Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then

‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

Since

(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

Morever, writing A =∑ri=1 siuiv

Ti ,

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)> r∑i=1

s2ivivTi

(w′−w),so w′ optimal iff w′ −w is in the right nullspace of A.

(We’ll revisit all this with convexity later.)

22 / 94

• Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then

‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

Since

(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

Morever, writing A =∑ri=1 siuiv

Ti ,

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)> r∑i=1

s2ivivTi

(w′−w),so w′ optimal iff w′ −w is in the right nullspace of A.

(We’ll revisit all this with convexity later.)

22 / 94

• Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then

‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

Since

(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

Morever, writing A =∑ri=1 siuiv

Ti ,

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)> r∑i=1

s2ivivTi

(w′−w),so w′ optimal iff w′ −w is in the right nullspace of A.

(We’ll revisit all this with convexity later.)

22 / 94

• Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.

Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.

Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

a1

a2

I b̂ is uniquely determined; indeed,b̂ = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b̂:solve system of linear equations Aw = b̂.

23 / 94

• Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.

Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

a1

a2

I b̂ is uniquely determined; indeed,b̂ = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b̂:solve system of linear equations Aw = b̂.

23 / 94

• Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.

Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

a1

a2

I b̂ is uniquely determined; indeed,b̂ = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b̂:solve system of linear equations Aw = b̂.

23 / 94

• Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.

Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

a1

a2

I b̂ is uniquely determined; indeed,b̂ = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b̂:solve system of linear equations Aw = b̂.

23 / 94

• Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.

Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

a1

a2

I b̂ is uniquely determined; indeed,b̂ = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b̂:solve system of linear equations Aw = b̂.

23 / 94

• Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.

Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

a1

a2

I b̂ is uniquely determined; indeed,b̂ = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b̂:solve system of linear equations Aw = b̂.

23 / 94

• Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.

Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

a1

a2

I b̂ is uniquely determined; indeed,b̂ = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b̂:solve system of linear equations Aw = b̂.

23 / 94

• 7. Features

• Enhancing linear regression models with features

Linear functions alone are restrictive,but become powerful with creative side-information, or features.

Idea: Predict with x 7→ wTφ(x), where φ is a feature mapping.

Examples:

1. Non-linear transformations of existing variables: for x ∈ R,

φ(x) = ln(1 + x).

2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

3. Trigonometric expansion: for x ∈ R,

φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

24 / 94

• Enhancing linear regression models with features

Linear functions alone are restrictive,but become powerful with creative side-information, or features.

Idea: Predict with x 7→ wTφ(x), where φ is a feature mapping.

Examples:

1. Non-linear transformations of existing variables: for x ∈ R,

φ(x) = ln(1 + x).

2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

3. Trigonometric expansion: for x ∈ R,

φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

24 / 94

• Example: Taking advantage of linearity

Suppose you are trying to predict some health outcome.

I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:

φ(x) = (xtemp − 98.6)2.

I Instead, useφ(x) = (1, xtemp, x

2temp).

Can learn coefficients w such that

wTφ(x) = (xtemp − 98.6)2,

or any other quadratic polynomial in xtemp (which may be better!).

25 / 94

Quadratic function f : R→ R

f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.

This can be written as a linear function of φ(x), where

φ(x) := (1, x, x2),

sincef(x) = wTφ(x)

where w = (c, b, a).

For multivariate quadratic function f : Rd → R, use

φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms

, x21, . . . , x2d︸ ︷︷ ︸

squared terms

, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms

).

26 / 94

Quadratic function f : R→ R

f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.

This can be written as a linear function of φ(x), where

φ(x) := (1, x, x2),

sincef(x) = wTφ(x)

where w = (c, b, a).

For multivariate quadratic function f : Rd → R, use

φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms

, x21, . . . , x2d︸ ︷︷ ︸

squared terms

, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms

).

26 / 94

Quadratic function f : R→ R

f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.

This can be written as a linear function of φ(x), where

φ(x) := (1, x, x2),

sincef(x) = wTφ(x)

where w = (c, b, a).

For multivariate quadratic function f : Rd → R, use

φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms

, x21, . . . , x2d︸ ︷︷ ︸

squared terms

, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms

).

26 / 94

• Affine expansion and “Old Faithful”

Woodward needed an affine expansion for “Old Faithful” data:

φ(x) := (1, x).

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil n

ext eru

ption

affine function

Affine function fa,b : R→ R for a, b ∈ R,

fa,b(x) = a+ bx,

is a linear function fw of φ(x) for w = (a, b).

(This easily generalizes to multivariate affine functions.)

27 / 94

• Affine expansion and “Old Faithful”

Woodward needed an affine expansion for “Old Faithful” data:

φ(x) := (1, x).

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil next eru

ption

affine function

Affine function fa,b : R→ R for a, b ∈ R,

fa,b(x) = a+ bx,

is a linear function fw of φ(x) for w = (a, b).

(This easily generalizes to multivariate affine functions.)

27 / 94

• Final remarks on features

I “Feature engineering” can drastically change the power of a model.

I Some people consider it messy, unprincipled, pure “trial-and-error”.

I Deep learning is somewhat touted as removing some of this, but it doesn’tdo so completely (e.g., took a lot of work to come up with the“convolutional neural network” (side question, who came up with that?)).

28 / 94

• 8. Statistical view of least squares; maximum likelihood

• Maximum likelihood estimation (MLE) refresher

Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.

I Θ: parameter space.

I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.

Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

L(θ) := pθ(x).

For continuous X ∼ Pθ with probability density function fθ,

L(θ) := fθ(x).

Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94

• Maximum likelihood estimation (MLE) refresher

Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.

I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.

Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

L(θ) := pθ(x).

For continuous X ∼ Pθ with probability density function fθ,

L(θ) := fθ(x).

Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94

• Maximum likelihood estimation (MLE) refresher

Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.

I θ ∈ Θ: a particular parameter (or parameter vector).

I Pθ: a particular probability distribution for observed data.

Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

L(θ) := pθ(x).

For continuous X ∼ Pθ with probability density function fθ,

L(θ) := fθ(x).

Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94

• Maximum likelihood estimation (MLE) refresher

Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.

I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.

Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

L(θ) := pθ(x).

For continuous X ∼ Pθ with probability density function fθ,

L(θ) := fθ(x).

Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94

• Maximum likelihood estimation (MLE) refresher

Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.

I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.

Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

L(θ) := pθ(x).

For continuous X ∼ Pθ with probability density function fθ,

L(θ) := fθ(x).

Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94

• Maximum likelihood estimation (MLE) refresher

Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.

I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.

Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

L(θ) := pθ(x).

For continuous X ∼ Pθ with probability density function fθ,

L(θ) := fθ(x).

Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94

• Distributions over labeled examples

X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

1. Marginal distribution PX of X:

PX is a probability distribution on X .

2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :PY |X=x is a probability distribution on Y.

This lecture: Y = R (regression problems).

30 / 94

• Distributions over labeled examples

X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

1. Marginal distribution PX of X:

PX is a probability distribution on X .

2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :PY |X=x is a probability distribution on Y.

This lecture: Y = R (regression problems).

30 / 94

• Distributions over labeled examples

X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

1. Marginal distribution PX of X:

PX is a probability distribution on X .

2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :PY |X=x is a probability distribution on Y.

This lecture: Y = R (regression problems).

30 / 94

• Distributions over labeled examples

X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

1. Marginal distribution PX of X:

PX is a probability distribution on X .

2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :PY |X=x is a probability distribution on Y.

This lecture: Y = R (regression problems).

30 / 94

• Distributions over labeled examples

X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

1. Marginal distribution PX of X:

PX is a probability distribution on X .

2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :PY |X=x is a probability distribution on Y.

This lecture: Y = R (regression problems).

30 / 94

• Optimal predictor

What function f : X → R has smallest (squared loss) risk

R(f) := E[(f(X)− Y )2]?

Note: earlier we discussed empirical risk.

I Conditional on X = x, the minimizer of conditional risk

ŷ 7→ E[(ŷ − Y )2 | X = x]

is the conditional meanE[Y | X = x].

I Therefore, the function f? : R→ R where

f?(x) = E[Y | X = x], x ∈ R

has the smallest risk.

I f? is called the regression function or conditional mean function.

31 / 94

• Optimal predictor

What function f : X → R has smallest (squared loss) risk

R(f) := E[(f(X)− Y )2]?

Note: earlier we discussed empirical risk.

I Conditional on X = x, the minimizer of conditional risk

ŷ 7→ E[(ŷ − Y )2 | X = x]

is the conditional meanE[Y | X = x].

I Therefore, the function f? : R→ R where

f?(x) = E[Y | X = x], x ∈ R

has the smallest risk.

I f? is called the regression function or conditional mean function.

31 / 94

• Optimal predictor

What function f : X → R has smallest (squared loss) risk

R(f) := E[(f(X)− Y )2]?

Note: earlier we discussed empirical risk.

I Conditional on X = x, the minimizer of conditional risk

ŷ 7→ E[(ŷ − Y )2 | X = x]

is the conditional meanE[Y | X = x].

I Therefore, the function f? : R→ R where

f?(x) = E[Y | X = x], x ∈ R

has the smallest risk.

I f? is called the regression function or conditional mean function.

31 / 94

• Optimal predictor

What function f : X → R has smallest (squared loss) risk

R(f) := E[(f(X)− Y )2]?

Note: earlier we discussed empirical risk.

I Conditional on X = x, the minimizer of conditional risk

ŷ 7→ E[(ŷ − Y )2 | X = x]

is the conditional meanE[Y | X = x].

I Therefore, the function f? : R→ R where

f?(x) = E[Y | X = x], x ∈ R

has the smallest risk.

I f? is called the regression function or conditional mean function.

31 / 94

• Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

I Conditional distribution of Y given X is normal.

I Marginal distribution of X not specified.

In this model, the regression function f? is a linear function fw : Rd → R,

fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by

w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94

• Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.

I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

I Conditional distribution of Y given X is normal.

I Marginal distribution of X not specified.

In this model, the regression function f? is a linear function fw : Rd → R,

fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by

w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94

• Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

I Conditional distribution of Y given X is normal.

I Marginal distribution of X not specified.

In this model, the regression function f? is a linear function fw : Rd → R,

fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by

w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94

• Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

I Conditional distribution of Y given X is normal.

I Marginal distribution of X not specified.

In this model, the regression function f? is a linear function fw : Rd → R,

fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by

w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94

• Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

I Conditional distribution of Y given X is normal.

I Marginal distribution of X not specified.

In this model, the regression function f? is a linear function fw : Rd → R,

fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by

w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94

• Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

I Conditional distribution of Y given X is normal.

I Marginal distribution of X not specified.

In this model, the regression function f? is a linear function fw : Rd → R,

fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by

w.)-1 -0.5 0 0.5 1

x

-5

0

5y

f*

32 / 94

• Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

33 / 94

• Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

33 / 94

• Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

33 / 94

• Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

33 / 94

• Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

R(f) = E[(f(X)− Y )2].

But we don’t know the distribution P of (X,Y ).

Replace P with Pn → Empirical (squared loss) risk R̂(f):

R̂(f) := 1n

n∑i=1

(f(xi)− yi)2.

(“Plug-in principle” is used throughout statistics in this same way.)

34 / 94

• Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

R(f) = E[(f(X)− Y )2].

But we don’t know the distribution P of (X,Y ).

Replace P with Pn → Empirical (squared loss) risk R̂(f):

R̂(f) := 1n

n∑i=1

(f(xi)− yi)2.

(“Plug-in principle” is used throughout statistics in this same way.)

34 / 94

• Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

R(f) = E[(f(X)− Y )2].

But we don’t know the distribution P of (X,Y ).

Replace P with Pn → Empirical (squared loss) risk R̂(f):

R̂(f) := 1n

n∑i=1

(f(xi)− yi)2.

(“Plug-in principle” is used throughout statistics in this same way.)

34 / 94

• Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

R(f) = E[(f(X)− Y )2].

But we don’t know the distribution P of (X,Y ).

Replace P with Pn → Empirical (squared loss) risk R̂(f):

R̂(f) := 1n

n∑i=1

(f(xi)− yi)2.

(“Plug-in principle” is used throughout statistics in this same way.)

34 / 94

• Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

For linear functions and squared loss: ERM returns

ŵ ∈ arg minw∈Rd

R̂(w),

which coincides with MLE under the basic linear regression model.

In general:

I MLE makes sense in context of statistical model for which it is derived.

I ERM makes sense in context of general iid model for supervised learning.

Further remarks.

I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.

I In ERM, by default there is no assumption of a “true” parameter torecover.

Useful examples: medical testing, gene expression, . . .

35 / 94

• Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

For linear functions and squared loss: ERM returns

ŵ ∈ arg minw∈Rd

R̂(w),

which coincides with MLE under the basic linear regression model.

In general:

I MLE makes sense in context of statistical model for which it is derived.

I ERM makes sense in context of general iid model for supervised learning.

Further remarks.

I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.

I In ERM, by default there is no assumption of a “true” parameter torecover.

Useful examples: medical testing, gene expression, . . .

35 / 94

• Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

For linear functions and squared loss: ERM returns

ŵ ∈ arg minw∈Rd

R̂(w),

which coincides with MLE under the basic linear regression model.

In general:

I MLE makes sense in context of statistical model for which it is derived.

I ERM makes sense in context of general iid model for supervised learning.

Further remarks.

I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.

I In ERM, by default there is no assumption of a “true” parameter torecover.

Useful examples: medical testing, gene expression, . . .

35 / 94

• Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

For linear functions and squared loss: ERM returns

ŵ ∈ arg minw∈Rd

R̂(w),

which coincides with MLE under the basic linear regression model.

In general:

I MLE makes sense in context of statistical model for which it is derived.

I ERM makes sense in context of general iid model for supervised learning.

Further remarks.

I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.

I In ERM, by default there is no assumption of a “true” parameter torecover.

Useful examples: medical testing, gene expression, . . .

35 / 94

• Old faithful data under this least squares statistical model

Recall our data, consisting of historical records of eruptions:

a1 b1 a2 a3a0 b2 b3b0 . . .

Y1 Y2 Y3

Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).I Data: Yi := ai − bi−1, i = 1, . . . , n.

(Admittedly not a great model, since durations are non-negative.)

Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(This extends the model above if we add the “1” feature.)

36 / 94

• Old faithful data under this least squares statistical model

Recall our data, consisting of historical records of eruptions:

a1 b1 a2 a3a0 b2 b3b0 . . .

Y1 Y2 Y3Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).I Data: Yi := ai − bi−1, i = 1, . . . , n.

(Admittedly not a great model, since durations are non-negative.)

Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(This extends the model above if we add the “1” feature.)

36 / 94

• Old faithful data under this least squares statistical model

Recall our data, consisting of historical records of eruptions:

an bnan−1 bn−1 . . .

Ydata

. . . t

Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).I Data: Yi := ai − bi−1, i = 1, . . . , n.

(Admittedly not a great model, since durations are non-negative.)

Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(This extends the model above if we add the “1” feature.)

36 / 94

• 9. Regularization and ridge regression

• Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.

I Obtain ŵ viaŵ = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

• Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.

I Obtain ŵ viaŵ = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

• Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.

I Obtain ŵ viaŵ = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

• Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.

I Obtain ŵ viaŵ = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

• Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.

I Obtain ŵ viaŵ = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

• Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.

I Obtain ŵ viaŵ = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

• Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.

I Obtain ŵ viaŵ = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

• Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.

I Obtain ŵ viaŵ = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

• Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

It is not a given that the least norm ERM is better than the other ERM!

38 / 94

• Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

38 / 94

• Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data and some arbitrary ERM:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

38 / 94

• Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data and least `2 norm ERM:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

38 / 94

• Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R̂(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

39 / 94

• Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R̂(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

39 / 94

• Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R̂(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

39 / 94

• Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R̂(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

39 / 94

• Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R̂(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

39 / 94

• Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R̂(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

39 / 94

• 10. True risk and overfitting

• Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

40 / 94

• Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

40 / 94

• Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

40 / 94

• Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

40 / 94

• Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

40 / 94

• Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution ŵ is not unique (e.g., if n < d), then R(ŵ) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(ŵ)−R(w?) = O(

tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

41 / 94

• Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution ŵ is not unique (e.g., if n < d), then R(ŵ) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(ŵ)−R(w?) = O(

tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

41 / 94

• Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution ŵ is not unique (e.g., if n < d), then R(ŵ) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(ŵ)−R(w?) = O(

tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

41 / 94

• Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution ŵ is not unique (e.g., if n < d), then R(ŵ) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(ŵ)−R(w?) = O(

tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

41 / 94

• Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0

and√n(ŵ −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(ŵ −w?) is

√n(ŵ −w?) d−→ E[XXT]−1 cov(εX)

12Z.

A few more steps gives

n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

42 / 94

• Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0

and√n(ŵ −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(ŵ −w?) is

√n(ŵ −w?) d−→ E[XXT]−1 cov(εX)

12Z.

A few more steps gives

n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

42 / 94

• Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0

and√n(ŵ −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(ŵ −w?) is

√n(ŵ −w?) d−→ E[XXT]−1 cov(εX)

12Z.

A few more steps gives

n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

42 / 94

• Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0

and√n(ŵ −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(ŵ −w?) is

√n(ŵ −w?) d−→ E[XXT]−1 cov(εX)

12Z.

A few more steps gives

n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

42 / 94

• Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0

and√n(ŵ −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(ŵ −w?) is

√n(ŵ −w?) d−→ E[XXT]−1 cov(εX)

12Z.

A few more steps gives

n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).42 / 94

• Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(ŵ)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

43 / 94

• Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(ŵ)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

43 / 94

• Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(ŵ)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

43 / 94

• Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(ŵ)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

43 / 94

• Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(ŵ)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

43 / 94

• Risk vs empirical risk

Let ŵ be ERM solution.

1. Empirical risk of ERM: R̂(ŵ)2. True risk of ERM: R(ŵ)

Theorem.E[R̂(ŵ)

]≤ E

[R(ŵ)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

44 / 94

• Risk vs empirical risk

Let ŵ be ERM solution.

1. Empirical risk of ERM: R̂(ŵ)

2. True risk of ERM: R(ŵ)Theorem.

E[R̂(ŵ)

]≤ E

[R(ŵ)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

44 / 94

• Risk vs empirical risk

Let ŵ be ERM solution.

1. Empirical risk of ERM: R̂(ŵ)2. True risk of ERM: R(ŵ)

Theorem.E[R̂(ŵ)

]≤ E

[R(ŵ)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

44 / 94

• Risk vs empirical risk

Let ŵ be ERM solution.

1. Empirical risk of ERM: R̂(ŵ)2. True risk of ERM: R(ŵ)

Theorem.E[R̂(ŵ)

]≤ E

[R(ŵ)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

44 / 94

• Risk vs empirical risk

Let ŵ be ERM solution.

1. Empirical risk of ERM: R̂(ŵ)2. True risk of ERM: R(ŵ)

Theorem.E[R̂(ŵ)

]≤ E

[R(ŵ)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

44 / 94

• Overfitting example

(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.Suppose we use degree-k polynomial expansion

φ(x) = (1, x1, . . . , xk), x ∈ R,

so dimension is d = k + 1.

Fact: Any function on ≤ k + 1 points can be