linear regression  university of...
Embed Size (px)
TRANSCRIPT

Linear regression
CS 446

1. Overview

todo
check some continuity bugsmake sure nothing missing from old lectures (both mine and daniel’s)fix some of those bugs, like b replacing y.delete the excess material from endadd proper summary slide which boils down concepts and reduces studentworry.
1 / 94

Lecture 1: supervised learning
Training data: labeled examples
(x1, y1), (x2, y2), . . . , (xn, yn)
where
I each input xi is a machinereadable description of an instance (e.g.,image, sentence), and
I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.
Goal: learn a function f̂ from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs.
learned predictorpast labeled examples learning algorithm
predicted label
new (unlabeled) example
2 / 94

Lecture 2: nearest neighbors and decision trees
1.0 0.5 0.0 0.5 1.0
1.0
0.5
0.0
0.5
1.0
x1
x2
Nearest neighbors.Training/fitting: memorize data.Testing/predicting: find k closestmemorized points, return pluralitylabel.Overfitting? Vary k.
Decision trees.Training/fitting: greedily partitionspace, reducing “uncertainty”.Testing/predicting: traverse tree,output leaf label.Overfitting? Limit or prune tree.
3 / 94

Lectures 34: linear regression
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration
50
60
70
80
90
dela
y
Linear regression / least squares.
Our first (of many!) linear prediction methods.
Today:
I Example.
I How to solve it; ERM, andSVD.
I Features.
Next lecture: advanced topics, including overfitting.
4 / 94

2. Example: Old Faithful

Prediction problem: Old Faithful geyser (Yellowstone)
Task: Predict time of next eruption.
5 / 94

Time between eruptions
Historical records of eruptions:
a1 b1 a2 a3a0 b2 b3b0 . . .
Y1 Y2 Y3
Time until next eruption: Yi := ai − bi−1.
Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .
On “Old Faithful” data:
I Using 136 past observations, we form mean estimate µ̂ = 70.7941.
I Can we do better?
6 / 94

Time between eruptions
Historical records of eruptions:
a1 b1 a2 a3a0 b2 b3b0 . . .
Y1 Y2 Y3Time until next eruption: Yi := ai − bi−1.
Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .
On “Old Faithful” data:
I Using 136 past observations, we form mean estimate µ̂ = 70.7941.
I Can we do better?
6 / 94

Time between eruptions
Historical records of eruptions:
an bnan−1 bn−1 . . .
Ydata
. . . t
Time until next eruption: Yi := ai − bi−1.
Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .
On “Old Faithful” data:
I Using 136 past observations, we form mean estimate µ̂ = 70.7941.
I Can we do better?
6 / 94

Time between eruptions
Historical records of eruptions:
an bnan−1 bn−1 . . .
Ydata
. . . t
Time until next eruption: Yi := ai − bi−1.
Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .
On “Old Faithful” data:
I Using 136 past observations, we form mean estimate µ̂ = 70.7941.
I Can we do better?
6 / 94

Looking at the data
Naturalist Harry Woodward observed that time until the next eruption seemsto be related to duration of last eruption.
1.5 2 2.5 3 3.5 4 4.5 5 5.5duration of last eruption
50
60
70
80
90
time
until
nex
t eru
ptio
n
7 / 94

Looking at the data
Naturalist Harry Woodward observed that time until the next eruption seemsto be related to duration of last eruption.
1.5 2 2.5 3 3.5 4 4.5 5 5.5duration of last eruption
50
60
70
80
90
time
until
nex
t eru
ptio
n
7 / 94

Using sideinformation
At prediction time t, duration of last eruption is available as sideinformation.
an bnan−1 bn−1 . . .
Ydata
. . . t
X
IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).
X takes values in X (e.g., X = R), Y takes values in R.
1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)
f̂ : X → R,
This is called “learning” or “training”.
2. At prediction time, observe X, and form prediction f̂(X).
How should we choose f̂ based on data? Recall:
I The model is our choice.
I We must contend with overfitting, bad fitting algorithms, . . .
8 / 94

Using sideinformation
At prediction time t, duration of last eruption is available as sideinformation.
an bnan−1 bn−1 . . .
Y
. . . t
XXn Yn. . .
IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).
X takes values in X (e.g., X = R), Y takes values in R.
1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)
f̂ : X → R,
This is called “learning” or “training”.
2. At prediction time, observe X, and form prediction f̂(X).
How should we choose f̂ based on data? Recall:
I The model is our choice.
I We must contend with overfitting, bad fitting algorithms, . . .
8 / 94

Using sideinformation
At prediction time t, duration of last eruption is available as sideinformation.
an bnan−1 bn−1 . . .
Y
. . . t
XXn Yn. . .
IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).
X takes values in X (e.g., X = R), Y takes values in R.
1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)
f̂ : X → R,
This is called “learning” or “training”.
2. At prediction time, observe X, and form prediction f̂(X).
How should we choose f̂ based on data? Recall:
I The model is our choice.
I We must contend with overfitting, bad fitting algorithms, . . .
8 / 94

Using sideinformation
At prediction time t, duration of last eruption is available as sideinformation.
an bnan−1 bn−1 . . .
Y
. . . t
XXn Yn. . .
IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).
X takes values in X (e.g., X = R), Y takes values in R.
1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)
f̂ : X → R,
This is called “learning” or “training”.
2. At prediction time, observe X, and form prediction f̂(X).
How should we choose f̂ based on data? Recall:
I The model is our choice.
I We must contend with overfitting, bad fitting algorithms, . . .
8 / 94

3. Least squares and linear regression

Which line?
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration
50
60
70
80
90
dela
y
Let’s predict with a linear regressor:
ŷ := wT [ x1 ] ,
where w ∈ R2 is learned from data.
Remark: appending 1 makes thisan affine function x 7→ w1x + w2.(More on this later. . . )
If data lies along a line,we should output that line.But what if not?
9 / 94

Which line?
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration
50
60
70
80
90
dela
y
Let’s predict with a linear regressor:
ŷ := wT [ x1 ] ,
where w ∈ R2 is learned from data.
Remark: appending 1 makes thisan affine function x 7→ w1x + w2.(More on this later. . . )
If data lies along a line,we should output that line.But what if not?
9 / 94

ERM setup for least squares.
I Predictors/model: f̂(x) = wTx;a linear predictor/regressor.(For linear classification: x 7→ sgn(wTx).)
I Loss/penalty: the least squares loss
`ls(y, ŷ) = `ls(y, ŷ) = (y − ŷ)2.
(Some conventions scale this by 1/2.)
I Goal: minimize least squares emprical risk
R̂ls(f̂) =1
n
n∑i=1
`ls(yi, f̂(xi)) =1
n
n∑i=1
(yi − f̂(xi))2.
I Specifically, we choose w ∈ Rd according to
arg minw∈Rd
R̂ls(x 7→ wTx
)= arg min
w∈Rd
1
n
n∑i=1
(yi −wTxi)2.
I More generally, this is the ERM approach:pick a model and minimize empirical risk over the model parameters.
10 / 94

ERM in general
I Pick a family of models/predictors F .(For today, we use linear predictors.)
I Pick a loss function `.(For today, we chose squared loss.)
I Minimize the empirical risk over the model parameters.
We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.
Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.
11 / 94

ERM in general
I Pick a family of models/predictors F .(For today, we use linear predictors.)
I Pick a loss function `.(For today, we chose squared loss.)
I Minimize the empirical risk over the model parameters.
We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.
Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.
11 / 94

Least squares ERM in pictures
Red dots: data points.
Affine hyperplane: our predictions(via affine expansion (x1, x2) 7→ (1, x1, x2)).
ERM: minimize sum of squared verticallengths from hyperplane to points.
12 / 94

Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.
Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.
13 / 94

Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.
13 / 94

Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.
13 / 94

Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.
13 / 94

Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.
13 / 94

Summary on ERM and linear regression
Procedure:
I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)
I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)
I In general, solutions are not unique. (Why not?)
I If ATA is invertible, can choose (unique) (ATA)−1ATb.
I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.
I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.
I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .
14 / 94

Summary on ERM and linear regression
Procedure:
I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)
I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)
I In general, solutions are not unique. (Why not?)
I If ATA is invertible, can choose (unique) (ATA)−1ATb.
I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.
I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.
I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .
14 / 94

Summary on ERM and linear regression
Procedure:
I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)
I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)
I In general, solutions are not unique. (Why not?)
I If ATA is invertible, can choose (unique) (ATA)−1ATb.
I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.
I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.
I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .
14 / 94

Summary on ERM and linear regression
Procedure:
I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)
I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)
I In general, solutions are not unique. (Why not?)
I If ATA is invertible, can choose (unique) (ATA)−1ATb.
I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.
I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.
I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .
14 / 94

4. SVD and least squares

SVD
Recall the Singular Value Decomposition (SVD) M = USV T ∈ Rm×n, where
I U ∈ Rm×r is orthonormal, S ∈ Rr×r is diag(s1, . . . , sr) withs1 ≥ s2 ≥ · · · ≥ sr ≥ 0, and V ∈ Rn×r is orthonormal, withr := rank(M). (If r = 0, use the convention of S = 0 ∈ R1×1.)
I This convention is sometimes called the thin SVD.
I Another notation is to write M =∑ri=1 siuiv
Ti . This avoids the issue
with 0 (empty sum is 0). Moreover, this notation makes it clear that(ui)
ri=1 span the column space and (vi)
ri=1 span the rows space of M .
I The full SVD will not be used in this class; it fills out U and V to be fullrank and orthonormal, and pads S with zeros. It agrees with theeigendecompositions of MTM and MMT.
I Note; numpy and pytorch have SVD (interfaces slightly differ).Determining r runs into numerical issues.
15 / 94

Pseudoinverse
Let SVD M =∑ri=1 siuiv
Ti be given.
I Define pseudoinverse M+ =∑ri=1
1siviu
Ti .
(If 0 = M ∈ Rm×n, then 0 = M+ ∈ Rn×m.)I Alternatively, define pseudoinverse S+ of a diagonal matrix to be ST but
with reciprocals of nonzero elements;then M+ = V S+UT.
I Also called MoorePenrose Pseudoinverse; it is unique, even though theSVD is not unique (why not?).
I If M−1 exists, then M−1 = M+ (why?).
16 / 94

SVD and least squares
Recall: we’d like to find w such that
ATAw = ATb.
If w = A+b, then
ATAw =
r∑i=1
siviuTi
r∑i=1
siuivTi
r∑i=1
1
siviu
Ti
b=
r∑i=1
siviuTi
r∑i=1
uiuTi
b = ATb.Henceforth, define ŵols = A
+b as the OLS solution.(OLS = “ordinary least squares”.)
Note: in general, AA+ =∑ri=1 uiu
Ti 6= I.
17 / 94

5. Summary of linear regression so far

Main points
I Model/function/predictor class of linear regressors x 7→ wTx.I ERM principle: we chose a loss (least squares) and find a good predictor
by minimizing empirical risk.
I ERM solution for least squares: pick w satisfying ATAw = ATb, which isnot unique; one unique choice is the ordinary least squares solution A+b.
18 / 94

Part 2 of linear regression lecture. . .

Recap on SVD. (A messy slide, I’m sorry.)
Suppose 0 6= M ∈ Rn×d, thus r := rank(M) > 0.
I “Decomposition form” thin SVD: M =∑ri=1 siuiv
Ti , and
s1 ≥ · · · ≥ sr > 0, and M+ =∑ri=1
1siviu
Ti . and in general
M+M =∑ri=1 viv
Ti 6= I.
I “Factorization form” thin SVD: M = USV T, U ∈ Rn×r and V ∈ Rd×rorthonormal but UTU and V TV are not identity matrices in general, andS = diag(s1, . . . , sr) ∈ Rr×r with s1 ≥ · · · ≥ sr > 0; pseudoinverseM+ = V S−1UT and in general M+M 6= MM+ 6= I.
I Full SVD: M = U fSfV Tf , U f ∈ Rn×n and V ∈ Rd×d orthonormal andfull rank so UTf U f and V
Tf V f are identity matrices and Sf ∈ Rn×d is zero
everywhere except the first r diagonal entries which ares1 ≥ · · · ≥ sr > 0; pseudoinverse M+ = V fS+f U
Tf where S
+f is obtained
by transposing Sf and then flipping nonzero entries, and in generalM+M 6= MM+ 6= I. Additional property: agreement witheigendecompositions of MMT and MTM .
The “full SVD” adds columns to U and V which hit zeros of S and thereforedon’t matter(as a sanity check, verify for yourself that all these SVDs are equal).
19 / 94

Recap on SVD, zero matrix case
Suppose 0 = M ∈ Rn×d, thus r := rank(M) = 0.
I In all types of SVD, M+ is MT (another zero matrix).
I Technically speaking, s is a singular value of M iff exist nonzero vectors(u,v) with Mv = su and MTu = sv, and zero matrix therefore has nosingular values (or left/right singular vectors).
I “Factorization form thin SVD” becomes a little messy.
20 / 94

6. More on the normal equations

Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.
Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.
21 / 94

Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.
21 / 94

Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.
21 / 94

Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.
21 / 94

Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.21 / 94

Normal equations imply optimality
Consider w with ATAw = ATy, and any w′; then
‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2
= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.
Since
(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,
then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.
Morever, writing A =∑ri=1 siuiv
Ti ,
‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)> r∑i=1
s2ivivTi
(w′−w),so w′ optimal iff w′ −w is in the right nullspace of A.
(We’ll revisit all this with convexity later.)
22 / 94

Normal equations imply optimality
Consider w with ATAw = ATy, and any w′; then
‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2
= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.
Since
(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,
then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.
Morever, writing A =∑ri=1 siuiv
Ti ,
‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)> r∑i=1
s2ivivTi
(w′−w),so w′ optimal iff w′ −w is in the right nullspace of A.
(We’ll revisit all this with convexity later.)
22 / 94

Normal equations imply optimality
Consider w with ATAw = ATy, and any w′; then
‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2
= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.
Since
(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,
then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.
Morever, writing A =∑ri=1 siuiv
Ti ,
‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)> r∑i=1
s2ivivTi
(w′−w),so w′ optimal iff w′ −w is in the right nullspace of A.
(We’ll revisit all this with convexity later.)
22 / 94

Geometric interpretation of least squares ERM
Let aj ∈ Rn be the jth column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.
Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.
Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b̂
a1
a2
I b̂ is uniquely determined; indeed,b̂ = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b̂:solve system of linear equations Aw = b̂.
23 / 94

Geometric interpretation of least squares ERM
Let aj ∈ Rn be the jth column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.
Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b̂
a1
a2
I b̂ is uniquely determined; indeed,b̂ = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b̂:solve system of linear equations Aw = b̂.
23 / 94

Geometric interpretation of least squares ERM
Let aj ∈ Rn be the jth column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.
Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b̂
a1
a2
I b̂ is uniquely determined; indeed,b̂ = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b̂:solve system of linear equations Aw = b̂.
23 / 94

Geometric interpretation of least squares ERM
Let aj ∈ Rn be the jth column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.
Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b̂
a1
a2
I b̂ is uniquely determined; indeed,b̂ = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b̂:solve system of linear equations Aw = b̂.
23 / 94

Geometric interpretation of least squares ERM
Let aj ∈ Rn be the jth column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.
Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b̂
a1
a2
I b̂ is uniquely determined; indeed,b̂ = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b̂:solve system of linear equations Aw = b̂.
23 / 94

Geometric interpretation of least squares ERM
Let aj ∈ Rn be the jth column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.
Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b̂
a1
a2
I b̂ is uniquely determined; indeed,b̂ = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b̂:solve system of linear equations Aw = b̂.
23 / 94

Geometric interpretation of least squares ERM
Let aj ∈ Rn be the jth column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.
Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b̂
a1
a2
I b̂ is uniquely determined; indeed,b̂ = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b̂:solve system of linear equations Aw = b̂.
23 / 94

7. Features

Enhancing linear regression models with features
Linear functions alone are restrictive,but become powerful with creative sideinformation, or features.
Idea: Predict with x 7→ wTφ(x), where φ is a feature mapping.
Examples:
1. Nonlinear transformations of existing variables: for x ∈ R,
φ(x) = ln(1 + x).
2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,
φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).
3. Trigonometric expansion: for x ∈ R,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).
4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,
φ(x) = (1, x1, . . . , xd, x21, . . . , x
2d, x1x2, . . . , x1xd, . . . , xd−1xd).
24 / 94

Enhancing linear regression models with features
Linear functions alone are restrictive,but become powerful with creative sideinformation, or features.
Idea: Predict with x 7→ wTφ(x), where φ is a feature mapping.
Examples:
1. Nonlinear transformations of existing variables: for x ∈ R,
φ(x) = ln(1 + x).
2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,
φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).
3. Trigonometric expansion: for x ∈ R,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).
4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,
φ(x) = (1, x1, . . . , xd, x21, . . . , x
2d, x1x2, . . . , x1xd, . . . , xd−1xd).
24 / 94

Example: Taking advantage of linearity
Suppose you are trying to predict some health outcome.
I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:
φ(x) = (xtemp − 98.6)2.
I What if you didn’t know about this magic constant 98.6?
I Instead, useφ(x) = (1, xtemp, x
2temp).
Can learn coefficients w such that
wTφ(x) = (xtemp − 98.6)2,
or any other quadratic polynomial in xtemp (which may be better!).
25 / 94

Quadratic expansion
Quadratic function f : R→ R
f(x) = ax2 + bx+ c, x ∈ R,
for a, b, c ∈ R.
This can be written as a linear function of φ(x), where
φ(x) := (1, x, x2),
sincef(x) = wTφ(x)
where w = (c, b, a).
For multivariate quadratic function f : Rd → R, use
φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms
, x21, . . . , x2d︸ ︷︷ ︸
squared terms
, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms
).
26 / 94

Quadratic expansion
Quadratic function f : R→ R
f(x) = ax2 + bx+ c, x ∈ R,
for a, b, c ∈ R.
This can be written as a linear function of φ(x), where
φ(x) := (1, x, x2),
sincef(x) = wTφ(x)
where w = (c, b, a).
For multivariate quadratic function f : Rd → R, use
φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms
, x21, . . . , x2d︸ ︷︷ ︸
squared terms
, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms
).
26 / 94

Quadratic expansion
Quadratic function f : R→ R
f(x) = ax2 + bx+ c, x ∈ R,
for a, b, c ∈ R.
This can be written as a linear function of φ(x), where
φ(x) := (1, x, x2),
sincef(x) = wTφ(x)
where w = (c, b, a).
For multivariate quadratic function f : Rd → R, use
φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms
, x21, . . . , x2d︸ ︷︷ ︸
squared terms
, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms
).
26 / 94

Affine expansion and “Old Faithful”
Woodward needed an affine expansion for “Old Faithful” data:
φ(x) := (1, x).
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil n
ext eru
ption
affine function
Affine function fa,b : R→ R for a, b ∈ R,
fa,b(x) = a+ bx,
is a linear function fw of φ(x) for w = (a, b).
(This easily generalizes to multivariate affine functions.)
27 / 94

Affine expansion and “Old Faithful”
Woodward needed an affine expansion for “Old Faithful” data:
φ(x) := (1, x).
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil next eru
ption
affine function
Affine function fa,b : R→ R for a, b ∈ R,
fa,b(x) = a+ bx,
is a linear function fw of φ(x) for w = (a, b).
(This easily generalizes to multivariate affine functions.)
27 / 94

Final remarks on features
I “Feature engineering” can drastically change the power of a model.
I Some people consider it messy, unprincipled, pure “trialanderror”.
I Deep learning is somewhat touted as removing some of this, but it doesn’tdo so completely (e.g., took a lot of work to come up with the“convolutional neural network” (side question, who came up with that?)).
28 / 94

8. Statistical view of least squares; maximum likelihood

Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.
I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94

Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94

Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).
I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94

Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94

Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94

Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94

Distributions over labeled examples
X : Space of possible sideinformation (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY X=x of Y given X = x for each x ∈ X :PY X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94

Distributions over labeled examples
X : Space of possible sideinformation (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY X=x of Y given X = x for each x ∈ X :PY X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94

Distributions over labeled examples
X : Space of possible sideinformation (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY X=x of Y given X = x for each x ∈ X :PY X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94

Distributions over labeled examples
X : Space of possible sideinformation (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY X=x of Y given X = x for each x ∈ X :PY X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94

Distributions over labeled examples
X : Space of possible sideinformation (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY X=x of Y given X = x for each x ∈ X :PY X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94

Optimal predictor
What function f : X → R has smallest (squared loss) risk
R(f) := E[(f(X)− Y )2]?
Note: earlier we discussed empirical risk.
I Conditional on X = x, the minimizer of conditional risk
ŷ 7→ E[(ŷ − Y )2  X = x]
is the conditional meanE[Y  X = x].
I Therefore, the function f? : R→ R where
f?(x) = E[Y  X = x], x ∈ R
has the smallest risk.
I f? is called the regression function or conditional mean function.
31 / 94

Optimal predictor
What function f : X → R has smallest (squared loss) risk
R(f) := E[(f(X)− Y )2]?
Note: earlier we discussed empirical risk.
I Conditional on X = x, the minimizer of conditional risk
ŷ 7→ E[(ŷ − Y )2  X = x]
is the conditional meanE[Y  X = x].
I Therefore, the function f? : R→ R where
f?(x) = E[Y  X = x], x ∈ R
has the smallest risk.
I f? is called the regression function or conditional mean function.
31 / 94

Optimal predictor
What function f : X → R has smallest (squared loss) risk
R(f) := E[(f(X)− Y )2]?
Note: earlier we discussed empirical risk.
I Conditional on X = x, the minimizer of conditional risk
ŷ 7→ E[(ŷ − Y )2  X = x]
is the conditional meanE[Y  X = x].
I Therefore, the function f? : R→ R where
f?(x) = E[Y  X = x], x ∈ R
has the smallest risk.
I f? is called the regression function or conditional mean function.
31 / 94

Optimal predictor
What function f : X → R has smallest (squared loss) risk
R(f) := E[(f(X)− Y )2]?
Note: earlier we discussed empirical risk.
I Conditional on X = x, the minimizer of conditional risk
ŷ 7→ E[(ŷ − Y )2  X = x]
is the conditional meanE[Y  X = x].
I Therefore, the function f? : R→ R where
f?(x) = E[Y  X = x], x ∈ R
has the smallest risk.
I f? is called the regression function or conditional mean function.
31 / 94

Linear regression models
When sideinformation is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)1 0.5 0 0.5 1
x
5
0
5
y
f*
32 / 94

Linear regression models
When sideinformation is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.
I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)1 0.5 0 0.5 1
x
5
0
5
y
f*
32 / 94

Linear regression models
When sideinformation is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)1 0.5 0 0.5 1
x
5
0
5
y
f*
32 / 94

Linear regression models
When sideinformation is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)1 0.5 0 0.5 1
x
5
0
5
y
f*
32 / 94

Linear regression models
When sideinformation is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)1 0.5 0 0.5 1
x
5
0
5
y
f*
32 / 94

Linear regression models
When sideinformation is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)1 0.5 0 0.5 1
x
5
0
5y
f*
32 / 94

Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Loglikelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes loglikelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
33 / 94

Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Loglikelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes loglikelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
33 / 94

Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Loglikelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes loglikelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
33 / 94

Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Loglikelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes loglikelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
33 / 94

Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plugin principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R̂(f):
R̂(f) := 1n
n∑i=1
(f(xi)− yi)2.
(“Plugin principle” is used throughout statistics in this same way.)
34 / 94

Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plugin principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R̂(f):
R̂(f) := 1n
n∑i=1
(f(xi)− yi)2.
(“Plugin principle” is used throughout statistics in this same way.)
34 / 94

Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plugin principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R̂(f):
R̂(f) := 1n
n∑i=1
(f(xi)− yi)2.
(“Plugin principle” is used throughout statistics in this same way.)
34 / 94

Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plugin principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R̂(f):
R̂(f) := 1n
n∑i=1
(f(xi)− yi)2.
(“Plugin principle” is used throughout statistics in this same way.)
34 / 94

Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
ŵ ∈ arg minw∈Rd
R̂(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
Further remarks.
I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.
I In ERM, by default there is no assumption of a “true” parameter torecover.
Useful examples: medical testing, gene expression, . . .
35 / 94

Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
ŵ ∈ arg minw∈Rd
R̂(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
Further remarks.
I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.
I In ERM, by default there is no assumption of a “true” parameter torecover.
Useful examples: medical testing, gene expression, . . .
35 / 94

Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
ŵ ∈ arg minw∈Rd
R̂(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
Further remarks.
I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.
I In ERM, by default there is no assumption of a “true” parameter torecover.
Useful examples: medical testing, gene expression, . . .
35 / 94

Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
ŵ ∈ arg minw∈Rd
R̂(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
Further remarks.
I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.
I In ERM, by default there is no assumption of a “true” parameter torecover.
Useful examples: medical testing, gene expression, . . .
35 / 94

Old faithful data under this least squares statistical model
Recall our data, consisting of historical records of eruptions:
a1 b1 a2 a3a0 b2 b3b0 . . .
Y1 Y2 Y3
Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).I Data: Yi := ai − bi−1, i = 1, . . . , n.
(Admittedly not a great model, since durations are nonnegative.)
Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume
Y X = x ∼ N(xTw, σ2), x ∈ Rd.
(This extends the model above if we add the “1” feature.)
36 / 94

Old faithful data under this least squares statistical model
Recall our data, consisting of historical records of eruptions:
a1 b1 a2 a3a0 b2 b3b0 . . .
Y1 Y2 Y3Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).I Data: Yi := ai − bi−1, i = 1, . . . , n.
(Admittedly not a great model, since durations are nonnegative.)
Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume
Y X = x ∼ N(xTw, σ2), x ∈ Rd.
(This extends the model above if we add the “1” feature.)
36 / 94

Old faithful data under this least squares statistical model
Recall our data, consisting of historical records of eruptions:
an bnan−1 bn−1 . . .
Ydata
. . . t
Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).I Data: Yi := ai − bi−1, i = 1, . . . , n.
(Admittedly not a great model, since durations are nonnegative.)
Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume
Y X = x ∼ N(xTw, σ2), x ∈ Rd.
(This extends the model above if we add the “1” feature.)
36 / 94

9. Regularization and ridge regression

Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.
I Obtain ŵ viaŵ = A+b
where A+ is the (MoorePenrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94

Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.
I Obtain ŵ viaŵ = A+b
where A+ is the (MoorePenrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94

Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.
I Obtain ŵ viaŵ = A+b
where A+ is the (MoorePenrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94

Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.
I Obtain ŵ viaŵ = A+b
where A+ is the (MoorePenrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94

Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.
I Obtain ŵ viaŵ = A+b
where A+ is the (MoorePenrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94

Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.
I Obtain ŵ viaŵ = A+b
where A+ is the (MoorePenrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94

Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.
I Obtain ŵ viaŵ = A+b
where A+ is the (MoorePenrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94

Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.
I Obtain ŵ viaŵ = A+b
where A+ is the (MoorePenrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94

Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
It is not a given that the least norm ERM is better than the other ERM!
38 / 94

Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data:
0 1 2 3 4 5 6
x
2.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
38 / 94

Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data and some arbitrary ERM:
0 1 2 3 4 5 6
x
2.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
38 / 94

Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data and least `2 norm ERM:
0 1 2 3 4 5 6
x
2.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
38 / 94

Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R̂(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).
I Choose λ using crossvalidation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94

Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R̂(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).
I Choose λ using crossvalidation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94

Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R̂(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).
I Choose λ using crossvalidation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94

Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R̂(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).
I Choose λ using crossvalidation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94

Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R̂(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).
I Choose λ using crossvalidation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94

Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R̂(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).
I Choose λ using crossvalidation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94

10. True risk and overfitting

Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plugin estimator for a minimizer of R.
40 / 94

Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plugin estimator for a minimizer of R.
40 / 94

Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plugin estimator for a minimizer of R.
40 / 94

Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plugin estimator for a minimizer of R.
40 / 94

Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plugin estimator for a minimizer of R.
40 / 94

Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution ŵ is not unique (e.g., if n < d), then R(ŵ) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(ŵ)−R(w?) = O(
tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
41 / 94

Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution ŵ is not unique (e.g., if n < d), then R(ŵ) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(ŵ)−R(w?) = O(
tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
41 / 94

Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution ŵ is not unique (e.g., if n < d), then R(ŵ) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(ŵ)−R(w?) = O(
tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
41 / 94

Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution ŵ is not unique (e.g., if n < d), then R(ŵ) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(ŵ)−R(w?) = O(
tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
41 / 94

Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0
and√n(ŵ −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(ŵ −w?) is
√n(ŵ −w?) d−→ E[XXT]−1 cov(εX)
12Z.
A few more steps gives
n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
42 / 94

Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0
and√n(ŵ −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(ŵ −w?) is
√n(ŵ −w?) d−→ E[XXT]−1 cov(εX)
12Z.
A few more steps gives
n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
42 / 94

Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0
and√n(ŵ −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(ŵ −w?) is
√n(ŵ −w?) d−→ E[XXT]−1 cov(εX)
12Z.
A few more steps gives
n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
42 / 94

Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0
and√n(ŵ −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(ŵ −w?) is
√n(ŵ −w?) d−→ E[XXT]−1 cov(εX)
12Z.
A few more steps gives
n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
42 / 94

Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0
and√n(ŵ −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(ŵ −w?) is
√n(ŵ −w?) d−→ E[XXT]−1 cov(εX)
12Z.
A few more steps gives
n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).42 / 94

Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(ŵ)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove nonasymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94

Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(ŵ)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove nonasymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94

Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(ŵ)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove nonasymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94

Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(ŵ)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove nonasymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94

Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(ŵ)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove nonasymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94

Risk vs empirical risk
Let ŵ be ERM solution.
1. Empirical risk of ERM: R̂(ŵ)2. True risk of ERM: R(ŵ)
Theorem.E[R̂(ŵ)
]≤ E
[R(ŵ)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94

Risk vs empirical risk
Let ŵ be ERM solution.
1. Empirical risk of ERM: R̂(ŵ)
2. True risk of ERM: R(ŵ)Theorem.
E[R̂(ŵ)
]≤ E
[R(ŵ)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94

Risk vs empirical risk
Let ŵ be ERM solution.
1. Empirical risk of ERM: R̂(ŵ)2. True risk of ERM: R(ŵ)
Theorem.E[R̂(ŵ)
]≤ E
[R(ŵ)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94

Risk vs empirical risk
Let ŵ be ERM solution.
1. Empirical risk of ERM: R̂(ŵ)2. True risk of ERM: R(ŵ)
Theorem.E[R̂(ŵ)
]≤ E
[R(ŵ)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94

Risk vs empirical risk
Let ŵ be ERM solution.
1. Empirical risk of ERM: R̂(ŵ)2. True risk of ERM: R(ŵ)
Theorem.E[R̂(ŵ)
]≤ E
[R(ŵ)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94

Overfitting example
(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.Suppose we use degreek polynomial expansion
φ(x) = (1, x1, . . . , xk), x ∈ R,
so dimension is d = k + 1.
Fact: Any function on ≤ k + 1 points can be