linear regression - university of...
Embed Size (px)
TRANSCRIPT
-
Linear regression
CS 446
-
1. Overview
-
todo
check some continuity bugsmake sure nothing missing from old lectures (both mine and daniel’s)fix some of those bugs, like b replacing y.delete the excess material from endadd proper summary slide which boils down concepts and reduces studentworry.
1 / 94
-
Lecture 1: supervised learning
Training data: labeled examples
(x1, y1), (x2, y2), . . . , (xn, yn)
where
I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and
I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.
Goal: learn a function f̂ from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs.
learned predictorpast labeled examples learning algorithm
predicted label
new (unlabeled) example
2 / 94
-
Lecture 2: nearest neighbors and decision trees
1.0 0.5 0.0 0.5 1.0
1.0
0.5
0.0
0.5
1.0
x1
x2
Nearest neighbors.Training/fitting: memorize data.Testing/predicting: find k closestmemorized points, return pluralitylabel.Overfitting? Vary k.
Decision trees.Training/fitting: greedily partitionspace, reducing “uncertainty”.Testing/predicting: traverse tree,output leaf label.Overfitting? Limit or prune tree.
3 / 94
-
Lectures 3-4: linear regression
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration
50
60
70
80
90
dela
y
Linear regression / least squares.
Our first (of many!) linear predic-tion methods.
Today:
I Example.
I How to solve it; ERM, andSVD.
I Features.
Next lecture: advanced topics, in-cluding overfitting.
4 / 94
-
2. Example: Old Faithful
-
Prediction problem: Old Faithful geyser (Yellowstone)
Task: Predict time of next eruption.
5 / 94
-
Time between eruptions
Historical records of eruptions:
a1 b1 a2 a3a0 b2 b3b0 . . .
Y1 Y2 Y3
Time until next eruption: Yi := ai − bi−1.
Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .
On “Old Faithful” data:
I Using 136 past observations, we form mean estimate µ̂ = 70.7941.
I Can we do better?
6 / 94
-
Time between eruptions
Historical records of eruptions:
a1 b1 a2 a3a0 b2 b3b0 . . .
Y1 Y2 Y3Time until next eruption: Yi := ai − bi−1.
Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .
On “Old Faithful” data:
I Using 136 past observations, we form mean estimate µ̂ = 70.7941.
I Can we do better?
6 / 94
-
Time between eruptions
Historical records of eruptions:
an bnan−1 bn−1 . . .
Ydata
. . . t
Time until next eruption: Yi := ai − bi−1.
Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .
On “Old Faithful” data:
I Using 136 past observations, we form mean estimate µ̂ = 70.7941.
I Can we do better?
6 / 94
-
Time between eruptions
Historical records of eruptions:
an bnan−1 bn−1 . . .
Ydata
. . . t
Time until next eruption: Yi := ai − bi−1.
Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .
On “Old Faithful” data:
I Using 136 past observations, we form mean estimate µ̂ = 70.7941.
I Can we do better?
6 / 94
-
Looking at the data
Naturalist Harry Woodward observed that time until the next eruption seemsto be related to duration of last eruption.
1.5 2 2.5 3 3.5 4 4.5 5 5.5duration of last eruption
50
60
70
80
90
time
until
nex
t eru
ptio
n
7 / 94
-
Looking at the data
Naturalist Harry Woodward observed that time until the next eruption seemsto be related to duration of last eruption.
1.5 2 2.5 3 3.5 4 4.5 5 5.5duration of last eruption
50
60
70
80
90
time
until
nex
t eru
ptio
n
7 / 94
-
Using side-information
At prediction time t, duration of last eruption is available as side-information.
an bnan−1 bn−1 . . .
Ydata
. . . t
X
IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).
X takes values in X (e.g., X = R), Y takes values in R.
1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)
f̂ : X → R,
This is called “learning” or “training”.
2. At prediction time, observe X, and form prediction f̂(X).
How should we choose f̂ based on data? Recall:
I The model is our choice.
I We must contend with overfitting, bad fitting algorithms, . . .
8 / 94
-
Using side-information
At prediction time t, duration of last eruption is available as side-information.
an bnan−1 bn−1 . . .
Y
. . . t
XXn Yn. . .
IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).
X takes values in X (e.g., X = R), Y takes values in R.
1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)
f̂ : X → R,
This is called “learning” or “training”.
2. At prediction time, observe X, and form prediction f̂(X).
How should we choose f̂ based on data? Recall:
I The model is our choice.
I We must contend with overfitting, bad fitting algorithms, . . .
8 / 94
-
Using side-information
At prediction time t, duration of last eruption is available as side-information.
an bnan−1 bn−1 . . .
Y
. . . t
XXn Yn. . .
IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).
X takes values in X (e.g., X = R), Y takes values in R.
1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)
f̂ : X → R,
This is called “learning” or “training”.
2. At prediction time, observe X, and form prediction f̂(X).
How should we choose f̂ based on data? Recall:
I The model is our choice.
I We must contend with overfitting, bad fitting algorithms, . . .
8 / 94
-
Using side-information
At prediction time t, duration of last eruption is available as side-information.
an bnan−1 bn−1 . . .
Y
. . . t
XXn Yn. . .
IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).
X takes values in X (e.g., X = R), Y takes values in R.
1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)
f̂ : X → R,
This is called “learning” or “training”.
2. At prediction time, observe X, and form prediction f̂(X).
How should we choose f̂ based on data? Recall:
I The model is our choice.
I We must contend with overfitting, bad fitting algorithms, . . .
8 / 94
-
3. Least squares and linear regression
-
Which line?
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration
50
60
70
80
90
dela
y
Let’s predict with a linear regressor:
ŷ := wT [ x1 ] ,
where w ∈ R2 is learned from data.
Remark: appending 1 makes thisan affine function x 7→ w1x + w2.(More on this later. . . )
If data lies along a line,we should output that line.But what if not?
9 / 94
-
Which line?
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration
50
60
70
80
90
dela
y
Let’s predict with a linear regressor:
ŷ := wT [ x1 ] ,
where w ∈ R2 is learned from data.
Remark: appending 1 makes thisan affine function x 7→ w1x + w2.(More on this later. . . )
If data lies along a line,we should output that line.But what if not?
9 / 94
-
ERM setup for least squares.
I Predictors/model: f̂(x) = wTx;a linear predictor/regressor.(For linear classification: x 7→ sgn(wTx).)
I Loss/penalty: the least squares loss
`ls(y, ŷ) = `ls(y, ŷ) = (y − ŷ)2.
(Some conventions scale this by 1/2.)
I Goal: minimize least squares emprical risk
R̂ls(f̂) =1
n
n∑i=1
`ls(yi, f̂(xi)) =1
n
n∑i=1
(yi − f̂(xi))2.
I Specifically, we choose w ∈ Rd according to
arg minw∈Rd
R̂ls(x 7→ wTx
)= arg min
w∈Rd
1
n
n∑i=1
(yi −wTxi)2.
I More generally, this is the ERM approach:pick a model and minimize empirical risk over the model parameters.
10 / 94
-
ERM in general
I Pick a family of models/predictors F .(For today, we use linear predictors.)
I Pick a loss function `.(For today, we chose squared loss.)
I Minimize the empirical risk over the model parameters.
We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.
Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.
11 / 94
-
ERM in general
I Pick a family of models/predictors F .(For today, we use linear predictors.)
I Pick a loss function `.(For today, we chose squared loss.)
I Minimize the empirical risk over the model parameters.
We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.
Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.
11 / 94
-
Least squares ERM in pictures
Red dots: data points.
Affine hyperplane: our predictions(via affine expansion (x1, x2) 7→ (1, x1, x2)).
ERM: minimize sum of squared verticallengths from hyperplane to points.
12 / 94
-
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.
Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.
13 / 94
-
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.
13 / 94
-
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.
13 / 94
-
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.
13 / 94
-
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.
13 / 94
-
Summary on ERM and linear regression
Procedure:
I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)
I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)
I In general, solutions are not unique. (Why not?)
I If ATA is invertible, can choose (unique) (ATA)−1ATb.
I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.
I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.
I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .
14 / 94
-
Summary on ERM and linear regression
Procedure:
I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)
I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)
I In general, solutions are not unique. (Why not?)
I If ATA is invertible, can choose (unique) (ATA)−1ATb.
I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.
I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.
I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .
14 / 94
-
Summary on ERM and linear regression
Procedure:
I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)
I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)
I In general, solutions are not unique. (Why not?)
I If ATA is invertible, can choose (unique) (ATA)−1ATb.
I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.
I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.
I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .
14 / 94
-
Summary on ERM and linear regression
Procedure:
I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)
I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)
I In general, solutions are not unique. (Why not?)
I If ATA is invertible, can choose (unique) (ATA)−1ATb.
I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.
I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.
I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .
14 / 94
-
4. SVD and least squares
-
SVD
Recall the Singular Value Decomposition (SVD) M = USV T ∈ Rm×n, where
I U ∈ Rm×r is orthonormal, S ∈ Rr×r is diag(s1, . . . , sr) withs1 ≥ s2 ≥ · · · ≥ sr ≥ 0, and V ∈ Rn×r is orthonormal, withr := rank(M). (If r = 0, use the convention of S = 0 ∈ R1×1.)
I This convention is sometimes called the thin SVD.
I Another notation is to write M =∑ri=1 siuiv
Ti . This avoids the issue
with 0 (empty sum is 0). Moreover, this notation makes it clear that(ui)
ri=1 span the column space and (vi)
ri=1 span the rows space of M .
I The full SVD will not be used in this class; it fills out U and V to be fullrank and orthonormal, and pads S with zeros. It agrees with theeigendecompositions of MTM and MMT.
I Note; numpy and pytorch have SVD (interfaces slightly differ).Determining r runs into numerical issues.
15 / 94
-
Pseudoinverse
Let SVD M =∑ri=1 siuiv
Ti be given.
I Define pseudoinverse M+ =∑ri=1
1siviu
Ti .
(If 0 = M ∈ Rm×n, then 0 = M+ ∈ Rn×m.)I Alternatively, define pseudoinverse S+ of a diagonal matrix to be ST but
with reciprocals of non-zero elements;then M+ = V S+UT.
I Also called Moore-Penrose Pseudoinverse; it is unique, even though theSVD is not unique (why not?).
I If M−1 exists, then M−1 = M+ (why?).
16 / 94
-
SVD and least squares
Recall: we’d like to find w such that
ATAw = ATb.
If w = A+b, then
ATAw =
r∑i=1
siviuTi
r∑i=1
siuivTi
r∑i=1
1
siviu
Ti
b=
r∑i=1
siviuTi
r∑i=1
uiuTi
b = ATb.Henceforth, define ŵols = A
+b as the OLS solution.(OLS = “ordinary least squares”.)
Note: in general, AA+ =∑ri=1 uiu
Ti 6= I.
17 / 94
-
5. Summary of linear regression so far
-
Main points
I Model/function/predictor class of linear regressors x 7→ wTx.I ERM principle: we chose a loss (least squares) and find a good predictor
by minimizing empirical risk.
I ERM solution for least squares: pick w satisfying ATAw = ATb, which isnot unique; one unique choice is the ordinary least squares solution A+b.
18 / 94
-
Part 2 of linear regression lecture. . .
-
Recap on SVD. (A messy slide, I’m sorry.)
Suppose 0 6= M ∈ Rn×d, thus r := rank(M) > 0.
I “Decomposition form” thin SVD: M =∑ri=1 siuiv
Ti , and
s1 ≥ · · · ≥ sr > 0, and M+ =∑ri=1
1siviu
Ti . and in general
M+M =∑ri=1 viv
Ti 6= I.
I “Factorization form” thin SVD: M = USV T, U ∈ Rn×r and V ∈ Rd×rorthonormal but UTU and V TV are not identity matrices in general, andS = diag(s1, . . . , sr) ∈ Rr×r with s1 ≥ · · · ≥ sr > 0; pseudoinverseM+ = V S−1UT and in general M+M 6= MM+ 6= I.
I Full SVD: M = U fSfV Tf , U f ∈ Rn×n and V ∈ Rd×d orthonormal andfull rank so UTf U f and V
Tf V f are identity matrices and Sf ∈ Rn×d is zero
everywhere except the first r diagonal entries which ares1 ≥ · · · ≥ sr > 0; pseudoinverse M+ = V fS+f U
Tf where S
+f is obtained
by transposing Sf and then flipping nonzero entries, and in generalM+M 6= MM+ 6= I. Additional property: agreement witheigendecompositions of MMT and MTM .
The “full SVD” adds columns to U and V which hit zeros of S and thereforedon’t matter(as a sanity check, verify for yourself that all these SVDs are equal).
19 / 94
-
Recap on SVD, zero matrix case
Suppose 0 = M ∈ Rn×d, thus r := rank(M) = 0.
I In all types of SVD, M+ is MT (another zero matrix).
I Technically speaking, s is a singular value of M iff exist nonzero vectors(u,v) with Mv = su and MTu = sv, and zero matrix therefore has nosingular values (or left/right singular vectors).
I “Factorization form thin SVD” becomes a little messy.
20 / 94
-
6. More on the normal equations
-
Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.
Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.
21 / 94
-
Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.
21 / 94
-
Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.
21 / 94
-
Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.
21 / 94
-
Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT1 →
...← xTn →
, b := 1√ny1...yn
.Can write empirical risk as
R̂(w) = 1n
n∑i=1
(yi − xTiw
)2= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R̂:
∇R̂(w) = 0, i.e., w is a critical point of R̂.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.21 / 94
-
Normal equations imply optimality
Consider w with ATAw = ATy, and any w′; then
‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2
= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.
Since
(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,
then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.
Morever, writing A =∑ri=1 siuiv
Ti ,
‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)> r∑i=1
s2ivivTi
(w′−w),so w′ optimal iff w′ −w is in the right nullspace of A.
(We’ll revisit all this with convexity later.)
22 / 94
-
Normal equations imply optimality
Consider w with ATAw = ATy, and any w′; then
‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2
= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.
Since
(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,
then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.
Morever, writing A =∑ri=1 siuiv
Ti ,
‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)> r∑i=1
s2ivivTi
(w′−w),so w′ optimal iff w′ −w is in the right nullspace of A.
(We’ll revisit all this with convexity later.)
22 / 94
-
Normal equations imply optimality
Consider w with ATAw = ATy, and any w′; then
‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2
= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.
Since
(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,
then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.
Morever, writing A =∑ri=1 siuiv
Ti ,
‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)> r∑i=1
s2ivivTi
(w′−w),so w′ optimal iff w′ −w is in the right nullspace of A.
(We’ll revisit all this with convexity later.)
22 / 94
-
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.
Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.
Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b̂
a1
a2
I b̂ is uniquely determined; indeed,b̂ = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b̂:solve system of linear equations Aw = b̂.
23 / 94
-
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.
Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b̂
a1
a2
I b̂ is uniquely determined; indeed,b̂ = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b̂:solve system of linear equations Aw = b̂.
23 / 94
-
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.
Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b̂
a1
a2
I b̂ is uniquely determined; indeed,b̂ = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b̂:solve system of linear equations Aw = b̂.
23 / 94
-
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.
Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b̂
a1
a2
I b̂ is uniquely determined; indeed,b̂ = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b̂:solve system of linear equations Aw = b̂.
23 / 94
-
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.
Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b̂
a1
a2
I b̂ is uniquely determined; indeed,b̂ = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b̂:solve system of linear equations Aw = b̂.
23 / 94
-
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.
Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b̂
a1
a2
I b̂ is uniquely determined; indeed,b̂ = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b̂:solve system of linear equations Aw = b̂.
23 / 94
-
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.
Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b̂
a1
a2
I b̂ is uniquely determined; indeed,b̂ = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b̂:solve system of linear equations Aw = b̂.
23 / 94
-
7. Features
-
Enhancing linear regression models with features
Linear functions alone are restrictive,but become powerful with creative side-information, or features.
Idea: Predict with x 7→ wTφ(x), where φ is a feature mapping.
Examples:
1. Non-linear transformations of existing variables: for x ∈ R,
φ(x) = ln(1 + x).
2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,
φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).
3. Trigonometric expansion: for x ∈ R,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).
4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,
φ(x) = (1, x1, . . . , xd, x21, . . . , x
2d, x1x2, . . . , x1xd, . . . , xd−1xd).
24 / 94
-
Enhancing linear regression models with features
Linear functions alone are restrictive,but become powerful with creative side-information, or features.
Idea: Predict with x 7→ wTφ(x), where φ is a feature mapping.
Examples:
1. Non-linear transformations of existing variables: for x ∈ R,
φ(x) = ln(1 + x).
2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,
φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).
3. Trigonometric expansion: for x ∈ R,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).
4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,
φ(x) = (1, x1, . . . , xd, x21, . . . , x
2d, x1x2, . . . , x1xd, . . . , xd−1xd).
24 / 94
-
Example: Taking advantage of linearity
Suppose you are trying to predict some health outcome.
I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:
φ(x) = (xtemp − 98.6)2.
I What if you didn’t know about this magic constant 98.6?
I Instead, useφ(x) = (1, xtemp, x
2temp).
Can learn coefficients w such that
wTφ(x) = (xtemp − 98.6)2,
or any other quadratic polynomial in xtemp (which may be better!).
25 / 94
-
Quadratic expansion
Quadratic function f : R→ R
f(x) = ax2 + bx+ c, x ∈ R,
for a, b, c ∈ R.
This can be written as a linear function of φ(x), where
φ(x) := (1, x, x2),
sincef(x) = wTφ(x)
where w = (c, b, a).
For multivariate quadratic function f : Rd → R, use
φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms
, x21, . . . , x2d︸ ︷︷ ︸
squared terms
, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms
).
26 / 94
-
Quadratic expansion
Quadratic function f : R→ R
f(x) = ax2 + bx+ c, x ∈ R,
for a, b, c ∈ R.
This can be written as a linear function of φ(x), where
φ(x) := (1, x, x2),
sincef(x) = wTφ(x)
where w = (c, b, a).
For multivariate quadratic function f : Rd → R, use
φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms
, x21, . . . , x2d︸ ︷︷ ︸
squared terms
, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms
).
26 / 94
-
Quadratic expansion
Quadratic function f : R→ R
f(x) = ax2 + bx+ c, x ∈ R,
for a, b, c ∈ R.
This can be written as a linear function of φ(x), where
φ(x) := (1, x, x2),
sincef(x) = wTφ(x)
where w = (c, b, a).
For multivariate quadratic function f : Rd → R, use
φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms
, x21, . . . , x2d︸ ︷︷ ︸
squared terms
, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms
).
26 / 94
-
Affine expansion and “Old Faithful”
Woodward needed an affine expansion for “Old Faithful” data:
φ(x) := (1, x).
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil n
ext eru
ption
affine function
Affine function fa,b : R→ R for a, b ∈ R,
fa,b(x) = a+ bx,
is a linear function fw of φ(x) for w = (a, b).
(This easily generalizes to multivariate affine functions.)
27 / 94
-
Affine expansion and “Old Faithful”
Woodward needed an affine expansion for “Old Faithful” data:
φ(x) := (1, x).
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil next eru
ption
affine function
Affine function fa,b : R→ R for a, b ∈ R,
fa,b(x) = a+ bx,
is a linear function fw of φ(x) for w = (a, b).
(This easily generalizes to multivariate affine functions.)
27 / 94
-
Final remarks on features
I “Feature engineering” can drastically change the power of a model.
I Some people consider it messy, unprincipled, pure “trial-and-error”.
I Deep learning is somewhat touted as removing some of this, but it doesn’tdo so completely (e.g., took a lot of work to come up with the“convolutional neural network” (side question, who came up with that?)).
28 / 94
-
8. Statistical view of least squares; maximum likelihood
-
Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.
I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94
-
Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94
-
Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).
I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94
-
Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94
-
Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94
-
Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94
-
Distributions over labeled examples
X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :PY |X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94
-
Distributions over labeled examples
X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :PY |X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94
-
Distributions over labeled examples
X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :PY |X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94
-
Distributions over labeled examples
X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :PY |X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94
-
Distributions over labeled examples
X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :PY |X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94
-
Optimal predictor
What function f : X → R has smallest (squared loss) risk
R(f) := E[(f(X)− Y )2]?
Note: earlier we discussed empirical risk.
I Conditional on X = x, the minimizer of conditional risk
ŷ 7→ E[(ŷ − Y )2 | X = x]
is the conditional meanE[Y | X = x].
I Therefore, the function f? : R→ R where
f?(x) = E[Y | X = x], x ∈ R
has the smallest risk.
I f? is called the regression function or conditional mean function.
31 / 94
-
Optimal predictor
What function f : X → R has smallest (squared loss) risk
R(f) := E[(f(X)− Y )2]?
Note: earlier we discussed empirical risk.
I Conditional on X = x, the minimizer of conditional risk
ŷ 7→ E[(ŷ − Y )2 | X = x]
is the conditional meanE[Y | X = x].
I Therefore, the function f? : R→ R where
f?(x) = E[Y | X = x], x ∈ R
has the smallest risk.
I f? is called the regression function or conditional mean function.
31 / 94
-
Optimal predictor
What function f : X → R has smallest (squared loss) risk
R(f) := E[(f(X)− Y )2]?
Note: earlier we discussed empirical risk.
I Conditional on X = x, the minimizer of conditional risk
ŷ 7→ E[(ŷ − Y )2 | X = x]
is the conditional meanE[Y | X = x].
I Therefore, the function f? : R→ R where
f?(x) = E[Y | X = x], x ∈ R
has the smallest risk.
I f? is called the regression function or conditional mean function.
31 / 94
-
Optimal predictor
What function f : X → R has smallest (squared loss) risk
R(f) := E[(f(X)− Y )2]?
Note: earlier we discussed empirical risk.
I Conditional on X = x, the minimizer of conditional risk
ŷ 7→ E[(ŷ − Y )2 | X = x]
is the conditional meanE[Y | X = x].
I Therefore, the function f? : R→ R where
f?(x) = E[Y | X = x], x ∈ R
has the smallest risk.
I f? is called the regression function or conditional mean function.
31 / 94
-
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)-1 -0.5 0 0.5 1
x
-5
0
5
y
f*
32 / 94
-
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.
I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)-1 -0.5 0 0.5 1
x
-5
0
5
y
f*
32 / 94
-
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)-1 -0.5 0 0.5 1
x
-5
0
5
y
f*
32 / 94
-
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)-1 -0.5 0 0.5 1
x
-5
0
5
y
f*
32 / 94
-
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)-1 -0.5 0 0.5 1
x
-5
0
5
y
f*
32 / 94
-
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)-1 -0.5 0 0.5 1
x
-5
0
5y
f*
32 / 94
-
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
33 / 94
-
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
33 / 94
-
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
33 / 94
-
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
33 / 94
-
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plug-in principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R̂(f):
R̂(f) := 1n
n∑i=1
(f(xi)− yi)2.
(“Plug-in principle” is used throughout statistics in this same way.)
34 / 94
-
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plug-in principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R̂(f):
R̂(f) := 1n
n∑i=1
(f(xi)− yi)2.
(“Plug-in principle” is used throughout statistics in this same way.)
34 / 94
-
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plug-in principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R̂(f):
R̂(f) := 1n
n∑i=1
(f(xi)− yi)2.
(“Plug-in principle” is used throughout statistics in this same way.)
34 / 94
-
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plug-in principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R̂(f):
R̂(f) := 1n
n∑i=1
(f(xi)− yi)2.
(“Plug-in principle” is used throughout statistics in this same way.)
34 / 94
-
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
ŵ ∈ arg minw∈Rd
R̂(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
Further remarks.
I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.
I In ERM, by default there is no assumption of a “true” parameter torecover.
Useful examples: medical testing, gene expression, . . .
35 / 94
-
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
ŵ ∈ arg minw∈Rd
R̂(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
Further remarks.
I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.
I In ERM, by default there is no assumption of a “true” parameter torecover.
Useful examples: medical testing, gene expression, . . .
35 / 94
-
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
ŵ ∈ arg minw∈Rd
R̂(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
Further remarks.
I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.
I In ERM, by default there is no assumption of a “true” parameter torecover.
Useful examples: medical testing, gene expression, . . .
35 / 94
-
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
ŵ ∈ arg minw∈Rd
R̂(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
Further remarks.
I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.
I In ERM, by default there is no assumption of a “true” parameter torecover.
Useful examples: medical testing, gene expression, . . .
35 / 94
-
Old faithful data under this least squares statistical model
Recall our data, consisting of historical records of eruptions:
a1 b1 a2 a3a0 b2 b3b0 . . .
Y1 Y2 Y3
Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).I Data: Yi := ai − bi−1, i = 1, . . . , n.
(Admittedly not a great model, since durations are non-negative.)
Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(This extends the model above if we add the “1” feature.)
36 / 94
-
Old faithful data under this least squares statistical model
Recall our data, consisting of historical records of eruptions:
a1 b1 a2 a3a0 b2 b3b0 . . .
Y1 Y2 Y3Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).I Data: Yi := ai − bi−1, i = 1, . . . , n.
(Admittedly not a great model, since durations are non-negative.)
Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(This extends the model above if we add the “1” feature.)
36 / 94
-
Old faithful data under this least squares statistical model
Recall our data, consisting of historical records of eruptions:
an bnan−1 bn−1 . . .
Ydata
. . . t
Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).I Data: Yi := ai − bi−1, i = 1, . . . , n.
(Admittedly not a great model, since durations are non-negative.)
Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(This extends the model above if we add the “1” feature.)
36 / 94
-
9. Regularization and ridge regression
-
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.
I Obtain ŵ viaŵ = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
-
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.
I Obtain ŵ viaŵ = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
-
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.
I Obtain ŵ viaŵ = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
-
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.
I Obtain ŵ viaŵ = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
-
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.
I Obtain ŵ viaŵ = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
-
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.
I Obtain ŵ viaŵ = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
-
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.
I Obtain ŵ viaŵ = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
-
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.
I Obtain ŵ viaŵ = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
-
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
It is not a given that the least norm ERM is better than the other ERM!
38 / 94
-
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
38 / 94
-
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data and some arbitrary ERM:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
38 / 94
-
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data and least `2 norm ERM:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
38 / 94
-
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R̂(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94
-
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R̂(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94
-
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R̂(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94
-
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R̂(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94
-
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R̂(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94
-
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R̂(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94
-
10. True risk and overfitting
-
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
40 / 94
-
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
40 / 94
-
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
40 / 94
-
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
40 / 94
-
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
40 / 94
-
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution ŵ is not unique (e.g., if n < d), then R(ŵ) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(ŵ)−R(w?) = O(
tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
41 / 94
-
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution ŵ is not unique (e.g., if n < d), then R(ŵ) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(ŵ)−R(w?) = O(
tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
41 / 94
-
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution ŵ is not unique (e.g., if n < d), then R(ŵ) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(ŵ)−R(w?) = O(
tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
41 / 94
-
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution ŵ is not unique (e.g., if n < d), then R(ŵ) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(ŵ)−R(w?) = O(
tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
41 / 94
-
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0
and√n(ŵ −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(ŵ −w?) is
√n(ŵ −w?) d−→ E[XXT]−1 cov(εX)
12Z.
A few more steps gives
n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
42 / 94
-
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0
and√n(ŵ −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(ŵ −w?) is
√n(ŵ −w?) d−→ E[XXT]−1 cov(εX)
12Z.
A few more steps gives
n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
42 / 94
-
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0
and√n(ŵ −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(ŵ −w?) is
√n(ŵ −w?) d−→ E[XXT]−1 cov(εX)
12Z.
A few more steps gives
n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
42 / 94
-
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0
and√n(ŵ −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(ŵ −w?) is
√n(ŵ −w?) d−→ E[XXT]−1 cov(εX)
12Z.
A few more steps gives
n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
42 / 94
-
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0
and√n(ŵ −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(ŵ −w?) is
√n(ŵ −w?) d−→ E[XXT]−1 cov(εX)
12Z.
A few more steps gives
n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).42 / 94
-
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(ŵ)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94
-
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(ŵ)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94
-
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(ŵ)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94
-
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(ŵ)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94
-
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(ŵ)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94
-
Risk vs empirical risk
Let ŵ be ERM solution.
1. Empirical risk of ERM: R̂(ŵ)2. True risk of ERM: R(ŵ)
Theorem.E[R̂(ŵ)
]≤ E
[R(ŵ)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94
-
Risk vs empirical risk
Let ŵ be ERM solution.
1. Empirical risk of ERM: R̂(ŵ)
2. True risk of ERM: R(ŵ)Theorem.
E[R̂(ŵ)
]≤ E
[R(ŵ)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94
-
Risk vs empirical risk
Let ŵ be ERM solution.
1. Empirical risk of ERM: R̂(ŵ)2. True risk of ERM: R(ŵ)
Theorem.E[R̂(ŵ)
]≤ E
[R(ŵ)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94
-
Risk vs empirical risk
Let ŵ be ERM solution.
1. Empirical risk of ERM: R̂(ŵ)2. True risk of ERM: R(ŵ)
Theorem.E[R̂(ŵ)
]≤ E
[R(ŵ)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94
-
Risk vs empirical risk
Let ŵ be ERM solution.
1. Empirical risk of ERM: R̂(ŵ)2. True risk of ERM: R(ŵ)
Theorem.E[R̂(ŵ)
]≤ E
[R(ŵ)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94
-
Overfitting example
(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.Suppose we use degree-k polynomial expansion
φ(x) = (1, x1, . . . , xk), x ∈ R,
so dimension is d = k + 1.
Fact: Any function on ≤ k + 1 points can be