linear regression - university of...

of 371 /371
Linear regression CS 446

Author: others

Post on 19-Aug-2020

0 views

Category:

Documents


0 download

Embed Size (px)

TRANSCRIPT

  • Linear regression

    CS 446

  • 1. Overview

  • todo

    check some continuity bugsmake sure nothing missing from old lectures (both mine and daniel’s)fix some of those bugs, like b replacing y.delete the excess material from endadd proper summary slide which boils down concepts and reduces studentworry.

    1 / 94

  • Lecture 1: supervised learning

    Training data: labeled examples

    (x1, y1), (x2, y2), . . . , (xn, yn)

    where

    I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and

    I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.

    Goal: learn a function f̂ from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs.

    learned predictorpast labeled examples learning algorithm

    predicted label

    new (unlabeled) example

    2 / 94

  • Lecture 2: nearest neighbors and decision trees

    1.0 0.5 0.0 0.5 1.0

    1.0

    0.5

    0.0

    0.5

    1.0

    x1

    x2

    Nearest neighbors.Training/fitting: memorize data.Testing/predicting: find k closestmemorized points, return pluralitylabel.Overfitting? Vary k.

    Decision trees.Training/fitting: greedily partitionspace, reducing “uncertainty”.Testing/predicting: traverse tree,output leaf label.Overfitting? Limit or prune tree.

    3 / 94

  • Lectures 3-4: linear regression

    1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration

    50

    60

    70

    80

    90

    dela

    y

    Linear regression / least squares.

    Our first (of many!) linear predic-tion methods.

    Today:

    I Example.

    I How to solve it; ERM, andSVD.

    I Features.

    Next lecture: advanced topics, in-cluding overfitting.

    4 / 94

  • 2. Example: Old Faithful

  • Prediction problem: Old Faithful geyser (Yellowstone)

    Task: Predict time of next eruption.

    5 / 94

  • Time between eruptions

    Historical records of eruptions:

    a1 b1 a2 a3a0 b2 b3b0 . . .

    Y1 Y2 Y3

    Time until next eruption: Yi := ai − bi−1.

    Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .

    On “Old Faithful” data:

    I Using 136 past observations, we form mean estimate µ̂ = 70.7941.

    I Can we do better?

    6 / 94

  • Time between eruptions

    Historical records of eruptions:

    a1 b1 a2 a3a0 b2 b3b0 . . .

    Y1 Y2 Y3Time until next eruption: Yi := ai − bi−1.

    Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .

    On “Old Faithful” data:

    I Using 136 past observations, we form mean estimate µ̂ = 70.7941.

    I Can we do better?

    6 / 94

  • Time between eruptions

    Historical records of eruptions:

    an bnan−1 bn−1 . . .

    Ydata

    . . . t

    Time until next eruption: Yi := ai − bi−1.

    Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .

    On “Old Faithful” data:

    I Using 136 past observations, we form mean estimate µ̂ = 70.7941.

    I Can we do better?

    6 / 94

  • Time between eruptions

    Historical records of eruptions:

    an bnan−1 bn−1 . . .

    Ydata

    . . . t

    Time until next eruption: Yi := ai − bi−1.

    Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .

    On “Old Faithful” data:

    I Using 136 past observations, we form mean estimate µ̂ = 70.7941.

    I Can we do better?

    6 / 94

  • Looking at the data

    Naturalist Harry Woodward observed that time until the next eruption seemsto be related to duration of last eruption.

    1.5 2 2.5 3 3.5 4 4.5 5 5.5duration of last eruption

    50

    60

    70

    80

    90

    time

    until

    nex

    t eru

    ptio

    n

    7 / 94

  • Looking at the data

    Naturalist Harry Woodward observed that time until the next eruption seemsto be related to duration of last eruption.

    1.5 2 2.5 3 3.5 4 4.5 5 5.5duration of last eruption

    50

    60

    70

    80

    90

    time

    until

    nex

    t eru

    ptio

    n

    7 / 94

  • Using side-information

    At prediction time t, duration of last eruption is available as side-information.

    an bnan−1 bn−1 . . .

    Ydata

    . . . t

    X

    IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).

    X takes values in X (e.g., X = R), Y takes values in R.

    1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)

    f̂ : X → R,

    This is called “learning” or “training”.

    2. At prediction time, observe X, and form prediction f̂(X).

    How should we choose f̂ based on data? Recall:

    I The model is our choice.

    I We must contend with overfitting, bad fitting algorithms, . . .

    8 / 94

  • Using side-information

    At prediction time t, duration of last eruption is available as side-information.

    an bnan−1 bn−1 . . .

    Y

    . . . t

    XXn Yn. . .

    IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).

    X takes values in X (e.g., X = R), Y takes values in R.

    1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)

    f̂ : X → R,

    This is called “learning” or “training”.

    2. At prediction time, observe X, and form prediction f̂(X).

    How should we choose f̂ based on data? Recall:

    I The model is our choice.

    I We must contend with overfitting, bad fitting algorithms, . . .

    8 / 94

  • Using side-information

    At prediction time t, duration of last eruption is available as side-information.

    an bnan−1 bn−1 . . .

    Y

    . . . t

    XXn Yn. . .

    IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).

    X takes values in X (e.g., X = R), Y takes values in R.

    1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)

    f̂ : X → R,

    This is called “learning” or “training”.

    2. At prediction time, observe X, and form prediction f̂(X).

    How should we choose f̂ based on data? Recall:

    I The model is our choice.

    I We must contend with overfitting, bad fitting algorithms, . . .

    8 / 94

  • Using side-information

    At prediction time t, duration of last eruption is available as side-information.

    an bnan−1 bn−1 . . .

    Y

    . . . t

    XXn Yn. . .

    IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).

    X takes values in X (e.g., X = R), Y takes values in R.

    1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)

    f̂ : X → R,

    This is called “learning” or “training”.

    2. At prediction time, observe X, and form prediction f̂(X).

    How should we choose f̂ based on data? Recall:

    I The model is our choice.

    I We must contend with overfitting, bad fitting algorithms, . . .

    8 / 94

  • 3. Least squares and linear regression

  • Which line?

    1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration

    50

    60

    70

    80

    90

    dela

    y

    Let’s predict with a linear regressor:

    ŷ := wT [ x1 ] ,

    where w ∈ R2 is learned from data.

    Remark: appending 1 makes thisan affine function x 7→ w1x + w2.(More on this later. . . )

    If data lies along a line,we should output that line.But what if not?

    9 / 94

  • Which line?

    1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration

    50

    60

    70

    80

    90

    dela

    y

    Let’s predict with a linear regressor:

    ŷ := wT [ x1 ] ,

    where w ∈ R2 is learned from data.

    Remark: appending 1 makes thisan affine function x 7→ w1x + w2.(More on this later. . . )

    If data lies along a line,we should output that line.But what if not?

    9 / 94

  • ERM setup for least squares.

    I Predictors/model: f̂(x) = wTx;a linear predictor/regressor.(For linear classification: x 7→ sgn(wTx).)

    I Loss/penalty: the least squares loss

    `ls(y, ŷ) = `ls(y, ŷ) = (y − ŷ)2.

    (Some conventions scale this by 1/2.)

    I Goal: minimize least squares emprical risk

    R̂ls(f̂) =1

    n

    n∑i=1

    `ls(yi, f̂(xi)) =1

    n

    n∑i=1

    (yi − f̂(xi))2.

    I Specifically, we choose w ∈ Rd according to

    arg minw∈Rd

    R̂ls(x 7→ wTx

    )= arg min

    w∈Rd

    1

    n

    n∑i=1

    (yi −wTxi)2.

    I More generally, this is the ERM approach:pick a model and minimize empirical risk over the model parameters.

    10 / 94

  • ERM in general

    I Pick a family of models/predictors F .(For today, we use linear predictors.)

    I Pick a loss function `.(For today, we chose squared loss.)

    I Minimize the empirical risk over the model parameters.

    We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.

    Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.

    11 / 94

  • ERM in general

    I Pick a family of models/predictors F .(For today, we use linear predictors.)

    I Pick a loss function `.(For today, we chose squared loss.)

    I Minimize the empirical risk over the model parameters.

    We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.

    Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.

    11 / 94

  • Least squares ERM in pictures

    Red dots: data points.

    Affine hyperplane: our predictions(via affine expansion (x1, x2) 7→ (1, x1, x2)).

    ERM: minimize sum of squared verticallengths from hyperplane to points.

    12 / 94

  • Empirical risk minimization in matrix notation

    Define n× d matrix A and n× 1 column vector b by

    A :=1√n

    ← xT1 →

    ...← xTn →

    , b := 1√ny1...yn

    .

    Can write empirical risk as

    R̂(w) = 1n

    n∑i=1

    (yi − xTiw

    )2= ‖Aw − b‖22.

    Necessary condition for w to be a minimizer of R̂:

    ∇R̂(w) = 0, i.e., w is a critical point of R̂.

    This translates to(ATA)w = ATb,

    a system of linear equations called the normal equations.

    In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.

    13 / 94

  • Empirical risk minimization in matrix notation

    Define n× d matrix A and n× 1 column vector b by

    A :=1√n

    ← xT1 →

    ...← xTn →

    , b := 1√ny1...yn

    .Can write empirical risk as

    R̂(w) = 1n

    n∑i=1

    (yi − xTiw

    )2= ‖Aw − b‖22.

    Necessary condition for w to be a minimizer of R̂:

    ∇R̂(w) = 0, i.e., w is a critical point of R̂.

    This translates to(ATA)w = ATb,

    a system of linear equations called the normal equations.

    In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.

    13 / 94

  • Empirical risk minimization in matrix notation

    Define n× d matrix A and n× 1 column vector b by

    A :=1√n

    ← xT1 →

    ...← xTn →

    , b := 1√ny1...yn

    .Can write empirical risk as

    R̂(w) = 1n

    n∑i=1

    (yi − xTiw

    )2= ‖Aw − b‖22.

    Necessary condition for w to be a minimizer of R̂:

    ∇R̂(w) = 0, i.e., w is a critical point of R̂.

    This translates to(ATA)w = ATb,

    a system of linear equations called the normal equations.

    In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.

    13 / 94

  • Empirical risk minimization in matrix notation

    Define n× d matrix A and n× 1 column vector b by

    A :=1√n

    ← xT1 →

    ...← xTn →

    , b := 1√ny1...yn

    .Can write empirical risk as

    R̂(w) = 1n

    n∑i=1

    (yi − xTiw

    )2= ‖Aw − b‖22.

    Necessary condition for w to be a minimizer of R̂:

    ∇R̂(w) = 0, i.e., w is a critical point of R̂.

    This translates to(ATA)w = ATb,

    a system of linear equations called the normal equations.

    In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.

    13 / 94

  • Empirical risk minimization in matrix notation

    Define n× d matrix A and n× 1 column vector b by

    A :=1√n

    ← xT1 →

    ...← xTn →

    , b := 1√ny1...yn

    .Can write empirical risk as

    R̂(w) = 1n

    n∑i=1

    (yi − xTiw

    )2= ‖Aw − b‖22.

    Necessary condition for w to be a minimizer of R̂:

    ∇R̂(w) = 0, i.e., w is a critical point of R̂.

    This translates to(ATA)w = ATb,

    a system of linear equations called the normal equations.

    In upcoming lecture we’ll prove every critical point of R̂ is a minimizer of R̂.

    13 / 94

  • Summary on ERM and linear regression

    Procedure:

    I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)

    I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)

    I In general, solutions are not unique. (Why not?)

    I If ATA is invertible, can choose (unique) (ATA)−1ATb.

    I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.

    I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.

    I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .

    14 / 94

  • Summary on ERM and linear regression

    Procedure:

    I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)

    I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)

    I In general, solutions are not unique. (Why not?)

    I If ATA is invertible, can choose (unique) (ATA)−1ATb.

    I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.

    I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.

    I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .

    14 / 94

  • Summary on ERM and linear regression

    Procedure:

    I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)

    I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)

    I In general, solutions are not unique. (Why not?)

    I If ATA is invertible, can choose (unique) (ATA)−1ATb.

    I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.

    I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.

    I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .

    14 / 94

  • Summary on ERM and linear regression

    Procedure:

    I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)

    I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)

    I In general, solutions are not unique. (Why not?)

    I If ATA is invertible, can choose (unique) (ATA)−1ATb.

    I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.

    I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.

    I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .

    14 / 94

  • 4. SVD and least squares

  • SVD

    Recall the Singular Value Decomposition (SVD) M = USV T ∈ Rm×n, where

    I U ∈ Rm×r is orthonormal, S ∈ Rr×r is diag(s1, . . . , sr) withs1 ≥ s2 ≥ · · · ≥ sr ≥ 0, and V ∈ Rn×r is orthonormal, withr := rank(M). (If r = 0, use the convention of S = 0 ∈ R1×1.)

    I This convention is sometimes called the thin SVD.

    I Another notation is to write M =∑ri=1 siuiv

    Ti . This avoids the issue

    with 0 (empty sum is 0). Moreover, this notation makes it clear that(ui)

    ri=1 span the column space and (vi)

    ri=1 span the rows space of M .

    I The full SVD will not be used in this class; it fills out U and V to be fullrank and orthonormal, and pads S with zeros. It agrees with theeigendecompositions of MTM and MMT.

    I Note; numpy and pytorch have SVD (interfaces slightly differ).Determining r runs into numerical issues.

    15 / 94

  • Pseudoinverse

    Let SVD M =∑ri=1 siuiv

    Ti be given.

    I Define pseudoinverse M+ =∑ri=1

    1siviu

    Ti .

    (If 0 = M ∈ Rm×n, then 0 = M+ ∈ Rn×m.)I Alternatively, define pseudoinverse S+ of a diagonal matrix to be ST but

    with reciprocals of non-zero elements;then M+ = V S+UT.

    I Also called Moore-Penrose Pseudoinverse; it is unique, even though theSVD is not unique (why not?).

    I If M−1 exists, then M−1 = M+ (why?).

    16 / 94

  • SVD and least squares

    Recall: we’d like to find w such that

    ATAw = ATb.

    If w = A+b, then

    ATAw =

    r∑i=1

    siviuTi

    r∑i=1

    siuivTi

    r∑i=1

    1

    siviu

    Ti

    b=

    r∑i=1

    siviuTi

    r∑i=1

    uiuTi

    b = ATb.Henceforth, define ŵols = A

    +b as the OLS solution.(OLS = “ordinary least squares”.)

    Note: in general, AA+ =∑ri=1 uiu

    Ti 6= I.

    17 / 94

  • 5. Summary of linear regression so far

  • Main points

    I Model/function/predictor class of linear regressors x 7→ wTx.I ERM principle: we chose a loss (least squares) and find a good predictor

    by minimizing empirical risk.

    I ERM solution for least squares: pick w satisfying ATAw = ATb, which isnot unique; one unique choice is the ordinary least squares solution A+b.

    18 / 94

  • Part 2 of linear regression lecture. . .

  • Recap on SVD. (A messy slide, I’m sorry.)

    Suppose 0 6= M ∈ Rn×d, thus r := rank(M) > 0.

    I “Decomposition form” thin SVD: M =∑ri=1 siuiv

    Ti , and

    s1 ≥ · · · ≥ sr > 0, and M+ =∑ri=1

    1siviu

    Ti . and in general

    M+M =∑ri=1 viv

    Ti 6= I.

    I “Factorization form” thin SVD: M = USV T, U ∈ Rn×r and V ∈ Rd×rorthonormal but UTU and V TV are not identity matrices in general, andS = diag(s1, . . . , sr) ∈ Rr×r with s1 ≥ · · · ≥ sr > 0; pseudoinverseM+ = V S−1UT and in general M+M 6= MM+ 6= I.

    I Full SVD: M = U fSfV Tf , U f ∈ Rn×n and V ∈ Rd×d orthonormal andfull rank so UTf U f and V

    Tf V f are identity matrices and Sf ∈ Rn×d is zero

    everywhere except the first r diagonal entries which ares1 ≥ · · · ≥ sr > 0; pseudoinverse M+ = V fS+f U

    Tf where S

    +f is obtained

    by transposing Sf and then flipping nonzero entries, and in generalM+M 6= MM+ 6= I. Additional property: agreement witheigendecompositions of MMT and MTM .

    The “full SVD” adds columns to U and V which hit zeros of S and thereforedon’t matter(as a sanity check, verify for yourself that all these SVDs are equal).

    19 / 94

  • Recap on SVD, zero matrix case

    Suppose 0 = M ∈ Rn×d, thus r := rank(M) = 0.

    I In all types of SVD, M+ is MT (another zero matrix).

    I Technically speaking, s is a singular value of M iff exist nonzero vectors(u,v) with Mv = su and MTu = sv, and zero matrix therefore has nosingular values (or left/right singular vectors).

    I “Factorization form thin SVD” becomes a little messy.

    20 / 94

  • 6. More on the normal equations

  • Recall our matrix notation

    Let labeled examples ((xi, yi))ni=1 be given.

    Define n× d matrix A and n× 1 column vector b by

    A :=1√n

    ← xT1 →

    ...← xTn →

    , b := 1√ny1...yn

    .

    Can write empirical risk as

    R̂(w) = 1n

    n∑i=1

    (yi − xTiw

    )2= ‖Aw − b‖22.

    Necessary condition for w to be a minimizer of R̂:

    ∇R̂(w) = 0, i.e., w is a critical point of R̂.

    This translates to(ATA)w = ATb,

    a system of linear equations called the normal equations.

    We’ll now finally show that normal equations imply optimality.

    21 / 94

  • Recall our matrix notation

    Let labeled examples ((xi, yi))ni=1 be given.

    Define n× d matrix A and n× 1 column vector b by

    A :=1√n

    ← xT1 →

    ...← xTn →

    , b := 1√ny1...yn

    .Can write empirical risk as

    R̂(w) = 1n

    n∑i=1

    (yi − xTiw

    )2= ‖Aw − b‖22.

    Necessary condition for w to be a minimizer of R̂:

    ∇R̂(w) = 0, i.e., w is a critical point of R̂.

    This translates to(ATA)w = ATb,

    a system of linear equations called the normal equations.

    We’ll now finally show that normal equations imply optimality.

    21 / 94

  • Recall our matrix notation

    Let labeled examples ((xi, yi))ni=1 be given.

    Define n× d matrix A and n× 1 column vector b by

    A :=1√n

    ← xT1 →

    ...← xTn →

    , b := 1√ny1...yn

    .Can write empirical risk as

    R̂(w) = 1n

    n∑i=1

    (yi − xTiw

    )2= ‖Aw − b‖22.

    Necessary condition for w to be a minimizer of R̂:

    ∇R̂(w) = 0, i.e., w is a critical point of R̂.

    This translates to(ATA)w = ATb,

    a system of linear equations called the normal equations.

    We’ll now finally show that normal equations imply optimality.

    21 / 94

  • Recall our matrix notation

    Let labeled examples ((xi, yi))ni=1 be given.

    Define n× d matrix A and n× 1 column vector b by

    A :=1√n

    ← xT1 →

    ...← xTn →

    , b := 1√ny1...yn

    .Can write empirical risk as

    R̂(w) = 1n

    n∑i=1

    (yi − xTiw

    )2= ‖Aw − b‖22.

    Necessary condition for w to be a minimizer of R̂:

    ∇R̂(w) = 0, i.e., w is a critical point of R̂.

    This translates to(ATA)w = ATb,

    a system of linear equations called the normal equations.

    We’ll now finally show that normal equations imply optimality.

    21 / 94

  • Recall our matrix notation

    Let labeled examples ((xi, yi))ni=1 be given.

    Define n× d matrix A and n× 1 column vector b by

    A :=1√n

    ← xT1 →

    ...← xTn →

    , b := 1√ny1...yn

    .Can write empirical risk as

    R̂(w) = 1n

    n∑i=1

    (yi − xTiw

    )2= ‖Aw − b‖22.

    Necessary condition for w to be a minimizer of R̂:

    ∇R̂(w) = 0, i.e., w is a critical point of R̂.

    This translates to(ATA)w = ATb,

    a system of linear equations called the normal equations.

    We’ll now finally show that normal equations imply optimality.21 / 94

  • Normal equations imply optimality

    Consider w with ATAw = ATy, and any w′; then

    ‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

    = ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

    Since

    (Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

    then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

    Morever, writing A =∑ri=1 siuiv

    Ti ,

    ‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)> r∑i=1

    s2ivivTi

    (w′−w),so w′ optimal iff w′ −w is in the right nullspace of A.

    (We’ll revisit all this with convexity later.)

    22 / 94

  • Normal equations imply optimality

    Consider w with ATAw = ATy, and any w′; then

    ‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

    = ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

    Since

    (Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

    then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

    Morever, writing A =∑ri=1 siuiv

    Ti ,

    ‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)> r∑i=1

    s2ivivTi

    (w′−w),so w′ optimal iff w′ −w is in the right nullspace of A.

    (We’ll revisit all this with convexity later.)

    22 / 94

  • Normal equations imply optimality

    Consider w with ATAw = ATy, and any w′; then

    ‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

    = ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

    Since

    (Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

    then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

    Morever, writing A =∑ri=1 siuiv

    Ti ,

    ‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)> r∑i=1

    s2ivivTi

    (w′−w),so w′ optimal iff w′ −w is in the right nullspace of A.

    (We’ll revisit all this with convexity later.)

    22 / 94

  • Geometric interpretation of least squares ERM

    Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

    A =

    ↑ ↑a1 · · · ad↓ ↓

    .

    Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.

    Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

    b

    a1

    a2

    I b̂ is uniquely determined; indeed,b̂ = AA+b =

    ∑ri=1 uiu

    Ti b.

    I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.

    If rank(A) < d, then ERM solution is notunique.

    To get w from b̂:solve system of linear equations Aw = b̂.

    23 / 94

  • Geometric interpretation of least squares ERM

    Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

    A =

    ↑ ↑a1 · · · ad↓ ↓

    .Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.

    Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

    b

    a1

    a2

    I b̂ is uniquely determined; indeed,b̂ = AA+b =

    ∑ri=1 uiu

    Ti b.

    I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.

    If rank(A) < d, then ERM solution is notunique.

    To get w from b̂:solve system of linear equations Aw = b̂.

    23 / 94

  • Geometric interpretation of least squares ERM

    Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

    A =

    ↑ ↑a1 · · · ad↓ ↓

    .Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.

    Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

    b

    a1

    a2

    I b̂ is uniquely determined; indeed,b̂ = AA+b =

    ∑ri=1 uiu

    Ti b.

    I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.

    If rank(A) < d, then ERM solution is notunique.

    To get w from b̂:solve system of linear equations Aw = b̂.

    23 / 94

  • Geometric interpretation of least squares ERM

    Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

    A =

    ↑ ↑a1 · · · ad↓ ↓

    .Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.

    Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

    b

    a1

    a2

    I b̂ is uniquely determined; indeed,b̂ = AA+b =

    ∑ri=1 uiu

    Ti b.

    I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.

    If rank(A) < d, then ERM solution is notunique.

    To get w from b̂:solve system of linear equations Aw = b̂.

    23 / 94

  • Geometric interpretation of least squares ERM

    Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

    A =

    ↑ ↑a1 · · · ad↓ ↓

    .Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.

    Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

    b

    a1

    a2

    I b̂ is uniquely determined; indeed,b̂ = AA+b =

    ∑ri=1 uiu

    Ti b.

    I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.

    If rank(A) < d, then ERM solution is notunique.

    To get w from b̂:solve system of linear equations Aw = b̂.

    23 / 94

  • Geometric interpretation of least squares ERM

    Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

    A =

    ↑ ↑a1 · · · ad↓ ↓

    .Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.

    Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

    b

    a1

    a2

    I b̂ is uniquely determined; indeed,b̂ = AA+b =

    ∑ri=1 uiu

    Ti b.

    I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.

    If rank(A) < d, then ERM solution is notunique.

    To get w from b̂:solve system of linear equations Aw = b̂.

    23 / 94

  • Geometric interpretation of least squares ERM

    Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

    A =

    ↑ ↑a1 · · · ad↓ ↓

    .Minimizing ‖Aw − b‖22 is the same as finding vector b̂ ∈ range(A) closest to b.

    Solution b̂ is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

    b

    a1

    a2

    I b̂ is uniquely determined; indeed,b̂ = AA+b =

    ∑ri=1 uiu

    Ti b.

    I If r = rank(A) < d, then >1 way towrite b̂ as linear combination ofa1, . . . ,ad.

    If rank(A) < d, then ERM solution is notunique.

    To get w from b̂:solve system of linear equations Aw = b̂.

    23 / 94

  • 7. Features

  • Enhancing linear regression models with features

    Linear functions alone are restrictive,but become powerful with creative side-information, or features.

    Idea: Predict with x 7→ wTφ(x), where φ is a feature mapping.

    Examples:

    1. Non-linear transformations of existing variables: for x ∈ R,

    φ(x) = ln(1 + x).

    2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

    φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

    3. Trigonometric expansion: for x ∈ R,

    φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

    4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

    φ(x) = (1, x1, . . . , xd, x21, . . . , x

    2d, x1x2, . . . , x1xd, . . . , xd−1xd).

    24 / 94

  • Enhancing linear regression models with features

    Linear functions alone are restrictive,but become powerful with creative side-information, or features.

    Idea: Predict with x 7→ wTφ(x), where φ is a feature mapping.

    Examples:

    1. Non-linear transformations of existing variables: for x ∈ R,

    φ(x) = ln(1 + x).

    2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

    φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

    3. Trigonometric expansion: for x ∈ R,

    φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

    4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

    φ(x) = (1, x1, . . . , xd, x21, . . . , x

    2d, x1x2, . . . , x1xd, . . . , xd−1xd).

    24 / 94

  • Example: Taking advantage of linearity

    Suppose you are trying to predict some health outcome.

    I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:

    φ(x) = (xtemp − 98.6)2.

    I What if you didn’t know about this magic constant 98.6?

    I Instead, useφ(x) = (1, xtemp, x

    2temp).

    Can learn coefficients w such that

    wTφ(x) = (xtemp − 98.6)2,

    or any other quadratic polynomial in xtemp (which may be better!).

    25 / 94

  • Quadratic expansion

    Quadratic function f : R→ R

    f(x) = ax2 + bx+ c, x ∈ R,

    for a, b, c ∈ R.

    This can be written as a linear function of φ(x), where

    φ(x) := (1, x, x2),

    sincef(x) = wTφ(x)

    where w = (c, b, a).

    For multivariate quadratic function f : Rd → R, use

    φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms

    , x21, . . . , x2d︸ ︷︷ ︸

    squared terms

    , x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms

    ).

    26 / 94

  • Quadratic expansion

    Quadratic function f : R→ R

    f(x) = ax2 + bx+ c, x ∈ R,

    for a, b, c ∈ R.

    This can be written as a linear function of φ(x), where

    φ(x) := (1, x, x2),

    sincef(x) = wTφ(x)

    where w = (c, b, a).

    For multivariate quadratic function f : Rd → R, use

    φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms

    , x21, . . . , x2d︸ ︷︷ ︸

    squared terms

    , x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms

    ).

    26 / 94

  • Quadratic expansion

    Quadratic function f : R→ R

    f(x) = ax2 + bx+ c, x ∈ R,

    for a, b, c ∈ R.

    This can be written as a linear function of φ(x), where

    φ(x) := (1, x, x2),

    sincef(x) = wTφ(x)

    where w = (c, b, a).

    For multivariate quadratic function f : Rd → R, use

    φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms

    , x21, . . . , x2d︸ ︷︷ ︸

    squared terms

    , x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms

    ).

    26 / 94

  • Affine expansion and “Old Faithful”

    Woodward needed an affine expansion for “Old Faithful” data:

    φ(x) := (1, x).

    0 1 2 3 4 5 6

    duration of last eruption

    0

    20

    40

    60

    80

    100

    tim

    e u

    ntil n

    ext eru

    ption

    affine function

    Affine function fa,b : R→ R for a, b ∈ R,

    fa,b(x) = a+ bx,

    is a linear function fw of φ(x) for w = (a, b).

    (This easily generalizes to multivariate affine functions.)

    27 / 94

  • Affine expansion and “Old Faithful”

    Woodward needed an affine expansion for “Old Faithful” data:

    φ(x) := (1, x).

    0 1 2 3 4 5 6

    duration of last eruption

    0

    20

    40

    60

    80

    100

    tim

    e u

    ntil next eru

    ption

    affine function

    Affine function fa,b : R→ R for a, b ∈ R,

    fa,b(x) = a+ bx,

    is a linear function fw of φ(x) for w = (a, b).

    (This easily generalizes to multivariate affine functions.)

    27 / 94

  • Final remarks on features

    I “Feature engineering” can drastically change the power of a model.

    I Some people consider it messy, unprincipled, pure “trial-and-error”.

    I Deep learning is somewhat touted as removing some of this, but it doesn’tdo so completely (e.g., took a lot of work to come up with the“convolutional neural network” (side question, who came up with that?)).

    28 / 94

  • 8. Statistical view of least squares; maximum likelihood

  • Maximum likelihood estimation (MLE) refresher

    Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.

    I Θ: parameter space.

    I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.

    Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

    L(θ) := pθ(x).

    For continuous X ∼ Pθ with probability density function fθ,

    L(θ) := fθ(x).

    Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.

    29 / 94

  • Maximum likelihood estimation (MLE) refresher

    Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.

    I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.

    Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

    L(θ) := pθ(x).

    For continuous X ∼ Pθ with probability density function fθ,

    L(θ) := fθ(x).

    Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.

    29 / 94

  • Maximum likelihood estimation (MLE) refresher

    Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.

    I θ ∈ Θ: a particular parameter (or parameter vector).

    I Pθ: a particular probability distribution for observed data.

    Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

    L(θ) := pθ(x).

    For continuous X ∼ Pθ with probability density function fθ,

    L(θ) := fθ(x).

    Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.

    29 / 94

  • Maximum likelihood estimation (MLE) refresher

    Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.

    I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.

    Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

    L(θ) := pθ(x).

    For continuous X ∼ Pθ with probability density function fθ,

    L(θ) := fθ(x).

    Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.

    29 / 94

  • Maximum likelihood estimation (MLE) refresher

    Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.

    I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.

    Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

    L(θ) := pθ(x).

    For continuous X ∼ Pθ with probability density function fθ,

    L(θ) := fθ(x).

    Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.

    29 / 94

  • Maximum likelihood estimation (MLE) refresher

    Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.I Θ: parameter space.

    I θ ∈ Θ: a particular parameter (or parameter vector).I Pθ: a particular probability distribution for observed data.

    Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

    L(θ) := pθ(x).

    For continuous X ∼ Pθ with probability density function fθ,

    L(θ) := fθ(x).

    Maximum likelihood estimator (MLE):Let θ̂ be the θ ∈ Θ of highest likelihood given observed data.

    29 / 94

  • Distributions over labeled examples

    X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

    Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

    1. Marginal distribution PX of X:

    PX is a probability distribution on X .

    2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :PY |X=x is a probability distribution on Y.

    This lecture: Y = R (regression problems).

    30 / 94

  • Distributions over labeled examples

    X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

    Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

    1. Marginal distribution PX of X:

    PX is a probability distribution on X .

    2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :PY |X=x is a probability distribution on Y.

    This lecture: Y = R (regression problems).

    30 / 94

  • Distributions over labeled examples

    X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

    Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

    1. Marginal distribution PX of X:

    PX is a probability distribution on X .

    2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :PY |X=x is a probability distribution on Y.

    This lecture: Y = R (regression problems).

    30 / 94

  • Distributions over labeled examples

    X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

    Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

    1. Marginal distribution PX of X:

    PX is a probability distribution on X .

    2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :PY |X=x is a probability distribution on Y.

    This lecture: Y = R (regression problems).

    30 / 94

  • Distributions over labeled examples

    X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

    Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

    1. Marginal distribution PX of X:

    PX is a probability distribution on X .

    2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :PY |X=x is a probability distribution on Y.

    This lecture: Y = R (regression problems).

    30 / 94

  • Optimal predictor

    What function f : X → R has smallest (squared loss) risk

    R(f) := E[(f(X)− Y )2]?

    Note: earlier we discussed empirical risk.

    I Conditional on X = x, the minimizer of conditional risk

    ŷ 7→ E[(ŷ − Y )2 | X = x]

    is the conditional meanE[Y | X = x].

    I Therefore, the function f? : R→ R where

    f?(x) = E[Y | X = x], x ∈ R

    has the smallest risk.

    I f? is called the regression function or conditional mean function.

    31 / 94

  • Optimal predictor

    What function f : X → R has smallest (squared loss) risk

    R(f) := E[(f(X)− Y )2]?

    Note: earlier we discussed empirical risk.

    I Conditional on X = x, the minimizer of conditional risk

    ŷ 7→ E[(ŷ − Y )2 | X = x]

    is the conditional meanE[Y | X = x].

    I Therefore, the function f? : R→ R where

    f?(x) = E[Y | X = x], x ∈ R

    has the smallest risk.

    I f? is called the regression function or conditional mean function.

    31 / 94

  • Optimal predictor

    What function f : X → R has smallest (squared loss) risk

    R(f) := E[(f(X)− Y )2]?

    Note: earlier we discussed empirical risk.

    I Conditional on X = x, the minimizer of conditional risk

    ŷ 7→ E[(ŷ − Y )2 | X = x]

    is the conditional meanE[Y | X = x].

    I Therefore, the function f? : R→ R where

    f?(x) = E[Y | X = x], x ∈ R

    has the smallest risk.

    I f? is called the regression function or conditional mean function.

    31 / 94

  • Optimal predictor

    What function f : X → R has smallest (squared loss) risk

    R(f) := E[(f(X)− Y )2]?

    Note: earlier we discussed empirical risk.

    I Conditional on X = x, the minimizer of conditional risk

    ŷ 7→ E[(ŷ − Y )2 | X = x]

    is the conditional meanE[Y | X = x].

    I Therefore, the function f? : R→ R where

    f?(x) = E[Y | X = x], x ∈ R

    has the smallest risk.

    I f? is called the regression function or conditional mean function.

    31 / 94

  • Linear regression models

    When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

    Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

    I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

    I Conditional distribution of Y given X is normal.

    I Marginal distribution of X not specified.

    In this model, the regression function f? is a linear function fw : Rd → R,

    fw(x) = xTw =

    d∑i=1

    xiw, x ∈ Rd.

    (We’ll often refer to fw just by

    w.)-1 -0.5 0 0.5 1

    x

    -5

    0

    5

    y

    f*

    32 / 94

  • Linear regression models

    When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

    Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

    I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.

    I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

    I Conditional distribution of Y given X is normal.

    I Marginal distribution of X not specified.

    In this model, the regression function f? is a linear function fw : Rd → R,

    fw(x) = xTw =

    d∑i=1

    xiw, x ∈ Rd.

    (We’ll often refer to fw just by

    w.)-1 -0.5 0 0.5 1

    x

    -5

    0

    5

    y

    f*

    32 / 94

  • Linear regression models

    When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

    Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

    I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

    I Conditional distribution of Y given X is normal.

    I Marginal distribution of X not specified.

    In this model, the regression function f? is a linear function fw : Rd → R,

    fw(x) = xTw =

    d∑i=1

    xiw, x ∈ Rd.

    (We’ll often refer to fw just by

    w.)-1 -0.5 0 0.5 1

    x

    -5

    0

    5

    y

    f*

    32 / 94

  • Linear regression models

    When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

    Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

    I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

    I Conditional distribution of Y given X is normal.

    I Marginal distribution of X not specified.

    In this model, the regression function f? is a linear function fw : Rd → R,

    fw(x) = xTw =

    d∑i=1

    xiw, x ∈ Rd.

    (We’ll often refer to fw just by

    w.)-1 -0.5 0 0.5 1

    x

    -5

    0

    5

    y

    f*

    32 / 94

  • Linear regression models

    When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

    Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

    I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

    I Conditional distribution of Y given X is normal.

    I Marginal distribution of X not specified.

    In this model, the regression function f? is a linear function fw : Rd → R,

    fw(x) = xTw =

    d∑i=1

    xiw, x ∈ Rd.

    (We’ll often refer to fw just by

    w.)-1 -0.5 0 0.5 1

    x

    -5

    0

    5

    y

    f*

    32 / 94

  • Linear regression models

    When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

    Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

    I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

    I Conditional distribution of Y given X is normal.

    I Marginal distribution of X not specified.

    In this model, the regression function f? is a linear function fw : Rd → R,

    fw(x) = xTw =

    d∑i=1

    xiw, x ∈ Rd.

    (We’ll often refer to fw just by

    w.)-1 -0.5 0 0.5 1

    x

    -5

    0

    5y

    f*

    32 / 94

  • Maximum likelihood estimation for linear regression

    Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

    Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

    (Traditional to study linear regression in context of this model.)

    Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

    n∑i=1

    {− 1

    2σ2(xTiw − yi)2 +

    1

    2ln

    1

    2πσ2

    }+{

    terms not involving (w, σ2)}.

    The w that maximizes log-likelihood is also w that minimizes

    1

    n

    n∑i=1

    (xTiw − yi)2.

    This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

    33 / 94

  • Maximum likelihood estimation for linear regression

    Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

    Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

    (Traditional to study linear regression in context of this model.)

    Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

    n∑i=1

    {− 1

    2σ2(xTiw − yi)2 +

    1

    2ln

    1

    2πσ2

    }+{

    terms not involving (w, σ2)}.

    The w that maximizes log-likelihood is also w that minimizes

    1

    n

    n∑i=1

    (xTiw − yi)2.

    This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

    33 / 94

  • Maximum likelihood estimation for linear regression

    Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

    Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

    (Traditional to study linear regression in context of this model.)

    Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

    n∑i=1

    {− 1

    2σ2(xTiw − yi)2 +

    1

    2ln

    1

    2πσ2

    }+{

    terms not involving (w, σ2)}.

    The w that maximizes log-likelihood is also w that minimizes

    1

    n

    n∑i=1

    (xTiw − yi)2.

    This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

    33 / 94

  • Maximum likelihood estimation for linear regression

    Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

    Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

    (Traditional to study linear regression in context of this model.)

    Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

    n∑i=1

    {− 1

    2σ2(xTiw − yi)2 +

    1

    2ln

    1

    2πσ2

    }+{

    terms not involving (w, σ2)}.

    The w that maximizes log-likelihood is also w that minimizes

    1

    n

    n∑i=1

    (xTiw − yi)2.

    This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

    33 / 94

  • Empirical distribution and empirical risk

    Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

    pn((x, y)) :=1

    n

    n∑i=1

    1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

    Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

    R(f) = E[(f(X)− Y )2].

    But we don’t know the distribution P of (X,Y ).

    Replace P with Pn → Empirical (squared loss) risk R̂(f):

    R̂(f) := 1n

    n∑i=1

    (f(xi)− yi)2.

    (“Plug-in principle” is used throughout statistics in this same way.)

    34 / 94

  • Empirical distribution and empirical risk

    Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

    pn((x, y)) :=1

    n

    n∑i=1

    1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

    Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

    R(f) = E[(f(X)− Y )2].

    But we don’t know the distribution P of (X,Y ).

    Replace P with Pn → Empirical (squared loss) risk R̂(f):

    R̂(f) := 1n

    n∑i=1

    (f(xi)− yi)2.

    (“Plug-in principle” is used throughout statistics in this same way.)

    34 / 94

  • Empirical distribution and empirical risk

    Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

    pn((x, y)) :=1

    n

    n∑i=1

    1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

    Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

    R(f) = E[(f(X)− Y )2].

    But we don’t know the distribution P of (X,Y ).

    Replace P with Pn → Empirical (squared loss) risk R̂(f):

    R̂(f) := 1n

    n∑i=1

    (f(xi)− yi)2.

    (“Plug-in principle” is used throughout statistics in this same way.)

    34 / 94

  • Empirical distribution and empirical risk

    Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

    pn((x, y)) :=1

    n

    n∑i=1

    1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

    Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

    R(f) = E[(f(X)− Y )2].

    But we don’t know the distribution P of (X,Y ).

    Replace P with Pn → Empirical (squared loss) risk R̂(f):

    R̂(f) := 1n

    n∑i=1

    (f(xi)− yi)2.

    (“Plug-in principle” is used throughout statistics in this same way.)

    34 / 94

  • Empirical risk minimization

    Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

    For linear functions and squared loss: ERM returns

    ŵ ∈ arg minw∈Rd

    R̂(w),

    which coincides with MLE under the basic linear regression model.

    In general:

    I MLE makes sense in context of statistical model for which it is derived.

    I ERM makes sense in context of general iid model for supervised learning.

    Further remarks.

    I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.

    I In ERM, by default there is no assumption of a “true” parameter torecover.

    Useful examples: medical testing, gene expression, . . .

    35 / 94

  • Empirical risk minimization

    Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

    For linear functions and squared loss: ERM returns

    ŵ ∈ arg minw∈Rd

    R̂(w),

    which coincides with MLE under the basic linear regression model.

    In general:

    I MLE makes sense in context of statistical model for which it is derived.

    I ERM makes sense in context of general iid model for supervised learning.

    Further remarks.

    I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.

    I In ERM, by default there is no assumption of a “true” parameter torecover.

    Useful examples: medical testing, gene expression, . . .

    35 / 94

  • Empirical risk minimization

    Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

    For linear functions and squared loss: ERM returns

    ŵ ∈ arg minw∈Rd

    R̂(w),

    which coincides with MLE under the basic linear regression model.

    In general:

    I MLE makes sense in context of statistical model for which it is derived.

    I ERM makes sense in context of general iid model for supervised learning.

    Further remarks.

    I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.

    I In ERM, by default there is no assumption of a “true” parameter torecover.

    Useful examples: medical testing, gene expression, . . .

    35 / 94

  • Empirical risk minimization

    Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

    For linear functions and squared loss: ERM returns

    ŵ ∈ arg minw∈Rd

    R̂(w),

    which coincides with MLE under the basic linear regression model.

    In general:

    I MLE makes sense in context of statistical model for which it is derived.

    I ERM makes sense in context of general iid model for supervised learning.

    Further remarks.

    I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.

    I In ERM, by default there is no assumption of a “true” parameter torecover.

    Useful examples: medical testing, gene expression, . . .

    35 / 94

  • Old faithful data under this least squares statistical model

    Recall our data, consisting of historical records of eruptions:

    a1 b1 a2 a3a0 b2 b3b0 . . .

    Y1 Y2 Y3

    Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).I Data: Yi := ai − bi−1, i = 1, . . . , n.

    (Admittedly not a great model, since durations are non-negative.)

    Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume

    Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

    (This extends the model above if we add the “1” feature.)

    36 / 94

  • Old faithful data under this least squares statistical model

    Recall our data, consisting of historical records of eruptions:

    a1 b1 a2 a3a0 b2 b3b0 . . .

    Y1 Y2 Y3Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).I Data: Yi := ai − bi−1, i = 1, . . . , n.

    (Admittedly not a great model, since durations are non-negative.)

    Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume

    Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

    (This extends the model above if we add the “1” feature.)

    36 / 94

  • Old faithful data under this least squares statistical model

    Recall our data, consisting of historical records of eruptions:

    an bnan−1 bn−1 . . .

    Ydata

    . . . t

    Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).I Data: Yi := ai − bi−1, i = 1, . . . , n.

    (Admittedly not a great model, since durations are non-negative.)

    Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume

    Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

    (This extends the model above if we add the “1” feature.)

    36 / 94

  • 9. Regularization and ridge regression

  • Inductive bias

    Suppose ERM solution is not unique. What should we do?

    One possible answer: Pick the w of shortest length.

    I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.

    I Obtain ŵ viaŵ = A+b

    where A+ is the (Moore-Penrose) pseudoinverse of A.

    Why should this be a good idea?

    I Data does not give reason to choose a shorter w over a longer w.

    I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

    All learning algorithms encode some kind of inductive bias.

    37 / 94

  • Inductive bias

    Suppose ERM solution is not unique. What should we do?

    One possible answer: Pick the w of shortest length.

    I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.

    I Obtain ŵ viaŵ = A+b

    where A+ is the (Moore-Penrose) pseudoinverse of A.

    Why should this be a good idea?

    I Data does not give reason to choose a shorter w over a longer w.

    I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

    All learning algorithms encode some kind of inductive bias.

    37 / 94

  • Inductive bias

    Suppose ERM solution is not unique. What should we do?

    One possible answer: Pick the w of shortest length.

    I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.

    I Obtain ŵ viaŵ = A+b

    where A+ is the (Moore-Penrose) pseudoinverse of A.

    Why should this be a good idea?

    I Data does not give reason to choose a shorter w over a longer w.

    I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

    All learning algorithms encode some kind of inductive bias.

    37 / 94

  • Inductive bias

    Suppose ERM solution is not unique. What should we do?

    One possible answer: Pick the w of shortest length.

    I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.

    I Obtain ŵ viaŵ = A+b

    where A+ is the (Moore-Penrose) pseudoinverse of A.

    Why should this be a good idea?

    I Data does not give reason to choose a shorter w over a longer w.

    I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

    All learning algorithms encode some kind of inductive bias.

    37 / 94

  • Inductive bias

    Suppose ERM solution is not unique. What should we do?

    One possible answer: Pick the w of shortest length.

    I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.

    I Obtain ŵ viaŵ = A+b

    where A+ is the (Moore-Penrose) pseudoinverse of A.

    Why should this be a good idea?

    I Data does not give reason to choose a shorter w over a longer w.

    I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

    All learning algorithms encode some kind of inductive bias.

    37 / 94

  • Inductive bias

    Suppose ERM solution is not unique. What should we do?

    One possible answer: Pick the w of shortest length.

    I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.

    I Obtain ŵ viaŵ = A+b

    where A+ is the (Moore-Penrose) pseudoinverse of A.

    Why should this be a good idea?

    I Data does not give reason to choose a shorter w over a longer w.

    I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

    All learning algorithms encode some kind of inductive bias.

    37 / 94

  • Inductive bias

    Suppose ERM solution is not unique. What should we do?

    One possible answer: Pick the w of shortest length.

    I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.

    I Obtain ŵ viaŵ = A+b

    where A+ is the (Moore-Penrose) pseudoinverse of A.

    Why should this be a good idea?

    I Data does not give reason to choose a shorter w over a longer w.

    I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

    All learning algorithms encode some kind of inductive bias.

    37 / 94

  • Inductive bias

    Suppose ERM solution is not unique. What should we do?

    One possible answer: Pick the w of shortest length.

    I Fact: The shortest solution ŵ to (ATA)w = ATb is always unique.

    I Obtain ŵ viaŵ = A+b

    where A+ is the (Moore-Penrose) pseudoinverse of A.

    Why should this be a good idea?

    I Data does not give reason to choose a shorter w over a longer w.

    I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

    All learning algorithms encode some kind of inductive bias.

    37 / 94

  • Example

    ERM with scaled trigonometric feature expansion:

    φ(x) = (1, sin(x), cos(x), 12

    sin(2x), 12

    cos(2x), 13

    sin(3x), 13

    cos(3x), . . . ).

    It is not a given that the least norm ERM is better than the other ERM!

    38 / 94

  • Example

    ERM with scaled trigonometric feature expansion:

    φ(x) = (1, sin(x), cos(x), 12

    sin(2x), 12

    cos(2x), 13

    sin(3x), 13

    cos(3x), . . . ).

    Training data:

    0 1 2 3 4 5 6

    x

    -2.5

    -2

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    f(x)

    It is not a given that the least norm ERM is better than the other ERM!

    38 / 94

  • Example

    ERM with scaled trigonometric feature expansion:

    φ(x) = (1, sin(x), cos(x), 12

    sin(2x), 12

    cos(2x), 13

    sin(3x), 13

    cos(3x), . . . ).

    Training data and some arbitrary ERM:

    0 1 2 3 4 5 6

    x

    -2.5

    -2

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    f(x)

    It is not a given that the least norm ERM is better than the other ERM!

    38 / 94

  • Example

    ERM with scaled trigonometric feature expansion:

    φ(x) = (1, sin(x), cos(x), 12

    sin(2x), 12

    cos(2x), 13

    sin(3x), 13

    cos(3x), . . . ).

    Training data and least `2 norm ERM:

    0 1 2 3 4 5 6

    x

    -2.5

    -2

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    f(x)

    It is not a given that the least norm ERM is better than the other ERM!

    38 / 94

  • Regularized ERM

    Combine the two concerns: For a given λ ≥ 0, find minimizer of

    R̂(w) + λ‖w‖22

    over w ∈ Rd.

    Fact: If λ > 0, then the solution is always unique (even if n < d)!

    I This is called ridge regression.

    (λ = 0 is ERM / Ordinary Least Squares.)

    Explicit solution (ATA+ λI)−1ATb.

    I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).

    I Choose λ using cross-validation.

    Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

    39 / 94

  • Regularized ERM

    Combine the two concerns: For a given λ ≥ 0, find minimizer of

    R̂(w) + λ‖w‖22

    over w ∈ Rd.

    Fact: If λ > 0, then the solution is always unique (even if n < d)!

    I This is called ridge regression.

    (λ = 0 is ERM / Ordinary Least Squares.)

    Explicit solution (ATA+ λI)−1ATb.

    I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).

    I Choose λ using cross-validation.

    Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

    39 / 94

  • Regularized ERM

    Combine the two concerns: For a given λ ≥ 0, find minimizer of

    R̂(w) + λ‖w‖22

    over w ∈ Rd.

    Fact: If λ > 0, then the solution is always unique (even if n < d)!

    I This is called ridge regression.

    (λ = 0 is ERM / Ordinary Least Squares.)

    Explicit solution (ATA+ λI)−1ATb.

    I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).

    I Choose λ using cross-validation.

    Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

    39 / 94

  • Regularized ERM

    Combine the two concerns: For a given λ ≥ 0, find minimizer of

    R̂(w) + λ‖w‖22

    over w ∈ Rd.

    Fact: If λ > 0, then the solution is always unique (even if n < d)!

    I This is called ridge regression.

    (λ = 0 is ERM / Ordinary Least Squares.)

    Explicit solution (ATA+ λI)−1ATb.

    I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).

    I Choose λ using cross-validation.

    Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

    39 / 94

  • Regularized ERM

    Combine the two concerns: For a given λ ≥ 0, find minimizer of

    R̂(w) + λ‖w‖22

    over w ∈ Rd.

    Fact: If λ > 0, then the solution is always unique (even if n < d)!

    I This is called ridge regression.

    (λ = 0 is ERM / Ordinary Least Squares.)

    Explicit solution (ATA+ λI)−1ATb.

    I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).

    I Choose λ using cross-validation.

    Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

    39 / 94

  • Regularized ERM

    Combine the two concerns: For a given λ ≥ 0, find minimizer of

    R̂(w) + λ‖w‖22

    over w ∈ Rd.

    Fact: If λ > 0, then the solution is always unique (even if n < d)!

    I This is called ridge regression.

    (λ = 0 is ERM / Ordinary Least Squares.)

    Explicit solution (ATA+ λI)−1ATb.

    I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R̂(w).

    I Choose λ using cross-validation.

    Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

    39 / 94

  • 10. True risk and overfitting

  • Statistical interpretation of ERM

    Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

    Necessary condition for w to be a minimizer of R:

    ∇R(w) = 0, i.e., w is a critical point of R.

    This translates toE[XXT]w = E[YX],

    a system of linear equations called the population normal equations.

    It can be proved that every critical point of R is a minimizer of R.

    Looks familiar?

    If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

    E[ATA] = E[XXT] and E[ATb] = E[YX],

    so ERM can be regarded as a plug-in estimator for a minimizer of R.

    40 / 94

  • Statistical interpretation of ERM

    Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

    Necessary condition for w to be a minimizer of R:

    ∇R(w) = 0, i.e., w is a critical point of R.

    This translates toE[XXT]w = E[YX],

    a system of linear equations called the population normal equations.

    It can be proved that every critical point of R is a minimizer of R.

    Looks familiar?

    If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

    E[ATA] = E[XXT] and E[ATb] = E[YX],

    so ERM can be regarded as a plug-in estimator for a minimizer of R.

    40 / 94

  • Statistical interpretation of ERM

    Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

    Necessary condition for w to be a minimizer of R:

    ∇R(w) = 0, i.e., w is a critical point of R.

    This translates toE[XXT]w = E[YX],

    a system of linear equations called the population normal equations.

    It can be proved that every critical point of R is a minimizer of R.

    Looks familiar?

    If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

    E[ATA] = E[XXT] and E[ATb] = E[YX],

    so ERM can be regarded as a plug-in estimator for a minimizer of R.

    40 / 94

  • Statistical interpretation of ERM

    Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

    Necessary condition for w to be a minimizer of R:

    ∇R(w) = 0, i.e., w is a critical point of R.

    This translates toE[XXT]w = E[YX],

    a system of linear equations called the population normal equations.

    It can be proved that every critical point of R is a minimizer of R.

    Looks familiar?

    If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

    E[ATA] = E[XXT] and E[ATb] = E[YX],

    so ERM can be regarded as a plug-in estimator for a minimizer of R.

    40 / 94

  • Statistical interpretation of ERM

    Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

    Necessary condition for w to be a minimizer of R:

    ∇R(w) = 0, i.e., w is a critical point of R.

    This translates toE[XXT]w = E[YX],

    a system of linear equations called the population normal equations.

    It can be proved that every critical point of R is a minimizer of R.

    Looks familiar?

    If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

    E[ATA] = E[XXT] and E[ATb] = E[YX],

    so ERM can be regarded as a plug-in estimator for a minimizer of R.

    40 / 94

  • Risk of ERM

    IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

    Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

    E[XXT]w? = E[YX].

    I If ERM solution ŵ is not unique (e.g., if n < d), then R(ŵ) can bearbitrarily worse than R(w?).

    I What about when ERM solution is unique?

    Theorem. Under mild assumptions on distribution of X,

    R(ŵ)−R(w?) = O(

    tr(cov(εW ))

    n

    )“asymptotically”, where W := E[XXT]−

    12X and ε := Y −XTw?.

    41 / 94

  • Risk of ERM

    IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

    Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

    E[XXT]w? = E[YX].

    I If ERM solution ŵ is not unique (e.g., if n < d), then R(ŵ) can bearbitrarily worse than R(w?).

    I What about when ERM solution is unique?

    Theorem. Under mild assumptions on distribution of X,

    R(ŵ)−R(w?) = O(

    tr(cov(εW ))

    n

    )“asymptotically”, where W := E[XXT]−

    12X and ε := Y −XTw?.

    41 / 94

  • Risk of ERM

    IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

    Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

    E[XXT]w? = E[YX].

    I If ERM solution ŵ is not unique (e.g., if n < d), then R(ŵ) can bearbitrarily worse than R(w?).

    I What about when ERM solution is unique?

    Theorem. Under mild assumptions on distribution of X,

    R(ŵ)−R(w?) = O(

    tr(cov(εW ))

    n

    )“asymptotically”, where W := E[XXT]−

    12X and ε := Y −XTw?.

    41 / 94

  • Risk of ERM

    IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

    Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

    E[XXT]w? = E[YX].

    I If ERM solution ŵ is not unique (e.g., if n < d), then R(ŵ) can bearbitrarily worse than R(w?).

    I What about when ERM solution is unique?

    Theorem. Under mild assumptions on distribution of X,

    R(ŵ)−R(w?) = O(

    tr(cov(εW ))

    n

    )“asymptotically”, where W := E[XXT]−

    12X and ε := Y −XTw?.

    41 / 94

  • Risk of ERM analysis (rough sketch)

    Let εi := Yi −XTiw? for each i = 1, . . . , n, so

    E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0

    and√n(ŵ −w?) =

    (1

    n

    n∑i=1

    XiXTi

    )−11√n

    n∑i=1

    εiXi.

    1. By LLN:1

    n

    n∑i=1

    XiXTi

    p−→ E[XXT]

    2. By CLT:1√n

    n∑i=1

    εiXid−→ cov(εX)

    12Z, where Z ∼ N(0, I).

    Therefore, asymptotic distribution of√n(ŵ −w?) is

    √n(ŵ −w?) d−→ E[XXT]−1 cov(εX)

    12Z.

    A few more steps gives

    n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]

    )d−→ ‖E[XXT]−

    12 cov(εX)

    12Z‖22.

    Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

    42 / 94

  • Risk of ERM analysis (rough sketch)

    Let εi := Yi −XTiw? for each i = 1, . . . , n, so

    E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0

    and√n(ŵ −w?) =

    (1

    n

    n∑i=1

    XiXTi

    )−11√n

    n∑i=1

    εiXi.

    1. By LLN:1

    n

    n∑i=1

    XiXTi

    p−→ E[XXT]

    2. By CLT:1√n

    n∑i=1

    εiXid−→ cov(εX)

    12Z, where Z ∼ N(0, I).

    Therefore, asymptotic distribution of√n(ŵ −w?) is

    √n(ŵ −w?) d−→ E[XXT]−1 cov(εX)

    12Z.

    A few more steps gives

    n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]

    )d−→ ‖E[XXT]−

    12 cov(εX)

    12Z‖22.

    Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

    42 / 94

  • Risk of ERM analysis (rough sketch)

    Let εi := Yi −XTiw? for each i = 1, . . . , n, so

    E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0

    and√n(ŵ −w?) =

    (1

    n

    n∑i=1

    XiXTi

    )−11√n

    n∑i=1

    εiXi.

    1. By LLN:1

    n

    n∑i=1

    XiXTi

    p−→ E[XXT]

    2. By CLT:1√n

    n∑i=1

    εiXid−→ cov(εX)

    12Z, where Z ∼ N(0, I).

    Therefore, asymptotic distribution of√n(ŵ −w?) is

    √n(ŵ −w?) d−→ E[XXT]−1 cov(εX)

    12Z.

    A few more steps gives

    n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]

    )d−→ ‖E[XXT]−

    12 cov(εX)

    12Z‖22.

    Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

    42 / 94

  • Risk of ERM analysis (rough sketch)

    Let εi := Yi −XTiw? for each i = 1, . . . , n, so

    E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0

    and√n(ŵ −w?) =

    (1

    n

    n∑i=1

    XiXTi

    )−11√n

    n∑i=1

    εiXi.

    1. By LLN:1

    n

    n∑i=1

    XiXTi

    p−→ E[XXT]

    2. By CLT:1√n

    n∑i=1

    εiXid−→ cov(εX)

    12Z, where Z ∼ N(0, I).

    Therefore, asymptotic distribution of√n(ŵ −w?) is

    √n(ŵ −w?) d−→ E[XXT]−1 cov(εX)

    12Z.

    A few more steps gives

    n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]

    )d−→ ‖E[XXT]−

    12 cov(εX)

    12Z‖22.

    Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

    42 / 94

  • Risk of ERM analysis (rough sketch)

    Let εi := Yi −XTiw? for each i = 1, . . . , n, so

    E[εiXi] = E[YiXi]− E[XiXTi ]w? = 0

    and√n(ŵ −w?) =

    (1

    n

    n∑i=1

    XiXTi

    )−11√n

    n∑i=1

    εiXi.

    1. By LLN:1

    n

    n∑i=1

    XiXTi

    p−→ E[XXT]

    2. By CLT:1√n

    n∑i=1

    εiXid−→ cov(εX)

    12Z, where Z ∼ N(0, I).

    Therefore, asymptotic distribution of√n(ŵ −w?) is

    √n(ŵ −w?) d−→ E[XXT]−1 cov(εX)

    12Z.

    A few more steps gives

    n(E[(XTŵ − Y )2]− E[(XTw? − Y )2]

    )d−→ ‖E[XXT]−

    12 cov(εX)

    12Z‖22.

    Random variable on RHS is “concentrated” around its mean tr(cov(εW )).42 / 94

  • Risk of ERM: postscript

    I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

    I Only assumptions are those needed for LLN and CLT to hold.

    I However, if normal linear regression model holds, i.e.,

    Y |X = x ∼ N(xTw?, σ2),

    then the bound from the theorem becomes

    R(ŵ)−R(w?) = O

    (σ2d

    n

    ),

    which is familiar to those who have taken introductory statistics.

    I With more work, can also prove non-asymptotic risk bound of similar form.

    I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

    43 / 94

  • Risk of ERM: postscript

    I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

    I Only assumptions are those needed for LLN and CLT to hold.

    I However, if normal linear regression model holds, i.e.,

    Y |X = x ∼ N(xTw?, σ2),

    then the bound from the theorem becomes

    R(ŵ)−R(w?) = O

    (σ2d

    n

    ),

    which is familiar to those who have taken introductory statistics.

    I With more work, can also prove non-asymptotic risk bound of similar form.

    I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

    43 / 94

  • Risk of ERM: postscript

    I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

    I Only assumptions are those needed for LLN and CLT to hold.

    I However, if normal linear regression model holds, i.e.,

    Y |X = x ∼ N(xTw?, σ2),

    then the bound from the theorem becomes

    R(ŵ)−R(w?) = O

    (σ2d

    n

    ),

    which is familiar to those who have taken introductory statistics.

    I With more work, can also prove non-asymptotic risk bound of similar form.

    I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

    43 / 94

  • Risk of ERM: postscript

    I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

    I Only assumptions are those needed for LLN and CLT to hold.

    I However, if normal linear regression model holds, i.e.,

    Y |X = x ∼ N(xTw?, σ2),

    then the bound from the theorem becomes

    R(ŵ)−R(w?) = O

    (σ2d

    n

    ),

    which is familiar to those who have taken introductory statistics.

    I With more work, can also prove non-asymptotic risk bound of similar form.

    I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

    43 / 94

  • Risk of ERM: postscript

    I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

    I Only assumptions are those needed for LLN and CLT to hold.

    I However, if normal linear regression model holds, i.e.,

    Y |X = x ∼ N(xTw?, σ2),

    then the bound from the theorem becomes

    R(ŵ)−R(w?) = O

    (σ2d

    n

    ),

    which is familiar to those who have taken introductory statistics.

    I With more work, can also prove non-asymptotic risk bound of similar form.

    I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

    43 / 94

  • Risk vs empirical risk

    Let ŵ be ERM solution.

    1. Empirical risk of ERM: R̂(ŵ)2. True risk of ERM: R(ŵ)

    Theorem.E[R̂(ŵ)

    ]≤ E

    [R(ŵ)

    ].

    (Empirical risk can sometimes be larger than true risk, but not on average.)

    Overfitting: empirical risk is “small”, but true risk is “much higher”.

    44 / 94

  • Risk vs empirical risk

    Let ŵ be ERM solution.

    1. Empirical risk of ERM: R̂(ŵ)

    2. True risk of ERM: R(ŵ)Theorem.

    E[R̂(ŵ)

    ]≤ E

    [R(ŵ)

    ].

    (Empirical risk can sometimes be larger than true risk, but not on average.)

    Overfitting: empirical risk is “small”, but true risk is “much higher”.

    44 / 94

  • Risk vs empirical risk

    Let ŵ be ERM solution.

    1. Empirical risk of ERM: R̂(ŵ)2. True risk of ERM: R(ŵ)

    Theorem.E[R̂(ŵ)

    ]≤ E

    [R(ŵ)

    ].

    (Empirical risk can sometimes be larger than true risk, but not on average.)

    Overfitting: empirical risk is “small”, but true risk is “much higher”.

    44 / 94

  • Risk vs empirical risk

    Let ŵ be ERM solution.

    1. Empirical risk of ERM: R̂(ŵ)2. True risk of ERM: R(ŵ)

    Theorem.E[R̂(ŵ)

    ]≤ E

    [R(ŵ)

    ].

    (Empirical risk can sometimes be larger than true risk, but not on average.)

    Overfitting: empirical risk is “small”, but true risk is “much higher”.

    44 / 94

  • Risk vs empirical risk

    Let ŵ be ERM solution.

    1. Empirical risk of ERM: R̂(ŵ)2. True risk of ERM: R(ŵ)

    Theorem.E[R̂(ŵ)

    ]≤ E

    [R(ŵ)

    ].

    (Empirical risk can sometimes be larger than true risk, but not on average.)

    Overfitting: empirical risk is “small”, but true risk is “much higher”.

    44 / 94

  • Overfitting example

    (X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.Suppose we use degree-k polynomial expansion

    φ(x) = (1, x1, . . . , xk), x ∈ R,

    so dimension is d = k + 1.

    Fact: Any function on ≤ k + 1 points can be