18-661 introduction to machine learning€¦ · 18-661 introduction to machine learning linear...
TRANSCRIPT
18-661 Introduction to Machine Learning
Linear Regression – I
Spring 2020
ECE – Carnegie Mellon University
Outline
1. Recap of MLE/MAP
2. Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization
1
Recap of MLE/MAP
Dogecoin
• Scenario: You find a coin on the ground.
• You ask yourself: Is this a fair or biased coin? What is the
probability that I will flip a heads?
2
• You flip the coin 10 times . . .
• It comes up as ’H’ 8 times and ’T’ 2 times
• Can we learn from this data?
3
• You flip the coin 10 times . . .
• It comes up as ’H’ 8 times and ’T’ 2 times
• Can we learn from this data?
3
• You flip the coin 10 times . . .
• It comes up as ’H’ 8 times and ’T’ 2 times
• Can we learn from this data?
3
Machine Learning Pipeline
ML methoddata intelligence
feature extraction
model & parameters optimization evaluation
Two approaches that we discussed:
• Maximum likelihood Estimation (MLE)
• Maximum a posteriori Estimation (MAP)
4
Maximum Likelihood Estimation (MLE)
• Data: Observed set D of nH heads and nT tails
• Model: Each flip follows a Bernoulli distribution
P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]
Thus, the likelihood of observing sequence D is
P(D | θ) = θnH (1− θ)nT
• Question: Given this model and the data we’ve observed, can we
calculate an estimate of θ?
• MLE: Choose θ that maximizes the likelihood of the observed data
θMLE = arg maxθ
P(D | θ)
= arg maxθ
logP(D | θ)
=nH
nH + nT
5
Maximum Likelihood Estimation (MLE)
• Data: Observed set D of nH heads and nT tails
• Model: Each flip follows a Bernoulli distribution
P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]
Thus, the likelihood of observing sequence D is
P(D | θ) = θnH (1− θ)nT
• Question: Given this model and the data we’ve observed, can we
calculate an estimate of θ?
• MLE: Choose θ that maximizes the likelihood of the observed data
θMLE = arg maxθ
P(D | θ)
= arg maxθ
logP(D | θ)
=nH
nH + nT
5
Maximum Likelihood Estimation (MLE)
• Data: Observed set D of nH heads and nT tails
• Model: Each flip follows a Bernoulli distribution
P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]
Thus, the likelihood of observing sequence D is
P(D | θ) = θnH (1− θ)nT
• Question: Given this model and the data we’ve observed, can we
calculate an estimate of θ?
• MLE: Choose θ that maximizes the likelihood of the observed data
θMLE = arg maxθ
P(D | θ)
= arg maxθ
logP(D | θ)
=nH
nH + nT
5
Maximum Likelihood Estimation (MLE)
• Data: Observed set D of nH heads and nT tails
• Model: Each flip follows a Bernoulli distribution
P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]
Thus, the likelihood of observing sequence D is
P(D | θ) = θnH (1− θ)nT
• Question: Given this model and the data we’ve observed, can we
calculate an estimate of θ?
• MLE: Choose θ that maximizes the likelihood of the observed data
θMLE = arg maxθ
P(D | θ)
= arg maxθ
logP(D | θ)
=nH
nH + nT
5
Maximum Likelihood Estimation (MLE)
• Data: Observed set D of nH heads and nT tails
• Model: Each flip follows a Bernoulli distribution
P(H) = θ, P(T ) = 1− θ, θ ∈ [0, 1]
Thus, the likelihood of observing sequence D is
P(D | θ) = θnH (1− θ)nT
• Question: Given this model and the data we’ve observed, can we
calculate an estimate of θ?
• MLE: Choose θ that maximizes the likelihood of the observed data
θMLE = arg maxθ
P(D | θ)
= arg maxθ
logP(D | θ)
=nH
nH + nT
5
MAP for Dogecoin
θMAP = arg maxθ
P(θ | D) = arg maxθ
P(D | θ)P(θ)
• Recall that P(D | θ) = θnH (1− θ)nT
• How should we set the prior, P(θ)?
• Common choice for a binomial likelihood is to use the Beta
distribution, θ ∼ Beta(α, β):
P(θ) =1
B(α, β)θα−1(1− θ)β−1
• Interpretation: α = number of expected heads, β = number of
expected tails. Larger value of α + β denotes more confidence (and
smaller variance).
6
MAP for Dogecoin
θMAP = arg maxθ
P(θ | D) = arg maxθ
P(D | θ)P(θ)
• Recall that P(D | θ) = θnH (1− θ)nT
• How should we set the prior, P(θ)?
• Common choice for a binomial likelihood is to use the Beta
distribution, θ ∼ Beta(α, β):
P(θ) =1
B(α, β)θα−1(1− θ)β−1
• Interpretation: α = number of expected heads, β = number of
expected tails. Larger value of α + β denotes more confidence (and
smaller variance).
6
MAP for Dogecoin
θMAP = arg maxθ
P(θ | D) = arg maxθ
P(D | θ)P(θ)
• Recall that P(D | θ) = θnH (1− θ)nT
• How should we set the prior, P(θ)?
• Common choice for a binomial likelihood is to use the Beta
distribution, θ ∼ Beta(α, β):
P(θ) =1
B(α, β)θα−1(1− θ)β−1
• Interpretation: α = number of expected heads, β = number of
expected tails. Larger value of α + β denotes more confidence (and
smaller variance).
6
MAP for Dogecoin
θMAP = arg maxθ
P(θ | D) = arg maxθ
P(D | θ)P(θ)
• Recall that P(D | θ) = θnH (1− θ)nT
• How should we set the prior, P(θ)?
• Common choice for a binomial likelihood is to use the Beta
distribution, θ ∼ Beta(α, β):
P(θ) =1
B(α, β)θα−1(1− θ)β−1
• Interpretation: α = number of expected heads, β = number of
expected tails. Larger value of α + β denotes more confidence (and
smaller variance).
6
Putting it all together
θMLE =nH
nH + nT
θMAP =α + nH − 1
α + β + nH + nT − 2
• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –
θMLE or θMAP?
• θMAP = 5/12, θMLE = 1/3
• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –
θMLE or θMAP?
• θMAP = 1/6, θMLE = 1/3
7
Putting it all together
θMLE =nH
nH + nT
θMAP =α + nH − 1
α + β + nH + nT − 2
• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}
• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –
θMLE or θMAP?
• θMAP = 5/12, θMLE = 1/3
• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –
θMLE or θMAP?
• θMAP = 1/6, θMLE = 1/3
7
Putting it all together
θMLE =nH
nH + nT
θMAP =α + nH − 1
α + β + nH + nT − 2
• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –
θMLE or θMAP?
• θMAP = 5/12, θMLE = 1/3
• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –
θMLE or θMAP?
• θMAP = 1/6, θMLE = 1/3
7
Putting it all together
θMLE =nH
nH + nT
θMAP =α + nH − 1
α + β + nH + nT − 2
• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –
θMLE or θMAP?
• θMAP = 5/12, θMLE = 1/3
• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –
θMLE or θMAP?
• θMAP = 1/6, θMLE = 1/3
7
Putting it all together
θMLE =nH
nH + nT
θMAP =α + nH − 1
α + β + nH + nT − 2
• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –
θMLE or θMAP?
• θMAP = 5/12, θMLE = 1/3
• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –
θMLE or θMAP?
• θMAP = 1/6, θMLE = 1/3
7
Putting it all together
θMLE =nH
nH + nT
θMAP =α + nH − 1
α + β + nH + nT − 2
• Suppose θ∗ := 0.5 and we observe: D = {H,H,T ,T ,T ,T}• Scenario 1: We assume θ ∼ Beta(4, 4). Which is more accurate –
θMLE or θMAP?
• θMAP = 5/12, θMLE = 1/3
• Scenario 2: We assume θ ∼ Beta(1, 7). Which is more accurate –
θMLE or θMAP?
• θMAP = 1/6, θMLE = 1/3
7
Linear Regression
Recap of MLE/MAP
Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization
8
Task 1: Regression
How much should you sell your house for?
ML methoddata intelligenceregressiondata intelligence
= ??
house size
pirc
e ($
)
input: houses & features
regressiondata intelligence
= ??
house size
pirc
e ($
)
learn: x → y relationship
regressiondata intelligence
= ??
house size
pirc
e ($
)
predict: y (continuous)
Course Covers: Linear/Ridge Regression, Loss Function, SGD, Feature
Scaling, Regularization, Cross Validation 9
Supervised Learning
Supervised learning
In a supervised learning problem, you have access to input variables (X )
and outputs (Y ), and the goal is to predict an output given an input
• Examples:
• Housing prices (Regression): predict the price of a house based on
features (size, location, etc)
• Cat vs. Dog (Classification): predict whether a picture is of a cat
or a dog
10
Supervised Learning
Supervised learning
In a supervised learning problem, you have access to input variables (X )
and outputs (Y ), and the goal is to predict an output given an input
• Examples:
• Housing prices (Regression): predict the price of a house based on
features (size, location, etc)
• Cat vs. Dog (Classification): predict whether a picture is of a cat
or a dog
10
Supervised Learning
Supervised learning
In a supervised learning problem, you have access to input variables (X )
and outputs (Y ), and the goal is to predict an output given an input
• Examples:
• Housing prices (Regression): predict the price of a house based on
features (size, location, etc)
• Cat vs. Dog (Classification): predict whether a picture is of a cat
or a dog
10
Regression
Predicting a continuous outcome variable:
• Predicting a company’s future stock price using its profit and other
financial info
• Predicting annual rainfall based on local flora and fauna
• Predicting distance from a traffic light using LIDAR measurements
Magnitude of the error matters:
• We can measure ’closeness’ of prediction and labels, leading to
different ways to evaluate prediction errors.
• Predicting stock price: better to be off by 1$ than by 20$
• Predicting distance from a traffic light: better to be off 1 m than by
10 m
• We should choose learning models and algorithms accordingly.
11
Regression
Predicting a continuous outcome variable:
• Predicting a company’s future stock price using its profit and other
financial info
• Predicting annual rainfall based on local flora and fauna
• Predicting distance from a traffic light using LIDAR measurements
Magnitude of the error matters:
• We can measure ’closeness’ of prediction and labels, leading to
different ways to evaluate prediction errors.
• Predicting stock price: better to be off by 1$ than by 20$
• Predicting distance from a traffic light: better to be off 1 m than by
10 m
• We should choose learning models and algorithms accordingly.
11
Ex: predicting the sale price of a house
Retrieve historical sales records
(This will be our training data)
12
Features used to predict
13
Correlation between square footage and sale price
14
Roughly linear relationship
Sale price ≈ price per sqft × square footage + fixed expense
15
Data Can be Compactly Represented by Matrices
regressiondata intelligence
= ??
house size
pirc
e ($
)
• Learn parameters (w0,w1) of the orange line y = w1x + w0
Sq.ft
House 1: 1000× w1 + w0 = 200, 000
House 2: 2000× w1 + w0 = 350, 000
• Can represent compactly in matrix notation[
1000 1
2000 1
][w1
w0
]=
[200, 000
350, 000
]
16
Data Can be Compactly Represented by Matrices
regressiondata intelligence
= ??
house size
pirc
e ($
)
• Learn parameters (w0,w1) of the orange line y = w1x + w0
Sq.ft
House 1: 1000× w1 + w0 = 200, 000
House 2: 2000× w1 + w0 = 350, 000
• Can represent compactly in matrix notation[
1000 1
2000 1
][w1
w0
]=
[200, 000
350, 000
]
16
Some Concepts That You Should Know
• Invertibility of Matrices and Computing Inverses
• Vector Norms – L2, Frobenius etc., Inner Products
• Eigenvalues and Eigen-vectors
• Singular Value Decomposition
• Covariance Matrices and Positive Semi-definite-ness
Excellent Resources:
• Essence of Linear Algebra YouTube Series
• Prof. Gilbert Strang’s course at MIT
17
Matrix Inverse
• Let us solve the house-price prediction problem
[1000 1
2000 1
][w1
w0
]=
[200, 000
350, 000
](1)
[w1
w0
]=
([1000 1
2000 1
])−1 [200, 000
350, 000
](2)
=1
−1000
[1 −1
−2000 1000
][200, 000
350, 000
](3)
=1
−1000
[150, 000
−5× 107
](4)
[w1
w0
]=
[150
50, 000
](5)
18
You could have data from many houses
• Sale price =
price per sqft× square footage + fixed expense + unexplainable stuff
• Want to learn the price per sqft and fixed expense
• Training data: past sales record.
sqft sale price
2000 800K
2100 907K
1100 312K
5500 2,600K
· · · · · ·
Problem: there isn’t a w = [w1,w0]T that will satisfy all equations
19
Want to predict the best price per sqft and fixed expense
• Sale price =
price per sqft× square footage + fixed expense + unexplainable stuff
• Want to learn the price per sqft and fixed expense
• Training data: past sales record.
sqft sale price prediction
2000 810K 720K
2100 907K 800K
1100 312K 350K
5500 2,600K 2,600K
· · · · · · · · ·
20
Reduce prediction error
How to measure errors?
• absolute difference: |prediction− sale price|.• squared difference: (prediction− sale price)2 [differentiable!].
sqft sale price prediction abs error squared error
2000 810K 720K 90K 8100
2100 907K 800K 107K 1072
1100 312K 350K 38K 382
5500 2,600K 2,600K 0 0
· · · · · ·
21
Reduce prediction error
How to measure errors?
• absolute difference: |prediction− sale price|.• squared difference: (prediction− sale price)2 [differentiable!].
sqft sale price prediction abs error squared error
2000 810K 720K 90K 8100
2100 907K 800K 107K 1072
1100 312K 350K 38K 382
5500 2,600K 2,600K 0 0
· · · · · ·
21
Geometric Illustration: Each house corresponds to one line
c>1 w = y1
c>2 w = y2w
c>3 w = y3
c>4 w = y4
r =
26664
c>1 w � y1
c>2 w � y2
...c>4 w � y4
37775
= Aw � y
• Want to find w that minimizes the difference between Xw, y
• But since this a vector, we need an operator that can map the
residual vector r(w) = y − Xw to a scalar
22
Norms and Loss Functions
• A vector norm is any function f : Rn → R with
• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0
• f (ax) = |a|f (x) for a ∈ R
• triangle inequality: f (x + y) ≤ f (x) + f (y)
• e.g., `2 norm: ‖x‖2 =√x>x =
√∑ni=1 x
2i
• e.g., `1 norm: ‖x‖1 =∑n
i=1 |xi |• e.g., `∞ norm: ‖x‖∞ = max |xi |
from inside to outside: `1, `2, `∞ norm ball.
23
Norms and Loss Functions
• A vector norm is any function f : Rn → R with
• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0
• f (ax) = |a|f (x) for a ∈ R
• triangle inequality: f (x + y) ≤ f (x) + f (y)
• e.g., `2 norm: ‖x‖2 =√x>x =
√∑ni=1 x
2i
• e.g., `1 norm: ‖x‖1 =∑n
i=1 |xi |• e.g., `∞ norm: ‖x‖∞ = max |xi |
from inside to outside: `1, `2, `∞ norm ball.
23
Norms and Loss Functions
• A vector norm is any function f : Rn → R with
• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0
• f (ax) = |a|f (x) for a ∈ R
• triangle inequality: f (x + y) ≤ f (x) + f (y)
• e.g., `2 norm: ‖x‖2 =√x>x =
√∑ni=1 x
2i
• e.g., `1 norm: ‖x‖1 =∑n
i=1 |xi |• e.g., `∞ norm: ‖x‖∞ = max |xi |
from inside to outside: `1, `2, `∞ norm ball.
23
Norms and Loss Functions
• A vector norm is any function f : Rn → R with
• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0
• f (ax) = |a|f (x) for a ∈ R
• triangle inequality: f (x + y) ≤ f (x) + f (y)
• e.g., `2 norm: ‖x‖2 =√x>x =
√∑ni=1 x
2i
• e.g., `1 norm: ‖x‖1 =∑n
i=1 |xi |• e.g., `∞ norm: ‖x‖∞ = max |xi |
from inside to outside: `1, `2, `∞ norm ball.
23
Norms and Loss Functions
• A vector norm is any function f : Rn → R with
• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0
• f (ax) = |a|f (x) for a ∈ R
• triangle inequality: f (x + y) ≤ f (x) + f (y)
• e.g., `2 norm: ‖x‖2 =√x>x =
√∑ni=1 x
2i
• e.g., `1 norm: ‖x‖1 =∑n
i=1 |xi |• e.g., `∞ norm: ‖x‖∞ = max |xi |
from inside to outside: `1, `2, `∞ norm ball.
23
Norms and Loss Functions
• A vector norm is any function f : Rn → R with
• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0
• f (ax) = |a|f (x) for a ∈ R
• triangle inequality: f (x + y) ≤ f (x) + f (y)
• e.g., `2 norm: ‖x‖2 =√x>x =
√∑ni=1 x
2i
• e.g., `1 norm: ‖x‖1 =∑n
i=1 |xi |
• e.g., `∞ norm: ‖x‖∞ = max |xi |
from inside to outside: `1, `2, `∞ norm ball.
23
Norms and Loss Functions
• A vector norm is any function f : Rn → R with
• f (x) ≥ 0 and f (x) = 0 ⇐⇒ x = 0
• f (ax) = |a|f (x) for a ∈ R
• triangle inequality: f (x + y) ≤ f (x) + f (y)
• e.g., `2 norm: ‖x‖2 =√x>x =
√∑ni=1 x
2i
• e.g., `1 norm: ‖x‖1 =∑n
i=1 |xi |• e.g., `∞ norm: ‖x‖∞ = max |xi |
from inside to outside: `1, `2, `∞ norm ball. 23
Minimize squared errors
Our model:
Sale price =
price per sqft× square footage + fixed expense + unexplainable stuff
Training data:
sqft sale price prediction error squared error
2000 810K 720K 90K 8100
2100 907K 800K 107K 1072
1100 312K 350K 38K 382
5500 2,600K 2,600K 0 0
· · · · · ·Total 8100 + 1072 + 382 + 0 + · · ·
Aim:
Adjust price per sqft and fixed expense such that the sum of the squared
error is minimized — i.e., the unexplainable stuff is minimized.
24
Recap of MLE/MAP
Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization
25
Linear regression
Setup:
• Input: x ∈ RD (covariates, predictors, features, etc)
• Output: y ∈ R (responses, targets, outcomes, outputs, etc)
• Model: f : x→ y , with f (x) = w0 +∑D
d=1 wdxd = w0 + w>x.
• w = [w1 w2 · · · wD ]>: weights, parameters, or parameter vector
• w0 is called bias.
• Sometimes, we also call w = [w0 w1 w2 · · · wD ]> parameters.
• Training data: D = {(xn, yn), n = 1, 2, . . . ,N}
Minimize the Residual sum of squares:
RSS(w) =N∑
n=1
[yn − f (xn)]2 =N∑
n=1
[yn − (w0 +D∑
d=1
wdxnd)]2
26
Recap of MLE/MAP
Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization
27
A simple case: x is just one-dimensional (D=1)
Residual sum of squares:
RSS(w) =∑
n
[yn − f (xn)]2 =∑
n
[yn − (w0 + w1xn)]2
What kind of function is this? CONVEX (has a unique global minimum)
28
A simple case: x is just one-dimensional (D=1)
Residual sum of squares:
RSS(w) =∑
n
[yn − f (xn)]2 =∑
n
[yn − (w0 + w1xn)]2
What kind of function is this?
CONVEX (has a unique global minimum)
28
A simple case: x is just one-dimensional (D=1)
Residual sum of squares:
RSS(w) =∑
n
[yn − f (xn)]2 =∑
n
[yn − (w0 + w1xn)]2
What kind of function is this? CONVEX (has a unique global minimum)28
A simple case: x is just one-dimensional (D=1)
Residual sum of squares:
RSS(w) =∑
n
[yn − f (xn)]2 =∑
n
[yn − (w0 + w1xn)]2
Stationary points:
Take derivative with respect to parameters and set it to zero
∂RSS(w)
∂w0= 0⇒ −2
∑
n
[yn − (w0 + w1xn)] = 0,
∂RSS(w)
∂w1= 0⇒ −2
∑
n
[yn − (w0 + w1xn)]xn = 0.
29
A simple case: x is just one-dimensional (D=1)
Residual sum of squares:
RSS(w) =∑
n
[yn − f (xn)]2 =∑
n
[yn − (w0 + w1xn)]2
Stationary points:
Take derivative with respect to parameters and set it to zero
∂RSS(w)
∂w0= 0⇒ −2
∑
n
[yn − (w0 + w1xn)] = 0,
∂RSS(w)
∂w1= 0⇒ −2
∑
n
[yn − (w0 + w1xn)]xn = 0.
29
A simple case: x is just one-dimensional (D=1)
∂RSS(w)
∂w0= 0⇒ −2
∑
n
[yn − (w0 + w1xn)] = 0
∂RSS(w)
∂w1= 0⇒ −2
∑
n
[yn − (w0 + w1xn)]xn = 0
Simplify these expressions to get the “Normal Equations”:∑
yn = Nw0 + w1
∑xn
∑xnyn = w0
∑xn + w1
∑x2n
Solving the system we obtain the least squares coefficient estimates:
w1 =
∑(xn − x)(yn − y)∑
(xi − x)2and w0 = y − w1x
where x = 1N
∑n xn and y = 1
N
∑n yn.
30
A simple case: x is just one-dimensional (D=1)
∂RSS(w)
∂w0= 0⇒ −2
∑
n
[yn − (w0 + w1xn)] = 0
∂RSS(w)
∂w1= 0⇒ −2
∑
n
[yn − (w0 + w1xn)]xn = 0
Simplify these expressions to get the “Normal Equations”:∑
yn = Nw0 + w1
∑xn
∑xnyn = w0
∑xn + w1
∑x2n
Solving the system we obtain the least squares coefficient estimates:
w1 =
∑(xn − x)(yn − y)∑
(xi − x)2and w0 = y − w1x
where x = 1N
∑n xn and y = 1
N
∑n yn.
30
A simple case: x is just one-dimensional (D=1)
∂RSS(w)
∂w0= 0⇒ −2
∑
n
[yn − (w0 + w1xn)] = 0
∂RSS(w)
∂w1= 0⇒ −2
∑
n
[yn − (w0 + w1xn)]xn = 0
Simplify these expressions to get the “Normal Equations”:∑
yn = Nw0 + w1
∑xn
∑xnyn = w0
∑xn + w1
∑x2n
Solving the system we obtain the least squares coefficient estimates:
w1 =
∑(xn − x)(yn − y)∑
(xi − x)2and w0 = y − w1x
where x = 1N
∑n xn and y = 1
N
∑n yn.
30
Example
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
Residual sum of squares:
RSS(w) =∑
n
[yn − f (xn)]2 =∑
n
[yn − (w0 + w1xn)]2
The w1 and w0 that minimize this are given by:
w1 =
∑(xn − x)(yn − y)∑
(xi − x)2and w0 = y − w1x
where x = 1N
∑n xn and y = 1
N
∑n yn.
31
Example
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
Residual sum of squares:
RSS(w) =∑
n
[yn − f (xn)]2 =∑
n
[yn − (w0 + w1xn)]2
The w1 and w0 that minimize this are given by:
w1 ≈ 1.6
w0 ≈ 0.45
32
Recap of MLE/MAP
Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization
33
Least Mean Squares when x is D-dimensional
sqft (1000’s) bedrooms bathrooms sale price (100k)
1 2 1 2
2 2 2 3.5
1.5 3 2 3
2.5 4 2.5 4.5
RSS(w) in matrix form:
RSS(w) =∑
n
[yn − (w0 +∑
d
wdxnd)]2 =∑
n
[yn − w>xn]2,
where we have redefined some variables (by augmenting)
x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>
34
Least Mean Squares when x is D-dimensional
RSS(w) in matrix form:
RSS(w) =∑
n
[yn − (w0 +∑
d
wdxnd)]2 =∑
n
[yn − w>xn]2,
where we have redefined some variables (by augmenting)
x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>
which leads to
RSS(w) =∑
n
(yn − w>xn)(yn − x>n w)
=∑
n
w>xnx>n w − 2ynx>n w + const.
=
{w>
(∑
n
xnx>n
)w − 2
(∑
n
ynx>n
)w
}+ const.
35
Least Mean Squares when x is D-dimensional
RSS(w) in matrix form:
RSS(w) =∑
n
[yn − (w0 +∑
d
wdxnd)]2 =∑
n
[yn − w>xn]2,
where we have redefined some variables (by augmenting)
x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>
which leads to
RSS(w) =∑
n
(yn − w>xn)(yn − x>n w)
=∑
n
w>xnx>n w − 2ynx>n w + const.
=
{w>
(∑
n
xnx>n
)w − 2
(∑
n
ynx>n
)w
}+ const.
35
RSS(w) in new notations
From previous slide:
RSS(w) =
{w>
(∑
n
xnx>n
)w − 2
(∑
n
ynx>n
)w
}+ const.
Design matrix and target vector:
X =
x>1x>2...
x>N
∈ RN×(D+1)
, y =
y1y2...
yN
∈ RN
Compact expression:
RSS(w) = ‖Xw − y‖22 =
{w>X>Xw − 2
(X>y
)>w
}+ const
36
RSS(w) in new notations
From previous slide:
RSS(w) =
{w>
(∑
n
xnx>n
)w − 2
(∑
n
ynx>n
)w
}+ const.
Design matrix and target vector:
X =
x>1x>2...
x>N
∈ RN×(D+1), y =
y1y2...
yN
∈ RN
Compact expression:
RSS(w) = ‖Xw − y‖22 =
{w>X>Xw − 2
(X>y
)>w
}+ const
36
Example: RSS(w) in compact form
sqft (1000’s) bedrooms bathrooms sale price (100k)
1 2 1 2
2 2 2 3.5
1.5 3 2 3
2.5 4 2.5 4.5
Design matrix and target vector:
X =
x>1x>2...
x>N
∈ RN×(D+1), y =
y1y2...
yN
∈ RN
. Compact expression:
RSS(w) = ‖Xw − y‖22 =
{w>X>Xw − 2
(X>y
)>w
}+ const
37
Example: RSS(w) in compact form
sqft (1000’s) bedrooms bathrooms sale price (100k)
1 2 1 2
2 2 2 3.5
1.5 3 2 3
2.5 4 2.5 4.5
Design matrix and target vector:
X =
x>1x>2...
x>N
=
1 1 2 1
1 2 2 2
1 1.5 3 2
1 2.5 4 2.5
, y =
2
3.5
3
4.5
. Compact expression:
RSS(w) = ‖Xw − y‖22 =
{w>X>Xw − 2
(X>y
)>w
}+ const
38
Solution in matrix form
Compact expression
RSS(w) = ||Xw − y||22 =
{w>X>Xw − 2
(X>y
)>w
}+ const
Gradients of Linear and Quadratic Functions
• ∇x(b>x) = b
• ∇x(x>Ax) = 2Ax (symmetric A)
Normal equation
∇wRSS(w) = 2X>Xw − 2X>y = 0
This leads to the least-mean-squares (LMS) solution
wLMS =(
X>X)−1
X>y
39
Solution in matrix form
Compact expression
RSS(w) = ||Xw − y||22 =
{w>X>Xw − 2
(X>y
)>w
}+ const
Gradients of Linear and Quadratic Functions
• ∇x(b>x) = b
• ∇x(x>Ax) = 2Ax (symmetric A)
Normal equation
∇wRSS(w) = 2X>Xw − 2X>y = 0
This leads to the least-mean-squares (LMS) solution
wLMS =(
X>X)−1
X>y
39
Solution in matrix form
Compact expression
RSS(w) = ||Xw − y||22 =
{w>X>Xw − 2
(X>y
)>w
}+ const
Gradients of Linear and Quadratic Functions
• ∇x(b>x) = b
• ∇x(x>Ax) = 2Ax (symmetric A)
Normal equation
∇wRSS(w) = 2X>Xw − 2X>y = 0
This leads to the least-mean-squares (LMS) solution
wLMS =(
X>X)−1
X>y
39
Example: RSS(w) in compact form
sqft (1000’s) bedrooms bathrooms sale price (100k)
1 2 1 2
2 2 2 3.5
1.5 3 2 3
2.5 4 2.5 4.5
Write the least-mean-squares (LMS) solution
wLMS =(
X>X)−1
X>y
Can use solvers in Matlab, Python etc., to compute this for any given X
and y.
40
Example: RSS(w) in compact form
sqft (1000’s) bedrooms bathrooms sale price (100k)
1 2 1 2
2 2 2 3.5
1.5 3 2 3
2.5 4 2.5 4.5
Write the least-mean-squares (LMS) solution
wLMS =(
X>X)−1
X>y
Can use solvers in Matlab, Python etc., to compute this for any given X
and y.
40
Exercise: RSS(w) in compact form
Using the general least-mean-squares (LMS) solution
wLMS =(
X>X)−1
X>y
recover the uni-variate solution that we had computed earlier:
w1 =
∑(xn − x)(yn − y)∑
(xi − x)2and w0 = y − w1x
where x = 1N
∑n xn and y = 1
N
∑n yn.
41
Exercise: RSS(w) in compact form
For the 1-D case, the least-mean-squares solution is
wLMS =(
X>X)−1
X>y
=
[1 1 . . . 1
x1 x2 . . . xN
]
1 x11 x21 . . .
1 xN
−1[
1 1 . . . 1
x1 x2 . . . xN
]
y1y2. . .
yN
=
([N Nx
Nx∑
n x2n
])−1 [ ∑n yn∑
n xnyn
]
[w0
w1
]=
1∑(xi − x)2)
[y∑
(xi − x)2 − x∑
(xn − x)(yn − y)∑(xn − x)(yn − y)
]
where x = 1N
∑n xn and y = 1
N
∑n yn.
42
Exercise: RSS(w) in compact form
For the 1-D case, the least-mean-squares solution is
wLMS =(
X>X)−1
X>y
=
[1 1 . . . 1
x1 x2 . . . xN
]
1 x11 x21 . . .
1 xN
−1[
1 1 . . . 1
x1 x2 . . . xN
]
y1y2. . .
yN
=
([N Nx
Nx∑
n x2n
])−1 [ ∑n yn∑
n xnyn
]
[w0
w1
]=
1∑(xi − x)2)
[y∑
(xi − x)2 − x∑
(xn − x)(yn − y)∑(xn − x)(yn − y)
]
where x = 1N
∑n xn and y = 1
N
∑n yn.
42
Recap of MLE/MAP
Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization
43
Why is minimizing RSS sensible?
c>1 w = y1
c>2 w = y2w
c>3 w = y3
c>4 w = y4
r =
26664
c>1 w � y1
c>2 w � y2
...c>4 w � y4
37775
= Aw � y
• Want to find w that minimizes the difference between Xw, y
• But since this a vector, we need an operator that can map the
residual vector r(w) = y − Xw to a scalar
• We take the sum of the squares of the elements of r(w)44
Why is minimizing RSS sensible?
Probabilistic interpretation
• Noisy observation model:
Y = w0 + w1X + η
where η ∼ N(0, σ2) is a Gaussian random variable
• Conditional likelihood of one training sample:
p(yn|xn) = N(w0 + w1xn, σ2) =
1√2πσ
e−[yn−(w0+w1xn)]
2
2σ2
45
Probabilistic interpretation (cont’d)
Log-likelihood of the training data D (assuming i.i.d):
logP(D) = logN∏
n=1
p(yn|xn) =∑
n
log p(yn|xn)
=∑
n
{− [yn − (w0 + w1xn)]2
2σ2− log
√2πσ
}
= − 1
2σ2
∑
n
[yn − (w0 + w1xn)]2 − N
2log σ2 − N log
√2π
= −1
2
{1
σ2
∑
n
[yn − (w0 + w1xn)]2 + N log σ2
}+ const
What is the relationship between minimizing RSS and maximizing the
log-likelihood?
46
Maximum likelihood estimation
Estimating σ, w0 and w1 can be done in two steps
• Maximize over w0 and w1:
max logP(D)⇔ min∑
n
[yn − (w0 + w1xn)]2 ← This is RSS(w)!
• Maximize over s = σ2:
∂ logP(D)
∂s= −1
2
{− 1
s2
∑
n
[yn − (w0 + w1xn)]2 + N1
s
}= 0
→ σ∗2 = s∗ =1
N
∑
n
[yn − (w0 + w1xn)]2
47
How does this probabilistic interpretation help us?
• It gives a solid footing to our intuition: minimizing RSS(w) is a
sensible thing based on reasonable modeling assumptions.
• Estimating σ∗ tells us how much noise there is in our predictions.
For example, it allows us to place confidence intervals around our
predictions.
48
Recap of MLE/MAP
Linear Regression
Motivation
Algorithm
Univariate solution
Multivariate Solution
Probabilistic interpretation
Computational and numerical optimization
49
Computational complexity of the Least Squares Solution
Bottleneck of computing the solution?
w =(X>X
)−1Xy
Matrix multiply of X>X ∈ R(D+1)×(D+1)
Inverting the matrix X>X
How many operations do we need?
• O(ND2) for matrix multiplication
• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent
theoretical advances) for matrix inversion
• Impractical for very large D or N
50
Alternative method: Batch Gradient Descent
(Batch) Gradient descent
• Initialize w to w (0) (e.g., randomly);
set t = 0; choose η > 0
• Loop until convergence
1. Compute the gradient
∇RSS(w) = X>(Xw (t) − y)
2. Update the parameters
w (t+1) = w (t) − η∇RSS(w)
3. t ← t + 1
What is the complexity of each iteration?
O(ND)
51
Why would this work?
If gradient descent converges, it will converge to the same solution as
using matrix inversion.
This is because RSS(w) is a convex function in its parameters w
Hessian of RSS
RSS(w) = w>X>Xw − 2(X>y
)>w + const
⇒ ∂2RSS(w)
∂ww>= 2X>X
X>X is positive semidefinite, because for any v
v>X>Xv = ‖X>v‖22 ≥ 0
52
Alternative method: Batch Gradient Descent
(Batch) Gradient descent
• Initialize w to w (0) (e.g., randomly);
set t = 0; choose η > 0
• Loop until convergence
1. Compute the gradient
∇RSS(w) = X>Xw (t) − X>y2. Update the parameters
w (t+1) = w (t) − η∇RSS(w)
3. t ← t + 1
What is the complexity of each iteration?
O(ND)
53
Stochastic gradient descent (SGD)
Widrow-Hoff rule: update parameters using one example at a time
• Initialize w to some w (0); set t = 0; choose η > 0
• Loop until convergence
1. random choose a training a sample x t
2. Compute its contribution to the gradient
g t = (x>t w (t) − yt)xt
3. Update the parameters
w (t+1) = w (t) − ηg t
4. t ← t + 1
How does the complexity per iteration compare with gradient descent?
• O(ND) for gradient descent versus O(D) for SGD
54
SGD versus Batch GD
• SGD reduces per-iteration complexity from O(ND) to O(D)
• But it is noisier and can take longer to converge
55
How to Choose Learning Rate η in practice?
• Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more on
this later) and choose the one that gives fastest, stable convergence
• Reduce η by a constant factor (eg. 10) when learning saturates so
that we can reach closer to the true minimum.
• More advanced learning rate schedules such as AdaGrad, Adam,
AdaDelta are used in practice.
56
How to Choose Learning Rate η in practice?
• Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more on
this later) and choose the one that gives fastest, stable convergence
• Reduce η by a constant factor (eg. 10) when learning saturates so
that we can reach closer to the true minimum.
• More advanced learning rate schedules such as AdaGrad, Adam,
AdaDelta are used in practice.
56
How to Choose Learning Rate η in practice?
• Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more on
this later) and choose the one that gives fastest, stable convergence
• Reduce η by a constant factor (eg. 10) when learning saturates so
that we can reach closer to the true minimum.
• More advanced learning rate schedules such as AdaGrad, Adam,
AdaDelta are used in practice.
56
Mini-Summary
• Linear regression is the linear combination of features
f : x → y , with f (x) = w0 +∑
d wdxd = w0 + w>x
• If we minimize residual sum of squares as our learning objective, we
get a closed-form solution of parameters
• Probabilistic interpretation: maximum likelihood if assuming residual
is Gaussian distributed
• Gradient Descent and mini-batch SGD can overcome computational
issues
57