18.650 fundamentals of statistics 26mm 6. linear regressionmitx+18.6501x+3t...18.650 { fundamentals...
TRANSCRIPT
1/30
18.650 – Fundamentals of Statistics
6. Linear Regression
2/30
Goals
Consider two random variables X and Y . For example,
1. X is the amount of $ spent on Facebook ads and Y is thetotal conversion rate
2. X is the age of the person and Y is the number of clicks
Given two random variables (X,Y ), we can ask the followingquestions:
I How to predict Y from X?
I Error bars around this prediction?
I How much more conversions Y for an additional dollar?
I Does the number of clicks even depend on age?
I What if X is a random vector? For example, X = (X1, X2)where X1 is the amount of $ spent on Facebook ads and X2
is the duration in days of the campaign.
3/30
Conversions vs. amount spent
4/30
Clicks vs. age
5/30
Modeling assumptions
(Xi, Yi), i = 1, . . . , n are i.i.d from some unknown jointdistribution IP.
IP can be described entirely by (assuming all exist)
I Either a joint PDF h(x, y)
I The marginal density of X h(x) =
∫h(x, y)dy and the
conditional density
h(y|x) =h(x, y)
h(x)
h(y|x) answers all our questions. It contains all the informationabout Y given X
6/30
Partial modeling
We can also describe the distribution only partially,e.g., using
I The expectation of Y : IE[Y ]
I The conditional expectation of Y given X = x: IE[Y |X = x]The function
x 7→ f(x) := IE[Y |X = x] =
∫yh(y|x)dy
is called regression function
I Other possibilities:
I The conditional median: m(x) such that∫ m
−∞h(y|x)dy =
1
2
I Conditional quantilesI Conditional variance (not informative about location)
7/30
Conditional expectation and standard deviation
8/30
Conditional expectation
9/30
Conditional density and conditional quantiles
10/30
Conditional distribution: boxplots
11/30
Linear regression
We first focus on modeling the regression functionf(x) = IE[Y |X = x]
I Too many possible regression functions f (nonparametric)
I Useful to restrict to simple functions that are described by afew parameters
I Simplest:
f(x) = a+ bx linear (or affine) functions
Under this assumption, we talk about linear regression
12/30
Nonparametric regression
13/30
Linear regression
14/30
Probabilistic analysis
I Let X and Y be two real r.v. (not necessarily independent)with two moments and such that var(X) > 0.
I The theoretical linear regression of Y on X is the linex 7→ a∗ + b∗x where
(a∗, b∗) = argmin(a,b)∈IR2
IE[(Y − a− bX)2
]
I Setting partial derivatives to zero gives
I b∗ =cov(X,Y )
var(X),
I a∗ = IE[Y ]− b∗IE[X] = IE[Y ]− cov(X,Y )
var(X)IE[X].
15/30
Noise
Clearly the points are not exactly on the line x 7→ a∗ + b∗x ifvar(Y |X = x) > 0. The random variable ε = Y − (a∗ + b∗X) is
called noise and satisfies
Y = a∗ + b∗X + ε,
with
I IE[ε] = 0 and
I cov(X, ε) = 0.
16/30
Statistical problem
In practice a∗, b∗ need to be estimated from data.
I Assume that we observe n i.i.d. random pairs(X1, Y1), . . . , (Xn, Yn) with same distribution as (X,Y ):
Yi = a∗ + b∗Xi + εi
I We want to estimate a∗ and b∗.
17/30
Statistical problem
Yi = a∗ + b∗Xi + εi
18/30
Statistical problem
Yi = a∗ + b∗Xi + εi
19/30
Statistical problem
Yi = a∗ + b∗Xi + εi
20/30
Statistical problem
Yi = a∗ + b∗Xi + εi
21/30
Least squares
DefinitionThe least squares estimator (LSE) of (a∗, b∗) is the minimizer ofthe sum of squared errors:
n∑i=1
(Yi − a− bXi)2.
(a, b) is given by
b =XY − XYX2 − X2
a = Y − bX.
22/30
Residuals
- SSE2.pdf
23/30
Multivariate regression
Yi = X>i β∗ + εi, i = 1, . . . , n.
I Vector of explanatory variables or covariates: Xi ∈ IRp
(wlog, assume its first coordinate is 1).
I Response / Dependent variable: Yi.
I β∗ = (a∗,b∗>)>; β∗1(= a∗) is called the intercept.
I {εi}i=1,...,n: noise terms satisfying cov(Xi, εi) = 0.
DefinitionThe least squares estimator (LSE) of β∗ is the minimizer of thesum of square errors:
β = argminβ∈IRp
n∑i=1
(Yi −X>i β)2
24/30
LSE in matrix form
I Let Y = (Y1, . . . , Yn)> ∈ IRn.
I Let X be the n× p matrix whose rows are X>1 , . . . ,X>n (X is
called the design matrix).
I Let ε = (ε1, . . . , εn)> ∈ IRn (unobserved noise)
I Y = Xβ∗ + ε, β∗ unknwon.
I The LSE β satisfies:
β = argminβ∈IRp
‖Y − Xβ‖22.
25/30
Closed form solution
I Assume that rank(X) = p.
I Analytic computation of the LSE:
β = (X>X)−1X>Y.
I Geometric interpretation of the LSE: Xβ is the orthogonalprojection of Y onto the subspace spanned by the columns ofX:
Xβ = PY,
where P = X(X>X)−1X>.
26/30
Statistical inference
To make inference (confidence regions, tests) we need moreassumptions.
Assumptions:
I The design matrix X is deterministic and rank(X) = p.
I The model is homoscedastic: ε1, . . . , εn are i.i.d.
I The noise vector ε is Gaussian:
ε ∼ Nn(0, σ2In)
for some known or unknown σ2 > 0.
27/30
Properties of LSEI LSE = MLE
I Distribution of β: β ∼ Np(β∗, σ2(X>X)−1
).
I Quadratic risk of β: IE[‖β − β∗‖22
]= σ2tr
((X>X)−1
).
I Prediction error: IE[‖Y − Xβ‖22
]= σ2(n− p).
I Unbiased estimator of σ2: σ2 =1
n− p‖Y − Xβ‖22.
Theorem
I (n− p) σ2
σ2∼ χ2
n−p.
I β ⊥⊥ σ2.
28/30
Significance testsI Test whether the j-th explanatory variable is significant in the
linear regression (1 ≤ j ≤ p).
I H0 : β∗j = 0 v.s. H1 : β∗j 6= 0.
I If γj is the j-th diagonal coefficient of (X>X)−1 (γj > 0):
βj − β∗j√σ2γj
∼ tn−p.
I Let T (j)n =
βj√σ2γj
.
I Test with non asymptotic level α ∈ (0, 1):
Rj,α = {|T (j)n | > qα
2(tn−p)}
where qα2(tn−p) is the (1− α/2)-quantile of tn−p.
I We can also compute p-values.
29/30
Bonferroni’s test
I Test whether a group of explanatory variables is significant inthe linear regression.
I H0 : β∗j = 0, ∀j ∈ S v.s. H1 : ∃j ∈ S, β∗j 6= 0, whereS ⊆ {1, . . . , p}.
I Bonferroni’s test: RS,α =⋃j∈S
Rj,α/k, where k = |S|.
I This test has nonasymptotic level at most α.
30/30
Remarks
I Linear regression exhibits correlations, NOT causality
I Normality of the noise: One can use goodness of fit tests totest whether the residuals εi = Yi − X>i β are Gaussian.
I Deterministic design: If X is not deterministic, all the abovecan be understood conditionally on X, if the noise is assumedto be Gaussian, conditionally on X.