18.650 fundamentals of statistics 26mm 6. linear regressionmitx+18.6501x+3t...18.650 { fundamentals...

1/30

18.650 – Fundamentals of Statistics

6. Linear Regression

2/30

Goals

Consider two random variables X and Y . For example,

1. X is the amount of $ spent on Facebook ads and Y is thetotal conversion rate

2. X is the age of the person and Y is the number of clicks

Given two random variables (X,Y ), we can ask the followingquestions:

I How to predict Y from X?

I Error bars around this prediction?

I How much more conversions Y for an additional dollar?

I Does the number of clicks even depend on age?

I What if X is a random vector? For example, X = (X1, X2)where X1 is the amount of $ spent on Facebook ads and X2

is the duration in days of the campaign.

3/30

Conversions vs. amount spent

4/30

Clicks vs. age

5/30

Modeling assumptions

(Xi, Yi), i = 1, . . . , n are i.i.d from some unknown jointdistribution IP.

IP can be described entirely by (assuming all exist)

I Either a joint PDF h(x, y)

I The marginal density of X h(x) =

∫h(x, y)dy and the

conditional density

h(y|x) =h(x, y)

h(x)

h(y|x) answers all our questions. It contains all the informationabout Y given X

6/30

Partial modeling

We can also describe the distribution only partially,e.g., using

I The expectation of Y : IE[Y ]

I The conditional expectation of Y given X = x: IE[Y |X = x]The function

x 7→ f(x) := IE[Y |X = x] =

∫yh(y|x)dy

is called regression function

I Other possibilities:

I The conditional median: m(x) such that∫ m

−∞h(y|x)dy =

1

2

I Conditional quantilesI Conditional variance (not informative about location)

7/30

Conditional expectation and standard deviation

8/30

Conditional expectation

9/30

Conditional density and conditional quantiles

10/30

Conditional distribution: boxplots

11/30

Linear regression

We first focus on modeling the regression functionf(x) = IE[Y |X = x]

I Too many possible regression functions f (nonparametric)

I Useful to restrict to simple functions that are described by afew parameters

I Simplest:

f(x) = a+ bx linear (or affine) functions

Under this assumption, we talk about linear regression

12/30

Nonparametric regression

13/30

Linear regression

14/30

Probabilistic analysis

I Let X and Y be two real r.v. (not necessarily independent)with two moments and such that var(X) > 0.

I The theoretical linear regression of Y on X is the linex 7→ a∗ + b∗x where

(a∗, b∗) = argmin(a,b)∈IR2

IE[(Y − a− bX)2

]

I Setting partial derivatives to zero gives

I b∗ =cov(X,Y )

var(X),

I a∗ = IE[Y ]− b∗IE[X] = IE[Y ]− cov(X,Y )

var(X)IE[X].

15/30

Noise

Clearly the points are not exactly on the line x 7→ a∗ + b∗x ifvar(Y |X = x) > 0. The random variable ε = Y − (a∗ + b∗X) is

called noise and satisfies

Y = a∗ + b∗X + ε,

with

I IE[ε] = 0 and

I cov(X, ε) = 0.

16/30

Statistical problem

In practice a∗, b∗ need to be estimated from data.

I Assume that we observe n i.i.d. random pairs(X1, Y1), . . . , (Xn, Yn) with same distribution as (X,Y ):

Yi = a∗ + b∗Xi + εi

I We want to estimate a∗ and b∗.

17/30

Statistical problem


18/30

Statistical problem


19/30

Statistical problem


20/30

Statistical problem


21/30

Least squares

DefinitionThe least squares estimator (LSE) of (a∗, b∗) is the minimizer ofthe sum of squared errors:

n∑i=1

(Yi − a− bXi)2.

(a, b) is given by

b =XY − XYX2 − X2

a = Y − bX.

22/30

Residuals

- SSE2.pdf

23/30

Multivariate regression

Yi = X>i β∗ + εi, i = 1, . . . , n.

I Vector of explanatory variables or covariates: Xi ∈ IRp

(wlog, assume its first coordinate is 1).

I Response / Dependent variable: Yi.

I β∗ = (a∗,b∗>)>; β∗1(= a∗) is called the intercept.

I {εi}i=1,...,n: noise terms satisfying cov(Xi, εi) = 0.

DefinitionThe least squares estimator (LSE) of β∗ is the minimizer of thesum of square errors:

β = argminβ∈IRp

n∑i=1

(Yi −X>i β)2

24/30

LSE in matrix form

I Let Y = (Y1, . . . , Yn)> ∈ IRn.

I Let X be the n× p matrix whose rows are X>1 , . . . ,X>n (X is

called the design matrix).

I Let ε = (ε1, . . . , εn)> ∈ IRn (unobserved noise)

I Y = Xβ∗ + ε, β∗ unknwon.

I The LSE β satisfies:

β = argminβ∈IRp

‖Y − Xβ‖22.

25/30

Closed form solution

I Assume that rank(X) = p.

I Analytic computation of the LSE:

β = (X>X)−1X>Y.

I Geometric interpretation of the LSE: Xβ is the orthogonalprojection of Y onto the subspace spanned by the columns ofX:

Xβ = PY,

where P = X(X>X)−1X>.

26/30

Statistical inference

To make inference (confidence regions, tests) we need moreassumptions.

Assumptions:

I The design matrix X is deterministic and rank(X) = p.

I The model is homoscedastic: ε1, . . . , εn are i.i.d.

I The noise vector ε is Gaussian:

ε ∼ Nn(0, σ2In)

for some known or unknown σ2 > 0.

27/30

Properties of LSEI LSE = MLE

I Distribution of β: β ∼ Np(β∗, σ2(X>X)−1

).

I Quadratic risk of β: IE[‖β − β∗‖22

]= σ2tr

((X>X)−1

).

I Prediction error: IE[‖Y − Xβ‖22

]= σ2(n− p).

I Unbiased estimator of σ2: σ2 =1

n− p‖Y − Xβ‖22.

Theorem

I (n− p) σ2

σ2∼ χ2

n−p.

I β ⊥⊥ σ2.

28/30

Significance testsI Test whether the j-th explanatory variable is significant in the

linear regression (1 ≤ j ≤ p).

I H0 : β∗j = 0 v.s. H1 : β∗j 6= 0.

I If γj is the j-th diagonal coefficient of (X>X)−1 (γj > 0):

βj − β∗j√σ2γj

∼ tn−p.

I Let T (j)n =

βj√σ2γj

.

I Test with non asymptotic level α ∈ (0, 1):

Rj,α = {|T (j)n | > qα

2(tn−p)}

where qα2(tn−p) is the (1− α/2)-quantile of tn−p.

I We can also compute p-values.

29/30

Bonferroni’s test

I Test whether a group of explanatory variables is significant inthe linear regression.

I H0 : β∗j = 0, ∀j ∈ S v.s. H1 : ∃j ∈ S, β∗j 6= 0, whereS ⊆ {1, . . . , p}.

I Bonferroni’s test: RS,α =⋃j∈S

Rj,α/k, where k = |S|.

I This test has nonasymptotic level at most α.

30/30

Remarks

I Linear regression exhibits correlations, NOT causality

I Normality of the noise: One can use goodness of fit tests totest whether the residuals εi = Yi − X>i β are Gaussian.

I Deterministic design: If X is not deterministic, all the abovecan be understood conditionally on X, if the noise is assumedto be Gaussian, conditionally on X.

18.650 fundamentals of statistics 26mm 6. linear regressionmitx+18.6501x+3t...18.650 { fundamentals...

Documents