applied statistics - uniroma1.it · 2019. 11. 11. · applied statistics lecturer: cristinamollica....

Multiple regression OLS estimation A real example Fitted values and residuals R2 Properties OLS estimator

Applied Statistics

Lecturer: Cristina Mollica


Outline of the lecture

IntroductionMultiple linear regression modelOrdinary least squares estimationExamplesFitted values and residualsCoefficient of determination R2

Properties of the OLS estimator


Introduction

Regression models are used to describe how one or perhaps a fewresponse variables depend on other explanatory variables.

The idea of regression is at the core of much statistical modelling,because the question what happens to y when x varies? is central tomany investigations.

The main goal is to gain an understanding of the relation betweenthem and to make predictions of the response from the knowledgeof the explanatory variables.

There is usually a single response, treated as random, and a set ofexplanatory variables, which can be both quantitative andcategorical variables and are treated as non-stochastic.

We are going to generalize the theory of the simple linear regressionmodel to the case where a single response depends linearly onK ≥ 2 covariates (multiple regression model).


Multiple regression

The multiple linear regression model assumes that r.v. Yi satisfies

Yi = β0 + β1xi1 + . . . , βKxiK + εi = x ti β + εi i = 1, . . . , n

x ti = (1, xi1, . . . , xiK ) is the 1× (K + 1) vector of covariatesassociated with the i-th observation (known constants)

β = (β0, β1, . . . , βK )t is a (K + 1)× 1 vector of unknown regressionparameters

x ti β is the deterministic component of the model referred to aslinear predictor, that is, a linear combination (in the parameters)of the covariates

εi is the random (unobserved) error term perturbing the linearrelation and inducing the discrepancy between Yi and x ti β


Matrix notationEconometricians make frequent use of the matrix notation:

Y =

Y1...Yi...Yn

X =

1 x11 ... x1K... ... ... ...1 xi1 ... xiK... ... ... ...1 xn1 ... xnK

β =

β0...βk...βK

ε =

ε1...εi...εn


Matrix notation

Using the matrix notation, where:Y is the n × 1 vector of the response observationsX is the n × (K + 1) design matrix with the covariate valuesβ is the (K + 1)× 1 vector of regression coefficientsε is the n × 1 vector of unknown error terms

the regression equation can be concisely written as

Y = Xβ + ε


The Gauss-Markov assumptionsFor a linear regression model

Y = Xβ + ε

the Gauss-Markov assumptions concern the errors εi and their relationwith the xi :

E [εi ] = 0

V (εi ) = σ2 not depending on i homoschedasticity

COV (εi , εj) = 0 for all i 6= j uncorrelation

ε1, ..., εn and x1, ..., xn are independent

The error terms are uncorrelated drawings from a distribution with 0mean and constant variance σ2. Using the matrix notation

E [ε] = 0 and V (ε) = σ2In


Example: simple regression

For the straight line regression model yi = β0 + β1xi + εi fori = 1, . . . , n, the matrix form of the model is

y1y2...yn

=

1 x11 x2...

...1 xn

(β0β1

)+

ε1ε2...εn

so X is an n × 2 matrix and β a 2× 1 vector of parameters.


Example: two groups comparison

Suppose that the response variable y has been observed on twogroups of observations of size n1 and n2. Let y1i for i = 1, . . . n1 bethe observations of the first group and let y2i for i = 1 . . . n2 be theobservation of the second group.

Let β0 and β0 + β1 be the means of the variable y in the twogroups. Hence

y1i = β0 + ε1i i = 1, . . . , n1

y2i = β0 + β1 + ε2i i = 1 . . . n2


Example: two groups comparison

We can write the model for the two groups comparison in matrixnotation y = Xβ + ε where

y =

y11...

y1n1

y21...

y2n2

X =

1 0...

...1 01 1...

...1 1

β =

(β0β1

)ε =

ε11...

ε1n1

ε21...

ε2n2


Example: polynomial regressionSuppose that the response is a polynomial function of a singlecovariate,

yi = β0 + β1xi + · · ·+ βKxKi + εi

Useful tool to fit, for example, a quadratic or cubic trend to thedata, in which case we would have K = 2 or K = 3 respectively.Then

y1y2...yn

=

1 x1 x2

1 · · · xK11 x2 x2

2 · · · xK2...

...1 xn x2

n · · · xKn

β0β1...βK

+

ε1ε2...εn

Note that the model y = Xβ + ε is linear in the parameters β.Polynomial regression can be written as a linear model because ofits linearity, not in x , but in β.


Ordinary least squares

The OLS estimate of β is obtained by the solution of the followingoptimization problem

β = argminβ∈RK+1

SS(β)

where the objective function to be minimized is the sum of squares

SS(β) =n∑

i=1

(yi − x ti β)2 = (y − Xβ)t(y − Xβ)


Ordinary least squaresBy differentiating SS(β) with respect to β and setting thederivatives equal to zero, one has to solve the system

∂SS(β)∂β0

= −2∑n

i=1(yi − x ti β) = 0...

∂SS(β)∂βk

= −2∑n

i=1(yi − x ti β)xik = 0...

∂SS(β)∂βK

= −2∑n

i=1(yi − x ti β)xiK = 0

With the matrix notation, it corresponds to

∂SS(β)

∂β= −2(y − Xβ)tX = (0, . . . 0)


Ordinary least squaresIn matrix form these amount to the equations

(y − Xβ)tX = (0, . . . 0)

that isX t(y − Xβ) = (0, . . . 0)t

which imply that the estimate satisfies

X ty = X tXβ

Provided the (K + 1)× (K + 1) matrix X tX is of full rank

β = (X tX )−1X ty =

(n∑

i=1

xixti

)−1 n∑i=1

xiyi

is the system solution.


Ordinary least squaresMoreover, the (r , s) element of the matrix of second derivatives ofSS(β) is

∂2SS(β)

∂βr∂βs= 2

n∑i=1

xirxis .

Hence, the matrix of second derivatives of SS(β) is 2X tX , which isa semi-positive matrix.

Thus, β = (X tX )−1X ty is the value that minimizes SS(β). Theminimum value of the objective function

SS(β) =n∑

i=1

(yi − x ti β)2 = (y − X β)t(y − X β) = (y − y)t(y − y)

is called residual sum of squares.


Examples 1: simple regressionIn simple cases it is possible to have analytical expressions for theleast square estimates. For example in the straight-line regressionmodel

yi = β0 + β1xi + εi i = 1, . . . n

The X matrix of the representation y = Xβ + ε is

X =

1 x11 x2...

...1 xn

Then, we have that

X tX =

(n

∑ni=1 xi∑n

i=1 xi∑n

i=1 x2i

)X ty =

( ∑ni=1 yi∑n

i=1 xiyi

)


Examples 1: simple regressionMoreover

(X tX )−1 =1

n∑n

i=1 x2i − (

∑ni=1 xi )

2

( ∑ni=1 x

2i −

∑ni=1 xi

−∑n

i=1 xi n

)so that

β = (X tX )−1X ty

=1

n∑n

i=1 x2i − (

∑ni=1 xi )

2

( ∑ni=1 yi

∑ni=1 x

2i −

∑ni=1 xi

∑ni=1 xiyi

n∑n

i=1 xiyi −∑n

i=1 xi∑n

i=1 yi

)

Now let sxy = 1n

∑ni=1(xi − x)(yi − y), s2

y = 1n

∑ni=1(yi − y)2 and

s2x = 1

n

∑ni=1(xi − x)2. After some algebra we have

β =

(y − x sxy/s

2x

sxy/s2x

)


Examples 2: binary covariate

For the two groups comparison

y1i = β0 + ε1i i = 1, . . . , n1

y2i = β0 + β1 + ε2i i = 1 . . . n2

we have

y =

y11...

y1n1

y21...

y2n2

X =

1 0...

...1 01 1...

...1 1


Examples 2: binary covariate

To obtain the least square estimates observe that

X tX =

(n1 + n2 n2

n2 n2

)

(X tX )−1 =1

n1n2

(n2 −n2−n2 n1 + n2

)=

(n−11 −n−1

1n−11 n−1

1 + n−12

)

X ty =

( ∑n1i=1 y1i +

∑n2i=1 y2i∑n2

i=1 y2i

)=

(n1y1 + n2y2

n2y2

)hence

β = (X tX )−1X ty =

(y1

y2 − y1

)


Examples 2: categorical covariateThe two groups comparison can be extended to G ≥ 2 groups

y1i = β0 + ε1i i = 1, . . . , n1

y2i = β0 + β1 + ε2i i = 1 . . . n2

......

...

yGi = β0 + βG + εGi i = 1 . . . nG

Let yg = (yg1, . . . , ygng )t for i = 1, . . . ,G .

y =

y1...yG

X =

1n1 0n1 · · · 0n1

1n2 1n2 · · · 0n2...

. . .1nG 0nG · · · 1nG

β = (X tX )−1X ty =

y1

y2 − y1...

yG − y1


Example 2: Wages data

Consider the wages data. We can extend our regression model withaddition explanatory variables, such as the years of schooling(schooli ), the experience in years (experi ). The model is

Yi = β0 + β1malei + β2schooli + β3experi + εi

The model is now interpreted to describe the conditionalexpected wages of an individual given his or her gender, yearsof schooling and experience.



The model is estimated as follows:

yi = −3.38 + 1.34malei + 0.64schooli + 0.12experi

The coefficient of male measures the expected wage betweenmale and female with the same schooling and experience: itmeans that if we compare an arbitrary male and female withthe same schooling and same experience, the expected wagedifferential is 1.34



The coefficient of school measures the expected wagedifference between two individuals with the same experience,the same gender where one has one additional year of schoolingand for the parameter of the experience?

In general, the coefficients in a multiple regression model canonly be interpreted under a ceteris paribus condition, whichsays that the other variables that are included in the model areconstant.


Fitted values

The fitted or predicted values for y are given by

yi = x ti β i = 1, . . . , n.

In vector terms,

y = X β = X (X tX )−1X ty = Hy

In linear algebra, the matrix H = X (X tX )−1X t is known as projectionmatrix. It projects the vector y onto the orthogonal space spanned by thecolumns of X . The matrix H is more frequently known as hat matrixbecause it transforms y to y .


Residuals

The unobservable error εi = yi − x ti β is estimated by the i-th residual

ei = yi − x ti β

In vector terms,

e = y − X β = y − Hy = (I − H)y = My

where I is the n × n identity matrix and M is called residual matrix.


Hat and residual matrices

The projection matrix is symmetric and idempotent, that is

H t = H and HH = H

The same properties hold for the matrix M.

Moreover, the vector of residuals e = My and the vector of fitted valuesy = Hy are orthogonal. In fact,

y te = y tH tMy = y tH t(I − H)y = y t(H − H)y = 0


Deviance decomposition and R2

Finally, note that by the orthogonality of y and e, we haven∑

i=1

y2i = y ty = (e + y)t(e + y) = y t y + ete =

n∑i=1

y2i +

n∑i=1

e2i

The overall sum of squares of the data equals the sum of squares of thefitted model and the residual sum of squares. Thanks to the fact that yiand yi have the same mean y , one can write

n∑i=1

y2i − ny2 =

n∑i=1

y2i − ny2 +

n∑i=1

e2i

n

(∑ni=1 y

2i

n− y2

)= n

(∑ni=1 y

2i

n− y2

)+

n∑i=1

e2i

n∑i=1

(yi − y)2 =n∑

i=1

(yi − y)2 +n∑

i=1

e2i

known as deviance decomposition.


Deviance decomposition and R2

The deviance decomposition is briefly indicated as

TSS = ESS + RSS

implying that the total deviance TSS of the response variable y is equalto the deviance of the predicted values y , called explained devianceESS by the estimated multiple regression model, and the deviance of theresiduals, that is, the residual sum of squares RSS resulting from the OLSminimization.

This decomposition is exploited to derive the coefficient of determinationR2 to assess the goodness-of-fit of the estimated model

R2 =ESS

TSS= 1− RSS

TSS∈ [0, 1]

It quantifies the portion of original variability recovered by the fittedmodel.


Properties of the OLS estimatorWhen the Gauss-Markov conditions hold:

1 The OLS estimator β is unbiased: E (β) = β. In fact

β = (X tX )−1X ty = (X tX )−1X t [Xβ + ε] = β + (X tX )−1X tε

Hence,

E (β) = E (β + (X tX )−1X tε)

= β + (X tX )−1X tE (ε) = β.

2 The variance of β is V (β) = σ2(∑n

i=1 xixti

)−1= σ2(X tX )−1

V (β) = V (β + (X tX )−1X tε) = V ((X tX )−1X tε)

= (X tX )−1X tσ2I ((X tX )−1X t)t = σ2(X tX )−1

3 Gauss-Markov theorem: The OLS estimator is the best linearunbiased estimator (BLUE) of β


Gauss-Markov theorem

Let β be a linear estimator of β that is β = Ay , where A is (K + 1)× nmatrix. Assume that β is unbiased that is

E (β) = E (Ay) = AXβ = β

hence, AX = I . Then, under the Gauss-Markov assumptions one has

V (β)− V (β)

is a positive semidefinite matrix.


Gauss-Markov theorem

In fact,

V (β)− V (β) = Aσ2IAt − σ2(X tX )−1

= σ2 {AAt − AX (X tX )−1X tAt}

= σ2A(I − H)At = σ2A(I − H)(I − H)tAt

This result shows that β has smallest variance in finite samples among alllinear unbiased estimators of β, provided that the Gauss-Markovassumptions hold. Of course, nonlinear estimators may have smallervariance.



To estimate the variance of β, we need to replace the unknown errorvariance σ2 with an estimate. An unbiased estimator of σ2 is

s2 =1

n − (K + 1)

n∑i=1

e2i =

ete

n − (K + 1)

where n is the number of observations and (K + 1) the number ofregressors in the model (including the intercept).

The naive estimate of σ2, given by

σ2 =1n

n∑i=1

e2i =

ete

n=

(y − y)t(y − y)

n=

1n

n∑i=1

(yi − yi )2

is, therefore, a biased estimate for σ2.



Hence the variance-covariance matrix of β can be estimated as

V (β) = s2(X tX )−1

We define standard error of β the quantity

SE (β) =

√V (β)

The standard error of β is a measure for the precision of the estimator.We will define

SE (βk) = s√ckk

where ckk is the (k , k) element of (∑

i=1 xixti )−1.

applied statistics - uniroma1.it · 2019. 11. 11. · applied statistics lecturer: cristinamollica....

Documents