multiple regression analysis ram akella university of california berkeley silicon valley center/sc...

42
Multiple Regression Analysis Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 2 January 26, 2011.

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Multiple Regression Analysis

Ram AkellaUniversity of California Berkeley Silicon Valley Center/SC

Lecture 2January 26, 2011.

Introduction

We extend the concept of simple linear regression as we investigate a response y which is affected by several independent variables x1,

x2, x3,…, xk.

Our objective is to use the information provided by the xi to predict the value of y.

Example

Rating PredictionWe have a database of movie ratings (on the scale of 1 to 10) from various usersWe can predict a missing rating for a movie m from the user Y based on:

other movie ratings from the same user ratings given to movie m by other users

People ratings

Airplane

Matrix Room with a View

... Hidalgo

comedy action romance ... action

Joe 27,M,70k 9 7 2 7

Carol 53,F,20k 8 9

...

Kumar 25,M,22k 9 3 6

Ua 48,M,81k 4 7 ? ? ?

Illustrative Example

Body Fat Prediction Let y be the body fat index of an individual.

This might be a function of several variables: x1 = height x2 = weight x3 = abdomen measurement x4 = hip measurement

We want to predict y using knowledge of x1, x2, x3 and x4.

Formalization of the Regression Model

For each observation i, the expected value of the dependent y conditional on the information X1, X2, …., Xp is given by:

We add some noise to this expectation model, then value of y becomes:

Combining both equations we have

pppi xbxbxbbxxxyE .....),....,,|( 2211021

ipii xxxyEy ),...,,|( 21

ippi xbxbxbby .....22110

Formalization of the Regression ModelWe can express the regression model in a matrix terms:

Where y is a vector of n X1, and b is a vector of order [(p+1)x1] and X is configured as:

The vector of 1 corresponds to de dummy variable that is multiplied by the parameter b0

Xby

nppp

p

p

p

XXX

XXX

XXX

XXX

X

21

33231

22221

11211

1

1

1

1

Assumptions of the Regression Model

Assumptions of the data matrix XAssumptions of the data matrix X

It It is fixed for fitting purposes It is full rank

Assumptions of the random variable Assumptions of the random variable

are independent Have a mean 0 and common variance 2 for any set

xxxk . . Have a normal distribution.

Method of Least Squares

The method o least squares is used to estimate the values of b which achieves the minimum sum of squares between the observations yi and the straight line formed by these parameters and the variables x1, x2, ... Xk.

Mechanics of the Least Square Estimator in multiple variables

The objective function is to choose to minimize

We differentiate the expression above with respect to and we obtain:

Solving for we obtain

b

)ˆ)(ˆ()( bXybXybS

b

0)ˆ('2 bXyX

b

yXXXb ')'(ˆ 1

Mechanics of the Least Square Estimator in multiple variables

The variance for the estimator is

We know that:

IE

XXXbb

bbbbEb

2

1

][

')'(ˆ

)'ˆ)(ˆ()ˆvar(

And the expected value of (εε) is

Mechanics of the Least Square Estimator in multiple variables

12

121

11

11

)'()ˆvar(

)'(')'()ˆvar(

)'('')'()ˆvar(

)'('')'()ˆvar(

XXb

XXIXXXXb

XXXEXXXb

XXXXXXEb

Substituting in the equation above

Properties of the least squares estimator

The estimate is unbiased

The estimate has the lowest variance estimator of b

b

0)ˆ( bbE

b

12 )'()ˆvar( XXb

Example

A computer database in a small community contains: the listed selling price y (in thousands of dollars) per acre, the

amount of living area x1 acres, and the number of floors x2, bedrooms x3, and bathrooms x4, for n = 15 randomly selected

residences currently on the market. Property y x1 x2 x3 x4

1 69.0 6 1 2 1

2 118.5 10 1 2 2

3 116.5 10 1 3 2

… … … … … …

15 209.9 21 2 4 3

Fit a first order model to the data using the method of least squares.

Fit a first order model to the data using the method of least squares.

Example

The first order model is

y= b0+b1x1 + bx2 + bx3 + bx4

6 1 2 1

10 1 2 2

10 1 3 2

21 2 4 3

X= Y=

69.0

118.5

116.5

209.9

b=(X’X)-1X’ y=

18.763

6.2698

-16.203

-2.673

30.271

Some Questions

1. How well does the model fit?2. How strong is the relationship between

y and the predictor variables?3. Have any assumptions been violated?4. How good are the estimates and

predictions?

To answer these question we need the n observations on the response y and the independent variables, x1, x2, x3, …xk.

To answer these question we need the n observations on the response y and the independent variables, x1, x2, x3, …xk.

Residuals The difference between the observed value yi and

the corresponding fitted value ŷi is a residual and it is defined as:

If we elevate the residuals to the square we will obtain a chi-square distribution

22

222

2

~

~),(

),0(~ˆ

Xe

XN

Nyye

i

iii

If the normal assumption is valid, the plot of the residuals should appear as a random scatter around the zero center line.

If not, you will see a pattern in the residuals.

If the normal assumption is valid, the plot of the residuals should appear as a random scatter around the zero center line.

If not, you will see a pattern in the residuals.

Residuals versus Fits

Residuals versus Fits

If we see a pattern on the residuals then this method may not be appropriate to fit the data

The descriptors and the predicted variable do not follow a linear relationship

We can transform the descriptors in order to achieve a better fitting

How good is the fit ? Our objective in regression is choose the

parameters b0, b1,…, bk that provide the most accurate fitted values of y by minimizing the uncertainty after using the information about X.

This uncertainty may be denoted by the measure R2 which is equal to:

i i

i ii

i i

i i iii

yy

yy

yy

yyyyR

2

2

2

222

)(

)ˆ(1

)(

)ˆ()(

How good is the fit? We are interested in having a high value of R2

when we are evaluating our data One drawback of R2 is that whenever an

independent variable to the model this measure always increases (increase of degrees of freedom).

we calculate the adjusted R2 to weigh the improvement in fit versus the cost in degrees of freedom

i i

i ii

nyy

pnyyR

)1/()(

)1/()ˆ(1

2

22

Where p is the number of degrees of freedom and n the number of samples used to fit the model

Is it significant?

The first question to ask is whether the regression model is of any use in predicting y.

If it is not, then the value of y does not change, regardless of the value of the independent variables, x1, x2 ,…, xk. This implies that the partial regression

coefficients, b1b2,…, bk are all zero.

zeronot is oneleast at :H

versus0...:H

ia

210

b

bbb k

F statistics

An F statistic is the ratio of two independent χ2 random variables, each divided by its respective degrees of freedom. The key point is to show that both are independent

)1/()ˆ(

/)ˆ(F

2

2

i ii

i i

pnyy

pyy

Is it significant?Testing the model using F test.This consist in the ratio of the error term (ε) and the variance of the residual terms (e). If the model is useful, the value of F will be large.

freedom. of degrees 1)-p-n(p, with Fstatistic

)1/()ˆ(

/)ˆ(F 2

2

i ii

i i

pnyy

pyy

Analysis of variance table

Source df SS MS F

Regression

k SSR SSR/k MSR/MSE

Error n – k -1 SSE SSE/(n-k-1)

Total n -1 Total SS

SS Sum of squares

df degrees of freedom

MS mean squares

F F statistics

S = 6.849 R-Sq = 97.1% R-q(adj) = 96.0%

Table of variance of the real state example

Source DF SS MS F P

Regression 4 15913.0 3978.3 84.80 0.000

Residual Error

10 469.1 46.9

Total 14 16382.2

Example

Testing Individual Parameters

• Is a particular independent variable useful in the model, in the presence of all the other independent variables? The test statistic is function of bi, our best estimate of bi

0:H versus0:H a0 ii bb

t statistics A t distribution is defined as a normal distribution divided

by a chi-square distribution.

If we want to test the significance of a particular coefficient, we will calculate the t value of the coefficient .

Where vii is the element i,i of the matrix

has a t distribution with error df = n – p –1. We reject that b=0 if

|t0|>t/2,n-2

)ˆvar(b

iiv0ˆ

:statisticTest ib

t

The Real Estate Problem

In the presence of the other three independent variables, is the number of bedrooms significant in predicting the list price of homes? Test using = .05.

Regression Analysis: ListPrice versus SqFeet, NumFlrs, Bdrms, BathsThe regression equation isListPrice = 18.8 + 6.27 SqFeet - 16.2 NumFlrs - 2.67 Bdrms + 30.3 Baths

Predictor Coef SE Coef T PConstant 18.763 9.207 2.04 0.069SqFeet 6.2698 0.7252 8.65 0.000NumFlrs -16.203 6.212 -2.61 0.026Bdrms -2.673 4.494 -0.59 0.565Baths 30.271 6.849 4.42 0.001

Detecting problems in the regressionMulticollinearity This problem is related to the high correlation

between the independent variables Multicollinearity leads to the inestability in the

calculation of the inverse (X’X)-1

We can measure it by calculating the condition index (CI) given by

The higher the value is the most unstable the inverse of the matrix is. The matrix has multicollinearity if CI >30

min

max

d

dCI

Where dmax and dmin are the maximum value and the minimum value of the maxtrix D obtained from the SVD decomposition of X

Detecting problems in the regression

HeteroscedasticityWhen the assumption of the same variance 2 in the error terms εi is violated.

This yields to a low efficiency in the calculation of then this estimator may not have the lower variance or be biasedWe can address this problem by transforming the dependent variable (obtaining the logarithm), or use Weighted least squares to estimate

b

b

yWXXWXb

WN

Xby

111

2

')'(ˆ

),0(~

Detecting problems in the regression

Autocorrelation Is the problem of consecutive error terms in

time series data being correlated The consequences of this problem are

similar to heteroscedasticity We can detect this problem by plotting the

residuals and look for patterns or use the Durbin-Watson test

Durbin Watson Test This test is given by the ratio:

Where e are the residuals

This is approximately equal to 2(1-r), where r is the autocorrelation. Then if DW is close to 2 there is no evidence of autocorrelation.

i t

i tt

e

eeDW 2

2

1)(

Comparing Regression Models

To fairly compare two models, we can use:

The adjusted R2

The F test

This takes into account the difference between

degrees of freedom

Estimation and Prediction

Once you have: determined that the regression line is useful used the diagnostic plots to check for

violation of the regression assumptions.

You are ready to use the regression line to

Predict a particular value of y for a given value of x.

Estimation and Prediction

Enter the appropriate values of x1, x2, …, xk

Particular values of y are more difficult to predict, then it is required to have a wider range of values in the prediction interval.

kk22110 xb...xbxbby

The Real Estate Problem

Estimate the average list price for a home with 1000 square feet of living space, one floor, 3 bedrooms and two baths with a 95% confidence interval.

Predicted Values for New ObservationsNew Obs Fit SE Fit 95.0% CI 95.0% PI1 117.78 3.11 ( 110.86, 124.70) ( 101.02, 134.54)

Values of Predictors for New ObservationsNew Obs SqFeet NumFlrs Bdrms Baths1 10.0 1.00 3.00 2.00

We estimate that the average list price will be between $110,860 and $124,700 for a home like this.

We estimate that the average list price will be between $110,860 and $124,700 for a home like this.

Example: Body Fat Calculation

Predict the amount of body fat index from the body based on the following measures:

X1 Age in yearsX2 Weight in lbsX3 Height in inchesX4 neck measure in cmX5 chest measure in cmX6 abdomen measure in cmX7 hip measure in cmX8 thigh in cm

Example Data set example

age weight height neck chest

abdomen hip thigh

23 154.25 67.75 36.2 93.1 85.2 94.5 59

22 173.25 72.25 38.5 93.6 83 98.7 58.7

22 154 66.25 34 95.8 87.9 99.2 59.6

26 184.75 72.25 37.4 101.8 86.4 101.2 60.1

24 184.25 71.25 34.4 97.3 100 101.9 63.2

24 210.25 74.75 39 104.5 94.4 107.8 66

26 181 69.75 36.4 105.1 90.7 100.3 58.4

25 176 72.5 37.8 99.6 88.5 97.1 60

25 191 74 38.1 100.9 82.5 99.9 62.9

23 198.25 73.5 42.1 99.6 88.6 104.1 63.1

26 186.25 74.5 38.5 101.5 83.6 98.2 59.7

27 216 76 39.4 103.6 90.9 107.7 66.2

32 180.5 69.5 38.4 102 91.6 103.9 63.4

bodyfat

12.3

6.1

25.3

10.4

28.7

20.9

19.2

12.4

4.1

11.7

7.1

7.8

20.8

Y= X=

Regression Analysis The result of the regression analysis is the

following:

Feature Value

b0 -23.2763

b1 0.0265

b2 -0.0980

b3 -0.0933

b4 -0.5725

b5 0.0258

b6 0.9576

b7 -0.2475

b8 0.3493

Measure Value P value

R2 0.7304

R2 adj 0.7237

F 81.9439 0

Total variance

19.4037

Example

Residuals plot

0 50 100 150 200 250 300-15

-10

-5

0

5

10

15Residuals Plot

Example The prediction for a new sample

74 207.5 70 40.8 112.4 108.5 107.1 59.3

P=

Y=P b

Y=29.2801

-23.2763

0.0265

-0.0980

-0.0933

-0.5725

0.0258

0.9576

-0.2475

0.3493

b=