autocorrelation in regression analysis

Autocorrelation in Regression Analysis

•Tests for Autocorrelation•Examples• Durbin-Watson Tests• Modeling Autoregressive Relationships

What causes autocorrelation?

• Misspecification

• Data Manipulation– Before receipt– After receipt

• Event Inertia

• Spatial ordering

Checking for Autocorrelation

• Test: Durbin-Watson statistic:

d (ei ei 1)

2ei

2 , for n and K -1 d.f.

Positive Zone of No Autocorrelation Zone of Negativeautocorrelation indecision indecision autocorrelation|_______________|__________________|_____________|_____________|__________________|___________________|0 d-lower d-upper 2 4-d-upper 4-d-lower 4

Autocorrelation is clearly evident

Ambiguous – cannot rule out autocorrelation

Autocorrelation in not evident

Consider the following regression:

Because this is time series data, we should consider the possibility of autocorrelation. To run the Durbin-Watson, first we have to specify the data as time series with the tsset command. Next we use the dwstat command.

Durbin-Watson d-statistic( 3, 328) = .2109072

Source | SS df MS Number of obs = 328-------------+------------------------------ F( 2, 325) = 52.63 Model | .354067287 2 .177033643 Prob > F = 0.0000 Residual | 1.09315071 325 .003363541 R-squared = 0.2447-------------+------------------------------ Adj R-squared = 0.2400 Total | 1.447218 327 .004425743 Root MSE = .058

------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ice | .060075 .006827 8.80 0.000 .0466443 .0735056 quantity | -2.27e-06 2.91e-07 -7.79 0.000 -2.84e-06 -1.69e-06 _cons | .2783773 .0077177 36.07 0.000 .2631944 .2935602------------------------------------------------------------------------------

Find the D-upper and D-lower• Check a Durbin Watson table for the

numbers for d-upper and d-lower.• http://hadm.sph.sc.edu/courses/J716/Dw.html

• For n=20 and k=2, α = .05 the values are:– Lower = 1.643– Upper = 1.704

Durbin's alternative test for autocorrelation--------------------------------------------------------------------------- lags(p) | chi2 df Prob > chi2-------------+------------------------------------------------------------- 1 | 1292.509 1 0.0000--------------------------------------------------------------------------- H0: no serial correlation

Alternatives to the d-statistic

• The d-statistic is not valid in models with a lagged dependent variable– In the case of a lagged LHS variable you must

use the Durbin-a test (the command is durbina in Stata)

• Also, the d-statistic is only for first order autocorrelation. In other instances you may use the Durbin-a– Why would you suspect other than 1st order

autocorrelation?

The Runs Test• An alternative to the D-W test is a

formalized examination of the signs of the residuals. We would expect that the signs of the residuals will be random in the absence of autocorrelation.

• The first step is to estimate the model and predict the residuals.

Runs continued

• Next, order the signs of the residuals against time (or spatial ordering in the case of cross-sectional data) and see if there are excessive “runs” of positives or negatives. Alternatively, you can graph the residuals and look for the same trends.

Runs test continuedWhere n = number of observations, 1n = number

of + symbols, 2n = the number of – symbols, and k = the number of runs:

)1()(

)2(2

212

21

2121212

nnnn

nnnnnnk

The final step is to use the expected mean and deviation in a standard t-test

Stata does this automatically with the runtest command!

12

)(21

21

nn

nnkE

Visual diagnosis of autocorrelation (in a single series)

• A correlogram is a good tool to identify if a series is autocorrelated

-0.5

00

.00

0.5

01

.00

Au

tocorr

ela

tion

s o

f pri

ce

0 10 20 30 40Lag

Bartlett's formula for MA(q) 95% confidence bands

Dealing with autocorrelation• D-W is not appropriate for auto-regressive (AR)

models, where:

• In this case, we use the Durbin alternative test• For AR models, need to explicitly estimate the

correlation between Yi and Yi-1 as a model parameter

• Techniques:• AR1 models (closest to regression; 1st order only)• ARIMA (any order)

...22110 XbYbbY itit

Dealing with Autocorrelation

• There are several approaches to resolving problems of autocorrelation. – Lagged dependent variables– Differencing the Dependent variable– GLS – ARIMA

Lagged dependent variables• The most common solution

– Simply create a new variable that equals Y at t-1, and use as a RHS variable

• To do this in Stata, simply use the generate command with the new variable equal to L.variable

– gen lagy = L.y– gen laglagy = L2.y

• This correction should be based on a theoretic belief for the specification

• May cause more problems than it solves• Also costs a degree of freedom (lost observation)

– There are several advanced techniques for dealing with this as well

Differencing

• Differencing is simply the act of subtracting the previous observation value from the current observation.

• To do this in Stata, again use the generate command with a capital D (instead of the L for lags)

– This process is effective; however, it is an EXPENSIVE correction

– This technique “throws away” long-term trends – Assumes the Rho = 1 exactly

1.1 tt xxxD

GLS and ARIMA

• GLS approaches use maximum likelihood to estimate Rho and correct the model– These are good corrections, and can be

replicated in OLS

• ARIMA is an acronym for Autoregressive Integrated Moving Average– This process is a univariate “filter” used to

cleanse variables of a variety of pathologies before analysis

Corrections based on Rho

• There are several ways to estimate rho, the most simple being calculating it from the residuals

n

tt

n

ttt

e

ee

1

2

21

We then estimate the regression by transforming the regressors so that: and This gives the regression:

1* ˆ tt xxx

1* ˆ tt yyy

*110

* xy

High tech solutions

• Stata also offers the option of estimating the model with the AR (with multiple ways of estimating rho). There is also what is known as a prais-winsten regression which generates values for the lost observation

• For the truly adventurous, there is also the option of doing a full ARIMA model

Prais-winsten regression• Prais-Winsten AR(1) regression -- iterated estimates

• Source | SS df MS Number of obs = 328• -------------+------------------------------ F( 2, 325) = 15.39• Model | .012722308 2 .006361154 Prob > F = 0.0000• Residual | .134323736 325 .000413304 R-squared = 0.0865• -------------+------------------------------ Adj R-squared = 0.0809• Total | .147046044 327 .000449682 Root MSE = .02033

• ------------------------------------------------------------------------------• price | Coef. Std. Err. t P>|t| [95% Conf. Interval]• -------------+----------------------------------------------------------------• ice | .0098603 .0059994 1.64 0.101 -.0019422 .0216629• quantity | -1.11e-07 1.70e-07 -0.66 0.512 -4.45e-07 2.22e-07• _cons | .2517135 .0195727 12.86 0.000 .2132082 .2902188• -------------+----------------------------------------------------------------• rho | .9436986• ------------------------------------------------------------------------------• Durbin-Watson statistic (original) 0.210907• Durbin-Watson statistic (transformed) 1.977062

ARIMA

• The ARIMA model allows us to test the hypothesis of autocorrelation and remove it from the data.

• This is an iterative process akin to the purging we did when creating the ystar variable.

The model

Significant lagEstimate of rho

ARIMA regression

Sample: 1 to 328 Number of obs = 328 Wald chi2(1) = 3804.80Log likelihood = 811.6018 Prob > chi2 = 0.0000

------------------------------------------------------------------------------ | OPG price | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+----------------------------------------------------------------price | _cons | .2558135 .0207937 12.30 0.000 .2150587 .2965683-------------+----------------------------------------------------------------ARMA | ar | L1. | .9567067 .01551 61.68 0.000 .9263076 .9871058-------------+---------------------------------------------------------------- /sigma | .0203009 .000342 59.35 0.000 .0196305 .0209713------------------------------------------------------------------------------

The residuals of the ARIMA model

There are a few significant lags a ways back. Generally we should expect some, but this mess is probably an indicator of a seasonal trend (well beyond the scope of this lecture)!

-0.2

0-0

.10

0.0

00

.10

0.2

0

Au

tocorr

ela

tion

s o

f e

0 10 20 30 40Lag

Bartlett's formula for MA(q) 95% confidence bands

ARIMA with a covariateARIMA regression

Sample: 1 to 328 Number of obs = 328 Wald chi2(3) = 3569.57Log likelihood = 812.9607 Prob > chi2 = 0.0000

------------------------------------------------------------------------------ | OPG price | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+----------------------------------------------------------------price | ice | .0095013 .0064945 1.46 0.143 -.0032276 .0222303 quantity | -1.04e-07 1.22e-07 -0.85 0.393 -3.43e-07 1.35e-07 _cons | .2531552 .0220777 11.47 0.000 .2098838 .2964267-------------+----------------------------------------------------------------ARMA | ar | L1. | .9542692 .01628 58.62 0.000 .9223611 .9861773-------------+---------------------------------------------------------------- /sigma | .0202185 .0003471 58.25 0.000 .0195382 .0208988------------------------------------------------------------------------------

Final thoughts

• Each correction has a “best” application.– If we wanted to evaluate a mean shift (dummy

variable only model), calculating rho will not be a good choice. Then we would want to use the lagged dependent variable

– Also, where we want to test the effect of inertia, it is probably better to use the lag

Final Thoughts Continued

– In Small N, calculating rho tends to be more accurate – ARIMA is one of the best options, however, it is very

complicated!– When dealing with time, the number of time periods

and the spacing of the observations is VERY IMPORTANT!

– When using estimates of rho, a good rule of thumb is to make sure you have 25-30 time points at a minimum. More if the observations are too close for the process you are observing!

Next Time:

• Review for Exam– Plenary Session

• Exam Posting– Available after class Wednesday

autocorrelation in regression analysis

Documents