overview of time series and forecastingdickey/sascode/se_sas_users...1 overview of time series and...

1

Overview of Time Series and Forecasting:

Data taken over time (usually equally spaced)

Yt = data at time t µ = mean (constant over time)

Models:

“Autoregressive”

1 1 2 2( ) ( ) ( )( )

t t t

p t p t

Y Y YY e

µ α µ α µα µ

− −

−

− = − + − ++ − +

et independent, constant variance: “White Noise”

How to find p? Regress Y on lags.

PACF Partial Autocorrelation Function

(1) Regress Yt on Yt-1 then Yt on Yt-1 and Yt-2

then Yt on Yt-1,Yt-2, Yt-3 etc.

(2) Plot last lag coefficients versus lags.

2

Example 1: Supplies of Silver in NY commodities exchange:

Getting PACF (and other identifying plots). SASTM code:

PROC ARIMA data=silver plots(unpack) = all; identify var=silver; run; TM SAS and its products are registered trademarks of SAS Institute, Cary NC.

3

PACF

“Spikes” outside 2 standard error bands are

statistically significant

Two spikes p=2

1 1 2 2( ) ( ) ( )t t t tY Y Y eµ α µ α µ− −− = − + − +

How to estimate µ and α’s ? PROC ARIMA’s ESTIMATE statement.

Use maximum likelihood (ml option) PROC ARIMA data=silver plots(unpack) = all; identify var=silver;

estimate p=2 ml;

4

Maximum Likelihood Estimation Parameter Estimate Standard Error t Value Approx

Pr > |t| Lag

MU 668.29592 38.07935 17.55 <.0001 0

AR1,1 1.57436 0.10186 15.46 <.0001 1

AR1,2 -0.67483 0.10422 -6.48 <.0001 2

1 1 2 2( ) ( ) ( )t t t tY Y Y eµ α µ α µ− −− − − − − =

1 2( 668) 1.57( 668) 0.67( 668)t t t tY Y Y e− −− − − + − =

1 2( 668) 1.57( 668) 0.67( 668)t t t tY Y Y e− −− = − − − +

5

Backshift notation: B(Yt)=Yt-1, B2(Yt)=B(B(Yt))=Yt-2 2(1 1.57 0.67 )( 668)t tB B Y e− + − =

SAS output: (uses backshift)

Autoregressive Factors Factor 1: 1 - 1.57436 B**(1) + 0.67483 B**(2)

Checks:

(1) Overfit (try AR(3) )

Maximum Likelihood Estimation

Parameter Estimate Standard Error t Value Approx Pr > |t|

Lag

MU 664.88129 35.21080 18.88 <.0001 0

AR1,1 1.52382 0.13980 10.90 <.0001 1

AR1,2 -0.55575 0.24687 -2.25 0.0244 2

AR1,3 -0.07883 0.14376 -0.55 0.5834 3

(2) Residual autocorrelations

Residual rt

Residual autocorrelation at lag j: Corr(rt, rt-j) = ρ(j)

6

Box-Pierce Q statistic: Estimate, square, and sum k of these. Multiply by sample size n. PROC ARIMA: k in sets of 6. Limit distribution Chi-square if errors independent. Later modification: Box-Ljung statistic for H0:residuals uncorrelated

2

1

2k

jj

nn j

n ρ=

+ −

∑

SAS output:

Autocorrelation Check of Residuals To

Lag Chi-

Square DF Pr >

ChiSq Autocorrelations

6 3.49 4 0.4794 -0.070 -0.049 -0.080 0.100 -0.112 0.151

12 5.97 10 0.8178 0.026 -0.111 -0.094 -0.057 0.006 -0.110

18 10.27 16 0.8522 -0.037 -0.105 0.128 -0.051 0.032 -0.150

24 16.00 22 0.8161 -0.110 0.066 -0.039 0.057 0.200 -0.014

Residuals uncorrelatedResiduals are White Noise Residuals are unpredictable

7

SAS computes Box-Ljung on original data too.

Autocorrelation Check for White Noise To

Lag Chi-

Square DF Pr >

ChiSq Autocorrelations

6 81.84 6 <.0001 0.867 0.663 0.439 0.214 -0.005 -0.184

12 142.96 12 <.0001 -0.314 -0.392 -0.417 -0.413 -0.410 -0.393

Data autocorrelated predictable!

Note: All p-values are based on an assumption called “stationarity” discussed later.

How to predict?

1 1 2 2( ) ( ) ( )t t t tY Y Y eµ α µ α µ− −− − − − − =

One step prediction

1 1 12 1( ) ( ),t t ttY Y Y future error eµ α µ α µ+ − += + − + − =

Two step prediction

2 1 2 11 2 1( ) ( ),t t t ttY Y Y error e eµ α α αµ µ + ++ + += + − + − =

Prediction error variance ( σ2 = variance(et) ) 21

2 2, , ...(1 )σ α σ+

8

From prediction error variances, get 95% prediction intervals. Can estimate variance of et from past data. SAS PROC ARIMA does it all for you!

Moving Average, MA(q), and ARMA(p,q) models

MA(1) Yt = µ + et - θet-1 Variance (1+θ2)σ2

Yt-1 = µ + et-1 - θet-2 ρ(1)=-θ/(1+θ2)

Yt-2 = µ + et-2 - θet-3 ρ(2)=0/(1+θ2)=0

9

Autocorrelation function “ACF” (ρ(j)) is 0 after lag q for MA(q). PACF is useless for identifying q in MA(q).

PACF drops to 0 after lag 3 AR(3) p=3

ACF drops to 0 after lag 2 MA(2) q=2

Neither drops ARMA(p,q) p= ___ q=____

1 1 1 1( ) ( ) ... ( )t t p t p t t q t qY Y Y e e eµ α µ α µ θ θ− − − −− − − − − − = − − −

Example 2: Iron and Steel Exports.

PROC ARIMA plots(unpack)=all; Identify VAR=EXPORT;

10

ACF (could be MA(1)) PACF (could be AR(1)) Spike at lags 0, 1 No spike at lag 0

Estimate P=1 ML; Estimate Q=2 ML; Estimate Q=1 ML; Maximum Likelihood Estimation Approx Parameter Estimate t Value Pr>|t| Lag MU 4.42129 10.28 <.0001 0 AR1,1 0.46415 3.42 0.0006 1 MU 4.43237 11.41 <.0001 0 MA1,1 -0.54780 -3.53 0.0004 1 MA1,2 -0.12663 -0.82 0.4142 2 MU 4.42489 12.81 <.0001 0 MA1,1 -0.49072 -3.59 0.0003 1

How to choose? AIC - smaller is better

11

AIC 165.8342 (MA(1)) AIC 166.3711 (AR(1)) AIC 167.1906 (MA(2))

Forecast lead=5 out=out1 id=date interval=year;

Example 3: Brewers’ Proportion Won

Mean of Working Series 0.478444 Standard Deviation 0.059934 Number of Observations 45

12

Autocorrelations Lag -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 Correlation Std Error 0 1.00000 | |********************| 0 1 0.52076 | . |********** | 0.149071 2 0.18663 | . |**** . | 0.185136 3 0.11132 | . |** . | 0.189271 4 0.11490 | . |** . | 0.190720 5 -.00402 | . | . | 0.192252 6 -.14938 | . ***| . | 0.192254 7 -.13351 | . ***| . | 0.194817 8 -.06019 | . *| . | 0.196840 9 -.05246 | . *| . | 0.197248 10 -.20459 | . ****| . | 0.197558 11 -.22159 | . ****| . | 0.202211 12 -.24398 | . *****| . | 0.207537 "." marks two standard errors

Could be MA(1) Autocorrelation Check for White Noise To Chi- Pr > Lag Square DF ChiSq --------------Autocorrelations--------------- 6 17.27 6 0.0084 0.521 0.187 0.111 0.115 -0.004 -0.149 12 28.02 12 0.0055 -0.134 -0.060 -0.052 -0.205 -0.222 -0.244

NOT White Noise!

SAS Code:

proc arima data=brewers; identify var=Win_Pct nlag=12; run; estimate q=1 ml;

13

Maximum Likelihood Estimation Standard Approx Parameter Estimate Error t Value Pr > |t| Lag MU 0.47791 0.01168 40.93 <.0001 0 MA1,1 -0.50479 0.13370 -3.78 0.0002 1 AIC -135.099 Autocorrelation Check of Residuals To Chi- Pr > Lag Square DF ChiSq --------------Autocorrelations------------ 6 3.51 5 0.6219 0.095 0.161 0.006 0.119 0.006 -0.140 12 11.14 11 0.4313 -0.061 -0.072 0.066 -0.221 -0.053 -0.242 18 13.54 17 0.6992 0.003 -0.037 -0.162 -0.010 -0.076 -0.011 24 17.31 23 0.7936 -0.045 -0.035 -0.133 -0.087 -0.114 0.015 Estimated Mean 0.477911 Moving Average Factors Factor 1: 1 + 0.50479 B**(1)

Partial Autocorrelations Lag Correlation -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 1 0.52076 | . |********** | 2 -0.11603 | . **| . | 3 0.08801 | . |** . | 4 0.04826 | . |* . | 5 -0.12646 | . ***| . | 6 -0.12989 | . ***| . | 7 0.01803 | . | . | 8 0.01085 | . | . | 9 -0.02252 | . | . | 10 -0.20351 | . ****| . | 11 -0.03129 | . *| . | 12 -0.18464 | . ****| . |

OR … could be AR(1)

14

estimate p=1 ml;

Maximum Likelihood Estimation Standard Approx Parameter Estimate Error t Value Pr > |t| Lag MU 0.47620 0.01609 29.59 <.0001 0 AR1,1 0.53275 0.12750 4.18 <.0001 1 AIC -136.286 (vs. -135.099)

Autocorrelation Check of Residuals To Chi- Pr > Lag Square DF ChiSq -----------Autocorrelations------------ 6 3.57 5 0.6134 0.050 -0.133 -0.033 0.129 0.021 -0.173 12 8.66 11 0.6533 -0.089 0.030 0.117 -0.154 -0.065 -0.181 18 10.94 17 0.8594 0.074 0.027 -0.161 0.010 -0.019 0.007 24 13.42 23 0.9423 0.011 -0.012 -0.092 -0.081 -0.106 0.013 Model for variable Win_pct Estimated Mean 0.476204 Autoregressive Factors Factor 1: 1 - 0.53275 B**(1)

Conclusions for Brewers:

Both models have statistically significant

parameters.

Both models are sufficient (no lack of fit)

15

Predictions from MA(1):

First one uses correlations

The rest are on the mean.

Predictions for AR(1):

Converge exponentially fast toward mean

Not much difference but AIC prefers AR(1)

16

Stationarity

(1) Mean constant (no trends) (2) Variance constant (3) Covariance γ(j) and correlation

ρ(j) = γ(j)/γ(0) between Yt and Yt-j depend only on j

ARMA(p,q) model

1 1 1 1( ) ( ) ... ( )t t p t p t t q t qY Y Y e e eµ α µ α µ θ θ− − − −− − − − − − = − − −

Stationarity guaranteed whenever solutions of equation (roots of polynomial)

Xp – α1Xp-1−α2Xp-2−…−αp =0

are all <1 in magnitude.

17

Examples

(1) Yt−µ = .8(Yt-1−µ) + et X-.8=0 X=.8 stationary

(2) Yt−µ = 1.00(Yt-1−µ) + et nonstationary

Note: Yt= Yt-1 + et Random walk

(3) Yt−µ = 1.6(Yt-1−µ) − 0.6(Yt-2−µ)+ et

“characteristic polynomial”

X2−1.6X+0.6=0 X=1 or X=0.6

nonstationary (unit root X=1)

(Yt−µ)−(Yt-1−µ) =0.6[(Yt-1−µ)− (Yt-2−µ)]+ et

(Yt−Yt-1) =0.6(Yt-1− Yt-2) + et

First differences form stationary AR(1) process!

18

No mean – no mean reversion – no gravity pulling toward the mean.

(4) Yt−µ = 1.60(Yt-1−µ) − 0.63(Yt-1−µ)+ et

X2−1.60X+0.63=0 X=0.9 or X=0.7

|roots|<1 stationary

(Yt−µ)−(Yt-1−µ) =

−0.03(Yt-1−µ) + 0.63[(Yt-1−µ)− (Yt-2−µ)]+ et

Yt−Yt-1 = −0.03(Yt-1−µ) + 0.63(Yt-1−Yt-2)+ et

Unit Root testing (H0:Series has a unit root)

Regress

Yt−Yt-1 on Yt-1 and (Yt-1−Yt-2)

Look at t test for Yt-1. If it is significantly negative then stationary.

19

Problem: Distribution of “t statistic” is not t distribution under unit root hypothesis. Distribution looks like this histogram:

(1 million random walks of length n=100)

Overlays: N(sample mean & variance) N(0,1)

Correct distribution: Dickey-Fuller test in PROC ARIMA.

-2.89 is the correct (left) 5th %ile

46% of t’s are less than -1.645

(the normal 5th percentile)

20

Example 1: Brewers

proc arima data=brewers; identify var=Win_Pct nlag=12 stationarity=(ADF=0);

Dickey-Fuller Unit Root Tests Type Lags Rho Pr < Rho Tau Pr < Tau Zero Mean 0 -0.1803 0.6376 -0.22 0.6002 Single Mean 0 -21.0783 0.0039 -3.75 0.0062 Trend 0 -21.1020 0.0287 -3.68 0.0347

Conclusion reject H0:unit roots so Brewers series is stationary (mean reverting).

0 lags do not need lagged differences in model (just regress Yt-Yt-1 on Yt-1)

21

Example 2: Stocks of silver revisited

Needed AR(2) (2 lags) so regress

Yt-Yt-1 (D_Silver) on

Yt-1 (L_Silver) and Yt-1-Yt-2 (D_Silver_1)

PROC REG: Parameter Estimates Parameter Variable DF Estimate t Value Pr>|t| Intercept 1 75.58073 2.76 0.0082 L_Silver 1 -0.11703 -2.78 0.0079 wrong distn.

D_Silver_1 1 0.67115 6.21 <.0001 OK

22

PROC ARIMA: Augmented Dickey-Fuller Unit Root Tests Type Lags Rho Pr<Rho Tau Pr<Tau Zero Mean 1 -0.2461 0.6232 -0.28 0.5800 Single Mean 1 -17.7945 0.0121 -2.78 0.0689 OK Trend 1 -15.1102 0.1383 -2.63 0.2697

Same t statistic, corrected p-value!

Conclusion: Unit root difference the series.

1 lag need 1 lagged difference in model (regress Yt-Yt-1 on Yt-1 and Yt-1-Yt-2 ) PROC ARIMA data=silver; identify var=silver(1) stationarity=(ADF=(0)); estimate p=1 ml; forecast lead=24 out=outN ID=date Interval=month;

23

Unit root forecast & forecast interval

PROC AUTOREG

Fits a regression model (least squares)

Fits stationary autoregressive model to error terms

Refits accounting for autoregressive errors.

Example 3: AUTOREG Harley-Davidson closing stock prices 2009-present.

24

proc autoreg data=Harley;

model close=date/ nlag=15 backstep; run;

One by one, AUTOREG eliminates insignificant lags then:

Estimates of Autoregressive Parameters Lag Coefficient Standard Error t Value

1 -0.975229 0.006566 -148.53

25

Final model: Parameter Estimates

Variable DF Estimate Standard Error t Value Approx Pr > |t|

Intercept 1 -412.1128 35.2646 -11.69 <.0001

Date 1 0.0239 0.001886 12.68 <.0001

Error term Zt satisfies Zt-0.97Zt-1=et.

Example 3 ARIMA: Harley-Davidson closing stock prices 2009-present. (vs. AUTOREG)

Apparent upward movement: Linear trend or nonstationary?

Regress

Yt – Yt-1 on 1, t, Yt-1 (& lagged differences)

26

H0: Yt= β + Yt-1 + et “random walk with drift”

H1: Yt=α+βt + Zt with Zt stationary AR(p)

New distribution for Yt-1 t-test

With trend

27

Without trend

1 million simulations - runs in 7 seconds!

SAS code for Harley stock closing price

proc arima data=Harley; identify var=close stationarity=(adf) crosscor=(date) noprint; Estimate input=(date) p=1 ml; forecast lead=120 id=date interval=weekday out=out1; run;

28

Stationarity test (0,1,2 lagged differences):

Augmented Dickey-Fuller Unit Root Tests

Type Lags Rho Pr < Rho Tau Pr < Tau

Zero Mean 0 0.8437 0.8853 1.14 0.9344

1 0.8351 0.8836 1.14 0.9354

2 0.8097 0.8786 1.07 0.9268

Single Mean 0 -2.0518 0.7726 -0.87 0.7981

1 -1.7772 0.8048 -0.77 0.8278

2 -1.8832 0.7925 -0.78 0.8227

Trend 0 -27.1559 0.0150 -3.67 0.0248

1 -26.9233 0.0158 -3.64 0.0268

2 -29.4935 0.0089 -3.80 0.0171

Conclusion: stationary around a linear trend.

Estimates: trend + AR(1)



Lag Variable Shift

MU -412.08104 35.45718 -11.62 <.0001 0 Close 0

AR1,1 0.97528 0.0064942 150.18 <.0001 1 Close 0

NUM1 0.02391 0.0018961 12.61 <.0001 0 Date 0

29

Autocorrelation Check of Residuals

To Lag Chi-Square DF Pr > ChiSq Autocorrelations

6 3.20 5 0.6694 -0.005 0.044 -0.023 0.000 0.017 0.005

12 6.49 11 0.8389 -0.001 0.019 0.003 -0.010 0.049 -0.003

18 10.55 17 0.8791 0.041 -0.026 -0.022 -0.023 0.007 -0.011

24 16.00 23 0.8553 0.014 -0.037 0.041 -0.020 -0.032 0.003

30 22.36 29 0.8050 0.013 -0.026 0.028 0.051 0.036 0.000

36 24.55 35 0.9065 0.037 0.016 -0.012 0.002 -0.007 0.001

42 29.53 41 0.9088 -0.007 -0.021 0.029 0.030 -0.033 0.030

48 49.78 47 0.3632 0.027 -0.009 -0.097 -0.026 -0.074 0.026

30

NCSU Energy Demand

Type of day

Class Days

Work Days (no classes)

Holidays & weekends.

Temperature Season of Year

Step 1: Make some plots of energy demand vs. temperature and season. Use type of day as color.

Seasons: S = A sin(2πt/365) , C=B sin(2πt/365)

Temperature Season of Year

31

Step 2: PROC AUTOREG with all inputs: PROC AUTOREG data=energy; MODEL DEMAND = TEMP TEMPSQ CLASS WORK S C /NLAG=15 BACKSTEP DWPROB; output out=out3 predicted = p predictedm=pm residual=r residualm=rm; run;

Estimates of Autoregressive Parameters

Lag Coefficient Standard Error t Value

1 -0.559658 0.043993 -12.72

5 -0.117824 0.045998 -2.56

7 -0.220105 0.053999 -4.08

8 0.188009 0.059577 3.16

9 -0.108031 0.051219 -2.11

12 0.110785 0.046068 2.40

14 -0.094713 0.045942 -2.06

Autocorrelation at 1, 7, 14, and others. After autocorrelation adjustments, trust t tests etc.

32

Parameter Estimates

Variable DF Estimate Standard Error t Value Approx Pr > |t|

Intercept 1 6076 296.5261 20.49 <.0001

TEMP 1 28.1581 3.6773 7.66 <.0001

TEMPSQ 1 0.6592 0.1194 5.52 <.0001

CLASS 1 1159 117.4507 9.87 <.0001

WORK 1 2769 122.5721 22.59 <.0001

S 1 -764.0316 186.0912 -4.11 <.0001

C 1 -520.8604 188.2783 -2.77 0.0060

.

Residuals from regression part. Large residual on workday near Christmas. Add dummy variable.

rm

-5000

-4000

-3000

-2000

-1000

0

1000

2000

DATE

01JUL79 01AUG79 01SEP79 01OCT79 01NOV79 01DEC79 01JAN80 01FEB80 01MAR80 01APR80 01MAY80 01JUN80 01JUL80

Need better model?Big negative residual on Jan. 2

WC non work work class

33

Same idea: PROC ARIMA Step 1: Graphs Step 2: Regress on inputs, diagnose residual autocorrelation:

Not white noise (bottom right) Activity (bars) at lag 1, 7, 14

34

(3) Estimate resulting model from diagnostics plus trial and error: e input = (temp tempsq class work s c) p=1 q=(1,7,14) ml;



Lag Variable Shift

MU 6183.1 300.87297 20.55 <.0001 0 DEMAND 0

MA1,1 0.11481 0.07251 1.58 0.1133 1 DEMAND 0

MA1,2 -0.18467 0.05415 -3.41 0.0006 7 DEMAND 0

MA1,3 -0.13326 0.05358 -2.49 0.0129 14 DEMAND 0

AR1,1 0.73980 0.05090 14.53 <.0001 1 DEMAND 0

NUM1 26.89511 3.83769 7.01 <.0001 0 TEMP 0

NUM2 0.64614 0.12143 5.32 <.0001 0 TEMPSQ 0

NUM3 912.80536 122.78189 7.43 <.0001 0 CLASS 0

NUM4 2971.6 123.94067 23.98 <.0001 0 WORK 0

NUM5 -767.41131 174.59057 -4.40 <.0001 0 S 0

NUM6 -553.13620 182.66142 -3.03 0.0025 0 C 0

(Note: class days get class effect plus work effect)

35

(5) Check model fit (stats look OK):

Autocorrelation Check of Residuals

To Lag Chi-Square DF Pr > ChiSq Autocorrelations

6 2.86 2 0.2398 -0.001 -0.009 -0.053 -0.000 0.050 0.047

12 10.71 8 0.2188 0.001 -0.034 0.122 0.044 -0.039 -0.037

18 13.94 14 0.4541 -0.056 0.013 -0.031 0.048 -0.006 -0.042

24 16.47 20 0.6870 -0.023 -0.028 0.039 -0.049 0.020 -0.029

30 24.29 26 0.5593 0.006 0.050 -0.098 0.077 -0.002 0.039

36 35.09 32 0.3239 -0.029 -0.075 0.057 -0.001 0.121 -0.047

42 39.99 38 0.3817 0.002 -0.007 0.088 0.019 -0.004 0.060

48 43.35 44 0.4995 -0.043 0.043 -0.027 -0.047 -0.019 -0.032

36

Looking for “outliers” that can be explained PROC ARIMA, OUTLIER statement Available types

(1) Additive (single outlier) (2) Level shift (sudden change in mean) (3) Temporary change (level shift for k contiguous time

points – you specify k) NCSU energy: tested every point – 365 tests. Adjust for multiple testing 0.05/365 = .0001369863 (Bonferroni) OUTLIER type=additive alpha=.0001369863 id=date; FORMAT date weekdate.; run; /****************************************************** January 2, 1980 Wednesday: Hangover Day :-) . March 3,1980 Monday: On the afternoon and evening of March 2, 1980, North Carolina experienced a major winter storm with heavy snow across the entire state and near blizzard conditions in the eastern part of the state. Widespread snowfall totals of 12 to 18 inches were observed over Eastern North Carolina, with localized amounts ranging up to 22 inches at Morehead City and 25 inches at Elizabeth City, with unofficial reports of up to 30 inches at Emerald Isle and Cherry Point (Figure 1). This was one of the great snowstorms in Eastern North Carolina history. What made this storm so remarkable was the combination of snow, high winds, and very cold temperatures.

37

May 10,1980 Saturday. Graduation! *****************************************************/;

Outlier Details

Obs Time ID Type Estimate Chi-Square Approx Prob>ChiSq

186 Wednesday Additive -3250.9 87.76 <.0001

315 Saturday Additive 1798.1 28.19 <.0001

247 Monday Additive -1611.8 22.65 <.0001

Outlier Details

Obs Time ID Type Estimate Chi-Square Approx Prob>ChiSq

186 02-JAN-1980 Additive -3250.9 87.76 <.0001

315 10-MAY-1980 Additive 1798.1 28.19 <.0001

247 03-MAR-1980 Additive -1611.8 22.65 <.0001

38

Outliers: Jan 2 (hangover day!), March 3 (snowstorm), May 10 (graduation day). AR(1) ‘rebound’ outlying residuals next day. Add dummy variables for explainable outliers data next; merge outarima energy; by date; hangover = (date="02Jan1980"d); storm = (date="03Mar1980"d); graduation = (date="10May1980"d); Proc ARIMA data=next; identify var=demand crosscor=(temp tempsq class work s c hangover graduation storm) noprint; estimate input = (temp tempsq class work s c hangover

graduation storm) p=1 q=(7,14) ml; forecast lead=0 out=outARIMA2 id=date interval=day; run;



Lag Variable Shift

MU 6127.4 259.43918 23.62 <.0001 0 DEMAND 0

MA1,1 -0.25704 0.05444 -4.72 <.0001 7 DEMAND 0

MA1,2 -0.10821 0.05420 -2.00 0.0459 14 DEMAND 0

AR1,1 0.76271 0.03535 21.57 <.0001 1 DEMAND 0

NUM1 27.89783 3.15904 8.83 <.0001 0 TEMP 0

NUM2 0.54698 0.10056 5.44 <.0001 0 TEMPSQ 0

39



Lag Variable Shift

NUM3 626.08113 104.48069 5.99 <.0001 0 CLASS 0

NUM4 3258.1 105.73971 30.81 <.0001 0 WORK 0

NUM5 -757.90108 181.28967 -4.18 <.0001 0 S 0

NUM6 -506.31892 184.50221 -2.74 0.0061 0 C 0

NUM7 -3473.8 334.16645 -10.40 <.0001 0 hangover 0

NUM8 2007.1 331.77424 6.05 <.0001 0 graduation 0

NUM9 -1702.8 333.79141 -5.10 <.0001 0 storm 0

Constant Estimate 1453.963

Variance Estimate 181450

Std Error Estimate 425.9695

AIC 5484.728

SBC 5535.462

Number of Residuals 366

40

Model looks fine. AUTOREG - regression with AR(p) errors ARIMA – regressors, differencing, ARMA(p,q) errors. SEASONALITY Many economic and environmental series show seasonality.

(1) Very regular (“deterministic”) or (2) Slowly changing (“stochastic”)

41

Example 1: NC accident reports involving deer. Method 1: regression PROC REG data=deer; model deer = X11; (X11: 1 in Nov, 0 otherwise) Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 1181.09091 78.26421 15.09 <.0001 X11 1 2578.50909 271.11519 9.51 <.0001

Looks like December and October need dummies too! PROC REG data=deer; model deer = X10 X11 X12; Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 929.40000 39.13997 23.75 <.0001 X10 1 1391.20000 123.77145 11.24 <.0001 X11 1 2830.20000 123.77145 22.87 <.0001 X12 1 1377.40000 123.77145 11.13 <.0001 Average of Jan through Sept. is 929 crashes per month. Add 1391 in October, 2830 in November, 1377 in December.

(graph on next page)

42

Try dummies for all but one month (need “average of rest” so must leave out at least one) PROC REG data=deer; model deer = X1 X2 … X10 X11; Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 2306.80000 81.42548 28.33 <.0001 X1 1 -885.80000 115.15301 -7.69 <.0001 X2 1 -1181.40000 115.15301 -10.26 <.0001 X3 1 -1220.20000 115.15301 -10.60 <.0001 X4 1 -1486.80000 115.15301 -12.91 <.0001 X5 1 -1526.80000 115.15301 -13.26 <.0001 X6 1 -1433.00000 115.15301 -12.44 <.0001 X7 1 -1559.20000 115.15301 -13.54 <.0001 X8 1 -1646.20000 115.15301 -14.30 <.0001 X9 1 -1457.20000 115.15301 -12.65 <.0001 X10 1 13.80000 115.15301 0.12 0.9051 X11 1 1452.80000 115.15301 12.62 <.0001 Average of rest is just December mean 2307. Subtract 886 in January, add 1452 in November. October (X10) is not significantly different than December.

43

Residuals for Deer Crash data:

Looks like a trend – add trend (date): PROC REG data=deer; model deer = date X1 X2 … X10 X11; Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -1439.94000 547.36656 -2.63 0.0115 X1 1 -811.13686 82.83115 -9.79 <.0001 X2 1 -1113.66253 82.70543 -13.47 <.0001 X3 1 -1158.76265 82.60154 -14.03 <.0001 X4 1 -1432.28832 82.49890 -17.36 <.0001 X5 1 -1478.99057 82.41114 -17.95 <.0001 X6 1 -1392.11624 82.33246 -16.91 <.0001 X7 1 -1525.01849 82.26796 -18.54 <.0001 X8 1 -1618.94416 82.21337 -19.69 <.0001 X9 1 -1436.86982 82.17106 -17.49 <.0001 X10 1 27.42792 82.14183 0.33 0.7399 X11 1 1459.50226 82.12374 17.77 <.0001 date 1 0.22341 0.03245 6.88 <.0001

44

Trend is 0.22 more accidents per day (1 per 5 days) and is significantly different from 0.

What about autocorrelation?

Method 2: PROC AUTOREG PROC AUTOREG data=deer; model deer = date X1 - X11 / nlag=13 backstep; Backward Elimination of Autoregressive Terms Lag Estimate t Value Pr > |t| 6 -0.003105 -0.02 0.9878 11 0.023583 0.12 0.9029 4 -0.032219 -0.17 0.8641 9 -0.074854 -0.42 0.6796 5 0.064228 0.44 0.6610 13 -0.081846 -0.54 0.5955 12 0.076075 0.56 0.5763 8 -0.117946 -0.81 0.4205 10 -0.127661 -0.95 0.3489 7 0.153680 1.18 0.2458 2 0.254137 1.57 0.1228 3 -0.178895 -1.37 0.1781 Preliminary MSE 10421.3 Estimates of Autoregressive Parameters Standard Lag Coefficient Error t Value 1 -0.459187 0.130979 -3.51

45

Parameter Estimates Standard Approx Variable DF Estimate Error t Value Pr > |t| Intercept 1 -1631 857.3872 -1.90 0.0634 date 1 0.2346 0.0512 4.58 <.0001 X1 1 -789.7592 64.3967 -12.26 <.0001 X2 1 -1100 74.9041 -14.68 <.0001 X3 1 -1149 79.0160 -14.54 <.0001 X4 1 -1424 80.6705 -17.65 <.0001 X5 1 -1472 81.2707 -18.11 <.0001 X6 1 -1386 81.3255 -17.04 <.0001 X7 1 -1519 80.9631 -18.76 <.0001 X8 1 -1614 79.9970 -20.17 <.0001 X9 1 -1432 77.8118 -18.40 <.0001 X10 1 31.3894 72.8112 0.43 0.6684 X11 1 1462 60.4124 24.20 <.0001

Method 3: PROC ARIMA PROC ARIMA plots=(forecast(forecast)); IDENTIFY var=deer crosscor= (date X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11); ESTIMATE input= (date X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11) p=1 ML; FORECAST lead=12 id=date interval=month; run; Maximum Likelihood Estimation Standard Approx Parameter Estimate Error t Value Pr > |t| Lag Variable Shift MU -1640.9 877.41683 -1.87 0.0615 0 deer 0 AR1,1 0.47212 0.13238 3.57 0.0004 1 deer 0 NUM1 0.23514 0.05244 4.48 <.0001 0 date 0 NUM2 -789.10728 64.25814 -12.28 <.0001 0 X1 0 NUM3 -1099.3 74.93984 -14.67 <.0001 0 X2 0

46

NUM4 -1148.2 79.17135 -14.50 <.0001 0 X3 0 NUM5 -1423.7 80.90397 -17.60 <.0001 0 X4 0 NUM6 -1471.6 81.54553 -18.05 <.0001 0 X5 0 NUM7 -1385.4 81.60464 -16.98 <.0001 0 X6 0 NUM8 -1518.9 81.20724 -18.70 <.0001 0 X7 0 NUM9 -1613.4 80.15788 -20.13 <.0001 0 X8 0 NUM10 -1432.0 77.82871 -18.40 <.0001 0 X9 0 NUM11 31.46310 72.61873 0.43 0.6648 0 X10 0 NUM12 1462.1 59.99732 24.37 <.0001 0 X11 0

Autocorrelation Check of Residuals To Chi- Pr > Lag Square DF ChiSq -----------Autocorrelations------------- 6 5.57 5 0.3504 0.042 -0.175 0.146 0.178 0.001 -0.009 12 10.41 11 0.4938 -0.157 -0.017 0.102 0.115 -0.055 -0.120 18 22.18 17 0.1778 0.158 0.147 -0.183 -0.160 0.189 -0.008 24 32.55 23 0.0893 -0.133 0.013 -0.095 0.005 0.101 -0.257 Autoregressive Factors Factor 1: 1 - 0.47212 B**(1)

Model 3: Differencing Compute and model Dt = Yt-Yt-12 Removes seasonality Removes linear trend Use (at least) q=(12) et-θet-12

(A) if θ near 1, you’ve overdifferenced (B) if 0<θ<1 this is seasonal exponential

smoothing model.

47

11 −−=−− tetetYtY θ ]...221[111 −+−−−+−−=−+−−= tetYtYtYtYtetYtYte θθθ

tetYtYtYtYtY ++−+−+−+−−= ]33

22

21)[1( θθθθ Forecast is a weighted (exponentially smoothed) average of past values:

]33

22

21)[1(ˆ +−+−+−+−−= tYtYtYtYtY θθθθ

IDENTIFY var=deer(12) nlag=25; ESTIMATE P=1 Q=(12) ml; run; Maximum Likelihood Estimation Standard Approx Parameter Estimate Error t Value Pr > |t| Lag MU 85.73868 16.78380 5.11 <.0001 0 MA1,1 0.89728 0.94619 0.95 0.3430 12 AR1,1 0.46842 0.11771 3.98 <.0001 1 Autocorrelation Check of Residuals To Chi- Pr > Lag Square DF ChiSq ------------Autocorrelations------------ 6 4.31 4 0.3660 0.053 -0.161 0.140 0.178 -0.026 -0.013 12 7.47 10 0.6801 -0.146 -0.020 0.105 0.131 -0.035 -0.029 18 18.02 16 0.3226 0.198 0.167 -0.143 -0.154 0.183 0.002 24 24.23 22 0.3355 -0.127 0.032 -0.083 0.022 0.134 -0.155

48

Lag 12 MA somewhat close to 1 with large standard error, model OK but not best. Variance estimate 15,122 (vs. 13,431 for dummy variable model). Forecasts are similar 2 years out.

49

Accounting for changes in trend Nenana Ice Classic data Exact time (day and time) of thaw of the Tanana river in Nenana Alaska: 1917 Apr 30 11:30 a.m. 1918 May 11 9:33 a.m. 1919 May 3 2:33 p.m. (more data) 2010 Apr 29 6:22 p.m. 2011 May 04 4:24 p.m. 2012 Apr 23 7:39 p.m.

When the tripod moves downstream, that is the unofficial start of spring.

50

Get “ramp” with PROC NLIN ____/ X= 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 … PROC NLIN data=all; PARMS point=1960 int=126 slope=-.2; X = (year-point)*(year>point); MODEL break = int + slope*X; OUTPUT out=out2 predicted=p residual=r;

Approx Approximate 95% Confidence Parameter Estimate Std Error Limits point 1965.4 11.2570 1943.0 1987.7 int 126.0 0.7861 124.5 127.6 slope -0.1593 0.0592 -0.2769 -0.0418

PROC SGPLOT data=out2; SERIES Y=break X=year; SERIES Y=p X=year/ lineattrs = (color=red thickness=2); REFLINE 1965.4 / axis=X; run;quit;

51

What about autocorrelation? Final ramp: Xt = (year-1965.4)*(year>1965); PROC ARIMA; IDENTIFY var=break crosscor=(ramp) noprint; ESTIMATE input=(ramp); FORECAST lead=5 id=year out=out1; run;

PROC ARIMA generates diagnostic plots:

52

Autocorrelation Check of Residuals To Chi- Pr > Lag Square DF ChiSq ---------------Autocorrelations--------------- 6 6.93 6 0.3275 -0.025 0.067 -0.027 -0.032 -0.152 -0.192 12 14.28 12 0.2834 0.086 -0.104 -0.074 0.184 -0.041 -0.091 18 20.30 18 0.3164 -0.086 0.111 0.041 -0.073 0.142 -0.066 24 21.37 24 0.6166 -0.015 -0.020 -0.036 -0.059 -0.047 -0.030

Example 2: Visitors to St. Petersburg/Clearwater

http://www.pinellascvb.com/statistics/2013-04-VisitorProfile.pdf

Model 1: Seasonal dummy variables + trend + AR(p) (REG - R2>97%)

http://www.pinellascvb.com/statistics/2013-04-VisitorProfile.pdf

53

proc reg data=aaem.stpete; model visitors=t m1-m11; output out=out4 predicted = P residual=R UCL=u95 LCL=l95; run; PROC SGPLOT data=out3; BAND lower=l95 upper=u95 X=date; SERIES Y=P X=date; SCATTER Y=visitors X=date / datalabel=month datalabelattrs=(color=red size=0.3 cm); SERIES Y=U95 X=date/ lineattrs=(color=red thickness=0.8); SERIES Y=L95 X=date/ lineattrs=(color=red thickness=0.8); REFLINE "01apr2013"d / axis=x; where 2011<year(date)<2015; run;

54

PROC SGPLOT data=out4; NEEDLE Y=r X=date; run;

Definitely autocorrelated Slowly changing mean? Try seasonal span difference model … PROC ARIMA data=StPete plots=forecast(forecast); IDENTIFY var=visitors(12);

55

Typical ACF for ARMA(1,1) has initial dropoff followed by exponential decay – try ARMA(1,1) on span 12 differences. ESTIMATE P=1 Q=1 ml; FORECAST lead=44 id=date interval = month out=outARIMA;run; Standard Approx Parameter Estimate Error t Value Pr > |t| Lag MU 0.0058962 0.0029586 1.99 0.0463 0 MA1,1 0.32635 0.11374 2.87 0.0041 1 AR1,1 0.79326 0.07210 11.00 <.0001 1 Autocorrelation Check of Residuals To Chi- Pr > Lag Square DF ChiSq ----------------Autocorrelations---------------- 6 1.81 4 0.7699 0.022 -0.055 -0.000 0.011 -0.028 -0.071 12 14.87 10 0.1367 0.109 0.044 0.003 0.060 0.211 -0.064 18 16.41 16 0.4248 0.010 -0.012 0.036 0.014 -0.066 0.038 24 17.30 22 0.7468 -0.056 0.020 0.004 0.018 -0.009 -0.017 30 34.15 28 0.1961 -0.131 -0.052 -0.144 0.134 0.026 -0.133 36 45.08 34 0.0969 -0.102 0.051 0.135 -0.107 0.014 -0.074

56

PROC SGPLOT data=outARIMA; ; SERIES X=date Y=residual; run;

57

Summary: PROC ARIMA – fits autoregressive moving average models Use ACF, PACF to identify p=# autoregressive lags and q= # moving average lags. Stationarity – mean reverting models versus unit roots (random walk type models). Graphics and DF test (and others) available. Diagnostics – errors should be white noise – Ljung Box test to check. PROC AUTOREG – regression with autoregressive Errors (ARIMA can also handle X variables). PROC NLIN – to estimate slope changes (least squares). PROC SGPLOT – new graphics routines.

overview of time series and forecastingdickey/sascode/se_sas_users...1 overview of time series and...

Documents