overview of time series and forecastingdickey/sascode/se_sas_users...1 overview of time series and...
TRANSCRIPT
1
Overview of Time Series and Forecasting:
Data taken over time (usually equally spaced)
Yt = data at time t µ = mean (constant over time)
Models:
“Autoregressive”
1 1 2 2( ) ( ) ( )( )
t t t
p t p t
Y Y YY e
µ α µ α µα µ
− −
−
− = − + − ++ − +
et independent, constant variance: “White Noise”
How to find p? Regress Y on lags.
PACF Partial Autocorrelation Function
(1) Regress Yt on Yt-1 then Yt on Yt-1 and Yt-2
then Yt on Yt-1,Yt-2, Yt-3 etc.
(2) Plot last lag coefficients versus lags.
2
Example 1: Supplies of Silver in NY commodities exchange:
Getting PACF (and other identifying plots). SASTM code:
PROC ARIMA data=silver plots(unpack) = all; identify var=silver; run; TM SAS and its products are registered trademarks of SAS Institute, Cary NC.
3
PACF
“Spikes” outside 2 standard error bands are
statistically significant
Two spikes p=2
1 1 2 2( ) ( ) ( )t t t tY Y Y eµ α µ α µ− −− = − + − +
How to estimate µ and α’s ? PROC ARIMA’s ESTIMATE statement.
Use maximum likelihood (ml option) PROC ARIMA data=silver plots(unpack) = all; identify var=silver;
estimate p=2 ml;
4
Maximum Likelihood Estimation Parameter Estimate Standard Error t Value Approx
Pr > |t| Lag
MU 668.29592 38.07935 17.55 <.0001 0
AR1,1 1.57436 0.10186 15.46 <.0001 1
AR1,2 -0.67483 0.10422 -6.48 <.0001 2
1 1 2 2( ) ( ) ( )t t t tY Y Y eµ α µ α µ− −− − − − − =
1 2( 668) 1.57( 668) 0.67( 668)t t t tY Y Y e− −− − − + − =
1 2( 668) 1.57( 668) 0.67( 668)t t t tY Y Y e− −− = − − − +
5
Backshift notation: B(Yt)=Yt-1, B2(Yt)=B(B(Yt))=Yt-2 2(1 1.57 0.67 )( 668)t tB B Y e− + − =
SAS output: (uses backshift)
Autoregressive Factors Factor 1: 1 - 1.57436 B**(1) + 0.67483 B**(2)
Checks:
(1) Overfit (try AR(3) )
Maximum Likelihood Estimation
Parameter Estimate Standard Error t Value Approx Pr > |t|
Lag
MU 664.88129 35.21080 18.88 <.0001 0
AR1,1 1.52382 0.13980 10.90 <.0001 1
AR1,2 -0.55575 0.24687 -2.25 0.0244 2
AR1,3 -0.07883 0.14376 -0.55 0.5834 3
(2) Residual autocorrelations
Residual rt
Residual autocorrelation at lag j: Corr(rt, rt-j) = ρ(j)
6
Box-Pierce Q statistic: Estimate, square, and sum k of these. Multiply by sample size n. PROC ARIMA: k in sets of 6. Limit distribution Chi-square if errors independent. Later modification: Box-Ljung statistic for H0:residuals uncorrelated
2
1
2k
jj
nn j
n ρ=
+ −
∑
SAS output:
Autocorrelation Check of Residuals To
Lag Chi-
Square DF Pr >
ChiSq Autocorrelations
6 3.49 4 0.4794 -0.070 -0.049 -0.080 0.100 -0.112 0.151
12 5.97 10 0.8178 0.026 -0.111 -0.094 -0.057 0.006 -0.110
18 10.27 16 0.8522 -0.037 -0.105 0.128 -0.051 0.032 -0.150
24 16.00 22 0.8161 -0.110 0.066 -0.039 0.057 0.200 -0.014
Residuals uncorrelatedResiduals are White Noise Residuals are unpredictable
7
SAS computes Box-Ljung on original data too.
Autocorrelation Check for White Noise To
Lag Chi-
Square DF Pr >
ChiSq Autocorrelations
6 81.84 6 <.0001 0.867 0.663 0.439 0.214 -0.005 -0.184
12 142.96 12 <.0001 -0.314 -0.392 -0.417 -0.413 -0.410 -0.393
Data autocorrelated predictable!
Note: All p-values are based on an assumption called “stationarity” discussed later.
How to predict?
1 1 2 2( ) ( ) ( )t t t tY Y Y eµ α µ α µ− −− − − − − =
One step prediction
1 1 12 1( ) ( ),t t ttY Y Y future error eµ α µ α µ+ − += + − + − =
Two step prediction
2 1 2 11 2 1( ) ( ),t t t ttY Y Y error e eµ α α αµ µ + ++ + += + − + − =
Prediction error variance ( σ2 = variance(et) ) 21
2 2, , ...(1 )σ α σ+
8
From prediction error variances, get 95% prediction intervals. Can estimate variance of et from past data. SAS PROC ARIMA does it all for you!
Moving Average, MA(q), and ARMA(p,q) models
MA(1) Yt = µ + et - θet-1 Variance (1+θ2)σ2
Yt-1 = µ + et-1 - θet-2 ρ(1)=-θ/(1+θ2)
Yt-2 = µ + et-2 - θet-3 ρ(2)=0/(1+θ2)=0
9
Autocorrelation function “ACF” (ρ(j)) is 0 after lag q for MA(q). PACF is useless for identifying q in MA(q).
PACF drops to 0 after lag 3 AR(3) p=3
ACF drops to 0 after lag 2 MA(2) q=2
Neither drops ARMA(p,q) p= ___ q=____
1 1 1 1( ) ( ) ... ( )t t p t p t t q t qY Y Y e e eµ α µ α µ θ θ− − − −− − − − − − = − − −
Example 2: Iron and Steel Exports.
PROC ARIMA plots(unpack)=all; Identify VAR=EXPORT;
10
ACF (could be MA(1)) PACF (could be AR(1)) Spike at lags 0, 1 No spike at lag 0
Estimate P=1 ML; Estimate Q=2 ML; Estimate Q=1 ML; Maximum Likelihood Estimation Approx Parameter Estimate t Value Pr>|t| Lag MU 4.42129 10.28 <.0001 0 AR1,1 0.46415 3.42 0.0006 1 MU 4.43237 11.41 <.0001 0 MA1,1 -0.54780 -3.53 0.0004 1 MA1,2 -0.12663 -0.82 0.4142 2 MU 4.42489 12.81 <.0001 0 MA1,1 -0.49072 -3.59 0.0003 1
How to choose? AIC - smaller is better
11
AIC 165.8342 (MA(1)) AIC 166.3711 (AR(1)) AIC 167.1906 (MA(2))
Forecast lead=5 out=out1 id=date interval=year;
Example 3: Brewers’ Proportion Won
Mean of Working Series 0.478444 Standard Deviation 0.059934 Number of Observations 45
12
Autocorrelations Lag -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 Correlation Std Error 0 1.00000 | |********************| 0 1 0.52076 | . |********** | 0.149071 2 0.18663 | . |**** . | 0.185136 3 0.11132 | . |** . | 0.189271 4 0.11490 | . |** . | 0.190720 5 -.00402 | . | . | 0.192252 6 -.14938 | . ***| . | 0.192254 7 -.13351 | . ***| . | 0.194817 8 -.06019 | . *| . | 0.196840 9 -.05246 | . *| . | 0.197248 10 -.20459 | . ****| . | 0.197558 11 -.22159 | . ****| . | 0.202211 12 -.24398 | . *****| . | 0.207537 "." marks two standard errors
Could be MA(1) Autocorrelation Check for White Noise To Chi- Pr > Lag Square DF ChiSq --------------Autocorrelations--------------- 6 17.27 6 0.0084 0.521 0.187 0.111 0.115 -0.004 -0.149 12 28.02 12 0.0055 -0.134 -0.060 -0.052 -0.205 -0.222 -0.244
NOT White Noise!
SAS Code:
proc arima data=brewers; identify var=Win_Pct nlag=12; run; estimate q=1 ml;
13
Maximum Likelihood Estimation Standard Approx Parameter Estimate Error t Value Pr > |t| Lag MU 0.47791 0.01168 40.93 <.0001 0 MA1,1 -0.50479 0.13370 -3.78 0.0002 1 AIC -135.099 Autocorrelation Check of Residuals To Chi- Pr > Lag Square DF ChiSq --------------Autocorrelations------------ 6 3.51 5 0.6219 0.095 0.161 0.006 0.119 0.006 -0.140 12 11.14 11 0.4313 -0.061 -0.072 0.066 -0.221 -0.053 -0.242 18 13.54 17 0.6992 0.003 -0.037 -0.162 -0.010 -0.076 -0.011 24 17.31 23 0.7936 -0.045 -0.035 -0.133 -0.087 -0.114 0.015 Estimated Mean 0.477911 Moving Average Factors Factor 1: 1 + 0.50479 B**(1)
Partial Autocorrelations Lag Correlation -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 1 0.52076 | . |********** | 2 -0.11603 | . **| . | 3 0.08801 | . |** . | 4 0.04826 | . |* . | 5 -0.12646 | . ***| . | 6 -0.12989 | . ***| . | 7 0.01803 | . | . | 8 0.01085 | . | . | 9 -0.02252 | . | . | 10 -0.20351 | . ****| . | 11 -0.03129 | . *| . | 12 -0.18464 | . ****| . |
OR … could be AR(1)
14
estimate p=1 ml;
Maximum Likelihood Estimation Standard Approx Parameter Estimate Error t Value Pr > |t| Lag MU 0.47620 0.01609 29.59 <.0001 0 AR1,1 0.53275 0.12750 4.18 <.0001 1 AIC -136.286 (vs. -135.099)
Autocorrelation Check of Residuals To Chi- Pr > Lag Square DF ChiSq -----------Autocorrelations------------ 6 3.57 5 0.6134 0.050 -0.133 -0.033 0.129 0.021 -0.173 12 8.66 11 0.6533 -0.089 0.030 0.117 -0.154 -0.065 -0.181 18 10.94 17 0.8594 0.074 0.027 -0.161 0.010 -0.019 0.007 24 13.42 23 0.9423 0.011 -0.012 -0.092 -0.081 -0.106 0.013 Model for variable Win_pct Estimated Mean 0.476204 Autoregressive Factors Factor 1: 1 - 0.53275 B**(1)
Conclusions for Brewers:
Both models have statistically significant
parameters.
Both models are sufficient (no lack of fit)
15
Predictions from MA(1):
First one uses correlations
The rest are on the mean.
Predictions for AR(1):
Converge exponentially fast toward mean
Not much difference but AIC prefers AR(1)
16
Stationarity
(1) Mean constant (no trends) (2) Variance constant (3) Covariance γ(j) and correlation
ρ(j) = γ(j)/γ(0) between Yt and Yt-j depend only on j
ARMA(p,q) model
1 1 1 1( ) ( ) ... ( )t t p t p t t q t qY Y Y e e eµ α µ α µ θ θ− − − −− − − − − − = − − −
Stationarity guaranteed whenever solutions of equation (roots of polynomial)
Xp – α1Xp-1−α2Xp-2−…−αp =0
are all <1 in magnitude.
17
Examples
(1) Yt−µ = .8(Yt-1−µ) + et X-.8=0 X=.8 stationary
(2) Yt−µ = 1.00(Yt-1−µ) + et nonstationary
Note: Yt= Yt-1 + et Random walk
(3) Yt−µ = 1.6(Yt-1−µ) − 0.6(Yt-2−µ)+ et
“characteristic polynomial”
X2−1.6X+0.6=0 X=1 or X=0.6
nonstationary (unit root X=1)
(Yt−µ)−(Yt-1−µ) =0.6[(Yt-1−µ)− (Yt-2−µ)]+ et
(Yt−Yt-1) =0.6(Yt-1− Yt-2) + et
First differences form stationary AR(1) process!
18
No mean – no mean reversion – no gravity pulling toward the mean.
(4) Yt−µ = 1.60(Yt-1−µ) − 0.63(Yt-1−µ)+ et
X2−1.60X+0.63=0 X=0.9 or X=0.7
|roots|<1 stationary
(Yt−µ)−(Yt-1−µ) =
−0.03(Yt-1−µ) + 0.63[(Yt-1−µ)− (Yt-2−µ)]+ et
Yt−Yt-1 = −0.03(Yt-1−µ) + 0.63(Yt-1−Yt-2)+ et
Unit Root testing (H0:Series has a unit root)
Regress
Yt−Yt-1 on Yt-1 and (Yt-1−Yt-2)
Look at t test for Yt-1. If it is significantly negative then stationary.
19
Problem: Distribution of “t statistic” is not t distribution under unit root hypothesis. Distribution looks like this histogram:
(1 million random walks of length n=100)
Overlays: N(sample mean & variance) N(0,1)
Correct distribution: Dickey-Fuller test in PROC ARIMA.
-2.89 is the correct (left) 5th %ile
46% of t’s are less than -1.645
(the normal 5th percentile)
20
Example 1: Brewers
proc arima data=brewers; identify var=Win_Pct nlag=12 stationarity=(ADF=0);
Dickey-Fuller Unit Root Tests Type Lags Rho Pr < Rho Tau Pr < Tau Zero Mean 0 -0.1803 0.6376 -0.22 0.6002 Single Mean 0 -21.0783 0.0039 -3.75 0.0062 Trend 0 -21.1020 0.0287 -3.68 0.0347
Conclusion reject H0:unit roots so Brewers series is stationary (mean reverting).
0 lags do not need lagged differences in model (just regress Yt-Yt-1 on Yt-1)
21
Example 2: Stocks of silver revisited
Needed AR(2) (2 lags) so regress
Yt-Yt-1 (D_Silver) on
Yt-1 (L_Silver) and Yt-1-Yt-2 (D_Silver_1)
PROC REG: Parameter Estimates Parameter Variable DF Estimate t Value Pr>|t| Intercept 1 75.58073 2.76 0.0082 L_Silver 1 -0.11703 -2.78 0.0079 wrong distn.
D_Silver_1 1 0.67115 6.21 <.0001 OK
22
PROC ARIMA: Augmented Dickey-Fuller Unit Root Tests Type Lags Rho Pr<Rho Tau Pr<Tau Zero Mean 1 -0.2461 0.6232 -0.28 0.5800 Single Mean 1 -17.7945 0.0121 -2.78 0.0689 OK Trend 1 -15.1102 0.1383 -2.63 0.2697
Same t statistic, corrected p-value!
Conclusion: Unit root difference the series.
1 lag need 1 lagged difference in model (regress Yt-Yt-1 on Yt-1 and Yt-1-Yt-2 ) PROC ARIMA data=silver; identify var=silver(1) stationarity=(ADF=(0)); estimate p=1 ml; forecast lead=24 out=outN ID=date Interval=month;
23
Unit root forecast & forecast interval
PROC AUTOREG
Fits a regression model (least squares)
Fits stationary autoregressive model to error terms
Refits accounting for autoregressive errors.
Example 3: AUTOREG Harley-Davidson closing stock prices 2009-present.
24
proc autoreg data=Harley;
model close=date/ nlag=15 backstep; run;
One by one, AUTOREG eliminates insignificant lags then:
Estimates of Autoregressive Parameters Lag Coefficient Standard Error t Value
1 -0.975229 0.006566 -148.53
25
Final model: Parameter Estimates
Variable DF Estimate Standard Error t Value Approx Pr > |t|
Intercept 1 -412.1128 35.2646 -11.69 <.0001
Date 1 0.0239 0.001886 12.68 <.0001
Error term Zt satisfies Zt-0.97Zt-1=et.
Example 3 ARIMA: Harley-Davidson closing stock prices 2009-present. (vs. AUTOREG)
Apparent upward movement: Linear trend or nonstationary?
Regress
Yt – Yt-1 on 1, t, Yt-1 (& lagged differences)
26
H0: Yt= β + Yt-1 + et “random walk with drift”
H1: Yt=α+βt + Zt with Zt stationary AR(p)
New distribution for Yt-1 t-test
With trend
27
Without trend
1 million simulations - runs in 7 seconds!
SAS code for Harley stock closing price
proc arima data=Harley; identify var=close stationarity=(adf) crosscor=(date) noprint; Estimate input=(date) p=1 ml; forecast lead=120 id=date interval=weekday out=out1; run;
28
Stationarity test (0,1,2 lagged differences):
Augmented Dickey-Fuller Unit Root Tests
Type Lags Rho Pr < Rho Tau Pr < Tau
Zero Mean 0 0.8437 0.8853 1.14 0.9344
1 0.8351 0.8836 1.14 0.9354
2 0.8097 0.8786 1.07 0.9268
Single Mean 0 -2.0518 0.7726 -0.87 0.7981
1 -1.7772 0.8048 -0.77 0.8278
2 -1.8832 0.7925 -0.78 0.8227
Trend 0 -27.1559 0.0150 -3.67 0.0248
1 -26.9233 0.0158 -3.64 0.0268
2 -29.4935 0.0089 -3.80 0.0171
Conclusion: stationary around a linear trend.
Estimates: trend + AR(1)
Maximum Likelihood Estimation
Parameter Estimate Standard Error t Value Approx Pr > |t|
Lag Variable Shift
MU -412.08104 35.45718 -11.62 <.0001 0 Close 0
AR1,1 0.97528 0.0064942 150.18 <.0001 1 Close 0
NUM1 0.02391 0.0018961 12.61 <.0001 0 Date 0
29
Autocorrelation Check of Residuals
To Lag Chi-Square DF Pr > ChiSq Autocorrelations
6 3.20 5 0.6694 -0.005 0.044 -0.023 0.000 0.017 0.005
12 6.49 11 0.8389 -0.001 0.019 0.003 -0.010 0.049 -0.003
18 10.55 17 0.8791 0.041 -0.026 -0.022 -0.023 0.007 -0.011
24 16.00 23 0.8553 0.014 -0.037 0.041 -0.020 -0.032 0.003
30 22.36 29 0.8050 0.013 -0.026 0.028 0.051 0.036 0.000
36 24.55 35 0.9065 0.037 0.016 -0.012 0.002 -0.007 0.001
42 29.53 41 0.9088 -0.007 -0.021 0.029 0.030 -0.033 0.030
48 49.78 47 0.3632 0.027 -0.009 -0.097 -0.026 -0.074 0.026
30
NCSU Energy Demand
Type of day
Class Days
Work Days (no classes)
Holidays & weekends.
Temperature Season of Year
Step 1: Make some plots of energy demand vs. temperature and season. Use type of day as color.
Seasons: S = A sin(2πt/365) , C=B sin(2πt/365)
Temperature Season of Year
31
Step 2: PROC AUTOREG with all inputs: PROC AUTOREG data=energy; MODEL DEMAND = TEMP TEMPSQ CLASS WORK S C /NLAG=15 BACKSTEP DWPROB; output out=out3 predicted = p predictedm=pm residual=r residualm=rm; run;
Estimates of Autoregressive Parameters
Lag Coefficient Standard Error t Value
1 -0.559658 0.043993 -12.72
5 -0.117824 0.045998 -2.56
7 -0.220105 0.053999 -4.08
8 0.188009 0.059577 3.16
9 -0.108031 0.051219 -2.11
12 0.110785 0.046068 2.40
14 -0.094713 0.045942 -2.06
Autocorrelation at 1, 7, 14, and others. After autocorrelation adjustments, trust t tests etc.
32
Parameter Estimates
Variable DF Estimate Standard Error t Value Approx Pr > |t|
Intercept 1 6076 296.5261 20.49 <.0001
TEMP 1 28.1581 3.6773 7.66 <.0001
TEMPSQ 1 0.6592 0.1194 5.52 <.0001
CLASS 1 1159 117.4507 9.87 <.0001
WORK 1 2769 122.5721 22.59 <.0001
S 1 -764.0316 186.0912 -4.11 <.0001
C 1 -520.8604 188.2783 -2.77 0.0060
.
Residuals from regression part. Large residual on workday near Christmas. Add dummy variable.
rm
-5000
-4000
-3000
-2000
-1000
0
1000
2000
DATE
01JUL79 01AUG79 01SEP79 01OCT79 01NOV79 01DEC79 01JAN80 01FEB80 01MAR80 01APR80 01MAY80 01JUN80 01JUL80
Need better model?Big negative residual on Jan. 2
WC non work work class
33
Same idea: PROC ARIMA Step 1: Graphs Step 2: Regress on inputs, diagnose residual autocorrelation:
Not white noise (bottom right) Activity (bars) at lag 1, 7, 14
34
(3) Estimate resulting model from diagnostics plus trial and error: e input = (temp tempsq class work s c) p=1 q=(1,7,14) ml;
Maximum Likelihood Estimation
Parameter Estimate Standard Error t Value Approx Pr > |t|
Lag Variable Shift
MU 6183.1 300.87297 20.55 <.0001 0 DEMAND 0
MA1,1 0.11481 0.07251 1.58 0.1133 1 DEMAND 0
MA1,2 -0.18467 0.05415 -3.41 0.0006 7 DEMAND 0
MA1,3 -0.13326 0.05358 -2.49 0.0129 14 DEMAND 0
AR1,1 0.73980 0.05090 14.53 <.0001 1 DEMAND 0
NUM1 26.89511 3.83769 7.01 <.0001 0 TEMP 0
NUM2 0.64614 0.12143 5.32 <.0001 0 TEMPSQ 0
NUM3 912.80536 122.78189 7.43 <.0001 0 CLASS 0
NUM4 2971.6 123.94067 23.98 <.0001 0 WORK 0
NUM5 -767.41131 174.59057 -4.40 <.0001 0 S 0
NUM6 -553.13620 182.66142 -3.03 0.0025 0 C 0
(Note: class days get class effect plus work effect)
35
(5) Check model fit (stats look OK):
Autocorrelation Check of Residuals
To Lag Chi-Square DF Pr > ChiSq Autocorrelations
6 2.86 2 0.2398 -0.001 -0.009 -0.053 -0.000 0.050 0.047
12 10.71 8 0.2188 0.001 -0.034 0.122 0.044 -0.039 -0.037
18 13.94 14 0.4541 -0.056 0.013 -0.031 0.048 -0.006 -0.042
24 16.47 20 0.6870 -0.023 -0.028 0.039 -0.049 0.020 -0.029
30 24.29 26 0.5593 0.006 0.050 -0.098 0.077 -0.002 0.039
36 35.09 32 0.3239 -0.029 -0.075 0.057 -0.001 0.121 -0.047
42 39.99 38 0.3817 0.002 -0.007 0.088 0.019 -0.004 0.060
48 43.35 44 0.4995 -0.043 0.043 -0.027 -0.047 -0.019 -0.032
36
Looking for “outliers” that can be explained PROC ARIMA, OUTLIER statement Available types
(1) Additive (single outlier) (2) Level shift (sudden change in mean) (3) Temporary change (level shift for k contiguous time
points – you specify k) NCSU energy: tested every point – 365 tests. Adjust for multiple testing 0.05/365 = .0001369863 (Bonferroni) OUTLIER type=additive alpha=.0001369863 id=date; FORMAT date weekdate.; run; /****************************************************** January 2, 1980 Wednesday: Hangover Day :-) . March 3,1980 Monday: On the afternoon and evening of March 2, 1980, North Carolina experienced a major winter storm with heavy snow across the entire state and near blizzard conditions in the eastern part of the state. Widespread snowfall totals of 12 to 18 inches were observed over Eastern North Carolina, with localized amounts ranging up to 22 inches at Morehead City and 25 inches at Elizabeth City, with unofficial reports of up to 30 inches at Emerald Isle and Cherry Point (Figure 1). This was one of the great snowstorms in Eastern North Carolina history. What made this storm so remarkable was the combination of snow, high winds, and very cold temperatures.
37
May 10,1980 Saturday. Graduation! *****************************************************/;
Outlier Details
Obs Time ID Type Estimate Chi-Square Approx Prob>ChiSq
186 Wednesday Additive -3250.9 87.76 <.0001
315 Saturday Additive 1798.1 28.19 <.0001
247 Monday Additive -1611.8 22.65 <.0001
Outlier Details
Obs Time ID Type Estimate Chi-Square Approx Prob>ChiSq
186 02-JAN-1980 Additive -3250.9 87.76 <.0001
315 10-MAY-1980 Additive 1798.1 28.19 <.0001
247 03-MAR-1980 Additive -1611.8 22.65 <.0001
38
Outliers: Jan 2 (hangover day!), March 3 (snowstorm), May 10 (graduation day). AR(1) ‘rebound’ outlying residuals next day. Add dummy variables for explainable outliers data next; merge outarima energy; by date; hangover = (date="02Jan1980"d); storm = (date="03Mar1980"d); graduation = (date="10May1980"d); Proc ARIMA data=next; identify var=demand crosscor=(temp tempsq class work s c hangover graduation storm) noprint; estimate input = (temp tempsq class work s c hangover
graduation storm) p=1 q=(7,14) ml; forecast lead=0 out=outARIMA2 id=date interval=day; run;
Maximum Likelihood Estimation
Parameter Estimate Standard Error t Value Approx Pr > |t|
Lag Variable Shift
MU 6127.4 259.43918 23.62 <.0001 0 DEMAND 0
MA1,1 -0.25704 0.05444 -4.72 <.0001 7 DEMAND 0
MA1,2 -0.10821 0.05420 -2.00 0.0459 14 DEMAND 0
AR1,1 0.76271 0.03535 21.57 <.0001 1 DEMAND 0
NUM1 27.89783 3.15904 8.83 <.0001 0 TEMP 0
NUM2 0.54698 0.10056 5.44 <.0001 0 TEMPSQ 0
39
Maximum Likelihood Estimation
Parameter Estimate Standard Error t Value Approx Pr > |t|
Lag Variable Shift
NUM3 626.08113 104.48069 5.99 <.0001 0 CLASS 0
NUM4 3258.1 105.73971 30.81 <.0001 0 WORK 0
NUM5 -757.90108 181.28967 -4.18 <.0001 0 S 0
NUM6 -506.31892 184.50221 -2.74 0.0061 0 C 0
NUM7 -3473.8 334.16645 -10.40 <.0001 0 hangover 0
NUM8 2007.1 331.77424 6.05 <.0001 0 graduation 0
NUM9 -1702.8 333.79141 -5.10 <.0001 0 storm 0
Constant Estimate 1453.963
Variance Estimate 181450
Std Error Estimate 425.9695
AIC 5484.728
SBC 5535.462
Number of Residuals 366
40
Model looks fine. AUTOREG - regression with AR(p) errors ARIMA – regressors, differencing, ARMA(p,q) errors. SEASONALITY Many economic and environmental series show seasonality.
(1) Very regular (“deterministic”) or (2) Slowly changing (“stochastic”)
41
Example 1: NC accident reports involving deer. Method 1: regression PROC REG data=deer; model deer = X11; (X11: 1 in Nov, 0 otherwise) Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 1181.09091 78.26421 15.09 <.0001 X11 1 2578.50909 271.11519 9.51 <.0001
Looks like December and October need dummies too! PROC REG data=deer; model deer = X10 X11 X12; Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 929.40000 39.13997 23.75 <.0001 X10 1 1391.20000 123.77145 11.24 <.0001 X11 1 2830.20000 123.77145 22.87 <.0001 X12 1 1377.40000 123.77145 11.13 <.0001 Average of Jan through Sept. is 929 crashes per month. Add 1391 in October, 2830 in November, 1377 in December.
(graph on next page)
42
Try dummies for all but one month (need “average of rest” so must leave out at least one) PROC REG data=deer; model deer = X1 X2 … X10 X11; Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 2306.80000 81.42548 28.33 <.0001 X1 1 -885.80000 115.15301 -7.69 <.0001 X2 1 -1181.40000 115.15301 -10.26 <.0001 X3 1 -1220.20000 115.15301 -10.60 <.0001 X4 1 -1486.80000 115.15301 -12.91 <.0001 X5 1 -1526.80000 115.15301 -13.26 <.0001 X6 1 -1433.00000 115.15301 -12.44 <.0001 X7 1 -1559.20000 115.15301 -13.54 <.0001 X8 1 -1646.20000 115.15301 -14.30 <.0001 X9 1 -1457.20000 115.15301 -12.65 <.0001 X10 1 13.80000 115.15301 0.12 0.9051 X11 1 1452.80000 115.15301 12.62 <.0001 Average of rest is just December mean 2307. Subtract 886 in January, add 1452 in November. October (X10) is not significantly different than December.
43
Residuals for Deer Crash data:
Looks like a trend – add trend (date): PROC REG data=deer; model deer = date X1 X2 … X10 X11; Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -1439.94000 547.36656 -2.63 0.0115 X1 1 -811.13686 82.83115 -9.79 <.0001 X2 1 -1113.66253 82.70543 -13.47 <.0001 X3 1 -1158.76265 82.60154 -14.03 <.0001 X4 1 -1432.28832 82.49890 -17.36 <.0001 X5 1 -1478.99057 82.41114 -17.95 <.0001 X6 1 -1392.11624 82.33246 -16.91 <.0001 X7 1 -1525.01849 82.26796 -18.54 <.0001 X8 1 -1618.94416 82.21337 -19.69 <.0001 X9 1 -1436.86982 82.17106 -17.49 <.0001 X10 1 27.42792 82.14183 0.33 0.7399 X11 1 1459.50226 82.12374 17.77 <.0001 date 1 0.22341 0.03245 6.88 <.0001
44
Trend is 0.22 more accidents per day (1 per 5 days) and is significantly different from 0.
What about autocorrelation?
Method 2: PROC AUTOREG PROC AUTOREG data=deer; model deer = date X1 - X11 / nlag=13 backstep; Backward Elimination of Autoregressive Terms Lag Estimate t Value Pr > |t| 6 -0.003105 -0.02 0.9878 11 0.023583 0.12 0.9029 4 -0.032219 -0.17 0.8641 9 -0.074854 -0.42 0.6796 5 0.064228 0.44 0.6610 13 -0.081846 -0.54 0.5955 12 0.076075 0.56 0.5763 8 -0.117946 -0.81 0.4205 10 -0.127661 -0.95 0.3489 7 0.153680 1.18 0.2458 2 0.254137 1.57 0.1228 3 -0.178895 -1.37 0.1781 Preliminary MSE 10421.3 Estimates of Autoregressive Parameters Standard Lag Coefficient Error t Value 1 -0.459187 0.130979 -3.51
45
Parameter Estimates Standard Approx Variable DF Estimate Error t Value Pr > |t| Intercept 1 -1631 857.3872 -1.90 0.0634 date 1 0.2346 0.0512 4.58 <.0001 X1 1 -789.7592 64.3967 -12.26 <.0001 X2 1 -1100 74.9041 -14.68 <.0001 X3 1 -1149 79.0160 -14.54 <.0001 X4 1 -1424 80.6705 -17.65 <.0001 X5 1 -1472 81.2707 -18.11 <.0001 X6 1 -1386 81.3255 -17.04 <.0001 X7 1 -1519 80.9631 -18.76 <.0001 X8 1 -1614 79.9970 -20.17 <.0001 X9 1 -1432 77.8118 -18.40 <.0001 X10 1 31.3894 72.8112 0.43 0.6684 X11 1 1462 60.4124 24.20 <.0001
Method 3: PROC ARIMA PROC ARIMA plots=(forecast(forecast)); IDENTIFY var=deer crosscor= (date X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11); ESTIMATE input= (date X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11) p=1 ML; FORECAST lead=12 id=date interval=month; run; Maximum Likelihood Estimation Standard Approx Parameter Estimate Error t Value Pr > |t| Lag Variable Shift MU -1640.9 877.41683 -1.87 0.0615 0 deer 0 AR1,1 0.47212 0.13238 3.57 0.0004 1 deer 0 NUM1 0.23514 0.05244 4.48 <.0001 0 date 0 NUM2 -789.10728 64.25814 -12.28 <.0001 0 X1 0 NUM3 -1099.3 74.93984 -14.67 <.0001 0 X2 0
46
NUM4 -1148.2 79.17135 -14.50 <.0001 0 X3 0 NUM5 -1423.7 80.90397 -17.60 <.0001 0 X4 0 NUM6 -1471.6 81.54553 -18.05 <.0001 0 X5 0 NUM7 -1385.4 81.60464 -16.98 <.0001 0 X6 0 NUM8 -1518.9 81.20724 -18.70 <.0001 0 X7 0 NUM9 -1613.4 80.15788 -20.13 <.0001 0 X8 0 NUM10 -1432.0 77.82871 -18.40 <.0001 0 X9 0 NUM11 31.46310 72.61873 0.43 0.6648 0 X10 0 NUM12 1462.1 59.99732 24.37 <.0001 0 X11 0
Autocorrelation Check of Residuals To Chi- Pr > Lag Square DF ChiSq -----------Autocorrelations------------- 6 5.57 5 0.3504 0.042 -0.175 0.146 0.178 0.001 -0.009 12 10.41 11 0.4938 -0.157 -0.017 0.102 0.115 -0.055 -0.120 18 22.18 17 0.1778 0.158 0.147 -0.183 -0.160 0.189 -0.008 24 32.55 23 0.0893 -0.133 0.013 -0.095 0.005 0.101 -0.257 Autoregressive Factors Factor 1: 1 - 0.47212 B**(1)
Model 3: Differencing Compute and model Dt = Yt-Yt-12 Removes seasonality Removes linear trend Use (at least) q=(12) et-θet-12
(A) if θ near 1, you’ve overdifferenced (B) if 0<θ<1 this is seasonal exponential
smoothing model.
47
11 −−=−− tetetYtY θ ]...221[111 −+−−−+−−=−+−−= tetYtYtYtYtetYtYte θθθ
tetYtYtYtYtY ++−+−+−+−−= ]33
22
21)[1( θθθθ Forecast is a weighted (exponentially smoothed) average of past values:
]33
22
21)[1(ˆ +−+−+−+−−= tYtYtYtYtY θθθθ
IDENTIFY var=deer(12) nlag=25; ESTIMATE P=1 Q=(12) ml; run; Maximum Likelihood Estimation Standard Approx Parameter Estimate Error t Value Pr > |t| Lag MU 85.73868 16.78380 5.11 <.0001 0 MA1,1 0.89728 0.94619 0.95 0.3430 12 AR1,1 0.46842 0.11771 3.98 <.0001 1 Autocorrelation Check of Residuals To Chi- Pr > Lag Square DF ChiSq ------------Autocorrelations------------ 6 4.31 4 0.3660 0.053 -0.161 0.140 0.178 -0.026 -0.013 12 7.47 10 0.6801 -0.146 -0.020 0.105 0.131 -0.035 -0.029 18 18.02 16 0.3226 0.198 0.167 -0.143 -0.154 0.183 0.002 24 24.23 22 0.3355 -0.127 0.032 -0.083 0.022 0.134 -0.155
48
Lag 12 MA somewhat close to 1 with large standard error, model OK but not best. Variance estimate 15,122 (vs. 13,431 for dummy variable model). Forecasts are similar 2 years out.
49
Accounting for changes in trend Nenana Ice Classic data Exact time (day and time) of thaw of the Tanana river in Nenana Alaska: 1917 Apr 30 11:30 a.m. 1918 May 11 9:33 a.m. 1919 May 3 2:33 p.m. (more data) 2010 Apr 29 6:22 p.m. 2011 May 04 4:24 p.m. 2012 Apr 23 7:39 p.m.
When the tripod moves downstream, that is the unofficial start of spring.
50
Get “ramp” with PROC NLIN ____/ X= 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 … PROC NLIN data=all; PARMS point=1960 int=126 slope=-.2; X = (year-point)*(year>point); MODEL break = int + slope*X; OUTPUT out=out2 predicted=p residual=r;
Approx Approximate 95% Confidence Parameter Estimate Std Error Limits point 1965.4 11.2570 1943.0 1987.7 int 126.0 0.7861 124.5 127.6 slope -0.1593 0.0592 -0.2769 -0.0418
PROC SGPLOT data=out2; SERIES Y=break X=year; SERIES Y=p X=year/ lineattrs = (color=red thickness=2); REFLINE 1965.4 / axis=X; run;quit;
51
What about autocorrelation? Final ramp: Xt = (year-1965.4)*(year>1965); PROC ARIMA; IDENTIFY var=break crosscor=(ramp) noprint; ESTIMATE input=(ramp); FORECAST lead=5 id=year out=out1; run;
PROC ARIMA generates diagnostic plots:
52
Autocorrelation Check of Residuals To Chi- Pr > Lag Square DF ChiSq ---------------Autocorrelations--------------- 6 6.93 6 0.3275 -0.025 0.067 -0.027 -0.032 -0.152 -0.192 12 14.28 12 0.2834 0.086 -0.104 -0.074 0.184 -0.041 -0.091 18 20.30 18 0.3164 -0.086 0.111 0.041 -0.073 0.142 -0.066 24 21.37 24 0.6166 -0.015 -0.020 -0.036 -0.059 -0.047 -0.030
Example 2: Visitors to St. Petersburg/Clearwater
http://www.pinellascvb.com/statistics/2013-04-VisitorProfile.pdf
Model 1: Seasonal dummy variables + trend + AR(p) (REG - R2>97%)
53
proc reg data=aaem.stpete; model visitors=t m1-m11; output out=out4 predicted = P residual=R UCL=u95 LCL=l95; run; PROC SGPLOT data=out3; BAND lower=l95 upper=u95 X=date; SERIES Y=P X=date; SCATTER Y=visitors X=date / datalabel=month datalabelattrs=(color=red size=0.3 cm); SERIES Y=U95 X=date/ lineattrs=(color=red thickness=0.8); SERIES Y=L95 X=date/ lineattrs=(color=red thickness=0.8); REFLINE "01apr2013"d / axis=x; where 2011<year(date)<2015; run;
54
PROC SGPLOT data=out4; NEEDLE Y=r X=date; run;
Definitely autocorrelated Slowly changing mean? Try seasonal span difference model … PROC ARIMA data=StPete plots=forecast(forecast); IDENTIFY var=visitors(12);
55
Typical ACF for ARMA(1,1) has initial dropoff followed by exponential decay – try ARMA(1,1) on span 12 differences. ESTIMATE P=1 Q=1 ml; FORECAST lead=44 id=date interval = month out=outARIMA;run; Standard Approx Parameter Estimate Error t Value Pr > |t| Lag MU 0.0058962 0.0029586 1.99 0.0463 0 MA1,1 0.32635 0.11374 2.87 0.0041 1 AR1,1 0.79326 0.07210 11.00 <.0001 1 Autocorrelation Check of Residuals To Chi- Pr > Lag Square DF ChiSq ----------------Autocorrelations---------------- 6 1.81 4 0.7699 0.022 -0.055 -0.000 0.011 -0.028 -0.071 12 14.87 10 0.1367 0.109 0.044 0.003 0.060 0.211 -0.064 18 16.41 16 0.4248 0.010 -0.012 0.036 0.014 -0.066 0.038 24 17.30 22 0.7468 -0.056 0.020 0.004 0.018 -0.009 -0.017 30 34.15 28 0.1961 -0.131 -0.052 -0.144 0.134 0.026 -0.133 36 45.08 34 0.0969 -0.102 0.051 0.135 -0.107 0.014 -0.074
56
PROC SGPLOT data=outARIMA; ; SERIES X=date Y=residual; run;
57
Summary: PROC ARIMA – fits autoregressive moving average models Use ACF, PACF to identify p=# autoregressive lags and q= # moving average lags. Stationarity – mean reverting models versus unit roots (random walk type models). Graphics and DF test (and others) available. Diagnostics – errors should be white noise – Ljung Box test to check. PROC AUTOREG – regression with autoregressive Errors (ARIMA can also handle X variables). PROC NLIN – to estimate slope changes (least squares). PROC SGPLOT – new graphics routines.