google trend time series modeling

10
Midterm Report Taylor Hines April 20, 2015 Introduction In this report I will analyze Time Series data from Google Trends. The data is a set of weekly values from an index of search frequency generated by Google for an unknown search term. First I will perform some exploratory data analysis, considering whether transformation is necessary and removing trend and seasonality. After which I will consider a plausible space of model specifications. Then I will run automated model selection algorithms seeking to minimize AIC. Lastly, I will confirm these results using Cross-Validation. EDA Lets load the data and take a look at a basic plot: library(stargazer) ## ## Please cite as: ## ## Hlavac, Marek (2014). stargazer: LaTeX code and ASCII text for well-formatted regression and summary ## R package version 5.1. http://CRAN.R-project.org/package=stargazer setwd("~/Documents/CAL/Spring 15/Stat 153/Exams") data2 <- read.csv("q2train.csv") plot(data2$Data, type="l", ylab = "Search Popularity", xlab="Week", main="Google Trend Data") 0 100 200 300 400 500 40 60 80 100 Google Trend Data Week Search Popularity 1

Upload: taylor-hines

Post on 20-Jan-2017

21 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Google Trend Time Series Modeling

Midterm ReportTaylor Hines

April 20, 2015

Introduction

In this report I will analyze Time Series data from Google Trends. The data is a set of weekly values froman index of search frequency generated by Google for an unknown search term. First I will perform someexploratory data analysis, considering whether transformation is necessary and removing trend and seasonality.After which I will consider a plausible space of model specifications. Then I will run automated modelselection algorithms seeking to minimize AIC. Lastly, I will confirm these results using Cross-Validation.

EDA

Lets load the data and take a look at a basic plot:

library(stargazer)

#### Please cite as:#### Hlavac, Marek (2014). stargazer: LaTeX code and ASCII text for well-formatted regression and summary statistics tables.## R package version 5.1. http://CRAN.R-project.org/package=stargazer

setwd("~/Documents/CAL/Spring 15/Stat 153/Exams")

data2 <- read.csv("q2train.csv")plot(data2$Data, type="l", ylab = "Search Popularity", xlab="Week", main="Google Trend Data")

0 100 200 300 400 500

4060

8010

0

Google Trend Data

Week

Sea

rch

Pop

ular

ity

1

Page 2: Google Trend Time Series Modeling

The data contains 484 observations, or just over 9 years of data. The periodic form indicates yearly seasonality(with a period of 52 weeks), in addition to a mildly declining trend. Furthermore the data appears mildlyheteroskedastic, as the amplitude, and therefore the variance decreasing over time. As such it may bereasonable to consider an appropriate Box-Cox transformation of the data, perhaps by taking a log:

trans2 <- log(data2$Data)plot(trans2, type="l", ylab = "Log Search Popularity", xlab="Week", main="Transformed Trend Data")

0 100 200 300 400 500

3.2

3.6

4.0

4.4

Transformed Trend Data

Week

Log

Sea

rch

Pop

ular

ity

As we can see in the plot, the log transformation stabalized the variance nicely, as well as linearizing thetrend. Now I will difference the data to remove the trend and seasonality. Its hard to tell if the trend incompletely linear or if it has a quadratic term. As such I will first take a difference at the yearly lag (52),and then take first difference to see if any further differencing is necessary:

diff.data <- diff(diff(trans2, 52))plot(diff.data, type="l", ylab = "Transformed Search Popularity",

xlab="Week", main="Log Transformed Twice Differenced Data")

2

Page 3: Google Trend Time Series Modeling

0 100 200 300 400

−0.

3−

0.1

0.1

0.3

Log Transformed Twice Differenced Data

Week

Tran

sfor

med

Sea

rch

Pop

ular

ity

mean(diff.data)

## [1] 0.0002733

At this point we should have stationary data that we can fit an ARMA model to. First differencing wassufficient to remove the trend, and despite some extreme values, they do not seem to be seasonally spaced.Furthermore the data has a mean very close to zero, therefore there won’t be a drift term in the model.

ACF and PCF

Now I will plot the ACF and PCF to begin to consider the form of the underlying ARMA model:

par(mfrow=c(2,1), mar = c(4, 4, 3, 3))acf(diff.data, lag.max = 105, xlim= c(4, 105), ylim=c(-.4,.4), main = "ACF and PCF Plots")pacf(diff.data, lag.max = 105, xlim= c(4, 105), ylim=c(-.4,.4), main = "")

3

Page 4: Google Trend Time Series Modeling

0 20 40 60 80 100

−0.

40.

00.

4

Lag

AC

FACF and PCF Plots

0 20 40 60 80 100

−0.

40.

00.

4

Lag

Par

tial A

CF

Here one can definitely see that we will want to consider a seasonal model. In both the ACF and PACF plotswe see significant spikes at the first seasonal lags (52), but not at the second. It seems reasonable that wewill consider (1, 1, 1)_52 for the seasonal component of the SARIMA model. As far as R implementation,we can definitely fit a SARIMA model as opposed to manually performin the differencing and then fittingan ARMA. As mentioned, the mean of the data is almost exactly zero, therefore we do not need to fit anintercept for a drift term, therefore R’s refusal to do so won’t be a problem. It is quite a bit more difficult tosee what is happening within the seasonal lag. One possibility is an MA(2), as we see two spikes and then itmostly drops off (Its hard to know if the spike at lag 6 is noise or not). The PACF seems to just be taperingoff, so it may make sense to just fit MA terms in the non-seasonal terms. However, to confirm this I will alsoconsider models with a few AR terms as well and compare the performance.

AIC

For the first set of diagnostics I will fit a variety of models and report the AIC values. While I have goodreason to suspect a SARIMA model, I will test this against the base case where we still difference at theseasonl Lag but do not fit values for Ps and Qs the seasonal ARMA parameters. As such the 4 sets of modelsconsidered will be:

4

Page 5: Google Trend Time Series Modeling

ARMA (p, 1, q) x (0, 1, 0)ARMA (p, 1, q) x (1, 1, 0)ARMA (p, 1, q) x (0, 1, 1)ARMA (p, 1, q) x (1, 1, 1)

Given the ambiguity of the order of the non-seasonal component, I will create a 4 matrices, one for each setof models with orders for AR and MA from 0 to 4:

# Some Parameter combinations don't lead to fitted models due to convergence issues,# errors, and other numerical analysis issues as such this "robust"# version of arima will set values to NA for any Models that won't fit properlyrobust.arima <- function(data, ar, diff = 0, ma,

Ps = 0, Ds = 0, Qs = 0, per = NA, num.method = "CSS-ML"){tryCatch(arima(data, order = c(ar, diff, ma),

seasonal = list(order = c(Ps, Ds, Qs), period = per),method = num.method, optim.control = list(maxit = 1000)),

warning = function(w){NA},error = function(e){NA})}

model.selec <- function(data, ar.param, diff = 0, ma.param,Ps = 0, Ds = 0, Qs = 0, per = NA, num.method = "CSS-ML"){

aic.matrix <- sapply(ma.param, function(ma){sapply(ar.param, function(ar){

model <- robust.arima(data, ar, diff, ma, Ps, Ds, Qs, per, num.method)print(model)

if(is(model, "Arima")){return(model$aic)}return(NA)})})

colnames(aic.matrix) <- paste("ma", ma)rownames(aic.matrix) <- paste("ar", ar)return(aic.matrix)}

ar = 0:4ma = 0:4aics1 <- model.selec(trans2, ar, 1, ma, 1, 1, 1, per=52, num.method = "ML")aic.m1 <- data.frame(round(aics1,1))aics2 <- model.selec(trans2, ar, 1, ma, 0, 1, 1, per=52, num.method = "ML")aic.m2 <- data.frame(round(aics2,1))aics3 <- model.selec(trans2, ar, 1, ma, 1, 1, 0, per=52, num.method = "ML")aic.m3 <- data.frame(round(aics3,1))aics4 <- model.selec(trans2, ar, 1, ma, 0, 1, 0, per=52, num.method = "ML")aic.m4 <- data.frame(round(aics4,1))

stargazer(aic.m3, type="text", summary=FALSE)

Table 1: ARIMA (p, 1, q) x (0, 1, 0)_52

MA(0) MA(1) MA(2) MA(3) MA(4)AR(0) -845.6 -956.7 -991.3 -992.2 -990.8AR(1) -890.4 -994.5 -992.6 -990.6 -988.8AR(2) -916.6 -992.6 -990.5 -988.6 -993.1AR(3) -929.2 -990.6 -988.8 -996.5 -989.5AR(4) -942.3 -989.0 -986.6 -990.0 -992.7

5

Page 6: Google Trend Time Series Modeling

Table 2: ARIMA (p, 1, q) x (1, 1, 1)_52

MA(0) MA(1) MA(2) MA(3) MA(4)AR(0) -921.1 -1,065.8 -1,083.2 -1,082.0 -1,080.3AR(1) -981.1 N/A N/A -1,079.7 -1,078.1AR(2) -1,013.0 N/A N/A N/A -1,085.4AR(3) -1,025.6 N/A N/A -1,079.9 -1,074.2AR(4) -1,037 -1,078.6 N/A N/A N/A

Table 3: ARIMA (p, 1, q) x (0, 1, 1)_52

MA(0) MA(1) MA(2) MA(3) MA(4)AR(0) -918.2 -1,058 -1,078.6 -1,077.4 -1,075.5AR(1) -973.2 -1,079.3 -1,077.4 -1,075.3 -1,073.5AR(2) -1,004.8 -1,077.4 -1,075.5 -1,079.5 -1,082.2AR(3) -1,019.5 -1,075.4 -1,073.5 -1,082 -1,071.7AR(4) -1,031.2 -1,073.7 -1,077.5 -1,080.8 -1,070.3

Table 4: Table: ARIMA (p, 1, q) x (1, 1, 0)_52

MA(0) MA(1) MA(2) MA(3) MA(4)AR(4) -919.9 -1,060.6 -1,078.7 -1,078.1 -1,077AR(4) -980.8 N/A -1,072.5 -1,075.5 N/AAR(4) -1,011.3 N/A N/A N/A -1,076.6AR(4) -1,021.9 N/A N/A -1,083.5 -1,072AR(4) -1,033.3 -1,075.3 N/A N/A -1,072

From these tables I can definitely conclude that we should be fitting a SARIMA model. Table 1 is the baselinewhere all models have P = 0 and Q = 0 for the seasonal parameters. No model breaks AIC = -1000. Table 2was our hypothesized specification: ARIMA (p, 1, q) x (1, 1, 1)_52

The large numbers of N/A values are a function of errors R threw as a result of numerical analysis problems:convergence issues or others. I experimented with modifying the fitting algorithm options (MaximumLiklehood vs. Conditional Sum of Squares), but was unable to work around a lot of these errors. In any case,this table shows good results for one of the more simple models: ARIMA (0, 1, 2) x (1, 1, 1)_52 with an AICof -1,083.2.

However, this model did not have the lowest AIC in the group, rather the ARIMA (2, 1, 4) x (1, 1, 1)_52 hadthe lowest with an AIC of -1,085.4 This model has quite a few more parameters and a much more complexform, so there is a chance this model will overfit. I can test this hypothesis well as the other top specificationsfrom Tables 3 and 4 using cross validation.

Table 5: Top Candidate Models

Model Specification AICARIMA (0, 1, 2) x (1, 1, 1)_52 -1,083.2ARIMA (2, 1, 4) x (1, 1, 1)_52 -1,085.4ARIMA (2, 1, 4) x (0, 1, 1)_52 -1,082.2ARIMA (4, 1, 3) x (1, 1, 0)_52 -1,083.5

6

Page 7: Google Trend Time Series Modeling

Cross Validation

While all of these models have similar AIC values, there is always a chance the higher Degree of Freedommodels are overfitting. The best way to test this would be to fit these models and test them against anunseen test set. However, given that I do not have acess to any additional data, the best strategy is to use across validation procedure to simulate this process.

The training data contains 484 observations, or 9.3 years. As such, I can begin by fitting the models to thefirst 4.3 years and predicting the following 52 observations, we then repeat this procedure after fitting to thefirst 5.3 years, and so on. This will yield 5 different predictions for a rolling 52 week prediction window. Ican then average these 5 prediction errors to determine which model performs the best.

start = length(trans2) - 5*52indices = seq(224, 484, by=52)

cross.validation <- function(data, ar, diff = 0, ma,Ps = 0, Ds = 0, Qs = 0, per = NA, num.method = "CSS-ML"){

start <- length(data) - 5*52sapply(0:4, function(yr){

train <- data[1:(start + yr*52)]test <- data[((start + yr*52) + 1):((start + yr*52) + 52)]model <- arima(data, order = c(ar, diff, ma), seasonal = list(order = c(Ps, Ds, Qs),

period = per), method = num.method, optim.control = list(maxit = 1000))preds <- predict(model, n.ahead = 52)$pred

mse <- sum((test - preds)^2)print(mse)return(mse)})}

cv.model1 <- cross.validation(trans2, 0, 1, 2, 1, 1, 1, 52)cv.model2 <- cross.validation(trans2, 2, 1, 4, 1, 1, 1, 52)cv.model3 <- cross.validation(trans2, 2, 1, 4, 0, 1, 1, 52)cv.model4 <- cross.validation(trans2, 4, 1, 3, 1, 1, 0, 52)

cv.df <- data.frame(rbind(round(cv.model1, 3),round(cv.model2, 3), round(cv.model3, 3), round(cv.model4, 3)))

cv.means <- apply(cv.df, MARGIN=1, mean)cv.df <- cbind(cv.df, round(cv.means, 3))

fold.names <- rep(0,5)for(yr in 1:5){

fold.names[yr] = paste("pred year",yr)}colnames(cv.df) <- c(fold.names, "AIC")stargazer(cv.df, type="text", summary=FALSE)

Table 6: Candidate Models with C.V. Results

Model SpecificationPred.1

Pred.2 Pred 3 Pred. 4 Pred 5

MeanError AIC

ARIMA (0, 1, 2) x (1,1, 1)_52

0.257 0.227 0.220 0.147 0.084 0.187 -1,083.2

ARIMA (2, 1, 4) x (1,1, 1)_52

0.258 0.232 0.223 0.152 0.083 0.190 -1,085.4

ARIMA (2, 1, 4) x (0,1, 1)_52

0.271 0.243 0.192 0.151 0.059 0.183 -1,082.2

7

Page 8: Google Trend Time Series Modeling

Model SpecificationPred.1

Pred.2 Pred 3 Pred. 4 Pred 5

MeanError AIC

ARIMA (4, 1, 3) x (1,1, 0)_52

0.276 0.251 0.289 0.162 0.090 0.214 -1,083.5

All 4 models have impressively low prediction error, around .2 for 52 periods. Given that the data was logtransformed typical observations had values between 3 and 4. Of the 4 models, Model 1 and 4 had thebest performance, despite Model 2’s superior AIC score. Given the extremely similar results, my preferenceis for the most parsimonous model, Model 1: ARIMA (0, 1, 2) x (1, 1, 1)_52 This result reinforces theearly qualitative assesment of the ACF and PACF plots, which graphically supported the Seasonal (1, 1, 1)specification. I can now fit this model to the entire training set and generate predictions for the following 104weekly periods.

Best Model and Diagnostics:

best.model <- arima(trans2, order = c(0, 1, 2), seasonal = list(order = c(1, 1, 1), period = 52))best.model

#### Call:## arima(x = trans2, order = c(0, 1, 2), seasonal = list(order = c(1, 1, 1), period = 52))#### Coefficients:## ma1 ma2 sar1 sma1## -0.678 -0.213 -0.278 -0.274## s.e. 0.046 0.047 0.100 0.101#### sigma^2 estimated as 0.00445: log likelihood = 546.6, aic = -1083

tsdiag(best.model)

8

Page 9: Google Trend Time Series Modeling

Standardized Residuals

Time

0 100 200 300 400 500

−3

−1

12

3

0 5 10 15 20 25

0.0

0.4

0.8

Lag

AC

F

ACF of Residuals

2 4 6 8 10

0.0

0.4

0.8

p values for Ljung−Box statistic

lag

p va

lue

Looking at the below diagnostic plots, the fits seems very strong. There is no meaningful correlations in theACF plot, nor is there any evidence of trend or seasonality in the residuals plot.

Plot our Predictions:

m = 104logfcast = predict(best.model, n.ahead = 104)

fcast = exp(logfcast$pred)

newx = 1:(length(trans2) + m)lognewy = c(trans2, logfcast$pred)

9

Page 10: Google Trend Time Series Modeling

newy <- exp(lognewy)

plot(newx, newy, type = "l", main="Original Data With Forecast",ylab = "Search Popularity", xlab="Week")

points(newx[((length(trans2)+1):(length(trans2) + m))],newy[((length(trans2)+1):(length(trans2) + m))], col = "blue", type = "l" )

legend("topright", y = 5, legend = c("Original Data", "Predictions"),col = c("black", "blue"), lty = c(1, 1))

0 100 200 300 400 500 600

4060

8010

0

Original Data With Forecast

Week

Sea

rch

Pop

ular

ity

Original DataPredictions

Conclusion

In seeking to fit this Google Trend dataset I confronted a number of issues. Due to heteroskedacity as well aslinear and seasonal trend, the data had to be massaged in order to get a stationary data set. After a logtransformation, a first and seasonal difference, I was able to consider a variety of ARIMA models. Since thedata didn’t evidence any drift I was able to directly fit a variety of SARIMA models with various parametersat the seasonal lag as well as within the seasonal period. After performing an automated fits across a (p x q)x (P x Q) search space, I restricted my attention to the 4 best candidate models. At least 2 of these modelsseemed to have more parameters than were necessary. As such I performed Cross Validation with a rolling 52week window. This allowed me to generate fits, predictions, and errors for 5 different year long periods. Inthe end the 4 candidate models all performed similarly, all with small average errors. As such I chose themost parsimonious model: ARIMA (0, 1, 2) x (1, 1, 1)_52 The visual plot of predictions certainly seems tobe a plaussible future path of this series.

10