statistical analysis of google flu trendsstatistical analysis of google flu trends a project...

Statistical Analysis of Google Flu Trends

A PROJECT

SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

OF THE UNIVERSITY OF MINNESOTA

BY

Melissa Sandahl

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

MASTER OF SCIENCE

Yang Li and Kang James

July, 2016

c© Melissa Sandahl 2016

ALL RIGHTS RESERVED

Acknowledgements

Firstly, I would like to thank my advisors Yang Li and Kang James for all of their

time, guidance and expert support while completing this project. I would also like to

thank Xuan Li for serving on my committee and the positive support. Lastly, I am very

thankful for everything that the University’s Mathematics & Statistics department has

provided me with and helped me accomplish over the last two years.

i

Abstract

Predicting the behavior of influenza is crucial to helping health officials prepare for and

decrease possible outbreaks of the infectious disease. This project discusses methods

for testing Google flu count data taken from 2008 - 2014 for spatial autocorrelation,

seasonality and temporal effects. We will generate an appropriate seasonal ARIMA

model to fit the data for the overall nation as well as use the statistical program R to

develop multiple state models. Lastly, the Ljung-Box test will be applied to test for

goodness of fit and model adequacy. The goal of this project is to be able to forecast

future influenza outbreaks from Google flu trends across the United States in hopes of

increasing preparation standards.

ii

Contents

Acknowledgements i

Abstract ii

List of Tables v

List of Figures vi

1 Introduction 1

2 Spatial Data 3

2.1 Spatial Weights Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Spatial Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Global Test for Spatial Autocorrelation . . . . . . . . . . . . . . 9

2.2.2 Local Test for Spatial Autocorrelation . . . . . . . . . . . . . . . 13

3 Time Series Analysis 19

4 Seasonal ARIMA Model 22

4.1 The General Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1.1 Simple ARIMA Example . . . . . . . . . . . . . . . . . . . . . . 23

4.2 ACF and PACF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 R ARIMA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.4 State Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.5 US Model Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.6 State Model Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

iii

5 Model Forecasting 37

6 Conclusion and Discussion 39

References 41

Appendix A. Spatial Matrix 43

Appendix B. Moran I Proof 45

iv

List of Tables

2.1 Global Moran Is: January 2008 - June 2011 . . . . . . . . . . . . . . . . 11

2.2 Global Moran Is: July 2011 - December 2014 . . . . . . . . . . . . . . . 12

4.1 ARIMA model generated from ACF and PACF plots after differencing . 28

4.2 Results from ARIMA model generated from R . . . . . . . . . . . . . . . 29

4.3 Seasonal ARIMA models selected by R: Alabama - Montana . . . . . . . 30

4.4 Seasonal ARIMA models selected by R: Nebraska - Wyoming . . . . . . 31

4.5 Seasonal ARIMA model statistics: Alabama - Montana . . . . . . . . . . 35

4.6 Seasonal ARIMA model statistics: Nebraska - Wyoming . . . . . . . . . 36

v

List of Figures

2.1 Average Monthly Flu Counts from 2008 - 2014 . . . . . . . . . . . . . . 7

2.2 Average Monthly Flu Counts Per Person from 2008 - 2014 . . . . . . . 8

2.3 Time Series of observed Global Morans . . . . . . . . . . . . . . . . . . 13

2.4 Local Moran I values for Alabama . . . . . . . . . . . . . . . . . . . . . 16

2.5 Local Moran I values for Montana . . . . . . . . . . . . . . . . . . . . . 16

2.6 Local Moran I values for North Dakota . . . . . . . . . . . . . . . . . . . 17

2.7 Local Moran I values for South Dakota . . . . . . . . . . . . . . . . . . . 17

2.8 Time series of Local Moran I values for Alabama, Montana and North

Dakota . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 Times Series of original data collected . . . . . . . . . . . . . . . . . . . 21

3.2 Times series of data after it was logged . . . . . . . . . . . . . . . . . . 21

4.1 ACF and PACF for US monthly data . . . . . . . . . . . . . . . . . . . 25

4.2 Times Series of the data after 12 month differencing . . . . . . . . . . . 26

4.3 ACF and PACF for US monthly data after differenced for 12 months . . 26

4.4 Time Series, ACF, and PACF of the model’s residuals . . . . . . . . . . 32

5.1 Forecast for ARIMA(1, 0, 1)× (2, 0, 0)12 model . . . . . . . . . . . . . . 37

A.1 Map of the continental states with corresponding numbers [13] . . . . . . 43

A.2 List of Neighbors for the 48 States [13] . . . . . . . . . . . . . . . . . . . 44

vi

Chapter 1

Introduction

Accurate models to help predict influenza outbreaks across the United States can

help public health officials track and prepare for future events and hopefully decrease

the amount of deaths. Nowadays, large data sets for diseases and epidemics can be col-

lected quickly and easily through internet-based programs. However, generating useful

models to help predict epidemics is largely dependent on the availability and accuracy

of this data [14].

The data for this project came from Google Flu Trends [5]. Google keeps track

of the number of times that the word ’flu’ or flu-like symptoms are searched on their

website and maintain a database with this information on a weekly basis. The data is

collected both nationwide and by each individual state.

Studies concerning Google-search-based tracking models have been done, like the

AutoRegression with Google data ,ARGO, model developed by Yang, Santillana and

Kou [14] and the Google Flu Trend with a state-space SEIR model by Dukik, Lopes

and Polson [3]. Both models were created with hopes of tracking disease behavior at

different temporal and spatial inputs.

We will start the project by using the data collected for each state and investigate

if there is any spatial autocorrelation present. When testing for spatial autocorrelation,

we must build a spatial weights matrix which depicts the locational similarities between

1

2

the areas in the study. We will test for both global and local spatial autocorrelation in

the data in order to get more accurate results.

Next, we will analyze the data as a seasonal time series. Using the ACF and PACF

plots, we will determine if any lags in the data are significant in developing an ap-

propriate seasonal ARIMA model that fits the data. For this project we will test the

model’s accuracy using the Ljung-Box test which will test if the data is independently

distributed.

Our goal for this project is to be able to create a model that will be able to forecast

future behavior of Google flu trends on the nationwide and/or state level. In doing

so, we would be able to predict if one state will have a rise in flu cases based on the

frequency of google flu counts in neighboring states and previous time periods.

Chapter 2

Spatial Data

Spatial data is the term used to describe data that has a spatial or geographical

component associated with it. This data can be characteristic or more commonly, nu-

merical observations.

There are two common spatial structures that are used when modeling regional spa-

tial data. One method is by determining the proximity of areas based on distances

between the centroids of each areal unit. It is assumed that the observation is observed

in the centroid of each region and then a spatial covariance structure is developed based

on the distances of the centroids. This strategy does not take into account the fact that

the centroid does not accurately describe the behavior across the entire region. [13]

The second method, and the one used in this project, uses a neighborhood structure

to create the spatial covariance matrix. This would mean that the proximity of the

areal units is determined by the borders shared in the regional lattice structure. Thus,

regions are considered neighbors when they share a common border. Unfortunately, this

method uses irregular lattices and has not been studied as in-depth as the first method.

In this project, we used area data from the 48 continental U.S. states of google flu

counts collected from Google [5]. Area data is defined as the type of spatial data where

the observations are associated with a fixed set of areal units. Like stated before, we

used a neighborhood structure that contained areas/zones with irregular boundaries [4].

3

4

2.1 Spatial Weights Matrix

We utilized what’s known as a spatial weights matrix to determine spatial relativity

within our data. Since we wanted to investigate if a rise in Google Flu Counts in one

state would affect the states surrounding it, we decided to use the neighbors of each

state to generate our spatial weights matrix. Thus, when two states, areas i and j,

share a common border the entries Wi,j and Wj,i will be given a value of 1. All diagonal

entries Wi,i will be assigned a value of 0 [4].

The Spatial Weights Matrix W̃ of the 48 continental U.S. states is:0 W1,2 . . . W1,48

W2,1 0 . . . W2,48

......

...

W48,1 W48,2 . . . 0

Wi,j =

{1 if area j is neighbors with area i

0 otherwise.

See appendix A for a map of the 48 areal units labeled in their corresponding al-

phabetical order and a list of all the neighbors for each state. Before we moved on with

the data, we decided to row standardize W̃ so that we could interpret each entry as the

portion of spatial influence that area j has on area i.

5

2.2 Spatial Dependence

In order to create a useful model with this data, we must first check if there is spatial

dependence among the observations. We will investigate if there is spatial autocorrela-

tion in the number of Google Flu Counts based on the proximity of the 48 continental

states. We will run two tests to check for spatial dependence, one on a global scale and

the second will look at local spatial dependence for each location.

Before we ran any tests, we wanted to look at what was happening with the flu

counts per person in each state at multiple time periods. Figure 2.2 below reveals what

happened every other month across the United States for the year 2014. We observed

that there seems to be a larger percentage of people searching flu trends in the upper

mid-west all year round than in other areas across the country.

When looking for spatial dependence, we are comparing the similarity between the

observations for each of the 48 locations. We will use all of the following variables to

help us test for spatial autocorrelation:

n number of areas in sample,

i, j two different areas in sample,

zi observation collected from area i,

z̄ average of all n observations,

Wij similarity in location of areas i and j,

Mij similarity in observations of areas i and j.

To see visually what the counts we collected looked like, we took the average monthly

data from 2008 - 2014 and generated density plots of the geographical region. Figure

2.1 displays the density counts for every other month.

We noticed that the states with the largest density of counts were the ones with the

6

largest populations, e.g. Texas, Arizona, California. For that reason, Figure 2.1 is not

useful for making any conjectures since it would not be spatial effects that cause those

states to have higher Google flu counts. To account for this, we divided our data of

monthly counts by each state’s population so we could use per capita data. We then

generated the same density plots but with the monthly average per capita data, see

Figure 2.2.

When we graphed the newly transformed data we saw that there is no longer clus-

tering happening primarily in the southern region of the US. We now see clustering

happening mainly in the upper mid-west region of the states instead. One possible

explanation for this is that that region generally has harsher winters and colder tem-

peratures year round which would increase the chances of contracting influenza.

7

25

30

35

40

45

50

−120 −100 −80x

y

2000

4000

6000

Count

Average January Counts

25

30

35

40

45

50

−120 −100 −80x

y

500

750

1000

1250

Count

Average July Counts

25

30

35

40

45

50

−120 −100 −80x

y

1000

2000

3000

Count

Average March Counts

25

30

35

40

45

50

−120 −100 −80x

y

1000

2000

3000

Count

Average September Counts

25

30

35

40

45

50

−120 −100 −80x

y

1000

1500

Count

Average May Counts

25

30

35

40

45

50

−120 −100 −80x

y

1000

2000

3000

4000

Count

Average November Counts

Figure 2.1: Average Monthly Flu Counts from 2008 - 2014

8

25

30

35

40

45

50

−120 −100 −80x

y

0.1

0.2

0.3

0.4

Count

Average January Logged Counts

25

30

35

40

45

50

−120 −100 −80x

y

0.05

0.10

Count

Average July Logged Counts

25

30

35

40

45

50

−120 −100 −80x

y

0.1

0.2

0.3

Count

Average March Logged Counts

25

30

35

40

45

50

−120 −100 −80x

y

0.05

0.10

0.15

Count

Average September Logged Counts

25

30

35

40

45

50

−120 −100 −80x

y

0.05

0.10

0.15

0.20Count

Average May Logged Counts

25

30

35

40

45

50

−120 −100 −80x

y

0.1

0.2

Count

Average November Logged Counts

Figure 2.2: Average Monthly Flu Counts Per Person from 2008 - 2014

9

2.2.1 Global Test for Spatial Autocorrelation

Global spatial autocorrelation measures and tests use the entire spatial weights ma-

trix W̃ to determine if there is spatial autocorrelation over the total area in the study.

Whereas, local measures will calculate a statistic for each area in the study and use a

smaller, restricted set of areal units [4].

When measuring for global spatial autocorrelation, we compared the similarities in

the observationsMij with the similarities in the locationsWij by using the cross-product:

n∑i=1

n∑j=1

MijWij (2.1)

The two most commonly used methods for finding spatial autocorrelation for areal

units are the Moran’s I and Geary’s c statistics. Both statistics will determine the

overall degree of spatial correlation in the data set as a whole.

For this project, we used Moran’s I statistic to determine if there’s global spatial au-

tocorrelation present in our data. Moran’s I statistic uses the cross-products to measure

value similarity, Mij = (zi − z̄)(zj − z̄), versus Geary’s c which uses squared differences

such as (zi − zj)2.

The global Moran I statistic is:

I =

nn∑i=1

n∑j=1

Wij(zi − z̄)(zj − z̄)

n∑i=1

n∑j 6=i

Wij

n∑i=1

(zi − z̄)2(2.2)

E[I] = − 1

n− 1(2.3)

var(I) =n2(n− 1)W1 − n(n− 1)W2 − 2W 2

0

(n+ 1)(n− 1)2W 20

(2.4)

10

where

W0 =n∑i=1

n∑j 6=i

Wij (2.5)

W1 =1

2

n∑i=1

n∑j 6=i

(Wij +Wji)2 (2.6)

W2 =n∑k=1

( n∑j=1

Wkj +

n∑i=1

Wik

)2(2.7)

Please see Appendix B for the proof of the expected value for Moran I.

The null hypothesis associated with the Moran I statistic is that the spatial pro-

cesses influencing any spatial relationships is randomly placed and there is no spatial

autocorrelation. To test for the significance of spatial autocorrelation, R will randomly

assign the obersvations to the areal units and calculate the observed Moran I for a

large number of these random assignments. The observed Moran I is then compared

to the random set of Is and if the actual observed I falls less than the 5th percentile

or greater than the 95th percentile then there is spatial autocorrelation present at the

α = 0.5 level [6]. Therefore, we looked for p-values that were significant (< 0.05) which

would imply that we could reject the null hypothesis and conclude that there is spatial

autocorrelation present[4].

When spatial autocorrelation is present, for large data sets, the observed Moran I

statistic will be a large value versus the expected value of the statistic under the null

hypothesis of no spatial relation. To see this, consider when two neighboring areas i

and j both have high observation values. They will both be larger than the average, z̄,

and the cross product (zi − z̄)(zj − z̄) will be a large positive value.

Using the Moran.I() command in the package {ape} in R we found the Global Moran

I statistics for all 84 time periods in the study. These results are shown in Tables 2.1

and 2.2 below.

11

Date Observed I Expected I St. Dev P val

Jan-08 0.066 -0.021 0.096 0.367

Feb-08 0.138 -0.021 0.095 0.094

Mar-08 0.198 -0.021 0.092 0.018

Apr-08 0.119 -0.021 0.093 0.133

May-08 0.062 -0.021 0.092 0.364

Jun-08 0.039 -0.021 0.091 0.509

Jul-08 0.024 -0.021 0.089 0.615

Aug-08 -0.003 -0.021 0.086 0.831

Sep-08 -0.036 -0.021 0.086 0.864

Oct-08 -0.054 -0.021 0.088 0.714

Nov-08 -0.047 -0.021 0.091 0.778

Dec-08 0.017 -0.021 0.093 0.682

Jan-09 0.023 -0.021 0.090 0.627

Feb-09 0.098 -0.021 0.094 0.208

Mar-09 0.115 -0.021 0.093 0.141

Apr-09 0.141 -0.021 0.095 0.088

May-09 0.050 -0.021 0.091 0.432

Jun-09 0.121 -0.021 0.089 0.111

Jul-09 0.018 -0.021 0.087 0.655

Aug-09 0.096 -0.021 0.095 0.218

Sep-09 -0.001 -0.021 0.097 0.837

Oct-09 0.183 -0.021 0.094 0.030

Nov-09 0.133 -0.021 0.093 0.098

Dec-09 -0.026 -0.021 0.093 0.958

Jan-10 -0.030 -0.021 0.091 0.924

Feb-10 -0.011 -0.021 0.095 0.910

Mar-10 0.016 -0.021 0.094 0.694

Apr-10 0.041 -0.021 0.093 0.505

May-10 0.034 -0.021 0.092 0.544

Jun-10 0.012 -0.021 0.091 0.715

Jul-10 0.014 -0.021 0.090 0.697

Aug-10 0.007 -0.021 0.089 0.751

Sep-10 -0.038 -0.021 0.092 0.858

Oct-10 -0.036 -0.021 0.093 0.872

Nov-10 -0.030 -0.021 0.093 0.929

Dec-10 -0.001 -0.021 0.095 0.833

Jan-11 -0.020 -0.021 0.095 0.986

Feb-11 0.085 -0.021 0.096 0.270

Mar-11 0.188 -0.021 0.095 0.027

Apr-11 0.147 -0.021 0.095 0.076

May-11 0.079 -0.021 0.094 0.285

Jun-11 0.039 -0.021 0.092 0.510

Table 2.1: Global Moran Is: January 2008 - June 2011

12

Date Observed I Expected I St. Dev P val

Jul-11 0.018 -0.021 0.089 0.659

Aug-11 -0.008 -0.021 0.089 0.884

Sep-11 -0.051 -0.021 0.091 0.745

Oct-11 -0.040 -0.021 0.090 0.832

Nov-11 -0.030 -0.021 0.091 0.923

Dec-11 -0.043 -0.021 0.094 0.816

Jan-12 -0.019 -0.021 0.094 0.983

Feb-12 0.017 -0.021 0.095 0.691

Mar-12 0.123 -0.021 0.095 0.128

Apr-12 0.050 -0.021 0.094 0.450

May-12 0.021 -0.021 0.093 0.651

Jun-12 0.015 -0.021 0.090 0.685

Jul-12 0.004 -0.021 0.087 0.775

Aug-12 -0.031 -0.021 0.087 0.911

Sep-12 -0.072 -0.021 0.089 0.569

Oct-12 -0.010 -0.021 0.092 0.899

Nov-12 0.050 -0.021 0.097 0.461

Dec-12 0.089 -0.021 0.096 0.250

Jan-13 0.182 -0.021 0.096 0.034

Feb-13 0.071 -0.021 0.096 0.338

Mar-13 0.000 -0.021 0.094 0.824

Apr-13 0.002 -0.021 0.093 0.805

May-13 -0.008 -0.021 0.090 0.887

Jun-13 -0.005 -0.021 0.090 0.861

Jul-13 -0.013 -0.021 0.091 0.931

Aug-13 0.002 -0.021 0.095 0.808

Sep-13 -0.079 -0.021 0.094 0.538

Oct-13 -0.045 -0.021 0.095 0.802

Nov-13 -0.024 -0.021 0.095 0.977

Dec-13 0.199 -0.021 0.097 0.022

Jan-14 0.064 -0.021 0.096 0.376

Feb-14 -0.006 -0.021 0.093 0.870

Mar-14 -0.022 -0.021 0.094 0.990

Apr-14 -0.008 -0.021 0.092 0.887

May-14 -0.036 -0.021 0.092 0.877

Jun-14 0.001 -0.021 0.094 0.813

Jul-14 -0.016 -0.021 0.092 0.958

Aug-14 -0.061 -0.021 0.083 0.637

Sep-14 -0.027 -0.021 0.084 0.950

Oct-14 0.002 -0.021 0.095 0.804

Nov-14 -0.019 -0.021 0.095 0.983

Dec-14 0.151 -0.021 0.095 0.068

Table 2.2: Global Moran Is: July 2011 - December 2014

13

A graph of the observed Global Moran I values for all 84 time periods is shown below

in Figure 2.3. The reference line for the expected Global Moran I is also included in

the plot. One standard deviation above and below the observed value is indicated in

the plot. Notice that none of the observed Global Moran values are at least one stan-

dard deviation away from the expected value but we did find five months that displayed

spatial autocorrelation by testing the I values from Figures 2.1 and 2.2. This result

is not very strong and we could not conclude that there is global spatial autocorrela-

tion in the data. Thus, we decided to test each state for any local spatial autocorrelation.

2008 2010 2012 2014

−1.0

−0.5

0.00.5

1.0

Global Moran Is

Time

Obser

ved Gl

oval M

oran Is

Figure 2.3: Time Series of observed Global Morans

2.2.2 Local Test for Spatial Autocorrelation

Since we did not find any significant global spatial autocorrelation, our next step was

to test for signs of local spatial autocorrelation. Local statistics are used to determine if

each areal unit has a large amount of spatial clustering or if there are notable similarities

in the observations of surrounding areas. Local indicators of spatial assocation (LISA)

were created by Luc Anselin to determine the influence of each individual observation

versus looking at the entire sample area [2].

14

The test for local spatial autocorrelation utilizes the cross product:

n∑j=1

MijWij (2.8)

which uses comparisons of spatial autocorrelations for a specific observation or areal

unit. Similarly, Mij = (zi − z̄)(zj − z̄) like in the global test. Also, we now have neigh-

borhood sets Ji which are the collection of neighbors for area i and z̄ now represents

the average observation value for just area i’s neighboring states.

The local Moran Ii statistic for area i is:

Ii = (zi − z̄)n∑

j∈Ji

Wij(zj − z̄) (2.9)

E[Ii] = − 1

(n− 1)

n∑j=1

Wij (2.10)

var[Ii] =1

(n− 1)Wi(2)(n− b2) +

2

(n− 1)(n− 2)Wi(kh)(2b2 − n)− 1

(n− 1)2W̃i

2(2.11)

where

Wi(2) =n∑j 6=i

W 2ij (2.12)

2Wi(kh) =

n∑k 6=i

n∑h6=i

WikWih (2.13)

W̃i =

n∑j=1

Wij (2.14)

One thing to notice is that the sum of Ii for all i values is equivalent to the global

15

Moran I statistic we used in Equation 2.2.

n∑i=1

Ii ≡n∑i=1

(zi − z̄)n∑

j∈Ji

Wij(zj − z̄) (2.15)

Using the Local.Moran() command in the package {spdep} in R, we again calcu-

lated Moran I statistics for each of the 84 time periods. However, when testing for local

spatial autocorrelation we generated a test statistic for each of the 48 areas for each

time period. Thus, we observed 4,032 local Moran I statistics.

From these results, we generated a graph for each state that includes the calculated

test statistic as well as one standard deviation above and below and the expected Ii

value for reference. From these, we found that only six states were consistently fur-

ther than one standard deviation away from the expected value. These states include:

Delaware, Montana, North Dakota, South Dakota, Texas and Wyoming. The corre-

sponding graphs for Montana, North Dakota and South Dakota are displayed below

along with the plot for Alabama which did not display significant spatial autocorrela-

tion.

16

2008 2010 2012 2014

−10

−50

510

Local Moran I: Alabama

Time

Obse

rved L

ocal

Moran

Is

Figure 2.4: Local Moran I values for Alabama

2008 2010 2012 2014

−10

−50

510

Local Moran I: Montana

Time

Obse

rved L

ocal

Moran

Is

Figure 2.5: Local Moran I values for Montana

17

2008 2010 2012 2014

−10

−50

510

Local Moran I: North Dakota

Time

Obser

ved Lo

cal Mo

ran Is

Figure 2.6: Local Moran I values for North Dakota

2008 2010 2012 2014

−10

−50

510

Local Moran I: South Dakota

Time

Obser

ved Lo

cal Mo

ran Is

Figure 2.7: Local Moran I values for South Dakota

18

Observed Local Moran I values

2008 2010 2012 2014

−50

510

15

Time

Obse

rved L

ocal

Moran

I's

Alabama

Montana

North Dakota

Figure 2.8: Time series of Local Moran I values for Alabama, Montana and NorthDakota

Figure 2.8 displays the trends in the observed local Moran Is for three of the states.

Alabama is used as a reference for a state that does not display any type of local spatial

autocorrelation. Whereas, both Montana’s and North Dakota’s Moran Is indicated that

they had significant local spatial autocorrelation.

However, despite having six states with significant local spatial autocorrelation, our

results are similar to those for global spatial autocorrelation. We concluded that our

data did not show enough significant local spatial autocorrelation to continue to include

it in our model building.

Chapter 3

Time Series Analysis

For this project, we are using average monthly data observed for 7 years, for a total

of 84 data points for each location. Since the data we collected was in terms of weeks,

we manually converted the data to monthly averages. Weekly data was listed by the

first day of the week that the counts started. Thus, if a week started at the end of Jan-

uary but also contained days in February it was only taken into account for January’s

average. A time series plot of the raw data is shown in Figure 3.1 below. Notice that

there is a large amount of variance between the peaks.

Looking at Figure 3.2, we can see that after we take the log of the data, there is

much less variance in the output. The data now ranges roughly from 6 − 10 versus

50− 1100 with the raw data. Also, since our data is in the form of counts then taking

the log will help normalize the data. We will continue our project by using the log-

transformed data to generate useful models.

Another option we could have tried would have been to difference the data. Differ-

encing is commonly used when you have a seasonal time series. The seasonality typically

causes a time series to be nonstationary since the seasonality is the main factor affect-

ing the output at certain time periods. Differencing, computing the difference between

consecutive observations, can help make a time series stationary and remove patterns

in the data. However, we decided to not use differencing and keep the seasonal effects

in the data for modeling purposes we will discuss later in the project [7].

19

20

Next, notice that there is an obvious spike in the data that occurs once every year.

These spikes represent flu season which typically peaks between December and Febru-

ary every year with the majority of years peaking in February [12]. Because of this, we

incorporated a 12 month seasonal effect into our models.

Another observation of the data is the atypical spike for the 2012 - 2013 flu season.

According to the Center for Disease Control, CDC, the flu vaccine was 52% effective at

preventing acute respiratory illness that required medical attention that year. However,

the vaccine’s effectiveness was lower for those aged 65 years and older; which could have

been one factor for the flu season’s severity [12].

21

2008 2010 2012 2014

2000

4000

6000

8000

10000

Time Series of Monthly Average US Raw Data

Dates

Raw M

onthly

Avera

ge Flu

Count

s

Figure 3.1: Times Series of original data collected

2008 2010 2012 2014

6.06.5

7.07.5

8.08.5

9.0

Time Series of Monthly Average US Logged Data

Dates

Logged

Month

ly Aver

age Flu

Count

s

Figure 3.2: Times series of data after it was logged

Chapter 4

Seasonal ARIMA Model

Since we are using monthly data to help us determine a model for predicting Google

flu counts, we would expect our model to be seasonal. Having seasonality in a time

series means there is a pattern present that repeats every m time periods. Thus, since

flu season generally peaks around the same time each year, we would expect our data

to have a reoccurring pattern every 12 months.

4.1 The General Model

The seasonal ARIMA model includes both seasonal and non-seasonal autoregressive

and moving average components. The model also incorporates the use of differencing

which was mentioned earlier. The backwards difference operator, B, is defined as:

(Bm)xt = xt−m (4.1)

The notation for the general model is:

ARIMA(p, d, q)× (P,D,Q)m

22

23

p non-seasonal AR order,

d non-seasonal differencing,

q non-seasonal MA order,

P seasonal AR order,

D seasonal differencing,

Q seasonal MA order,

m number of months per season.

The model can be written as:

Φ(Bm)φ(B)(1−B − ...−Bd)(1−B − ...−BD)(xt − µ) = Θ(Bm)θ(B)wt (4.2)

The non-seasonal AR component is written as:

φ(B) = 1− φ1B − ...− φpBp (4.3)

The non-seasonal MA component is written as:

θ(B) = 1 + θ1B + ...+ θqBq (4.4)

The seasonal AR component is written as:

Φ(Bm) = 1− Φ1Bm − ...− ΦPB

mP (4.5)

The seasonal MA component is written as:

Θ(Bm) = 1 + Θ1Bm + ...+ ΘQB

mQ (4.6)

4.1.1 Simple ARIMA Example

To illustrate a simple example of a seasonal ARIMA model, let’s look at anARIMA(0, 1, 1)×(1, 0, 1)12 model.

24

Our model would start as:

Φ(B)(1−B)(xt − µ) = Θ(B)θ(B)wt (4.7)

Now, replace (xt − µ) with zt and substitute in the appropriate equations to get:

(1− ΦB)(1−B)zt = (1 + ΘB)(1 + θB)wt (4.8)

(1−B − ΦB − ΦBB)zt = (1 + ΘB + θB + ΘθBB)wt (4.9)

Thus, we get the resulting equation:

zt = zt−1 + Φzt−1 + Φzt−2 + wt + Θwt−1 + θwt−1 + Θθwt−2 (4.10)

25

4.2 ACF and PACF

Next, we created ACF and PACF plots for the nationwide data which are shown in

Figure 4.1.

0 5 10 15 20 25 30

−0.5

0.0

0.5

1.0

Series: USData

LAG

ACF

0 5 10 15 20 25 30

−0.5

0.0

0.5

1.0

LAG

PACF

Figure 4.1: ACF and PACF for US monthly data

We observed that the ACF plot in Figure 4.1 has cyclic spikes that do not seem to

dampen in magnitude at any lag value. Thus, it would be difficult to try and determine

a model for this data based on these plots.

Since we had seasonal data, it is sometimes helpful to analyze the ACF and PACF

plots after differencing the data. Seasonality commonly brings about nonstationarity

in data since it’s typical for seasonal data to fluctuate and peak during certain time

periods. Thus, we used a 12 month difference in our data and the time series plot after

the differencing can be found in Figure 4.2. Notice that the new plot no longer has

the consistent peak and valley pattern that was present back in Figure 3.2. We then

generated ACF and PACF plots for the data after applying a 12 month difference which

can be found in Figure 4.3.

26

2008 2009 2010 2011 2012 2013 2014

−1.5

−1.0

−0.5

0.00.5

1.01.5

Time Series after Differenced for 12 Months

Dates

12 Mo

nth Di

fferenc

ed Flu

Count

s

Figure 4.2: Times Series of the data after 12 month differencing

0 5 10 15 20 25 30

−0.5

0.0

0.5

1.0

Series: Differenced

LAG

ACF

0 5 10 15 20 25 30

−0.5

0.0

0.5

1.0

LAG

PACF

Figure 4.3: ACF and PACF for US monthly data after differenced for 12 months

27

From analyzing Figure 4.3 we were able to make an educated guess on what pa-

rameters to include in a seasonal ARIMA model. We used the the ACF plot to help

us determine the values for the seasonal and non-seasonal MA order and the PACF for

the number of seasonal and nonseasonal AR parameters.

Looking at just the first couple lags in the ACF we determined that a non-seasonal

MA order of 1 would fit since there is only a spike at lag 1. Next, we analyzed the lags

at 12, 24 and 36 and decided to use a seasonal MA order of 1 as well since the lag at

time 12 seemed to be the only significant seasonal lag.

We used the same methodology for picking the AR components of the model and

analyzing the PACF plot. Thus, we decided to use a non-seasonal AR order of 2 and a

seasonal order of 0.

Thus, we created the following model:

ARIMA(2, 0, 1)× (0, 0, 1)12

Applying these parameters to equation (4.1.1) we get:

φ(B)(xt − µ) = Θ(B12)θ(B)wt (4.11)

(1− φ1B − φ2B2)(xt − µ) = (1 + ΘB12)(1 + θB)wt (4.12)

For simplicity, let zt = (xt − µ) and multiply the 2 polynomials on the right side:

(1− φ1B − φ2B2)zt = (1 + ΘB12 + θB + ΘθBB12)wt (4.13)

zt = φ1zt−1 + φ2zt−2 + wt + Θwt−12 + θwt−1 + Θθwt−13 (4.14)

28

Type Coef S.E.

AR 1 (φ1) 1.0545 0.1558

AR 2 (φ2) -0.4153 0.1504

MA 1 (θ) 0.5563 0.1603

SMA 1 (Θ) 0.5034 0.1276

Constant 7.1957 0.1713

AIC=26.95 σ̂2=0.0651

Table 4.1: ARIMA model generated from ACF and PACF plots after differencing

Our final model is:

zt = 0.6889zt−1 + wt + 0.4577wt−12 + 0.8025wt−1 + 0.3673wt−13 (4.15)

4.3 R ARIMA Model

Now that we have analyzed the ACF and PACF plots for the data and generated a

model using those, we decided to see what type of model R would choose to fit the data.

We used the auto.arima() command in the package {forecast} in R to help us develop

a potentially different model. This function in R selects the best model based on the

Akaike information criterion, AIC. The AIC is used to estimate the quality of a model

in relation to other models. Thus, it will determine which model is the most useful out

of a set of multiple models. It will not determine significance of an individual model.

Our results were as follows:

ARIMA(1, 0, 1)× (2, 0, 0)12

Applying these parameters to equation (4.1.1) we get:

Φ(B12)φ(B)(xt − µ) = θ(B)wt (4.16)

29

(1− Φ1B12 − Φ2B

24)(1− φB)(xt − µ) = (1 + θB)wt (4.17)

For simplicity, let zt = (xt − µ) and multiply the first 2 polynomials on the left side

to get:

(1− Φ1B12 − Φ2B

24 − φB + Φ1φB12B + Φ2φB

24B)zt = (1 + θB)wt (4.18)

zt = Φ1zt−12 + Φ2zt−24 + φzt−1 − Φ1φzt−13 − Φ2φzt−25 + wt + θwt−1 (4.19)

Type Coef S.E.

AR 1 (φ) 0.6375 0.0913

MA 1 (θ) 0.7387 0.0920

SAR 1 (Φ1) 0.3779 0.1091

SAR 2 (Φ2) 0.3785 0.1251

constant 7.2354 0.3217

AIC=15.57 σ̂2=0.0568

Table 4.2: Results from ARIMA model generated from R

Our final model is:

zt = 0.3779zt−12+0.3785zt−24+0.6375zt−1−0.2409zt−13−0.2413zt−25+wt+0.7387wt−1

(4.20)

Our first model, which we developed by analyzing the ACF and PACF plots, had

an AIC value of 26.95. Whereas, the model chose by R had an AIC value of only 15.57.

30

Thus, we concluded that the model generated by R, ARIMA(1, 0, 1) × (2, 0, 0)12, is a

better model for the data than the one we calculated ourselves.

4.4 State Models

Next, we ran auto.arima() on each of the 48 individual states in order to gather

state-specific models. The models that R chose for each state are listed below in Tables

4.3 and 4.4.

State AR MA SAR SMA Period NonSDiff SDiff AIC

Alabama 1 1 0 2 12 0 1 56.393

Arizona 2 0 1 1 12 0 0 51.056

Arkansas 1 3 1 2 12 0 0 45.365

California 2 0 2 0 12 0 0 19.249

Colorado 1 1 1 1 12 0 0 52.624

Connecticut 1 1 2 0 12 0 0 92.317

Delaware 1 1 2 0 12 0 0 16.141

Florida 1 1 1 0 12 0 1 -10.093

Georgia 1 1 2 0 12 0 0 42.990

Idaho 3 0 2 0 12 0 0 73.055

Illinois 1 1 0 1 12 0 1 21.765

Indiana 2 0 1 0 12 0 1 45.295

Iowa 1 1 1 0 12 0 1 56.480

Kansas 1 1 0 1 12 0 1 64.199

Kentucky 2 0 2 0 12 0 0 73.715

Louisiana 1 1 2 0 12 0 0 31.711

Maine 0 2 2 0 12 0 0 89.355

Maryland 1 1 1 0 12 0 1 -0.696

Massachusetts 2 0 1 0 12 0 1 75.148

Michigan 1 1 1 0 12 0 1 4.383

Minnesota 1 1 2 0 12 0 0 60.982

Mississippi 2 0 2 0 12 0 0 75.025

Missouri 1 1 0 2 12 0 1 34.746

Montana 3 1 1 2 12 0 0 79.302

Table 4.3: Seasonal ARIMA models selected by R: Alabama - Montana

31

State AR MA SAR SMA Period NonSDiff SDiff AIC

Nebraska 2 0 1 0 12 0 0 85.521

Nevada 2 0 2 0 12 0 0 29.620

New Hampshire 1 1 2 0 12 0 0 100.658

New Jersey 1 1 1 2 12 0 0 62.443

New Mexico 1 2 1 1 12 0 0 30.337

New York 1 1 2 0 12 0 0 35.076

North Carolina 1 1 2 0 12 0 0 62.608

North Dakota 1 1 2 0 12 0 1 83.835

Ohio 2 0 2 0 12 0 0 37.356

Oklahoma 4 0 2 0 12 0 0 62.203

Oregon 2 0 2 0 12 0 0 65.044

Pennsylvania 1 1 2 0 12 0 0 -9.615

Rhode Island 2 0 1 0 12 0 1 92.514

South Carolina 2 0 1 1 12 0 0 77.623

South Dakota 2 0 1 1 12 0 0 100.126

Tennessee 1 1 1 1 12 0 0 63.286

Texas 1 1 2 0 12 0 0 30.807

Utah 0 2 1 2 12 0 0 62.374

Vermont 2 1 1 0 12 0 0 80.656

Virginia 2 0 2 0 12 0 0 6.013

Washington 1 1 1 2 12 0 0 36.091

West Virginia 2 0 2 0 12 0 0 40.571

Wisconsin 1 1 1 0 12 0 1 30.775

Wyoming 2 1 0 0 12 0 0 70.470

Table 4.4: Seasonal ARIMA models selected by R: Nebraska - Wyoming

Notice that each state displays fairly different model specifications and that there

is a wide variety of models associated with the 48 states. It does appear that the

ARIMA(1, 0, 1)× (2, 0, 0)12 is the most common model present in Tables 4.3 and 4.4.

This makes sense since this is the model that R also chose to describe the behavior of

the overall country’s flu counts. However, there are 13 states that include a seasonal

difference variable in their model which is something we did not include in our model.

32

4.5 US Model Accuracy

We then made a time series plot of the residuals for this model as well as the ACF

and PACF plots which are shown in Figure 4.4 below.

res

1 2 3 4 5 6 7 8

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

0 5 10 15 20 25

−0.3

−0.2

−0.1

0.0

0.1

0.2

0.3

Lag

ACF

0 5 10 15 20 25

−0.3

−0.2

−0.1

0.0

0.1

0.2

0.3

Lag

PACF

Figure 4.4: Time Series, ACF, and PACF of the model’s residuals

There doesn’t look like there’s an apparent trend in the time series for the residu-

als, which indicates a decent model. We also notice that there are not any significant

33

spikes in either the ACF or the PACF plots which would mean that there’s no trends

happening in lags of the residuals for the model. Based on these plots, we concluded

that this model does sufficiently describe the data.

Next, we ran a Ljung-Box test on the model to help us further determine the model’s

usefulness [8]. The Ljung-Box test is defined as:

H0: The data are independently distributed

Ha: The data are not independently distributed (correlation is present)

with the test statistic:

Q = T (T + 2)

h∑g=1

ρ̂2

T − g

T length of time series,

ρ̂ sample autocorrelation at lag g,

h number of lags being tested,

K number of model parameters.

Under the null hypothesis, the test statistic Q follows a Chi-squared distribution

typically with h degrees of freedom, χ2(h). However, since we are using this test for an

ARIMA model, we are testing if the errors resemble independence and must adjust the

degrees of freedom accordingly. The degrees of freedom for a seasonal ARIMA model

end up being the number of lags being tested less the number of model parameters.

Thus, with a significance level of α, the rejection region for the hypothesis of random-

ness in the residuals is:

Q > χ21−α,h−K

According to Rob Hyndman and George Athanasopoulos [9] there is not a standard

value one should use for h, but as a rule of thumb for seasonal data you should let

34

h = min(2m,T/5) where T is the length of the time series.

Applying our model generated from R to this test at a 95% significance level we get:

h = min(2 ∗ 12,84

5) ≈ 17,

Q = 8.1738,

χ20.05,13 = 22.36,

p− value = 0.8321,

AIC = 15.57.

We would reject the hypothesis of randomness in the residuals if Q > 22.36. Since our

test statistic for our model is 8.1738, we will not reject the hypothesis and conclude

that the data does show to be independently distributed.

4.6 State Model Accuracy

Next, we ran the Ljung-Box test on the models that R generated for each individual

state. The test statistics, degrees of freedom, p-values and AIC values are listed in

Figures 4.5 and 4.6 below. We see that when using the same test we used for the

overall US model, we will only reject the hypothesis of independence for the state of

Arkansas at the α=0.05 level. We can conclude that the models for the rest of the 47

states look to be independently distributed and useful for our data.

35

State Q-statistic DF P val AIC

Alabama 13.478 14 0.489 56.393

Arizona 9.418 13 0.741 51.056

Arkansas 19.891 10 0.030 45.365

California 8.344 13 0.820 19.249

Colorado 4.444 13 0.985 52.624

Connecticut 16.314 13 0.233 92.317

Delaware 10.380 13 0.663 16.141

Florida 9.847 15 0.829 -10.093

Georgia 13.392 13 0.418 42.990

Idaho 9.476 12 0.662 73.055

Illinois 14.532 15 0.486 21.765

Indiana 14.995 15 0.452 45.295

Iowa 11.571 15 0.711 56.480

Kansas 8.317 15 0.910 64.199

Kentucky 10.906 13 0.619 73.715

Louisiana 7.865 13 0.852 31.711

Maine 11.078 13 0.604 89.355

Maryland 7.327 15 0.948 -0.696

Massachusetts 10.519 15 0.786 75.148

Michigan 13.373 15 0.574 4.383

Minnesota 9.025 13 0.771 60.982

Mississippi 21.479 13 0.064 75.025

Missouri 11.488 14 0.647 34.746

Montana 6.296 10 0.790 79.302

Table 4.5: Seasonal ARIMA model statistics: Alabama - Montana

36

State Q-statistic DF P val AIC

Nebraska 16.957 14 0.258 85.521

Nevada 3.778 13 0.993 29.620

New Hampshire 12.350 13 0.499 100.658

New Jersey 10.271 12 0.592 62.443

New Mexico 3.737 12 0.988 30.337

New York 14.578 13 0.334 35.076

North Carolina 12.101 13 0.519 62.608

North Dakota 11.379 14 0.656 83.835

Ohio 17.240 13 0.189 37.356

Oklahoma 11.168 11 0.429 62.203

Oregon 10.200 13 0.677 65.044

Pennsylvania 12.022 13 0.526 -9.615

Rhode Island 11.614 15 0.708 92.514

South Carolina 6.925 13 0.906 77.623

South Dakota 10.969 13 0.613 100.126

Tennessee 9.210 13 0.757 63.286

Texas 5.074 13 0.974 30.807

Utah 9.045 12 0.699 62.374

Vermont 13.333 13 0.422 80.656

Virginia 13.872 13 0.383 6.013

Washington 7.749 12 0.804 36.091

West Virginia 7.615 13 0.868 40.571

Wisconsin 10.877 15 0.761 30.775

Wyoming 8.633 14 0.854 70.470

Table 4.6: Seasonal ARIMA model statistics: Nebraska - Wyoming

Chapter 5

Model Forecasting

Our final task is to forecast future events from the model that we generated. To do

this, we used R’s forecast() command and applied it to our ARIMA(1, 0, 1)× (2, 0, 0)12

model. The resulting forecast is shown below in Figure 5.1.

24 Month Forecast

2008 2010 2012 2014 2016

6.06.5

7.07.5

8.08.5

9.0

Figure 5.1: Forecast for ARIMA(1, 0, 1)× (2, 0, 0)12 model

37

38

The black line on the left represents the actual data that we collected whereas the

blue line is the fitted values of the model for 24 months in the future. The shaded

areas are prediction intervals for the forecast. The light grey region the 95% confidence

interval and the purple area is the 80% confidence interval for the forecasted data.

The dotted line in Figure 5.1 is data that we collected from Google for January

2015 - July 2015. From this graph, the forecast looks to be a good fit, the fitted values

stay inside both the 80% and 95% confidence intervals.

Chapter 6

Conclusion and Discussion

Starting with weekly flu count data for 2008 - 2014, we converted it to average

monthly data and log-transformed it to normalize and reduce the variance in the data.

We were then left with a seasonal time series ready to be analyzed for spatial and tem-

poral effects.

Our first step was to test for both global and local spatial autocorrelation. We

used Moran’s I statistic to test for both types of spatial correlation. Unfortunately, we

did not find enough evidence to say there was signficant global spatial autocorrelation.

And although a handful of states displayed local spatial autocorrelation, we decided

this wasn’t enough evidence to continue using spatial analysis in our model. Thus, we

continued on with without spatial effects and looked into just temporal.

We created two potential seasonal ARIMA models. One from analyzing the ACF

and PACF plots of the seasonal differenced data and the other generated from R. We

then used the AIC values to help us choose between the two models. The lower AIC

value led us to choose the model that R developed for the data.

Once we had our final model, we applied the Ljung-Box test to determine the over-

all quality of the seasonal ARIMA model. This test revealed that the data did in fact

show to be independently distributed at a 95% significance level and we rejected the

alternative hypthesis that correlation was present. Thus, our model was an adequate

39

40

fit for the data.

Lastly, we were able to create a 24 month forecast of the model. We generated the

80% and 95% prediction intervals based on the 24 month forecast and then plotted 7

months of data from 2015 to see how well the model could forecast. The actual data

from January - July 2015 stayed within both confidence intervals so we accepted this

forecast and would suggest using it for predicting future Google flu trends across the

United States.

References

[1] Anselin, Luc. Spatial Econometrics A Companion to Theoretical Econometrics.

Chapter 14 (2003), 310-325.

[2] Anselin, Luc. Local Indicators of Spatial Association - LISA Geographical Analysis

27 (2): 93-115.

[3] Dukic, Vanja, Hedibert F. Lopes and Nicholas G. Polson. Tracking Epidemics With

Google Flu Trends Data and a State-Space SEIR Model Journal of the American

Statistical Association 107:500, 1410-1426. Web.

[4] Fischer, Manfred M., & Jinfeng Wang. Spatial Data Analysis: Models, Methods

and Techniques Heidelberg: Springer, 2011. Springer Link. Springer International

Publishing. Web.

[5] Flu Trends. Google. N.p., 2015. Web. 22 July 2015, available at https://www.

google.org/flutrends/about/data/flu/us/data.txt

[6] Franklin, Meredith. Spatial Statistics USC Keck School of Medicine, 2013. Web. 6

July 2016.

[7] Hyndman, Rob J., and George Athanasopoulos., 8.1 Stationarity and Differencing

OTexts, 2016. Web.

[8] Hyndman, Rob J. Thoughts on the Ljung-Box test WordPress, 2014. Web. 02 July

2016.

[9] Hyndman, Rob J., and George Athanasopoulos., 8.9 Seasonal ARIMA Models

OTexts, 2016. Web.

41

https://www.google.org/flutrends/about/data/flu/us/data.txt

https://www.google.org/flutrends/about/data/flu/us/data.txt

42

[10] LeSage, James P. Lecture 1: Maximum likelihood estimation of spatial regression

models (2004), available at www4.fe.uc.pt/spatial/doc/lecture1.pdf Web.

[11] Pace, R. Kelley, Ronald Barry, Otis W. Gilley1‘ and C.F. Sirmans. ”A method for

spatial-temporal forecasting with an application to real estate prices” International

Journal of Forecasting 16 (200): n. pag. Web.

[12] ”The Flu Season.” Centers for Disease Control and Prevention. Centers for Disease

Control and Prevention, 22 Oct. 2014. Web.

[13] Wall, Melanie M. ”A close look at the spatial structure implied by the CAR and

SAR models” Journal of statistical planning and inference 121 (2004): 311-324.

Web.

[14] Yang, Shihao, Mauricio Santillana and S. C. Kou. ”Accurate estimation of in-

fluenza epidemics using Google search data via ARGO.” Proceedings of the Na-

tional Academy of Sciences Proc Natl Acad Sci USA 112.47 (2015): 14473-4478.

Web.

www4.fe.uc.pt/spatial/doc/lecture1.pdf

Appendix A

Spatial Matrix

The spatial weights matrix W which is used for finding spatial autocorrelation is

determined by the collection of neighboring states for each state. The geographical map

of the states numbered in alphabetical order is displayed below in Figure A.1. The

corresponding list of neighbors associated with the map is found in Figure A.2 [13].

Figure A.1: Map of the continental states with corresponding numbers [13]

43

44

Figure A.2: List of Neighbors for the 48 States [13]

Appendix B

Moran I Proof

I =

n

n∑i=1

n∑j=1


n∑i=1

n∑j 6=i

Wij

n∑i=1

(zi − z̄)2

E[I] =−1

N − 1

Proof

Start by noting:

1. z1, ...zn are given andn∑i

(zi − z̄)2 is a constant

2. For k 6= l, E(zk − z̄)(zl − z̄) is the same ∀ k, l

45

46

3. For k 6= l,

E(zk − z̄)(zl − z̄) =

∑k

∑l 6=k

(zk − z̄)(zl − z̄)

n(n− 1)

=

∑k

(zk − z̄)(− (zk − z̄)

)n(n− 1)

= −

∑k

(zk − z̄)2

n(n− 1)

4.n∑i=1

n∑j=1

Wij =n∑i=1

n∑j 6=i

Wij since Wkk = 0 ∀k

Now we can algebraically prove:

E[I] =

nn∑i=1

n∑j=1


n∑i=1

n∑j 6=i

Wij

n∑i=1

(zi − z̄)2

=

nn∑i=1

n∑j=1

Wij

(−∑k

(zk − z̄)2)

n(n− 1)n∑i=1

n∑j 6=i

Wij

n∑i=1

(zi − z̄)2

= − 1

n− 1

�

statistical analysis of google flu trendsstatistical analysis of google flu trends a project...

Documents