modeling over-dispersed crash data with a long tail ...data with a long tail. the dispersion...

38
Modeling over-dispersed crash data with a long tail: Examining the accuracy of the dispersion parameter in negative binomial models By Yajie Zou, Ph.D. Research Associate Department of Civil and Environmental Engineering University of Washington, Seattle, Washington 98195-2700 Phone: 936-245-5628, fax: 206-543-5965 E-mail: [email protected] Lingtao Wu 1 Ph.D. Candidate, Zachry Department of Civil Engineering Texas A&M University, 3136 TAMU College Station, Texas 77843-3136 Phone: 979-587-3518, fax: 979-845-6481 E-mail: [email protected] Dominique Lord, Ph.D. Associate Professor and Zachry Development Professor I Zachry Department of Civil Engineering Texas A&M University, 3136 TAMU College Station, Texas 77843-3136 Phone: 979-458-3949, fax: 979-845-6481 E-mail: [email protected] 1 Corresponding author

Upload: others

Post on 30-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

Modeling over-dispersed crash data with a long tail: Examining the

accuracy of the dispersion parameter in negative binomial models

By

Yajie Zou, Ph.D. Research Associate

Department of Civil and Environmental Engineering University of Washington, Seattle, Washington 98195-2700

Phone: 936-245-5628, fax: 206-543-5965 E-mail: [email protected]

Lingtao Wu1

Ph.D. Candidate, Zachry Department of Civil Engineering Texas A&M University, 3136 TAMU

College Station, Texas 77843-3136 Phone: 979-587-3518, fax: 979-845-6481

E-mail: [email protected]

Dominique Lord, Ph.D. Associate Professor and Zachry Development Professor I

Zachry Department of Civil Engineering Texas A&M University, 3136 TAMU

College Station, Texas 77843-3136 Phone: 979-458-3949, fax: 979-845-6481

E-mail: [email protected]

1Corresponding author

Page 2: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

I

ABSTRACT

Despite many statistical models that have been proposed for modeling motor vehicle crashes, the most commonly used statistical tool remains the Negative Binomial (NB) model. Crash data collected for safety studies may exhibit over-dispersion and a long tail (i.e., a few sites have unusually high number of crashes). However, some studies have shown that NB models cannot handle over-dispersed count data with a long tail adequately. So far, no work has investigated the performance of the dispersion parameter of the NB model when analyzing over-dispersed crash data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety analysis. The first objective of this study is to examine whether the dispersion parameter can truly reflect the level of dispersion in over-dispersed crash data with a long tail. The second objective is to determine whether the dispersion term of the Sichel (SI) model can be used as an alternative to the dispersion parameter of the NB model. To accomplish the objectives of this study, crash data sets are simulated from NB and SI regression models using different values describing the mean and the dispersion level. For the simulated data sets, the dispersion parameter and dispersion term are estimated and compared to the true values. To complement the output of the simulation study, crash data collected in Texas are also used to compare the dispersion parameter and dispersion term. The results from this study suggest that the dispersion parameter of the NB model can erroneously estimate the level of dispersion in over-dispersed count data with a long tail and the dispersion term of the SI model is more reliable in estimating the true level of dispersion. Thus, considering the findings in this study, it is believed that the dispersion term may offer a viable alternative for analyzing over-dispersed crash data with a long tail.

Keywords: Sichel; negative binomial; dispersion parameter; traffic crashes; empirical Bayes

Page 3: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

1

1. Introduction

From a statistical point of view, the occurrence of highway crashes can be treated as random events by assuming that there is an underlying mean crash rate for each individual site (Park et al., 2010). What makes the analysis difficult in modeling crash data is that the crash data are often found to exhibit over-dispersion, meaning that the variance is greater than the mean (Park and Lord, 2009). Lord et al. (2005) provided a fundamental definition that the over-dispersion arises from the actual nature of the crash process. To accommodate over-dispersion in crash data, many mixed-Poisson models have been proposed by transportation safety analysts, such as the negative binomial (NB, also known as Poisson-gamma) models (Poch and Mannering, 1996; Miaou and Lord, 2003), zero-inflated models (Shankar et al., 1997), the Poisson-lognormal (Aguero-Valverde and Jovanis, 2008), the Conway-Maxwell-Poisson (Lord et al., 2008b), the Poisson-Weibull (Cheng et al., 2013), etc. (for a comprehensive review of the mixed-Poisson models used in transportation safety analysis, see Mannering and Bhat (2014)). These statistical models are in fact used as an approximation for modeling crash data. Among these mixed-Poisson models, the NB model remains the most frequently used statistical model for accommodating the over-dispersion observed in the crash data (Lord and Mannering, 2010). Reasons for the popularity of the NB models include: (1) the NB model provides a simple way to manipulate the relationship between the mean and the variance (Lord and Mannering, 2010); (2) the dispersion parameter of the NB model plays an important role in transportation safety analysis. Besides the mixed-Poisson models, the random parameters count models (Anastasopoulos and Mannering, 2009; Chen and Tarko, 2014), finite mixture and Markov switching models (Park and Lord, 2009; Malyshkina et al., 2009; Zou et al., 2013b; Zou et al., 2014), generalized ordered-response models (Castro et al., 2012; Bhat et al., 2014) and quantile regression models (Qin and Reyes, 2011) have been proposed for analyzing the crash-frequency data.

The dispersion parameter of the NB model is critical for estimating the weight factor of the empirical Bayes (EB) method (Hauer et al., 1988; Hauer, 1997) and for building confidence intervals for evaluating and screening highway projects (Wood, 2005). Since the above two types of analysis are commonly used in highway safety, it is necessary to obtain reliable estimates of the dispersion parameter. It has been shown that the low sample mean and small sample size can significantly influence the estimation of the dispersion parameter of NB models using the maximum likelihood estimation method and Bayesian method (Maher and Summersgill, 1996; Lord, 2006; Lord and Miranda-Moreno, 2008). To avoid or minimize an unreliably estimated dispersion parameter, Lord (2006) also summarized the minimum sample size for different sample means.

Page 4: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

2

For NB models, the gamma distribution assumed in the probabilistic error term related to the mean of the Poisson variable can be restrictive in terms of its ability to account for heterogeneity across observations (Park et al., 2010). For example, Guo and Trivedi (2002) have reported that NB regression models have difficulties modeling heavily over-dispersed data with a long-tail and relatively high mean value because a negligible probability is usually assigned to high counts. Recently, the Sichel distribution (SI, also known as the Poisson-generalized inverse Gaussian distribution) has been introduced by (Zou et al., 2013a) for calculating EB estimates. The SI distribution is a compound Poisson distribution, which mixes the Poisson distribution with the generalized inverse Gaussian distribution. Previous studies (Stein et al., 1987; Gupta and Ong, 2005) have shown that the SI distribution is useful as a model for over-dispersed count data with a long tail. Among different mixed-Poisson models, it is found that the NB and SI models both have the quadratic variance-mean relationship. Similar to the dispersion parameter of the NB model, a dispersion term of the SI model can be defined to measure the level of dispersion in the data. This dispersion term can be easily used by transportation safety analysts to obtain reliable EB estimates within the SI modeling framework (Zou et al., 2013a).

Considering the importance of the dispersion parameter of the NB model in transportation safety analysis, the objective of this study is to examine whether or not the traditionally used dispersion parameter can truly reflect the level of dispersion in over-dispersed crash data with a long tail and whether the dispersion term of the SI model can be used as an alternative to the dispersion parameter. To accomplish the objectives of this study, crash data sets are simulated from NB and SI models using different combinations of fixed regression parameters describing the mean and the dispersion level. For the simulated datasets, the dispersion parameter and dispersion term are estimated and compared to the true values. The simulation analysis is carried out in this study for the following reason: when analyzing real crash data, the true values of regression parameters and the dispersion level of the crash data are seldom known in practice. In contrast, in a simulation, it is possible to generate crash data with known regression parameters and dispersion levels. The simulation analysis was used in previous transportation safety studies (Lord, 2006; Francis et al., 2012) to characterize the performance of different estimators. To complement the output of the simulation study, crash data collected in Texas are also used to compare the dispersion parameter and dispersion term.

2. Methodology

The NB models have the following probabilistic structure: the number of crashes itY , at the thi

site and time period t , when conditional on its mean it is Poisson distributed and independent

over all sites and time periods (Miaou and Lord, 2003):

Page 5: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

3

| ~ ( )it it itY Poisson , 1,2,...,i I and 1,2,...,t T (1)

The mean of the Poisson is structured as:

( ; ) exp( )it itf X e β

where,

( )f is a function of the covariates;

β is vector of unknown coefficients; and,

ite is the model error independent of all the covariates and exp( )ite is assumed to be

independent and gamma distributed with a mean equal to 1 and a variance .

Then, it can be derived that itY conditional on it and is distributed as a NB random

variable with a mean it and a variance 2it it . The probability density function (PDF) of

the NB model is defined as follows (for the complete derivation of the NB model, see (Hilbe, 2011)):

1/

1( ) 1

( | , ) ( ) ( )1 1 1( ) ( 1)

it

ityit

it itit it

it

yf y

y

  (2) 

where,

ity response variable for observation i and time period t ;

it mean response of the observation i and time period t ; and,

dispersion parameter.

Page 6: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

4

Compared to the Poisson distribution, the NB distribution can allow for over-dispersion. If 0 , the crash variance equals the crash mean and the NB model converges to the Poisson model.

The SI distribution has recently been used for modeling motor vehicle crashes (Zou et al., 2013a). It can be shown that the SI models have the following probabilistic structure: the number of

crashes itY , at the thi site and time period t , when conditional on its mean it is Poisson

distributed and independent over all sites and time periods:

| ~ ( )it it itY Poisson , 1,2,...,i I and 1,2,...,t T (3)

The mean of the Poisson is structured as:

( ; ) exp( )it itf X e β

where,

( )f is a function of the covariates;

β is vector of unknown coefficients; and,

ite is the model error independent of all the covariates and exp( )ite is assumed to be

independent and generalized inverse Gaussian distributed with a mean equal to 1 and a variance

22 ( 1) / 1/ 1c c .

Then, it can be shown that itY conditional on it and 22 ( 1) / 1/ 1c c is distributed as a

SI random variable with a mean it and a variance 2 2(2 ( 1) / 1 / 1)it it c c . The PDF

of the SI distribution, ( , , )SI , is given by,

( / ) ( )( | , , )

(1/ ) !( )

it

it

it

yit y

it it yit

c Kp y

K y

(4)

Page 7: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

5

where,

ity response variable;

it mean response of the observation i ;

scale parameter, 0 ;

shape parameter;

2 2 12 ( )it c ;

1(1 / )

(1 / )

Kc

K

; and,

1 1

0

1 1( ) exp( ( )}

2 2K t x t x x dx

is the modified Bessel function of the third kind.

When and 0 , it can be shown that the SI distribution can be reduced to the NB distribution. Note that the NB and SI models both have the quadratic variance-mean relationship,

that is 2( ) ( , , )it it itVAR y h where ( )it itE y and ( , , )h is a function of the

parameters of the mixing distribution. For the NB model, ( , , )h is defined as the

dispersion parameter; on the other hand, for the SI model, 2( , , ) 2 ( 1) / 1/ 1h c c

can be viewed as a dispersion term. Similar to the dispersion parameter in the NB model, this dispersion term can be also used to measure the level of dispersion. For the over-dispersed crash data, the SI model is usually more flexible than the NB model. This is because the variance

to mean function for NB model is defined as 2( )

1( )

it it itit

it it

VAR y

E y

; while for the SI

model, the variance to mean function is

2 22( ) (2 ( 1) / 1/ 1)

1 [2 ( 1) / 1/ 1]( )

it it itit

it it

VAR y c cc c

E y

. Note that the SI

model has three different parameters, for the crash data with a fixed mean and a fixed variance to mean ratio, the possible values for parameters and are very flexible.

Page 8: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

6

Since the EB estimates can be used to identify high-frequency crash sites and assess the effects of implemented treatments, it is important to obtain reliable EB estimates. The dispersion parameter of NB models has extensively been used in the EB method. Similar to the dispersion parameter, the dispersion term of the SI model can also be easily used by practitioners to obtain reliable EB estimates. Within the SI modeling framework, the long term mean for a site i using the EB method is given by (for the complete derivation, see Zou et al. (2013a)):

(1 )i i i i iw w y (5)

where i is the EB estimate of the expected number of crashes per year for site i; i is the

estimated number of crashes by crash prediction models for given site i (estimated using a SI

model); 1

1 ( , )i

i

wh

is the weight factor estimated as a function of i and

2( , ) 2 ( 1) / 1/ 1h c c ; and iy is the observed number of crashes per year at site i .

So far, the NB distribution is the most frequently used model by transportation safety analysts for calculating the EB estimates (Huang et al., 2009; Cheng and Washington, 2005) and the dispersion parameter can be assumed to be fixed to the entire dataset (Miaou, 1996) or varying over different sites and periods (Hauer, 2001; Miaou and Lord, 2003). The dispersion term of SI models may provide a reliable estimate of the level of dispersion in the data. Recently, two studies (Zou et al., 2013a; Wu et al., 2014) have compared the effect of the dispersion parameter and dispersion term on the precision of the EB analysis. Zou et al. (2013a) found that the selection of the crash prediction model (i.e., the SI or NB model) will affect the value of weight factor used for estimating the EB output. Moreover, Wu et al. (2014) conducted a simulation study and the results suggest that the SI-based EB method can consistently provide a better crash-prone sites identification result than the NB-based EB method.

Although many parameter estimation methods are available for estimating the dispersion parameter and dispersion term, three common methods used by transportation safety modelers are the method of moments, the weighted regression analysis and the maximum likelihood estimation (MLE). Previously, Lord (2006) compared these three estimators and found that MLE can usually provide better estimation results than the other two estimators. Thus, the MLE method is adopted in this study. More details about the parameter estimation are given in Rigby et al. (2008).

Page 9: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

7

3. Simulation protocol

This section describes the methodology and simulation protocol used for estimating the dispersion parameter and the dispersion term under different scenarios. In order to construct the crash data that are convincingly similar to empirical crash data, we first summarize the results of pervious works performed on the application of Poisson-gamma models in traffic safety. Table 1 provides the summary of eight crash datasets from seven published papers. This table includes statistics on crash counts, type and location of crash sites, explanatory variables and reported dispersion parameters for NB models. The mean number of crashes for eight datasets ranges from 0.29 to 17.56 and the corresponding standard deviation is between 0.69 and 36.64. As documented in these seven studies, for various traffic facilities, the reported dispersion parameters for NB regression models are all below 1 except for the last study. The high dispersion parameter ( 1.69 ) found in the last study is probably explained by the preponderance of zeros in the data (about 80% of three-legged intersections report zero crash). Overall, Table 1 provides some guidelines in assigning the value for the dispersion parameter in our simulation framework.

Page 10: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

8

Table 1. Characteristics of crash datasets and reported values of dispersion parameter.

Study Location Crashes

Explanatory variables ObservationsReported value of

dispersion parameterMin Max Mean SD*

Anastasopoulos

and Mannering

(2009)

Rural interstate

highways 0 329 17.56 36.64

Pavement characteristics, geometric

characteristics and traffic flow

characteristics

322 0.88

Chang (2005) Freeway 0 7 0.67 1.00

Geometric characteristics, traffic flow

characteristics and weather

information

1992 0.22

El-Basyouny and

Sayed (2006)

Urban arterial 1 264 49.35 45.84 Geometric characteristics and traffic

flow characteristics 386 0.35

Lord et al.

(2008b)

Urban

intersection 0 54 11.56 10.02 Traffic flow characteristics 868 0.14

Lord et al.

(2008b)

Rural highways 0 108 4.89 8.45 Geometric characteristics and traffic

flow characteristics 3220 0.31

Kumara and Chin

(2003)

Urban

intersection 0 6 0.29 0.69

Geometric characteristics, traffic flow

characteristics and traffic device

information

2780 1.69

Chin and Quddus

(2003)

Urban

intersection 0 11 4.39 N/A

Geometric characteristics, traffic flow

characteristics and traffic device

information

832 0.30

Miaou (1994)

Rural interstate

highways 0 8 0.20 N/A

Geometric characteristics and traffic

flow characteristics 8263 0.95

* SD = Standard Deviation

Page 11: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

9

The following section presents the simulation protocol to illustrate the performance of the dispersion term of the SI model in estimating the dispersion parameter of the NB model. Two different experiments are designed. In the first experiment, the NB data were generated and the NB and SI regression models were estimated using the MLE method. In the second experiment, the SI data were generated and the NB and SI regression models were estimated.

3.1. Experiment one

In order to examine the accuracy of parameter estimates from the SI models, 100 datasets with 1,000 observations each, were randomly generated for each of 15 different scenarios corresponding to different dispersion parameters and sample means. The 15 scenarios include simulated datasets with the following dispersion parameter, = 0.25, 0.5, 0.75, 0.95 and 1.5. For each dispersion parameter, three different sample means were considered: high mean (HM)

( 0 2.4 , 1 0.15 , 2 0.15 and resulting sample mean is approximately 11.0); moderate

mean (MM) ( 0 1.7 , 1 0.15 , 2 0.15 and resulting sample mean is approximately 5.5);

and low mean (LM) ( 0 0.2 , 1 0.15 , 2 0.15 and resulting sample mean is

approximately 1.2). For each scenario, the parameter estimates from NB and SI models were compared to the known parameter values that had been used to generate crash datasets. The following paragraph summarizes the simulation procedure for experiment one. The simulation setting considered in this study was first proposed by (Francis et al., 2012).

The values of dispersion parameter are selected according to the finding in Table 1.

In experiment one, we generated the NB data using the following steps:

(1) Simulate a value for the covariates 1X and 2X from a uniform distribution on [0, 1],

respectively.

(2) Generate a mean i for observation i according to equation with known regression

parameters.

0 1 1 2 2exp( )i X X

(3) Generate a discrete count iY given that the mean for observation i is gamma distributed

with the dispersion parameter and mean equal to 1:

Page 12: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

10

~ ( )i iY Poisson

exp( )i i i

exp( ) ~ (1, )i gamma

(4) Repeat steps (1) to (3) 1,000 times.

3.2. Experiment two

Since the dispersion term 2( , ) 2 ( 1) / 1/ 1h c c of SI model has two different

parameters, there exist many possible values for parameters and in assigning a certain value to the dispersion term. In order to adequately generate crash datasets that are similar to empirical crash data, observed crash data are used to help select the possible values for parameters and . Specifically, SI models were applied to five crash datasets with different crash means and variances, which might reasonably reflect different traffic sites with different safety performance. The estimated scale parameter and shape parameter are provided in Table 2. The reader is referred to the studies listed in Table 2 for details of datasets, considered explanatory variables and functional forms. Shape and scale parameters are assumed to be fixed throughout this paper.

Page 13: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

11

Table 2. Estimated values of scale and shape parameters for crash datasets with different characteristics.

Datasets Crashes

Observations Scale

parameter

Shape

parameter Min Max Mean SD*

Michigan data

(Geedipally et al.,

2012)

0 61 0.68 1.77 33,970 1.47 -2.393

Texas data

(Zou et al., 2013a) 0 97 2.84 5.69 1499 1.62 -3.51

Washington data

(Lord et al., 2008a) 0 62 4.80 6.10 476 0.90 -4.807

California data

(Lord et al., 2008a) 0 217 10.90 22.10 356 0.74 -1.051

Indiana data

(Cheng et al., 2013) 0 329 17.56 36.64 338 2.96E+14 -1.92

* SD = Standard Deviation

The shape parameter is shown to have negative values, between -5 to -1. While the scale parameter is defined to be positive, and most datasets report small values for , with the exception of Indiana data. The modeling results for Indiana data indicate that the SI model is converging to the Poisson inverse Gaussian model. To make the results gained from this experiment applicable to empirical data, the values for parameters and were selected to try to represent the underlying characteristics related to true crash count distributions. The parameters that were assigned in simulating the crash data are given in Table 3.

Table 3. Assigned parameters for scale and shape parameter in five scenarios.

Scenario Scale parameter Shape parameter Dispersion term*

1 0.5 -5.3 0.25

2 1 -3.55 0.5

3 1 -2.5 0.75

4 1.5 -2.52 0.95

5 2.5 -2.2 1.55

* Dispersion term = 2( , ) 2 ( 1) / 1/ 1h c c

In order to investigate the accuracy of parameter estimates from the NB models for SI data, 100 datasets with 1,000 observations each, were randomly generated for each of 15 different scenarios corresponding to different dispersion terms and sample means. The 15 scenarios

Page 14: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

12

include simulated datasets with the following dispersion term, 2( , ) 2 ( 1) / 1/ 1h c c =

0.25, 0.5, 0.75, 0.95 and 1.55. For each dispersion term, three different sample means were

considered: high mean (HM) ( 0 2.4 , 1 0.15 , 2 0.15 and resulting sample mean is

approximately 11.0); moderate mean (MM) ( 0 1.7 , 1 0.15 , 2 0.15 and resulting

sample mean is approximately 5.5); and low mean (LM) ( 0 0.2 , 1 0.15 , 2 0.15 and

resulting sample mean is approximately 1.2). For each scenario, the parameter estimates from NB and SI models were compared to the known regression parameter values. The simulation procedure for experiment two is summarized as follows:

In experiment two, we generated the SI data using the following steps:

(1) Simulate a value for the covariates 1X and 2X from a uniform distribution on [0, 1],

respectively.

(2) Generate a mean i for observation i according to equation with known regression

parameters.

0 1 1 2 2exp( )i X X

(3) Generate a discrete count iY given that the mean for observation i is generalized inverse

Gaussian (GIG) distributed with the scale parameter , shape parameter and mean equal to 1:

~ ( )i iY Poisson

exp( )i i i

exp( ) ~ (1, , )i GIG

(4) Repeat steps (1) to (3) 1,000 times.

Page 15: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

13

The probability density functions of the gamma and generalized inverse Gaussian distributions can be found in (Rigby et al., 2008). Note that the gamma is a limiting distribution of the GIG by letting for 0 .

The coefficients of the NB and SI models were estimated using gamlss package (Rigby and Stasinopoulos, 2013) in the software R. At the end of the experiments, the estimated regression parameters, dispersion parameter and dispersion term were recorded and compared to the known parameter values.

4. Simulation results

4.1 Characteristics of the simulated datasets

Tables 4 and 5 show the characteristics of crash count for the 100 simulated datasets (each dataset contains 1,000 observations) generated from NB and SI regression models, respectively. For the same simulation setting (i.e., crash mean and dispersion level), the distributions of crash counts simulated from the NB and SI regression models have the similar pattern. Crash counts generated from SI models usually have a longer tail than the crash count generated from NB models. Moreover, as the dispersion parameter/term increases, the difference in the tail behavior (see 100% quantile column in Tables 4 and 5) becomes significant. The reason for this difference is that NB models have the limitation that they can only generate over-dispersed count data with a relatively short tail.

Page 16: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

14

Table 4 Characteristics of crash count for the 100 simulated datasets generated from NB regression models.

Dispersion

parameter

Average values of the q-quantiles a of crash count ( 0 0.2 )

0% 10% 30% 50% mean 70% 90% 100%

0.25 0 0b 0 1 1.222 2 3 7.67

0.5 0 0 0 1 1.218 1.95 3 9.4

0.75 0 0 0 1 1.221 1.726 3.06 10.42

0.95 0 0 0 1 1.219 1.562 3.123 11.82

1.5 0 0 0 0.57 1.225 1.146 3.585 14.84

Dispersion

parameter

Average values of the q-quantiles of crash count ( 0 1.7 )

0% 10% 30% 50% mean 70% 90% 100%

0.25 0 1.318 3.01c 5 5.493 7 10.255 24.03

0.5 0 1 2.754 4.315 5.480 6.933 11.464 31.43

0.75 0 0.098 2 4.01 5.501 6.909 12.457 38.98

0.95 0 0 1.914 3.82 5.461 6.78 13.095 43.11

1.5 0 0 1 2.985 5.511 6.336 14.469 57.59

Dispersion

parameter

Average values of the q-quantiles of crash count ( 0 2.4 )

0% 10% 30% 50% mean 70% 90% 100%

0.25 0 3.95 7.01 9.98 11.022 13.406 19.658 43.89

0.5 0 2.049 5.668 9.08 11.042 13.675 22.395 59.07

0.75 0 1.109 4.537 8.26 11.072 13.593 24.526 72.83

0.95 0 0.99 3.894 7.715 11.140 13.469 26.002 85.72

1.5 0 0 2.2 6.12 11.075 12.735 28.533 112.8

a, The q-quantile of a set of values divides them so that q% of the values lie below and (100-q)% of the

values lie above; b, Average values of the 100 10% quantiles for the 100 simulated datasets; c, Average

values of the 100 30% quantiles for the 100 simulated datasets.

Page 17: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

15

Table 5 Characteristics of crash count for the 100 simulated datasets generated from SI regression models

Dispersion term Average values of the q-quantiles of crash count ( 0 0.2 )

0% 10% 30% 50% mean 70% 90% 100%

0.25 0 0 0.037 1 1.220 1.98 3 7.54

0.5 0 0 0 1 1.227 1.886 3 11.98

0.75 0 0 0 1 1.222 1.613 3 14.5

0.95 0 0 0 1 1.222 1.375 3 20.31

1.55 0 0 0 1 1.223 1.043 3 35.28

Dispersion term Average values of the q-quantiles of crash count ( 0 1.7 )

0% 10% 30% 50% mean 70% 90% 100%

0.25 0 2 3.404 5 5.479 6.843 9.891 23.95

0.5 0 1.057 3 4.355 5.500 6.369 10.585 45.88

0.75 0 1 2.874 4 5.487 6.11 11.066 59.72

0.95 0 1 2.357 4 5.490 5.99 11.134 87.59

1.55 0 1 2 3.685 5.419 5.833 11.104 128.49

Dispersion term Average values of the q-quantiles of crash count ( 0 2.4 )

0% 10% 30% 50% mean 70% 90% 100%

0.25 0.47 4.959 7.621 10.01 11.053 13.003 18.566 45.99

0.5 0.09 3.855 6.301 9.02 11.004 12.552 20.223 86.8

0.75 0 3.01 5.85 8.245 10.978 12.066 21.349 111.84

0.95 0 2.978 5.127 7.97 11.093 11.819 21.543 182.67

1.55 0 2.119 4.858 7.16 11.061 11.076 21.652 286.2

4.2. Results for experiment one

Table 6 shows the means and standard deviations of estimated values for regression parameters

k (k= 0, 1, 2) at = 0.25, 0.5, 0.75, 0.95 and 1.5 for three sample means. As indicated in

Table 6, the regression parameter estimates of k from NB models are reliable under all

scenarios. Interestingly, the estimation results from SI models are very similar to those from NB models. In fact, the means and standard deviations of estimated values for NB and SI models are almost identical for many scenarios. Note that for the low sample mean scenario, the standard

Page 18: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

16

deviation is slightly larger compared with the standard deviations for moderate and high sample mean scenarios, indicating the regression parameter estimates tend to be less stable when sample

mean is low. Overall, for regression parameters k (k= 0, 1, 2), the parameter estimates from

NB and SI models are generally reliable for different dispersion parameters and sample means.

Table 6 also presents the means and standard deviations of estimated values for dispersion parameter under different scenarios. Note that the dispersion term of SI model is calculated

using equation 2( , ) 2 ( 1) / 1/ 1h c c and is considered as the estimated dispersion

parameter for the simulated NB data. It can be observed that the dispersion term of the SI models can adequately estimate the dispersion parameter of the NB models. For each scenario, one notable feature is that the standard deviation is generally larger when sample mean is low, even with a sample size equal to 1,000. For example, when true dispersion parameter = 0.25, the standard deviation of the estimated values from NB models is 0.05 for low sample mean, 0.02 for moderate sample mean and 0.01 for high sample mean. This finding suggests that the dispersion parameter estimated from data characterized by low sample means can be relatively unreliable even if the sample size is sufficient.

Page 19: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

17

Table 6 Simulation results under different scenarios in experiment one.

0 =0.2 (LM) 0 =1.7 (MM) 0 =2.4 (HM)

Estimated values for regression parameter 0 under different scenarios

Scenarios NB SI NB SI NB SI

=0.25 0.20a (0.08)b 0.21 (0.08) 1.70 (0.05) 1.69 (0.05) 2.40 (0.05) 2.40 (0.05)

=0.5 0.19 (0.1) 0.21 (0.1) 1.70 (0.07) 1.70 (0.07) 2.40 (0.06) 2.40 (0.06)

=0.75 0.18 (0.1) 0.19 (0.11) 1.70 (0.08) 1.70 (0.09) 2.41 (0.07) 2.40 (0.07)

=0.95 0.19 (0.11) 0.19 (0.11) 1.70 (0.08) 1.70 (0.08) 2.40 (0.07) 2.40 (0.07)

=1.5 0.21 (0.12) 0.21 (0.12) 1.70 (0.11) 1.70 (0.11) 2.39 (0.12) 2.39 (0.12)

Estimated values for regression parameter 1 0.15 under different scenarios

scenarios NB SI NB SI NB SI

=0.25 0.14 (0.12) 0.13 (0.11) 0.15 (0.07) 0.15 (0.07) 0.16 (0.07) 0.16 (0.07)

=0.5 0.16 (0.12) 0.17 (0.13) 0.13 (0.10) 0.13 (0.10) 0.15 (0.10) 0.15 (0.10)

=0.75 0.17 (0.14) 0.17 (0.14) 0.15 (0.11) 0.15 (0.11) 0.14 (0.10) 0.15 (0.09)

=0.95 0.15 (0.14) 0.15 (0.14) 0.16 (0.12) 0.16 (0.12) 0.15 (0.11) 0.15 (0.11)

=1.5 0.14 (0.14) 0.14 (0.14) 0.15 (0.14) 0.15 (0.14) 0.16 (0.13) 0.16 (0.13)

Estimated values for regression parameter 2 0.15 under different scenarios

scenarios NB SI NB SI NB SI

=0.25 -0.15 (0.11) -0.15 (0.11) -0.15 (0.08) -0.15 (0.08) -0.16 (0.05) -0.16 (0.06)

=0.5 -0.15 (0.13) -0.14 (0.13) -0.13 (0.09) -0.13 (0.09) -0.14 (0.08) -0.14 (0.08)

=0.75 -0.15 (0.13) -0.15 (0.14) -0.13 (0.11) -0.13 (0.11) -0.16 (0.10) -0.16 (0.10)

=0.95 -0.15 (0.13) -0.15 (0.13) -0.16 (0.12) -0.15 (0.12) -0.16 (0.10) -0.16 (0.10)

=1.5 -0.18 (0.16) -0.18 (0.16) -0.15 (0.14) -0.15 (0.14) -0.15 (0.15) -0.15 (0.15)

Estimated values for dispersion parameter under different scenarios

scenarios NB SI NB SI NB SI

=0.25 0.25 (0.05) 0.26 (0.05) 0.25 (0.02) 0.25 (0.02) 0.25 (0.01) 0.25 (0.01)

=0.5 0.49 (0.07) 0.51 (0.08) 0.50 (0.04) 0.51 (0.04) 0.50 (0.02) 0.50 (0.03)

=0.75 0.76 (0.09) 0.78 (0.10) 0.74 (0.04) 0.76 (0.05) 0.74 (0.04) 0.75 (0.04)

=0.95 0.94 (0.09) 0.97 (0.10) 0.95 (0.05) 0.97 (0.06) 0.95 (0.05) 0.96 (0.06)

=1.5 1.50 (0.12) 1.53 (0.13) 1.49 (0.08) 1.52 (0.10) 1.50 (0.07) 1.53 (0.08)

a, mean; b, standard deviation; LM, the low sample mean scenario; MM, the moderate sample mean

scenario; HM, the high sample mean scenario.

Page 20: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

18

In summary, the simulation results for the NB data have shown the following characteristics:

(1) Both estimates from NB and SI models are very close to the true regression parameters under all scenarios;

(2) The dispersion parameter and dispersion term performed very well in estimating the true dispersion parameter under all scenarios;

(3) The dispersion parameter estimated from NB and SI models becomes slightly unreliable for low sample mean even when the sample size is sufficient.

4.3. Results for experiment two

In this experiment, the crash datasets were generated from the SI models with known regression parameters. The simulated datasets were then used to estimate the regression parameters of NB and SI models, respectively.

Figures 1-3 show the boxplots of estimated values for regression parameters k (k= 0, 1, 2) at

dispersion term = 0.25, 0.5, 0.75, 0.95 and 1.55 for three sample means. The regression parameter estimates from NB models are similar to those from SI models and the estimated values become slightly unstable when the sample mean is low. One interesting characteristic worth noting is that the parameter estimates from NB models are less reliable compared with those from SI models when dispersion term is large (for example, see subfigures (d) and (e) in Figures 1-3).

Page 21: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

19

(a) Dispersion term = 0.25

(b) Dispersion term = 0.5

(c) Dispersion term = 0.75

Page 22: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

20

(d) Dispersion term = 0.95

(e) Dispersion term = 1.55

Fig. 1. Boxplots of estimated values for regression parameter 0 under different scenarios in

experiment two. LM, the low sample mean scenario; MM, the moderate sample mean scenario; HM, the high sample mean scenario. True parameter values are indicated by red horizontal lines.

Page 23: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

21

(a) Dispersion term = 0.25

(b) Dispersion term = 0.5

(c) Dispersion term = 0.75

Page 24: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

22

(d) Dispersion term = 0.95

(e) Dispersion term = 1.55

Figure 2. Boxplots of estimated values for regression parameter 1 under different scenarios in

experiment two.

LM, MM, HM and the red horizontal lines have the same meaning as those in Fig. 1.

Page 25: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

23

(a) Dispersion term = 0.25

(b) Dispersion term = 0. 5

(c) Dispersion term = 0.75

Page 26: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

24

(d) Dispersion term = 0.95

(e) Dispersion term = 1.55

Figure 3. Boxplots of estimated values for regression parameter 2 under different scenarios in

experiment two.

LM, MM, HM and the red horizontal lines have the same meaning as those in Fig. 1.

Figure 4 presents the boxplots of estimated values for the dispersion term under different scenarios. The dispersion parameter of NB models is considered as the estimated dispersion term for the simulated SI data. It can be observed that there are generally three types of subfigures. For subfigure 4 (a), the parameter estimates from NB and SI models both under-estimate the dispersion term, especially when sample mean is moderate or high. And the SI models can provide slightly larger estimates than NB models. For subfigures 4 (b)-(d), the parameter estimates for dispersion term from SI models are generally appropriate regardless of the sample means; while the estimated values from NB models are usually lower than the true dispersion term. For subfigure 4 (e), although parameter estimates from the SI models seem to be adequate, the distribution of the estimated values is right skewed, which increases the mean of the estimated values of the dispersion term. As expected, the NB estimator is seriously biased under this scenario.

Page 27: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

25

(a) Dispersion term = 0.25

(b) Dispersion term = 0.5

(c) Dispersion term = 0.75

Page 28: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

26

(d) Dispersion term = 0.95

(e) Dispersion term = 1.55

Figure 4. Boxplots of estimated values for dispersion term under different scenarios in experiment two.

LM, MM, HM and the red horizontal lines have the same meaning as those in Fig. 1.

In summary, the simulation results for the SI data have shown the following characteristics:

(1) The estimates from NB and SI models are adequate for regression parameters under all scenarios;

(2) Under all scenarios, the dispersion parameter of NB models consistently provides significantly biased estimates of the true dispersion term, especially when the simulated crash count has a very long tail;

(3) SI models performed well in estimating the dispersion term, except for dispersion term =0.25.

4.4 Estimation Bias

The bias of an estimator is defined as the difference between an estimator’s expected value and the true value of the parameter being estimated (Francis et al., 2012). The estimation bias of the dispersion parameter is calculated as follows:

Page 29: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

27

( )Bias E (6)

where is the true value of the dispersion parameter and the is the estimator.

And the estimation bias of the dispersion term 2( , ) 2 ( 1) / 1/ 1h c c is calculated as

follows:

( ( , )) ( , )Bias E h h (7)

where ( , )h is the true value of the dispersion term and ( , )h is the estimator.

The bias of the dispersion parameter and dispersion term ( , )h under each scenario is

calculated as the difference between their average estimates from the 100 replications and the true parameter values assigned in each scenario.

Tables 7 and 8 provide the parameter estimation bias for the dispersion parameter and dispersion term, respectively. For experiment one, the estimation bias is negligible for all scenarios, which means the dispersion term can be used as a robust estimate of the dispersion parameter in this scenario. For experiment two, parameter estimation bias for NB models is consistently larger than those for the SI models, especially when dispersion term = 0.75, 0.95 and 1.55. Moreover, for the same dispersion term, the estimation results from NB models deteriorate as the sample mean increases. On the other hand, for SI models, the parameter estimation results are usually acceptable except for the last scenario (dispersion term = 1.55). As shown in Figure 4 (e), the estimated values from SI models are unstable and a few large outliers can be found on one end of the boxplot. These outliers significantly increase the average estimates and thus result in unsatisfactory estimation bias.

Page 30: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

28

Table 7. Estimation bias for the dispersion parameter in experiment one

Dispersion parameter Scenario Estimation bias

NB SI

0.25

HM 0.00 0.00

MM 0.00 0.00

LM 0.00 0.01

0.5

HM 0.00 0.00

MM 0.00 0.01

LM -0.01 0.01

0.75

HM -0.01 0.00

MM -0.01 0.01

LM 0.01 0.03

0.95

HM 0.00 0.01

MM 0.00 0.02

LM -0.01 0.02

1.5

HM 0.00 0.03

MM -0.01 0.02

LM 0.00 0.03

Table 8. Estimation bias for the dispersion term in experiment two

Dispersion parameter Scenario Estimation bias

NB SI

0.25

HM -0.09 -0.07

MM -0.08 -0.07

LM -0.08 -0.07

0.5

HM -0.16 -0.02

MM -0.14 -0.03

LM -0.09 -0.02

0.75

HM -0.29 0.00

MM -0.26 -0.03

LM -0.16 0.01

0.95

HM -0.40 0.17

MM -0.36 0.16

LM -0.23 0.11

1.55

HM -0.86 0.52

MM -0.80 0.31

LM -0.55 0.53

Page 31: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

29

5. Observed data

One observed crash dataset was used to examine whether the dispersion term can be used as an alternative to the dispersion parameter. This dataset contains crash data collected on 1,499 4-lane undivided rural segments in Texas over a five-year period from 1997 to 2001. The data were collected as a part of NCHRP 17-29 research project (Lord et al., 2008a) and have been extensively used in some previous studies. The mean and variance of the crash data are equal to 2.84 and 32.4, respectively. Note that the crash count has a long tail (the maximum number of crashes is 97). Table 9 provides the summary statistics for the Texas data.

Table 9. Summary statistics of characteristics for individual road segments in the Texas data

Variable Min Max Mean(SD†) Sum

Number of crashes (5 years) 0 97 2.84(5.69) 4253

Average daily traffic (ADT) over

the 5 years (F) 42 24800

6613.61

(4010.01) -

Lane Width (LW) 9.75 16.5 12.57(1.59) -

Total Shoulder Width (SW) 0 40 9.96(8.02) -

Curve Density (CD) 0 18.07 1.43 (2.35) -

Segment Length (L) (miles) 0.1 6.28 0.55(0.67) 830.49 † SD = Standard Deviation.

Three subsets of 1,000 observations were randomly sampled from the whole dataset. For the entire dataset and each subset, the NB and SI models were fitted and the dispersion parameter and dispersion term were calculated, respectively. The mean functional form is adopted as follows:

2 3 41 * * *0

i i iLW SW CDi i iL F e (8)

where i is the estimated numbers of crashes at segment i ; iL is the segment length in miles

for segment i ; iF is the flow (ADT over five years) traveling on segment i ; iSW is the

total shoulder width in feet for segment i ; iCD is the curve density (curves per mile) for

segment i ; and 0 1 2 3 4( , , , , ) ' β are the estimated coefficients.

The modeling results for NB and SI models are provided in Tables 10 and 11, respectively. First,

Page 32: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

30

for Texas data, the goodness-of-fit statistics (log-likelihood, Akaike information criterion (AIC) and Bayesian information criterion (BIC)) indicate that the entire crash dataset and subsets 1 and 2 can be better described by SI models. In addition, for the full dataset and three subsets, the estimated dispersion parameters are all less than the estimated dispersion terms, and the magnitude of the relative difference could be as high as 50% (for example, see estimated values for subset 2). The modeling results using the real crash data appear to correspond with the outcome of experiment two, in which the dispersion parameter of NB models consistently under-estimated the dispersion level of simulated data when analyzing over-dispersed count data with a long tail. Since SI models are preferred over NB models in describing the Texas data based on the goodness-of-fit statistics, the estimated dispersion term may better reflect the actual level of dispersion of this crash dataset. Second, the estimated values of regression coefficients

(i.e., 0 , 1 , 2 , 3 and 4 ) from NB and SI models differ slightly for all tested datasets. The

results support the finding in the two simulation experiments that both models provide similar regression coefficient estimates.

Table10 Modeling results for full dataset and three subsets from NB models

NB estimate Full dataset Subset 1* Subset 2 Subset 3

Value SE Value SE Value SE Value SE

Intercept 0ln( ) -7.95 0.42 -7.73 0.49 -8.33 0.53 -8.08 0.50

Ln(ADT) 1 0.97 0.05 0.97 0.05 1.00 0.06 1.01 0.05

Lane Width 2 -0.05 0.02 -0.07 0.02 -0.04 0.02 -0.07 0.02

Total Shoulder Width 3 -0.01 0.00 -0.01 0.00 -0.01 0.00 -0.01 0.00

Curve Density 4 0.07 0.01 0.07 0.01 0.06 0.01 0.07 0.01

Dispersion parameter 0.39 0.08 0.34 0.12 0.40 0.09 0.31 0.12

Observations 1499 1000 1000 1000

Log-likelihood 2561.39 1679.92 1729.10 1637.22

AIC 5134.77 3371.85 3470.20 3286.44

BIC 5166.65 3401.30 3499.65 3315.88

* Maximum number of crashes for subsets 1, 2 and 3 are 97, 97 and 41, respectively.

Page 33: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

31

Table11 Modeling results for full dataset and three subsets from SI models.

SI estimate Full dataset Subset 1 Subset 2 Subset 3

Value SE Value SE Value SE Value SE

Intercept 0ln( ) -8.00 0.39 -7.87 0.44 -8.26 0.49 -8.14 0.44

Ln(ADT) 1 0.99 0.04 0.98 0.04 1.00 0.05 1.02 0.05

Lane Width 2 -0.06 0.02 -0.07 0.02 -0.04 0.02 -0.07 0.02

Total Shoulder Width 3 -0.01 0.00 -0.01 0.00 -0.01 0.00 -0.01 0.00

Curve Density 4 0.06 0.01 0.07 0.01 0.06 0.01 0.07 0.01

Scale parameter 1.62 1.01 0.44 0.18 1.80 1.37 0.35 0.15

Scale parameter -3.52 0.27 -2.35 0.86 -3.49 0.31 -1.78 1.49

Dispersion term 0.58 0.41 0.60 0.35

Observations 1499 1000 1000 1000

Log-likelihood 2550.23 1675.70 1722.50 1635.46

AIC 5114.47 3365.40 3459.01 3284.93

BIC 5151.65 3399.75 3493.36 3319.29

6. Discussion

In this paper, the results are very interesting and deserve further discussion. Although different models have been proposed for analyzing over-dispersed data, the NB model is still frequently used by traffic safety researchers. However, the results from the simulation experiments raise a few issues about application of NB models in analyzing over-dispersed crash data with a long tail. Based on the simulation results in this study, the following conclusions can be made: (1) When the crash data are generated from NB models, the dispersion parameter and dispersion term both performed very well in estimating the true dispersion parameter. (2) If the crash data are generated from SI models (the simulated crash counts have a long tail), then the dispersion parameter of NB models consistently provide biased estimates of the true dispersion term. In summary, when a long tail is present in the crash count, the dispersion parameter of the NB model can possibly be biased, and the dispersion term of the SI model is more likely to reveal the true level of dispersion in the over-dispersed crash data. In addition to the SI model, other different models (e.g., Poisson-lognormal, finite mixture model, quantile regression, etc.) can also be used to analyze over-dispersed crash data with a long tail. However, unlike the NB or SI,

Page 34: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

32

the output of these models cannot be directly used in calculating EB estimates.

Note that this study constrains the regression parameter to be fixed across observations. If the over-dispersed crash data contain unobserved heterogeneity, the random parameter model which allows the regression parameter to vary from observation to observation should be considered. Thus, in the future, it is useful to compare the estimation of dispersion parameter using the random-parameters Negative Binomial model and the random-parameters Sichel model. It is possible that if the random parameter model is used, the difference in estimating the dispersion parameter between NB and SI models could be slight. In that case, the application of dispersion term of the SI model may become unnecessary.

7. Summary and conclusions

Given the importance of the dispersion parameter in various types of transportation safety studies, the objective of this paper was to investigate whether the dispersion parameter can truly reflect the level of dispersion in over-dispersed crash data with a long tail and whether the dispersion term of the SI model can be used as an alternative. The performance of the dispersion parameter and dispersion term was examined using simulated datasets generated from various NB and SI regression models with fixed regression parameters. Appropriate sample means and dispersion levels are selected to generate over-dispersed data sets that are convincingly similar to empirical crash data. It is found that crash count simulated from SI regression models usually has a longer tail than the crash count generated from NB regression models. Moreover, the simulation results show that the dispersion parameter of NB models consistently underestimated the dispersion level for the over-dispersed crash data generated from SI regression models and the newly introduced dispersion term of SI models can estimate the true level of dispersion with small estimation bias. Overall, considering that the dispersion parameter can possibly be a biased estimator of the level of dispersion in the data, it is believed that the dispersion term may offer a viable alternative for analyzing over-dispersed crash data with a long tail. For future work, it is useful to implement the SI-based EB method for identifying crash-prone sites using crash severity data. Some new criteria (Cheng and Washington, 2008) can be considered to evaluate the effectiveness of the SI-based EB and the NB-based EB methods.

Page 35: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

33

References

Aguero-Valverde, J., Jovanis, P.P., 2008. Analysis of road crash frequency with spatial models. Transportation Research Record 2061, 55-63.

Anastasopoulos, P.C., Mannering, F.L., 2009. A note on modeling vehicle accident frequencies with random-parameters count models. Accident Analysis and Prevention 41 (1), 153-159.

Bhat, C.R., Born, K., Sidharthan, R., Bhat, P.C., 2014. A count data model with endogenous covariates: Formulation and application to roadway crash frequency at intersections. Analytic Methods in Accident Research 1, 53-71.

Castro, M., Paleti, R., Bhat, C.R., 2012. A latent variable representation of count data models to accommodate spatial and temporal dependence: Application to predicting crash frequency at intersections. Transportation Research Part B 46 (1), 253-272.

Chang, L.-Y., 2005. Analysis of freeway accident frequencies: Negative binomial regression versus artificial neural network. Safety Science 43 (8), 541-557.

Chen, E., Tarko, A.P., 2014. Modeling safety of highway work zones with random parameters and random effects models. Analytic Methods in Accident Research 1, 86-95.

Cheng, L., Geedipally, S.R., Lord, D., 2013. The poisson–weibull generalized linear model for analyzing motor vehicle crash data. Safety Science 54, 38-42.

Cheng, W., Washington, S., 2008. New criteria for evaluating methods of identifying hot spots. Transportation Research Record 2083, 76-85.

Cheng, W., Washington, S.P., 2005. Experimental evaluation of hotspot identification methods. Accident Analysis and Prevention 37 (5), 870-881.

Chin, H.C., Quddus, M.A., 2003. Modeling count data with excess zeroes an empirical application to traffic accidents. Sociological Methods and Research 32 (1), 90-116.

El-Basyouny, K., Sayed, T., 2006. Comparison of two negative binomial regression techniques in developing accident prediction models. Transportation Research Record 1950, 9-16.

Francis, R.A., Geedipally, S.R., Guikema, S.D., Dhavala, S.S., Lord, D., Larocca, S., 2012. Characterizing the performance of the conway‐maxwell poisson generalized linear model. Risk Analysis 32 (1), 167-183.

Geedipally, S.R., Lord, D., Dhavala, S.S., 2012. The negative binomial-lindley generalized linear

Page 36: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

34

model: Characteristics and applicatiousing crash data. Accident Analysis and Prevention 45, 258-265.

Guo, J., Trivedi, P., 2002. Flexible parametric models for long-tailed patent count distributions. Oxford Bulletin of Economics and Statistics 64, 63-82.

Gupta, R.C., Ong, S., 2005. Analysis of long-tailed count data by poisson mixtures. Communications in statistics—Theory and Methods 34 (3), 557-573.

Hauer, E. 1997. Observational before-after studies in road safety: Estimating the effect of highway and traffic engineering measures on road safety, Tarrytown, N.Y., U.S.A., Pergamon.

Hauer, E., 2001. Overdispersion in modelling accidents on road sections and in empirical bayes estimation. Accident Analysis and Prevention 33 (6), 799-808.

Hauer, E., Ng, J.C., Lovell, J. 1988. Estimation of safety at signalized intersections (with discussion and closure).

Hilbe, J.M. 2011. Negative binomial regression, Cambridge University Press.

Huang, H., Chin, H.C., Haque, M.M., 2009. Empirical evaluation of alternative approaches in identifying crash hot spots. Transportation Research Record 2103, 32-41.

Kumara, S., Chin, H.C., 2003. Modeling accident occurrence at signalized tee intersections with special emphasis on excess zeros. Traffic Injury Prevention 4 (1), 53-57.

Lord, D., 2006. Modeling motor vehicle crashes using poisson-gamma models: Examining the effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter. Accident Analysis and Prevention 38 (4), 751-766.

Lord, D., Geedipally, S.R., Persaud, B.N., Washington, S.P., Van Schalkwyk, I., Ivan, J.N., Lyon, C., Jonsson, T., 2008a. Methodology to predict the safety performance of rural multilane highways. NCHRP Project 17-29, Texas Transportation Institute, TX, U.S.

Lord, D., Guikema, S.D., Geedipally, S.R., 2008b. Application of the conway-maxwell-poisson generalized linear model for analyzing motor vehicle crashes. Accident Analysis and Prevention 40 (3), 1123-1134.

Lord, D., Mannering, F., 2010. The statistical analysis of crash-frequency data: A review and assessment of methodological alternatives. Transportation Research Part A 44 (5), 291-305.

Lord, D., Miranda-Moreno, L.F., 2008. Effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter of poisson-gamma models for modeling

Page 37: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

35

motor vehicle crashes: A bayesian perspective. Safety Science 46 (5), 751-770.

Lord, D., Washington, S.P., Ivan, J.N., 2005. Poisson, poisson-gamma and zero-inflated regression models of motor vehicle crashes: Balancing statistical fit and theory. Accident Analysis and Prevention 37 (1), 35-46.

Maher, M.J., Summersgill, I., 1996. A comprehensive methodology for the fitting of predictive accident models. Accident Analysis and Prevention 28 (3), 281-296.

Malyshkina, N.V., Mannering, F.L., Tarko, A.P., 2009. Markov switching negative binomial models: An application to vehicle accident frequencies. Accident Analysis and Prevention 41 (2), 217-226.

Mannering, F.L., Bhat, C.R., 2014. Analytic methods in accident research: Methodological frontier and future directions. Analytic Methods in Accident Research 1, 1-22.

Miaou, S.-P., 1994. The relationship between truck accidents and geometric design of road sections: Poisson versus negative binomial regressions. Accident Analysis and Prevention 26 (4), 471-482.

Miaou, S.-P., 1996. Measuring the goodness-of-fit of accident prediction models. FHWA-RD-96-040, Virginia, U.S.

Miaou, S.-P., Lord, D., 2003. Modeling traffic crash flow relationships for intersections - dispersion parameter, functional form, and bayes versus empirical bayes methods. Transportation Research Record 1840, 31-40.

Park, B.-J., Lord, D., 2009. Application of finite mixture models for vehicle crash data analysis. Accident Analysis and Prevention 41 (4), 683-691.

Park, B.-J., Lord, D., Hart, J.D., 2010. Bias properties of bayesian statistics in finite mixture of negative binomial regression models in crash data analysis. Accident Analysis and Prevention 42 (2), 741-749.

Poch, M., Mannering, F., 1996. Negative binomial analysis of intersection-accident frequencies. Journal of Transportation Engineering 122 (2), 105-113.

Qin, X., Reyes, P.E., 2011. Conditional quantile analysis for crash count data. Journal of Transportation Engineering 137 (9), 601-607.

Rigby, B., Stasinopoulos, M., 2013. A flexible regression approach using gamlss in r <http://www.gamlss.org/wp-content/uploads/2013/01/Lancaster-booklet.pdf>. (accessed July 28

Page 38: Modeling over-dispersed crash data with a long tail ...data with a long tail. The dispersion parameter of the NB model plays an important role in various types of transportation safety

36

2013).

Rigby, R., Stasinopoulos, D., Akantziliotou, C., 2008. A framework for modelling overdispersed count data, including the poisson-shifted generalized inverse gaussian distribution. Computational Statistics and Data Analysis 53 (2), 381-393.

Shankar, V., Milton, J., Mannering, F., 1997. Modeling accident frequencies as zero-altered probability processes: An empirical inquiry. Accident Analysis and Prevention 29 (6), 829-837.

Stein, G.Z., Zucchini, W., Juritz, J.M., 1987. Parameter estimation for the sichel distribution and its multivariate extension. Journal of the American Statistical Association 82 (399), 938-944.

Wood, G., 2005. Confidence and prediction intervals for generalised linear accident models. Accident Analysis and Prevention 37 (2), 267-273.

Wu, L., Zou, Y., Lord, D., 2014. Comparison of sichel and negative binomial models in hotspot identification. Transportation Research Record Forthcoming.

Zou, Y., Lord, D., Zhang, Y., Peng, Y., 2013a. Comparison of sichel and negative binomial models in estimating empirical bayes estimates. Transportation Research Record 2392, 11-21.

Zou, Y., Zhang, Y., Lord, D., 2013b. Application of finite mixture of negative binomial regression models with varying weight parameters for vehicle crash data analysis. Accident Analysis and Prevention 50, 1042-1051.

Zou, Y., Zhang, Y., Lord, D., 2014. Analyzing different functional forms of the varying weight parameter for finite mixture of negative binomial regression models. Analytic Methods in Accident Research 1, 39-52.