validation of cmfs derived from cross sectional studies using … · 2014-12-16 · 7 which...

20
Paper No.: 15-1467 Validation of CMFs Derived from Cross-Sectional Studies Using Regression Models By Lingtao Wu* Ph.D. Candidate, Zachry Department of Civil Engineering Texas A&M University, 3136 TAMU College Station, Texas 77843-3136 Phone: 979-587-3518, fax: 979-845-6481 Email: [email protected] Dominique Lord, Ph.D. Associate Professor and Zachry Development Professor I Zachry Department of Civil Engineering Texas A&M University, 3136 TAMU College Station, Texas 77843-3136 Phone: 979-458-3949, fax: 979-845-6481 Email: [email protected] Yajie Zou, Ph.D. Research Associate Department of Civil and Environmental Engineering 133B More Hall, Box 352700 University of Washington, Seattle, Washington 98195-2700 Phone: 936-245-5628, fax: 206-543-5965 Email: [email protected] Word count: 6,934 Words (5,434 Text + 6 Tables) + 34 References November, 2014 *Corresponding author

Upload: others

Post on 03-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Paper No.: 15-1467

Validation of CMFs Derived from Cross-Sectional Studies

Using Regression Models

By

Lingtao Wu*

Ph.D. Candidate, Zachry Department of Civil Engineering

Texas A&M University, 3136 TAMU

College Station, Texas 77843-3136

Phone: 979-587-3518, fax: 979-845-6481

Email: [email protected]

Dominique Lord, Ph.D.

Associate Professor and Zachry Development Professor I

Zachry Department of Civil Engineering

Texas A&M University, 3136 TAMU

College Station, Texas 77843-3136

Phone: 979-458-3949, fax: 979-845-6481

Email: [email protected]

Yajie Zou, Ph.D.

Research Associate

Department of Civil and Environmental Engineering

133B More Hall, Box 352700

University of Washington, Seattle, Washington 98195-2700

Phone: 936-245-5628, fax: 206-543-5965

Email: [email protected]

Word count: 6,934 Words (5,434 Text + 6 Tables) + 34 References

November, 2014

*Corresponding author

Page 2: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Wu et al. 1

ABSTRACT 1

Crash modification factors (CMFs) can be used to capture the safety effects of countermeasures 2

and play significant roles in traffic safety management. As an alternative to the before-after 3

study, the regression model method has been widely used for estimating CMFs. Although before-4

after studies are considered to be superior, the use of regression models for estimating CMFs has 5

never been fully investigated. This paper consequently sought to examine the conditions in 6

which regression models could be used for such purpose. CMFs for three variables, lane width, 7

curve density and pavement friction, were assumed and used for generating random crash counts. 8

Then, CMFs were derived from regression models using the simulated crash data for three 9

different scenarios. The results were then compared with the assumed true value. The study 10

results showed that (1) the CMFs derived from the regression models should be unbiased when 11

all factors affecting traffic safety are identical in all segments, except those of interest; (2) if 12

some factors having minor safety effects are omitted from the models, the accuracy of estimated 13

CMFs can still be acceptable; (3) if some factors already known to have significant effects on 14

crash risk are omitted, the CMFs derived from the regression models are generally unreliable. 15

Thus, depending on the missing variables that are not included in the model, the transportation 16

safety analyst can decide if the CMFs developed from the regression models should be used for 17

highway safety applications. 18

Page 3: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Wu et al. 2

INTRODUCTION 1

A crash modification factor (CMF) is a multiplicative factor that can be used to capture changes 2

in the expected number of crashes when a given countermeasure or a modification in geometric 3

and operational characteristics of a specific site is implemented (1; 2). CMFs play significant 4

roles in roadway safety management, including in safety effect evaluation, crash prediction, 5

hotspot identification, countermeasure selection, and the evaluation of design exemptions. 6

Several methods have been proposed for developing CMFs, such as before-after (e.g., naïve or 7

simple before-after, before-after with comparison group and empirical Bayes (EB) before-after), 8

cross-sectional (e.g., regression models and case-control), expert panel studies etc. (2). Among 9

these methods, before-after and cross-sectional studies are the most popular approaches (3). The 10

CMFs derived from before-after studies are based on the comparison of safety performance 11

before and after the implementation of one or several treatments (or changes in the 12

characteristics of the site(s)). Those derived from cross-sectional studies are based on the 13

comparison in the safety performance of sites that have a specific feature with those that do not 14

or are analyzed simultaneously based on datasets that contain a mixture of sites with different 15

characteristics. 16

Over the last 15 years or so, the before-after study has been considered to be the best 17

approach for developing CMFs (2; 4). The CMFs derived from before-after studies are usually 18

believed to be more reliable than those produced from cross-sectional studies because it can 19

directly account for changes that occurred at the sites investigated (5). However, although the 20

before-after analysis is considered superior, high-quality CMFs derived from this approach is 21

dependent on the availability of data (e.g., data availability for the before period, etc.) and the 22

sample size (e.g., number of sites where the treatment of interest has been implemented, etc.). 23

Furthermore, the estimated CMF can be biased if the regression-to-the-mean (RTM) and site 24

selection effects are not properly accounted for in the before-after study (5-7). Lord and Kuo (7) 25

even noted that the EB method can still be plagued by significant biases if the data collected for 26

the treatment and control groups do not share the exact same characteristics. 27

Given the limitations described above, researchers have proposed that cross-sectional 28

studies could be used for developing CMFs (8-10). Although different types of cross-sectional 29

studies have been proposed over the years, the regression model remains the method of choice 30

for estimating CMFs, as reflected by the large number of CMFs documented in the Highway 31

Safety Manual (HSM) (11) and FHWA CMF Clearinghouse (1). 32

Even though regression models are popular for developing CMFs, some researchers have 33

criticized their use for such purpose because they may not properly capture the relationship 34

between crashes and the variables influencing safety (12). In this context, so far, nobody has 35

examined the statistical performance of CMFs that are developed from regression models. Thus, 36

the primary objective of this study is to comprehensively investigate the robustness and accuracy 37

of the CMFs derived from regression models. A secondary objective is to describe the conditions 38

when the CMFs developed from regression models become unreliable and potentially biased. 39

The study objectives were accomplished using simulated data for three different scenarios. 40

Page 4: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Wu et al. 3

BACKGROUND 1

This section briefly describes the different methods that have been proposed for estimating 2

CMFs. The description focuses on their advantages and limitations. 3

The CMF for a countermeasure derived from a before-after study is estimated by the 4

change in the number of crashes occurring in a period before the improvement and the number 5

occurring after the improvement (2; 3). Four types of before-after studies have been proposed to 6

estimate CMFs: naïve before-after, before-after with comparison group, EB before-after and full 7

Bayes (FB) before-after studies. The naïve before-after study simply assumes the crash 8

performance before improvement is a good estimate of what would be in the after period if the 9

countermeasure had not been implemented (3; 5). This approach is considered to be less reliable, 10

because it does not account for changes unrelated to the countermeasure. The before-after study 11

with comparison group and EB before-after methods were then proposed to overcome this issue. 12

Readers are referred to (2) for more details of these CMF developing methods. 13

Even though multiple before-after studies have been developed and widely used to 14

estimate CMFs, the comparison results from before-after studies may be inaccurate if the 15

following issues are not properly accounted for: 16

Sample size: There might be inadequate samples of sites where the countermeasures of 17

interest have been implemented. This will lead to statistical uncertainty (4). 18

RTM effect: This bias is related to the level of correlation for sites that are evaluated 19

during different time periods. Sites that have large (or very small) values in one time 20

period (say before) are expected to regress towards the mean in the subsequent period (5; 21

13). 22

Site selection bias: This is related to the RTM, but its effects are different in that the sites 23

are selected based on a known or unknown entry criteria (e.g., 5 crashes per year). These 24

entry criteria lead to a truncated distribution, which influences the before-after estimate 25

(7). 26

Mixed safety effects: This bias or issue is related to when more than two or more 27

countermeasures are simultaneously implemented at a roadway site, and there can be 28

changes in traffic volume, weather, etc. after the implementation of treatments. This 29

makes it difficult to evaluate the safety effect of a single countermeasure (2; 4). 30

In contrast to before-after studies, cross-sectional studies compare the safety performance 31

of a site or group of sites with the treatment of interest to similar sites without the treatment in a 32

single point in time (2). The cross-sectional studies for developing CMFs can be regrouped into 33

three categories: regression, case control and cohort methods. The regression method is currently 34

the most frequently used approach because of its simplicity. It is usually accomplished through 35

multiple variable regression models or safety performance functions (SPFs). The SPFs can be 36

used to quantify the effect of a specific variable on the crash occurrence and CMFs are then 37

derived from the model coefficients (2; 10). 38

Many models have been proposed to predict safety performance and hence to develop 39

CMFs or crash modification functions (CM-Functions) (14). Although recent studies have 40

introduced some new models for transportation safety analysis (15-18), the Generalized Linear 41

Model (GLM) with a negative binomial (NB) error structure is still the most popular method for 42

Page 5: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Wu et al. 4

modeling traffic crashes. Although models have been extensively used in traffic safety studies, 1

there are still some limitations with this approach: 2

Similarity in crash risk: A primary premise of a cross-sectional study is that all locations 3

are similar to each other in all other factors affecting crash risk (2). However, this 4

assumption seems to be unattainable in practice. 5

Omitted variables: A variety of variables can influence crash risk, but not all of them are 6

measurable or can be captured in practice for model inclusion. It is common that some 7

SPFs were developed with limited variables, for example, using the traffic volume as the 8

only variable in the model. This can lead to biased parameter estimate and incorrect 9

CMFs (14). 10

Functional form: Functional form establishes the relationship between expected crashes 11

and explanatory variables and is a critical part of the modeling process. So far, various 12

forms have been used to link crashes to explanatory variables. But the modeling results 13

tend to be inconsistent when using different functional forms (19). Thus, the CMFs 14

derived from regression models can be biased. 15

There are also several other known issues with the crash modeling that will affect the 16

CMFs derived from regression methods. Readers are referred to Lord and Mannering (14) for 17

more details of these issues. 18

Given the substantive issues associated with the before-after study and regression model 19

method, it is not surprising that CMFs produced from these two approaches are not identical (2; 20

20; 21). The same approach and dataset can also generate different CMFs when using different 21

models (16; 22-24). Compared to the regression model, the before-after study has lower within-22

subject variability since it directly accounts for changes that have occurred at the study sites (7). 23

Before-after studies are less proned to confounding factors compared to cross-sectional studies 24

(25). Furthermore, well-designed observational before-after studies provide advantages over 25

other safety countermeasure evaluation methods (4). CMFs derived from regression models are 26

suggested to be compared with those from before-after studies (2). However, no study has 27

investigated whether or not the CMFs derived from regression models really reflect the true 28

safety effects of corresponding treatments. Considering the fact that a large number of CMFs in 29

the HSM (11) and CMF Clearinghouse (1) were derived from regression models, it is necessary 30

to evaluate the accuracy of CMFs estimated from this kind of approach. 31

METHODOLOGY 32

To examine if CMFs derived from regression models, defined below as SPFs, really reflect the 33

true safety effects of roadway features or treatments, the researchers derived CMFs using 34

simulated data. Since the exact safety effect of a feature or treatment is hardly known in practive, 35

this makes it extremely difficult to examine the CMFs when observed crash data are used. By 36

analyzing simulated data, one can compare the CMFs estimated from SPFs with the assumed 37

CMF values. The simulation experiment used in this study is described in the following section. 38

Page 6: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Wu et al. 5

Simulation Protocol 1

The simulation experiment for evaluating the CMFs derived from SPFs was proposed by Hauer 2

(12). First, CMFs for some highway geometric features are assumed. Then, random crash counts 3

are simulated based on the assigned values of CMFs. Finally, the CMFs are estimated from the 4

simulated crash data and are compared with the true CMFs. The simulation procedures are 5

described in detail below: 6

Step 1: Assign Initial Values 7

Assume CMFs for highway geometric features of interest. This study will assume an 8

exponential relationship between a highway geometric feature and its CMF. For example, we can 9

assume that the CMF for lane width is _LW AssumedCMF , that is the expected crash frequency will 10

be multiplied or divided by _LW AssumedCMF if the lane width increases or decreases by one foot. 11

Step 2: Calculate Mean Values 12

Calculate the true crash means for each segment using SPFs and assumed CMFs using 13

Equation 1 (11). 14

, , 1, 2, ,( )true i spf i i i m iN N CMF CMF CMF C (1) 15

Where, 16

,true iN = true crash mean for roadway segment i for a certain time period (i.e., one year). 17

The true crash mean is the theoretical number of crashes that occur on a segment, it is used to 18

generate random crash counts in this study; 19

,spf iN = crash mean for roadway segment i for the base conditions, generated from an 20

SPF; 21

,j iCMF = assumed CMF specific to geometric feature type j of segment i , 22

1,2, ,j m ; 23

m = the total number of variables or geometric features of interest; and 24

C = calibration factor to adjust SPF for local conditions, and is assumed to be 1.0 for all 25

segments in this study. 26

The SPF in this study is adopted from the HSM (11) for rural two-lane highways (the 27

same as the data used in this paper), as shown in Equation 2. 28

6 0.312 4

, 365 10 2.67 10spf i i i i iN AADT L e L AADT (2) 29

Where, 30

iAADT = average annual daily traffic volume (vehicles per day) of segment i ; and 31

iL = length of roadway segment i , (mile). 32

33

34

Page 7: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Wu et al. 6

Step 3: Generate Discrete Counts 1

Generate random counts iY given that the mean for segment i is gamma distributed with 2

the dispersion parameter (the inverse dispersion parameter 1/ ) and mean equal to 1 (26): 3

, exp( )i true i iN (3a) 4

exp( ) ~ (1, )i gamma (3b) 5

~ ( )i iY Poisson (3c) 6

Where, 7

i = Poisson mean for segment i for a certain time period; 8

i = model error independent of all the covariates and exp( )i is assumed to be 9

independent and gamma distributed with mean equal to 1 and dispersion parameter ; 10

and, 11

iY = randomly generated crash counts for segment i for a certain time period. 12

Thus, the simulated crash counts follow Poisson-Gamma or NB distribution with 13

parameters and i . The probability density function (PDF) is given by Equation (4) (26). 14

( )( ; , ) ( ) ( )

( ) !iyi i

i i

i i i

yf y

y

(4) 15

Where, 16

iy = crash count for segment i for a certain time period; 17

i = the crash mean during a period for segment i ; and, 18

= inverse dispersion parameter. 19

Step 4: Estimate CMFs from the Simulated Crash data Using NB Regression Models 20

As has been documented in the background, many models and functional forms have 21

been proposed to predict crashes. For this study, we selected the most commonly used GLM 22

model and functional form, as shown in Equation 5 (23). Note that a different parameter for 23

describing the mean of the site, i , was used for estimating the models (compared to the one 24

used for the simulation, i ). 25

1

0

2

( ) ( )n

i i j j

j

E L AADT exp x

(5) 26

Where, 27

( )iE = the estimated crash mean during a period for segment i ; 28

jx = a series of variables, such as the lane width of segment i ; and, 29

Page 8: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Wu et al. 7

0 1, , , n = coefficients to be estimated. 1

For the goodness-of-fit (GOF) of the models, we used the following three methods: (1) 2

Akaike information criterion (AIC), (2) Mean absolute deviance (MAD), and (3) Mean-squared 3

predictive error (MSPE). For more information about MAD and MSPE, readers are referred to 4

(27). 5

Once the model is fitted and coefficients are estimated using the simulated crash data, the 6

CM-Function for variable j can then be derived as (2; 23): 7

, 0,[ ( )]x j j jCMF exp x x (6) 8

Where, 9

j = estimated coefficient for variable j ; 10

x = value of variable j , such as lane width, curve density; 11

0, jx = base condition defined for variable j , usually 12 ft for lane width; and, 12

,x jCMF = CMF specific to variable j with value of x . 13

This also indicates the CMF derived from the SPF for variable j is ( )j jCMF exp , 14

meaning the expected crash frequency will be multiplied or divided by jCMF if the variable j 15

increases or decreases by one unit. 16

Repeat Steps 2 to 4 100 times, calculate the mean and the standard deviation of the 17

estimated CMF values for each variable. 18

Step 5: Evaluate the CMF Derived from the NB Model 19

Two indexes, estimation bias and error percentage, are used to evaluate the CMF derived 20

from SPFs. They are shown in Equations 7 and 8. The smaller is the error percentage, the more 21

accurate the CMF derived from SPFs will be. 22

_ _=j j Assumed j C SCMF CMF (7) 23

_

100j

j

j Assumed

eCMF

(8) 24

Where, 25

j = estimation bias of CMF for variable j ; 26

je = error percentage of CMF for variable j , (%); 27

_j AssumedCMF = assumed CMF value for variable j ; and 28

_j C SCMF = CMF derived from the SPF for variable j . 29

Page 9: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Wu et al. 8

Simulation Example 1

This section provides an example to illustrate the various steps used for generating crash data 2

and the method of evaluating CMFs. 3

Table 1 below further shows snapshots of the simulated crash counts for N segments 4

considering a single variable of lane width. In this table, the CMF is assumed to be 0.9, which 5

means the increase of one foot in lane width can decrease the predicted number of crashes by 10 6

percent (1.0-0.9), with the baseline condition for a lane width equal to 12 ft. So, the CMF 7

specific to a segment i can be calculated as Equation 9. 8

TABLE 1 Example of Simulated Crash Counts for N Segments ( 2 or 0.5 ) 9

Seg. Length

(mile) AADT

LW a

(ft) CMFs spfN

trueN Yr 1 Yr 2 Yr 3

1 0.113 15360 11 1.111 0.464 0.515 0 0 1

2 0.213 18420 8 1.524 1.048 1.598 1 3 2

3 0.125 4260 9 1.372 0.142 0.195 0 0 1

4 0.161 10600 10 1.235 0.456 0.563 1 0 1

5 0.196 12560 12 1.000 0.658 0.658 1 0 2

... ... ... ... ... ... ... ... ... ...

N 0.234 4580 13 0.900 0.286 0.258 0 1 1

a - LW = Lane Width 10

11

The CMF for lane width is given by: 12

12

, 0.9 i

i

LW

LW iCMF

(9) 13

Where, 14

iLW = lane width of segment i (ft); and 15

,iLW iCMF = specific CMF for lane width of segment i . 16

Thus, the true crash mean of segment i will be calculated as (recall that the calibration 17

factor is assumed to be 1.0 for all segments in this study): 18

, ,,true i LW ispf iN N CMF (10) 19

Then, the exp( )i of each segment is randomly generated based on a Gamma distribution 20

with parameters mean = 1 and dispersion parameter = , which has the value of 2 in Table 1. i 21

of segment i is then calculated by multiplying ,true iN and exp( )i , as shown in Equation (3b). 22

After, a sequence of Poisson counts can be generated based on the mean for each 23

segment. Three years of simulated crash counts are shown in the last three columns in Table 1. 24

The theoretical function form of these crash counts is shown in Equation 11. 25

124, ,, 2.67 10 0.9 iLW

true i i iLW ispf iN N CMF L AADT (11a) 26

Page 10: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Wu et al. 9

Or equivalently, 1

4, 9.45 10 ( 0.105 )true i i i iN L AADT exp LW (11b) 2

The simulated crash data was analyzed using the NB regression model. The mean 3

functional form is provided in Equation 12. 4

1

0 2( ) ( )i i iE L AADT exp LW (12) 5

The coefficients of Equation 12 are estimated using a NB regression model in MASS 6

package (28) within the software R. The GOF measures are calculated using Metrics package 7

(29). The modeling output is shown in Table 2. The p-values indicate the variables are 8

statistically significant at the 1% level in this example. And, the small MAD and MSPE show the 9

modeling result performs well (given the simulated data). 10

TABLE 2 Modeling Output of the Example Data (CMF = 0.9, 0.5 , NB Model) 11

Model Variable Theo. Value a Coef. Value

b SE

c p-Value

Intercept [ 0( )ln ] 4(9.45 10 )ln =-6.96 -6.810 0.340 5.3911E-89

Ln(AADT) ( 1 ) 1.00 0.960 0.036 9.996E-161

Lane Width ( 2 ), ft -0.105 -0.089 0.012 1.5953E-13

AIC 20614.5

MAD 0.214

MSPE 0.244

a – theoretical value; b – estimated coefficient value; c – SE = standard error. 12

13

Based on the fitting result, the CM-Function for lane width derived from this SPF is 14

12

2[ ( 12)] 0.915LW

LWCMF exp LW (13) 15

The value of 12 in Equation 13 reflects the base condition for lane width, which means 16

that the CMF derived from prediction model is equal to 1.0. In this case, with an increment of 17

one foot in lane width, the crash mean is expected to be multiplied by 2 0.915e . 18

The bias between the assumed CMF and that from this SPF is calculated as: 19

= 0.90 0.915 0.015Assumed SPFCMF CMF 20

And the error percentage is: 21

0.015100 100 1.69(%)

0.90Assumed

biase

CMF

22

By repeating the Steps 2 to 4 100 times, 100 CMFs can be estimated. The mean and 23

standard deviation of CMFs and mean of GOF measures can be calculated. The estimation bias 24

and error percentage are then calculated based upon the mean value of derived CMFs. 25

For illustration purposes, this example only considers one variable, lane width. Three 26

scenarios were examined in this study to accommodate more complex situations with different 27

Page 11: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Wu et al. 10

levels of dispersions (i.e., inverse dispersion parameters) and two additional variables, curve 1

density and pavement friction. The simulation settings for these three scenarios are described 2

below. 3

Scenario I: Consider one variable, the lane width, only. Different CMF values are used for lane 4

width and the inverse dispersion parameter equals 0.5, 1.0 and 2.0, respectively. 5

Scenario II: Consider three variables, lane width, curve density and pavement friction. A fixed 6

CMF value is assigned for each of the three variables. The inverse dispersion 7

parameter equals 0.5, 1.0 and 2.0, respectively. 8

Scenario III: Consider three variables, but only one variable is included in the SPF; this falls 9

under the “omitted bias variables” (14). The inverse dispersion parameter equals 10

0.5, 1.0 and 2.0, respectively. 11

DATA DESCRIPTION 12

The roadway data used in this study contains 1,492 rural two-lane highway segments in Texas. 13

The variables include segment length, Annual Average Daily Traffic (AADT), lane width, curve 14

density (i.e., curves/mile) and pavement friction. Pavement friction is the force that resists the 15

relative motion between a vehicle tire and a pavement surface (30). Generally, higher pavement 16

friction is linked to safer roads. The segment length and AADT are based on actual values from 17

the Texas data, while the other three are hypothetical variables created specifically for this study. 18

The lane widths were generated from a discrete uniform distribution with parameters 8 and 13. 19

The curve density and pavement friction were generated from continuous uniform distributions. 20

For the curve density, the parameters were 0 and 16. And for pavement friction, the parameters 21

were 16 and 48. The summary statistics of these variables are shown in Table 3. 22

Table 3 Summary Statistics of Highway Segments 23

Variable Sample Size Min. Max Mean (SD c)

Length (mile) 1492 0.1 6.3 0.55 (0.67)

AADT 1492 502 24800 6643.9 (3996.4)

Lane Width (ft) 1492 8.0 13.0 10.47 (1.74)

CD a (per mile) 1492 0.02 16.0 8.1 (4.66)

PF b 1492 16.0 47.9 31.9 (9.08)

a – CD = Curve Density; b – PF = Pavement Friction; c - SD = Standard Deviation. 24

25

It is important to point out that this study selected three geometric features and the CMFs 26

are mainly assumed based on their practical values (i.e., from HSM, CMF Clearinghouse, etc.) to 27

reflect as close as possible the characteristics related to variables that can influence crash risk. 28

However, it does not have to be so. With the simulation protocol, it would be possible for other 29

researchers to use variables and ranges based on characteristics associated with the roadway 30

entities in which the researchers have detailed information on these characteristics. 31

Page 12: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Wu et al. 11

RESULTS 1

Scenario I: Consider lane width only 2

In this scenario, only the lane width is considered. All other factors affecting crash risk are 3

assumed to be identical among all segments. The assumed CMF for lane width varies from 0.85 4

to 1.05 with an increment of 0.05. The theoretical function of the generated crash counts in this 5

scenario is shown in Equation 14, which is similar to Equation 11, but the coefficient of lane 6

width varies.

7

4

, , , 2.67 10 [ ( 12)]true i spf i LW i i i LW iN N CMF L AADT exp LW (14a) 8

Or equivalently, 9

, ( )true i off i i LW iN L AADT exp LW (14b) 10

Where, 11

LW = coefficient of lane width, varies between -0.163, -0.105, -0.051, 0 and 0.049, 12

corresponding to the assumed CMFs equal to 0.85, 0.90, 0.95, 1.0 and 1.05, respectively; and, 13

off = offset coefficient, varies between 0.0019, 0.0010, 0.0005, 0.0003 and 0.0002, 14

corresponding to the assumed CMFs equal to 0.85, 0.90, 0.95, 1.0 and 1.05, respectively. 15

The CMF for each assumed value is derived from the model using the same procedures 16

illustrated in the simulation example. The considered functional form is provided in Equation 12 17

(now 15 below). 18

1

0 2( ) ( )i i iE L AADT exp LW (15) 19

The fitting results are shown in Table 4. It can be seen that for each of the assumed CMFs, 20

the estimation bias between the CMF derived from the SPFs and the assumed value is relatively 21

small under different simulation settings. The estimation bias is less than 0.005 for all scenarios, 22

and the error is within 0.5 percent. The small standard deviation of CMFs (second column in 23

Table 4) also indicates the CMFs derived from the experiments are consistent. 24

Based on the result of Scenario I, it can be concluded that the CMFs derived from this 25

model can reflect the true safety performance of lane width when considering this variable only. 26

In other words, if a regression model is based on a group of roadway segments that are ideally 27

identical in all factors affecting traffic safety, except the segment length, AADT and lane width, 28

the CMFs for lane width derived from the SPF should be unbiased. 29

Page 13: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Wu et al. 12

Table 4 Results of Scenario I 1

Theo.

CMF a

CMF (SD) b Bias E

c AIC

d MAD

e MSPE

f

0.5

0.85 0.849 (0.014) 0.001 0.127 11127.63 0.047 0.013

0.90 0.901 (0.014) -0.001 0.107 10681.33 0.042 0.011

0.95 0.949 (0.015) 0.001 0.114 10277.12 0.041 0.010

1.00 0.998 (0.015) 0.002 0.170 9901.71 0.037 0.008

1.05 1.051 (0.018) -0.001 0.109 9540.71 0.036 0.008

1.0

0.85 0.849 (0.018) 0.001 0.149 11289.108 0.063 0.026

0.90 0.902 (0.017) -0.002 0.246 10800.501 0.053 0.017

0.95 0.945 (0.017) 0.005 0.498 10390.874 0.050 0.015

1.00 1.002 (0.024) -0.002 0.233 9995.650 0.048 0.013

1.05 1.051 (0.019) -0.001 0.076 9643.008 0.043 0.012

2.0

0.85 0.853 (0.022) -0.003 0.395 11020.611 0.083 0.044

0.90 0.899 (0.023) 0.001 0.099 10574.470 0.068 0.029

0.95 0.95 (0.024) 0.000 0.018 g 10185.386 0.065 0.026

1.00 0.996 (0.026) 0.004 0.436 9827.826 0.062 0.025

1.05 1.05 (0.025) 0.000 0.032 g 9454.310 0.056 0.022

a – theoretical CMF; b – mean of CMFs from 100 experiments, SD is the Standard Deviation of 2

the 100 CMFs; c – E is the error percentage, %; d, e, f – each is the mean value of the 3

corresponding GOF measure of the 100 results; g – the non-zero error percentage with zero bias 4

is caused by the rounding off during calculation. 5

6

Scenario II: Consider three variables with fixed CMFs 7

In scenario I, only one variable, the lane width, was considered. Scenario II considers a more 8

practical condition: considering lane width, curve density and pavement friction. In this scenario, 9

we assume each of these three variables will influence crash risk, but they are not identical 10

among all segments. The objective is to examine whether the CMFs of multiple variables derived 11

from SPFs are reliable. 12

Page 14: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Wu et al. 13

CMFs for lane width, curve density and pavement friction are assumed to be fixed, and 1

they are 0.90, 1.072 and 0.973, respectively. For curve density, the CMF = 1.072 means that if 2

the curve density of a segment increases by 1 per mile, the expected crash number will increase 3

by 7.2 percent (1.072 - 1.0). And the baseline for curve density is 0 per mile. So, if the curve 4

density of a segment is 0, the specific CMF for curve density of this segment is 1.0. For the 5

pavement friction, the CMF = 0.973 means that if the pavement friction of a segment increases 6

by 1 unit, the expected crash number will decrease by 2.7 percent (1.0 – 0.973). The baseline is 7

32. 8

The theoretical function form of the crash data in scenario II is shown in Equation 16. 9

And the fitting equation for this scenario is shown in Equation 17. 10

, , , , ,true i spf i LW i CD i PF iN N CMF CMF CMF 11

12 3242.67 10 0.9 1.072 0.973i i iLW CD PF

i iL AADT (16a) 12

Or equivalently, 13

, 0.0023 ( 0.105 0.070 0.027 )true i i i i i iN L AADT exp LW CD PF (16b) 14

1

0 2 3 4( ) ( )i i i i iE L AADT exp LW CD PF (17) 15

Where, 16

iCD = curve density of segment i (per mile); 17

,CD iCMF = specific CMF value for curve density of segment i ; 18

iPF = pavement friction of segment i ; 19

,PF iCMF = specific CMF value for pavement friction of segment i ; and, 20

2 3 4, , = coefficients to be estimated for lane width, curve density and pavement 21

friction, respectively. 22

The result of this scenario is shown in Table 5. It can be seen that the CMFs derived from 23

SPFs for all of the three variables are very close to the assumed values. The bias and error 24

percentage are small. The result is quite similar as that of scenario I. This means that, for fixed 25

CMFs in this scenario, the regression model is able to derive reliable CMFs for the three 26

variables. Furthermore, two other scenarios with more variables (five and eight in total, 27

respectively) have been analyzed, the results (not documented here due to the word limit) are 28

consistent with that the results documented here with three variables. 29

Based on this experiment, the CMFs derived from SPFs can reflect the true safety 30

performance when considering multiple variables and assuming other safety factors are identical 31

among all segments. 32

Page 15: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Wu et al. 14

Table 5 Results of Scenario II 1

Variable*

Theo .

CMFa

CMF (SD) b Bias E

c AIC

d MAD

e MSPE

f

0.5

LW 0.900 0.900 (0.013) 0.000 0.023 g

CD 1.072 1.073 (0.006) -0.001 0.051 13864.7 0.10 0.06

PF 0.973 0.973 (0.002) 0.000 0.040 g

1.0

LW 0.900 0.897 (0.014) 0.003 0.366

CD 1.072 1.072 (0.008) 0.000 0.019 g 14072.3 0.13 0.10

PF 0.973 0.973 (0.004) 0.000 0.026 g

2.0

LW 0.900 0.903 (0.023) -0.003 0.354

CD 1.072 1.072 (0.009) 0.000 0.032 g 13736.2 0.16 0.17

PF 0.973 0.972 (0.004) 0.001 0.063

a, b, c, d, e, f, g - the same notes as those in Table 4; * - LW = Lane Width, CD = Curve Density, 2

PF = Pavement Friction. 3

4

Scenario III: Consider three variables, but omit two in models 5

In scenario II, although three variables were considered and included in the model, other factors 6

affecting traffic safety were assumed to be identical among all segments. However, this is not the 7

case in most crash prediction studies. Not all the factors affecting crashes are known or able to be 8

captured by the model. In scenario III, another condition is considered: both curve density and 9

pavement friction are associated with crash risk, but only the lane width is included in the model. 10

The assumed CMFs for lane width, curve density and pavement friction are 0.90, 1.072 11

and 0.973, respectively, the same as those in scenario II. The theoretical function of scenario III 12

is the same as that of scenario II, as shown in Equation 16b (now 18 below). In this scenario, the 13

curve density and pavement friction are excluded from the model. So, the model for scenario III 14

is basically same as that of scenario I, as shown in Equation 12 above (now 19 below). 15

, 0.0023 ( 0.105 0.070 0.027 )true i i i i i iN L AADT exp LW CD PF (18) 16

1

0 2( ) ( )i i iE L AADT exp LW (19) 17

The results for scenario III are shown in Table 6. The derived CMF for lane width in this 18

scenario is also close to the assumed value 0.90. The bias is relatively small and the error is 19

within 1.2 percent in this experiment. Generally, the CMF for lane width in this experiment is 20

reliable. 21

However, when compared with the results in scenarios I and II, both the bias and error 22

percentage become large in scenario III. That is, when some factors affecting traffic safety are 23

Page 16: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Wu et al. 15

omitted in the models, the bias for the CMF will be higher. Meanwhile, the MAD and MSPE 1

also increase greatly, indicating the modeling result becomes less reliable. Similar scenarios with 2

CMFs for lane width of 0.85 as well as 1.05 and constant CMFs for curve density and pavement 3

friction have been analyzed, and the results are consistent. 4

5

Table 6 Results of Scenario III 6

Theo.

CMF a

CMF (SD) b Bias E

c AIC

d MAD

e MSPE

f

0.5

0.90 0.898 (0.014) 0.002 0.211 14395.9 0.75 2.45

1.0

0.90 0.890 (0.026) 0.010 1.111 14479.1 0.75 2.49

2.0

0.90 0.898 (0.023) 0.002 0.206 13975.8 0.75 2.51

a, b, c, d, e, f - the same notes as those in Table 4. 7

8

In scenario III, the assumed CMFs for curve density and pavement friction are close to 9

1.0. This means the change of one unit of these two variables will have relatively small effect or 10

association on crash risk. In other words, if minor factors are omitted in the SPFs, the result may 11

still be acceptable. However, the bias may become unacceptable if the omitted factors have a 12

strong relationship with crashes. Further analyses were conducted to examine this hypothesis. 13

For example, the assumed CMF for curve density was augmented to 1.2 and 1.3, which led to a 14

significantly increase in the error. At the same time, the MAD and MSPE values also increased 15

significantly. Then, the CMF for curve density was increased to 1.4, at which point the 16

maximum likelihood algorithm for estimating coefficients failed to converge. Therefore, when 17

major factors are omitted in the SPFs, the CMFs derived may become unreliable. This indicates 18

that the model suffers from the omitted-variable bias (14). 19

DISCUSSION AND CONCLUSIONS 20

This research has sought to examine the conditions in which regression models could be used for 21

developing CMFs. CMFs for three variables, lane width, curve density and pavement friction, 22

were assumed and used for generating random crash counts based on a NB modeling structure. 23

Then, CMFs were derived from GLM models using the simulated crash data for three different 24

scenarios. Finally, the modeling results were compared with the assumed true value. 25

The primary conclusions can be summarized as follows: 1) the CMFs derived from 26

regression models should be unbiased when all factors associated with the crash risk are identical 27

in all segments, except the factors of interest; 2) based on the simulation results, if some factors 28

having minor safety effects are omitted in the SPFs, the derived CMFs are still reliable, but the 29

error and bias become larger; 3) When some factors known to significantly influence crash risk 30

Page 17: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Wu et al. 16

are omitted, the CMFs derived from regression models become unreliable. According to these 1

findings, safety analysts are recommended to examine whether or not the segments or sites have 2

significant differences in some safety associated factors other than those included in the models. 3

If these factors have significant effects on safety (e.g., relatively large or small CMFs are found 4

from HSM, CMF Clearinghouse, and other relevant documents or peer-reviewed papers), 5

attention should be paid to the use of the CMFs derived from the models. 6

It is worth mentioning that the CMFs for all the three variables in this paper are assumed 7

to have an exponential relationship with their corresponding variables. This is, in fact, a linear 8

relationship between the predicted crash risk and the changes in some variable (in the 9

logarithmic form). Two discussion points need to be addressed. First, this linear relationship is 10

consistent with the GLM model used in this study. The expected crash mean will always be 11

multiplied by a constant factor when a variable increases by one unit, regardless of the original 12

value of the variable. This might explain the small bias and error percentage observed in this 13

study. Second, this relationship is monotonic. For example, if the CMF for lane width is less than 14

1.0, the expected crash mean will always decrease when the lane width increases. However, 15

recent research has shown that this may not be the case. Some variables have been shown to have 16

non-monotonic relationships with crash risk (31; 32). Under such conditions, the GLM method 17

may not be applicable, since the CMFs produced from these models will be biased, especially at 18

the boundary conditions. Recently, other statistical models have been proposed to address this 19

problem (22; 33). It is therefore suggested to examine non-monotonic relationships between 20

CMFs and variables using an approach similar to the one documented in this paper. Finally, the 21

factors influencing safety in this study were assumed to be “independent” of each other, which 22

may also not be realistic in practice (2). Further work is needed to examine this issue in the 23

context of regression models. 24

Although before-after studies have been considered to be the state-of-the-art method and 25

are always preferred for developing CMFs, recent studies have pointed out that the before-after 26

study can also be biased (7; 34). Hence, further research is needed to provide guidelines when a 27

cross-sectional study should be used over the before-after study, and vice versa, as a function of 28

the characteristics of the data. 29

In conclusion, the work in this paper is a first step on better understanding the 30

development of CMFs using regression models. A solid model and proper assumption of the 31

relationship between variables and their safety effects are the base for developing reliable CMFs 32

when using the regression method. This study started from the simple form of CMFs (i.e., 33

independent of each other, in linear form) and commonly used model (i.e., GLM based on the 34

NB distribution). The sample size, which is assumed to be large enough, will also influence the 35

CMFs. The topic of this paper can be expanded to other areas, such as independence of variables, 36

non-linear relationships between crash risk and variables. Some of the proposed work is ongoing 37

and the results will be documented in future manuscripts. 38

REFERENCES 39

[1] FHWA. CMF Clearinghouse Brochure. 40

http://www.cmfclearinghouse.org/collateral/CMF_brochure.pdf. Accessed January 20, 2014. 41

Page 18: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Wu et al. 17

[2] Gross, F., B. Persaud, and C. Lyon. A Guide to Developing Quality Crash Modification 1

Factors. Publication FHWA-SA-10-032. FHWA, U.S. Department of Transportation, 2010. 2

[3] Shen, J., and A. Gan. Development of Crash Reduction Factors: Methods, Problems, and 3

Research Needs. Transportation Research Record, No. 1840, 2003, pp. 50-56. 4

[4] Gross, F., and E. T. Donnell. Case–Control and Cross-Sectional Methods for Estimating 5

Crash Modification Factors: Comparisons from Roadway Lighting and Lane and Shoulder Width 6

Safety Effect Studies. Journal of safety research, Vol. 42, No. 2, 2011, pp. 117-129. 7

[5] Hauer, E. Observational Before-After Studies in Road Safety: Estimating the Effect of 8

Highway and Traffic Engineering Measures on Road Safety. Pergamon, Tarrytown, N.Y., U.S.A., 9

1997. 10

[6] Davis, G. A. Accident Reduction Factors and Causal Inference in Traffic Safety Studies: A 11

Review. Accident Analysis & Prevention, Vol. 32, No. 1, 2000, pp. 95-109. 12

[7] Lord, D., and P.-F. Kuo. Examining the Effects of Site Selection Criteria for Evaluating the 13

Effectiveness of Traffic Safety Countermeasures. Accident Analysis & Prevention, Vol. 47, 2012, 14

pp. 52-63. 15

[8] Noland, R. B. Traffic Fatalities and Injuries: the Effect of Changes in Infrastructure and 16

Other Trends. Accident Analysis & Prevention, Vol. 35, No. 4, 2003, pp. 599-611. 17

[9] Bonneson, J., and M. P. Pratt. Highway Safety Design Workshops. Publication FHWA/TX-18

10/5-4703-01-1. Texas Transportation Institute, 2010. 19

[10] Tarko, A. P., S. Eranky, K. C. Sinha, and R. Scinteie. An Attempt to Develop Crash 20

Reduction Factors Using Regression Technique. Presented at the 78th Annual Meeting of 21

Transportation Research Board, Washington, D.C., 1999. 22

[11] AASHTO. Highway Safety Manual. American Association of State Highway and 23

Transportation Officials, Washington, D.C., 2010. 24

[12] Hauer, E. Trustworthiness of Safety Performance Functions.In The Transportation Research 25

Board (TRB) 93rd Annual Meeting, Transportation Research Board, Washington, D.C., 2014. 26

[13] Hauer, E., and B. Persaud. Common Bias in Before-and-After Accident Comparisons and 27

Its Elimination. Transportation Research Record, Vol. 905, 1983, pp. 164-174. 28

[14] Lord, D., and F. Mannering. The Statistical Analysis of Crash-Frequency Data: A Review 29

and Assessment of Methodological Alternatives. Transportation Research Part A: Policy and 30

Practice, Vol. 44, No. 5, 2010, pp. 291-305. 31

[15] Park, B.-J., D. Lord, and C. Lee. Finite Mixture Modeling for Vehicle Crash Data with 32

Application to Hotspot Identification. Accident Analysis & Prevention, Vol. 71, 2014, pp. 319-33

326. 34

[16] Chen, Y., and B. Persaud. Methodology to Develop Crash Modification Functions for Road 35

Safety Treatments with Fully Specified and Hierarchical Models. Accident Analysis & 36

Prevention, Vol. 70, 2014, pp. 131-139. 37

[17] Zou, Y., D. Lord, Y. Zhang, and Y. Peng. Comparison of Sichel and Negative Binomial 38

Models in Estimating Empirical Bayes Estimates. Transportation Research Record, Vol. 2392, 39

2013, pp. 11-21. 40

Page 19: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Wu et al. 18

[18] Wu, L., Y. Zou, and D. Lord. Comparison of Sichel and Negative Binomial Models in 1

Hotspot Identification. Transportation Research Record, Vol. Forthcoming, 2014. 2

[19] Miaou, S.-P., and D. Lord. Modeling Traffic Crash Flow Relationships for Intersections - 3

Dispersion Parameter, Functional Form, and Bayes Versus Empirical Bayes Methods. 4

Transportation Research Record, Vol. 1840, 2003, pp. 31-40. 5

[20] Gross, F., C. Lyon, B. Persaud, and R. Srinivasan. Safety Effectiveness of Converting 6

Signalized Intersections to Roundabouts. Accident Analysis & Prevention, Vol. 50, 2013, pp. 7

234-241. 8

[21] Rodegerdts, L., M. Blogg, E. Wemple, E. Myers, M. Kyte, M. Dixon, G. List, A. Flannery, 9

R. Troutbeck, W. Brilon, N. Wu, B. Persaud, C. Lyon, D. Harkey, and D. Carter. Roundabouts in 10

the United States. Publication 0-309-09874-2. FHWA, U.S. Department of Transportation, 2007. 11

[22] Li, X., D. Lord, and Y. Zhang. Development of Accident Modification Factors for Rural 12

Frontage Road Segments in Texas Using Generalized Additive Models. Journal of 13

Transportation Engineering, Vol. 137, No. 1, 2011, pp. 74-83. 14

[23] Lord, D., and J. A. Bonneson. Development of Accident Modification Factors for Rural 15

Frontage Road Segments in Texas. Transportation Research Record, Vol. 2023, 2007, pp. 20-27. 16

[24] Hauer, E. Cause, Effect and Regression in Road Safety: A Case Study. Accident Analysis & 17

Prevention, Vol. 42, No. 4, 2010, pp. 1128-1135. 18

[25] Carter, D., R. Srinivasan, F. Gross, and F. Council. Recommended Protocols for Developing 19

Crash Modification Factors. http://www.cmfclearinghouse.org/collateral/CMF_Protocols.pdf. 20

Accessed June 26, 2014. 21

[26] Lord, D. Modeling Motor Vehicle Crashes Using Poisson-Gamma Models: Examining the 22

Effects of Low Sample Mean Values and Small Sample Size on the Estimation of the Fixed 23

Dispersion Parameter. Accident Analysis & Prevention, Vol. 38, No. 4, 2006, pp. 751-766. 24

[27] Lord, D., S. D. Guikema, and S. R. Geedipally. Application of The Conway-Maxwell-25

Poisson Generalized Linear Model for Analyzing Motor Vehicle Crashes. Accident Analysis & 26

Prevention, Vol. 40, No. 3, 2008, pp. 1123-1134. 27

[28] Ripley, B., B. Venables, D. M. Bates, K. Hornik, A. Gebhardt, and D. Firth. Package 28

'MASS'. http://cran.r-project.org/web/packages/MASS/MASS.pdf. Accessed May 20, 2014. 29

[29] Hamner, B. Package 'Metrics'. http://cran.r-project.org/web/packages/Metrics/Metrics.pdf. 30

Accessed May 22, 2014. 31

[30] Hall, J. W., K. L. Smith, L. Titus-Glover, J. C. Wambold, T. J. Yager, and Z. Rado. Guide 32

for Pavement Friction. Transportation Research Board. 33

http://onlinepubs.trb.org/onlinepubs/nchrp/nchrp_w108.pdf. Accessed November 11, 2014. 34

[31] Gross, F., P. P. Jovanis, K. Eccles, and K.-Y. Chen. Safety Evaluation of Lane and Shoulder 35

Width Combinations on Rural, Two Lane, Undivided Roads. Publication FHWA-HRT-09-031. 36

FHWA, U.S. Department of Transportation, 2009. 37

[32] Hauer, E. Statistical road safety modeling. Statistical Methods and Safety Data Analysis and 38

Evaluation, No. 1897, 2004, pp. 81-87. 39

Page 20: Validation of CMFs Derived from Cross Sectional Studies Using … · 2014-12-16 · 7 which regression models could be used for such purpose. CMFs for three variables, lane width,

Wu et al. 19

[33] Zhang, Y., and L. Xie. Crash Frequency Analysis of Different Types of Urban Roadway 1

Segments Using Generalized Additive Model. Journal of safety research, Vol. 43, No. 2, 2012, 2

pp. 107-114. 3

[34] Kuo, P.-F., and D. Lord. Accounting for Site-Selection Bias in Before-After Studies for 4

Continuous Distributions: Characteristics and Application Using Speed Data. Transportation 5

Research Part A: Policy and Practice, Vol. 49, 2013, pp. 256-269. 6