a genetic algorithm-based method for improving quality of travel time prediction intervals

Transportation Research Part C 19 (2011) 1364–1376

Contents lists available at ScienceDirect

Transportation Research Part C

journal homepage: www.elsevier .com/locate / t rc

A genetic algorithm-based method for improving quality of traveltime prediction intervals

Abbas Khosravi a,⇑, Ehsan Mazloumi b, Saeid Nahavandi a,b,c, Doug Creighton a, J.W.C. Van Lint c

a Centre for Intelligent Systems Research (CISR), Deakin University, Geelong, VIC 3216, Australiab Institute of Transport Studies, Department of Civil Engineering, Monash University, Melbourne, Australiac Department of Transport and Planning, Faculty of Civil Engineering and Geosciences, Delft University of Technology, 2600 Delft, The Netherlands

a r t i c l e i n f o

Article history:Received 24 April 2010Received in revised form 19 February 2011Accepted 11 April 2011

Keywords:Travel timePrediction intervalNeural networkGenetic algorithm

0968-090X/$ - see front matter � 2011 Elsevier Ltddoi:10.1016/j.trc.2011.04.002

⇑ Corresponding author.E-mail address: [email protected] (A

a b s t r a c t

The transportation literature is rich in the application of neural networks for travel timeprediction. The uncertainty prevailing in operation of transportation systems, however,highly degrades prediction performance of neural networks. Prediction intervals for neuralnetwork outcomes can properly represent the uncertainty associated with the predictions.This paper studies an application of the delta technique for the construction of predictionintervals for bus and freeway travel times. The quality of these intervals strongly dependson the neural network structure and a training hyperparameter. A genetic algorithm–basedmethod is developed that automates the neural network model selection and adjustmentof the hyperparameter. Model selection and parameter adjustment is carried out throughminimization of a prediction interval-based cost function, which depends on the widthand coverage probability of constructed prediction intervals. Experiments conducted usingthe bus and freeway travel time datasets demonstrate the suitability of the proposedmethod for improving the quality of constructed prediction intervals in terms of theirlength and coverage probability.

� 2011 Elsevier Ltd. All rights reserved.

1. Introduction

1.1. Travel time prediction problem

Access to accurate travel time information is widely acknowledged to have the potential to increase the reliability in roadnetworks (Levinson, 2003) and to alleviate congestion and its negative environmental and societal side effects. From thetravelers’ perspective, information on future travel times can reduce the uncertainty in decision making in regard to depar-ture time, route and mode choice (Mannering, 1989; Denant-BoFmont and Petiot, 2003; Bhat and Sardesai, 2006), which inturn can lessen travelers’ stress and anxiety (Bates et al., 2001; Lam and Small, 2001). From the operators’ point of view,information on future travel times can be used to present the current traffic state in a network and to fully identify the prob-lematic locations/routes (van Lint, 2008).

Travel time prediction is a complex problem because travel times result from nonlinear interactions of heterogeneousgroups of driver–vehicle combinations, each characterized by their own specific technical and behavioral properties, suchas vehicle dimensions, acceleration characteristics, and driving styles (aggressive or conservative) (van Lint, 2006). Traveltimes are also influenced by exogenous factors that are often completely beyond the analyst’s capacity to predict, such asweather and traffic incidents. There exists a wide range of methodologies adopted to predict travel times in the literature.

. All rights reserved.

. Khosravi).

http://dx.doi.org/10.1016/j.trc.2011.04.002

mailto:[email protected]

http://dx.doi.org/10.1016/j.trc.2011.04.002

http://www.sciencedirect.com/science/journal/0968090X

http://www.elsevier.com/locate/trc

A. Khosravi et al. / Transportation Research Part C 19 (2011) 1364–1376 1365

More complicated data-driven models have shown promising ability for travel time prediction by learning the complex traf-fic dynamics directly from data on the route under study. Examples of these methodologies are Kalman filtering (Chien andKuchipudi, 2003; Antoniou et al., 2007; Jula et al., 2008) and generalized linear regression (Zhang and Rice, 2003). NeuralNetworks (NNs) are of particular interest in this study which has been widely used for travel time prediction (van Lint,2008; van Lint et al., 2005; Rilett and Park, 2001; Liu et al., 2008; Kalaputapu and Demetsky, 1995; Jeong and Rilett,2005). With interesting features such as being universal approximators (Bishop, 1995), NNs have proven to be reliable pre-dictive tools in transportation engineering (Dougherty, 1995).

Traditionally, NNs are used to produce the average value of a dependent variable when a certain set of deterministic inputvalues is given (Bishop, 1995). However, these average values are not able to perfectly account for the uncertainty and var-iability in travel times. Factors such as different driving behaviors and pedestrian crossing (on arterial roads), parkingmanoeuvres, and most importantly traffic signals contribute into the variability of travel times. More specifically, whenthe aim is to predict bus travel times, a great share of uncertainty is caused by the variation in the number of passengerswaiting for service at different bus stops on any given day (Mazloumi et al., 2011a). Accumulation of these temporal factorsresults in multiple realities for future travel times, even under fixed deterministic conditions. Put in other words, as there aremany unpredictable exogenous factors excluded from the modeling process, it is highly likely that what happens in realitydiffers with what the models predict. Therefore, point prediction performance of models significantly drops and their reli-ability reduces mainly due to the presence of uncertainty.

Due to the above mentioned sources of uncertainties in travel time values, it makes more sense to predict a likely rangefor travel times rather than a single point. However, only a few studies have dealt with the issue of predicting travel timevariability. Fu and Rilett (1998) and Pattanamekar et al. (2003) presented a set of analytic functions for both mean and var-iance of travel times. They then used a Taylor’s series expansion to derive the mean and variance of travel times. Liu et al.(2005) developed a macroscopic model for urban link travel time prediction based on measurements collected by single loopdetectors. The method divides travel times into two components of link cruising times and intersection delays. They thenpresented a set of analytic equations to derive the mean and the variance of link travel time. Li et al. (2006) examinedthe variability between journey times experienced by different vehicles traveling over a same route and at the same time).They showed that vehicle-to-vehicle travel time variability has a S-shaped relationship with mean travel time. Li et al. (2006)developed fuzzy NNs to predict mean travel time and used the S-shaped relationship to predict vehicle-to-vehicle travel timevariability.

1.2. Prediction Intervals to account for uncertainties in travel times

The uncertainty of travel time prediction (and in general any prediction) can be described and measured through a Pre-diction Interval (PI). A PI is a range of values that has a predetermined probability (confidence level) of containing the averagevalue of the future observations. In fact, PIs are confidence intervals constructed for the unseen observations, and are there-fore wider than confidence intervals (Heskes, 1997). There are a few methods for constructing PIs for NN outcomes. In theliterature, the bootstrap is a widely used technique for PI construction (Efron, 1979; Heskes, 1997). This technique is essen-tially a resampling method that yields satisfactorily good PIs with a high coverage probability. The main problem with thismethod is that its computation load in the development and utilization stage is high (Veaux et al., 1998; Rivals andPersonnaz, 2000). The Bayesian technique is based on Bayesian statistics and has a strong mathematical foundation (Bishop,1995; MacKay, 1992). The mean-variance PI construction method, sometimes called Nix-Weigend method (Ding and He,2003), is based on the assumption that prediction error variance can be estimated using inputs of the main NN predictor(Mazloumi et al., 2011a,b). It has been argued that this method does not cover all sources of uncertainty and underestimatesthe variance (Dybowski and Roberts, 2001; Ding and He, 2003). The delta technique (Hwang and Ding, 1997; Veaux et al.,1998) is based on representing and interpreting NNs as nonlinear regression models. It allows applying standard asymptotictheory to them for constructing PIs. The main assumption in this method is that prediction errors are Gaussian with zero meanand constant variance. Comprehensive simulation studies conducted in Khosravi (2010) show that the quality of PIsconstructed using the delta technique is superior to those constructed using the other NN-based PI construction techniques.As per this, the main focus of this research is on the use of the delta technique for PI construction. Previous applications haveincluded temperature prediction (Lu and Viljanen, 2009), boring process prediction (Yu et al., 2006), modeling of solder pastedeposition process (Ho et al., 2001), and baggage handling system analysis (Khosravi et al., 2009, 2010a).

1.3. Research objectives

The purpose of this study is multifold. First, it aims to incorporate the uncertainty in travel time prediction by developingPIs. The delta technique will be used to construct PIs for bus and freeway travel times. Second, it investigates the applica-bility of measures allowing quantitative assessment of PIs. These measures can be easily applied for comparing performanceof different PI construction techniques. Finally, the study focuses on how the quality of PIs constructed using the delta tech-nique can be improved (minimizing their length and improving their coverage probability). A new cost function is developedbased on the quantitative measures for evaluation of PIs. The purpose of the optimization is to determine the optimal NNstructure (number of neurons in the hidden layers) and to find the best value for a hyperparameter. As the proposed costfunction is highly nonlinear and complex, its mathematical minimization is not possible. Even if applicable, the gradient-

1366 A. Khosravi et al. / Transportation Research Part C 19 (2011) 1364–1376

based methods are highly likely to be trapped in local minima leading to imperfect results. Therefore, genetic algorithm (GA)is applied for minimization of the proposed cost function. The optimal structure and hyperparameter are found through theminimization process resulting in more quality NN-based PIs.

The topic of automating NN structure selection using the Evolutionary Algorithms (EAs), such as GA, is not new. Manyresearchers have proposed and applied EA-based methods to improve the efficiency of the underlying NNs in regressionand classification problems (Bornholdt and Graudenz, 1992; Vonk and Jain, 1997). What makes this work distinct from oth-ers is, (i) its application domain, and (ii) its focus on improving the quality of constructed PIs for outcomes of NNs. The lit-erature focus is mainly on improving the generalization power of NNs through minimizing the error-based cost functionssuch as Mean Squared Error (MSE) or Mean Absolute Percentage Error (MAPE). In contrast, this paper aims at minimizationof a PI-based objective function in order to improve the quality of constructed PIs.

The rest of this paper is organized as follows. Section 2 provides a brief review of fundamental theories of the delta tech-nique and GA. PI assessment measures are explained in Section 3. Section 4 introduces the new cost function and the meth-odology proposed for improving the quality of constructed PIs. Experimental results using bus and freeway travel timedatasets are demonstrated in Section 5. Finally, Section 6 concludes the paper with some remarks for further study in thisdomain.

2. Theory and background

This section briefly introduces the delta technique and GA which are adopted in the upcoming sections.

2.1. Delta technique for PI construction

The delta technique is based on the representation and interpretation of NNs as nonlinear regression models. This allowsapplying standard asymptotic theory to them for constructing PIs. Accordingly, one may formulate NNs as,

yi ¼ f ðXi;w�Þ þ �i; i ¼ 1;2; . . . ;n ð1Þ

where Xi and yi are the ith set of inputs (m independent variables) and the corresponding target (n observations) respectively.�i is noise with zero mean. f(�) with w� is the nonlinear function representing the true regression function. w, an estimate ofw�, can be obtained through minimization of the Sum of Squared Error (SSE) cost function,

SSE ¼Xn

i¼1

ðyi � yiÞ2 ¼Xn

i¼1

ðyi � f ðXi; wÞÞ2 ð2Þ

A first-order Taylor’s expansion of f ðXi; wÞ around the true values of model parameters (w�) can be expressed as,

yi ¼ f ðXi;w�Þ þ gT yiðw�w�Þ ð3Þ

where gT is gradient of f(�) (here NN models) with respect to its parameters, w, calculated for w�. With the assumption that �i

in (1) are independently and normally distributed, N(0,r2), the (1 � a)% PI for yi is,

yi � t1�a

2df s

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1þ gT yiðJT JÞ�1gyi

qð4Þ

where t1�a

2df is the a

2 quantile of a cumulative t-distribution function with df degrees of freedom. df is the difference betweenthe number of training samples, n, and the number of NN parameters, p. s is the estimation of standard deviation, and J is theJacobian matrix of the NN model with respect to its parameters.

Inclusion of weight decay terms in (2) and training NNs using the Weight Decay Cost Function (WDCF) may improve thegeneralization power of NNs (Bishop, 1995),

WDCF ¼ SSEþ kwT w ð5Þ

where k is the regularizing factor. Constructing PIs based on (5) will yield the following PIs (Veaux et al., 1998):

yi � t1�a

2df s

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1þ gTðJT J þ kIÞ�1ðJT JÞðJT J þ kIÞ�1g

qð6Þ

Further information about this method and its mathematical discussion can be found in Hwang and Ding (1997) and Veauxet al. (1998).

2.2. Genetic algorithm

The genetic algorithm (GA) is a gradient-free, stochastic-based optimization method that uses the idea of survival of thefittest and natural selection (Holland, 1975; Goldberg, 1989). GA concurrently evaluates (evaluation step) a set of solutionsand converges towards more competitive solutions by applying reproduction, cross over, and mutation mechanisms. Thereproduction mechanism guarantees preserving the good solutions during the optimization process. Retaining the good


features of the parent is the main principle of the cross over mechanism. The mutation mechanism also helps avoid gettingtrapped in local minima through randomly generating new solutions.

The virtues of the GA method for optimization are as follows; first, as no derivative information is required during thesearch, GA performs well in conjunction with non-differentiable cost functions. Second, GA is stochastic, thus it has a betterchance to explore the entire design space and reach the global optimum. This indicates that GA is a suitable technique forminimization of highly bumpy cost functions. In fact, GA can be implemented without invoking ad-hoc assumptions relatedto smoothness, differentiability, and continuity of the objective functions. Furthermore, its parallelism makes the GA anexcellent candidate for finding the competitive solutions in the multimodal search space. Detailed information about GAcan be found in Holland (1975), Goldberg (1989), and Chambers (2001).

3. PI assessment indexes

A PI is comprised of upper and lower bounds within which the future observation is expected to lie with a prescribedprobability.1 This probability is theoretically very important for construction of PIs and is called the confidence level((1 � a)%). It is expected that the coverage probability of PIs will asymptotically approach the nominal level of confidence.Accordingly, PI Coverage Probability (PICP) is the spontaneous measure related to the quality of constructed PIs (Khosraviet al., 2010a,b, 2011a,b,c),

1 Oth2 Bot

instancAbsolut

PICP ¼ 1n

Xn

i¼1

ci ð7Þ

where ci = 1 if yi 2 [L(Xi),U(Xi)], otherwise ci = 0. L(Xi) and U(Xi) are lower and upper bounds of the PI corresponding to the i-thsample. If the empirical PICP is far below its nominal value, the first conclusion is that constructed PIs are not reliable at all.This measure has been reported almost in all studies related to PIs as an indication of how well the constructed PIs are.

An important point is that PICP by itself does not describe all characteristics of PIs. In fact, it is just an indication of howwell PIs cover the targets. If one selects PIs to be the extreme values of targets, perfect PICP is always achievable (100% cov-erage). Practically, PIs that are too wide are not useful, as they carry no information about the variation of the targets. There-fore, it is essential to define an index for quantifying length of PIs. Mean Prediction Interval Length (MPIL) is defined as follows:

MPIL ¼ 1n

Xn

i¼1

ðUðXiÞ � LðXiÞÞ ð8Þ

Assuming that the target range, R, is known, Normalized MPIL (NMPIL) (Khosravi et al., 2010a,b, 2011a,b,c) can be calculatedas follows:

NMPIL ¼ MPILR

ð9Þ

Normalization against the range of the target allows to objectively compare PIs of different targets. In fact, NMPIL is a dimen-sionless measure representing the average length of PIs as a percentage of range of the underlying target. In the case of usingthe extreme target values as upper and lower bounds of PIs, both NMPIL and PICP will be 100%. This indicates that PICP andNMPIL have a direct relationship. Under equal conditions, a higher NMPIL will usually result in a higher PICP 2.

The ideal case is when PICP is at least equal or higher than its nominal value and NMPIL is as small as possible (narrowPIs). As this may not happen in reality, a combinational measure is required to carry quantitative information about howwide PIs are and how well they cover the targets. As PICP is a key index related to the quality of PIs, the measure shouldbe highly sensitive to the small changes of PICP. Put in other words, the new measure should highly penalize PIs whose PICPis below the nominal confidence level, regardless of the lengths of PIs (as measured by NMPIL). The following CoverageLength-based Criterion (CLC) includes all these characteristics:

CLC ¼ NMPILð1þ e�gðPICP�lÞÞ ð10Þ

where g and l are two hyperparameters determining the level of penalty assigned to PIs with low coverage probability. Therole of g is to magnify any small difference between PICP and l.

A three dimensional graphical representation of CLC for different values of g, l and PICP is provided in Fig. 1. NMPIL isassumed to be 40%. The two plots (indicated by g) correspond to two values of g (100 and 200). A larger g means a morerapid jump (higher penalty) in the value of CLC if PICP � l < 0. In the two dimensional space of PICP and l, if PICP P l (halfof the space), CLC will be almost the same as NMPIL. In the other half space, CLC will rise quickly, regardless of how smallNMPIL is.

er names for PIs used in literature are prediction bounds, prediction limits, interval prediction, and prediction region (Jorgensen and Sjoberg, 2003).h MPIL and NMPIL equally treat (very) wide and narrow PIs. Another idea is to emphasize the wideness of PIs through considering higher powers, fore 2 or 3, in these two measures. From the conceptual perspective, this is very similar to the difference between Mean Squared Error (MSE) and Meane Error (MAE).

Fig. 1. Profile of CLC for different values of l, g, and PICP while setting NMPIL to be 40%.


4. PI optimization algorithm

4.1. Formulation of the optimization problem

There are two types of parameters affecting the quality of constructed PIs for NN outcomes3:

� The first set is associated with the NN structure and its learning capacity. As a data-driven model, NN prediction perfor-mance highly depends on its learning capacity. This capacity is directly related to its complexity, as determined by thequantity of neurons in the hidden layer(s). Assuming ni is the number of neurons in the ith hidden layer of an L layerNN, there are L independent parameters to be determined (ni, i = 1, . . . ,L).� The quality of constructed PIs also directly depends on the value of the regularizing factor, k, used in (6). As discussed

before, using WDCF leads to more reliable PIs. However, the optimal value of k has not been discussed in the literaturefrom a PI-based perspective.

According to the above discussion, there are L + 1 parameters to be determined optimally. Traditionally, the process of NNmodel selection is carried out based on some error-based measures such as SSE, MAPE, Akaike Information Criterion (AIC),and Bayesian Information Criterion (BIC) (Bishop, 1995; Qi and Zhang, 2001). This approach is theoretically and practicallyjustifiable if the purpose of modeling and analysis is point prediction. If NNs are going to be used for PIs construction, it ismore reasonable to configure and develop them based on the key characteristics of PIs. A set of optimal PIs constructed usingthe NNs in this manner will have an improved quality in terms of their length and coverage probability.

To achieve this goal, NN structure selection and optimal tuning of k can be accomplished through minimization of aPI-based cost function. The combinational measure proposed in (10) can be interpreted and used as a cost function for deter-mining the optimal values of L + 1 parameters. As both NMPIL and PICP contribute to the CLC, it is guaranteed that theultimate PIs will have a small NMPIL with a high PICP. The optimization process aims at finding the number of neuronsin the L hidden layers and k through minimization of CLC. This can be shown in mathematical terms as follows,

3 NNand the

wopt ¼ arg minni ;k

CLC ð11Þ

Hereafter, we refer to the set of L + 1 parameters as the decision variables.

4.2. Optimization algorithm

The traditional cost functions for training NNs, for instance, SSE defined in (2) and WDCF defined in (5), are mathemat-ically differentiable. Therefore, gradient-based methods can be applied to them for their minimization (Bishop, 1995).

learning capacity and its prediction performance depend on many other factors including the type of the activation functions, the training algorithm,training process termination. Optimal determination of all these parameters and factors is beyond the scope of this paper.


However, the gradient-based approaches are difficult to apply for the minimization of CLC. The main reason is that thecalculation of mathematical characteristics of CLC, including its derivatives, is very complex. As some of the decisionvariables, such as the number of neurons, can change in a discrete fashion, the cost function is not differentiable. Apart fromthis, CLC minimization through gradient-based methods may not lead to the best results. Gradient-based methods arenotorious for being susceptible to local minima. As CLC in the search space is highly volatile, the risk of being trapped inthe local minima is high.

With regard to this discussion, stochastic gradient free-based methods are the best option available for solving this opti-mization problem. In this study, GA is applied for finding the optimal set of decision variables. The procedure for the min-imization of CLC and optimal determination of the NN structure and its regularizing factor is described below.

4.2.1. Splitting dataFirst, we define three randomly selected datasets. Two of these datasets are used for training (D1 and D2), and the last one

is used for testing the performance of the developed NN for the construction of PIs (DTest). There is no limitation on the quan-tity of samples in each set, but it is reasonable to set them to have nD1 > nD2 > nDTest , where n is the cardinality of the set.

4.2.2. Optimization initializationAn initial population is randomly generated for the L + 1 parameters of the optimization process. The number of neurons

in the hidden layers and the regularizing factor can be selected arbitrarily. In a more systematic approach, the methodologyproposed in Khosravi et al. (2010a) can be applied for determining an upper bound for the initial number of neurons in thehidden layers. A sufficiently large value is initially considered for CLC and is recorded as CLCopt.

4.2.3. NN trainingFor the current population, NN models are trained using the traditional cost function defined in (5). The Levenberg Mar-

quardt algorithm (Bishop, 1995) is employed for adjusting NN parameters. Training is completed using sample set D1.

4.2.4. PI constructionThe estimation of standard deviation, s, is first done for D1, and then PIs are constructed for samples in D2 using (6). The

dataset used for training the NN models (D1) is not applied for constructing PIs.

4.2.5. PI evaluationCLC is calculated for the constructed PIs and named CLCnew. The value of CLC is an indication of the suitability of the cor-

responding NN structure and k. These are sorted and recorded for the evaluation and generation of the new population.

4.2.6. CLC comparisonIf the minimum of CLCnew is lower than CLCopt, the optimal decision variables are replaced with the new decision variables.

If not, they are kept unchanged for this generation of the optimization process.

4.2.7. Generating a new set of parametersIf no termination criteria are satisfied, the method generates a new population set through applying the reproduction

(elitism), cross over, and mutation mechanisms/operators. Elitism retains a certain number of individuals (elite count) withthe lowest CLC. This guarantees that the fittest solutions always survive in each generation. A selection function is applied tochoose the parents for the next generation based on the associated CLCs. The mutation operator enables random generationof new populations. The population size and cross over fraction determine the reproduction, cross over, and mutation quan-tities. After completion of the population generation, the process returns to Step 4.2.3 and the optimization continues.

4.2.8. Algorithm terminationThe optimization algorithm terminates if any of the following conditions are met:

� the maximum number of iterations is reached;� no further improvement for a specific number of consecutive iterations; or� a pre-advised value for the fitness function (here CLC) is achieved.

4.2.9. Optimal PI constructionThe optimal set of the decision variables is applied for constructing PIs for Dtest.The optimization algorithm discussed above is a PI-based algorithm. It is guided by a fitness function that depends on the

quality of PIs in terms of their length and coverage probability. Inclusion of two different datasets during the optimizationprocess reduces the chance of overfitting. If the NN is overfitted (when trained using D1), its CLC for D2 will dramatically rise,and therefore that set of NN decision variables will be discarded automatically.

The evolutionary PI-based approach proposed here only evolves the NN architecture and its regularizing factor. It doesnot cover the NN weights and bias parameters. NN parameters are determined utilizing the traditional error minimizationlearning techniques. As indicated in literature (Yao, 1999), a problem with the evolution of architectures without parameters


is the presence of noise in the fitness function. Different random initializations of NN parameters may yield NNs with similarstructures yet totally different prediction and generalization power (Yao and Liu, 1997). The presence of noise may misleadthe optimization process, resulting in convergence to a non-optimal solutions. This problem can be avoided in two ways:

� Training a NN architecture many times using different initial parameters and then averaging the obtained results (Fisze-lew et al., 2007). This methodology reduces the chance of selecting the wrong set of optimal parameters during the opti-mization process. However, the computational load is of concern when employing this method.� Initializing NN parameters using a pre-determined set permanently eliminates the noise from the fitness function and

results in NNs with steady performance. Despite this, the suitability of the pre-determined set of parameters is question-able, as it has a strong effect on the performance of the trained NNs (Bishop, 1995).

In this study, the latter solution was applied to remove the effects of noise. This decision was made based on preliminaryexperiments conducted using both techniques. It was found that the second methods leads to a more stable optimization.

5. Experimental results

The effectiveness of the proposed algorithm for improving the quality of PIs is examined using two different transporta-tion datasets. In the first part of this section, some practical aspects of the optimization method are discussed, then the re-sults for two transportation datasets are presented.

5.1. GA-based optimization implementation

Table 1 lists parameters used in the optimization algorithm. Level of confidence associated with all PIs is 90%. In all theexperiments, we randomly divide the data set into the first training dataset (50%), the second training dataset (25%), and thetest dataset (25%). These datasets respectively correspond to D1, D2, and DTest. To come up with a quantitative evaluation ofthe constructed PIs, PICP, NMPIL, and CLC are computed for each case. The cross over method is single point. This randomlyselects a crossover point within a chromosome, and then interchanges the two parent chromosomes at this point to producetwo new offspring. While k can be any number between 0 and 1, n1 and n2 are integers between 1 and 10.

The set of parameters summarized in Table 1 was obtained through trial and error. The preliminary experiments werecarried out with the aim to ensure the optimization convergence in a short time (minimum number of generations) andto avoid premature convergence.

5.2. Bus travel time (BTT)

Data for the first case study were supplied from GPS equipped buses operating on an 8 km segment of bus route 246 ininner Melbourne, Australia. This segment comprises four sections (equal in length), defined as the distance between consec-utive timing point stops. Arrival and departure times of buses are recorded at timing point stops to monitor service consis-tency and schedule adherence. Bus headways vary from ten minutes in peak hours to approximately half an hour in the offpeak. Buses in this section operate in mixed traffic and there is no separate lane allocated to them.

This paper uses a data set including weekday travel times in one direction of the bus route. The data set provides the tra-vel times of almost 1800 trips corresponding to each route section, collected over a period of six months in 2007. Fig. 2 de-picts travel times of buses over the day for each section of the bus route. The scatter plots illustrate a considerable variabilityin the travel times at each time over different days. The extent of variation in travel times at a certain time can be attribut-able to a range of factors, including variations in passenger demand and traffic flow over different days, various signal delaysexperienced by different buses, and variation in driving style of bus drivers on different days (Mazloumi et al., 2010). Tobetter understand the data used in the study, statistical characteristics of four datasets have been reported in Table 2.

Table 1Parameters used in experiments and GA-based optimization method.

Parameter Numerical value

a 0.1g 200l 0.875Population size 10Parent selection Stochastic uniformReproduction (elite count) 2Cross over fraction (single point) 0.7Number of NN layers 2Number of neurons [1,10]D1 50% of all samplesD2 25% of all samplesDTest 25% of all samples

Fig. 2. The scatter plot of travel times on four sections over the day, section one (top-left), section two (top-right), section three (bottom-left), and sectionfour (bottom-right).

Table 2Statistical characteristics of bus travel times for four sections (all measures in seconds).

Section Min Max Mean SD Median Mode

One 104 1049 328 130 294 240Two 153 866 385 109 377 300Three 158 774 363 109 346 300Four 179 794 426 118 406 420


Section one has the largest range of travel times (945 s). Travel times are on average longer in section four and shorter insection one. Standard deviations of four datasets are almost the same indicating similar variation in travel time values overdifferent days.

The explanatory variables for doing prediction include, (1) the average degree of saturation values for each intermediatesignalized intersection in the last 15 min interval prior to the departure of the bus from the upstream timing point stop and(2) the schedule adherence quantified by subtracting observed arrival time from scheduled arrival time at each timing point.These variables were found to affect bus travel times in the dataset used in this paper (Mazloumi et al., 2011a).

We first examine performance of NN models for travel time prediction. A two layer NN model is used to approximate therelationship between dependant and independent variables. The number of neurons is varied between 1 and 10 in eachlayer. Experiments are repeated five times to remove the effects of random initialization (totally 500 NNs were trainedand tested for each dataset). The coefficient of determination (R2) is calculated as the performance assessment measure.The best averaged results for test samples have been shown in Table 3. The small values of R2 indicate the inability of

Table 3R2 for the best prediction results using neural network models.

Dataset Section R2(%)

Bus travel time Section one 46.29Section two 30.06Section three 38.80Section four 25.42

Freeway travel time – 83.73


NNs to explain stochasticity in the bus travel times. The best R2 for bus travel times is for section one (46.29%). The smallnessof R2 is not attributable to the NN structure or its training process, because we have examined different structures (throughvarying the number of hidden neurons) and repeated the training process five times.

The high level of uncertainty in the bus travel time datasets greatly contribute to the poor performance of NN models fortravel time prediction. Bus travel times are affected by traffic lights, passengers’ demand, and a high rate of congestion dur-ing rush hours. As many of these factors/variables are not measurable, the level of uncertainty associated with the traveltime predictions generated by NNs is high. This justifies the need for construction of PIs and their application instead of pointpredictions.

Fig. 3 shows the variation of the fitness function (CLC) for the bus travel time datasets. From top to bottom, the plots cor-respond to the four sections of the bus route. The CLC value decreases as the optimization process proceeds and converges toan acceptable solution. Convergence is achieved in less than 40 generations, indicating the low computational load of theoptimization algorithm. As CLCs are lower than 100, it is obvious that PICPs have been greater or equal to the nominal con-fidence level (90%). The amount of CLC reduction varies depending on the type of dataset and the suitability of the initialpopulation. The best improvement is for Section 4, where CLC is reduced from 68.16 to 59.60. As the final values of the deci-sion variables, k, n1, and n2, differ from the initial ones, it is reasonable to conclude that the initial values have not been opti-mal in terms of the PI characteristics.

Hereafter ‘‘opt’’ subscript is used for all computed measures and developed models using the proposed optimization algo-rithm. Also, ‘‘init’’ subscript indicates those developed and computed based on initial values of the decision variables. Uponcompletion of the optimization stage, NNopt are applied for construction of PIs for the test samples, Dtest. PICPopt, NMPILopt, andCLCopt for test samples are summarized in Table 4. These measures should be compared with PICPinit, NMPILinit, and CLCinit forthe objective assessment of the effectiveness of the proposed method. The tabulated results clearly show that application ofthe proposed method enhances the quality of constructed PIs. For the four sections, CLCopt is always lower than CLCinit. Whilethe largest improvement is for the first and forth datasets, the smallest one is for the second dataset. Although the length ofPIs has been reduced, the CLC reduction is mainly attributable to the improvement of coverage probability of PIs, in partic-ular for section one and four. As discussed in Section 3, CLC is highly sensitive to the smallness of PICP. Therefore, improve-ment of PICP is the main concern through the optimization process. Whenever PICP has been lower than the nominalconfidence level (90%), the optimization algorithm has discarded the current solution and looked for more competitive solu-tions. This, in turn, has improved the coverage probability of PIs and made it closer to the nominal one. This has beenachieved through the appropriate adjustment of NN structure and its regularizing parameter in the optimization algorithm.

Fig. 4 shows one hundred PIs for travel times in Section 3 of the bus route. Depending on the level of uncertainty in thesamples and data, lengths of PI vary to cover the targets. There are cases that targets are not covered by PIs, for instance sam-ple 4 and 15. As per the results in Table 4, demonstrated PIs cover more than 92% of the targets (yellow squares lie within thevertical lines).

The optimization algorithm to shorten the length of PIs does not jeopardize the coverage probability of PIs. For all sections,applying the optimization method not only reduces the length of PI (measured by NMPIL), but also improves their coverageprobability. This implicitly means that widening PIs for achieving a greater coverage probability is not an optimal decision.

5.3. Freeway travel time (FTT)

As the second case study, the methodology proposed in this paper is also applied to an 8.5 km long route of the A12 free-way in the Netherlands, from an on ramp (Zoetermeer) to an off ramp (Voorburg). Cameras record the license plate of vehi-cles at both the on ramp and the off ramp. Individual travel times, determined by matches of license plates, were collectedfor 95 days in the winter and spring of 2007. The data were filtered for outliers, which were considerable in number, mainlydue to the fact that only four characters out of six are recorded due to privacy legislation. After filtering the data and inspect-ing them visually, the travel times of the vehicles leaving in the same 5-min time period were averaged. A total of 39 peakperiods of approximately 3.5 h each were selected from the data set.

Over the freeway section under study, there are 19 double loop detectors collecting speeds and flow rates every minute.The speed data were averaged over different 5-min time periods to be consistent with travel time data. No additional infor-mation was available on the occurrence of, for example, incidents or accidents. The travel times are between 273 s and1280 s which means the congestion has occurred in many times leading to long travel times. The coefficient of variationfor the travel times is 43.02% indicating the high variability of travel times around the mean. We use the data in the last5 min intervals to predict the freeway travel times. For instance, the data set of 08:55 is used to predict travel times at09:00 and so forth. Fig. 5 shows the scatter plot of freeway travel times used in this study. During the peak hours(08:00–09:00 am), the travel times have a large variability making the point prediction problematic and prone to error.

The convergence history and variation of the decision variables for FTT dataset has been shown in Fig. 6. Similar to thecase of BTT, the optimization process converges very quickly to the optimal solution and finds the best structure and regu-larizing factor for the underlying NN. The fitness functions drops from 34.11 to 31.38 during the optimization of the decisionvariables.

The effectiveness of the optimization algorithm is also verified using the FTT test dataset. PICP, NMPIL, and CLC have beentabulated in Table 4 for test samples. PICPinit is far from the nominal confidence level resulting in a significantly large CLC.After applying the optimization algorithm, PICPopt is above 90% and the length of PIs is acceptable.

Fig. 3. Evolution of CLC and decision variables during the optimization process for the bus travel time dataset.


For the case of FTT, the NN models tend to have a more complex structure compared to the BTT datasets. The quantity ofnumber of inputs for the FTT dataset warrants the more complexity requirement.

The main advantage of the proposed method is that it gives flexibility to develop custom-made PIs with specific charac-teristics. The quality of PIs directly depends on how g and l are defined in (10). If l is set to be very close to the nominalconfidence level ((1 � a)%), having PIs with excellent coverage probability will be the main concern of the optimization algo-rithm. Large values for g, as a magnifier, also emphasize this concern. In contrast, if smaller values are selected for l, theoptimization algorithm highly squeezes the PIs while keeping their PICP close to 100 l%.

The robust ability of the optimal PIs in encompassing travel time values presents a promise for a wide range of appli-cations in transportation engineering. Provision of this information to passengers will ease their stress caused by theuncertainty in future travel times. These intervals reflect the level of variability in travel times, so they can provide a basefor measuring system reliability. More specifically, within the public transport domain, the availability of these intervalswill assist planners in schedule design and in defining optimal slack times in the timetable development process(Wirasinghe and Liu, 1995). Furthermore, as adopted in Kim and Rilett (2005), these intervals can enhance the quality

Table 4Comparison of characteristics of PIs developed using the optimized and initial NNs for bus travel time datasets

Dataset PICPinit(%) NMPILinit(%) CLCinit PICPopt(%) NMPILopt(%) CLCopt

Section one of bus route 81.51 32.79 5.24 � 106 88.08 30.98 40.74Section two of bus route 89.20 47.95 49.57 89.20 46.42 47.98Section three of bus route 86.99 49.80 188.92 92.01 44.20 44.21Section four of bus route 81.64 61.28 7.57 � 106 91.07 60.63 60.67Freeway 57.91 15.69 7.93 � 1026 91.24 30.73 30.75

Fig. 4. PIs for bus travel times in Section 3.

Fig. 5. Scatter plot of the freeway travel time during the day.


of transit signal priority schemes by providing an interval in which buses are highly likely to arrive at a certain signalizedintersection.

Fig. 6. Evolution of CLC and decision variables during the optimization process for the freeway travel time dataset.


6. Conclusion

An algorithmic approach has been proposed to construct optimal prediction intervals for bus and freeway travel times.The method uses genetic algorithm to determine the number of neurons in the first and second hidden layers of the neuralnetwork models. It also optimally adjusts the regularizing factor used for training the neural networks. Optimization wascarried out through minimization of a prediction interval-based fitness function. Prediction interval coverage probabilityand the mean of prediction interval length constituted the core of the proposed fitness function. Numerical studies were con-ducted using the bus and freeway travel time datasets. Demonstrated results showed that the proposed method quicklydetermines the optimal neural network structure and its training hyperparameter. This not only removes the hassle oftime-consuming trial and error approaches for determining the neural network optimal structure, but also leads to highquality prediction intervals.

The obtained results can be improved in different manners. Bus and freeway travel times throughout the days widely varydepending on different known and unknown factors. Currently, lack of information about many of these variables is regardedas uncertainty. Availability of more measurements and information about different variables will positively contribute to theimprovement of results, in particular reducing the length of PIs. Optimal initialization of NN parameters can enhance theconvergence speed of the proposed optimization method. A comparative study of initialization algorithms has been repre-sented in Yam and Chow (2001) and Zhang et al. (2004). To make the method more effective, it is helpful to adjust NNparameters through the minimization of PI-based cost functions. Broadly speaking, the optimized prediction intervals canbe used in many applications where decisions have previously been made based on point prediction values. This includesreal time operational planning and scheduling, bus timetable modification, and advanced traveler information systems.

Acknowledgment

This research was fully supported by the Centre for Intelligent Systems Research (CISR) at Deakin University. The authorswould like to acknowledge Ventura Bus Company and VicRoads for supplying the GPS and SCATS data respectively for thisresearch. The data from the Dutch freeways were obtained from the TU Delft Regiolab data server (www.regiolab-delft.nl),whereas the travel times were kindly provided by Vialis NL.

References

Antoniou, C., Ben-Akiva, M., Koutsopoulos, H., 2007. Nonlinear Kalman filtering algorithms for on-line calibration of dynamic traffic assignment models.IEEE Transactions on Intelligent Transportation Systems 8 (4), 661–670.

Bates, J., Polak, J., Jones, P., Cook, A., 2001. The valuation of reliability for personal travel. Transportation Research Part E: Logistics and TransportationReview 37 (2-3), 191–229.

Bhat, C.R., Sardesai, R., 2006. The impact of stop-making and travel time reliability on commute mode choice. Transportation Research Part B:Methodological 40 (9), 709–730.

Bishop, C.M., 1995. Neural Networks for Pattern Recognition. Oxford University Press, Oxford.Bornholdt, S., Graudenz, D., 1992. General asymmetric neural networks and structure design by genetic algorithms. Neural Networks 5 (2), 327–334.Chambers, L., 2001. The Practical Handbook of Genetic Algorithms: Applications. CRC Press.Chien, S.I.-J., Kuchipudi, C.M., 2003. Dynamic travel time prediction with real-time and historic data. Journal of Transportation Engineering 129 (6), 608–

616.Denant-BoFmont, L., Petiot, R., 2003. Information value and sequential decision-making in a transport setting: an experimental study. Transportation

Research Part B: Methodological 37 (4), 365–386.Ding, A., He, X., 2003. Backpropagation of pseudo-errors: neural networks that are adaptive to heterogeneous noise. IEEE Transactions on Neural Networks

14 (2), 253–262.Dougherty, M., 1995. A review of neural networks applied to transport. Transportation Research Part C 3, 247–260.Dybowski, R., Roberts, S., 2001. Confidence intervals and prediction intervals for feed-forward neural networks. In: Dybowski, R., Gant, V. (Eds.), Clinical

Applications of Artificial Neural Networks. Cambridge University Press, pp. 298–326.Efron, B., 1979. Bootstrap methods: another look at the jackknife. The Annals of Statistics 7 (1), 1–26.Fiszelew, A., Britos, P., Ochoa, A., Merlino, H., Fernndez, E., Garcfa-Martfnez, R., 2007. Finding optimal neural network architecture using genetic algorithms.

Research in Computing Science 27, 15–24.Fu, L., Rilett, L.R., 1998. Expected shortest paths in dynamic and stochastic traffic networks. Transportation Research Part B: Methodological 32 (7), 499–516.Goldberg, D.E., 1989. Genetic, Algorithm, in Search Optimization and Machine, Learning, Reading. Addision-Wesley, MA.

http://www.regiolab-delft.nl


Heskes, T., 1997. Practical confidence and prediction intervals. In: Mozer, T.P.M., Jordan, M. (Eds.), Neural Information Processing Systems, vol. 9. MIT Press,pp. 176–182.

Ho, S., Xie, M., Tang, L., Xu, K., Goh, T., 2001. Neural network modeling with confidence bounds: a case study on the solder paste deposition process. IEEETransactions on Electronics Packaging Manufacturing 24 (4), 323–332.

Holland, J.H., 1975. Adaptation in Natural and Artificial Systems. Michigan Press.Hwang, J.T.G., Ding, A.A., 1997. Prediction intervals for artificial neural networks. Journal of the American Statistical Association 92 (438), 748–757.Jeong, R., Rilett, L., 2005. Prediction model of bus arrival time for real-time applications. Transportation Research Record 1927, 195–204.Jorgensen, M., Sjoberg, D.I.K., 2003. An effort prediction interval approach based on the empirical distribution of previous estimation accuracy. Information

and Software Technology 45 (3), 123–136.Jula, H., Dessouky, M., Ioannou, P., 2008. Real-time estimation of travel times along the arcs and arrival times at the nodes of dynamic stochastic networks.

IEEE Transactions on Intelligent Transportation Systems 9 (1), 97–110.Kalaputapu, R., Demetsky, M., 1995. Modelling schedule deviations of buses using automatic vehicle location data and artificial neural networks.

Transportation Research Record 1497, 44–52.Khosravi, A., Nahavandi, S., Creighton, D., 2009. Constructing prediction intervals for neural network metamodels of complex systems. In: International Joint

Conference on Neural Networks (IJCNN), pp. 1576–1582.Khosravi, A., 2010. Construction and Optimisation of Prediction Intervals for Neural Network Models, Ph.D. thesis, Deakin University.Khosravi, A., Nahavandi, S., Creighton, D., 2010a. A prediction interval-based approach to determine optimal structures of neural network metamodels.

Expert Systems with Applications 37, 2377–2387.Khosravi, A., Nahavandi, S., Creighton, D., 2010b. Construction of optimal prediction intervals for load forecasting problem. IEEE Transactions on Power

Systems 25, 1496–1503.Khosravi, A., Nahavandi, S., Creighton, D., Atiya, A.F., 2011a. A lower upper bound estimation method for construction of neural network-based prediction

intervals. IEEE Transactions on Neural Networks, 22 (3), 337–346.Khosravi, A., Mazloumi, E., Nahavandi, S., Creighton, D., Van Lint, J.W.C., 2011b. Prediction intervals to account for uncertainties in travel time prediction.

IEEE Transactions on Intelligent Transportation Systems. doi:10.1109/TITS.2011.2106209.Khosravi, A., Nahavandi, S., Creighton, D., 2011c. Construction of optimal prediction intervals for load forecasting problem. IEEE Transactions on Fuzzy

Systems. doi:10.1109/TFUZZ.2011.2130529.Kim, W., Rilett, L., 2005. Improved transit signal priority system for networks with nearside bus stops. Transportation Research Record 1925 (1), 205–214.Lam, T.C., Small, K.A., 2001. The value of time and reliability: measurement from a value pricing experiment. Transportation Research Part E: Logistics and

Transportation Review 37 (2-3), 231–251.Levinson, D., 2003. The value of advanced traveler information systems for route choice. Transportation Research Part C: Emerging Technologies 11 (1), 75–

87.Li, R., Rose, G., Sarvi, M., 2006. Using automatic vehicle identification data to gain insight into travel time variability and its causes. Transportation Research

Record 1954, 24–32.Li, R., 2006. Enhancing Motorway Travel Time Prediction Models through Explicit Incorporation of Travel Time Variability, Ph.D. thesis, Monash University,

Melbourne, Australia, 2006.Liu, H., van Zuylen, H., van Lint, H., Chen, Y., Zhang, K., 2005. Prediction of urban travel times with intersection delays. In: Proceedings of the 8th

International IEEE Conference on Intelligent Transportation Systems, pp. 402–407.H. Liu, Travel Time Prediction for Urban Networks, Ph.D. thesis, Delft University of Technology, the Netherlands, 2008.Lu, T., Viljanen, M., 2009. Prediction of indoor temperature and relative humidity using neural network models: model comparison. Neural Computing &

Applications 18 (4), 345–357.MacKay, D.J.C., 1992. The evidence framework applied to classification networks. Neural Computation 4 (5), 720–736.Mannering, F.L., 1989. Poisson analysis of commuter flexibility in changing routes and departure times. Transportation Research Part B: Methodological 23

(1), 53–60.Mazloumi, E., Currie, G., Rose, G., 2010. Using GPS data to gain insight into public transport travel time variability. Journal of Transportation Engineering 136

(7), 623–631.Mazloumi, E., Rose, G., Currie, G., Sarvi, M., 2011a. An integrated framework to predict bus travel time and its variability using traffic flow data. Journal of

Intelligent Transportation Systems 15 (2), 75–90.Mazloumi, E., Rose, G., Currie, G., Moridpour, S., 2011b. Prediction intervals to account for uncertainties in neural network predictions: methodology and

application in bus travel time prediction. Engineering Applications of Artificial Intelligence 24 (3), 534–542.Pattanamekar, P., Park, D., Rilett, L.R., Lee, J., Lee, C., 2003. Dynamic and stochastic shortest path in transportation networks with two components of travel

time uncertainty. Transportation Research Part C: Emerging Technologies 11 (5), 331–354.Qi, M., Zhang, G.P., 2001. An investigation of model selection criteria for neural network time series forecasting. European Journal of Operational Research

132 (3), 666–680.Rilett, L., Park, D., 2001. Direct forecasting of freeway corridor travel times using spectral basis neural networks. Transportation Research Record 1752, 140–

147.Rivals, I., Personnaz, L., 2000. Construction of confidence intervals for neural networks based on least squares estimation. Neural Networks 13 (4–5), 463–

484.van Lint, J.W.C., 2006. Reliable real-time framework for short-term freeway travel time prediction. Journal of Transportation Engineering 132 (12), 921–932.van Lint, J., 2008. Online learning solutions for freeway travel time prediction. IEEE Transactions on Intelligent Transportation Systems 9 (1), 38–47.van Lint, J., Hoogendoorn, S., van Zuylen, H., 2005. Accurate freeway travel time prediction with state-space neural networks under missing data.

Transportation Research Part C: Emerging Technologies 13 (5–6), 347–369.Veaux, R.D. d., Schumi, J., Schweinsberg, J., Ungar, L.H., 1998. Prediction intervals for neural networks via nonlinear regression. Technometrics 40 (4), 273–

282.Vonk, L.C.J.R.P., Jain, E., 1997. Automatic Generation of Neural Network Architecture Using Evolutionary Computation. World Scientific, Singapore, River

Edge, NJ.Wirasinghe, S., Liu, G., 1995. Determination of the number and locations of time points in transit schedule design case of a single run. Annals of Operations

Research 60 (1), 161–191.Yam, J., Chow, T., 2001. Feedforward networks training speed enhancement by optimal initialization of the synaptic coefficients. IEEE Transactions on

Neural Networks 12 (2), 430–434.Yao, X., 1999. Evolving artificial neural networks. Proceedings of the IEEE 87 (9), 1423–1447.Yao, X., Liu, Y., 1997. A new evolutionary system for evolving artificial neural networks. IEEE Transactions on Neural Networks 8 (3), 694–713.Yu, G., Qiu, H., Djurdjanovic, D., Lee, J., 2006. Feature signature prediction of a boring process using neural network modeling with confidence bounds. The

International Journal of Advanced Manufacturing Technology 30 (7), 614–621.Zhang, X., Rice, J.A., 2003. Short-term travel time prediction. Transportation Research Part C: Emerging Technologies 11 (3–4), 187–210.Zhang, X.M., Chen, Y.Q., Ansari, N., Shi, Y.Q., 2004. Mini–max initialization for function approximation. Neurocomputing 57, 389–409.

http://dx.doi.org/10.1109/TITS.2011.2106209

http://dx.doi.org/10.1109/TFUZZ.2011.2130529

a genetic algorithm-based method for improving quality of travel time prediction intervals

Documents