data-driven mesoscopic simulation of large-scale surface ... · data-driven mesoscopic simulation...
TRANSCRIPT
Data-Driven Mesoscopic Simulation of Large-Scale Surface Transit Networks
by
Bo Wen Wen
A thesis submitted in conformity with the requirements for the degree of Master of Applied Science
Department of Civil Engineering University of Toronto
© Copyright by Bo Wen Wen 2017
ii
Data-Driven Mesoscopic Simulation of Large-Scale Surface
Transit Networks
Bo Wen Wen
Master of Applied Science
Department of Civil Engineering
University of Toronto
2017
Abstract
The planning of transit services, assessment of operational strategies, and evaluation of service
changes can benefit tremendously from high-fidelity transit network models. Traditional
microsimulation models are infeasible for large networks due to onerous model construction and
calibration and prohibitive computational requirements. They are typically only used to model
individual corridors or small sub-networks. This study presents a data-driven mesoscopic
simulation method that models surface transit movement based on open data and machine learning.
After a comprehensive comparison of running speed models using multiple linear regression,
support vector machine, linear mixed effect model, regression tree and random forest, the random
forest running speed models and lognormal dwell time distribution models were used to perform
stop-to-stop mesoscopic simulations. The model results adequately replicated variation in
headways, delays, and dwell times. Validation at the stop level and the route level demonstrated
the need to capture passenger demand and congestion variations in future studies.
iii
Acknowledgement
This research was made possible by the generous support of our industry research partner,
ARUP, as well as the grants and scholarships from the Natural Sciences and Engineering Research
Council of Canada.
I want to express my gratitude for Prof. Amer Shalaby’s supervision. His support,
guidance, and vision encouraged me to overcome many challenges. We have accomplished a great
deal together and I believe this thesis brings the transit research community closer to a world with
growing presence of data and artificial intelligence.
Thank you, Prof. Shoshanna Saxe, for providing valuable feedback on my thesis. I want to
also express gratitude for Siva Srikukenthiran’s efforts in making this project possible.
I want to thank the staff working at the Toronto Open Data Team for their efforts in making
quality transportation data available.
Special thanks to Kenny Ling from TTC for providing very valuable suggestions for
processing AVL data, you have saved me countless hours in figuring out the best way to structure
AVL data for analysis.
I am forever indebted to the open source developers for many of the R packages that I used
and the python packages I tested, as well as the countless number of posts that had taught me how
to code in C#, R, and Python on Stack Overflow.
Thank you to all the wonderful people working very hard with me along the way and cheer
me on day after day: Greg and Greg, Paula, Yishu, Ariel, Teddy, Sami, Adam, Nancy.
Finally, thank you to my family for supporting me and believing in me. My parents, Lina,
and David. Your love made everything possible in the most trying times.
iv
Table of Contents
List of Tables ................................................................................................................................ vii
List of Figures .............................................................................................................................. viii
Glossary ......................................................................................................................................... xi
Modelling Terms ....................................................................................................................... xi
Programming Terms................................................................................................................. xii
Chapter 1 Introduction ..............................................................................................................1
1.1 Thesis Objectives .................................................................................................................2
1.2 Surface Transit Modelling Approach ...................................................................................3
1.3 The Nexus Platform .............................................................................................................7
1.4 Thesis Outline ......................................................................................................................8
Chapter 2 Literature Review...................................................................................................10
2.1 Simulation Level of Detail .................................................................................................10
2.2 Simulation Models for Transit ...........................................................................................13
2.3 Statistical Learning Models ...............................................................................................16
2.4 Research Opportunities ......................................................................................................21
Chapter 3 Modelling Framework ...........................................................................................23
3.1 Functional Requirements ...................................................................................................23
3.2 Simulation Framework.......................................................................................................25
3.3 Detailed Functional Design................................................................................................28
3.4 Summary ............................................................................................................................35
Chapter 4 Data Collection ......................................................................................................36
4.1 Data Collection Methods ...................................................................................................36
4.2 Open Data Formats and Collections ..................................................................................39
4.3 Summary ............................................................................................................................50
v
Chapter 5 Data Processing ......................................................................................................51
5.1 Data Loading ......................................................................................................................52
5.2 Preprocessing of AVL data ................................................................................................57
5.3 AVL Trip Data Processing.................................................................................................61
5.4 Data Export ........................................................................................................................77
5.5 Summary ............................................................................................................................79
Chapter 6 Model Estimation ...................................................................................................80
6.1 Running Speed Model........................................................................................................81
6.2 Dwell Time Model ...........................................................................................................112
6.3 Summary ..........................................................................................................................115
Chapter 7 Simulation Procedures .........................................................................................116
7.1 Load Trained Models .......................................................................................................117
7.2 Initialize Simulation Data ................................................................................................118
7.3 Iterative Predictions .........................................................................................................119
7.4 Simulation Result Outputs ...............................................................................................122
7.5 Analytic Reports ..............................................................................................................126
7.6 Summary ..........................................................................................................................129
Chapter 8 Results of Case Study ..........................................................................................130
8.1 Case Study Background ...................................................................................................130
8.2 Summary of Data .............................................................................................................132
8.3 Running Speed Model......................................................................................................135
8.4 Dwell Time Distribution Model Result ...........................................................................140
8.5 Simulation with Random Forest ......................................................................................147
8.6 Simulation with Linear Mixed Effect Model ...................................................................152
8.7 Summary ..........................................................................................................................157
Chapter 9 Conclusion ...........................................................................................................158
vi
9.1 Summary of Results .........................................................................................................158
9.2 Research Contributions ....................................................................................................162
9.3 Future Research ...............................................................................................................163
Bibliography ................................................................................................................................164
Appendix A Selected Programming Code ................................................................................177
A.1 Data Collection Algorithm ...............................................................................................177
A.2 Data Processing Algorithm ..............................................................................................203
A.3 Model Estimation Algorithm ...........................................................................................277
A.4 Model Simulation Algorithm ...........................................................................................293
A.5 Analytic Report Algorithm ..............................................................................................301
Appendix B Software Repository.............................................................................................315
Appendix C TRB Paper 2016 ...................................................................................................316
Appendix D TransitData 2017 Abstract ...................................................................................337
vii
List of Tables
Table 1. Fields for GTFS CSV files.............................................................................................. 41
Table 2. Vehicle object fields for NextBus AVL XML files ........................................................ 44
Table 3. Closure object fields for Road Restriction XML files .................................................... 45
Table 4. Objects and fields in weather JSON files ....................................................................... 47
Table 5. Processed GTFS Database table information ................................................................. 53
Table 6. List of possible variables, attributes, by data sources ..................................................... 78
Table 7. Variables for route level basic model ............................................................................. 83
Table 8. Variables for network level advanced model analysis .................................................... 84
Table 9. Comparing linear and RBF kernels for support vector machine .................................... 94
Table 10. Random forest performances with increasing number of trees .................................. 107
Table 11. A sample of simulated and scheduled trip summary data .......................................... 124
Table 12. A sample of observed trip summary data from test data set ....................................... 125
Table 13. Detailed summary of open data sources ..................................................................... 134
Table 14. Comparison of route-level running speed models for 34-Eglinton East .................... 136
Table 15. Comparison of route-level running speed models for 54-Lawrence East .................. 137
Table 16. Comparison of route-level running speed models for 504-King ................................ 137
Table 17. Comparison of route-level running speed models for 512-St. Clair ........................... 138
Table 18. Comparison of five types of network-level running speed models ............................ 139
viii
List of Figures
Figure 1. Nexus simulation platform .............................................................................................. 8
Figure 2. Nexus surface simulator automatic modelling pipeline framework design .................. 27
Figure 3. Classes and data object references within the surface transit simulator tool ................ 30
Figure 4. Functional component calls within the surface transit simulator tool ........................... 34
Figure 5. Data collection for archival data .................................................................................... 37
Figure 6. Data collection for real-time online API data................................................................ 39
Figure 7. Data tool objects of the surface transit simulator .......................................................... 40
Figure 8. Data Processing Flow .................................................................................................... 51
Figure 9. Data points that cannot be used to produce trips, left: points with default location, right:
points in trip with no path travelled .............................................................................................. 58
Figure 10. Non-revenue trip for AVL points labelled as King Street shuttle bus ........................ 59
Figure 11. Multiple eastbound and westbound trips (over 10) on St Clair had the same trip
information, labelled as St Clair shuttle bus ................................................................................. 60
Figure 12. Trip forming using SQL database query, top: TTCGPS table containing AVL GPS
points, bottom: TTCGPSTRIPS table containing AVL trips........................................................ 62
Figure 13. Stop sequence trimming to delete stops not traversed by the observed trip ................ 65
Figure 14. Arrival and dwell time determination for a sample AVL trip ..................................... 66
Figure 15. Delay determination for all previous stop times served .............................................. 67
Figure 16. Delay determination for a trip with missed scheduled arrival times ........................... 69
Figure 17. Delay determination for a trip following the trip with missed arrival times ............... 69
ix
Figure 18. Headway determination demonstration ....................................................................... 70
Figure 19. Generation of link data using trip data ........................................................................ 73
Figure 20. Model estimation process flow for running speed model ........................................... 81
Figure 21. Separating and classifying data with support vector machine..................................... 90
Figure 22. Clustering pattern for links for route number 192 ....................................................... 98
Figure 23. Clustering pattern for links for route number 504 ....................................................... 99
Figure 24. Clustering pattern for links for route number 196 ....................................................... 99
Figure 25. Clustering pattern for links for route number 510 ..................................................... 100
Figure 26. Diagram of a simple regression tree model ............................................................... 102
Figure 27. Relative error of regression tree with increasing number of splits ............................ 102
Figure 28. Illustration of a random forest (R. Hänsch & O. Hellwich, 2015) ............................ 105
Figure 29. RMSE decreases and training time increases with increasing number of trees ........ 107
Figure 30. Model estimation process flow for dwell time model ............................................... 113
Figure 31. Simulation procedure process flow ........................................................................... 116
Figure 32. Flowchart of model simulation procedure ................................................................. 120
Figure 33. An example of time-distance diagram for Route 192 Airport Rocket NB ................ 127
Figure 34. Examples of route speed histograms for Route 192 Airport Rocket NB .................. 128
Figure 35. Examples of stop delay distribution curve ................................................................ 129
Figure 36. The Toronto Transit Commission downtown network map (Toronto Transit
Commission, 2017c) ................................................................................................................... 131
Figure 37. Dwell time model parameters at stops on coloured bubble map ............................... 142
x
Figure 38. Observed dwell times at stops on coloured bubble map ........................................... 144
Figure 39. Predicted dwell times at stops on coloured bubble map ............................................ 145
Figure 40. Chi-square goodness of fit test for dwell time predictions ........................................ 146
Figure 41. Time distance diagrams, 504 King WB and EB, for scheduled, observed and
simulated trips, simulated using Random Forest running speed model ...................................... 148
Figure 42. Route speed validation for four major TTC routes, simulated using Random Forest
running speed model ................................................................................................................... 149
Figure 43. Stop delay validation for four routes at 14 major TTC transfer stops, simulated using
Random Forest running speed model ......................................................................................... 151
Figure 44. Time distance diagrams, 504 King WB and EB, for schedule, observed and simulated
trips, simulated using Linear Mixed Effects running speed model ............................................ 153
Figure 45. Route speed validation for four major TTC routes, simulated using Linear Mixed
Effects running speed model ....................................................................................................... 154
Figure 46. Stop delay validation for four routes at 14 major TTC transfer stops, simulated using
Linear Mixed Effects running speed model ................................................................................ 156
xi
Glossary
Modelling Terms
• ROW: right of way in transit describes the area which transit vehicles pass through.
• Dedicated ROW: area which only transit vehicles may pass through
• Shared ROW: a shared area of road both transit vehicles and other vehicles may occupy
• Streetcar: light rail public vehicles on streets
• Subway: heavy underground rail vehicles
• AVL: automatic vehicle location systems which determine the locations of vehicles
• GTFS: General Transit Feed Specification is a common format for public transportation
schedules. It contains geographical information regarding each transit trip and schedule.
(Google Inc., 2016)
• Transit Trip: a sequence of stops along a route that occur at specific times (Google Inc.,
2016).
• Transit Route: a group of trips organized as a single service (Google Inc., 2016).
• Transit Stop: locations where transit vehicles board or alight passengers (Google Inc.,
2016).
• Vehicle Block: the complete itinerary of a transit vehicle for one day, including revenue
and non-revenue trips. Vehicle blocks in the context of GTFS data include only the revenue
trips since the GTFS schedules do not contain non-revenue trip information (Google Inc.,
2016).
• GTFS Shapes: lines with a sequence of geographical location data to represent transit
routes on a map (Google Inc., 2016).
xii
Programming Terms
• API: application programming interface is a set of clearly defined methods of
communications between software components.
• Web APIs: an interface that use Hypertext Transfer Protocol (HTTP) request messages to
retrieve response messages. Common response message formats are XML, JSON, and
CSV.
• CSV: comma-separated value format is a commonly used data storage format for tables of
values where column values are separated by a comma and rows are separated by newlines.
• XML: Extensible Markup Language is a language that defines a set of encoding rules for
documents in a way that is both human-readable and machine-readable. A key
characteristic of the XML format is the use of begin tag <aTag> and end tag </aTag>.
• JSON: JavaScript Object Notation is a data object transmission standard popularized by
JavaScript, which is both human-readable and machine-readable. A key characteristic of
the JSON format is the use of curly brackets and colons to define objects.
• Database: a collection of data organized for fast access by an application. It usually
contains a set of tables with defined data columns.
• Database Table: a matrix of related data with defined data columns and data types. Often,
a key can be defined for a table column for fast data row retrieval. Properties such as unique
or default values can be defined for convenient data insertions.
• SQL: refers to the programming languages that execute commands on a database table.
Some common SQL queries include: Create, Drop, Insert, Select, Update, and Vacuum.
• SQLite: a variant of SQL application that is optimized for file-based database operations;
in particular, it does not require a separate client-server process.
• File: a collection of data stored, usually on disk. A file may store data in different formats,
such as binary, text, CSV, JSON, XML, etc.
xiii
• Memory: a short-term, volatile data storage medium.
• Disk: a long-term, non-volatile data storage device.
• Data Type: a type of data can be primitives, such as string (text), int (integer), long (large
integer), double (decimal numbers), etc. It may also be complex data types that contain
many of data types defined within a class.
• Class: an extensible program code for creating objects. It defines the data types as well as
procedures within the class.
• Method: detailed instructions to perform a certain procedure.
• Data Object: an in-process object containing data of a specific data or class type.
• List: a type of object that contains many items of the same data type.
• Dictionary: a type of object that contains values of many items of the same data type, with
each item having a key that can be used for locating the item.
1
Chapter 1
Introduction
In an era when cities are growing at an ever-increasing rate while infrastructural
improvements are lagging, transit agencies are faced with difficulties in maintaining service
reliability in the face of chronic congestion. The ability to maintain service reliability is even more
difficult for routes running at capacity with short headways, giving transit agencies few options to
recover from heavy congestion and long delays. Furthermore, these delays from surface transit can
cause irregular crowding at major stations, spreading the effects of surface disruptions to other
parts of the transit network such as transit stations, as well as subway and rail systems. Currently,
transit agencies handle these service disruptions and irregularities in an ad-hoc fashion. This is
partly due to a lack of analytical tools to model and analyse network-level impacts of response
strategies. As such, the cascading effects of delays due to surface transit congestion on a
multimodal transit network are rarely quantified and thus poorly addressed. The difficulty of
operating reliable transit service in a congested network is evident for the Toronto Transit
Commissions network. According to TTC’s performance scorecard for May of 2017, only 58.4%
of the streetcars and 75.7% of the buses departed on-time, falling short of the 90% target TTC had
set for itself (Toronto Transit Commission, 2017a).
Microsimulation models are commonly used to assess transit operational performance on
a few selected routes. However, when the impact of policy decisions and strategies need to be
evaluated at the network level, microsimulation models are too resource intensive to build and
computationally intensive to calibrate (Casas, Perarnau, & Torday, 2011). Calculation of vehicle
trajectories using car-following models become intractable when the network is large and
congested. A new method of simulation is needed for the evaluation of large-scale transit networks.
The adoption of modern transit technologies such as automatic vehicle location (AVL),
automatic passenger counter (APC), and automatic fare collection (AFC) systems provides transit
agencies with a large quantity of data and the opportunity to develop data-driven transit models
(Wilson, 2016). Existing transit simulation models have a primary focus on characterizing the
2
passenger demand using survey data and AFC data with the assumption that all transit vehicles
operate according to schedule (Gaudette, Chapleau, & Spurr, 2016; Kucirek, 2012; Weiss,
Mahmoud, Kucirek, & Habib, 2014). Based on TTC’s performance score, running transit services
to schedule in a congested network such as Toronto is very difficult to achieve (Toronto Transit
Commission, 2017a). As such, many other studies account for the various effects that influence
transit travel times using machine learning models in real-time bus arrival prediction algorithms
(Bai, Peng, Lu, & Sun, 2015; X. Chen, Liu, Xia, & Chien, 2004; Elhenawy, Chen, & Rakha, 2014;
Farid, Christofa, & Paget-Seekins, 2016; Kormaksson, Barbosa, Vieira, & Zadrozny, 2014;
Rashidi, Ranjitkar, & Hadas, 2014; Shalaby & Farhan, 2004; Yu, Yang, & Yao, 2006). These bus
travel time prediction techniques are capable of representing the stochastic behavior of transit
travel times in a congested network. However, they have not previously been applied in a large-
scale transit simulation application.
This study presents an innovative approach to model transit vehicle movement using open
data sources and machine learning algorithms. This approach extracts and combines various data
sources, trains statistical models, and simulates surface transit movement. More importantly, data-
driven models can be trained on recent data to more accurately represent the various impacts on
transit operations. Data-driven transit simulation models such as the one presented in this thesis
can help transit agencies evaluate their operational strategies and plans using the data produced by
modern transit technologies.
1.1 Thesis Objectives
The primary objective of this thesis is to model and simulate transit vehicle movements for
large-scale networks accurately and efficiently, utilizing big data and open data. In particular, the
aim of this thesis is to identify and evaluate appropriate methods for travel time and dwell time
models, which would characterize the mesoscopic transit vehicle movements under dynamic
operating conditions. This type of modelling technique would allow for assessment of transit
operational performance at the network level. By utilizing open data, these models can be quickly
updated for real-time decision support. A secondary objective of the data-driven mesoscopic transit
simulator is to integrate with multimodal simulation platforms to assess network level impacts
3
across different modes of public transport. These objectives can be achieved by developing the
surface simulator as a modular automatic modelling pipeline with the use of open source machine
learning software packages. This thesis demonstrates the capability of the simulator to replicate
surface transit movements for the Toronto Transit Commission network.
1.2 Surface Transit Modelling Approach
An appropriate modelling approach was developed to achieve the study objectives. This
thesis uses multiple sources of open data and open-source statistical models to identify the factors
influencing transit movements. These factors were used to develop the transit model, which was
then used as the engine to perform surface transit simulation.
1.2.1 Open Data and Big Data
Recent progress in public transit network modelling has been made possible using
automatic data collection systems (ADCS), general transit feed specification (GTFS), and other
transportation-related open data such as weather, road restriction and traffic intersection data. The
advances in public transit data have enabled the use of statistical learning methods that require
large quantities of data.
ADCS, including automatic vehicle location systems (AVL), automatic passenger counting
systems (APC), and automatic fare collection systems (AFC), are capable of collecting real-time
data and archiving historical data (Wilson, 2016). Bus locations collected by AVL-based GPS data
are readily available in real time, and a large sample of this data can be collected at a lower
marginal cost by the ADCS. Then, bus location data can be used to model transit vehicle speed. If
available at the network level, APC and AFC data can be valuable in modelling passenger demand
and enable more realistic representation of passenger movements across a transit network. The
AVL bus location data is sufficient in determining the travel speed of vehicles along a transit route;
however, to quantify the effects of various factors on transit travel speeds, additional open data
sources are required.
4
To fully uncover the wealth of transit travel patterns from AVL data and enable their use
beyond descriptive statistics, key transit operational characteristics must be identified using transit
schedules. GTFS is a standardized open data format for transit schedule data (Goldstein & Dyson,
2013). GTFS data contains trip, schedule, route, and stop information, for the entire transit
network. Using the GTFS transit schedule data, the AVL bus location data of a route can be
matched to its GTFS schedule data. This enables the computation of many transit operational
characteristics, including stop times at all the stops along the route, headways at stops, and delays
at stops.
To understand the effects of additional factors such as weather, road restrictions, and
attributes of traffic intersections on transit movements, additional transportation-related data
sources must be matched to the AVL data to identify the effects of these additional factors on
transit movements. This thesis uses weather data, road restriction data, and traffic intersection to
identify numerous link characterises and route characteristics that can impact transit movements.
By incorporating additional open data sources, the policy-relevant factors affecting surface transit
travel times can be assessed: the number of signalized intersections, the number of vehicular turns,
the presence of dedicated transit lane, the presence of transit signal priority, and transit stop
locations. This allows the model to assess the degree of impact policy changes can have on transit
performances. A policy sensitive model can be produced using the additional factors from open
data.
1.2.2 Link Representation
The method used for data processing is affected by how transit movements are represented
and thus a consistent link representation need to be established early on in this study. To represent
transit movement, a model incorporating both the transit running speed and dwell times need to be
captured (Padmanaban, Vanajakshi, & Subramanian, 2009). Because running speed and dwell time
of a transit vehicle are associated with the transit link the vehicle is traversing, an accurate
modelling of transit speeds and travel times requires a consistent representation of the link. There
are a few options available for representing links: by terminal stations, by time points, and by
stops.
5
Terminal-station-based segments, or route level segments provides better prediction
accuracy of average speeds over stop-based segments due to higher variations of speed observed
within shorter length of links (W. X. Hu & Shalaby, 2017). However, route level segments are not
suitable for this study because they cannot provide arrival patterns at major subway stations for
transit lines that do not terminate at these stations (M. Chen, Yu, Zhang, & Guo, 2009). The ability
to generate transit vehicle arrival patterns at major subways stations is important for this study.
Time-point-based segments can be a suitable approach and were used by many previous
studies (X. Chen et al., 2004; Yu et al., 2006). However, time points are generally not marked in
GTFS schedule data and the determination of time points are not standardized across transit
agencies (City of Toronto, 2017a). This makes time-point based segments difficult to implement
in a consistent way. In addition, the process of aggregating multiple stops into a single link can
exclude stops of interests. The possible exclusion of stops of interests can generate issues during
simulation when attempting to produce arrival patterns at key stations. With the exclusion of key
stops, it is difficult to assess of policies such as stop relocation, stop removal, and stop addition
since the many stop locations were aggregated as one time-point-based segment.
The use of stop-based links is not common in transit travel time studies since the
achievement of high model fitness and high accuracy can be difficult due to the higher degree of
random effects on travel time relative with shorter length of links (W. X. Hu & Shalaby, 2017).
However, there are several benefits with stop-based links. Firstly, stop-based links represent the
movement of transit vehicle between stops along an entire route. This provides greater simulation
fidelity. Secondly, the modelling of stop-based links enables detailed simulation of transit
passenger travel demand from origin to destination, when such information is available. Passenger
demand is an important aspect of transit models. Finally, stop-based link representations are
consistent with those used by standard transit schedule data such as GTFS. This allows for a
consistent mapping between GTFS schedule times and the calculated AVL stop times, which can
be advantageous in evaluating transit services at across stops and stations. This thesis will explore
the use of stop-based link representation for transit models.
6
1.2.3 Statistical Models for Transit Movements
With the goal of providing an accurate representation of transit travel patterns, structured
data obtained from various open data sources were used to estimate running speed and dwell time
models. To estimate these models, the unstructured data must first be processed into a list of
attributes that can be used as variables of the model – these are structured data. Once structured
data are obtained through data processing, the modelling methods for the running speed and dwell
time models were determined and evaluated.
In the field of machine learning, model accuracy, model training time, and model
prediction time are important criteria to assess to suitability of the method for modelling (Lim,
Loh, & Shih, 2000). Since this study uses various machine learning methods to model transit travel
times, model selection for this study will be based on a trade off between accurate representation,
training time, and simulation time. More complex modelling methods may allow more flexibility
in model representation and improve model fitness, but such flexibility can impose a heavier
computational cost in model training and in generating predictions for simulation (Caruana &
Niculescu-Mizil, 2006; Lim et al., 2000). Balancing these trade offs is important in producing
useful models, especially for application in large-scale networks.
While there are many previous studies on travel time models for transit vehicles, they
generally model up to several transit routes (Bai et al., 2015; X. Chen et al., 2004; Elhenawy et al.,
2014; Farid et al., 2016; Kormaksson et al., 2014; Rashidi et al., 2014; Shalaby & Farhan, 2004;
Yu et al., 2006). For the estimation of a large-scale transit running speed model involving a large
number of routes and trips, data clustering can become an issue since particular transit links and
route environments may have significant differences in running speed (Lee, Si, Chen, & Chen,
2012). An important aspect of this thesis is to properly characterize the data clustering effects on
running speeds due to link attributes. Providing an accurate representation of running speed with
feasible training time and efficient simulation method is critical to the simulation of surface transit
for Nexus.
1.2.4 Simulation Demonstration
Stop-based running speed and dwell time models were used as the engine to drive the
simulation of transit vehicles travelling across a series of transit links. To demonstrate the
7
capability of the surface transit simulation engine, all transit routes over the study period will be
simulated based on scheduled release of vehicles from the terminal station for the first trip of the
vehicle block. The simulation case study demonstrates the capabilities of data-driven transit
simulation models.
1.2.5 Toronto Transit Commission Case Study
The Toronto Transit Commission (TTC) bus and streetcar network was chosen to illustrate
the ability of large-scale transit statistical running speed and dwell times models with stop-based
link representation and the use of open data, to accurately represent the travel patterns of surface
transit vehicles. This study uses the TTC GTFS schedule data, TTC NextBus AVL real-time data,
Toronto intersection data, Toronto road restriction data, and Toronto weather data to characterize
the attributes and conditions of transit travel. The statistical models were evaluated on their ability
to predict running speeds and dwell times across the entire network for simulation purposes. The
variables used for the statistical models were based on the available open data sources and their
statistical significance; the statistical models can be extended in future studies. The framework
used in this study can be applied to other transit networks as well.
1.3 The Nexus Platform
The capability of the data-driven surface transit model can be extended when used in
conjunction with a multimodal transit simulation platform such as the Nexus simulation platform.
The Nexus simulation platform, a high-fidelity multimodal transit modelling system currently
under development, is capable of representing the dynamic behaviour of transit lines, stations, and
passenger travel behaviour (Srikukenthiran, 2015). By connecting train, station and surface transit
simulators, Nexus can model the complete journey of any transit passenger traversing through a
multimodal transit network. This enables the Nexus platform to model the interaction between
passenger movements, trains, and surface transit vehicles; this provides a representation of the
performance at major transit transfer locations. The interactions between different specialized
simulators are shown in Figure 1. The modular design of the Nexus platform allows for the
independent development of models for different parts of the network, which is advantageous in
the construction of large-scale network models. The design of the data-driven surface transit model
8
should also be modular to allow for the integration of the data-driven surface transit model with
the Nexus simulation platform.
Figure 1. Nexus simulation platform
1.4 Thesis Outline
There are nine chapters in this thesis. Following this introductory chapter (Chapter 1), a
literature review of state-of-the-art practices in transit and transportation simulation is presented
in Chapter 2. A detailed exploration of different simulation models for transit provides a deeper
understanding of the need for a data-driven mesoscopic model for large-scale transit networks. In
addition, the various methods used to estimate the statistical transit models are reviewed to
establish a list of potential modelling methods.
The framework used in the implementation of the large-scale transit simulation model is
detailed in Chapter 3. After detailing the various components of the programmed implementations
that enable functions including data collection, data processing, model estimation, and model
simulation, the thesis will illustrate the algorithms used by each of these components.
The development of a large-scale transit model requires a method for data collection
(Chapter 4). The use of online application programming interface (API) and multiple program
threads for simultaneous real-time data collection was critical to the success in harnessing
information from multiple open data sources. After the completion of data collection, data
9
processing of open data from real-time and archival sources enabled the fusion of multiple data
sources using spatiotemporal map matching techniques (Chapter 5).
Using the structured data obtained via data processing, several machine learning algorithms
were used to estimate and evaluate the potential statistical learning models (Chapter 6). The
recommended modelling method is based on the ability of the model to more accurately predict
the running speeds and dwell times of the vehicles across the transit network. To demonstrate the
implementation of the recommended models as the transit vehicle movement engine for simulation
applications, a procedure used to generate simulated trips based on a base case simulation scenario
was developed (Chapter 7). Lastly, the modelling and simulation demonstration results is
presented for a case study of the TTC network (Chapter 8). The case study demonstrates the
possible application of the statistical transit running speed and dwell time models for the Nexus
platform.
Finally, the final chapter provides a summary of the methods, results, and findings of the
thesis. The thesis ends with some suggestions of possible extensions to this study and establishes
potential future research directions (Chapter 9).
10
Chapter 2
Literature Review
The modeling and simulation of transit vehicles for large networks is a challenging task
due to its high computational requirements. Often, studies on transit modelling have focused on
isolated areas of networks where the factors affecting transit movement were specific to a limited
number of routes (Bai et al., 2015; Yu, Lam, & Tam, 2011). Since different transit modelling
approaches impose different computational requirements, selecting an appropriate modelling
method becomes critical to the success of the simulation framework when the network is large.
Therefore, an appropriate method for modelling must be chosen.
Expanding on the surface transit modelling approach we identified in the introduction, the
various methods used by previous studies are explored. Firstly, the trade-off between the level of
detail and computational requirements for traffic and transit modelling is discussed. Then, the
statistical methods used for transit travel time models are comprehensively reviewed. In addition,
the methods used by previous studies to model transit using open data such as GTFS and AVL
provide an in-depth understanding of the benefits of combining data sets to enhance models and
analyses. Finally, an appropriate method that addresses the modelling needs of a large-scale transit
network is presented.
2.1 Simulation Level of Detail
The level of detail required in simulation models depends on the nature of the research
study. There are three major types of simulation models as classified by the level of detail:
macroscopic, microscopic, and mesoscopic models. In the following sections, the abilities of
various simulations models to address research question at different level of modelling detail are
discussed.
11
2.1.1 Macroscopic Models
Macroscopic models describe traffic flow in aggregate variables such as flow, density and
speed and they include a lower number of parameters than microscopic models (Spiliopoulou,
Kontorinaki, Papageorgiou, & Kopelias, 2014). There are three main categories of previous studies
estimating traffic states: replicating traffic flow along congested freeway segments, using field
data and recursive filtering algorithms to dynamically adjust analytical traffic flow models, and
employing statistical or machine learning algorithm to generate short term predictions of traffic
states (Yang, Haghani, & Qiao, 2013). Similar to other models, macroscopic models require proper
calibration to generate realistic traffic states. Given the complex nature of transit systems,
particularly the interactions of traffic when it stops for passengers at different lanes and road
segments, macroscopic models are not as popular. More importantly, traditional macroscopic flow
models do not track the movements of individual vehicles and this limitation makes modelling of
transit movements difficult.
2.1.2 Microscopic Simulation Models
In contrast to macroscopic models, microscopic models simulate a high level of details
using detailed models such as the car-following models of traffic flow. Microscopic models can
simulate individual vehicles and their detailed movements. Previous studies demonstrated the
capabilities of microscopic simulation software, Paramics, in simulating various interactions
between transit vehicles, passengers, and traffic, such as bus holding, active transit signal priority
for buses, skip-stop operations, and bus station passenger transfers (Fernandez, Cortes, & Burgos,
2010). However, microscopic simulations’ vehicle routing algorithm suffer from a nonlinear
increase in computation requirements as network size increases (Cortés, Pagès, & Jayakrishnan,
2005). To overcome this limitation, mesoscopic simulation tools are used for vehicle route
decisions (Cortés et al., 2005). While microscopic simulation models generate a high level of
detail, calibration of large networks is very difficult and computation requirements can be
prohibitive.
12
2.1.3 Mesoscopic Simulation Models
While macroscopic models may lack a level of details required for transit modelling and
microscopic models may not be suitable for large-scale applications, mesoscopic models can be a
suitable alternative to fulfil both of these needs. Mesoscopic simulation models represent
individual vehicles but treat each roadway segment as queues and lanes are not represented
explicitly (Cats, Burghout, Toledo, & Koutsopoulos, 2010). One example of mesoscopic
simulation platform is Mezzo. Unlike microscopic simulation models, the links of Mezzo models
have a running part that contains vehicles not delayed by downstream and queueing part that
extends upstream from the end of the link when capacity is exceeded; the exit time of vehicles
entered the queue is calculated as a function of the density in the running part (Cats et al., 2010).
Mesoscopic models like Mezzo can simulate bus routes and collect stop-level statistics and are
suitable alternatives for large-scale applications.
2.1.4 Determining Simulation Level of Detail
The level of detail required in simulation models depends on the nature of the research
study. For instance, macroscopic models often address the relationships between flow, density and
speed but do not have the ability to trace individual vehicles (Spiliopoulou et al., 2014). In contrast,
microscopic simulation models based on the car-following model provides detailed vehicle
trajectories of all vehicles in the network at every time step but impose a higher computational
requirement (Cortés et al., 2005). Finally, mesoscopic simulation platform such as Mezzo uses a
queueing model to represent the movement of particular vehicles along transit links, balancing the
need to trace specific vehicles and computational requirements (Cats et al., 2010). The level of
detail of the models can impact computational requirements as well as the type of analysis that can
be performed. For studies interested in the capacity analysis of roadway and identifying traffic
bottlenecks, a macroscopic model is most suitable and most efficient. The capacity analysis of a
network may not be sufficient for studies requiring characterization of vehicle trajectories. In the
case of public transit modelling, microscopic simulation is preferred when computational
requirements are not a concern, such as for smaller networks (Cortés et al., 2005). However,
mesoscopic simulation may be desired if only a subset of vehicle trajectories must be determined,
such as in the case for large-scale transit simulations (Cats et al., 2010).
13
2.2 Simulation Models for Transit
As open data sources such as AVL, APC and AFC data become more readily available
from various transit agencies, a standard format was needed for transit data (Wilson, 2016). The
Google Transit Feed Specification (GTFS) common format enhances the interoperability of transit
data, thus allowing the development of applications using these data (Catala, Dowling, &
Hayward, 2011). In the following two sections, 2.2.1 and 2.2.2, the current state of the art in the
use of open data sources for transit simulation are discussed. These simulation models currently
place an emphasis on passenger demand modelling but do not model the stochastic travel time
behavior of surface transit vehicles, instead assuming transit vehicles proceed according to
schedule.
2.2.1 MATSim Model using GTFS and TTS
While GTFS data had previously been used to develop traveller information applications,
Kucirek has built a multimodal network model of the Greater Toronto and Hamilton Area (GTHA)
using GTFS data and Transportation Tomorrow Survey (TTS) data (Kucirek, 2012). The GTFS
data allowed Kucirek to construct scheduled vehicle trips. Kucirek used MATSim, a Java based
open-source simulation platform, to construct the simulation network based on GTFS data
(Kucirek, 2012). Kucirek converted an existing EMME2 model of the GTHA into the MATSim
network, creating the network nodes and links properties in the MATSim network (Kucirek, 2012).
More importantly, Kucirek used the semi-automated map-matching procedure developed by
Ordonez & Erath (Ordonez & Erath, 2011). A notable problem with Ordonez & Erath’s method is
the mismatch between the locations of GTFS bus stops and the simulation model’s bus stop. To
resolve this, Kucirek referenced multiple GTFS stops to a single geo-coded network link (Kucirek,
2012). Using the model constructed in MATSim, Kucirek could demonstrate the ability of
MATSim to be configured to include the effects of congestion in transit routing, and the ability of
MATSim to assign traffic and transit volumes (Kucirek, 2012).
Building upon Kucirek’s work on transit assignment, Weiss proposed a dynamic
multimodal assignment using a similar model and approach (Weiss et al., 2014). Using GTFS data
and TTS data, Weiss constructed a multimodal network with some improvements. Firstly, he
14
proposed additional solutions to the mismatched bus stop problem with Ordonez & Erath’s
method: increase the resolution of the simulation network to match that of GTFS data; and
removing stops to match the lower resolution of the simulation network (Weiss et al., 2014). This
results in a better definition of transit stops. He defined different types of GTFS stop clusters and
computed an equivalent multimodal network solution, accurately identifying different bus stop
types: transfer station, intersection stations and intermediate stops (Weiss et al., 2014).
Additionally, he reduced the size of the routing search space by proposing a more detailed
automated router network generation algorithm to consider more realistic transit transfers; this
reduced computation effort and generated more logical network routing results (Weiss et al.,
2014). Finally, Weiss noted a few issues with this assignment framework for the GTHA: use of
flat fare for all transit agencies generated unrealistically high assignments on some transit lines;
over prediction of traveller demands on specific stretches of highways with road pricing; and the
absent of multi-modal trip chaining behaviour in this model, particularly between modes (Weiss
et al., 2014).
2.2.2 MATSim Model using GTFS and Smart Card Data
Rather than using retrospective telephone survey data such as the TTS, Gaudette
demonstrated the use of a highly detailed public transit microsimulation model using GTFS and
tap-in smart card data from the Sociéte de transport de Laval (STL) bus network (Gaudette et al.,
2016). The use of smart card data allowed Gaudette to couple transit vehicle trips with passenger
actions; therefore, the passenger transfer behaviour can be represented (Gaudette et al., 2016).
Due to the moving fare box and lack of tap-out smart card data in Laval, determining the
exact boarding and alighting location of passengers was challenging. Gaudette used four methods
to match fare box transaction to boarding location: using subway station fare box as an anchor,
matching vehicle block, matching route-time window, and manual adjustments (Gaudette et al.,
2016). After the boarding location was determined, the trip was classified as a transfer from a
subway trip or initial boarding (Gaudette et al., 2016). For alighting location, the transaction
sequence of the smart card was used to determine where the user returned to the transit system;
however, if transaction sequence provides insufficient information, then a proportional distribution
calculation was used to determine the bus alighting locations (Gaudette et al., 2016).
15
The benefit of population level data sets such as the smart card data is that they allow for
accurate analysis of low-ridership lines for which insufficient samples exist for a conventional
survey such as the TTS. Gaudette validated the result of the model using APC data. However, trip
chaining between different modes, such as park and ride or kiss and ride, was still not captured.
Further research is needed to enrich data sources to provide true origin and destination points
(Gaudette et al., 2016).
2.2.3 Limitations of Simulation Models
Simulation models have been the standard of practice for assessing traffic congestion and
transit operation since they provide a way for researchers to perform scenario analysis without the
need to run a controlled experiment in real life (Weiss et al., 2014). The reliability of these models
depends on the careful calibration of several components of the models such as travel demand
models (generation of the origin-destination matrix), traffic and transit assignment models, mode
choice models, and car-following models or queueing models. The calibration of these various
models within the simulation model can be difficult and requires an extensive and expensive
survey (Kucirek, 2012).
As such, recent transit simulation models have advanced with the use of GTFS, smart cards
and APC data. These microscopic models rely heavily on the assumption of on-time transit vehicle
operations. As a result, they lack the stochastic representation of transit travel times due to the
effects of temporal variation, roadway incidents, weather, route characteristics, and link
characteristics. The primary focus of existing transit simulation models has been on passenger
demand and transit assignments; however, surface vehicle movement needs to be characterized in
order to determine passenger boarding and alighting at major transfer stations (Gaudette et al.,
2016). One of the key objectives of this thesis is to provide an accurate and efficient representation
of surface transit travel times across the entire transit network using existing data. Statistical
learning methods provide a means of achieving this goal. As such, various statistical learning
models for characterizing transit travel times are investigated in the following section.
16
2.3 Statistical Learning Models
With the advent of the global positioning system (GPS) based automatic vehicle location
(AVL) system, automatic passenger counter (APC) and other intelligent transportation systems
(ITS) for transit vehicles, statistical models including those based on machine learning algorithms
have been used by many previous studies for bus-arrival time and travel time prediction models
(Bai et al., 2015; X. Chen et al., 2004; Elhenawy et al., 2014; Farid et al., 2016; Kormaksson et
al., 2014; Rashidi et al., 2014; Shalaby & Farhan, 2004; Yu et al., 2006). These regression models
provide travel time predictions between stops and dwell times at stops and can be a
computationally efficient way to characterize the various effects on transit travel times. However,
they have not previously been applied in the setting of large-scale transit simulation, as is
performed in this thesis. In the following sections, 2.3.1 to 2.3.7, various modelling methods are
reviewed to form an understanding of existing state of the art statistical learning models for transit
travel time models. The limitations of each model in previous studies are assessed to determine
appropriate statistical learning models for use in this thesis.
2.3.1 Multiple Linear Regression
Multiple linear regression is one of the most commonly used regression models in research
as it reveals the degree of importance of independent variables (Bai et al., 2015). Using a linear
combination of independent predictor variables, multiple linear regression estimates the
coefficients of predictor variables by minimizing the variances. Multiple linear regression does
this based on the fundamental assumptions of linearity, independence, normal errors, and
homoscedasticity (Marill, 2004). Early studies on bus travel time used multiple linear regression
to assess travel time and reliability on arterial roads (Polus, 1979). Later studies on bus travel time
compared multiple linear regression to other regression models such as artificial neural networks
and support vector machine (Bai et al., 2015; Jeong & Rilett, 2004). In this study, multiple linear
regression was used as the basis for model comparison when evaluating more advanced regression
models.
17
2.3.2 Linear Mixed Effects Models
Another class of statistical model used for travel time prediction is linear mixed effects
model. Unlike most fixed-effects only models such as multiple linear regression, mixed effects
model deals with heteroscedasticity by accounting for correlations due to a grouping of subjects
or repeated measurements on each subject using random effects parameters (Seltman, 2016). A
varying intercept linear mixed effects model has been used for a travel time prediction model for
four regional bus routes in Rio Janeiro (Kormaksson et al., 2014). This study found the addition
of a random intercept for each bus ride improved model fitness by correcting the interpolating
errors due to repeated measurements per each bus trip (Kormaksson et al., 2014). Linear mixed
effects model can be useful in modelling variables with random effects due to repeated sampling.
2.3.3 Kalman Filter Algorithm
Instead of modelling effects of many explanatory variables on a predictor variable, the
Kalman filter algorithm models the recursive states of a predictor variable across time (Harvey,
1990). The Kalman filter is a process model that represents cyclic patterns between variables; it is
estimated using a linear recursive predictive update algorithm (Shalaby & Farhan, 2004). A bus
arrival and departure time prediction model for bus route number 5 in downtown Toronto had been
developed using Kalman filter (Shalaby & Farhan, 2004). The running time and dwell time models
used the Kalman filter algorithm, as well as Automatic Vehicle Location (AVL) and Automatic
Passenger Count (APC) data (Shalaby & Farhan, 2004). The model successfully captured the
interaction between running times and dwell times; also, it outperformed the predictive ability of
multiple linear regression and neural network models against real-world data (Shalaby & Farhan,
2004). Since the Kalman filter algorithm is designed to be updated in real-time during model run
time, it is commonly used for real-time travel time prediction application rather than offline transit
planning purposes.
2.3.4 Artificial Neural Network
The modelling of the recursive states of travel time using Kalman filter may not sufficiently
capture various effects on travel time. A study using weather and APC data for New Jersey Transit
route number 62 was conducted to demonstrate the use of both Kalman filter and artificial neural
18
networks for bus arrival time prediction model to address such concern (X. Chen et al., 2004). The
artificial neural networks model was used to predict bus travel time between time points, while the
Kalman filter was used to perform an adjustment on arrival-time estimates for a trip based on latest
travel-time information (X. Chen et al., 2004). The Kalman filter is required due to the inability of
artificial neural networks model to dynamically adjust its prediction using the most recent
information from the trip, while the artificial neural network model is required to provide an
accurate baseline travel time estimate (X. Chen et al., 2004). More importantly, the combined
dynamic model with artificial neural networks and Kalman filter adjustments outperformed the
model with artificial neural networks alone (X. Chen et al., 2004).
2.3.5 Support Vector Machine
A few other studies have compared artificial neural networks to other models such as
support vector machine. Similar to artificial neural networks, support vector machine does not
require specific form of function development; rather than minimizing the empirical risk as in
artificial neural networks, support vector machine seeks to minimize an upper bound of the
generalization error, consisting of the sum of the training error and a confidence level – structural
risk minimization (Yu et al., 2006). Using single route data for route number 4 in Dalian economic
development zone, a support vector machine model has been applied and its results are evaluated
against a three-hidden-layer artificial neural network (Yu et al., 2006). The support vector machine
outperformed the artificial neural networks by 5% to 7% over four different patterns of real data.
In addition, the root-mean-squared errors (RMSEs) were more stable for support vector machine,
which can be attributed to the use of structural risk minimization principle by support vector
machine.
Extending on the study on using support vector machine, bus running and arrival data from
multiple routes in Hong Kong within 3 major roadway corridors were used to demonstrate the use
of support vector machine, artificial neural networks, and k-nearest neighbour and conventional
linear regression (Yu et al., 2011). Similar to the results of previous studies, in general, artificial
neural networks performed worse than support vector machine, while artificial neural networks
outperformed k-nearest neighbour and linear regression models (Yu et al., 2011). Since artificial
neural networks models are only slightly better than k-nearest neighbour model, considering k-
nearest neighbour’s simple structure, k-nearest neighbour can serve as an alternative method for a
19
bus running time prediction (Yu et al., 2011). Additionally, it was shown that the use of multiple
routes’ data for support vector machine improved the accuracy of arrival time prediction by
approximately 20% in average mean absolute error compared with the use of single routes data in
previous studies (Yu et al., 2011). Support vector machine showed strong resistance to over-fitting
and performed well with a large set of data, in particular for multiple transit routes (Yu et al.,
2011).
2.3.6 Regression Trees
Unlike artificial neural networks and support vector machine, decision trees perform
classification or regression by partitioning the data into clusters to reduce misclassification
probability by minimizing the Gini impurity (Charpentier, 2013). Decision trees can perform
regression by partitioning the data using continuous labels rather than discrete and regression. This
is often referred to as regression trees. Due to the ability of regression trees to deal with data
clustering, a study on bus dwell time for Auckland, New Zealand, found that regression trees can
outperform multiple linear regression by addressing many limitations of multiple linear regression
such as multicollinearity and non-normality of random errors (Rashidi et al., 2014). Another study
dealing with the short-term prediction of bus travel time using automatic vehicle location data for
a street block in Boston, Massachusetts compared numerous machine learning models such as
multiple linear regression, support vector regression and regression tree; the study found that
support vector regression outperformed regression tree while regression tree performed similarly
to multiple linear regression (Farid et al., 2016). The regression tree is a useful modelling method
for datasets with a high degree of data clustering effects.
2.3.7 Random Forest
While regression tree can be useful on datasets with data clustering, the simplicity of its
data partitioning method can produce models that may perform poorly for complex datasets with
highly nonlinear relationships. Tree-based ensemble methods such as random forest can improve
model performances by combining inputs from many weak learners to generate more accurate
outputs (Breiman, 2001). The random forest can replicate complex and nonlinear relationships for
clustered data while reducing bias and overfitting by maintaining a low correlation between trees
using random subsampling with replacement (Breiman, 2001).
20
There are a few studies on traffic and transit travel times that demonstrate the strength of
random forest models. A study using INRIX traffic data showed that a random forest model
responds fast to peak period changes and can accurately reproduce temporal variations in travel
speeds of interstate I-64 and I-264 highway segments between Newport News and Virginia Beach
(Elhenawy et al., 2014). Another study on taxi travel time prediction for the Kaggle challenges
demonstrated that random forest can accurately represent the trip length of taxi trips and
outperforms another tree-based ensemble method using gradient boosting (Hoch, 2015). In
addition to the use of the random forest for traffic and taxi data, a random forest and a k-nearest
neighbour model were used to model a bus route in the city of Chennai, India (Bahuleyan &
Vanajakshi, 2017). It was found that random forest performed well for the complex intersection
areas of the bus route while and k-nearest neighbour was only suitable for mid-block sections
without intersections (Bahuleyan & Vanajakshi, 2017). Various studies on travel time prediction
suggest that random forest can be a promising method to model nonlinearity, data clustering
effects, and spatiotemporal travel time variations.
2.3.8 Choosing Statistical Learning Models
More complex statistical learning model such as artificial neural networks, support vector
machine, linear mixed effects model, regression tree and random forest can potentially provide
better model accuracy than multiple linear regression and fixed schedules (Yu et al., 2011). This
is due to their ability to model more complex relationships and relax some of the fundamental
assumptions of multiple linear regression, such as linearity and homoscedasticity (Young, 2017).
Although previous studies have made some comparisons between the different modelling methods
for networks up to several routes, the most suitable statistical learning model for the modelling of
transit travel times across an entire network such as the TTC network with over 100 surface transit
routes has not previously been determined. This study evaluates the modelling performance of
multiple linear regression, linear mixed-effects models, supportive vector machine, regression
trees, and random forest.
21
2.4 Research Opportunities
Microscopic simulation can produce fairly complex transit behaviour with limited data
using travel demand models, traffic and transit assignment models, mode choice models, and car-
following models; however, the calibration of various components of microscopic simulation can
be very difficult for large networks due to the need for data-rich surveys of various components of
the network (Fernandez et al., 2010). Even though microscopic simulation provides a high level
of model details, it is very difficult to produce such a large-scale microscopic model for analysis
of disruptions and planning scenarios, especially for transit applications. Microscopic simulation
models are computationally expensive and rarely used for large networks (Cats et al., 2010).
Statistical learning models are generally considered to be computationally efficient, but
current transit models based on statistical learning do not provide sufficient modelling detail and
model only one route or a few selected routes (Bai et al., 2015; X. Chen et al., 2004; Elhenawy et
al., 2014; Farid et al., 2016; Kormaksson et al., 2014; Rashidi et al., 2014; Shalaby & Farhan,
2004; Yu et al., 2006). Statistical learning models are data-driven and require many big data
sources. As such, availability of real-time data to produce a robust and realistic model can often
be a modelling constraint. Fortunately, the recent availability of real-time AVL, weather, road
restriction, and intersection characteristics data allows for the estimation of large-scale transit
travel time models for the TTC network. Traditional transit travel time models are used primarily
to assess impacts of various attributes on travel times, but they do not represent the movement of
vehicles and lack model detail. These models were based on a limited set of data with one or few
transit routes, making them unsuitable for large-scale transit simulations.
The primary aim of this thesis is to take advantage of the mesoscopic simulation framework
established by previous studies and to fully utilize the computational efficiency of statistical
learning models. Past transit simulation models showed the promise of using GTFS data to model
a transit network to schedule. While simulation models based on schedules and passenger loads
provide valuable information to transit agencies, these simulation models do not realistically
represent the stochastic nature of transit travels and thus cannot be used to understand transit
service variability (Gaudette et al., 2016). On the other hand, statistical learning models of travel
times showed the many factors affecting the transit link and route travel times, but these models
are not often applied to simulation applications to assess network performances due to their limited
22
model details (Bai et al., 2015; X. Chen et al., 2004; Elhenawy et al., 2014; Farid et al., 2016;
Kormaksson et al., 2014; Rashidi et al., 2014; Shalaby & Farhan, 2004; Yu et al., 2006).
Microscopic simulation models are based on car-following models that characterize the
movements of all vehicles in the network, but are computational intensive (Cats et al., 2010;
Fernandez et al., 2010). On the other hand, statistical learning models characterize travel times of
transit routes or links, but have not been applied in a transit simulation setting (Bai et al., 2015; X.
Chen et al., 2004; Elhenawy et al., 2014; Farid et al., 2016; Kormaksson et al., 2014; Rashidi et
al., 2014; Shalaby & Farhan, 2004; Yu et al., 2006). Different models have a unique set of
advantages regarding model details, data requirements, and computational efficiency. This thesis
will use a data-driven approach to generate detailed and computationally efficient transit
simulation models.
This thesis represents the movements of transit vehicles with statistical learning models of
travel times (i.e.: running speeds and dwell times). The models were trained using several open
data sources such as AVL, GTFS, road restriction, weather, and intersection characteristics. Many
of the state-of-the-art statistical learning algorithms such as support vector machine, linear mixed
effects model, regression tree and the random forest were assessed on their ability to generate
accurate and computationally efficient predictions. Finally, transit simulations based on the
statistical learning travel time model were performed to demonstrate the capabilities of data-driven
mesoscopic transit simulations. The next chapter provides more details on the modelling
framework and program designs developed by this thesis.
23
Chapter 3
Modelling Framework
While previous studies on statistical models and mesoscopic simulations showed
promising modelling results, they provide little guidance on model implementation. This chapter
presents an innovative modelling framework for a data-driven mesoscopic transit simulation
model. Using the defined modelling framework based on intended use cases, a detailed functional
software program design was produced.
The program design of the model implementations was guided by three key principles.
Firstly, functional design approaches will be used to ensure modularity of the program
implementations. Modularity allows programs to evolve with changing data requirements and
algorithms. Secondly, the object-oriented programming paradigm will be used to construct
modular program components and structures. Thirdly, cross-language interoperability enables the
use of open source packages from different programming languages, such as R and C#, for rapid
model implementations.
In the following sections, a detailed design of the simulator program based on the three key
program design principles is presented. The first section establishes the programs’ functional
requirements based on potential use cases. Then, an overall simulation framework is suggested to
achieve the functional requirements. Finally, the detailed functional design section presents the
procedural components and data definitions of the simulator program.
3.1 Functional Requirements
The functional requirements of the mesoscopic transit simulation program were determined
based on the potential use cases. This section first defines the use cases of the simulator. Then, the
24
program requirements, related to the type of data and the relevant processing procedures, were
developed to address the specified use cases.
3.1.1 Use Cases
The use cases were established based on the data sampling level and variable specifications
used to estimate models of travel times (i.e.: running speed model and dwell time models). Two
main use cases were considered for this study: 1. a basic analysis of a transit route, and 2. an
advanced analysis of the entire transit network. The basic analysis use case estimated travel time
models with sampling at the route level. By sampling travel time observations at the route level,
the variation due to route characteristics were held constant, while temporal effects such as time
of day and day of the week, transit operational characteristics such as stop headways, stop delays
and previous travel times, and basic link characteristics such as link distance were captured by the
travel time models. On the other hand, the advanced analysis use case sampled observations at the
network level to train travel time models. The goal of the advanced analysis model was to capture
a wider range of route and link characteristics that can affect transit travel speeds. In addition to
the temporal effects and the transit operational characteristics, the network level model accounted
for an expanded set of link characteristics including stop locations, link distance, number of
signalized intersections, number and type of turns made by transit vehicles between stops, traffic
volumes, and pedestrian volumes. The network level model also used route characteristics such as
route code, and disruptions such as the presence of road restrictions, and the rain and snow
precipitations to model the route level variations.
3.1.2 Requirements for Use Cases
The requirements for basic analysis and advanced analysis use cases were complimentary
to one another and share many common components. The basic analysis with sampling at the route
level needed to account for temporal and transit operational characteristics. These variables needed
for basic analysis were obtained using AVL and GTFS schedule data. By comparing the observed
arrivals and dwell times from the AVL trips and the scheduled arrivals from the GTFS schedule
trips, all the variables for basic analysis were obtained. The advanced analysis with sampling at
25
the network level required more data. In addition to AVL and GTFS data, signalized intersection
location, intersection volume data, road restriction data, and weather data were needed to obtain
expanded link, route, and disruption characteristic. Link characteristics such as stop locations, link
distance, and number of signalized intersections were obtained from intersections data and
scheduled stop locations. Route characteristics were used to accounted for the route level
differences across the network using route identifications. Disruption characteristics were obtained
using the location and time of road restriction and weather data as well. Since the basic analysis
requirements were a subset of the advanced analysis requirements, the simulator was designed to
enable advanced analysis but allows users to produce models at the route levels as well. The
functional requirements for the mesoscopic transit simulator include the use of various open data
sources to represent temporal effects, transit operational characteristics, expanded link
characteristics, route characteristics, and disruption characteristics at the network or route level.
This included data processing to obtain model attributes, estimating statistical models to represent
running speed and dwell time, and simulating transit vehicles using the statistical models. The
ability to export and analyse the simulation result was a key function as well. To achieve these
various functional requirements, a detailed program design was needed.
3.2 Simulation Framework
To address the computational and modelling challenges associated with large-scale transit
simulation, this study developed a data-driven method to efficiently simulate surface transit. This
thesis considered the functional requirements relevant to feature design and variable selection.
Based on these functional requirements, this thesis proposes an automatic modelling pipeline that
produces stop-to-stop mesoscopic surface transit simulation, using travel time models produced
from open transit data. As illustrated in Figure 2, the transit simulation pipeline collects online
open data, processes transit data, estimates data-driven models, and performs surface transit
simulation.
Using a system framework that gathers open data from multiple sources, the pipeline
extracted a rich array of attributes for each transit link, including arrival times, dwell times, delays,
26
headways, weather conditions, incidents, and transit link characteristics. This enabled the
estimation of running speed models based on regression methods and dwell time models based on
distribution fitting. The running speed model and dwell time model at each transit link were used
to simulate the movements of surface transit vehicles. Finally, the simulated transit trips were
evaluated based on their ability to replicate observed route cycle speeds, stop arrival patterns and
bunching patterns. The simulated transit trips can also be used by the Nexus coordination server
to model a multimodal transit network.
28
3.3 Detailed Functional Design
This section presents a detailed functional design of the simulator program that translates
the simulation framework into class objects and program components. Class objects within the
object-oriented programming context can store data using fields and perform operations using
methods. The class objects define the general structure of the simulator program. These class
objects show how data were transformed by a collection of procedures. The detailed program
components illustrate how different algorithms defined by methods were used together to
accomplish functions such as processing AVL data or estimate running speed and dwell time
models. The program structures outlined in this section were used to perform detailed functions
relevant to the Nexus surface simulator automatic modelling pipeline framework design.
3.3.1 Class Objects
Classes define the data structure of objects and the procedure of methods. In designing a
program that addresses the needs of various aspect of the functional requirements, the classes must
be properly defined. As shown in Figure 3, there are four main groups of classes that carry out
functions by making references to each other: data tool classes, simulation modeller classes, model
engine classes, and report tools classes. The data tool classes are comprised of procedures to obtain
open data from online APIs, such as AVL data APIs, weather data APIs, and road restriction data
APIs.
The simulation modeller classes processed the raw API data to formats used by the
simulator program. This included the use of GTFS data to determine actual stop times from the
AVL data. The model engine classes contained data structure definitions and procedures used in
estimating running speed models in R and dwell time models in C#. Finally, the report tools classes
generated excel reports and graphs produced by excel and R.
The simulation modeller classes took the raw data formats from online APIs and convert
them to the appropriate data formats. This conversion enabled the program to save the data fields
29
relevant to this thesis and reduce data consumptions on disk for SQLite database and memory
consumptions during data processing. Some of these data objects were:
• “SSVehGPSDataTable,”
• “SSINTNXSignalDataTable,”
• “SSWeatherDataTable,”
• “RouteProperties,”
• “SSIncidentDataTable” etc.
In addition to maintaining source data efficiently, many additional data objects were used to store
processed data. For instance, “SSVehGPSDataTable” was used to store processed data at the trip
level and link level. This type of data processing used various objects generated from the
“GTFSConvertSSim” class, such as:
• “SSimRouteRef,”
• “SSimRouteGroupRef,”
• “SSimStopRef,”
• “SSimScheduleRef,”
• “SSimServicePeriodsRef,” etc.
The “SSVehTripDataTable” contained trip level data processed from the source data in
“SSVehGPSDataTable” so stop times and dwell times could be stored for each AVL trip. In
addition, “SSLinkDataTable” and “SSLinkObjectTable” were produced using the trip level data
to organize AVL trip data into link level for model estimation.
The model engine classes used the class objects that contained the final set of model
variables (“SSLinkSimModelData” inherited “SSLinkModelBaseData”) to generate statistical
model predictions and manage simulation variables such as vehicle arrival times and dwell time at
stops. At the conclusion of simulation, the results of the simulation were processed using the report
tool class (“ExcelReports”) to generate summary data and graphs for analysis and model
evaluations.
31
3.3.2 Program Components
Within each of the class objects, methods were defined to carry out the required procedures
of the program. Figure 4 illustrates how the various functional components of the program interact
to carry out the required functions by making calls to each other. As mentioned previously, the
simulator program functions include data collection, data processing, model estimation, and model
simulation with variables on temporal effects, transit operational characteristics, expanded link
characteristics, route characteristics, and disruption characteristics. Modular program components
each carry out a particular procedure that enables these functions.
The first group of methods handle the data collection. The “DownloadNextBusGPSData”
method queries real-time NextBus API online to retrieve AVL data and save these data offline for
processing (see section 0 for source code). To gather weather data, the “DownloadWeatherData”
method retrieves real-time weather information from OpenWeatherMap APIs (see section A.1.3
for source code). The Toronto road restriction data were gathered via the open data Toronto road
restriction API with the “DownloadTorontoIncidentData” method (see section A.1.4 for source
code). These methods were used to gather online real-time data needed for modelling.
Once the data were downloaded, the offline data files containing all of the required data
sources were loaded into the program. The modular design of the methods means that
“LoadDataFromOfflineFile” loaded various offline data into memory and updated them in the
SQLite database by calling different methods such as (see section A.1.5 for source code):
• “LoadGTFSReferenceData” to load GTFS data from GTFS database (see section A.1.6
for source code);
• “InitializeGPSTripDb” to load AVL GPS data and process GPS trips and “Update-
AndLoadGPSDbDataFromFiles” to load and update database table;
• “UpdateDbWeatherDataFromFiles” to load weather data from file and update database
table, or “GetWeatherDataFromDB” to load weather data from database table;
• “UpdateDbIncidentDataFromFiles” to load road restriction data from file and update
database table, or “GetTorInciDataFromDB” to road restriction data from database table.
32
After the data were loaded, data processing could be done by calling “ProcessGPSData”
method (see section A.2.1 for source code). The “ProcessGPSData” method checked if the AVL
trip information contained fully process stop times, dwell times, as well as respective link data
with all the necessary attributes. The first stage of the data processing was to determine the
schedule ID of the AVL or GPS trips. This was performed by the method
“ScheduleMatchingGPSTripWithGTFS” (see section A.2.2 for source code). The determination
of schedule ID allowed the “DetermineAllFeatureValues” method to proceed in finding the
observed stop times (see section A.2.3 for source code). The GPS trips were preprocessed and
organized for arrival and dwell time computation using the method
“ProcessAndOrganizeGPSPointsForTrip” (see section A.2.4 for source code). Then, the
“EstimatedArrivalAndDwellAlongScheduleRoute” method was called to determine arrival and
dwell times for every stop along the GPS trips (see section A.2.5 for source code). By comparing
the observed arrival and dwell times to the schedule information, all of the transit operational
characteristics could be found. After the trip data were processed, these data were processed into
link data with “InitializeModelLinksToDb” method (see section A.2.6 for source code). Expanded
link characteristics, route characteristics, and disruption characteristic were obtained using
“GetIntxnIDsAndInitialDataForLinkObj” and “GetIntxnDataMissingVolAndStopLoc” methods;
these methods geotemporally match data records from different data sources to obtain additional
characteristics for a transit trip (see section A.2.7 and section A.2.8 for source code). In addition,
stop locations were obtained using the relative positions between stops and intersections; the stop
locations data were stored in a database table using the “UpdateStopDataToDB” method (see
section A.2.9 for source code). After link objects and link data were finalized, the
“UpdateAllLinksToDB” method was called to update the database with the link data information
(see section A.2.10 for source code). These link data were exported for model estimation and
simulation.
Once data processing was completed, the attributes associated with each observation on a
transit link can be used to estimate the running speed and dwell time models. The estimation of
running speed models was done in C# and R using the “RModel_RunningSpeedModelEstimation”
33
method in C# and “ME_TaskHandler” script in R (see section A.3.1 and A.3.3 for source code).
The following packages were used to estimate running speed models:
• the “MASS” package for multiple linear regression models (Ripley et al., 2017),
• the “e1071” and “liquidSVM” packages for support vector machines (Meyer et al., 2017;
Steinwart & Thomann, 2017),
• the “lme4” package for linear mixed-effects models (Bates et al., 2017, p. 4),
• the “rpart” package for regression trees (Therneau, Atkinson, & Ripley, 2017),
• the “ranger” package for fast implementation of random forest (Wright, 2017).
In addition, dwell time models were estimated in C# using the “Model_StopModelsEstimation”
method (see section A.3.2 for source code). The lognormal distribution estimator from the
Math.Net Numerics library in C# was used to estimate the parameters of the dwell time models
(Christoph Rüegg et al., 2017). After estimating running speed and dwell time models, transit trips
were initialized using a set of test data from “ImportLinkDataForModelRef” and GTFS schedule
trip data from “LoadGTFSReferenceData” (see section A.4.1 and section A.1.6 for source code).
Using the initialized trips, the “GetPredictionForLinkDataInR” method was able to generate
running speed and dwell time predictions for simulated transit trips (see section A.4.2 for source
code). Then, the “Model_GenerateReportsWithBinFileAndR” method generated detailed
summary data and graphs using the simulated transit trips results (see section A.5.1 for source
code). This method called three separate methods to generate three types of reports. The method,
“CreateTimeDistDiagramReport”, generated excel line graphs showing the vehicle trajectories of
simulated, scheduled, and observed trips (see section A.5.2 for source code). Another method,
“CreateStopDelayReport”, produced stop level reports for model validation. Similarly,
“CreateRouteSpeedReport” generated route level reports for model validation.
35
3.4 Summary
This chapter introduced the modelling framework for the mesoscopic surface transit
simulator. Two different use cases, basic route level analysis and advanced network level analysis,
were established to determine the appropriate functional requirements. The functional
requirements for this simulator were to represent the temporal effects, transit operational
characteristics, expanded link characteristics, route characteristics, and disruption characteristics,
by obtaining model attributes through data processing, by estimating statistical models of running
speed and dwell time, and simulating transit vehicle movements using the statistical models. A
detailed simulation framework, consisting of various components of a software pipeline, was
constructed based on these function requirements (see Figure 2). Then, the function design
components of the simulator were detailed with respect to the class object references and program
component calls. The class objects contained data fields and methods to store data and carry out
functional procedures, respectively. In addition, the methods within each of those classes carried
out the specific procedures, using fields and method calls, to fulfil the functional requirements of
the program (see Figure 3). This chapter detailed the functional design of the surface simulator
program, based on function design approach, object-oriented programming paradigm, and cross-
language interoperability. The final program design is highly modular and extensible for additional
data sets, training new types of models, and new transit networks.
36
Chapter 4
Data Collection
One of the first steps in building an automatic modelling pipeline is data collection. This
chapter details the different procedures used to collect and manage data from online open data
sources. A collection of these data enabled the program to perform the functions required by its
functional design. Data for this thesis were obtained in two ways: historical open data archives and
real-time open data feeds. Real-time open data feeds required the use of online APIs with multiple
download threads for simultaneous real-time data collection. Archival open data required data
conversion to enable automatic data processing.
This chapter discusses the use of different data collection methods for transit network data,
such as GTFS, AVL, road restriction, weather, and traffic intersection data. The choice of these
data was based on previous literatures on transit travel time regression models. In addition to transit
operational parameters such as headway and delay, previous studies have used disruptions,
weather conditions, time of day, and day of the week to model the various effects on transit travel
times (X. Chen et al., 2004; Yu et al., 2006). In this thesis, road restriction data were collected to
examine the effect of roadway closure disruptions have on transit travel times. Weather data were
also collected to examine the effect of weather parameters such as precipitation. Finally, the
specific time of day and day of the week of the transit travel were used to examine temporal effects.
In addition to the variables used by the previous studies, this thesis also accounts for the effect of
transit stop locations using the traffic intersection data for Toronto.
4.1 Data Collection Methods
This section presents the processes used to collect historical open data archives and real-
time open data feeds. An established method to collect open data ensured consistent data structure,
37
data quality, and data storage format. The consistent data structure for particular data sets or even
groups of similar data sets allowed different data sets to be fused and create more informative
attributes for analysis. A standardized method of collecting data, especially when done
automatically, maintained a consistent data format for analysis. Finally, a predetermined data
storage format ensured readability of the data. This thesis used common open standard file and
database formats, such as Extensible Markup Language (XML) and JavaScript Object Notation
(JSON), to store the collected data.
4.1.1 Historical Open Data Archives
Historical open data archives that were stored in formats such as comma-separated value
(CSV) formats or text formats were converted into formats such as XML and JSON. XML and
JSON are human-readable because they are text-based with words and numbers that can be viewed
easily without the need for conversion. These formats are machine-readable because the words and
numbers containing data are qualified with the name and type of fields. The properties of XML
and JSON file format make them suitable candidates to deliver and store historical open data.
The flowchart below outlines the steps to convert historical open data archives into XML
or JSON file formats. The historical data was first downloaded manually. Then, these data were
unpackaged, manually validated, and saved as text-based CSV files. The prepared data were
converted to XML or JSON objects with a short script that read the text file and properly formats
the data types of each field. Finally, the converted XML or JSON objects were saved. The schema
of these XML or JSON objects could be used to read these files into program memory.
Figure 5. Data collection for archival data
Manually Download
Archival Data
Unpack archive and validate
data
Convert archive to XML or JSON
Object
Save XML or JSON File
38
4.1.2 Real-time Open Data Feeds
Traditionally, open data were usually archives and records published by organizations
periodically. Recent efforts in the digitalization of real-time data records by governments and
organizations made available many real-time data streams such as the automatic vehicle locations,
weather conditions, and road restrictions. Unlike archival data, real-time data streams require
continuous polling and downloads from APIs, as well as management of the downloaded data.
Using programming code developed for this thesis, the real-time online API data were downloaded
(see section A.1 for source code).
The first step in downloading real-time data was to initiate download session, as shown in
Figure 6. In this step, the “end of the download” time is assigned and checked against the current
time. If the current time is before the “end of the download” time, then the next download request
is scheduled based on the polling rate. For instance, if the polling rate of the download is 20
seconds, then the next download request is to take place 20s after the previous download request.
Then, the program will wait until the download request execution time to send an asynchronous
web client download request to the web API server. After a brief wait, the web server returned an
API response object. This API response object was converted into desired data structure and
checked for duplicates. The converted data was then saved into memory or disk. The download
cycle continued until the “end of the download” time was reached. At the end of the download, all
the data were saved to disk.
39
Figure 6. Data collection for real-time online API data
4.2 Open Data Formats and Collections
There were five open data sources used by this thesis. The classes and data objects used to
read and convert these data are presented in Figure 7. In the following sections, methods used for
data collection are outlined. Then, the format of the downloaded data is presented. Finally, the data
conversion process to prepare the data for data processing are discussed. The type of information
each open data source contains determined the attributes that could be obtained for modelling.
Initiate Download
•Check current time
•If greater than end of download time, save and terminate.
•If less, schedule next download request.
Sends Asynchronous WebClient Download Request
•Wait for response
•Get Web API response object
Data Conversion
•Convert to defined structure
•Eliminate duplicates
Save
•If reached schedule save, add to memory object and save to disk.
•Otherwise, save to memory object.
41
4.2.1 General Transit Feed Specification
General Transit Feed Specification (GTFS) data for the Toronto transit network contained
the schedule information of all transit trips in the network. The GTFS data were manually
downloaded from TransitFeeds (TransitFeeds, 2016). The format of the GTFS schedule data was
CSV. The CSV files contained data about the transit agency, service periods, routes, shapes of
routes, stop times, stops, and trips. The fields of each of the CSV files are outlined in the following
table. The data were then organized into vehicle trips using a GTFS converter,
“GTFSConvertSSim”. The details regarding how the GTFS converter process the original CSV
files can be found in section 5.1.1. The GTFS data contained detailed information regarding transit
services at the network level; these data were used extensively during all stages of the simulation
such as data processing, model estimation, and model simulation, where the schedule data were
required.
Table 1. Fields for GTFS CSV files
CSV File Name of Field Description of Field
Agency.CSV agency_id the id number of agency
agency_name the name of the agency
agency_url the web address of the agency
Calendar.CSV service_id the id of the service period
Monday, Tuesday,
Wednesday, Thursday,
Friday, Saturday, Sunday
a list of the days this service period includes
start_date the start date of the service period
end_date the end date of the service period
42
CSV File Name of Field Description of Field
Routes.CSV route_id the id of the route
agency_id the id number of the agency for the route
route_short_name a brief version of the route name
route_long_name the full version of the route name
Shapes.CSV shape_id the id of a shape
shape_pt_lat the latitude of a shape point
shape_pt_lon the longitude of a shape point
shape_pt_sequence the sequence or order of the point along the
shape
shape_dist_traveled the distance travelled from the first point
Stop_times.CSV trip_id the trip which the stop times belong to
arrival_time the time at which the vehicle arrived at the
stop
departure_time the time at which the vehicle departed from
the stop, same as arrival_time for the TTC
network
stop_id the id of the stop for the stop time
stop_sequence the sequence of the stop along the trip
43
CSV File Name of Field Description of Field
shape_dist_traveled the amount of distance travelled from the
beginning of the shape
Stops.CSV stop_id the id of the stop
stop_code the code name of the stop used by the
agency
stop_name the full name of the stop used by the agency
stop_lat the latitude of the stop
stop_lon the longitude of the stop
Trips.CSV route_id the id of the route for the trip
service_id the id of the service period for the trip
trip_id the id of the trip
direction_id the direction of the trip
block_id the id of the block for the trip
shape_id the id of the shape for the trip
4.2.2 Automatic Vehicle Location
The NextBus data were downloaded via the NextBus public API (NextBus, 2016). The
“NextBus” web crawler (“SSWebCrawlerNextBus”) retrieved the real-time AVL data as
“GPSXMLQueryResult” object periodically for the duration of the download (see section A.1.1
44
for source code). For every download, the “LastTime” object attached to each query batch was
used to determine the exact time of the observation of the vehicle location in an array of the
“Vehicle” object. The vehicle objects contained several fields detailed in the following table. The
observed “Vehicle” fields were converted to “SSVehGPSDataTable” with assigned gpsID,
GPStime, vehID, longitude, latitude, direction, heading, and tripCode (that contained route
number, branch number and vehicle id). The converted AVL data was used during data processing
to retrieve attributes such as arrival and dwell times.
Table 2. Vehicle object fields for NextBus AVL XML files
Field Name Field Description Example data
id the id of the vehicle 4187
routeTag route number of the vehicle 512
dirTag route number, direction, and
branch number of the vehicle
concatenated with an
underscore
512_1_512
lat latitude of the observed
vehicle
43.673367
lon longitude of the observed
vehicle
-79.463783
secsSinceReport the number of seconds since
the last report
10
heading the heading of the vehicle 253
45
Field Name Field Description Example data
GPStime the time of observation based
on download request time
(lastTime) and
secsSinceReport
lastTime = 1488362800
secsSinceReport = 10
GPStime = 1488362790
4.2.3 Road Restriction
The road restriction data were collected from the Toronto road restriction public API(City
of Toronto, 2015). Each road restriction download request generated a “Closures” object that
contained a list of “Closure” objects in XML format. Each “Closure” object contained a number
of fields detailed in the following table. The observed closure events were converted to a list of
“SSIncidentDataTable” objects with similar fields: incidentID, RoadName, NameOfLocation,
Longitude, Latitude, Planned, LastUpdatedTime, StartTime, EndTime, and ClosureDescription.
The converted objects with the above fields could then be loaded into memory for data processing.
The road restriction data was used to determine whether road restriction events present along
transit links.
Table 3. Closure object fields for Road Restriction XML files
Field Name Field Description Example data
id the id of the road restriction R51072
Road road name VIMY AVE
Name detailed location of the road
restriction
VIMY AVE From LAWRENCE
AVE W To MACDONALD AVE
46
Field Name Field Description Example data
District the name of the city district Etobicoke York
Latitude the latitude location 43.702751
Longitude the longitude location -79.504901
RoadClass Major Arterial the type of road
Planned an indication of whether the event is
planned or unplanned
1 – yes
LastUpdated the time last updated 1486701720000
StartTime the start time of the restriction 1487062800000
EndTime the end time of the restriction 1487174400000
Description description of the nature of road
restriction
The northbound and southbound
lanes will be alternately occupied
due to construction.
4.2.4 Weather
The weather data were downloaded from the OpenWeatherMap public API
(OpenWeatherMap, 2017). Unlike other open data formats, the download weather data was highly
hierarchical and very complex with many data objects (see “SSWebCrawlerOpenWeather” in
Figure 7). The “RootObject” object returned from weather API query contain several objects with
multiple fields, as detailed in the following data. The information contained in the weather objects
were converted to “SSWeatherDataTable” object with reduced data complexity and similar fields
such as:
47
• WeatherStationID: the station ID of the weather station, same as the id of the city,
• WeatherTime: the time which the weather was observed, obtained from dt_text.
• Longitude: the longitude of the weather reported
• Latitude: the latitude of the weather reported
• WeatherCondition: the type of weather event such as rain, sunny, etc.
• Temp: the temperature
• Humidity: the humidity level
• WindSpeed: the wind speed
• RainPptn: the 3-hour rain precipitation forecast
• SnowPptn: the 3-hour snow precipitation forecast
The data conversion drastically decreased the storage space and memory requirements for
weather data since the “SSWeatherDataTable” contained only the data needed for analysis and
excluded all the metadata retrieved from the API call. The converted objects with the above
fields were used for data processing to obtain weather attributes.
Table 4. Objects and fields in weather JSON files
Object Description Example of Fields in the object
City the identification of
the city, with id and
name.
"id":7871305,"name":"Downsview"
Coord the coordinate of the
data, contains
longitude and
latitude
"lon":-79.49398, "lat":43.732029
48
Object Description Example of Fields in the object
Main the typical weather
information
including,
temperature range,
pressure, sea level,
ground level,
humidity.
"temp":270.59, "temp_min":270.59, "temp_max":270.9,
"pressure":992.76, "sea_level":1027.49,
"grnd_level":992.76, "humidity":68
List an object that holds
detailed weather data
see the following rows
Clouds the type of clouds "clouds":null
Snow the expected 3-hour
snow precipitation
"snow":{"3h":0.065}
Rain the expected 3-hour
rain precipitation
"rain":{"3h":0.0}
Weather contains weather
display objects
properties
"weather":[{"id":600,"main":"Snow","description":"light
snow","icon":"13n"}]
Sys general and other
information about
the data point
"sys":{"country":null,"population":0,"pod":"n"}
49
4.2.5 Traffic Intersection Characteristics
The traffic intersection and signal data files were downloaded manually from Open Data
Toronto (City of Toronto, 2017c, 2017b). There were four data files for traffic intersection
characteristics: “flashing_beacons.XML”, “pedestrian_crossovers.XML”, “traffic_signals.XML”,
“traffic_vols.XML”. Each of the data files had corresponding XML objects (one object as
individual records, one object as the container of an array of records) as shown in Figure 7. The
first three data files contained detailed data about the intersections, such as intersection ID, street
names and geographical locations (longitude and latitude) of the flashing beacon device, pedestrian
crossover signals, and traffic light signals. These data were helpful in determining the locations
where transit vehicles may have encountered potential delays due to flashing beacon warnings,
pedestrians crossing mid-block sections of the street, or signals delays at intersections. The fourth
data file contained the traffic volume count data at most of the signalized intersections. The volume
data included the count date, 8-hour vehicle count and 8-hour pedestrian count in addition to
identifying information about the intersection. The volume data could be used to determine general
traffic levels at each intersection.
Although there were four different data files for traffic intersections, a common data type
was defined for loading all of the data files since almost all of the fields exist in each of the files.
As such, the four data files were converted to “SSINTNXSignalDataTable” data type for data
processing. The “SSINTNXSignalDataTable” data type contained all of the fields present in any
of the four intersection data files: intersectionID (“pxID”), mainStreetName,
midBlockStreetName, sideStreetNames, Latitude, Longitude, countDate, vehVol, pedVol,
signal_system, mode_of_control, and type_Of_Intersection (signalized intersection, pedestrian
crossing, or flashing beacon). A common data type for all the four data files allowed the traffic
volume data to be merged with intersection properties where the pxID were identical. After data
conversion, the intersection data could be used during data processing to obtain intersection related
attributes between transit stops.
50
4.3 Summary
This chapter detailed the method used to perform data collection. This thesis used manual
download and data conversions to collect historical open data archives and used automatic multi-
threaded online API web requests to collect real-time open data feeds. The five major data sets
used in this study were GTFS, AVL, road restriction, weather, and traffic intersection data. The
format of the original data source was presented. The original data were converted to a new format
for lower data size, reduced data complexity, and common data type. The converted data types
could be loaded into memory and saved into SQLite database tables for data processing.
51
Chapter 5
Data Processing
While big data from open data sources provide researchers with vastly more information
to analyse and study, the value of open data may not be fully realized without a systematic and
computationally efficient method of data processing. Many of the open data formats were
unstructured, so they were not suitable for modelling and analysis. To retrieve valuable
information from these data, they must be processed. Unstructured data are data with a low degree
of organization. Processing unstructured data into structured data with target labels and defined
attributes and variables enables the estimation of models. For example, the AVL source data was
composed of a large list of GPS points with information such as longitude, latitude, route number,
vehicle number, and direction of travel (S/N or E/W). These data labels could be used to process
the AVL data into vehicle trips with variables of interests such as trip id, stop times, running
speeds, dwell times, and delays. After such processing, the structured AVL data could be analysed
and be used for modelling. To process these data efficiently, a cross-platform relational database
management system called SQLite was used to manage application data.
There were four stages involved in an automatic data processing procedure used for this
thesis: data loading, preprocessing, processing, and data export. The detailed processing
procedures for each data set following the four main stages are discussed in the following sections.
Figure 8. Data Processing Flow
Data LoadingPreprocessing or Preparation
Processing
•Trip Processing
•Map Matching
Data Export
52
5.1 Data Loading
Data loading refers to the process of copying source data from files or databases onto
memory objects used by the processing procedure. This section describes the formats of the
collected data files, and the process used to load them into memory. As open data formats evolve,
the data loading processes used can differ, but the general idea of loading these data to enable data
processing should remain the same. The following sections detail the data loading process of
GTFS, AVL, road restriction, weather, traffic intersection, and transit route property data.
5.1.1 General Transit Feed Specification
While comma separated value (CSV) format are commonly used, the GTFS CSV files are
quite difficult to use at runtime when retrieving attributes related to transit trips. This is due to the
unstructured nature of these CSV files. There were eight CSV files for the Toronto Transit
Commission (TTC) network: agency.CSV, calendar.CSV, calendar_dates.CSV, routes.CSV,
shapes.CSV, stop_times.CSV, stops.CSV, and trips.CSV. Each of these files contained all the
information regarding the file name of the CSV. For instance, stops.CSV contained the stop id,
stop code, stop description as well as the longitude, the latitude of all the transit stops in the
network. The user of this data was expected to use this data and refer to other CSVs with additional
data about the stops. The stop_times.CSV contained all the scheduled times that the transit vehicles
made a stop, for a trip at a stop. Given the trip id of those stop times, the stop times with stop
locations could then be traced to a trip in the trips.CSV file. Then, the trips.CSV contained further
information about the trip such as route id, which established the characteristics of the routes in
routes.CSV. The routes data had information about the agency, as well as the calendar dates which
the route serviced. The original GTFS CSV files minimized the storage requirements of these files
but made their use at runtime more difficult.
To use the GTFS CSV files efficiently at run time, they were processed into a network of
vehicle trips with their corresponding stop times and travel paths. For this thesis, a GTFS converter
was used to transform these CSV files into an SQLite database file with various tables on vehicle
trips, routes, and stops. This GTFS converter had been originally used by the Nexus Platform for
53
generating static schedule vehicle trips (Srikukenthiran, 2015). The GTFS converter obtained
schedule vehicle trips in 3 main steps. First, the converter read and extracted all the data from each
of the CSV files. Then, by making references to the data contained within the CSV tables, such as
stop, route, and stop times data, the converter built a schedule that contained all the information
regarding every vehicle trip. Finally, the trips linked to stop and route information by unique
integer IDs were extracted into an SQLite database with various tables such as trips, schedules,
routes, stops, etc. Using these data tables from SQLite database, the schedule data of the transit
network were obtained. A full description of all the GTFS tables can be found below.
Table 5. Processed GTFS Database table information
Names
(# Rows) *
Columns Descriptions
Agencies
(1)
agencyID, name Identified transit agencies. For this thesis, TTC
was the only agency name with agencyID = 1.
Service_Periods
(5)
serviceID, typeOfDay,
agencyID
Each service period was associated with a
specific transit agency. Typical service periods
include Weekday, Sunday, and Saturday as the
type of day. A special type of day is usually
assigned to holiday service periods.
Route_Groups
(199)
groupID, routeCode,
routeName, routeType,
agencyID
A route group referred to a collection of route
branches that belong to a route number, also
known as route code. This table also identifies
the type of route such as “surface variable”
(buses or shuttles), “surface fixed” (streetcars) or
“underground” (subways).
54
Names
(# Rows) *
Columns Descriptions
Routes
(3,405)
routeID, groupID, name This table described the route branches in a route
group, and linked the route branches to the main
route number.
Stops
(10,570)
stopID, stopCode,
Longitude, Latitude,
stopType, stopName
The location of stops was defined by longitude
and latitude, while the stops were identified by
stop ID, stop code and stop name.
Schedule
(3,405)
scheduleID, serviceID,
routeID, groupID,
routeStops,
routeStopDistances,
shapeID
The schedule table contained the list of stops and
stop shape distances, as well as the path (shape)
which the transit vehicle travelled. The list of
stops IDs, distances and path for a schedule was
stored in the sequencing service. Each schedule
corresponded to each route with a route name, as
well as a service period with a type of day.
Shapes
(1,515)
shapeID, path Each path in the Shapes table contained a
sequence of high-resolution geographical
locations, defined by longitude and latitude.
Trips
(133,361)
tripID, scheduleID,
blockID, shapeID,
stopTimes
The trip table contained all the scheduled transit
trips over each day of service. The sequence of
stop times in each trip corresponded to a
sequence of stops and stop distances in the
schedule table with the corresponding
scheduleID.
* based on TTC GTFS schedule data between February 12, 2017 and March 25, 2017.
55
There were many benefits of converting the original CSV files to database tables during
the data loading step. Firstly, the process of converting the CSV files into linked trip objects with
detailed properties such as stop ID, stop distances, stop times, and paths took a considerable
amount of time – approximately five minutes, while the reading of the processed GTFS data takes
much less time. By processing the GTFS schedule during or before data loading, we reduced the
time associated with loading the routed transit trips from five minutes down to 30 seconds, for
each data processing run. Also, this eliminated the need to process GTFS data at run time, as
processed GTFS data could be simply saved to disk and be loaded quickly when they were needed
at the beginning of data processing. In addition, loading fully processed scheduled data allowed
for faster referencing during AVL data processing. Finally, this was beneficial to parts of the thesis
beyond data processing, such as model estimation and model simulation, as those algorithms relied
heavily on the ability to reference GTFS schedule as well for key information.
5.1.2 Automatic Vehicle Location (AVL)
Since the AVL data collected were stored in XML files divided up by each hour, the data
loading of AVL data consisted of two steps: loading XML files into memory and combining all
data objects. The file names of AVL XML data files used were in the format of “all-GPS-
yyyymmddhhmmss-yyyymmddhhmmss.XML.” The first and second date time string consisted of
the start and end year, month, day, hour, minute, and second, respectively. The date time string
used the UTC time to prevent duplicate names due to daylight saving time. Using this naming
convention, the XML file names falling within the study period were loaded into memory as a
dictionary of “SSVehGPSDataTable” indexed by GPS ID. After the XML files were loaded, each
AVL point was checked for uniqueness by the time of observation and vehicle ID. Duplicated data
were ignored during AVL data loading. After AVL data points were loaded into memory, they
were inserted into the “TTCGPS” table in the Surface Simulator (SSim) database. The SSim
database was a separate database from the GTFS database generated earlier and it was used to store
and process real-time transit related data. The data points from AVL data were used extensively
to compute various operational characteristics of the observed transit trips.
56
5.1.3 Road Restriction
The road restriction data also used XML format for storage, but due to the smaller sizes of
the data, this data was stored by day with a similar naming convention: “all-RR-
yyyymmddhhmmss-yyyymmddhhmmss.XML”. Using such a naming convention, all the required
road restriction data were loaded into the program as a dictionary of “SSIncidentDataTable”
indexed by incident ID and were ready to be used for processing. After loading these data into
memory, they were inserted into the “TORINC” table in the SSim Database. The road restriction
data were used to determine the presence of potential incidents due to emergency and road work,
which resulted in reduced road capacities and potential travel delays.
5.1.4 Weather
Similar to road restriction data, weather data had smaller sizes as well, as such, the weather
JSON files were stored by day with a naming convention such as: “GTA-Weather-
yyyymmddhhmmss- yyyymmddhhmmss.json”. All the required weather data were loaded into
memory as a dictionary of “SSWeatherDataTable”, indexed by weather record ID for data
processing. The weather data had detailed temperature and precipitation across many weather
station for a defined geographical area such as the Greater Toronto Area (GTA).
5.1.5 Traffic Intersection Characteristics
Unlike AVL, road restriction, and weather data, traffic intersection data were archived data.
For easier data loading, the various traffic intersection data originally in CSV format were
converted into XML format. These traffic data files, namely “flashing_beacons.XML”,
“pedestrian_crossovers.XML”, “traffic_signals.XML”, and “traffic_vols.XML”, were loaded into
memory with a common object type named “SSINTNXSignalDataTable”, indexed by intersection
(px) ID. The intersection data were used to determine the intersection-related attributes for transit
network.
57
5.1.6 Transit Route Properties
While GTFS contains route identifications, additional properties of the transit route may
be needed to introduce more descriptive characteristics to transit routes. For this thesis, a custom
route properties data object, stored in XML format, was created to identify whether each route
number in the Toronto Transit Commission is a streetcar and has dedicated right of way (ROW).
The route properties data were loaded during data loading step as a list of “RouteProperties”. These
data were used for determining route attributes during data processing.
5.2 Preprocessing of AVL data
Due to equipment and communication errors, a small fraction of the real-time AVL data
did not contain meaningful information for analysis. AVL data preprocessing addressed three main
data issues: empty data, non-revenue trip data, and wrongly labelled data. These erroneous data
points could cause problems in many steps of data processing, including the proper forming of
trips and computations of attributes. Due to these potential issues related to erroneous AVL data
points, the AVL data needed to be preprocessed.
5.2.1 Empty Data
Most modern AVL systems rely on Global Positioning Systems (GPS) to triangulate the
positions of transit vehicles and report them in real-time via cellular service or radio
communications. Often, during initialization of the GPS system, a default geographical location
stored by the GPS device is reported. This default location for the Toronto Transit Commission
AVL system was (latitude = 43.5 and longitude = -79), which is in the middle of Lake Ontario.
Often, a series of these default location data were received at the beginning of a vehicle’s operation
while its AVL system is trying to establish GPS triangulation. Data points with this default location
were considered empty data and were removed from the AVL data records during the data
preprocessing step.
58
Another type of empty data was included when bus drivers were trying to change the trip
information in the AVL system at the start of each new trip. When the AVL data points were
examined for each trip, there were some trips which had very few points around one stop and one
stop only, usually at the terminal stations. These trips never took place since there was no path of
travel. These trips could be due to wrong entry of trip information in the AVL system when the
bus driver was entering the trip codes. In the period that the driver was trying to correct such
mistake, these points were reported to the AVL real-time feed systems. Since these data points
contained no useful information, they were removed from the AVL data records during the data
preprocessing step.
Figure 9. Data points that cannot be used to produce trips, left: points with default
location, right: points in trip with no path travelled
59
5.2.2 Non-Revenue Trip Data
Occasionally, drivers entered the trip code and left the AVL system on when they were
performing non-revenue trip from the garage to their first stop of service. This was evident
because, for some trips, the AVL points were off route at the beginning and/or at the end of the
trip where the bus garages were located. It was important to exclude these non-revenue trips
because points off route may lead to misidentifications of stop times during data processing. To
remove these points, the proximities of the AVL point from the schedule route stops were
determined. Once the points that were off the route and far away from route stops were found, they
were identified as the non-revenue trip with no stops made on their route to the garage. These non-
revenue points were then removed from the rest of the trip.
Figure 10. Non-revenue trip for AVL points labelled as King Street shuttle bus
60
5.2.3 Incorrectly labelled Trips
In the original real-time AVL data, there were trips with identical trip codes but traversed
for several cycles within the same trip before they were returned to the garage. For instance, in the
figure below, the trips were all labelled westbound but half of them were travelling eastbound.
Ideally, the bus information should have been set to out-of-service when they were performing the
non-revenue trips in the eastbound direction after the transit agency serviced customers from the
St Clair station to St Clair West station. Due to the temporary nature of these trips, the heading of
the trip was not changed for the duration of the shuttle service, where the buses were traversing
back and forth between the two stations.
For these type of incorrectly labelled trips, the chain of trips formed based on the
mislabelled trip codes resulted in a very long trip being formed, with points traversing back and
forth between two stations. To properly compute stop times, delays and headways, this type of
chained trips needed to be split into individual trips. In addition, the non-revenue portions of the
trip were removed. These trips were labelled travelling westbound when they were, in fact,
travelling eastbound. Based on careful observation and analysis of these incorrectly labelled trips,
they were corrected so data processing on mislabelled temporary services could be completed.
Figure 11. Multiple eastbound and westbound trips (over 10) on St Clair had the same trip
information, labelled as St Clair shuttle bus
61
5.3 AVL Trip Data Processing
Once data preprocessing was completed, the AVL data were processed to extract the
various attributes required for model estimation. Firstly, the AVL trips were formed based on their
route number, vehicle identification, and direction of travel. Then, using the newly formed trips
with their route number (route code), the branch with the closest geographical match to the trip
trajectory was found and assigned to the trip. Once the specific branch the trip belongs to was
determined, various attributes such as stop times, delay, and headways at each of the stops were
computed. Finally, the trips data were converted into link data to allow the addition of attributes
related to weather, road restriction, intersections, and stop locations for each link. These AVL data
processing steps converted the unstructured data of bus locations into a set of structured data
variables with descriptive feature representations.
5.3.1 Forming AVL Trips
When AVL data were collected, the data received contained various properties such as
route code (i.e.: 32), route tag (i.e.: 32A), vehicle ID (i.e.: 1504), and direction (i.e.: 1 for
North/West and 0 for South/East). These properties were used to determine the sequence of AVL
data points that formed the AVL trips. The route code, route tag, and vehicle ID properties of each
data point were used to identify the specific vehicle on a route. To perform more efficient queries
in the database, these three properties were combined into a string (text) called trip code. For
example, a data point with route code of “504”, route tag of “504BUS” and vehicle ID of “7609”
had a trip code of “504_504BUS_7609”. Utilizing this trip code for every AVL data point, a select
query was performed on the “TTCGPS” table in the SSim Database. The SQL query was
conducted for each route number using query statement such as “select * from TTCGPS where
routeCode=504 order by tripCode ASC”. Then, by reading these points one by one, in order, and
forming a trip where the change in trip code and/or vehicle direction occurred, a collection of
unique transit trips can be observed from the AVL data points. In the example provided in the
following figure, the query statement for route code 504, King Street, returned a total of 20879
62
rows of AVL data points for the sample data from February 28, 2017. A sample of these data is
shown and the forming of two trips with trip ID 8873 and 9843 is demonstrated.
Figure 12. Trip forming using SQL database query, top: TTCGPS table containing AVL
GPS points, bottom: TTCGPSTRIPS table containing AVL trips
TT
CG
PS
Tab
le
TT
CG
PS
TR
IP T
able
63
5.3.2 Stop Sequences
The newly formed AVL trips were much more informative than scattered individual AVL
points, but stop level attributes were still missing. A sequence of stops the trips serviced needed to
be determined. After AVL trips were formed, these unique trips had a sequence of AVL or GPS
point IDs that were used to retrieve detailed information regarding each point. Using these GPS
point IDs and their corresponding data, the branch that the trip belongs to was determined. After
the branch of the trip had been determined, the sequence of stops the trip serviced was then be
found.
5.3.2.1 Trip Route Branch Matching
For the Toronto Transit Commission (TTC) system, each route number had many branch
routes with different stop sequences. This allows the transit agency to serve the same corridor with
varying level of frequency across the line with only one route number. Some section of routes with
higher passenger demand receive more frequent service routes while other sections, usually further
away from the central business district, receive less frequent services. For example, route number
504 of King Street had two branch numbers: 504 (between Broadview Station and Dundas West
Station) and 504B (between Broadview and Queen to Roncesvalles and Queen). Many the trips’
branch routes could be found by matching the route tag from the AVL point data and the route
name in the GTFS schedule data. For clarity, these individual bus branches were equivalent to
routes in the GTFS database tables.
Unfortunately, not all the AVL trips had serviced the exact stop sequences that the GTFS
schedule suggested. Even from the sample data shown in the previous section, it was evident that
shuttle buses and temporary services were often needed to address operational challenges the
transit agency encounter daily. In fact, according to TTC’s CEO reports, in the month of May,
there were 2,603 short turned bus trips and 1,151 short turned streetcar trips(Toronto Transit
Commission, 2017a). These trips, along with the shuttle bus trips that were observed in the AVL
data, made it necessary to determine the branch the observed AVL trip was serving, especially if
the trip was a short turned or temporary trip.
64
The procedure to determine the best schedule match involved three main steps. Firstly, a
possible list of all the GTFS schedule (as each schedule ID for GTFS database has a unique route
ID for branches such as 504, 504A, with a specific time of day code such as 3 for weekday) from
the GTFS route group (i.e.: route number or route code such as 504) with the correct time of day
of service was chosen. This step reduces the number of comparisons between the GTFS route stop
sequences and the GPS points in the AVL trip. Then, a score based on the proximity of GPS points
in the AVL trip to the path of the GTFS schedule was used to determine the trip’s branch/route
matching. This match error score was the sum of all the GPS points’ distances from the path of the
GTFS schedule, the distance between the first GPS point and GTFS start stop, and the distance
between the last GPS point and the GTFS end stop. Finally, using the match score computed for
each potential GTFS schedule, the GTFS schedules were ranked and the one with the lowest match
error score was chosen to be the match.
Equation 1. Match error score for AVL trip schedule matching
Match Error Score = ∆𝒅𝒔𝒕𝒂𝒓𝒕 + ∆𝒅𝒆𝒏𝒅 + ∑ 𝒅𝒊
𝒌−𝟏
𝒊=𝟐
∆𝑑𝑠𝑡𝑎𝑟𝑡: distance from the first GPS point to closest GTFS start stop
∆𝑑𝑒𝑛𝑑: distance from the last GPS point to closest GTFS end stop
𝑑𝑖: distance from a GPS point to nearest point on the GTFS shape
5.3.2.2 Stop Sequence Trimming
In an ideal situation, the GPS points from an AVL trip were identically matched to the
locations of the stops from GTFS schedule. In this case, the stop sequence from the matched GTFS
schedule table could be used for the AVL trip without modifications. However, this was evidently
not the case due to the presence of short turn and temporary shuttle trips. To properly determine
the stop sequence for these trips, a custom stop sequence for the AVL trip was determined.
The stop trimming algorithm worked by first finding the closest stop to the first and last
GPS points. The result of this search could be obtained by finding the distances from the GPS
65
point to all the stops in the chosen GTFS schedule. Once the stop indices of the likely first and last
stops were determined, based on minimum distances, the algorithm deleted the stops that the AVL
trip never visited, from the stop sequence list for the AVL trip. This algorithm is illustrated in the
following figure.
Figure 13. Stop sequence trimming to delete stops not traversed by the observed trip
The stop sequence trimming step in the AVL trip data processing procedure enabled the
analysis of short-turn trips and shuttle bus trips, which did not adhere to scheduled trips’ standard
stop sequences. After stop sequence trimming, the AVL trip was ready for further processing and
attribute extraction. To make the following data processing tasks more efficient, the GPS points in
the AVL were placed in a list inside a dictionary with stop ID as the key. Each of the list placed
within the dictionary with a key was assembled in order of their distance from the start of the trip.
This GPS-list dictionary allowed for more efficient retrieval of GPS points for stop times
computation.
5.3.3 Stop Times Calculations
After determining which GTFS schedule the AVL trip belonged to, the approximate stop
times at each transit stop along the trip scheduled were estimated. For simplicity, stop times
referred to both the arrival time and departure time at a stop. To determine the arrival and departure
times at a stop, a buffer area around the stop location was used. The buffer area used in this thesis
was 25m around the stop location. Most modern GPS receivers can obtain accuracy higher than
25m under normal conditions. For each stop and points within that stop’s buffer zone, the GPS
66
point with the earliest time indicated the arrival time while the one with the latest time was the
departure time.
Algorithmically, this was efficiently done by using the GPS-list dictionary. If the current
stop times of interest is for stop ID of 2, the algorithm first tried to find the arrival GPS point by
looking at the list with dictionary key of stop ID of 1. By going from end to beginning of this list
while examining its distance from the transit stop, an earliest GPS point within the buffer zone
from the stop of interest. After the arrival time was found, the list with dictionary key of stop ID
of 2 was retrieved. This time, the algorithm examined this list from beginning to end and finds the
latest point within buffer zone from the stop of interest. Once the arrival point and departure point
were determined for a stop, the arrival time and dwell time for the stop was determined, saved to
the AVL trip object in memory, and updated to the database table row for the AVL trip.
The above algorithm alone worked well for trips with dwell time over 20 seconds at every
stop (recall that the real-time update period for AVL data was approximately 20 seconds);
however, there were occasions where the transit vehicle skipped stops because no boarding or
alighting took place for those stops. This could be particularly common for less busy routes where
requested stop had been implemented. For the no dwell time situation, the algorithm interpolated
stop arrival times using upstream and downstream GPS points closets to the stop of interest. The
stop arrival time was interpolated assuming constant velocity from upstream to downstream, and
the dwell time was assigned as zero. A sample result of the stop times algorithm is demonstrated
in the following figure.
Figure 14. Arrival and dwell time determination for a sample AVL trip
It is important to understand that dwell time less than 20 seconds cannot be captured by
AVL data with 20-second resolution. For transit systems with higher resolution AVL reporting,
67
this algorithm will be able to capture smaller dwell times. Since excessive dwell times were often
the cause of significant delays and operational challenges for the surface transit system, especially
at major subways station, a 20-second resolution AVL data is sufficient for an application that
aims to characterizes the effects of delays and transit operational challenges.
5.3.4 Delay Calculations
In this thesis, transit schedule delay was defined as the difference between scheduled
arrival time and actual observed arrival time of a vehicle at a stop. As evident in the manner the
AVL trip route branch match and stop sequence was determined, the large portion of short turned
and temporary shuttle service on some of the transit routes introduced a few complications in delay
calculations. One of the complication was in the determination of the best-matched scheduled
arrival time. Another complication was to properly determine delay when there were missing
arrivals at a stop. Once these complications were addressed, best-matched schedule arrival times
were assigned for each actual observed arrival times. Using these assignments, the delay could be
determined using the difference between the scheduled and actual arrival times at a stop.
The simplest case in determining the best-matched scheduled arrival was for the first
observed stop times for each stop. For these cases, it was assumed that all previously scheduled
stop times were served. As such, there was no complication when dealing with the best-matched
scheduled arrival time and penalization of delay. In such circumstance, the best-matched scheduled
arrival time was simply the time closest to the actual arrival time. The delay was the time difference
between scheduled and actual arrival times. This is demonstrated in Figure 15.
Figure 15. Delay determination for all previous stop times served
68
A more general algorithm in determining delay accounted for missed stops, early arrivals,
and delay penalization, by keeping a detailed sequence of arrivals and their matches to schedule.
This algorithm first conducted a simple assignment of actual arrival times to schedule times based
on the smallest time difference. The simple assignment step found the scheduled arrival time
closest to actual arrival time.
Then, the next step assigned observed stop times with conflicted matches in a reversed
order. For instance, when two actual arrival times were both assigned to a single scheduled arrival
time in the simple assignment step, the earlier of the two was reassigned to an earlier schedule
arrival time match. This step was done in reverse order, from latest to earliest, to ensure the later
matches were not overwritten as only the earlier of the pair of the identical match were assigned
to an earlier match, effectively penalizing these instances of bunching. The reverse assignment
step was important for bunching vehicles where multiple vehicles may be serving the same
schedule arrival time but the earlier one of the vehicles may have missed a previously scheduled
arrival time, by being delayed and thus arriving at the same time as a later trip. By performing
reverse reassignment, the delay of these bunching vehicles was properly calculated.
After taking care of the complications regarding determining the best-matched schedule,
not all scheduled arrival times received a match with an actual arrival time. This could happen
with missed trips or short turned trips due to excessive delays on the route. This situation was
particularly common for congested streetcar routes. For instance, a vehicle could be delayed as it
progressed along its trip, eventually missing its own schedule arrival time at all its downstream
stops. To properly account for the delayed vehicle, delay for the vehicles with missed scheduled
arrival times was applied to only the first vehicle that missed scheduled stop times. In the
algorithm, missed delay penalization at a stop was applied to these missed schedule arrival times
for the situation demonstrated in Figure 16.
69
Figure 16. Delay determination for a trip with missed scheduled arrival times
The calculation of delay had a principal focus on the delay experienced from the customer’s
point of view. The trips following the one in Figure 16 were technically late as result of the missed
trip, if only a simple match to the schedule was made. However, this was not a realistic delay
estimation, because accumulative penalization on delays could produce artificial delays not
experienced by users. Instead, missed trip delays were accounted for using the missed stop delay
penalization algorithm, demonstrated in Figure 16. And from the customer’s point of view, the
vehicles arriving after the trip described in Figure 16 simply followed the later scheduled arrival
time. The way the algorithm deals with delay calculations after missed stop times situation is
demonstrated in Figure 17. Upon excessive delay from the earlier trip, the following trip served
the next sequence of scheduled arrival times at the downstream stop. This may result in some early
arrivals from the customer’s point of view. However, if the service ever fully recovered from such
delay, vehicles can eventually catch up to all of their schedule and the delay calculations were able
to reflect this by allowing a single schedule arrival time to be matched with more than one actual
arrival time, given all of the previous arrival times were served.
Figure 17. Delay determination for a trip following the trip with missed arrival times
70
After the final schedule arrival times match with observed arrival times, the delay values
for each stop served by each observed trip were determined. The method of delay calculations used
in this thesis comprised a three-step algorithm: simple schedule time assignment, reverse
assignment to resolve match conflicts and missed stop delay penalization for missed scheduled
arrival times. This algorithm could handle complex transit operations with short turned trips and
missed stops operations to compute realistic delay values for transit trips in a large-scale network.
5.3.5 Headway Calculations
In this thesis, three values related to headways were computed for model estimation:
observed headway, scheduled headway, and headway ratio. The observed headway was calculated
based on AVL data and previously computed arrival times. The schedule headway was computed
based on stop times of GTFS trips. The headway ratio was the quotient of observed headway over
the scheduled headway, as shown in the following equation.
Equation 2. Headway ratio calculation
HR𝑥 =HW𝑥
SHW𝑥
HR𝑥: headway ratio to be calculated at stop position x.
HW𝑥: actual headway at stop position x.
SHW𝑥: scheduled headway at stop position x.
To determine headways at each stop of an observed trip, the previously computed arrival
times for all the trips were grouped by stops, as shown in the following figure.
Figure 18. Headway determination demonstration
71
When determining actual headways, the AVL trip ID of the stop time values and calculated
headways were tracked so the actual headway values were organized by trips, rather than by stops.
This was particularly important for cases where there were insufficient stop times at a stop, for
example when only one trip serviced a stop. When headway for a stop in a trip cannot be computed
using observed arrival times, the headways were interpolated based on the headways and stop
distances at the closest upstream and downstream stops with computed headway values. The
interpolation of headway was based on weighted average of headway based on distance from start
stop, as detailed by the following equation.
Equation 3. Headway interpolation using upstream and downstream headways
HW𝑥 = HWupstream, j +(Dx − Dupstream, j)
(Ddownstream, i − Dupstream, j)(HWdownstream, i − HWupstream, j)
HW𝑥: missing headway to be calculated at stop position x.
Dx: distance from start stop at stop position x.
HWupstream, j: upstream headway at stop position j
HWdownstream, i: downstream headway at stop position i
Dupstream, j: distance from start stop at stop position j
Ddownstream, i: distance from start stop at stop position i
The calculation of schedule headway was similar to the calculation of observed headway
except the stop times used were from the GTFS trip stop times during the period of interest. Using
the calculated observed headway and scheduled headway, the headway ratio can be obtained.
These values were important in characterizing the effect of transit congestion during operations
and could indicate the occurrence of transit vehicle bunching.
5.3.6 Previous Speeds
For the determination of previous running speeds for each link-based observation, it was
important to keep track of the order of trip arrivals at stops. For this thesis, the previous trip speed
at the current link (previous trip speed) and the previous link speed for the current trip (previous
link speed) were used. The previous trip speed accounted for the time lag effect of travel speed,
72
while the previous link speed accounted for the spatial lag effect of travel speed. The previous
speed attributes ensured the travel time prediction was both realistic and reasonable and were not
dependent on operational and link attributes alone.
To keep track of the order of previous trip speeds so they could be retrieved quickly, an
object “PreviousTripIDByTripIDByStopID” was constructed and used to query previous trip
speeds. For the retrieval of previous link speed, the current trip ID and the previous stop ID was
used to query previous link speed.
5.3.7 Conversion and Initialization Link Data
With transit operational attributes found for each AVL trip and stop within the trip, the trip
based data could be converted to link based data. Stop-to-stop based transit travel was established
in the framework of this thesis as it provided an appropriate level of detail for transit simulation.
Use of the link based data and the attributes from map matching allowed the incorporation of
transit operational attributes, disruption attributes, link attributes, and route attributes.
To perform the conversion to link data, links with unique names were created using the
start stop code and end stop code, separated by an underscore (i.e.: “3271_8373”). After these
unique link names were created, an algorithm went through all the AVL trips, reading the
processed data organized in sequence by start stop ID and populating all the values of transit
operational attributes previously calculated. The transit operational attributes at the start stop are:
stop time, dwell time, link distance (also was a link attribute), running speed, delay, headway,
previous speeds. The link attributes identified were linkID (with start and end stop codes), link
distance, and scheduleID. Additional unprocessed attributes were initialized so they could be
computed in the next steps of the data processing algorithm.
73
Figure 19. Generation of link data using trip data
After the link data objects were populated in memory by the algorithm, these data were
updated in the SQLite database, as shown in the above figure. In this figure, all attributes were
processed. The determination of disruption attributes, such as weather and incident, link attributes
such as a number of intersections and volumes, and route attributes such as route ID and schedule
ID were discussed in the following sections.
5.3.8 Geospatial Map Matching
Geospatial map matching of AVL data to other datasets allowed additional attributes be
incorporated in the analyses conducted by this thesis. The general approach to map matching was
to use geospatial identifications across multiple datasets to match related records. The additional
data attributes extracted for this thesis using such approach included weather, road restriction,
intersection, and stop type attributes.
5.3.8.1 Weather
As mentioned in the previous chapter, weather data consisted of data such as longitude,
latitude and weather reported time. For this thesis, the weather data from eight different unique
74
weather stations were collected and loaded for processing. These stations were: Agincourt,
Downsview, Downtown Toronto, Etobicoke, Rexdale, Scarborough Station, Thornhill, and
Vaughan. These weather stations provided a distributed and complete coverage of the Toronto
Region that the Toronto Transit Commission serviced.
For each link data, an algorithm determined the nearest weather station to the link based
on the distances between the link and the weather station. Then, the data from that weather station
that was closest in time to the link data time was assigned to the link data. Once all the link data
within all the link objects were updated, in memory and in the database table, with weather data
via the assignments of weather data ID (WeatherSQID), weather information such as temperature,
humidity, wind speed, rain precipitation, and snow precipitation could be retrieved for each link
data quickly when exporting features (variables).
5.3.8.2 Road Restriction
Road restriction records had time fields and location fields. Each record of road restriction
provided the start time, end time and last updated time of the restriction, as well as the as longitude,
latitude of the road restriction event. Using the geospatial fields, the road restriction record event
that existed on a link data record was determined. The road restriction data were collected for the
entire City of Toronto and were maintained by the City of Toronto Transportation Services and
Open Data Team. This dataset provided a complete coverage for the areas that the Toronto Transit
Commission serviced.
Unlike weather data which used proximity for the geographical match, road restriction data
record must be on the link to be considered a valid record for that link. To check this, a buffer
width of 25 metres along the links was used to determine if the road restriction record falls
anywhere on the link. If a road restriction record falls within the link, its “IncidentSQID” was
assigned to that link; if no record was found for that link, then the “IncidentSQID” was set to
negative one (-1) to indicate no restrictions. In addition to making sure the restriction record was
geographically on the link, the link data time must fall between the start and end time of the
75
restriction for the restriction as well, for it to be assigned to the link. The final step of road
restriction matching was to update the link data table in the SQLite database.
5.3.8.3 Intersection Attributes
The intersection and signals data from Open Data Toronto provided the geographical
locations of traffic signals, flashing beacons, and pedestrian crossovers, as well as the latest
pedestrian and traffic volumes for over 2,000 intersections. (City of Toronto, 2017b, 2017c).
Before matching, four separate XML files were loaded into memory into a list of common type
“SSINTNXSignalDataTable” with no duplicate intersections. Each intersection contained fields
such as longitude and latitude, while some intersections contained additional fields such as street
names, signal types (TSP, Actuated, and Adaptive) and volumes (8-hour pedestrian and traffic).
Similar to the procedure used for road restriction data, the algorithm for intersection
attributes searched for intersections falling on the link by checking if any of the intersection was
on the link, with a 25 metres buffer width. Only the intersections within the start and end stops of
the link was added to a list of intersection IDs for the link object.
In addition to keeping track of the list of intersections that were on each link, some
attributes related to intersections were also determined. The following list describes the attributes
found, their variable name (in bracket), and any additional sub-procedures required to obtain them
(after the colon):
• stop types (“far-side” and "near-side"): to determine if the stop type of the start stop or
end stop was far-sided or near-sided, the algorithm checked if there were any intersection
during the search near the start and end stops. The tolerance for the distance between the
intersection and the potential stop was 50 metres. For instance, if the algorithm found an
intersection 30 metres downstream of the end stop, then the end stop was found as a near-
sided stop. Conversely, if the algorithm found an intersection 30 metres upstream of the
start stop, then the start stop was determined as a far-side stop.
76
• the number of signalized intersections (“Num_Intxn”): the number of signalized
intersections was computed by going over the list of intersections found to be on the link
and counting the number of intersections that were signalized intersections.
• the number of TSP equipped intersections (“Num_TSP_equipped”): the number of
TSP intersections was determined by going over the list of intersections found to be on the
link and counting the number of intersections that were TSP intersections.
• the number of pedestrian intersections (“Num_PedCross”): the number of pedestrian
intersections was determined by going over the list of intersections found to be on the link
and counting the number of intersections that were pedestrian crossing intersections.
• the number of left turns made by the transit vehicle at a signalized intersection
(“Num_VehLtTurns”): left turns made by transit vehicles were detected when there were
changes in the direction the path connecting start and end stop in the anti-clockwise
direction. Such change needed to be accounted for to properly determine whether a point
was on the link.
• the number of right turns made by the transit vehicle at an intersection
(“Num_VehRtTurns”): right turns made by transit vehicles were detected when there
were changes in the direction the path connecting start and end stop in the clockwise
direction.
• the number of through movements made by the transit vehicle at an intersection
(“Num_VehThroughs”): no turns detected because the change in direction within the path
connecting start and end stop is minimal (<45 degrees).
• average vehicle volume on the link (“AvgVehVol”): the average vehicle volumes across
all intersection volume counts within the link. If there was no volume record for vehicles
at an intersection, that volume was ignored.
77
• average pedestrian volume on the link (“AvgPedVol”): the average pedestrian volumes
across all intersection volume counts within the link. If there was no volume record for
pedestrians at an intersection, that volume was ignored.
• is streetcar route (“IsStreetcar”): based on route properties and route code of the link
object, the route characteristics were updated to the link object during the geospatial map
matching step.
• dedicated right of way (“IsDedicatedROW”): based on route properties and route code
of the link object, the route characteristics were updated to the link object during the
geospatial map matching step.
Using the algorithm that performed geospatial map matching with intersection data and
link object data, additional attributes about the weather, road restrictions, stops, intersections, and
routes were extracted. The technique of geospatial map matching was successfully applied to
multiple data sources to enrich the breadth of attributes related to link data and link objects, which
allows for a more descriptive travel time model.
5.4 Data Export
After data processing has been completed, link data with a set of variables could be
exported for model estimation of travel time models. Depending on the model specifications of
the travel time model, the list of variables needed for model estimation may differ. Based on the
information available in our processed dataset, the following table provides lists of possible
variables that were exported into a file with common-separated values (CSV) format.
78
Table 6. List of possible variables, attributes, by data sources
Data Source Attributes List of Possible Variables
AVL Observed Arrival
Times and Dwell
Times at Stops
RunningSpeed, DwellTime, LinkDist,
PrevLinkRunningSpeed, PrevTripRunningSpeed,
TerminalDelay, Delay, Headway, ScheduleHeadway,
HeadwayRatio, Day, Time_mins
GTFS Schedule Arrival
Times
RouteCode, GtfsScheduleID, LinkName, StopSeq,
StartStopID,
Weather Weather
Locations and
Conditions by ID
TotalPptn
Road
Restriction
Road Restriction
Locations and
Conditions by ID
HasIncident
Intersection
and Signal
Data
Intersection
Locations and
Properties by ID
Num_VehLtTurns, Num_VehRtTurns,
Num_VehThroughs, Num_TSP_equipped,
Num_PedCross, AvgVehVol, AvgPedVol,
IsStartStopNearSided, IsEndStopFarSided
Route
Properties
(Customize
Route
Information)
Streetcar or Bus,
and Shared or
Dedicated Right
of Way
IsStreetcar, IsSeparatedROW
79
5.5 Summary
This chapter demonstrated an automatic data processing procedure that extracted attributes
from multiple sources of open data, which could be used to generate descriptive variables and
features for model estimation of a large-scale transit network. There were four major steps
involved in data processing. Data loading allowed for the program to perform the operations in
memory. AVL data preprocessing eliminated erroneous data. Trip data processing formed AVL
trips, computed arrival and dwell times, computed delays, computed headways, found previous
speeds, and performed geospatial map matching to enrich link data attributes. Finally, data export
converted processed data attributes into specific link data variables. Using the possible list of
variables, different types of travel time models were estimated and evaluated.
80
Chapter 6
Model Estimation
Once a list of defined variables had been obtained from data processing, they were used to
estimate models to represent transit travel times. While it was common for previous studies to use
terminal-station-based and time-point-based segments to model transit travel times, these methods
are less flexible when applied to simulation since the ability to generate transit vehicle arrival
patterns at only key transit stops has limited simulation applications. This thesis examines the
potential of representing travel times for stop-based links using open data sources, namely GTFS,
AVL, weather, road restriction and intersection data. This thesis explores various regression
modelling methods to model running speed and estimate an appropriate distribution model for
dwell time.
The decision to model running speed and dwell time separately had a few major benefits.
Firstly, since we know that dwell time was the result of passenger boarding and alighting at stops
rather than the variables we have available such as weather, road restrictions, number of
intersections at the link, it was more sensible to exclude dwell time from being characterized by
the variables in the running speed model. While it is true that passengers boarding and alighting
can be affected by weather and road restriction conditions, without detailed passenger demand data
at each stop, the effects of weather and road restriction conditions on passenger boarding and
alighting cannot be properly characterized. This assumption needs to be addressed in future studies
and extensions of the modelling framework presented by this thesis. Secondly, separating dwell
time enabled this thesis to evaluate the effectiveness of distribution-based dwell time models at
representing transit travel times accurately. Finally, separation of dwell time allows future studies
to extend this data-driven modelling technique by investigating the use of APC or AFC data to
model the effect of passenger demand on dwell times, enabling high fidelity data-driven transit
simulations. The rest of this chapter presents a detailed procedure to estimate dwell time and
running speed models to support the simulation of a data-driven mesoscopic transit model.
81
6.1 Running Speed Model
This thesis used four main steps to complete model estimations of running speed models.
The processed data was first loaded into R workbench, formatted according to data type, and split
into training and test data. Then, the variables were defined and the model type was assigned. The
most computationally intensive step was model training. The model training step used R packages
to estimate running speed models. Finally, the estimated models were used to generate predictions
and these predictions were evaluated against test data. Figure 20 illustrates the four steps of model
estimation.
Figure 20. Model estimation process flow for running speed model
6.1.1 Data Retrieval
The data retrieval step involved the validation of data types and splitting the data table into
training and test data. The processed data was saved in a common separated values (CSV) file with
data column names. These data were first loaded into the R workbench using the “fread” function
from “data.table” package in R. When performing model estimations with R, it was important to
specify the correct data type of each column. For example, the route number variable (RouteCode)
was recognized as an integer value by default, but it should be a categorical value. This was
particularly important because, for many modelling methods, categorical variables were treated
differently when performing estimations. After the data were loaded and the data types validated,
Data Retrieval
•Validate Data Types
•Split into training and test data
Model Specifications
•Define target and predictors
Model Training
•Assign model type
•Assign model parameters
Model Prediction and
Evaluation
82
using the “TrainingOrTest” variable field included in the CSV and the split function, separate
training data and test data tables were loaded into the R workbench.
6.1.2 Model Specifications
To estimate meaningful models, the list of variables required that are dependent on the
specific use case and functional requirement of the model need to be specified. This study
established two potential use cases for demonstration. The first was a basic analysis of a transit
route, and the second was an advanced analysis of the entire transit network. The route level model
was designed to account for temporal effects, transit operational characteristics, and basic link
characteristics. The set of variables required were mainly based on attributes extracted from AVL
and GTFS data only. On the other hand, the advanced analysis model required additional link and
route characteristics. With larger sample sizes, weather and road restriction effects on travel time
could be explored as well.
Many variables were used to model running speeds in the route level basic model. To
capture the effect of temporal factors on running speed, the variables day of the week (Day), as
well as the time of the day (Time_mins), were used for the route level model. In addition, headway,
delay, previous link travel speed, and previous trip travel speed variables were used to account for
the effects of transit operational characteristics. Finally, link characteristics, such as link distance
or link name, were used to model the effects of link characteristics.
Due to the variability between different transit routes, the network level advanced model
for running speed required additional variables to account for the spatial variation of links and
routes. With additional link characteristics such as stop locations, link distance, left and right turns
made by transit vehicles, as well as vehicle volumes and pedestrian volumes, the network level
model could account for the variation due to link characterises across the transit network.
Furthermore, the use of the “hasIncident” (road closure incidents) and “totalPptn” (precipitation)
was possible with the network level model since the data was sufficient in geographical scale to
cover areas with different incident occurrences and total precipitation.
83
Using the respective variable specifications for each use case, this thesis trained and
evaluated many regression models to determine the most appropriate modelling technique for the
respective use case. A summary of the different model specifications as well as the description of
each of the variables are found in the following tables.
Table 7. Variables for route level basic model
Variable Name Description Variable Type
RunningSpeed Running speed between two stops Continuous
prevLinkRunning
Speed
Previous running speed upstream of the current
link Continuous
prevTripRunning
Speed
The previous trip's running speed on the current
link Continuous
Day Day of week (1 for Monday, 6 for Saturday, etc.) Categorical
Time_mins
Time of day in minutes since midnight of current
day Continuous
linkDist Distance of the current link Continuous
Terminal Delay
Estimated schedule delay at the terminal station
of a vehicle trip Continuous
Delay
Estimated schedule delay experienced by the
vehicle on the link Continuous
Headway
The amount of time between vehicles across
transit links Continuous
84
Variable Name Description Variable Type
HeadwayRatio
The ratio between scheduled and estimated
headway of the vehicle at a stop Continuous
linkName The name of the link Categorical
Table 8. Variables for network level advanced model analysis
Variable Name Description Variable Type Variable Range
RunningSpeed
Running speed between two
stops Continuous 0 to 120 kph
RouteCode
Route code of travelling
vehicle Categorical 163 levels
hasIncident
If the link segment has road
restriction Categorical 0, 1
prevLinkRunning
Speed
Previous running speed
upstream of the current link Continuous 0 to 120 kph
prevTripRunning
Speed
The previous trip's running
speed on the current link Continuous 0 to 120 kph
Day
Day of week (1 for Monday, 6
for Saturday, etc.) Categorical 0 to 6
85
Variable Name Description Variable Type Variable Range
Time_mins
Time of day in minutes since
midnight of current day Continuous 0 to 86,400 mins.
linkDist Distance of the current link Continuous 0 to 11,600 m
Terminal Delay
Estimated schedule delay at the
terminal station of a vehicle
trip Continuous -1000 to 3000 s
Delay
Estimated schedule delay
experienced by the vehicle on
the link Continuous -1000 to 3000 s
Headway
The amount of time between
vehicles across transit links Continuous 0 to 3000 s
HeadwayRatio
The ratio between scheduled
and estimated headway of the
vehicle at a stop Continuous 0 to 10
totalPptn
Total precipitation reported at
the nearest weather station to
current link Continuous 0 to 10 mm
num_VehLtTurns
Number of left turns by the
transit vehicle Categorical 0 to 2
num_VehRtTurns
Number of right turns by the
transit vehicle Categorical 0 to 3
86
Variable Name Description Variable Type Variable Range
num_VehThroughs
Number of through movements
at intersections made by the
transit vehicle on the link Categorical 0 to 14
num_TSP_equipped
Number of TSP equipped
intersections on the link Categorical 0 to 6
num_PedCross
Number of pedestrian
crossings on the link Categorical 0 to 3
avgVehVol
Average vehicle volume of the
link Categorical 0 to 20,000
avgPedVol
Average pedestrian volume of
the link Categorical 0 to 10,000
isStartStopNearSided If start stop is near sided Categorical 0 or 1
isEndStopFarSided If end stop is far sided Categorical 0 or 1
isStreetcar
If the route on the link a
streetcar route Categorical 0 or 1
isSeparatedROW
If the link on the route
separated right-of-way Categorical 0 or 1
linkName The name of the link Categorical 9267 levels
87
6.1.3 Model Training
After training data, test data, and variable specifications were loaded into the R workbench,
specific models could be trained using corresponding statistics and machine learning packages in
R. This thesis investigated 5 types of regression models for the estimation of running speeds:
multiple linear regression, support vector machine, linear mixed effect regression, regression tree,
and random forest. The followings sections present the theory (model form and estimation
technique), key assumptions, and applications (statistical package used and parameter
specifications) for each model.
6.1.3.1 Multiple Linear Regression
Multiple linear regression (MLR) relates the target variable, y, with multiple predictor
variables Xi using a linear combination of the predictor variables (Young, 2017). The relationship
between the target variable and predictor variable can be expressed as:
Equation 4. Functional form of MLR models
𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑖𝑥𝑛 + 𝜖 = 𝛽𝑋𝑇 + 𝜖
𝑦: response/target variable
𝛽: estimated coefficients for predictor variables,
𝑋: predictor variables,
𝜖: residuals
Estimation of MLR 𝛽𝑖 coefficients is done by minimizing the sum of squared errors for the
sample observations, where m is the number of sample observations. This is also known as
88
ordinary least square estimation where the variance of the model is minimized. The objective
function to be minimized can be expressed as follows:
Equation 5. Objective function of MLR models
min [1
2𝑚∑(𝜖𝑗)
2𝑚
𝑗=1
]
where the residual is 𝜖 = 𝑦𝑖 − �̂�𝑖, the difference between observed and predicted values
Assumptions
There are four fundamentals assumptions associated with MLR: linear relationship,
normally distributed errors, homoscedasticity, and independence. The linear relationship
assumption is often a reasonable one, and in the case where the specific relationship is known,
such as 𝑙𝑛(𝑦) = 𝑥, or 𝑦 = 𝑥2, etc., transformation can usually be applied. In the context of
transit travel times, determining and specifying the type of nonlinear relationship is often difficult
due to varying conditions and presence of random variation in our data. This variation generates
random errors which are accounted for by the residual term. In MLR, the residuals are assumed to
be normally distributed. This is a reasonable assumption for modelling with large data sets for
continuous variables, and the assumption can be evaluated by examining the distribution of
residuals. Thirdly, homoscedasticity means that the standard deviations of the normally distributed
errors must not vary across the ranges of the predictor variables 𝑥𝑖. This implies that the normally
distributed errors are distributed in the same way for y regardless of the value of the predictors.
Finally, MLR assumes independence between predictor variables, which requires the predictor
variables to have zero covariance from one another. This is often an issue in modelling since many
of the variables we specify often are correlated with one another. In this study, vehicle and
pedestrian volumes are certainly correlated. Some variables that may correlates with one another
may be headway and delay, as well as the types of vehicle movements at intersections and link
distance. MLR can account for dependence between variables with interaction terms; however, if
89
the effects of the correlations are weak and the number of correlated variables are large, interaction
term can impose heavy computational requirements when estimating model parameters.
Applications
This thesis used the MASS package from R to estimate MLR models (Ripley et al., 2017).
A formula and training data were used as input for MLR model estimation. The formula consisted
of the target variable “RunningSpeed” and a list of predictor variables listed in Table 7 and Table
8, as part of the model specification. The training data set was obtained at the end of the data
retrieval step. MLR model does not contain any parameters that need to be tuned or specified since
the only variable in its objective function is the residual.
For this thesis, interaction terms were not considered due to the relatively large number of
interaction variable terms that needed to be tested, which would increase model training times
exponentially. For instance, to fully test interaction effects between all k=20 variables, given that
the number of interaction parameters is equal to 2k – k – 1, the number of interaction effect
variables would be over a million. It would be even more difficult to apply given about half the
variables used in this thesis are categorical variables which would require dummy variables that
would further increase the number of interaction variable parameters needed.
6.1.3.2 Support Vector Machine
Support vector machine is a machine learning algorithm that derived its ability to perform
classification and regression from the idea of hyperplane margin optimization and kernel functions
to model highly nonlinear relationships (Hsu, Chang, & Lin, 2016).
Support vector machine finds an optimal hyperplane that maximizes the separation or
“margin” between groups of data with different properties. This is achieved by identifying a group
of linearly separable data points on the edges of a certain label and making them the support
vectors. Figure 21 shows how support vector machine separates two groups of data (blue/small
90
and red/big points) using the idea of edge vectors supporting the margin with maximum separation
ε. Such optimal hyperplane maximization uses the objective function identified in Equation 6.
Using the idea of the optimal hyperplane, support vectors are used as the model for classification
and regression in support vector machine.
Figure 21. Separating and classifying data with support vector machine
Equation 6. Objective function of support vector machine
min [𝐶 ∑ 𝜉𝑖
𝑙
𝑖=1
+1
2𝑤𝑇𝑤], 𝜉𝑖 ≥ 0
where 𝑤 is normal vector to the hyperplane,
C is constant on the error terms (regularization parameter),
𝑙 is sample size, and
𝜉𝑖 are estimation errors terms (Cortes & Vapnik, 1995).
91
Traditionally, the modelling of nonlinear relationships is computationally expensive due to
a large number of polynomial terms in the higher dimensional functions when modelling higher
order variables. To deal with such issue, support vector machine performs kernel tricks obtain the
dot product of the transformed vectors without needing a higher dimensional function. Support
vector machine achieves this with a kernel function. Some common kernel functions are the linear
kernel, radial basis function kernel, and polynomial kernel (Hsu et al., 2016). The use of kernels
allows the objective function to be evaluated efficiently to obtain a set of support vectors that
minimizes the objective function.
Equation 7. linear kernel
𝐾(𝑥𝑖, 𝑥𝑗) = 𝑥𝑖𝑇𝑥𝑗
where 𝑥𝑖𝑇 is a transposed vector of one data point
where 𝑥𝑗 is a transposed vector of another data point
Equation 8. Radial basis function (RBF) kernel
𝐾(𝑥𝑖, 𝑥𝑗) = exp (−𝛾‖𝑥𝑖 − 𝑥𝑗‖2
)
where the hyperparameter gamma is 𝛾 > 0
Equation 9. Polynomial kernel
𝐾(𝑥𝑖, 𝑥𝑗) = (𝛾𝑥𝑖𝑇𝑥𝑗 + 𝑟)
𝑑
where the hyperparameter gamma is 𝛾 > 0
r is the polynomial degree
92
To adapt support vector machine for regression, support vector regression uses a ε-
insensitive error measure where low error values are ignored while larger error contributes linearly
to the absolute residual greater than a constant ε, as shown in (Hastie, Friedman, & Tibshirani,
2008). This makes the fitting less sensitive to outliers (Hastie et al., 2008). The soft margin
hyperplane and kernel property are not unique to classification and they have been demonstrated
in past studies to be integral to the properties of support vector regression (Smola & Schölkopf,
2004).
Equation 10. Support vector machine ε-insensitive loss function for regression
𝜉𝜀 = |𝑟| − 𝜀
where error penalization is |𝑟| ≥ 𝜀
Assumptions
While support vector machine is a much more flexible algorithm than MLR, it does require
the independent and identically distributed (I.I.D.) assumption for the error term to hold true
(Balakrishna, Raman, Trafalis, & Santosa, 2008). Unlike MLR, support vector machine can
account for the correlation between predictor variables with kernel functions. Because kernels can
implicitly map higher order combinations of variables, nonlinear as well as correlated relationships
can be modelled by support vector machine. Depending on the kernel chosen, the degree and
specification of the interaction terms can be different. A linear kernel support vector machine is
similar to ridge regression where the linear combination of variables are modelled, while
polynomial kernel support vector machine also maps the combination of variables in higher order
(Marsupial, 2012). In addition, the radial basis function kernel is formed by an infinite sum over
polynomial kernels; therefore, it can be used as a universal kernel approximator to represent any
decision boundary and nonlinear shapes (Bernstein, 2015). Consequently, radial basis function
kernel is preferred over the polynomial kernel.
93
Applications
This thesis used the “e1071” and “liquidSVM” packages to estimate support vector
machine models in R (Meyer et al., 2017; Steinwart & Thomann, 2017). A formula, the type of
kernel function, as well as the hyperparameters associated with the kernel function need to be
specified for the estimation of support vector machine models. Hyperparameters such as the
regularization parameter (C), kernel function parameters (𝛾, 𝑑), and loss function parameter (𝜀)
should also be determined through k-fold cross-validation, then specified for the final support
vector machine model.
A k-fold cross validation procedure using a grid-search method was used to find the most
suitable set of 𝐶 and 𝛾 parameters. The k-fold cross validation procedure divides training data into
k number of subsets and trains the model sequentially by each subset (Hsu et al., 2016). This
method of finding 𝐶 and 𝛾 prevents overfitting while minimizing the root mean squared error (Hsu
et al., 2016). A common 𝜀 value of 0.001 was used since it was reasonable to exclude very small
estimation errors (Hsu et al., 2016).
This thesis used support vector machine to investigate if the modelling of nonlinear
relationships using different kernels improved model performance. Through our preliminary
analysis, we compared linear kernel and radial basis function kernel. We eliminated polynomial
kernel from our preliminary analysis due to its computational complexity and potential numerical
difficulties associated with the polynomial kernel. The polynomial kernel has a larger number of
hyperparameters which would require exponentially more computing time when selecting these
hyperparameters through cross-validations (Hsu et al., 2016). Also, the polynomial kernel may
generate zero or infinity kernel values for larger degrees (Hsu et al., 2016). Consequently, a
preliminary analysis between linear kernel and radial basis function kernel was done on a group
of 25 routes, and the results of RMSE can be found in Table 9.
94
Table 9. Comparing linear and RBF kernels for support vector machine
Route RMSE for Linear Kernels* (kph) RMSE for RBF Kernels* (kph)
5 12.5 13.7
6 13.0 11.6
7 13.2 12.8
8 17.4 12.7
9 12.7 13.8
11 21.4 13.9
12 16.2 21.8
14 18.0 10.1
15 17.1 13.9
16 14.7 14.5
17 17.1 14.8
20 15.2 13.3
21 15.5 15.0
22 8.3 10.0
23 12.8 11.8
24 18.8 16.1
25 16.8 13.6
26 14.7 15.0
28 13.8 12.6
29 11.4 12.6
30 14.9 14.5
31 19.0 12.2
32 16.5 14.4
33 7.9 10.8
Average 15.0 13.6
Total Runtime for Routes (min.)
10 120
Runtime per Route (min.)
0.42 5.0
* RMSE: root mean square error
Based on the result of the preliminary analysis of support vector machine kernel functions,
RBF can reduce the error of the estimate further for route level models. As such, a k-fold cross
validation procedure with RBF kernel was used in estimating support vector machine models.
95
In addition to selecting kernel function, another issue related to the application of support
vector machine is its limitations when applying it to large datasets. Since finding an optimal
solution to support vector machine requires dot products of all the feature vectors as well as
optimizing the objective function, the computational time complexity of support vector machine
is O(n3), where n is the number of samples (C. Hu, Zhou, & Hu, 2014). Because the computational
time with conventional support vector machine grows exponentially with training sample size,
support vector machine may run into difficulties for network level models with a very large sample
size.
6.1.3.3 Linear Mixed Effect Model
Linear mixed effect models are an extension of MLR model which address the issue of
heteroscedasticity that is often present in repeated measurements and sampling. The linear mixed
effects model does this by modelling random effect variables, in addition to fixed effect variables
that were modelled by MLR. Homoscedasticity refers to the assumption that the distribution of
error term does not vary across different values of fixed effect variables. The use of random effect
variables accounts for heteroscedasticity by varying the intercept and/or slope of the linear function
over different levels of the random effect variable (Seltman, 2016). The mixed-effect linear model
is mathematically expressed by the following equation.
Equation 11. Functional form of linear mixed effects models
𝑦 = 𝛽𝑋𝑇 + 𝑢𝑍𝑇 + 𝜖
𝑦: response/target variable
𝛽: estimated coefficients for fixed-effect predictor variables,
𝑋: fixed-effect predictor variables,
𝑢: estimated coefficients for random-effect predictor variables,
96
𝑍: random-effect predictor variables,
𝜖: residuals
The estimation of linear mixed effects model 𝛽 and 𝑢 coefficients requires the joint
likelihood distributions of the fixed and random effect variables to be maximized. As such, the
objective function for the log-joint distribution of (y, u) is can be expressed as follows (Gumedze
& Dunne, 2011), assuming 𝑢 and 𝑦 represent a joint Gaussian distribution with the following
mean and covariance.
Equation 12. Mean and covariances of u and y for linear mixed effects model
[𝑢𝑦] ~𝑁 (𝑚𝑒𝑎𝑛 [
0𝛽𝑋
] , 𝑐𝑜𝑣 [ 𝐺 𝐺𝑍′𝑍𝐺 𝐻
])
where the variance-covariance of y is: 𝑣𝑎𝑟(𝑦) = 𝜎2(𝑍𝐺𝑍′ + 𝑅) = 𝜎2(𝐻),
𝑍 is the design matrix for random-effect variables,
𝐺 is the known component of random-effect variables,
R is the known component of residuals.
It was shown in the literature that the above join Gaussian distribution can be expressed as follows
(Gumedze & Dunne, 2011).
Equation 13. Log joint distribution of y and u for linear mixed effects model
𝑢~𝑁(0, 𝜎2(𝐺))
𝑦|𝑢~𝑁(𝛽𝑋 + 𝑢𝑍, 𝜎2(𝑅))
log 𝑓(𝑦, 𝑢) = log 𝑓(𝑦|𝑢) + log 𝑓(𝑢) =1
2𝜎2{𝑢′(𝑍𝑅−1𝑍′ + 𝐺−1)𝑢 − 2(𝑦 − 𝑋𝛽)′𝑅−1𝑍𝑢}
97
Using the equations above, estimates for 𝛽 and 𝑢 can be found by maximizing the log joint
likelihood of y and u. By including random effect parameter estimates u, the linear mixed effects
model can address data clustering, as well as correlations due to random effects and repeated
observations.
Assumptions
The linear mixed effects model deals with data clustering by modelling the variation across
random variable levels. However, linear mixed effects models do have a few important modelling
assumptions. The explanatory or target variables are linearly correlated to the fixed effect variables
at each random effect level. As discussed in multiple linear regression, the inclusion of higher
order terms and their perspective interaction terms are very computationally intensive. Even with
the use of kernel trick in support vector machine, higher order polynomials still impose a
significant computational constraint to model training. As such, the linear version of the mixed
effect model was chosen for analysis, as a trade-off for more efficient computations. The residuals
still hold all the previous assumptions of constant variance, independence, and normally
distribution. The key difference for linear mixed effects model is that the random effect variable
now explains a component of the previous residual in MLR to account for heteroscedasticity.
Applications
The “lme4” package in R was used to estimated linear mixed effects model (Bates et al.,
2017). One advantage of linear models is that there is no need to perform cross-validation to
estimate hyperparameters, making the algorithm more computationally efficient. In addition, only
a formula specifying the random and fixed effect variables is needed. For this thesis, a varying
intercept model with random variable link name was used to account for clustering effects due to
links. This decision was based on preliminary exploratory data analysis on variable clustering.
In the following figures, Figure 22 to Figure 25, it could be observed that the relationships
between running speed, previous link running speed, and previous trip running speed exhibited
clustering effects by link name. Although the separation of the clusters differs across different
98
routes and between different links, it was evident that in general, there was a pattern of clustering
emerging whereby each link occupied a region of the graph space. For example, for route 192, link
“14092_14278” (from Pearson Terminal 1 to Pearson Terminal 3) had more sparse changes in
speed from previous link and trips than its link “2148_14719” (from Dundas St W at the East Mall
Cres to Kipling Station), where the changes in speed from previous link and trip is minimal. For
streetcar routes such as 504, clustering effect by link existed as well. While link “11596_11596”
(Roncesvalles Ave at Queen St W to Roncesvalles Ave at Queen St W North) occupied the lower
speed regions, link “4573_14992” (Roncesvalles Ave at Galley Ave to Roncesvalles Ave At
Marion St South Side) occupied higher speed regions.
Figure 22. Clustering pattern for links for route number 192
99
Figure 23. Clustering pattern for links for route number 504
Figure 24. Clustering pattern for links for route number 196
100
Figure 25. Clustering pattern for links for route number 510
6.1.3.4 Regression Trees
A regression tree is a nonparametric modelling technique based on classification and
regression trees – CART. It uses the idea of node splitting to partition data into different nodes,
forming a decision tree which models the outcome by applying the node splitting decisions based
on predictor variable values. The advantage of regression tree is that it can determine the variables
that were most important in generating predictions, thus using those variables to generate
predictions efficiently. To perform node splitting, regression tree uses an impurity measure for
each node (Equation 14) to maximize the decrease in the change of impurity because of the node
split (Equation 15) (Berk, 2008). The objective of building trees is to maximize the decrease in
impurity with each node construction, thus resulting in a tree with minimum impurity.
Equation 14. Impurity of a node
𝑖(𝜏) = ∑(𝑦𝑖 − �̅�(𝜏))2
101
where 𝜏 is a node on the tree,
�̅�(𝜏) is the mean of the node, which is also the prediction value
Equation 15. Change in impurity due to node split
Δ(𝑠, 𝑡) = 𝑖(𝜏) − 𝑖(𝜏𝐿) − 𝑖(𝜏𝑅)
There are two strategies to avoid overfitting. The first strategy is to set a minimum sample
size, which prevents nodes with only very few samples being formed (Berk, 2008). Nodes with
very few samples may not provide a robust generalizability. The second strategy is to limit the size
of the tree by limiting node split that yields very small change in impurity, referred to as “pruning”
(Berk, 2008). Tree pruning for regression requires the consideration of penalties based on AIC
which accounts for the amount of estimated parameters (Berk, 2008). By either of these strategies,
the complexity of tree can be reduced and the effect of overfitting minimized.
One advantage of regression tree is that it is easy to visualize since the regression tree
model itself can be represented using a decision tree diagram. To demonstrate the appearance of a
decision tree, a sample regression tree model was trained and graphed using process data for the
TTC network. See Figure 26 for a diagram of the tree and Figure 27 for a graph showing how
model errors decrease with increasing number of node splits.
102
Figure 26. Diagram of a simple regression tree model
Figure 27. Relative error of regression tree with increasing number of splits
Assumptions
Since regression tree is constructed by node splitting using variable values, it does not
model the linear or nonlinear relationship between the predictor variables and the response
variable. At every node split, regression tree determines the point of best split along the possible
range of a predictor variable and among all predictor variables. Because the regression tree is
essentially a data partitioning procedure, each of the ending node of the regression tree is assumed
to hold relatively a set of homogeneous samples, with each of the sample values as close to the
103
mean of the node as possible. This leads to the assumption that the samples within each regression
tree node do not hold linear or nonlinear relationships internally.
Another assumption of regression tree stems from the way variables are chosen to when
performing node split. At each node split, a combination of the levels of a categorical variable and
values of a continuous variable will be assessed as a possible split criterion. The variables that are
more important will appear higher up in the nodes of the tree. In addition, the variables with
multiple possible values and levels and are more important may appear more than once,
sometimes, many times on the tree. While important variables are responsible for making multiple
partitions on the tree, less important variables with weaker effects may not appear on the tree nodes
at all. As such, an assumption made by regression tree is that weak effects from less important
variable do not contribute to model fitness. Consequently, outliers and small samples due to rare
occurrences, such as disruptions, have minimal effects on the model.
Applications
The “rpart” package from R was used to estimate regression tree models (Therneau et al.,
2017). A formula, a minimum split size, model complexity and an optional prune complexity are
specified for the regression tree to generate a tree. The formula used is the same across different
methods and consisted of the response variable name with a list of predictor variable names. The
minimum split size has a default of 20 and is a reasonable choice for most modelling applications
(Terry M. Therneau & Elizabeth J. Atkinson, 2017). In addition, the model complexity is the
minimum decrease in impurity before node splitting terminates. It is a stopping criterion for node
splitting during model estimation. Finally, prune complexity is the maximum increase in impurity
the model allows when pruning or removing nodes to limit overfitting.
There are two ways to prevent overfitting for regression trees. The k-fold cross validation
procedure can find the best complexity parameters for the regression tree, without overfitting.
Similar to the way k-fold cross validation was used for support vector machine models in this
thesis, this procedure divides the entire data set into k number of samples and train separate models
on each sample, and evaluate its performance against rest of the (k-1) number of samples. The best
104
model complexity based on this cross-validation procedure is selected. Another way to reduce
overfitting in regression tree is by specifying a pruning complexity parameter. After a tree with
certain complexity parameter (CP), for example, train CP of 0.0015, has been trained, the tree
pruning procedure can be initiated with a prune CP larger than the train CP, such as prune CP of
0.005. The pruned tree is less likely to overfit the data. For this thesis, the k-fold cross validation
procedure was used since it is a more robust means of preventing overfitting and is a very
commonly used method to optimize model parameters.
6.1.3.5 Random Forest
Random forest is a tree ensemble learning method that does not suffer from overfitting. It
is a combination of decision trees, be it regression trees or classification trees, where each tree is
constructed using a random and independent subsample data (Breiman, 2001). Random forest is
an ensemble method because it combines the predictions from many weaker predictors or trees.
This makes the random forest a very flexible machine learning algorithm as it can deal with
nonlinearity, variable dependencies, and heteroscedastic errors. Unlike a single tree, the random
forest can model weaker parameter effects as weaker trees can contain splits of less important
variables with smaller effects. Most importantly, it has been shown that as the number of trees
grown increases, the test or cross-validation error converges to a minimal by the law of large
numbers (see Equation 16) (Breiman, 2001). This means that random forest model does not suffer
from the overfitting problem.
Equation 16. Random forest regression convergence with increasing number of trees
𝐸𝑋,𝑌(𝑌 − 𝑎𝑣𝑘ℎ(𝑋, 𝛩𝑘))2
→ 𝐸𝑋,𝑌(𝑌 − 𝐸Θℎ(𝑋, 𝛩))2
where ℎ(𝑋, Θ𝑘) is the prediction of individual trees,
𝑎𝑣𝑘(ℎ(𝑋, Θ𝑘)) is the average prediction of all individual trees,
𝑌 is the labelled/true value,
105
𝐸Θℎ(𝑋, 𝛩) is the expected prediction from all individual trees (converged result)
Random forest combines results from numerous trees, which enables it to deal with
clustered data and replicate nonlinear relationships. A key requirement of training random forest
is the randomization of subsample draws which aims to achieve low correlations between trees
(Breiman, 2001). This is achieved by performing subsample draws for training individual trees
independently and at random. The ability to grow trees independently allows the random forest to
deal with clusters and nonlinear relationships. Figure 28 shows that a random forest can contain
numerous trees covering a range of variables, giving rise to complex prediction behaviours.
Figure 28. Illustration of a random forest (R. Hänsch & O. Hellwich, 2015)
Assumptions
The model accuracy of any algorithm is dependent on the quality of the data. Data quality
is a particularly important for the random forest. While random forest has been shown to deal with
noise or random errors well, data that are not representative of true relationships can produce
inaccurate models (Louppe, 2014). Besides the need for high quality data, the random forest is a
very flexible and robust nonparametric modelling method that can model any arbitrarily complex
relations, and handle heteroscedasticity, nonlinear relationships, and variable dependencies.
(Louppe, 2014).
106
Applications
There are many open-source R packages available to train random forest. When choosing
a package for the random forest, it is important to select one that optimizes the subsampling routine
to allow trees to be trained in parallel while minimizing memory usage. This thesis used “ranger”
package for the training of random forest (Wright, 2017). A formula and the number of trees must
be provided to the random tree trainer. The formula, like the previous models, consisted of the
response variable name with a list of predictor variable names. The number of trees determined
the number of subsamples needed to train that number of trees. While increasing number of trees
will always improve model accuracy and reduce training errors, a large number of trees requires
memory and takes a long time to train. In this thesis, we found that 100 number of trees is sufficient
for a random forest to provide an accuracy close to the theoretical maximum with infinite trees.
While increasing number of trees will produce better accuracy, this thesis found that increasing
number of trees beyond that of 250 provided no detectable accuracy gains but drastically increased
memory and computational requirements.
In the following table, a comparison between random forest running speed models with a
different number of trees specified is made. As the number of trees increases, the errors approach
a limit at around RMSE of 9.33 kph. As the error decreases exponentially to this limit, the training
time increased linearly with increasing number of trees (see Table 10 and Figure 29). To optimize
the training time and errors, an appropriate number of trees was selected for this study based on
where the error seemed to the plateau while training time continued to increase. Based the errors
for the table alone without consideration for training time, the random forest for 200 trees was
selected for the best performance. However, when considering both factors, as shown in Figure
29, the random forest with 100 trees seemed to provide the best trade-off between training time
and error reduction. If a faster implementation is desired, then 50 trees would be a reasonable
choice.
107
Table 10. Random forest performances with increasing number of trees
Number of
Trees
10 25 50 100 150 200
R2 0.301 0.340 0.353 0.359 0.362 0.363
MAPE 33.7% 32.9% 32.7% 32.5% 32.5% 32.4%
MAE 7.429 7.214 7.147 7.109 7.098 7.086
RAE 0.810 0.786 0.779 0.775 0.774 0.772
RMSE 9.785 9.505 9.415 9.366 9.349 9.338
RRSE 0.836 0.812 0.805 0.800 0.799 0.798
Reduction in
RMSE 0.0% 2.86% 3.89% 4.45% 4.66% 4.78%
Training Time
(min.)
1.53 3.66 7.10 14.68 21.85 29.06
Prediction
Time (min.)
0.06 0.11 0.18 0.33 0.48 0.63
Figure 29. RMSE decreases and training time increases with increasing number of trees
108
6.1.4 Model Evaluation
To evaluate the performance of different modelling methods, six error and performance
measurements were computed using equations 1 to 6. They can be used to select a model with the
best prediction performance. While generally a better model provides a lower error and better fit,
each of the error measures a slightly different aspect of the model performance. This section
provides error definitions, their indications, and the possible range of values.
Common notations for calculating model performance measurements are observed values
𝑦𝑖 (y), predicted values �̂�𝑖 (y-hat), and average observed value �̅� (y-bar). Many measurements use
(𝑦𝑖 − �̅�) as base line measurements. A model with all prediction of y-bar is often referred to as the
no relationship horizontal line model, average value model, or default predictor.
6.1.4.1 Coefficient of determination (R2)
The coefficient of determination or R squared is the proportion of the variance explained
by the predictor variable (Allen, 1997). The unexplained error can often be found by computing
the sum of the square of residuals (SSR). Since we know the sum of square, SST, is the difference
between the no relationship horizontal line of �̂�𝑖, the proportion of the unexplained error then is
the quotient between the two, adjusted by degree of freedom. The explained error therefore be one
minus such quotient. The calculation of coefficient of determination is shown in Equation 17. A
typical range of the coefficient of determination is from 0 to 1.
Equation 17. Coefficient of determination
�̅�2 = 1 −SSR dfe⁄
SST dft⁄
where the sum of the square of residual: SSR = ∑ (𝑦𝑖 − �̂�𝑖)2𝑛
𝑖=0
the total sum of the square: 𝑆𝑆𝑇 = ∑ (𝑦𝑖 − �̅�)2𝑛𝑖=0
the degree of freedom: dfe = 𝑛 − 1 dft = 𝑛 − 𝑝 − 1
109
While there is another variant of R squared called adjusted R squared which penalizes the
number of parameters, such penalization is not necessary for this study. The penalization for
increasing number of parameters is supposed to limit overfitting by adding an increasing number
of parameters, but it is only meaningful when comparing alternative multiple linear regression
models. Many of the models presented in this study have other mechanisms to prevent
overfitting, such as the use of regularization term C or 1/lambda for support vector machine, or
regression tree pruning, or cross-validation procedures in model training and validations. The
methods used in this study to prevent overfitting are much more powerful and provide a far
better guarantee than adjusted R squared. In addition, the variable specifications used in this
study are not different across models. We relied on the ability of each algorithm to determine the
degree which each variable contributes to the final predictions. Since the variable specifications
are the same across all alternative models, adjusted R squared is simply not a meaningful
measure. As such, the regular coefficient of determination was used for this study.
6.1.4.2 Mean absolute percentage error (MAPE)
The mean absolute percentage error is a commonly used measure to evaluate forecast
predictions (Swanson, Tayman, & Bryan, 2011). MAPE has the statistical property that it uses all
observations and has minimal variability from sample to sample (Swanson et al., 2011). This error
measure is useful for cases where the relative deviations from the true value are more important
than the absolute deviations. The MAPE is simply an average of the relative error between the
predicted and observed values, as shown in Equation 18. A typical MAPE value range from 0% to
100%.
Equation 18. Mean absolute percentage error
MAPE =1
𝑛∑ |
�̂�𝑖 − 𝑦𝑖
𝑦𝑖|
𝑛
𝑖=0
× 100%
110
6.1.4.3 Mean absolute error (MAE)
The mean absolute error considers the unsigned deviation of the predicted value from the
observed value, regardless of the relative sizes of the prediction. This error measure provides an
average of the absolute deviations and is most suitable for uniformly distributed errors (Chai &
Draxler, 2014). Nevertheless, MAE provides average error measure without penalizing the size of
the error, thus indicate an average expected error without exaggerating the effect of outliers
(Witten, Frank, & Hall, 2011). Calculation of MAE is shown in Equation 19. MAE values can
range from 0 to infinity.
Equation 19. Mean absolute error
MAE =1
𝑛∑|�̂�𝑖 − 𝑦𝑖|
𝑛
𝑖=0
6.1.4.4 Relative absolute error (RAE)
The relative absolute error is the absolute error normalized by the residual from the simple
average value predictor (Witten et al., 2011). The benefit of relative error is that it makes a
comparison between the current model and the simple horizontal line model (see Equation 20).
The lower the RAE, the better the model is compared to the simple horizontal line model. The
typical value of RAE ranges from 0 to 1, with 0 being a perfect model that explains all errors and
1 being a model equivalent to the average value model. It is possible for models to perform worse
than the average value model.
Equation 20. Relative absolute error
RAE =∑ |�̂�𝑖 − 𝑦𝑖|
𝑛𝑖=0
∑ |𝑦𝑖 − �̅�|𝑛𝑖=0
111
6.1.4.5 Root mean square error (RMSE)
The root mean square error is one of the most commonly used error measurements in
comparing model performance (see Equation 21) (Witten et al., 2011). The underlying assumption
of RMSE is that the errors are unbiased and are normally distributed (Chai & Draxler, 2014). Since
RMSE assumes normally distributed errors, it penalizes large errors at a greater extent than small
errors. It is generally considered a better representation of normally distributed errors (Chai &
Draxler, 2014). However, RMSE may overestimate error values if the observed values are not
strictly normal. A typical value of RMSE can range from 0 to infinity.
Equation 21. Root mean square error
RMSE = √1
𝑛∑(𝑦�̂� − 𝑦𝑖)2
𝑛
𝑖=0
6.1.4.6 Root relative square error (RRSE)
The root relative square error takes the RMSE and normalize it to the default predictor (see
Equation 22) (Witten et al., 2011). It combines the advantage of accurately representing outliers
and comparing the model errors to the no relationship horizontal line model. Lower RRSE
indicates a model that reduces outlier errors and total errors much better than the average value
model. A typical value of RRSE ranges from 0 to 1, but infinite RRSE is possible if the model
performs significantly worse than the default predictor.
Equation 22. Root relative square error
RRSE = √∑ (�̂�𝑖 − 𝑦𝑖)2𝑛
𝑖=0
∑ (𝑦𝑖 − �̅�)2𝑛𝑖=0
112
6.1.4.7 Training and Prediction Time
The training and prediction time can differ greatly between machine learning algorithms
due to differing mathematical and software implementations. In this thesis, the time required to
train and test a machine learning model was recorded for each model as an additional model
selection criteria. The training time refers to the duration it takes for a machine learning package
to use the input training data and model specification to compute and return a trained model object.
On the other hand, the prediction time refers to the duration it takes for the machine learning
package to use the input test data and trained model object to return a list of predictions, previously
referred to as �̂�𝑖, y-hat. In general, the prediction time is lower than training time. The training and
prediction times were reported in minutes.
6.2 Dwell Time Model
Previous studies have demonstrated that transit dwell times follow a lognormal distribution
(Li, Duan, & Yang, 2012). The characteristics of lognormal distribution are consistent with many
of the properties of dwell time. Firstly, dwell times can only be positive. Secondly, dwell times is
a continuous variable. Finally, dwell times are often right skewed with infrequent large dwell times
and frequent small to median dwell times.
The lognormal distribution parameters of dwell times were estimated and evaluated. The
training data were loaded and all the dwell time observations were organized into stops. Then, a
lognormal distribution of dwell time was estimated for each stop. Finally, the dwell time
predictions across all stops in the network were compared against test data to evaluate the fitness
of the lognormal distribution models.
113
Figure 30. Model estimation process flow for dwell time model
6.2.1 Estimate Lognormal Distribution Parameters
The training data were loaded into a dictionary with a list of dwell times, grouped by the
stop IDs of the dwell times. Using these dwell time observations at each stop, the parameters for
the lognormal distribution model were estimated using the Math.NET Numerics library in C#
(Christoph Rüegg et al., 2017).
The theory of estimating lognormal distribution parameter is similar to that for normal
distribution. The log mean parameter was computed using the average of the log values of the
observations (see Equation 23). The shape parameter �̂�2 is the average variance of the log values
from the log mean (see Equation 24). The log-mean parameter and shape parameter for each stop
defined a lognormal distribution that can be used to generate predictions.
Equation 23. Estimation of log mean parameter for lognormal distribution
�̂� =∑ ln 𝑥𝑘
𝑘𝑖=1
𝑛
where 𝑥𝑘 is dwell time sample observation from training data,
n is the number of samples.
Retrieve Dwell Times
•Load training data
•Extract dwell times by stops
Estimate Lognormal Distribution Parameters
•Input dwell time samples for stop
•Estimate log-scale mu and shape sigma parameters
Model Prediction and Evaluation
•Distribution draws for simulation
•Chi-squared goodness of fit
114
Equation 24. Estimation of shape parameter for lognormal distribution
�̂�2 =∑ (ln 𝑥𝑘 − �̂�)2𝑘
𝑖=1
𝑛
where 𝑥𝑘 is dwell time sample observation from training data,
n is the number of samples,
�̂� is the log mean.
6.2.2 Dwell Time Model Evaluations
The dwell time distribution for each stop was used to generate the dwell time predictions
during simulation. The chi-square test was used to assess the goodness of fit of the predicted dwell
times. Since the dwell times at each stop were estimated based on AVL data points, the resolution
of dwell times was 20 seconds. This means that a bin size of at least 20 seconds must be used to
conduct the chi-square test. The chi-square test statistic was then computed with the predicted
dwell time frequency (𝑃𝑖) and observed dwell time frequency (𝑂𝑖), using Equation 25. The null
hypothesis that “the predicted dwell times follow the distribution of the observed dwell times” was
evaluated using the chi-square test statistics. If the critical chi-square value is less than the
computed chi-square value, the null hypothesis would be rejected.
Equation 25. Chi-squared test statistics
𝜒2 = ∑ (𝑃𝑖 − 𝑂𝑖)2 𝑂𝑖⁄
where 𝑃𝑖 is the predicted dwell time frequency,
𝑂𝑖 is observed dwell time frequency
115
6.3 Summary
This chapter presented the procedure this thesis used to perform model estimation for
running speed and dwell time models. After retrieving processed data and specifying model
parameters, this thesis trained five different machine learning models for running speed: multiple
linear regression, support vector machine, linear mixed effect model, regression trees, and random
forest. The theory, assumptions, and applications of each model type were discussed in detailed.
Finally, six different model performance and error measurements were calculated to
comprehensively compare the ability of these models to capture errors under various assumptions.
The running speed model with the best fit and performance was selected to represent the running
speeds for large-scale transit simulations. Similarly, lognormal dwell time models for every stop
across the network were estimated using dwell times from training data. The goodness of fit of the
dwell time models was evaluated using the chi-square test. The estimated models for running
speeds and dwell times were used to generate simulated transit trips.
116
Chapter 7
Simulation Procedures
The data-driven mesoscopic transit simulation model presented in this thesis used a running
speed regression model and a dwell time distribution model to represent the movement of transit
vehicles. This thesis demonstrates the capability of this data-driven simulation model using a
simplified simulation procedure to predict the network behaviour of a base case scenario.
Adaptations of this simulation procedure with changes to initial simulation parameters can assess
impacts of more complex planning scenarios. While more software development efforts may be
required to enable the application of the running speed and dwell time models for more complex
planning scenarios, this thesis chapter is focused on addressing most of the fundamental challenges
of simulating large-scale transit networks with spatiotemporal variables.
Figure 31. Simulation procedure process flow
Three major challenges associated with performing mesoscopic simulations following the
loading of a running speed and dwell time model were identified (see Figure 31). Firstly, the
scheduled trips that represented the simulation scenarios needed to be converted into moving
vehicles that contain a series of stop-to-stop itinerary data, containing a list of predictor variable
Load Trained Models
Initialize Simulation Data
•Schedule Trips
•Link data
•Initialize variables
Iterative Predictions
•Group current vehicles in batch
•Predict
•Update next trip and next link
Simulation Results
•Data Export
•Reports
117
values identical to link data. Another challenge was to perform as many predictions in parallel
while updating the vehicle itinerary data in proper order. Many of the predictor variables were
spatiotemporal in nature, so they needed to be updated in order of space and time as well. Some
of the examples of these predictor variables were previous link speed, previous trip speed,
headway, and delay. To update the vehicle itinerary data, this thesis grouped vehicles into batches,
perform predictions, then updated the next link’s and next trip’s vehicle itinerary data. Finally,
after simulation, an important challenge was to extract all the vehicle itinerary information to
enable analysis and generation of reports. The last section of this chapter will represent the
reporting functions used to generate time-distance diagrams, route speed reports, and stop delay
reports.
7.1 Load Trained Models
The running speed models presented in this thesis were trained in an R workbench. Since
the simulation procedure was written in C#, we needed to call R scripts from the C# environment.
R.Net is an in-process interoperability bridge to R from the Dot-Net languages that are capable of
establishing R instances inside of Dot-Net framework (Jean-Michel Perraud, 2015). R.Net allows
one to initiate an R workbench instance from C# and perform any given R function. As such, the
training and loading of running speed model were done in an R workbench instance inside C#.
Before simulation begins, the procedure loads the workbench with the trained running speed model
into the R workbench in memory.
In contrast, the dwell time models were trained in the native C# environment. As a result,
there were no interoperability considerations. The dwell time models were loaded onto memory
before the start of the model simulation. The models were stored in a dictionary indexed by their
stop code. During simulation run time, the dwell time model for a specific stop was used to draw
a random dwell time from the estimated lognormal distribution.
118
7.2 Initialize Simulation Data
After loading the trained models, the simulation data representing vehicles needed to be
prepared for model predictions. The simulation data object contained possible movements of
vehicles from one stop to another. The simulation data object was a dictionary object containing a
list of vehicle movements indexed by the GTFS trip ID. The initialization step constructed the
simulation data object using the GTFS schedule trips and initialized predictor variable values
needed for speed prediction.
To initialize the vehicle movement data, a list of appropriate GTFS trips within the study
period was obtained from the GTFS trip data. The GTFS trip data were first filtered by their service
ID to ensure only the weekday or weekend were included. Then, they were filtered by the trip’s
start and end times. The start time of the trip was required to be before the end time of the study
period, and the end time of the trip must be after the start time of the study period. This ensured
that all of the possible vehicles travelling within the study period were captured, and the vehicles
travelling entirely out the study period were not modelled. After a list of GTFS trips within the
study period was determined, all the vehicle movements from stop to stop for each trip were
generated and stored in a list. The list was sorted by time.
Before predictions on the response variable, running speed and dwell time, could be carried
out, the values of all the predictor variables needed to be specified. The initialization of simulation
data assigns the following predictor variables: stop sequence, start stop ID and code, end stop ID
and code, time of day, route code, link name, start stop scheduled arrival time, end stop scheduled
arrival time, schedule headway, total precipitation, incident occurrence, link distance, is streetcar,
is separated ROW, number of vehicle left, number of vehicle right, number of vehicle through
movements, number of intersections with TSP equipped, number of pedestrian crossings, average
vehicle volume along the transit link, average pedestrian volume along the transit link, is start stop
near sided and is end stop far sided.
The listed predictor variables correspond to the variables previously discussed for model
estimation of running speed models. This part of the procedure populated the variables in two
119
ways. The simplest way is through referencing stop and schedule information in GTFS. For
instance, stop sequence, start stop ID, end stop ID, time of day, route code, link name, start stop
scheduled arrival time, end stop scheduled arrival time, and schedule headway were variables that
were directly obtained from schedule information contained in the GTFS tables, accessible to
model simulation. The remaining variables, total precipitation, incident occurrence, link distance,
is streetcar, is separated ROW, number of vehicle left, number of vehicle right, number of vehicle
through movements, number of intersections with TSP equipped, number of pedestrian crossings,
average vehicle volume along the transit link, average pedestrian volume along the transit link, is
start stop near sided and is end stop far sided, were obtainable using the processed data from AVL,
intersection, weather, and road restriction data. Therefore, a test day data was required for
simulation. It is important that only a few of the variables presented were day dependent: total
precipitation and incident occurrence. Modification of these variables at selected links would be a
viable way to test different scenarios across the network such as heavy rain and multiple roadway
incidents.
All the vehicle movement data were initialized using information from GTFS schedules
and the day of test data (see left side of Figure 32). These vehicle movement data were finally
placed into the simulation data object, Sim_ModelData_ByGtfsTripID, where the ordered list of
vehicle movements for each GTFS trip was indexed by the GTFS trip for easy retrieval.
7.3 Iterative Predictions
The simulation data object itself contained all the possible movements of vehicles from
one stop to another for a GTFS trip; however, there were still a number of transit operational
variables that needed to be obtained before predictions could be obtained: previous link speed,
previous trip speed, terminal stop delay, stop delay, simulated headway, headway ratio. The
iterative prediction algorithm updated these missing variables as predictions were carried out in an
iterative manner (see right side of Figure 32). The iterative procedure ran until all of the vehicle
movements (samples) were predicted. The iterative prediction algorithm performed three major
120
steps: creating sample batch ready for prediction, performing prediction on the batch, and variable
updating for the next round of predictions. Performing predictions and updates in batches allowed
model simulations to be carried out in a computationally efficient manner.
Figure 32. Flowchart of model simulation procedure
Vehicle movements could become ready for prediction in two ways. If the vehicle
movement occurred at the first stop of the first trip of a vehicle block, then the movement was
always ready and would be predicted in the first batch of prediction. These first stop of the block
vehicle movements used default values for transit operational variables: previous link speed = 0,
previous trip speed = 0, terminal stop delay = 0, stop delay = 0, simulated headway = schedule
headway, headway ratio = 1. For all other vehicle movements, they become ready when all of the
above 7 variables were updated during the variable update step. The samples that were ready for
prediction received running speed and dwell time predictions.
121
The prediction step involved the algorithm collecting all of the samples and converting
them into a format that could be used to create predictions. Batched sample predictions reduced
compute time since the model prediction method in R was optimized for parallelized computations
with large sample data batches. The sample batch was first converted into an R.Net DataFrame
table object in C#, then passed to the R workbench. The DataFrame table, now residing inside the
R workbench, was used to generate running speed predictions. The running speed predictions were
returned to C# as an array of Numerics object. These running speed predictions were used to update
the current samples’ running speed values. Dwell time predictions for the current sample were
obtained by performing a random distribution draw using the dwell time models for the start stop.
Using running speed and dwell time predictions, the departure time at start stop and arrival time
at end stop could be computed for these samples as well. At the end of the prediction step, the
samples with the predicted running speed and dwell time were updated to the
Sim_ModelPredOutput_ByGtfsTripID object that contained all of the predicted vehicle
movements. At the same time, the previous GTFS trip ID, GTFS trip ID, block ID, block sequence
were updated as well. These variables, along with running speed and dwell time were used to
update variables for the next steps of prediction.
The variable update was an important step to ensure vehicle movements were consistent
and connected. The variable update was performed on every remaining sample where there are
still missing variables. The previous link speed and arrival time at start stop were updated using
the speed prediction and arrival at end stop from the previous link of the same trip. The previous
trip speed was updated using the speed prediction from the previous trip on the same link. Using
the arrival times of the current trip and previous trip on the same link, the stop headway and
headway ratio were computed. Finally, the stop delay and terminal stop delay were computed
based on the difference between scheduled arrival time and simulated arrival time at the start stop.
The variable update step kept the simulation consistent and allow the next sample batch to be
formed for prediction.
As predictions were carrying out for every sample, the samples with completed predictions
were stored in a simulated trip object, Sim_ModelPredOutput_ByGtfsTripID. This object
122
contained the records of all the simulated vehicle movements of the GTFS trips. Using the
procedure for iterative predictions with three major steps, a network of simulated vehicle
movements was obtained using the GTFS schedule trips.
7.4 Simulation Result Outputs
The simulated trip object obtained from the iterative prediction step was exported into a
binary data object for debugging and software development use. In addition, the simulated trip
object was converted into comma separated value (CSV) format for general analysis. This section
describes the format and use of these outputs of the simulation result.
7.4.1 Binary Simulation Data
The binary simulation data file, “CSharp_Sim_ModelDataPred.bin,” was a serialized
binary object of Sim_ModelPredOutput_ByGtfsTripID. This binary file was a dictionary object
containing lists of vehicle movements indexed by GTFS trip ID. One use of the binary file is to
allow the simulated vehicle movements to be saved and retrieved for report generation (discussed
in the next section). This way, simulation for a scenario can be saved and retrieved later for analysis
and reporting. Another use for the binary data file is for integration with Nexus. The vehicle
movements can be read by Nexus coordinator classes to connect the simulated surface transit
vehicle movements with the pedestrian and station models used by Nexus. A binary file can
efficiently save simulation result for future use.
7.4.2 Summary Data File
In addition to saving the simulation to binary, the simulation procedure also wrote
summary data in CSV format. Unlike binary files, which are not human readable, CSV files could
be used to analyse the fields of vehicle movements. Since for every simulated vehicle movement,
there was a scheduled vehicle movement, a simulated results CSV file can be exported containing
these movements with fields such as simulated arrival time at start stop, scheduled arrival time at
start stop, etc. In contrast, there was not a one to one correspondence between the observed vehicle
123
movements from processed AVL data; therefore, a separate observed results CSV file was
generated. The observed results CSV file contained all the data fields of the observed results. These
CSV files were exported from the final simulation data and test data for simulation; they were used
for general analysis and reporting.
124
Table 11. A sample of simulated and scheduled trip summary data
StartStop_
Scheduled
EndStop_S
cheduled
Pred.
Running
Speed
StartStop_
Simulated
EndStop_
Simulated Dwell Route LinkName StopSeq
06:14:00 06:16:12 20.58 06:14:00 06:15:34 27 504 4582_4194 1
06:16:12 06:17:54 27.91 06:15:34 06:16:43 28 504 4194_4164 2
06:17:54 06:19:05 27.71 06:16:43 06:17:50 38 504 4164_4173 3
06:19:05 06:20:19 25.38 06:17:50 06:20:24 121 504 4173_4168 4
06:20:19 06:21:23 24.34 06:20:24 06:21:28 35 504 4168_4186 5
06:21:23 06:23:00 25.63 06:21:28 06:22:30 21 504 4186_4167 6
06:23:00 06:24:00 24.52 06:22:30 06:23:49 49 504 4167_4171 7
06:24:00 06:25:27 24.16 06:23:49 06:25:19 45 504 4171_4155 8
06:25:27 06:27:02 26.32 06:25:19 06:26:36 32 504 4155_13344 9
06:27:02 06:27:57 24.67 06:26:36 06:27:44 40 504 13344_4162 10
06:27:57 06:29:12 25.57 06:27:44 06:29:15 55 504 4162_4187 11
06:29:12 06:30:58 25.84 06:29:15 06:30:19 13 504 4187_4178 12
06:30:58 06:31:56 24.31 06:30:19 06:31:06 18 504 4178_4189 13
06:31:56 06:33:00 23.72 06:31:06 06:31:57 19 504 4189_4157 14
06:33:00 06:34:26 23.8 06:31:57 06:33:35 64 504 4157_10172 15
06:34:26 06:37:02 25.75 06:33:35 06:34:55 23 504 10172_4184 16
06:37:02 06:38:31 23.65 06:34:55 06:36:19 48 504 4184_4180 17
06:38:31 06:39:53 23.32 06:36:19 06:37:04 11 504 4180_4176 18
06:39:53 06:42:27 21.64 06:37:04 06:38:38 27 504 4176_4192 19
06:42:27 06:44:52 21.37 06:38:38 06:40:51 69 504 4192_4158 20
06:44:52 06:46:00 23.04 06:40:51 06:42:15 57 504 4158_4196 21
06:46:00 06:48:00 21.55 06:42:15 06:43:45 35 504 4196_4133 22
06:48:00 06:48:49 22.93 06:43:45 06:44:42 24 504 4133_4135 23
06:48:49 06:50:00 21.75 06:44:42 06:45:50 18 504 4135_4145 24
06:50:00 06:51:16 25.07 06:45:50 06:46:55 25 504 4145_4138 25
06:51:16 06:52:00 23.29 06:46:55 06:47:36 17 504 4138_4140 26
06:52:00 06:52:54 23.53 06:47:36 06:48:19 20 504 4140_4151 27
06:52:54 06:53:52 27.38 06:48:19 06:49:01 21 504 4151_4143 28
06:53:52 06:54:54 28.24 06:49:01 06:49:43 20 504 4143_4148 29
06:54:54 06:56:25 27.3 06:49:43 06:50:35 18 504 4148_15351 30
06:56:25 06:58:34 28.99 06:50:35 06:51:51 32 504 15351_3038 31
06:58:34 06:59:47 27.26 06:51:51 06:52:50 32 504 3038_3032 32
06:59:47 07:01:44 24.5 06:52:50 06:54:20 28 504 3032_3372 33
07:01:44 07:02:37 26.23 06:54:20 06:55:10 23 504 3372_9317 34
07:02:37 07:03:17 26.33 06:55:10 06:55:49 18 504 9317_1757 35
07:03:17 07:04:31 26.6 06:55:49 06:56:40 13 504 1757_1759 36
07:04:31 07:05:28 30.68 06:56:40 06:57:26 20 504 1759_1766 37
07:05:28 07:06:45 31.22 06:57:26 06:58:19 20 504 1766_1761 38
07:06:45 07:08:06 31.97 06:58:19 06:59:52 58 504 1761_1768 39
07:08:06 07:09:12 29.72 06:59:52 07:01:00 38 504 1768_1755 40
07:09:12 07:10:00 23.1 07:01:00 07:01:57 31 504 1755_14639 41
125
Table 12. A sample of observed trip summary data from test data set
StartStop_
Observed
EndStop_
Observed
Running
Speed Dwell
Route
Code Link Name StopSeq
06:14:06 06:18:48 7.27 0 504 11155_4194 1
06:18:48 06:19:23 32.66 0 504 4194_4164 2
06:19:23 06:20:19 14.25 0 504 4164_4173 3
06:20:19 06:21:03 18.85 0 504 4173_4168 4
06:21:03 06:21:36 21.68 0 504 4168_4186 5
06:21:36 06:22:28 20.14 0 504 4186_4167 6
06:22:28 06:23:21 14.17 0 504 4167_4171 7
06:23:21 06:24:37 14.22 0 504 4171_4155 8
06:24:37 06:26:28 13.28 22 504 4155_13344 9
06:26:28 06:27:00 21.24 0 504 13344_4162 10
06:27:00 06:28:03 21.73 20 504 4162_4187 11
06:28:03 06:29:24 19.8 15 504 4187_4178 12
06:29:24 06:30:14 14.43 0 504 4178_4189 13
06:30:14 06:31:00 16.04 0 504 4189_4157 14
06:31:00 06:31:58 14.04 0 504 4157_10172 15
06:31:58 06:33:42 17.88 21 504 10172_4184 16
06:33:42 06:34:51 12.23 0 504 4184_4180 17
06:34:51 06:35:53 18.09 19 504 4180_4176 18
06:35:53 06:37:50 12.48 0 504 4176_4192 19
06:37:50 06:40:50 16.79 98 504 4192_4158 20
06:40:50 06:41:47 16.74 20 504 4158_4196 21
06:41:47 06:43:52 9.37 0 504 4196_4133 22
06:43:52 06:44:31 19.5 0 504 4133_4135 23
06:44:31 06:45:38 23.41 20 504 4135_4145 24
06:45:38 06:46:53 13.27 0 504 4145_4138 25
06:46:53 06:47:31 14.85 0 504 4138_4140 26
06:47:31 06:48:29 9.27 0 504 4140_4151 27
06:48:29 06:49:04 16.46 0 504 4151_4143 28
06:49:04 06:49:30 23.9 0 504 4143_4148 29
06:49:30 06:51:12 14.42 39 504 4148_15351 30
06:51:12 06:52:55 12.52 0 504 15351_3038 31
06:52:55 06:53:30 20.83 0 504 3038_3032 32
06:53:30 06:55:10 19.18 21 504 3032_3372 33
06:55:10 06:58:26 12.83 140 504 3372_9317 34
06:58:26 06:58:50 23.02 0 504 9317_1757 35
06:58:50 07:00:19 14.87 21 504 1757_1759 36
07:00:19 07:00:54 22.45 0 504 1759_1766 37
07:00:54 07:01:39 23.25 0 504 1766_1761 38
07:01:39 07:02:28 22.69 0 504 1761_1768 39
07:02:28 07:03:10 21.47 0 504 1768_1755 40
07:03:10 07:05:09 12.34 64 504 1755_14639 41
126
7.5 Analytic Reports
While the summary data contained the complete vehicle movement data, analysing the
summary data on the vehicle movements of the entire network, by stop or by route, can be very
time-consuming. This thesis presents ways to automate the analysis and reporting of vehicle
movement data. The report generation methods in this thesis used OfficeOpenXML tools from the
EPPlus package as well as plotting packages in R via R.Net from C#. This thesis demonstrated
three types of analytics reports for transit vehicle movements: time distance diagrams for vehicle
trajectory analysis, route speed histograms for route-level analysis, and stop delay distributions for
stop-level analysis.
7.5.1 Time Distance Diagrams
The time distance diagram is commonly used to examine the vehicle trajectories of transit
vehicles along a route (see Figure 33). The y axis of the diagram is the distance from the start of a
route and the x axis of the diagram is time. In order to generate this type of diagram properly, the
vehicle movement data of different trips needed to be grouped into their appropriate route and
direction. After the trips were grouped by their routes, the reports were generated using the stop
arrival and departure times for scheduled trips, simulated trips and observed trips. The graph data
along with the time distance diagram were saved as a Microsoft Excel file for that route and
direction. This procedure was repeated for each route and direction to generate time and distance
diagrams for all trips.
128
7.5.2 Summary of Route Speeds
The route speed histograms were used to show the frequency of the route speed occurrences
from the simulated, scheduled, and observed trips (see Figure 34). These graphs showed how well
the scheduled and simulated vehicle movements resembled the observed vehicle movements at the
route level. To create these histograms, the route speeds were computed using the departure time
from the first stop and the arrival time at the last stop. Then, the route speeds were grouped into
their respective route and directions. Using the route speed with their appropriate labels of
“scheduled,” “simulated,” and “observed,” the route speed histograms were constructed using
ggplot2 package from R, for each route and direction. These histograms allow researchers to
evaluate and validate the simulation results at the route level.
Figure 34. Examples of route speed histograms for Route 192 Airport Rocket NB
7.5.3 Summary of Stop Delays
At the stop level, delays were chosen to evaluate the accuracy of vehicle arrivals. Since
stop delay is defined as the difference between scheduled stop time and actual stop time, the
simulated delays and observed delays could be calculated using the simulated arrival times and
observed arrival times. Simulation results that replicated the observed conditions well should
produce a similar distribution of stop delay values as the observed. Grouping the stop delay values
by their respective routes, stop delay distribution curves were obtained. Kernel density estimation
129
method was applied to the stop delay data using the ggplot2 package from R to produce probability
density curves for simulated and observed stop delays.
Figure 35. Examples of stop delay distribution curve
7.6 Summary
The simulation procedures used by this thesis resolved major challenges associated with
large-scale transit simulations. To enable efficient model loading, the challenges associated with
working across programming platforms (R and C#) were resolved by using R.Net to host R
workbench instance within C#. Simulation data was initialized to establish a base case simulation
scenario based on GTFS schedule trips. Then, running speed and dwell time predictions were
generated in a computationally efficient manner using an iterative prediction procedure with data
sample batching. Using these predictions, a network of transit trips was simulated and the results
of the simulation were exported for analysis. In addition, this thesis presented an innovative
method to automate the generations of time distance diagrams, route speed histograms, and stop
delay distribution curves to evaluate and validate the simulation models. Finally, the simulated
surface transit model enabled analysis of network behaviours and allowed for the integration of
multimodal transit models.
130
Chapter 8
Results of Case Study
This chapter presents the modelling and simulation results of the data-driven mesoscopic
simulation model for the Toronto Transit Commission surface transit network. After a brief
description of the transit network, the model estimation results for the running speed model and
dwell time model are evaluated. The models that predicted running speed and dwell time well are
selected for simulation. The mesoscopic simulation results for the random forest and linear mixed
effect running speed models are presented. The simulation results are evaluated based on their
ability to replicate variations in headways, delays, and dwell times. In addition, route-level
validation based on route speed variations and stop-level validation based on stop delay variations
are used to assess model performances.
8.1 Case Study Background
This thesis used three days of training data in a particular week to estimate running speed
and dwell time models and perform evaluation of models using test data from the following week.
The case study network for this thesis was the Toronto Transit Commission (TTC) network.
8.1.1 Network Characteristics
The TTC surface transit network, operating within the City of Toronto, was used in this
study to demonstrate the application of this data-driven transit simulation method. The TTC
network consists of three subway lines, one short automated transit line, 11 streetcar lines, 140 bus
lines, and over 10,000 transit stops. The surface transit system is well-connected to the subway
system with 150 out of 154 bus and streetcar lines making 247 connections with the subway and
the rapid transit line (Toronto Transit Commission, 2017b). The TTC services a dense downtown
131
area of Toronto where almost all of its streetcar lines run through with major connections at every
subway station (see Figure 36).
Figure 36. The Toronto Transit Commission downtown network map (Toronto Transit
Commission, 2017c)
8.1.2 Study Period
The study period selected for model training was from February 28, 2017 to March 2, 2017.
A large-scale weekday morning peak-period (6:00 to 9:00) running speed and dwell time model
was developed using the training data for this study period. To perform a statistical analysis on the
running speed model, test data from March 7, 2017 to March 9, 2017 were used. To validate the
running speed and dwell time models at the stop and route level of the network, schedule trips
132
from a typical weekday with reference test data from March 8, 2017 were used to simulate transit
trips. Leveraging the availability of open data and the capability of machine learning algorithms,
this case study demonstrates a data-driven method for large-scale transit simulation modelling.
8.2 Summary of Data
The training and test data were necessary for the estimation and evaluation of models. This
thesis used five different data sets within the defined study periods: General Transit Feed
Specification (GTFS), Automatic Vehicle Location (AVL), road restriction, weather, and traffic
intersection data. A description, the type of link variables, as well as the quantity of each type of
data are detailed in Table 13.
A transit schedule contained a set of weekday, weekend, and holiday trips with relevant
stop and route information. Using the schedule information in the GTFS data set for the TTC, the
schedule arrival times, schedule headway, route number, and link distances were obtained for
every vehicle trip. For a typical AM-peak weekday schedule from 6 AM to 9AM, there were 8304
scheduled transit trips. These transit trips were used extensively from data processing to model
evaluation.
The AVL data set for TTC contained the actual vehicle trajectories for the entire network.
Many of the link variables based on the observed vehicle position were obtained, such as the arrival
times, dwell times, running speeds, delays, and headways. Due to operational conditions, the TTC
may short-turn trips or dispatch additional surface transit service. This likely explains the
differences in the number of trips between the AVL data and the GTFS data. The number of AVL
trips ranged from 8350 to 8428 for the set of training and test data used.
The reported road events contained in the road restriction data covered the entire City of
Toronto. This thesis used the locations of road restrictions to determine if there were incident
occurrences on a particular transit link at a particular time. The incident occurrences associated
with each link were recorded to explore the effect of road closures and incidents on transit running
133
speeds. There were 734 road restriction events during the training data period and 766 road
restriction events during the test data period.
In addition to road events, the total precipitation of the transit link within a particular period
were determined based on proximity of the weather stations to the link. Using the various weather
station data from Toronto, 72 weather records were collected each day. These weather records
were used to determine the total precipitation on a transit link for a given time.
Finally, the traffic intersection data consist information from a list of all the signalized
intersections, pedestrian crossings, and flashing beacons in the City of Toronto. Using the
intersection data and the geometry of the scheduled routes from the GTFS data, various transit link
characteristics were determined. Some examples of transit link characteristics were number of left
turns made by a transit vehicle, average vehicle volumes from the upstream intersections, and stop
locations of the transit stops. There were 2269 records with intersections containing historical
traffic and pedestrian volumes and 71 records with minor intersections such as pedestrian
crossings, and flashing beacons.
134
Table 13. Detailed summary of open data sources
* data reported for the period indicated
Data
Source
Descriptions Unprocessed
Fields
Processed
Attributes
Quantity *
(study period)
GTFS transit network
schedule data
over many
time periods,
and days of the
week
stop latitude,
stop longitude,
stop times,
trip IDs with route
IDs, block IDs, and
service IDs
GTFS Trips with
sequences of stop
IDs, stop distances
(from start), and
stop times, for each
trip
8304 Trips
(typical weekday)
AVL real-time data
feed of the
operating
transit vehicle
positions
across the
entire network
vehicle latitude,
vehicle longitude,
vehicle number,
route identification,
direction
AVL Trips with a
matching GTFS
route, as well as
sequences of stop
IDs, stop distances
(from start), and
stop times
8381 Trips (Feb 28)
8350 Trips (Mar 1)
8403 Trips (Mar 2)
8428 Trips (Mar 7)
8395 Trips (Mar 8)
8414 Trips (Mar 9)
Road
Restricti
on
reported road
restrictions
events across
the city of
Toronto.
incident latitude,
incident longitude,
start time,
end time,
incident ID
incident ID that has
occurred on a link
of the AVL trip
was assigned to the
link data
734 Events
(Feb 28 to Mar 2)
766 Events
(Mar 7 to Mar 9)
Weather weather data
from stations
across the city
of Toronto
station latitude,
station longitude,
weather report
time,
temp., humidity,
wind speed,
3-hr precipitation,
and weather ID
weather ID(s) that
best represent the
weather condition
of a link of the
AVL trip was
assigned to the
corresponding link
data records.
72 Records
(per day)
Traffic
Intersecti
on
traffic
intersections
(interxn.) in
the city of
Toronto
interxn latitude,
interxn longitude,
num. of approach,
8-Hr ped. volumes,
8-Hr veh. volumes,
interxn ID
a list of interxn
ID(s) located
within the transit
link was assigned
to the
corresponding link
data.
2269 Records
(intersection
volumes)
71 Records
(minor intersections)
135
8.3 Running Speed Model
Based on the two different use case scenarios established in Chapter 3, this thesis tested
running speed models at the route level and at the network level. The route-level running speed
models were designed for the basic analysis use case while the network-level running speed
models were intended for the advanced analysis use case. The variable specifications for each use
cases can be found in section 6.1.2.
As discussed in Chapter 6, the five modelling methods assessed in this study were: multiple
linear regression (MLR), support vector machine (SVM), linear mixed effect (LME), regression
tree (RT), and random forest (RF) models. The definitions for the various error measures, as
presented in section 6.1.4, were used to evaluate model accuracy performance. In addition to model
accuracy, model training time and model prediction time were measured to assess the
computational efficiencies of the models. The compute times were based on i7-7500U processor
with 16GB DDR4, 2400MHz memory, with no GPU acceleration.
8.3.1 Route Level Running Speed Model Results
Four major transit routes were selected for route-level running speed model analysis. Two
bus routes and two streetcar routes were chosen. The first bus route was bus number 34 Eglinton
East serving the east-west Eglinton East Street corridor with subway connections at Eglinton and
Kennedy stations. The second bus route was bus number 54 Lawrence East serving the east-west
Lawrence East Street corridor with rail connections at Eglinton, Lawrence East, and Rouge Hill
Go stations. The 504 King Street streetcar is a TTC streetcar route serving the east-west King
Street corridor. It had shared right of way throughout the route and had subway connections at
Saint Andrew and King Street stations. In contrast, the 512 Saint (St.) Clair streetcar route had
dedicated right of way. It served the east-west St. Clair street corridor and had subway connections
at St. Clair and St. Clair West stations.
136
The route-level running speed model for all of the selected surface routes showed that LME
model provided the highest R2 measurement and lowest error measurements (see Table 14, Table
15, Table 16, and Table 17). The RMSEs of the LME model were 2.6% to 5.6% lower than that
of MLR model. RF were not able to outperform LME mode, but it produced higher model accuracy
than MLR, SVM and RT models. The performances between MLR, SVM and RT models were
similar (with difference less than 2%) across most routes, with the exception of 512 St. Clair
streetcar. For 512 St. Clair streetcar model, SVM and RT models were able to outperform MLR
model by 3.7% and 2.1%, respectively.
Regarding model training and model prediction times, SVM was the most computationally
intensive model while MLR was the least. MLR model was the fastest to train for all routes. The
LME, RT, RF, and SVM model training times were longer than MLR by approximately a factor
of 10, 230, 320, and 9000, respectively. The model prediction times for all models were generally
faster. MLR and RT were almost identical, while LME, FR and SVM model prediction took longer
than MLR by a factor of 5, 10, and 150. MLR was the fastest model while SVM was the slowest
model for model training and predictions.
Table 14. Comparison of route-level running speed models for 34-Eglinton East
Model Type MLR SVM LME RT RF (100 trees)
R Package MASS liquidSVM LME4 RPART RANGER
R2 0.211 0.224 0.286 0.205 0.262
MAPE 0.475 0.459 0.447 0.453 0.411
MAE 7.797 7.707 7.271 7.786 7.438
RAE 0.873 0.863 0.814 0.872 0.833
RMSE 10.105 10.020 9.615 10.146 9.774
RRSE 0.888 0.881 0.845 0.892 0.859
Reduction in
RMSE 0% 0.8% 4.8% -0.4% 3.3%
Training
Time (sec.) 0.016 11.162 0.125 2.338 3.543
Prediction
Time (sec.) 0.000 1.612 0.054 0.016 0.109
137
Table 15. Comparison of route-level running speed models for 54-Lawrence East
Model Type MLR SVM LME RT RF (100 trees)
R Package MASS liquidSVM LME4 RPART RANGER
R2 0.289 0.307 0.405 0.316 0.344
MAPE 0.397 0.391 0.346 0.380 0.369
MAE 8.208 8.064 7.377 7.969 7.778
RAE 0.830 0.815 0.746 0.806 0.786
RMSE 10.690 10.548 9.773 10.485 10.267
RRSE 0.843 0.832 0.771 0.827 0.810
Reduction in
RMSE 0% 1.3% 8.6% 1.9% 4.0%
Training
Time (sec.) 0.016 15.099 0.194 3.672 7.193
Prediction
Time (sec.) 0.016 2.349 0.064 0.016 0.216
Table 16. Comparison of route-level running speed models for 504-King
Model Type MLR SVM LME RT RF (100 trees)
R Package MASS liquidSVM LME4 RPART RANGER
R2 0.107 0.115 0.223 0.102 0.153
MAPE 0.329 0.326 0.296 0.330 0.318
MAE 5.127 5.077 4.726 5.134 4.982
RAE 0.940 0.931 0.866 0.941 0.913
RMSE 6.816 6.784 6.359 6.834 6.639
RRSE 0.945 0.941 0.882 0.947 0.920
Reduction in
RMSE 0% 0.5% 6.7% -0.3% 2.6%
Training
Time (sec.) 0.017 31.790 0.327 0.662 3.838
Prediction
Time (sec.) 0.011 2.286 0.076 0.012 0.158
138
Table 17. Comparison of route-level running speed models for 512-St. Clair
Model Type MLR SVM LME RT RF (100 trees)
R Package MASS liquidSVM LME4 RPART RANGER
R2 0.109 0.174 0.252 0.147 0.206
MAPE 0.419 0.378 0.334 0.385 0.363
MAE 6.596 6.258 5.842 6.348 6.099
RAE 0.931 0.884 0.825 0.897 0.861
RMSE 8.629 8.310 7.907 8.444 8.146
RRSE 0.944 0.909 0.865 0.924 0.891
Reduction in
RMSE 0% 3.7% 8.4% 2.1% 5.6%
Training
Time (sec.) 0.013 99.535 0.269 0.499 2.383
Prediction
Time (sec.) 0.010 5.508 0.073 0.010 0.096
8.3.2 Network Level Running Speed Model Results
Based on the results presented in Table 18, LME and RF models outperformed the MLR,
SVM and RT models on all error measurements. Of the models that performed poorly, MLR and
SVM had similar performances with near identical errors, while RT had 3.5% higher RMSE. The
LME model using Link Name as the random effects variable performed the best, yielding 7.9%
RMSE reduction relative to the MLR model. On the other hand, RF produced 5.9% RMSE
reduction. LME and RF’s adjusted R2 values (0.387 and 0.359) were highest. This reflected LME
and RF’s superior abilities to capture the variability in speeds among all route segments and time
of day.
The computing times for training and prediction reflected the computational efficiency of
the models. Since the training and test data sets were similar in size (593,234 training data rows
and 600,351 test data rows), model prediction operations were typically 10 to 50 times faster than
model training. MLR was the fastest model with 0.42 minutes of training time and 0.04 minutes
139
of prediction time. LME and RT were both computationally efficient with less than 3 minutes of
training time and 0.05 minutes of prediction time. SVM was the slowest model with over 36
minutes of training time and 3 minutes of prediction time. RF has a modest training time of 14.69
minutes and prediction time of 0.33 minutes.
Table 18. Comparison of five types of network-level running speed models
Model Type MLR SVM LME RT RF (100 trees)
R Package MASS liquidSVM LME4 RPART RANGER
R2 0.277 0.265 0.387 0.225 0.359
MAPE 0.355 0.358 0.311 0.372 0.325
MAE 7.625 7.677 6.902 7.911 7.109
RAE 0.831 0.837 0.752 0.862 0.775
RMSE 9.950 10.035 9.160 10.303 9.366
RRSE 0.850 0.858 0.783 0.881 0.800
Reduction in
RMSE 0% -0.9% 7.9% -3.5% 5.9%
Training
Time (min.) 0.419 36.272 2.629 1.021 14.681
Prediction
Time (min.) 0.036 3.249 0.049 0.015 0.331
140
8.4 Dwell Time Distribution Model Result
This thesis estimated statistical dwell time models at the stop level assuming that the dwell
time distributions were lognormal. The sigma and mu distribution parameters for the lognormal
distributions were estimated using training data for the case study network. These resulting
parameters were presented on a colour bubble map of the Toronto area (Figure 37). The results of
the lognormal distributions were assessed by comparing the averages and standard deviations of
the observed and predicted dwell times. Finally, the results of the dwell time models were
evaluated using goodness of fit chi-square test.
8.4.1 Log Normal Distribution Parameters
By training a dwell time model for each stop using the training data, the parameters for the
lognormal distributions, mu and sigma, were obtained. The mu parameter is the log average of the
training data. The sigma parameter, also known as the shape parameters, computed the log standard
deviation using the log values and the log average of the training data. The estimation of log normal
distribution parameters was outlined in section 6.2.1.
There was a total of 4488 dwell time models trained for the Toronto network. For each
dwell time model at a stop, a mu and a sigma parameter were obtained. These parameters were
presented as a coloured bubble map (see Figure 37). The colour represented the mu parameter of
log average while the size of the bubble represented the sigma parameter. A small bubble with
green colour indicates a dwell time model with small mu and small sigma parameter values. A
large bubble with red colour indicates a dwell time model with large mu and large sigma parameter
values. The specific values of each dwell time model can be found online via the following URL:
https://bowenwen.carto.com/tables/table_2017_08_12_dwelltimemodel_v5_9/public/map.
The dwell time model parameter values showed variations of dwell time averages and
standard deviations across the transit network. In particular, Figure 37 showed that the mu value
was higher and sigma value was higher for transfer stations at and nearby major subway stations.
The top five subway stations with higher mu and sigma values were:
141
1. Downsview station (mu = 4.63, sigma = 0.82),
2. Rosedale station (mu = 4.54, sigma = 0.69),
3. Broadview station (mu = 4.46, sigma = 1.05),
4. Lawrence East station (mu = 4.43, sigma = 0.95), and
5. Ossington station (mu = 4.42, sigma = 0.88).
In addition to subway stations, dwell time models with large mu and sigma were found for terminal
stations as well. The top five terminal stations with high mu and sigma values were:
1. Main Street station with stop code 14739 (mu = 4.87, sigma = 0.75),
2. Wynford Drive at Don Mills Road with stop code 4720 (mu = 4.83, sigma = 0.83),
3. Ferris Road at Curran Drive with stop code 2475 (mu = 4.74, sigma = 1.03),
4. Lower Jarvis street at Queens Quay East with stop code 6477
(mu = 4.62, sigma = 0.92), and
5. Steeles Avenue West at Highway 27 with stop code 11522 (mu = 4.59, sigma = 0.99).
The dwell time parameter estimates showed variations of averages and standard deviations
across different stops within the network. The dwell time models were used to predict dwell
times during simulation.
142
Figure 37. Dwell time model parameters at stops on coloured bubble map
8.4.2 Dwell Time Model Evaluations
Dwell time models that represent the stochasticity of dwell times across different stops well
should replicate the variation of observed dwell time averages and standard deviations. A
comparison between the dwell time predictions and observed dwell times at each stop for the
simulated period was made. Furthermore, a chi-squared test was used to evaluate the goodness of
fit of the dwell time model at the network level.
143
8.4.2.1 Observed versus Predicted Dwell Times
The observed dwell time coloured bubble plot showed the variation in dwell times across
the network. The observed dwell times were obtained from the 1-day test data for simulation. The
colour of the bubble represented the observed dwell time averages while the size of the bubble
indicated the standard deviations. Figure 38 showed that most of the stops along many corridors
have low observed averages and standard deviations in dwell times. However, the stops with major
transfers with other bus or streetcar routes and the subway system have larger averages and
standard deviations in dwell times. In particular, stops at the subway stations have larger values.
For example, many of the stops at the subway stations on Yonge Street, University Avenue, and
Bloor Street had larger averages and standard deviations of dwell times. In addition, terminal stops
for transit routes were observed to have high averages and standard deviations of dwell time values.
The predicted dwell time coloured bubble plot was obtained using dwell time models with
distribution draws performed at simulation run times (see Figure 39). The predicted dwell time
results were obtained using seed 100 during simulation to ensure simulation model results were
comparable between different running speed models. The predicted dwell time model results
showed variations in dwell time across the network. Stops at subway stations yielded larger
averages and standard deviations of predicted dwell times. These stops were on the Yonge Street,
University Avenue, and Bloor Street corridors. Further away from the Toronto downtown region,
larger averages and standard deviations of predicted dwell times were observed for terminal stop
stations.
146
8.4.2.2 Distribution Testing
On the network level, the predicted dwell times were evaluated using the chi-square test
for goodness of fit. The chi-square test compared the frequencies of observed and predicted dwell
times for bin size of 20 seconds using the chi-square test statistic: 𝜒2 = ∑ (𝑃𝑖 − 𝑂𝑖)2 𝑂𝑖⁄ , where
𝑃𝑖 is the predicted dwell time frequency and 𝑂𝑖 is the observed dwell time frequency. The
frequency and chi-square test statistics were shown in Figure 40. With a confidence level of 90%,
no significant difference was observed between the predicted and observed dwell times. As such,
the predicted dwell times follow the observed dwell time lognormal distributions.
Figure 40. Chi-square goodness of fit test for dwell time predictions
147
8.5 Simulation with Random Forest
The simulation results using random forest (RF) are presented using a series of time-
distance diagrams for a selected TTC route, as shown in Figure 41. The time-distance diagrams
showed the vehicle trajectories of schedule trips, observed trips, and simulated trips. To validate
the results of the model, the routes speeds and stop delay distributions were examined for subway
stops on selected TTC routes.
8.5.1 Simulation Results
The time-distance diagram, Figure 41, illustrates the ability of the RF running speed model
and lognormal dwell time model to produce transit vehicle trajectories. Route 504 King, the busiest
streetcar route in Toronto with a daily ridership of about 65,000 passengers, was chosen for
analysis. The top row shows the scheduled vehicle trajectory with relatively constant speeds and
headways between all trips from 6 am to 9 am. However, the observed trajectories exhibited
variable headways and high bunching frequency. In addition, the running speeds across the trips
were not constant. The simulation model of this study could replicate reasonably well the effects
of dwell times, delays, and headways on the movements of transit vehicles (see the bottom row of
Figure 41). The simulated vehicle trajectories for the sample route reproduced operational patterns
similar to those observed in the test data with respect to vehicle bunching instances, stochastic
dwell times, and variable headways and delays. The simulated vehicle trajectories captured several
of the transit operational characteristics and provided a more realistic representation than
scheduled vehicle trajectories.
The simulation computational performance with RF running speed model was acceptable.
The total gross simulation time was 2262 seconds (38 minutes) for a 3-hour network. With smaller
networks or a less busy transit network, the simulation time would be reduced.
For the following figure, the intersecting scheduled vehicle trajectories were due to the
planned dispatches of 504 King replacement buses towards Broadview during the training and test
data study periods.
148
Figure 41. Time distance diagrams, 504 King WB and EB, for scheduled, observed and
simulated trips, simulated using Random Forest running speed model
0
2
4
6
8
10
12
06:00 06:30 07:00 07:30 08:00 08:30 09:00
Dis
tance
fro
m S
tart
(km
)
Time (hh:mm:ss)
504 KING WB - Scheduled
0
2
4
6
8
10
12
06:00 06:30 07:00 07:30 08:00 08:30 09:00
Dis
tance
fro
m S
tart
(km
)
Time (hh:mm:ss)
504 KING WB - Observed
0
2
4
6
8
10
12
06:00 06:30 07:00 07:30 08:00 08:30 09:00
Dis
tance
fro
m S
tart
(km
)
Time (hh:mm:ss)
504 KING WB - Simulated
0
2
4
6
8
10
12
06:00 06:30 07:00 07:30 08:00 08:30 09:00
Dis
tance
fro
m S
tart
(km
)
Time (hh:mm:ss)
504 KING EB - Simulated
0
2
4
6
8
10
12
06:00 06:30 07:00 07:30 08:00 08:30 09:00
Dis
tance
fro
m S
tart
(km
)
Time (hh:mm:ss)
504 KING EB - Observed
0
2
4
6
8
10
12
06:00 06:30 07:00 07:30 08:00 08:30 09:00
Dis
tance
fro
m S
tart
(km
)
Time (hh:mm:ss)
504 KING EB - Scheduled
149
8.5.2 Route-based Validation
The route speed of a transit vehicle is important for the planning of transit services as it
determines the number of transit units and the frequency of service. The distribution of route
speeds of the simulated trips, generated by the RF running speed model, was examined to
demonstrate the ability of the simulation model to replicate the observed conditions (see Figure 43
for results of two major bus routes and two streetcar routes). While the schedule-based route speeds
exhibited low variability for each branch of a route, the simulated route speeds varied across
different trips and better reflected the observed conditions. However, the distributions between
simulated and observed route speeds were not identical. In general, the simulated route speeds
were more normally distributed with smaller variances. The average route speeds between the
simulated and observed trips matched well for most routes.
Figure 42. Route speed validation for four major TTC routes, simulated using Random
Forest running speed model
150
8.5.3 Station-based Validation
The delays at each stop are critical for the coordination of transfers as well as crowding at
major transfer stations. This study assessed the distributions of delays captured at 14 major transfer
stops, at subway stations, of four feeder bus and streetcar routes, using the kernel density
estimation for data smoothing. While only a few stops had near identical simulated and observed
distribution, the simulation model was able to capture either the average of delays or the shape of
the density for most stops. For stops where the center of delay distribution curve was shifted, such
as the stops for 504 King EB, the route speed distribution was also shifted. The shape of the
distributions did not match well for 512 St Clair. In addition, routes with less similar variances of
route speeds also have delay density shapes that do not match as well. Even though the simulated
models do not replicate identically the stop delay distribution as observed, they did provide a
reasonable representation of the delays experienced.
151
Figure 43. Stop delay validation for four routes at 14 major TTC transfer stops, simulated
using Random Forest running speed model
152
8.6 Simulation with Linear Mixed Effect Model
Using the linear mixed effects (LME) running speed model, simulated vehicle movements
were generated. The vehicle movements were presented as a series of time-distance diagrams for
a selected TTC route. The vehicle trajectories of schedule trips, observed trips, and simulated trips
are shown on the time-distance diagrams (Figure 44). The simulated routes speeds and stop delay
distributions were validated against the observed values for subway stops on selected TTC routes.
8.6.1 Simulation Results
The vehicle movements generated by the LME running speed model and lognormal dwell
time model were presented as a series of time-distance diagrams (Figure 44). Similar to the results
for the RF running speed model, the busy streetcar route 504 King was chosen for analysis. The
scheduled vehicle trajectory (top row) showed constant speeds and headways between all trips
from 6 am to 9 am. This did not model the observed trajectories that had variable headways and
high bunching frequency. The LME running speeds model was able to produce non-constant
running speeds across the trips, as observed. The simulation model using LME replicated the
effects of dwell times, delays, and headways on vehicle movements, (refer to bottom row of Figure
41). Operational patterns with vehicle bunching, stochastic dwell times, and variable headways
and delays were reproduced by the LME running speed model. The simulated vehicle trajectories
captured several of the transit operational characteristics and provided a more realistic
representation than scheduled vehicle trajectories.
The simulation computational performance with LME running speed model was
acceptable. For a 3-hour TTC network, the total gross simulation time was 1304 seconds (22
minutes). With smaller networks or a less busy transit network, the simulation time could be less.
For the following figure, the intersecting scheduled vehicle trajectories were due to the
planned dispatches of 504 King replacement bus towards Broadview during the training and test
data study periods.
153
Figure 44. Time distance diagrams, 504 King WB and EB, for schedule, observed and
simulated trips, simulated using Linear Mixed Effects running speed model
0
2
4
6
8
10
12
06:00 06:30 07:00 07:30 08:00 08:30 09:00
Dis
tance
fro
m S
tart
(km
)
Time (hh:mm:ss)
504 KING EB - Scheduled
0
2
4
6
8
10
12
06:00 06:30 07:00 07:30 08:00 08:30 09:00
Dis
tance
fro
m S
tart
(km
)
Time (hh:mm:ss)
504 KING WB - Scheduled
0
2
4
6
8
10
12
06:00 06:30 07:00 07:30 08:00 08:30 09:00
Dis
tance
fro
m S
tart
(km
)
Time (hh:mm:ss)
504 KING WB - Observed
0
2
4
6
8
10
12
06:00 06:30 07:00 07:30 08:00 08:30 09:00
Dis
tance
fro
m S
tart
(km
)
Time (hh:mm:ss)
504 KING WB - Simulated
0
2
4
6
8
10
12
06:00 06:30 07:00 07:30 08:00 08:30 09:00
Dis
tance
fro
m S
tart
(km
)
Time (hh:mm:ss)
504 KING EB - Observed
0
2
4
6
8
10
12
06:00 06:30 07:00 07:30 08:00 08:30 09:00
Dis
tance
fro
m S
tart
(km
)
Time (hh:mm:ss)
504 KING EB - Simulated
154
8.6.2 Route-based Validation
Route speeds are important for the planning of transit services, so its distribution was
chosen as a validation measure for the simulation model. The distribution of route speeds of the
simulated trips, generated using the LME model, was examined to demonstrate the ability of the
simulation model to replicate the observed conditions. The results for Eglinton and Lawrence East
buses and Saint Clair and King Streetcars were summarized by a series of route speed histograms
(see Figure 45). While the schedule-based route speeds exhibited low variability for each branch
of a route, the simulated route speeds varied across different trips and better reflected the observed
conditions. However, the distributions between simulated and observed route speeds were not
identical. In general, the simulated route speeds were more normally distributed with smaller
variances. The average route speeds between the simulated and observed trips matched well for
most routes.
Figure 45. Route speed validation for four major TTC routes, simulated using Linear Mixed
Effects running speed model
155
8.6.3 Station-based Validation
Scheduled transit services usually optimize the crowding and the coordination of transfers
at major stations, and delays can often have adverse effect on crowding and transfer. The
distribution of delays at 14 major subway transfer stops along four major feeder bus and streetcar
routes were generated. The distributions were produced using the kernel density estimation for
data smoothing. It was found that only few stops had near identical simulated and observed
distribution. However, the shapes and/or averages of the delay distributions were captured for
some stops. For stops where the center of delay distribution curve was shifted, such as the stops
for 504 King EB, the route speed distribution was also shifted. The distribution for different bus
platforms at the same station, such as those for 512 St Clair, could yield different results as well.
For instance, the delay distribution for bus platform 3 matched better than that for bus platform 4
at St Clair station. Even though the simulated models did not replicate identically the stop delay
distribution as observed, they did provide a reasonable representation of the delays experienced.
156
Figure 46. Stop delay validation for four routes at 14 major TTC transfer stops, simulated
using Linear Mixed Effects running speed model
157
8.7 Summary
This chapter presented data-driven mesoscopic transit simulation models for the Toronto
Transit Commission case study. The mesoscopic transit simulation model was made up of a
regression running speed model and a lognormal statistical dwell time model. In the case study
presented in this thesis, five different regression models for running speed were evaluated. The
random forest and linear mixed effect models were found to provide higher model accuracy and
reasonable computational efficiencies for application. The superior performances of the random
forest and linear mixed effect models were consistent for route-level and network-level running
speed models. The dwell time models were also evaluated based on their ability to replicate dwell
times variation across different stops in the network. The dwell time models were able to capture
the higher dwell time averages and larger standard deviations for major transfer stops intersecting
subway lines. A chi-square goodness of fit test was performed for dwell times at the network level.
It was found that the dwell time distributions pass this statistical test with 90% confidence.
Using the estimated running speed and dwell time models, mesoscopic transit simulations
were performed using the random forest and linear mixed effect running speed models. The
random dwell time distribution draws from the dwell time models were controlled using a seed
with a value of 100 during simulation runs in order to ensure comparability. This chapter presented
the simulation results for both models by showing the vehicle trajectories of a selected TTC route
on a series of time-distance diagrams. In addition, validations were performed on the simulation
results for four major feeder bus and streetcar routes. The route-based validation examined the
distribution of route speeds between scheduled, observed, and simulated trips. The stop-based
validation compared the distributions of stop delay values at major TTC subway stops along the
four major routes. The vehicle trajectories generated by the models were reasonably realistic, and
provided drastic improvements over the scheduled trajectories. The validations results, however,
did not show identically matched distribution results, which replicated the stochasticity of the
transit operational characteristics reasonably well. The overall results of the simulation were
logical and promising.
158
Chapter 9
Conclusion
This thesis provided a framework to construct a data-driven surface transit simulator using
multiple open data sources. It was shown that by combining the use of GTFS, AVL, road
restriction, weather, and signalized intersection data, a data-driven transit movement simulation
model could be constructed using statistical models of running speeds and dwell times. The
conclusions reached based on the model results contributed to a better understanding of the data-
driven models for transit simulation. The research accomplishments presented throughout this
thesis are detailed in the research contribution section. Finally, future research directions to
improve the simulation model, to extend its capabilities, and to apply it to real world systems are
discussed.
9.1 Summary of Results
Data-driven simulation models require accurate running speed and dwell time models to
realistically predict the movement of transit vehicles across a network. A detailed summary of the
running speed models and dwell time models contributed to a better understanding of the transit
simulation model results. A summary of the validation results helped in identifying many potential
future research directions.
9.1.1 Running Speed Models
Even though the model specifications for route-level and network-level running speed
models were different, the relative performances of the models were similar. The linear mixed
effects (LME) method produced the best model prediction performances with the highest R2 and
lowest error measurements. The random forest (RF) model came a close second in model
159
prediction performances with 2% more errors for the network-level model. RF was about 10 times
slower on average than LME at model training and predictions. Based on the route-level and
network-level model results, the training time for LME with more variables and more training data
took considerably longer. Such an increase in training time was less severe for RF, nonetheless the
training time for RF is still more than LME. Prediction with RF took considerably longer with a
larger training sample. Complex and deeper trees with larger training data resulted in an
exponential increase in prediction times with RF. In contrast, the prediction time with LME
increased linearly. This is consistent with the understanding that predictions with LME depend
only on the complexity of variable specifications, not sample sizes. The differences in training and
prediction time characteristics, as well as model accuracies are important considerations for
simulation applications.
The RF may be considered for model simulation applications, despite the superior
prediction performance of LME, because it provides more flexible implementation. Since the LME
model used link name as the random effects variable, it cannot be applied to links with no recorded
training data. In contrast, the RF model can handle clustered data with similar link characteristics
without having the link names explicitly specified. For simulations requiring implementations of
new links, RF model would be preferred as it could predict running speeds for similar links to that
it was trained on.
Unlike RF, LME does not suffer from increased model complexity due to larger datasets
and deeper trees. This results in a more stable model prediction time for LME that increases
linearly with network size. Although the LME model training became slower with more complex
variable specification, it was a more computationally efficient modelling method than RF with
notably better model performance. However, unlike RF, LME requires a random effect variable of
link name, which makes implementation of the LME model for new links difficult. If the addition
of new links was not needed and the data sizes were large, LME should be the preferred model.
160
9.1.2 Dwell Time Models
The predicted dwell times generated using the dwell time models replicated the patterns of
the observed dwell times well. In particular, the predicted values captured longer dwell times with
higher variance nearby or at subway stations and terminal transit stops. The higher passenger
demand for boarding and alighting at transfer stations could explain the longer dwell times with
higher variance at the subway stations. The longer dwell times at terminal stations could be
explained by the higher passenger demand at the area. For example, the Main Street station on
Danforth Avenue had higher average and standard deviation of dwell times because the station
was a major transfer point between the 506 Carlton streetcar, subway transfer to line 2, and several
buses. On the other hand, some of the terminal stops such as Lower Jarvis Street at Queens Quay
East, Wynford Drive at Don Mills Road, and Gunns Loop at St Clair Avenue West do not have
heavy passenger demands. Such discrepancy may be due to the way that automatic vehicle location
(AVL) data were recorded at terminal stations. It was unclear if all drivers follow the same protocol
of changing the vehicle head sign the moment the vehicle arrives at the terminal station. In
addition, the dwell times at terminal stations may contain waiting time for the start of its next trip,
in addition to the boarding and alighting times. This should not affect simulation significantly
because dwell times for on-time arrivals at the terminal stop rarely affect the departure time of the
next trip, given the previous trip arrived early. The behaviour of bus and streetcar waiting at the
terminal stop for its next scheduled trip departure is an important assumption built into the
simulation and can overcome this limitation associated with the dwell time model. The lognormal
dwell time model used in this study replicated the observed dwell time distributions well; however,
there are a few issues which may have affected the model fit. Since the available AVL data had a
20-second resolution, some lower values of dwell times would not be captured, especially for parts
of the networks with poor GPS signal reception, such as areas with tall buildings and tree cover.
This may have led to biased data points, resulting in slightly left skewed observed dwell time
distribution when compared to the simulated values.
161
9.1.3 Simulation Models
The simulation of transit movements was illustrated using the results based on running
speed and dwell time predictions with a “base case” GTFS schedule simulation scenario. This
thesis investigated the performances of the RF and LME running speed models. The results of the
simulated vehicle movements were assessed using a series of time-distance diagrams. Then, the
simulated vehicle movements were validated at the route and the stop levels.
The simulated vehicle trajectories were more realistic than the scheduled vehicle
trajectories. Based on the results on vehicle trajectories, both RF and LME were able to capture
vehicle bunching, distributions of dwell times, stochastic patterns of delay and variations in
headways. The results of RF and LME simulation models were similar. Unlike the observed data,
simulated vehicle trips showed more stable running speeds. The instability of running speed in the
observed vehicle trips could be the result of stochastic traffic congestion. However, even without
traffic congestion data, the patterns of transit travel were replicated generally well by the RF and
LME models. The data-driven models can be a viable way to simulate transit movements
efficiently.
The results of the data-driven simulation were validated at the route level and stop level.
While the simulated route speeds were more realistic than the scheduled route speeds, the
simulated and observed distributions were not identical. The simulated route speeds were more
normally distributed with lower variance than the observed route speeds. This could be due to
unknown disruptions, such as congestion, signal delays, or short-turned transit trips, which may
have shifted the observed route speed distributions and induced larger variance. This affected the
stop delay distributions, as evident in the shift in the means and variances of some stop delay
curves. The simulated trips captured the statistical variations of route speeds and stop delays much
better than the scheduled trips; however, additional data on congestion and transit disruptions
could potentially improve model fit.
While the simulation results from RF and LME were similar, LME is more computationally
efficient with half of the simulation time as RF. The simulation times for both methods were
162
reasonable. In choosing whether to use RF or LME for simulation applications, the need to create
new links with existing link characteristics needs to be assessed. Since RF provides a more flexible
implementation of new links, it may be preferred over LME despite the higher computational cost.
However, if only the modelling of existing transit links is needed, LME should be the preferred
choice given its computational efficiency.
9.2 Research Contributions
This thesis presented an innovative and data-driven approach to transit simulations and it
provided significant contributions on many aspects of this approach. The data-driven transit
simulation used open source data to model surface transit movements with running speed models
and dwell time models. Using this modelling framework, this thesis collected and processed
multiple open source data to obtain variables required for modelling. Using GTFS schedule data,
the AVL data was processed into trips to generate stop-based variables such as stop arrival times,
dwell times, delays, and headways. Then, the spatiotemporal matching of the AVL trip data to
additional sources of open data such as weather, road restrictions, and signalized intersections data
provided policy-relevant variables for model estimation. These processed AVL data enabled the
modelling of running speed and dwell time. The second key contribution was the use of structured
data from multiple data sources to model running speed and dwell times. This thesis performed a
thorough comparison of different machine learning algorithms to model stop-to-stop transit
running speeds. It was shown that random forest and linear mixed effect models were most suitable
for network-level running speed models, and the lognormal dwell time distribution modelled dwell
time variations across the network well. Thirdly, this thesis developed a simulation procedure for
data-driven simulation, with standardized analytics reports. This research contribution lays the
foundation for event-based data-driven simulations and allows future researchers to explore ways
to further optimize data-driven simulation for large-scale network applications. Finally, this thesis
demonstrated the capabilities of the data-driven simulation using a case study on the Toronto
network.
163
9.3 Future Research
This study demonstrated an innovative concept for transit simulation using data-driven
machine learning models. These initial simulation results illustrate the need to incorporate
congestion in future studies to more accurately capture route speeds and stop delays. In addition,
more complex simulation scenarios could improve the realism of the simulation output as well.
Large transit agencies such as the TTC make changes to their services on a daily basis to address
transit disruptions and ease network congestion. Surface transit operational changes such as short-
turned trips or temporary shuttle buses could be useful to incorporate in future simulation models.
In future studies, advanced dwell time models that incorporate passenger demand data from
automatic fare payment or automatic passenger counter systems could be used to capture the effect
crowding and dwell times on transit movements. Advanced dwell time models could handle more
complex scenarios with varying passenger demand. For example, the effects of stop additions,
relocations, and removals could be assessed by reallocating passenger demand among stops.
To further enhance the benefits of more advanced dwell time models, software integration
with pedestrian and station models could further improve the realism of the simulation model by
representing changing passenger demands across the different modes of transit. Such integration
may require the synchronization of time steps during the iterative prediction procedure to allow
simultaneous updates. The conversion of the current event-based running speed prediction
procedure to time-step-based requires further software development efforts.
Finally, the use of continuous learning algorithms to provide real-time transit simulation
models could enable transit agencies to quantify and optimize the effects of operational decisions.
A key benefit of data-driven simulation model is that there was no need to construct a model from
the ground up. The simulation pipeline presented in this thesis could be converted into a continuous
learning application where the data download and data processing could take place continuously
to update the estimated running speed and dwell time models. Future research could use continuous
learning modelling methods to create online transit simulation models.
164
Bibliography
Allen, M. P. (1997). The coefficient of determination in multiple regression. In Understanding
Regression Analysis (pp. 91–95). Springer, Boston, MA. https://doi.org/10.1007/978-0-
585-25657-3_19
Bahuleyan, H., & Vanajakshi, L. D. (2017). Arterial Path-Level Travel-Time Estimation Using
Machine-Learning Techniques. Journal of Computing in Civil Engineering, 31(3).
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000644
Bai, C., Peng, Z.-R., Lu, Q.-C., & Sun, J. (2015). Dynamic Bus Travel Time Prediction Models
on Road with Multiple Bus Routes. Computational Intelligence and Neuroscience, 2015,
e432389. https://doi.org/10.1155/2015/432389
Balakrishna, P., Raman, S., Trafalis, T. B., & Santosa, B. (2008). Support vector regression for
determining the minimum zone sphericity. The International Journal of Advanced
Manufacturing Technology, 35(9–10), 916–923. https://doi.org/10.1007/s00170-006-
0774-1
Bates, D., Maechler, M., Bolker, B., Walker, S., Christensen, R. H. B., Singmann, H., … Green,
P. (2017, April 19). Linear Mixed-Effects Models using “Eigen” and S4 (Package
“lme4”). CRAN Repository. Retrieved from https://cran.r-
project.org/web/packages/lme4/lme4.pdf
165
Berk, R. A. (2008). Classification and Regression Trees (CART). In Statistical Learning from a
Regression Perspective (pp. 1–65). Springer, New York, NY.
https://doi.org/10.1007/978-0-387-77501-2_3
Bernstein, M. N. (2015, August 29). The Radial Basis Function Kernel. University of Wisconsin
- Madison. Retrieved from
http://pages.cs.wisc.edu/~matthewb/pages/notes/pdf/svms/RBFKernel.pdf
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
https://doi.org/10.1023/A:1010933404324
Caruana, R., & Niculescu-Mizil, A. (2006). An Empirical Comparison of Supervised Learning
Algorithms. In Proceedings of the 23rd International Conference on Machine Learning
(pp. 161–168). New York, NY, USA: ACM. https://doi.org/10.1145/1143844.1143865
Casas, J., Perarnau, J., & Torday, A. (2011). The need to combine different traffic modelling
levels for effectively tackling large-scale projects adding a hybrid meso/micro approach.
Procedia - Social and Behavioral Sciences, 20, 251–262.
https://doi.org/10.1016/j.sbspro.2011.08.031
Catala, M., Dowling, S., & Hayward, D. (2011). Expanding the Google Transit Feed
Specification to Support Operations and Planning. Retrieved from
https://trid.trb.org/view.aspx?id=1127545
166
Cats, O., Burghout, W., Toledo, T., & Koutsopoulos, H. (2010). Mesoscopic Modeling of Bus
Public Transportation. Transportation Research Record: Journal of the Transportation
Research Board, 2188, 9–18. https://doi.org/10.3141/2188-02
Chai, T., & Draxler, R. R. (2014). Root mean square error (RMSE) or mean absolute error
(MAE)? – Arguments against avoiding RMSE in the literature. Geosci. Model Dev., 7(3),
1247–1250. https://doi.org/10.5194/gmd-7-1247-2014
Charpentier, A. (2013, January 27). Regression tree using Gini’s index. Retrieved May 19, 2017,
from http://freakonometrics.hypotheses.org/1279
Chen, M., Yu, L., Zhang, Y., & Guo, J. (2009). Analyzing urban bus service reliability at the
stop, route, and network levels. Transportation Research Part A: Policy and Practice,
43(8), 722–734. https://doi.org/10.1016/j.tra.2009.07.006
Chen, X., Liu, X., Xia, J., & Chien, S. I. (2004). A Dynamic Bus-Arrival Time Prediction Model
Based on APC Data. Computer-Aided Civil and Infrastructure Engineering, 19(5), 364–
376. https://doi.org/10.1111/j.1467-8667.2004.00363.x
Christoph Rüegg, Marcus Cuda, Jurgen Van Gael, Scott Stephens, Alexander Karatarakis, Eirik
Ovegard, … Gustavo Guerra. (2017, July 16). Math.NET Numerics. Retrieved September
20, 2017, from https://numerics.mathdotnet.com/
City of Toronto. (2015, October 7). Road Restrictions. Retrieved July 10, 2017, from
https://www1.toronto.ca/wps/portal/contentonly?vgnextoid=1af0e69ae554e410VgnVCM
10000071d60f89RCRD
167
City of Toronto. (2017a). TTC Routes and Schedules - Open Data Toronto. Retrieved September
22, 2017, from
https://www1.toronto.ca/wps/portal/contentonly?vgnextoid=96f236899e02b210VgnVC
M1000003dd60f89RCRD
City of Toronto. (2017b, April 19). Traffic Signal Vehicle and Pedestrian Volumes. Retrieved
April 19, 2017, from
http://www1.toronto.ca/wps/portal/contentonly?vgnextoid=417aed3c99cc7310VgnVCM
1000003dd60f89RCRD
City of Toronto. (2017c, April 19). Traffic Signals Tabular. Retrieved April 19, 2017, from
http://www1.toronto.ca/wps/portal/contentonly?vgnextoid=965b868b5535b210VgnVCM
1000003dd60f89RCRD
Cortés, C., Pagès, L., & Jayakrishnan, R. (2005). Microsimulation of Flexible Transit System
Designs in Realistic Urban Networks. Transportation Research Record: Journal of the
Transportation Research Board, 1923, 153–163. https://doi.org/10.3141/1923-17
Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. Mach. Learn., 20(3), 273–297.
https://doi.org/10.1023/A:1022627411411
Elhenawy, M., Chen, H., & Rakha, H. A. (2014). Random Forest Travel Time Prediction
Algorithm Using Spatiotemporal Speed Measurements. Presented at the 21st Intelligent
Transportation Systems World Congress Conference Proceedings. Retrieved from
168
http://itswc.conferencespot.org/56062-itsa-1.1291464/t-005-1.1292028/f-023-
1.1292111/ts55-1.1292075/12307-1.1292120?qr=1
Farid, Y. Z., Christofa, E., & Paget-Seekins, L. (2016). Estimation of Short-Term Bus Travel
Time by Using Low-Resolution Automated Vehicle Location Data. Transportation
Research Record: Journal of the Transportation Research Board, 2539, 113–118.
https://doi.org/10.3141/2539-13
Fernandez, R., Cortes, C., & Burgos, V. (2010). Microscopic simulation of transit operations:
policy studies with the MISTRANSIT application programming interface. Transportation
Planning and Technology, 33(2), 157–176.
Gaudette, P., Chapleau, R., & Spurr, T. (2016). Bus Network Microsimulation with General
Transit Feed Specification and Tap-in-Only Smart Card Data. Transportation Research
Record: Journal of the Transportation Research Board, (2544). Retrieved from
https://trid.trb.org/view.aspx?id=1393935
Goldstein, B., & Dyson, L. (Eds.). (2013). Beyond transparency: open data and the future of
civic innovation. San Francisco: Code for America Press.
Google Inc. (2016, July 12). Overview | Static Transit. Retrieved August 8, 2017, from
https://developers.google.com/transit/gtfs/reference/
Gumedze, F. N., & Dunne, T. T. (2011). Parameter estimation and inference in the linear mixed
model. Linear Algebra and Its Applications, 435(8), 1920–1944.
https://doi.org/10.1016/j.laa.2011.04.015
169
Harvey, A. C. (1990). Forecasting, Structural Time Series Models and the Kalman Filter.
Cambridge University Press.
Hastie, T., Friedman, J., & Tibshirani, R. (2008). The Elements of Statistical Learning: Data
Mining, Inference, and Prediction (2nd ed). Stanford, California: Springer.
Hoch, T. (2015). An Ensemble Learning Approach for the Kaggle Taxi Travel Time Prediction
Challenge. In Proceedings of the 2015th International Conference on ECML PKDD
Discovery Challenge - Volume 1526 (pp. 52–62). Aachen, Germany, Germany: CEUR-
WS.org. Retrieved from http://dl.acm.org/citation.cfm?id=3056172.3056179
Hsu, C.-W., Chang, C.-C., & Lin, C.-J. (2016, May 19). A Practical Guide to Support Vector
Classification. National Taiwan University. Retrieved from
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
Hu, C., Zhou, B., & Hu, J. (2014). Fast Support Vector Data Description training using edge
detection on large datasets. In 2014 International Joint Conference on Neural Networks
(IJCNN) (pp. 2176–2182). https://doi.org/10.1109/IJCNN.2014.6889718
Hu, W. X., & Shalaby, A. (2017). Use of Automated Vehicle Location Data for Route- and
Segment-Level Analyses of Bus Route Reliability and Speed. Transportation Research
Record: Journal of the Transportation Research Board, 2649, 9–19.
https://doi.org/10.3141/2649-02
Jean-Michel Perraud. (2015, August 8). R.NET. Retrieved July 31, 2017, from
https://rdotnet.codeplex.com/Wikipage?ProjectName=rdotnet
170
Jeong, R., & Rilett, R. (2004). Bus arrival time prediction using artificial neural network model.
In Proceedings. The 7th International IEEE Conference on Intelligent Transportation
Systems (IEEE Cat. No.04TH8749) (pp. 988–993).
https://doi.org/10.1109/ITSC.2004.1399041
Kormaksson, M., Barbosa, L., Vieira, M. R., & Zadrozny, B. (2014). Bus Travel Time
Predictions Using Additive Models. arXiv:1411.7973 [Cs, Stat], 875–880.
https://doi.org/10.1109/ICDM.2014.107
Kucirek, P. (2012, September). Comparison between MATSim & EMME: Developing a
Dynamic, Activity-based Microsimulation Transit Assignment Model for Toronto.
University of Toronto, Toronto, ON. Retrieved from
https://tspace.library.utoronto.ca/bitstream/1807/33277/10/Kucirek_Peter_201211_MASc
_thesis.pdf
Lee, W.-C., Si, W., Chen, L.-J., & Chen, M. C. (2012). HTTP: A New Framework for Bus
Travel Time Prediction Based on Historical Trajectories. In Proceedings of the 20th
International Conference on Advances in Geographic Information Systems (pp. 279–
288). New York, NY, USA: ACM. https://doi.org/10.1145/2424321.2424357
Li, F., Duan, Z., & Yang, D. (2012). Dwell time estimation models for bus rapid transit stations.
Journal of Modern Transportation, 20(3), 168–177. https://doi.org/10.1007/BF03325795
171
Lim, T.-S., Loh, W.-Y., & Shih, Y.-S. (2000). A Comparison of Prediction Accuracy,
Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms.
Machine Learning, 40(3), 203–228. https://doi.org/10.1023/A:1007608224229
Louppe, G. (2014). Understanding Random Forests: From Theory to Practice. arXiv:1407.7502
[Stat]. Retrieved from http://arxiv.org/abs/1407.7502
Marill, K. A. (2004). Advanced Statistics: Linear Regression, Part II: Multiple Linear
Regression. Academic Emergency Medicine, 11(1), 94–102.
https://doi.org/10.1197/j.aem.2003.09.006
Marsupial, D. (2012, January 12). SVM, variable interaction and training data fit. Retrieved July
19, 2017, from https://stats.stackexchange.com/questions/21291/svm-variable-
interaction-and-training-data-fit
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F., Chang, C.-C., & Lin, C.-C.
(2017, February 2). Misc Functions of the Department of Statistics, Probability Theory
Group (Package “e1071”). CRAN Repository. Retrieved from https://cran.r-
project.org/web/packages/e1071/e1071.pdf
NextBus. (2016, January 25). NextBus Public XML Feed. Retrieved from
http://www.nextbus.com/xmlFeedDocs/NextBusXMLFeed.pdf
OpenWeatherMap. (2017). Weather API. Retrieved July 10, 2017, from
https://openweathermap.org/api
172
Ordonez, S., & Erath, A. (2011). Semi-Automatic Tool For Map-Matching Bus Routes On High-
Resolution Navigation Networks. In Hong Kong: 16th International Conference of Hong
Kong Society for Transportation Studies.
Padmanaban, R. P. S., Vanajakshi, L., & Subramanian, S. C. (2009). Estimation of bus travel
time incorporating dwell time for APTS applications. In 2009 IEEE Intelligent Vehicles
Symposium (pp. 955–959). https://doi.org/10.1109/IVS.2009.5164409
Polus, A. (1979). A Study of Travel Time and Reliability on Arterial Routes. Transportation,
8(2), 141–151. https://doi.org/10.1007/BF00167196
R. Hänsch, & O. Hellwich. (2015). Performance Assessment and Interpretation of Random
Forests by Three-Dimensional Visualizations. In Proceedings of the International
Conference on Information Visualization Theory and Application. Berlin, Germany.
Retrieved from http://www.rhaensch.de/vrf.html
Rashidi, S., Ranjitkar, P., & Hadas, Y. (2014). Modeling Bus Dwell Time with Decision Tree-
Based Methods. Transportation Research Record: Journal of the Transportation
Research Board, 2418, 74–83. https://doi.org/10.3141/2418-09
Ripley, B., Venables, B., Bates, D. M., Hornik, K., Gebhardt, A., & Firth, D. (2017, April 21).
Support Functions and Datasets for Venables and Ripley’s MASS (Package “MASS”).
CRAN Repository. Retrieved from https://cran.r-
project.org/web/packages/MASS/MASS.pdf
173
Seltman, H. J. (2016). Chapter 15: Mixed Models. In Experimental Design and Analysis (1st ed.,
pp. 357–376). Carnegie Mellon University. Retrieved from
http://www.stat.cmu.edu/~hseltman/309/Book/chapter15.pdf
Shalaby, A., & Farhan, A. (2004). Prediction Model of Bus Arrival and Departure Times Using
AVL and APC Data. Journal of Public Transportation, 7(1).
https://doi.org/http://dx.doi.org/10.5038/2375-0901.7.1.3
Smola, A. J., & Schölkopf, B. (2004). A Tutorial on Support Vector Regression. Statistics and
Computing, 14(3), 199–222. https://doi.org/10.1023/B:STCO.0000035301.49549.88
Spiliopoulou, A., Kontorinaki, M., Papageorgiou, M., & Kopelias, P. (2014). Macroscopic traffic
flow model validation at congested freeway off-ramp areas. Transportation Research
Part C: Emerging Technologies, 41, 18–29. https://doi.org/10.1016/j.trc.2014.01.009
Srikukenthiran, S. (2015, June). Integrated Microsimulation Modelling of Crowd and Subway
Network Dynamics for Disruption Management Support. Retrieved from
https://tspace.library.utoronto.ca/handle/1807/69521
Steinwart, I., & Thomann, P. (2017, July 19). A Fast and Versatile SVM Package (Package
“liquidSVM”). CRAN Repository. Retrieved from https://cran.r-
project.org/web/packages/liquidSVM/liquidSVM.pdf
Swanson, D. A., Tayman, J., & Bryan, T. M. (2011). MAPE-R: a rescaled measure of accuracy
for cross-sectional subnational population forecasts. Journal of Population Research,
28(2–3), 225–243. https://doi.org/10.1007/s12546-011-9054-5
174
Terry M. Therneau, & Elizabeth J. Atkinson. (2017, March 12). An Introduction to Recursive
Partitioning Using the RPART Routines. R Project. Retrieved from https://cran.r-
project.org/web/packages/rpart/vignettes/longintro.pdf
Therneau, T., Atkinson, B., & Ripley, B. (2017, April 21). Recursive Partitioning and Regression
Trees (Package “rpart”). CRAN Repository. Retrieved from https://cran.r-
project.org/web/packages/rpart/rpart.pdf
Toronto Transit Commission. (2017a). Chief Executive Officer’s Report – July 2017 Update (p.
5). Retrieved from
https://www.ttc.ca/About_the_TTC/Commission_reports_and_information/Commission_
meetings/2017/July_12/Reports/1_Chief_Executive_Officer%27s_Report_July_2017_Up
date.pdf
Toronto Transit Commission. (2017b, January 1). TTC Operating Statistics Section One -
System Quick Facts. Retrieved July 31, 2017, from
https://www.ttc.ca/About_the_TTC/Operating_Statistics/2016/section_one.jsp
Toronto Transit Commission. (2017c, July 20). TTC Maps. Retrieved August 6, 2017, from
http://www.ttc.ca/Routes/General_Information/Maps/index.jsp
TransitFeeds. (2016, July). TTC GTFS - TransitFeeds. Retrieved from
http://transitfeeds.com/p/ttc/33
Weiss, A., Mahmoud, M. S., Kucirek, P., & Habib, K. N. (2014). Merging transit schedule
information with a planning network to perform dynamic multimodal assignment: lessons
175
from a case study of the Greater Toronto and Hamilton Area. Canadian Journal of Civil
Engineering, 41(10), 900–908. https://doi.org/10.1139/cjce-2014-0202
Wilson, N. (2016). Opportunities Provided by Automated Data Collection Systems. In
Restructuring Public Transport through Bus Rapid Transit. Policy Press Scholarship
Online. Retrieved from
http://policypress.universitypressscholarship.com/view/10.1332/policypress/9781447326
168.001.0001/upso-9781447326168-chapter-14
Witten, I. H., Frank, E., & Hall, M. A. (2011). Data Mining: Practical Machine Learning Tools
and Techniques. Elsevier.
Wright, M. N. (2017, June 20). A Fast Implementation of Random Forests (Package “ranger”).
CRAN Repository. Retrieved from https://cran.r-
project.org/web/packages/ranger/ranger.pdf
Yang, L., Haghani, A., & Qiao, W. (2013). Macroscopic Traffic Flow Model for Estimation of
Real-Time Traffic State Along Signalized Arterial Corridor: Model Development and
Implementation. Transportation Research Record: Journal of the Transportation
Research, 2391, 142–153.
Young, D. S. (2017). Chapter 7: Multiple Regression. In Part II: Multiple Linear Regression (pp.
87–90). Retrieved from
https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/
files/pt2_multiple_linear_regression.pdf
176
Yu, B., Lam, W. H. K., & Tam, M. L. (2011). Bus arrival time prediction at bus stop with
multiple routes. Transportation Research Part C: Emerging Technologies, 19(6), 1157–
1170. https://doi.org/10.1016/j.trc.2011.01.003
Yu, B., Yang, Z., & Yao, B. (2006). Bus Arrival Time Prediction Using Support Vector
Machines. Journal of Intelligent Transportation Systems, 10(4), 151–158.
https://doi.org/10.1080/15472450600981009
177
Appendix A Selected Programming Code
A.1 Data Collection Algorithm
A.1.1 Next Bus Data Download Class (SSWebCrawlerNextBus)
public class SSWebCrawlerNextBus 1 { 2 string xmlGPSFolderPath; 3 string xmlGPSFilenameFormat; 4 long fileSaveFreq; //time interval to save the permanent xml file - 5 HOURLY = 3600 6 long bkSaveFreq; //xml raw file backup period - raw xml files saved 7 will be deleted after this time interval. = 60 8 private long lasttimeTimeURLField; 9 10 List<SSVehGPSDataTable> processedGPSDataFromXML; 11 12 public SSWebCrawlerNextBus(string xmlFolderPath, long 13 xmlFileTimeIncreInSecs, long xmlFileBackupTimeInSecs) 14 { 15 xmlGPSFolderPath = xmlFolderPath; 16 fileSaveFreq = xmlFileTimeIncreInSecs;//how often files are saved 17 bkSaveFreq = xmlFileBackupTimeInSecs;//how often backup files are 18 saved 19 processedGPSDataFromXML = new List<SSVehGPSDataTable>(); 20 xmlGPSFilenameFormat = "all-GPS-{0}-{1}.xml"; 21 } 22 23 /// <summary> 24 /// Polling frequency should be equal or less than 20s for max 25 polling rate 26 /// </summary> 27 /// <param name="pollingFrequency"></param> 28 /// <param name="durationInSecs"></param> 29 /// <returns></returns> 30 private ConcurrentDictionary<Tuple<long, string>, Vehicle> 31 downloadedGPSXMLData_UNSAVED; 32 public int startLiveDownloadTask(DateTime pollingStart, DateTime 33 pollingEnd, double pollingFrequency) 34 { 35 lasttimeTimeURLField = 0; 36 int downloadState = 0; 37 //NOTE: polling freq: 19s < bk freq: 60s < save freq: 3600s 38 downloadedGPSXMLData_UNSAVED = new 39 ConcurrentDictionary<Tuple<long, string>, Vehicle>();//GPSTime & VehID as 40 tuple index 41 DateTime nextDownloadTime;// = 42 DateTime.Now.ToLocalTime().AddSeconds(pollingFrequency); 43
178
DateTime nextBackupTime;// = 44 DateTime.Now.ToLocalTime().AddSeconds(bkSaveFreq); 45 //DateTime nextSaveTime = 46 DateTime.Now.ToLocalTime().AddSeconds(fileSaveFreq); 47 DateTime nextSaveTime;// = 48 pollingStart.Date.AddSeconds(pollingStart.Hour * 3600 - 1 + 49 fileSaveFreq);//save on the last second of the next hour 50 long lastLogUpdatedSave = 0; 51 string bkfilename = "current-GPSXML-bk.xml"; 52 53 if (pollingStart.DateTimeToEpochTime() >= 54 DateTime.Now.ToLocalTime().DateTimeToEpochTime())//future download - put 55 thread to sleep until start - proper state 56 { 57 //return pollingStart.DateTimeToEpochTime() - 58 downloadStartTime.DateTimeToEpochTime(); 59 Thread.Sleep(Convert.ToInt32(1000 * 60 (pollingStart.DateTimeToEpochTime() - 61 DateTime.Now.ToLocalTime().DateTimeToEpochTime() - 1))); 62 nextDownloadTime = pollingStart.AddSeconds(0); 63 nextBackupTime = pollingStart.AddSeconds(0); 64 nextSaveTime = pollingStart.Date.AddSeconds(pollingStart.Hour 65 * 3600 - 1 + fileSaveFreq);//assume file save frequency is at the hour 66 } 67 else if ((pollingStart.DateTimeToEpochTime() <= 68 DateTime.Now.ToLocalTime().DateTimeToEpochTime()) && 69 (pollingEnd.DateTimeToEpochTime() > 70 DateTime.Now.ToLocalTime().DateTimeToEpochTime()))//some missing downloads - 71 start period is before download can take place 72 { 73 //polling reschedule message 74 LogUpdateEventArgs args = new LogUpdateEventArgs(); 75 args.logMessage = String.Format("NextBus Polling to Start at: 76 {0}.", 77 DateTime.Now.ToLocalTime().Date.AddSeconds(DateTime.Now.ToLocalTime().Hour * 78 3600 + fileSaveFreq).DateTimeISO8601Format()); 79 OnLogUpdate(args); 80 81 Thread.Sleep(Convert.ToInt32(1000 * (3600 - 82 (DateTime.Now.ToLocalTime().Minute * 60 + DateTime.Now.ToLocalTime().Second - 83 1)))); 84 downloadState = -1; 85 nextDownloadTime = DateTime.Now.ToLocalTime(); 86 nextBackupTime = DateTime.Now.ToLocalTime(); 87 nextSaveTime = 88 DateTime.Now.ToLocalTime().Date.AddSeconds(DateTime.Now.ToLocalTime().Hour * 89 3600 - 1 + fileSaveFreq);//assume file save frequency is at the hour 90 } 91 else 92 { 93 nextDownloadTime = new DateTime(); 94 nextBackupTime = new DateTime(); 95 nextSaveTime = new DateTime(); 96 downloadState = -2; 97
179
return downloadState; 98 } 99 100 //initial save message 101 if (lastLogUpdatedSave != nextSaveTime.DateTimeToEpochTime()) 102 { 103 lastLogUpdatedSave = nextSaveTime.DateTimeToEpochTime(); 104 LogUpdateEventArgs args = new LogUpdateEventArgs(); 105 args.logMessage = String.Format("NextBus Upcoming Save: 106 {0}.", nextSaveTime.DateTimeISO8601Format()); 107 OnLogUpdate(args); 108 } 109 110 //download started after a proper wait - proper state, or start 111 immediate if some data can be downloaded (state: -1) 112 while (pollingEnd.DateTimeToEpochTime() >= 113 DateTime.Now.ToLocalTime().DateTimeToEpochTime()) 114 { 115 List<long> nextOpTime = new List<long>(); 116 nextOpTime.Add(nextDownloadTime.DateTimeToEpochTime()); 117 nextOpTime.Add(nextBackupTime.DateTimeToEpochTime()); 118 nextOpTime.Add(nextSaveTime.DateTimeToEpochTime()); 119 nextOpTime.Sort(); 120 if ((nextOpTime[0] > 121 DateTime.Now.ToLocalTime().DateTimeToEpochTime())) 122 { 123 if (lastLogUpdatedSave != 124 nextSaveTime.DateTimeToEpochTime()) 125 { 126 lastLogUpdatedSave = 127 nextSaveTime.DateTimeToEpochTime(); 128 LogUpdateEventArgs args = new LogUpdateEventArgs(); 129 args.logMessage = String.Format("NextBus Upcoming 130 Save: {0}.", nextSaveTime.DateTimeISO8601Format()); 131 OnLogUpdate(args); 132 } 133 134 long sleepTime = nextOpTime[0] - 135 DateTime.Now.ToLocalTime().DateTimeToEpochTime(); 136 Thread.Sleep(Convert.ToInt32(1000 * sleepTime));//brief 137 sleep in between operations 138 } 139 //this condition structure prioritize download to mem, then 140 backup, then save 141 if (nextDownloadTime.DateTimeToEpochTime() <= 142 DateTime.Now.ToLocalTime().DateTimeToEpochTime()) 143 { 144 nextDownloadTime = 145 nextDownloadTime.AddSeconds(pollingFrequency); 146 downloadCurrentGPSXMLFromWeb(pollingFrequency); 147 } 148 else if (nextBackupTime.DateTimeToEpochTime() <= 149 DateTime.Now.ToLocalTime().DateTimeToEpochTime()) 150 { 151
180
nextBackupTime = nextBackupTime.AddSeconds(bkSaveFreq); 152 153 SSUtil.SerializeXMLObject(downloadedGPSXMLData_UNSAVED.Values.ToList(), 154 xmlGPSFolderPath + bkfilename); 155 } 156 else if (nextSaveTime.DateTimeToEpochTime() <= 157 DateTime.Now.ToLocalTime().DateTimeToEpochTime()) 158 { 159 if (downloadedGPSXMLData_UNSAVED.Values.Count > 0)//avoid 160 saving empty file 161 { 162 string newfilename = 163 string.Format(xmlGPSFilenameFormat, nextSaveTime.AddSeconds(-fileSaveFreq + 164 1).ToUniversalTime().DateTimeNamingFormat(), 165 nextSaveTime.ToUniversalTime().DateTimeNamingFormat()); 166 167 SSUtil.SerializeXMLObject(downloadedGPSXMLData_UNSAVED.Values.ToList(), 168 xmlGPSFolderPath + newfilename); 169 downloadedGPSXMLData_UNSAVED.Clear();//clear mem 170 } 171 nextSaveTime = nextSaveTime.AddSeconds(fileSaveFreq); 172 } 173 } 174 return downloadState; 175 } 176 /// <summary> 177 /// More info: https://www.nextbus.com/xmlFeedDocs/NextBusXMLFeed.pdf 178 /// </summary> 179 /// <param name="pollingOrRequestFrequency"></param> 180 private async void downloadCurrentGPSXMLFromWeb(double 181 pollingOrRequestFrequency)//download current GPS XML data from NextBus with 182 processing for GPSTime 183 { 184 try 185 { 186 LastTime dataDownloadTime = new LastTime(); 187 List<Vehicle> downloadedGPSXML = new List<Vehicle>(); 188 Uri uriToDownload = new 189 Uri(String.Format("http://webservices.nextbus.com/service/publicXMLFeed?comma190 nd=vehicleLocations&a={0}&t={1}", "ttc", lasttimeTimeURLField)); 191 using (WebClient client = new WebClient()) 192 { 193 //download string and write to file 194 string xml = await 195 client.DownloadStringTaskAsync(uriToDownload); 196 string xmlString = Convert.ToString(xml); 197 using (StringReader reader = new StringReader(xmlString)) 198 { 199 XmlSerializer serializer = new 200 XmlSerializer(typeof(GPSXMLQueryResult)); 201 string headerString = reader.ReadLine(); 202 string currentVehString = reader.ReadToEnd(); 203 MemoryStream memStream = new 204 MemoryStream(Encoding.UTF8.GetBytes(currentVehString)); 205
181
GPSXMLQueryResult currentXMLGPSPoints = 206 (GPSXMLQueryResult)serializer.Deserialize(memStream); 207 dataDownloadTime = currentXMLGPSPoints.LastTime; 208 downloadedGPSXML = currentXMLGPSPoints.Vehicles; 209 }//end using reader 210 }//end using cilent 211 212 long currentLasttimeTimeField = 213 Convert.ToInt64(dataDownloadTime.Time); 214 //find GPSTime based on objects received 215 if (!dataDownloadTime.Time.Equals(null) && 216 !(downloadedGPSXML.Count == 0)) 217 { 218 long currentDownloadTime = currentLasttimeTimeField / 219 1000;//in seconds 220 for (int i = 0; i < downloadedGPSXML.Count; i++) 221 { 222 downloadedGPSXML[i].GPStime = currentDownloadTime - 223 Convert.ToInt64(downloadedGPSXML[i].SecsSinceReport); 224 } 225 } 226 //Add downloadedGPSXML to downloadedGPSXMLData_UNSAVED 227 foreach (Vehicle aGPSPoint in downloadedGPSXML) 228 { 229 downloadedGPSXMLData_UNSAVED.AddOrUpdate(new Tuple<long, 230 string>(aGPSPoint.GPStime, aGPSPoint.Id), aGPSPoint, (k, v) => 231 aGPSPoint);//replaces existing if it exists 232 } 233 lasttimeTimeURLField = 234 Convert.ToInt64(dataDownloadTime.Time);//in msec 235 } 236 catch (Exception e) 237 { 238 239 } 240 } 241 242 /// <summary> 243 /// Method to retrieve GPS data 244 /// </summary> 245 /// <param name="RouteTags"></param> 246 /// <param name="startTime"></param> 247 /// <param name="endTime"></param> 248 /// <returns></returns> 249 public List<SSVehGPSDataTable> RetriveGPSData(List<string> RouteTags, 250 DateTime startTime, DateTime endTime) 251 { 252 List<SSVehGPSDataTable> finalPackageOfGPSDataAllRoutes = new 253 List<SSVehGPSDataTable>(); 254 //List<SSVehGPSDataTable> allDataInTheGPSDataFile = new 255 List<SSVehGPSDataTable>(); 256 Dictionary<string, List<SSVehGPSDataTable>> allDataInGPSDataFile 257 = new Dictionary<string, List<SSVehGPSDataTable>>(); 258 List<string> gpsDataFilenames = new List<string>();//in order 259
182
260 DateTime lastIntermediateTime = startTime;//intermediate time 261 59:59 (min:sec) from start 262 DateTime intermediateTime = 263 startTime.Date.AddHours(startTime.Hour).AddSeconds(fileSaveFreq - 264 1);//intermediate time 59:59 (min:sec) from start 265 //initiate filenames 266 while (intermediateTime.DateTimeToEpochTime() <= 267 endTime.DateTimeToEpochTime()) 268 { 269 string currentfilename = string.Format(xmlGPSFilenameFormat, 270 lastIntermediateTime.ToUniversalTime().DateTimeNamingFormat(), 271 intermediateTime.ToUniversalTime().DateTimeNamingFormat()); 272 gpsDataFilenames.Add(currentfilename); 273 //increment to read next file 274 lastIntermediateTime = 275 lastIntermediateTime.AddSeconds(fileSaveFreq); 276 intermediateTime = intermediateTime.AddSeconds(fileSaveFreq); 277 } 278 279 //load all files in parallel 280 ParallelOptions pOptions = new ParallelOptions(); 281 pOptions.MaxDegreeOfParallelism = Environment.ProcessorCount; 282 Parallel.For(0, gpsDataFilenames.Count, pOptions, i => 283 { 284 string currentfilename = gpsDataFilenames[i]; 285 allDataInGPSDataFile.Add(currentfilename, 286 (convertRawGPSToSSFormat(SSUtil.DeSerializeXMLObject<List<Vehicle>>(xmlGPSFol287 derPath + currentfilename)))); 288 }); 289 //add each file, then each gps point, in order 290 foreach (string currentfilename in gpsDataFilenames) 291 { 292 //filter by routeTag 293 foreach (SSVehGPSDataTable aGPSPoint in 294 allDataInGPSDataFile[currentfilename]) 295 { 296 string aGPSPointRouteTag = aGPSPoint.TripCode == "NULL" ? 297 "" : aGPSPoint.TripCode.Split('_')[0]; 298 if (RouteTags.Contains(aGPSPointRouteTag)) 299 { 300 finalPackageOfGPSDataAllRoutes.Add(aGPSPoint); 301 } 302 } 303 } 304 305 allDataInGPSDataFile.Clear(); 306 gpsDataFilenames.Clear(); 307 return finalPackageOfGPSDataAllRoutes;//return result list 308 } 309 private List<SSVehGPSDataTable> convertRawGPSToSSFormat(List<Vehicle> 310 downloadedGPSData) 311 { 312
183
List<SSVehGPSDataTable> convertedData = new 313 List<SSVehGPSDataTable>(); 314 315 foreach (Vehicle aRawGPS in downloadedGPSData) 316 { 317 if (aRawGPS.DirTag != null && aRawGPS.DirTag != "NULL") 318 { 319 SSVehGPSDataTable aConvertedData = new 320 SSVehGPSDataTable(); 321 322 string routeNum = aRawGPS.RouteTag.Length > 0 ? 323 aRawGPS.RouteTag : "-1"; 324 string dir = aRawGPS.DirTag == "NULL" ? "-1" : 325 aRawGPS.DirTag.Split('_')[1]; 326 string RouteTag = aRawGPS.DirTag == "NULL" ? "-1" : 327 aRawGPS.DirTag.Split('_')[2]; 328 string TripCode = string.Concat(routeNum, "_", RouteTag, 329 "_", aRawGPS.Id); 330 331 aConvertedData.gpsID = -1;//unassigned 332 aConvertedData.GPStime = aRawGPS.GPStime; 333 aConvertedData.vehID = Convert.ToInt32(aRawGPS.Id); 334 aConvertedData.TripCode = TripCode; 335 aConvertedData.Direction = Convert.ToInt32(dir); 336 aConvertedData.Longitude = Convert.ToDouble(aRawGPS.Lon); 337 aConvertedData.Latitude = Convert.ToDouble(aRawGPS.Lat); 338 aConvertedData.GPStime = aRawGPS.GPStime; 339 aConvertedData.Heading = 340 Convert.ToInt32(aRawGPS.Heading); 341 342 convertedData.Add(aConvertedData); 343 } 344 } 345 346 return convertedData; 347 } 348
} 349
350
184
A.1.2 Next Bus Data Download Method (DownloadNextBusGPSData)
public void DownloadNextBusGPSData()//long dataPackageTimeIncrement, 351 long dataPacketTimeIncrement 352 { 353 //read settings from text file - root dir 354 DateRange dataDownloadPeriod = new DateRange(); 355 List<DateTime> dateList = new List<DateTime>(); 356 string datePattern = @"(?<date>(\d{4})-(\d{2})-(\d{2}) 357 (\d{2}):(\d{2}):(\d{2}))"; //Define regex string 358 Regex reg = new Regex(datePattern); //read log content 359 string Content = File.ReadAllText(MainFolderPath + "DataDownload-360 Setting.txt");//startupFolderSS 361 MatchCollection matches = reg.Matches(Content); //run regex 362 foreach (Match m in matches) //iterate over matches 363 { 364 DateTime date = new DateTime(2016, 1, 1, 0, 0, 0, 365 DateTimeKind.Local); 366 date = DateTime.Parse(m.Groups["date"].Value); 367 dateList.Add(date); 368 } //local time is assumed to be entered 369 dataDownloadPeriod.start = dateList[0]; 370 dataDownloadPeriod.end = dateList[1]; 371 372 long duration = (dataDownloadPeriod.end.DateTimeToEpochTime() - 373 DateTime.Now.DateTimeToEpochTime()); 374 TaskUpdateEventArgs args = new TaskUpdateEventArgs(); 375 args.eventMessage = String.Format("NextBus data polling 376 started...");// Time to Completion: {0} mins.", duration / 60 377 args.expectedDuration = Convert.ToDouble(duration); 378 args.progressBarPercent = 90.0; 379 args.timeReached = DateTime.Now; 380 OnTaskUpdate(args); 381 382 //nextbusWebCrawler = new SSWebCrawlerNextBus(gpsXMLPath, 383 gpsStorageIncreInSecs, gpsFileBackupIncrenSecs); 384 int completedDownloadState = 385 NextbusWebCrawler.startLiveDownloadTask(dataDownloadPeriod.start, 386 dataDownloadPeriod.end, GpsPollingFreq); 387 } 388
389
185
A.1.3 Weather Data Download Method (DownloadWeatherData)
public void DownloadWeatherData()//long dataPackageTimeIncrement, 390 long dataPacketTimeIncrement 391 { 392 //read settings from text file - root dir 393 DateRange dataDownloadPeriod = new DateRange(); 394 List<DateTime> dateList = new List<DateTime>(); 395 string datePattern = @"(?<date>(\d{4})-(\d{2})-(\d{2}) 396 (\d{2}):(\d{2}):(\d{2}))"; //Define regex string 397 Regex reg = new Regex(datePattern); //read log content 398 string Content = File.ReadAllText(MainFolderPath + "DataDownload-399 Setting.txt");//startupFolderSS 400 MatchCollection matches = reg.Matches(Content); //run regex 401 foreach (Match m in matches) //iterate over matches 402 { 403 DateTime date = new DateTime(2016, 1, 1, 0, 0, 0, 404 DateTimeKind.Local); 405 date = DateTime.Parse(m.Groups["date"].Value); 406 dateList.Add(date); 407 } //local time is assumed to be entered 408 dataDownloadPeriod.start = dateList[0]; 409 dataDownloadPeriod.end = dateList[1]; 410 411 long duration = (dataDownloadPeriod.end.DateTimeToEpochTime() - 412 DateTime.Now.DateTimeToEpochTime()); 413 TaskUpdateEventArgs args = new TaskUpdateEventArgs(); 414 args.eventMessage = String.Format("Weather data (GTA) polling 415 started...");// Time to Completion: {0} mins.", duration / 60 416 args.expectedDuration = Convert.ToDouble(duration); 417 args.progressBarPercent = 90.0; 418 args.timeReached = DateTime.Now; 419 OnTaskUpdate(args); 420 421 //openWeatherWebCrawler = new 422 SSWebCrawlerOpenWeather(weatherDataPath, weatherFileBackupIncrenSecs);//, 423 weatherStorageIncreInSecs 424 int completedDownloadState = 425 OpenWeatherWebCrawler.startLiveDownloadTask(dataDownloadPeriod.start, 426 dataDownloadPeriod.end, WeatherPollingFreq);//5 min polling should be 427 sufficient 428 } 429
430
186
A.1.4 Toronto Incidents Data Download Method
(DownloadTorontoIncidentData)
public void DownloadTorontoIncidentData()//long 431 dataPackageTimeIncrement, long dataPacketTimeIncrement 432 { 433 //read settings from text file - root dir 434 DateRange dataDownloadPeriod = new DateRange(); 435 List<DateTime> dateList = new List<DateTime>(); 436 string datePattern = @"(?<date>(\d{4})-(\d{2})-(\d{2}) 437 (\d{2}):(\d{2}):(\d{2}))"; //Define regex string 438 Regex reg = new Regex(datePattern); //read log content 439 string Content = File.ReadAllText(MainFolderPath + "DataDownload-440 Setting.txt");//startupFolderSS 441 MatchCollection matches = reg.Matches(Content); //run regex 442 foreach (Match m in matches) //iterate over matches 443 { 444 DateTime date = new DateTime(2016, 1, 1, 0, 0, 0, 445 DateTimeKind.Local); 446 date = DateTime.Parse(m.Groups["date"].Value); 447 dateList.Add(date); 448 } //local time is assumed to be entered 449 dataDownloadPeriod.start = dateList[0]; 450 dataDownloadPeriod.end = dateList[1]; 451 452 long duration = (dataDownloadPeriod.end.DateTimeToEpochTime() - 453 DateTime.Now.DateTimeToEpochTime()); 454 TaskUpdateEventArgs args = new TaskUpdateEventArgs(); 455 args.eventMessage = String.Format("Incident data (TOR) polling 456 started...");// Time to Completion: {0} mins.", duration / 60 457 args.expectedDuration = Convert.ToDouble(duration); 458 args.progressBarPercent = 90.0; 459 args.timeReached = DateTime.Now; 460 OnTaskUpdate(args); 461 462 int completedDownloadState = 463 TorInciWebCrawler.startLiveDownloadTask(dataDownloadPeriod.start, 464 dataDownloadPeriod.end, IncidentPollingFreq);//5 min polling should be 465 sufficient 466 467 } 468
469
187
A.1.5 Offline Data Loading Method (LoadDataFromOfflineFile)
public void LoadDataFromOfflineFiles(bool includeIncidentData, bool 470 includeWeatherData) 471 { 472 bool saveChangesToDb = true; 473 UpdateAndLoadGPSDbDataFromFiles(saveChangesToDb);//update gps 474 database and load new gps points to data object 475 //1. Load GTFS and AVL data 476 if (!File.Exists(DbFile_GTFS) || StartNewGTFSdbFiles) 477 { 478 GTFSDatabase = new 479 GTFSConvertSSim(Path.GetDirectoryName(DbFile_GTFS), 480 Path.GetFileName(DbFile_GTFS)); 481 482 GTFSDatabase.btn_Convert_and_Save_Click(); 483 GTFSDatabase.generateSQLiteDB_SSim(); 484 } 485 LoadGTFSReferenceData();// read GTFS data for network 486 initialization 487 InitializeGPSTripDB(false);//process new GPS points, let changes 488 be saved after with incident and weather 489 490 //2. Load and process Incident and Weather Data 491 if (includeIncidentData) 492 { 493 //load Incident Data 494 UpdateDbIncidentDataFromFiles(); 495 RdIncidentsTable = GetTorInciDataFromDB(); 496 //Compute incident value for newly loaded GPS points 497 } 498 if (includeWeatherData) 499 { 500 //load Weather Data 501 UpdateDbWeatherDataFromFiles(); 502 WeatherTable = GetWeatherDataFromDB(); 503 //Compute weather value for newly loaded GPS points 504 } 505 SSFileReaderTorINTXN intxnReader = new 506 SSFileReaderTorINTXN(TrafficSignalFile, TrafficIntnxVolFile, 507 TrafficPedCrossFile, TrafficFlasingBeaconFile); 508 IntxnSignalTable = intxnReader.getCombinedIntxnData(); 509 510 SaveSimulatorDbChanges(false); 511 UpdateGUI_StatusBox("Source data loaded...", 100.0, 0.0); 512 } 513 514
188
A.1.6 GTFS Data Loading Method (LoadGTFSReferenceData)
private void LoadGTFSReferenceData() 515 { 516 /* 517 * Read key GTFS data on stops, route, service period and 518 schedule, into respective objects: 519 * Parallelized reads: Route Group and Route IDs, Shapes, 520 Stops, Service Periods, Trips 521 * Schedules and post-processing of schedule data are done 522 after parallelized reads 523 */ 524 //Some Notes: https://manski.net/2012/10/sqlite-performance/ ==> 525 for very few threads (my guess: thread count <= cpu count), multiple 526 connections perform much better. However, for more threads, a single shared 527 connection is the better choice. 528 #region ParallelTasks for reading GTFS data from database 529 530 //SQLiteConnection conn = new SQLiteConnection("Data Source=" + 531 @dbFile_GTFS + "; Cache Size=10000; Page Size=4096") 532 SQLiteConnection source = new SQLiteConnection(@"Data Source=" + 533 DbFile_GTFS + "; Cache Size=10000; Page Size=4096"); 534 source.Open(); 535 SQLiteConnection conn_GTFSdb = new SQLiteConnection(@"Data 536 Source=:memory:"); 537 conn_GTFSdb.Open(); 538 //copy db file to memory - need to save changes if not read only 539 source.BackupDatabase(conn_GTFSdb, "main", "main", -1, null, 0); 540 //source.Close(); 541 542 Parallel.Invoke(() => 543 { 544 #region Read GTFS ROUTE GROUPS AND ROUTE IDs 545 List<SSimRouteGroupRef> AllRouteGroupData = new 546 List<SSimRouteGroupRef>(); 547 string sql = "select * from route_groups order by groupID"; 548 //select * from schedules where serviceID=4 order by 549 serviceID (Query Example) 550 using (SQLiteCommand command = new SQLiteCommand(sql, 551 conn_GTFSdb)) 552 { 553 SSimRouteGroupRef TempGroupData; 554 int TempGroupID; 555 List<int> TempRouteIDs; 556 SQLiteDataReader reader = command.ExecuteReader(); 557 while (reader.Read()) 558 { 559 TempGroupData = new SSimRouteGroupRef(); 560 TempGroupID = reader.GetInt32(0); 561 TempGroupData.GroupID = TempGroupID; 562 TempGroupData.RouteCode = reader.GetString(1); 563 TempGroupData.RouteName = reader.GetString(2); 564 string routeTyp = reader.GetString(3); 565
189
TempGroupData.RouteType = 566 (routeTyp.Equals("surface_fixed") ? routeType.surface_fixed : 567 (routeTyp.Equals("underground") ? routeType.underground : 568 (routeTyp.Equals("surface_separated") ? routeType.surface_separated : 569 (routeTyp.Equals("surface_variable") ? routeType.surface_variable : 570 (routeTyp.Equals("aerial") ? routeType.aerial : routeType.aerial))))); 571 572 TempGroupData.AgencyID = reader.GetInt32(4); 573 574 string sql2_routesInGroup = "select * from routes 575 where groupID=" + TempGroupID + " order by routeID"; 576 using (SQLiteCommand command_route = new 577 SQLiteCommand(sql2_routesInGroup, conn_GTFSdb)) 578 { 579 SQLiteDataReader reader_route = 580 command_route.ExecuteReader(); 581 TempRouteIDs = new List<int>(); 582 ConcurrentDictionary<int, SSimRouteRef> 583 TempRouteInfo = new ConcurrentDictionary<int, SSimRouteRef>();//routeInfo by 584 routeID - Temp 585 586 while (reader_route.Read()) 587 { 588 SSimRouteRef TempSingleRouteInfo = new 589 SSimRouteRef(); 590 TempRouteIDs.Add(reader_route.GetInt32(0)); 591 string name = 592 reader_route.GetString(2).Replace(" - ", "^"); 593 string[] nameArray = name.Split('^'); 594 TempSingleRouteInfo.RouteTag = 595 nameArray[1].Split(' ').First().ToUpper(); 596 TempSingleRouteInfo.DirTag = 597 Convert.ToInt32(name.Split('^').Last().Split(' ').Last()); 598 TempSingleRouteInfo.Include = false; 599 //default to exclude the route from analysis 600 TempSingleRouteInfo.Name = string.Format("{0} 601 - {1}B", nameArray[1], nameArray[0][0]);//i.e.: 94A WELLESLEY TOWARDS 602 OSSINGTON STATION - WB 603 604 605 TempRouteInfo.GetOrAdd(Convert.ToInt32(reader_route.GetInt32(0)), 606 TempSingleRouteInfo); 607 TempSingleRouteInfo = null; 608 } 609 TempGroupData.RouteIDs = TempRouteIDs; 610 TempGroupData.RouteInfo = TempRouteInfo; 611 TempGroupData.serviceAvailByServiceP = new 612 ConcurrentDictionary<int, bool>(); 613 TempGroupData.mainRouteIDDir0ByServiceP = new 614 ConcurrentDictionary<int, int>(); 615 TempGroupData.mainRouteIDDir1ByServiceP = new 616 ConcurrentDictionary<int, int>(); 617 TempGroupData.subRouteIDDir0 = new 618 List<Tuple<int, string, int>>(); 619
190
TempGroupData.subRouteIDDir1 = new 620 List<Tuple<int, string, int>>(); 621 622 AllRouteGroupData.Add(TempGroupData); 623 } 624 TempGroupData = null; 625 TempRouteIDs = null; 626 } 627 } 628 List<SSimRouteGroupRef> SurfaceRouteGroupData = 629 AllRouteGroupData.Where(o => (o.RouteType != routeType.aerial)).Where(o => 630 (o.RouteType != routeType.underground)).ToList(); 631 GtfsRouteGroupTableByRouteCode = new 632 ConcurrentDictionary<string, 633 SSimRouteGroupRef>(SurfaceRouteGroupData.ToDictionary(v => v.RouteCode)); 634 GtfsRouteGroupTableByGroupID = new ConcurrentDictionary<int, 635 SSimRouteGroupRef>(SurfaceRouteGroupData.ToDictionary(v => v.GroupID)); 636 //} 637 #endregion 638 }, // close first Action 639 () => 640 { 641 #region Read GTFS SHAPES 642 GtfsShapeTable = new ConcurrentDictionary<string, 643 RouteShape>(); 644 645 string sql = "select * from shapes order by shapeID";//DESC 646 using (SQLiteCommand command = new SQLiteCommand(sql, 647 conn_GTFSdb)) 648 { 649 SQLiteDataReader reader = command.ExecuteReader(); 650 while (reader.Read()) 651 { 652 RouteShape TempShape = new RouteShape(); 653 TempShape.SQLID = reader.GetInt32(0); 654 //int n; 655 //bool isNumeric = int.TryParse(TempShape.ShapeID, 656 out n); 657 //if (isNumeric) 658 // TempShape.SQLID = n; 659 //else 660 // TempShape.SQLID = 0; 661 TempShape.ShapeID = TempShape.SQLID.ToString(); 662 TempShape.Path = (List<Tuple<double, double, 663 double>>)ByteArrayToObject((byte[])reader[1]);//Encoding.ASCII.GetBytes((byte664 [])reader[1]) 665 GtfsShapeTable.GetOrAdd(TempShape.ShapeID, 666 TempShape); 667 TempShape = null; 668 } 669 } 670 #endregion 671 }, //close second Action 672 () => 673
191
{ 674 #region Read GTFS STOPS 675 GtfsStopTable = new ConcurrentDictionary<int, SSimStopRef>(); 676 //contains all stop reference, with GPS and stop types. TempStopID as 677 reference. 678 string sql = "select * from stops order by stopID"; 679 using (SQLiteCommand command = new SQLiteCommand(sql, 680 conn_GTFSdb)) 681 { 682 SSimStopRef TempStopData; 683 int TempStopID; 684 SQLiteDataReader reader = command.ExecuteReader(); 685 while (reader.Read()) 686 { 687 TempStopData = new SSimStopRef(); 688 TempStopID = reader.GetInt32(0); 689 TempStopData.StopID = TempStopID; 690 TempStopData.StopCode = reader.GetString(1); 691 TempStopData.Longitude = reader.GetDouble(2); 692 TempStopData.Latitude = reader.GetDouble(3); 693 string stopTyp = reader.GetString(4); 694 TempStopData.StopType = 695 (stopTyp.Equals("platform_bus") ? stopType.platform_bus : 696 (stopTyp.Equals("platform_train") ? 697 stopType.platform_train : 698 (stopTyp.Equals("stop") ? stopType.stop : 699 (stopTyp.Equals("doorway") ? stopType.doorway 700 : 701 (stopTyp.Equals("vehicle_yard") ? 702 stopType.vehicle_yard : 703 (stopTyp.Equals("subway_stop") ? 704 stopType.subway_stop : 705 (stopTyp.Equals("timing_stop") ? 706 stopType.timing_stop : stopType.timing_stop)))))));//subway_stop: modified 707 stop type for subway transfer stations 708 TempStopData.StopName = reader.GetString(5); 709 710 //double check to make sure subway_stop is correctly 711 assigned 712 if ((TempStopData.StopName.Contains("STATION") && 713 TempStopData.StopName.Contains(" AT ") && 714 !TempStopData.StopName.Contains("STATION LOOP")) || 715 (TempStopData.StopName.Contains("STATION)"))) 716 { 717 TempStopData.StopType = stopType.subway_stop; 718 //newStop.StopName += " - SUBWAY STOP";//Not 719 needed 720 } 721 722 string connectingstopstring = reader.GetString(6); 723 string[] connectingstopstringarray; 724 int[] connectingstopInt; 725 if (connectingstopstring != "none") 726 { 727
192
connectingstopstringarray = 728 connectingstopstring.Split(' '); 729 connectingstopInt = 730 Array.ConvertAll(connectingstopstringarray, int.Parse); 731 TempStopData.ConnectingStopIDs = 732 connectingstopInt.ToList(); 733 } 734 TempStopData.scheduledStopTimesByServiceID = 735 (Dictionary<int, List<TimeSpan>>)ByteArrayToObject((byte[])reader[7]); 736 737 //string stopTimesString = reader.GetString(7); 738 //Array.ConvertAll(stopTimesString.Split(' '), 739 TimeSpan.Parse).ToList(); 740 741 GtfsStopTable.GetOrAdd(TempStopID, 742 TempStopData);//uses stopId as index 743 TempStopData = null; 744 } 745 } 746 747 AssignDefaultDataProcessingSettings();//get all stops that 748 needs to have exceptions 749 750 #endregion 751 }, //close third Action 752 () => 753 { 754 #region read GTFS SERVICE PERIODS 755 GtfsServicePeriodTable = new ConcurrentDictionary<int, 756 SSimServicePeriodsRef>();//contains list of service period types and 757 respective agency. Service ID as reference. 758 string sql = "select * from servicePeriods order by 759 serviceID"; 760 using (SQLiteCommand command = new SQLiteCommand(sql, 761 conn_GTFSdb)) 762 { 763 SSimServicePeriodsRef TempServiceData; 764 int TempServiceID; 765 SQLiteDataReader reader = command.ExecuteReader(); 766 while (reader.Read()) 767 { 768 TempServiceData = new SSimServicePeriodsRef(); 769 TempServiceID = reader.GetInt32(0); 770 TempServiceData.ServiceID = TempServiceID; 771 string dayTyp = reader.GetString(1); 772 TempServiceData.TypeOfDay = (dayTyp.Equals("weekday") 773 ? dayType.weekday : (dayTyp.Equals("saturday") ? dayType.saturday : 774 (dayTyp.Equals("sunday") ? dayType.sunday : (dayTyp.Equals("special") ? 775 dayType.special : dayType.special))));//second special day not handled here 776 //TempServiceData.agencyID = 777 Convert.ToInt32(reader[2]); 778 GtfsServicePeriodTable.GetOrAdd(TempServiceID, 779 TempServiceData); 780
193
781 //servicePeriodTable.Add(Convert.ToInt32(TempServiceData.typeOfDay), 782 TempServiceData);//duplicated special daytype 783 TempServiceData = null; 784 } 785 } 786 #endregion 787 }, //close forth Action 788 () => 789 { 790 #region Read GTFS TRIPS - need to be run before schedule 791 //ConcurrentDictionary<int, SSimGTFSTripRef> 792 GtfsTripTable = new ConcurrentDictionary<int, 793 SSimGTFSTripRef>();//contains list of service period types and respective 794 agency. Service ID as reference. 795 GtfsTripIDByScheduleID = new Dictionary<int, 796 List<int>>();//use for post-processing schedule table 797 string sql = "select * from trips order by TripID";//DESC 798 using (SQLiteCommand command = new SQLiteCommand(sql, 799 conn_GTFSdb)) 800 { 801 SQLiteDataReader reader = command.ExecuteReader(); 802 while (reader.Read()) 803 { 804 SSimGTFSTripRef TempTrip = new SSimGTFSTripRef(); 805 TempTrip.gtfsTripID = reader.GetInt32(0); 806 TempTrip.ScheduleID = reader.GetInt32(1); 807 TempTrip.BlockID = reader.GetInt32(2); 808 //TempTrip.shapeID = Convert.ToInt32(reader[3]); 809 TempTrip.StartStopID = reader.GetInt32(4); 810 TempTrip.EndStopID = reader.GetInt32(5); 811 long TempTicks = reader.GetInt64(6); 812 TimeSpan TempTimeSpan = 813 TimeSpan.FromTicks(TempTicks); 814 TempTrip.ReleaseTime = TempTimeSpan; 815 string stopTimesString = reader.GetString(7); 816 TempTrip.StopTimes = 817 Array.ConvertAll(stopTimesString.Split(' '), TimeSpan.Parse).ToList(); 818 //TempTrip.tripScheduleInfo = 819 gtfsScheduleTable[TempTrip.scheduleID]; 820 821 GtfsTripTable.GetOrAdd(TempTrip.gtfsTripID, 822 TempTrip); 823 if 824 (GtfsTripIDByScheduleID.ContainsKey(TempTrip.ScheduleID)) 825 { 826 827 GtfsTripIDByScheduleID[TempTrip.ScheduleID].Add(TempTrip.gtfsTripID); 828 } 829 else 830 { 831 List<int> TempTripIDs = new List<int>(); 832 TempTripIDs.Add(TempTrip.gtfsTripID); 833
194
GtfsTripIDByScheduleID.Add(TempTrip.ScheduleID, 834 TempTripIDs); 835 } 836 837 TempTrip = null; 838 } 839 } 840 #endregion 841 } //close fifth Action, 842 ); //close parallel.invoke 843 #endregion 844 845 #region Read GTFS SCHEDULES with stop distances 846 //Service Period ID as index for the masterScheduleTable - to get 847 the SSimRouteScheduleRef 848 //ConcurrentDictionary<int, SSimScheduleRef> 849 scheduleTableByScheduleID = new ConcurrentDictionary<int, 850 SSimScheduleRef>();//index by schedule id 851 GtfsScheduleTable = new ConcurrentDictionary<int, 852 SSimScheduleRef>();//index by schedule id 853 GtfsScheduleIDLookupTable = new ConcurrentDictionary<int, 854 SSimServiceScheduleRef>();//construct 855 foreach (var servicep in GtfsServicePeriodTable) 856 { 857 //Dictionary<int, int> TempRouteIDbyScheduleID = new 858 Dictionary<int, int>(); 859 int currentServiceID = servicep.Value.ServiceID; 860 861 SSimServiceScheduleRef TempRouteScheduleData = new 862 SSimServiceScheduleRef(); 863 ConcurrentDictionary<int, int> TempScheduleIDByRouteID = new 864 ConcurrentDictionary<int, int>(); 865 866 string sql = "select * from schedules where serviceID=" + 867 servicep.Value.ServiceID + " order by serviceID DESC"; 868 using (SQLiteCommand command = new SQLiteCommand(sql, 869 conn_GTFSdb)) 870 { 871 SQLiteDataReader reader = command.ExecuteReader(); 872 while (reader.Read()) 873 { 874 SSimScheduleRef TempSingleScheduleData = new 875 SSimScheduleRef(); 876 int TempScheduleID = reader.GetInt32(0); 877 int TempServiceID = reader.GetInt32(1); 878 int TempRouteID = reader.GetInt32(2); 879 int TempGroupID = reader.GetInt32(3); 880 string routestopstring = reader.GetString(4); 881 string routestopdiststring = reader.GetString(5); 882 string TempShapeID = "" + reader.GetInt32(6); 883 884 //TempRouteIDbyScheduleID.Add(TempScheduleID, 885 TempRouteID); 886 TempSingleScheduleData.GroupID = TempGroupID; 887
195
TempSingleScheduleData.RouteID = TempRouteID; 888 TempSingleScheduleData.ScheduleID = TempScheduleID; 889 TempSingleScheduleData.ServiceID = TempServiceID; 890 TempSingleScheduleData.TripIDs = 891 GtfsTripIDByScheduleID[TempSingleScheduleData.ScheduleID]; 892 TempSingleScheduleData.ShapeID = TempShapeID; 893 if (routestopstring != "none") 894 { 895 TempSingleScheduleData.StopIDs = 896 Array.ConvertAll(routestopstring.Split(' '), int.Parse).ToList(); 897 } 898 if (routestopdiststring != "none") 899 { 900 TempSingleScheduleData.StopDistances = 901 Array.ConvertAll(routestopdiststring.Split(' '), double.Parse).ToList(); 902 } 903 904 //Exclude global except stops: such as stops that are 905 not real (timing_stop) 906 List<int> ExceptionStopIndices = 907 TempSingleScheduleData.StopIDs.GetIntersectIndices(GlobalExceptStops);//(reve908 rse numeric order)remove trip Stop IDs and Stop Distances that are in 909 GlobalExceptStops list 910 foreach (int removeIndex in ExceptionStopIndices) 911 { 912 913 TempSingleScheduleData.StopIDs.RemoveAt(removeIndex); 914 915 TempSingleScheduleData.StopDistances.RemoveAt(removeIndex); 916 917 foreach (int tripID in 918 TempSingleScheduleData.TripIDs) 919 { 920 921 GtfsTripTable[tripID].StopTimes.RemoveAt(removeIndex); 922 } 923 } 924 925 //Hotfix A: resolve duplicate stop IDs, check and 926 remove them 927 List<int> duplicateIndices = 928 TempSingleScheduleData.StopIDs.DuplicateAllButFirstIndices(); 929 //process duplicate 930 for (int i = (duplicateIndices.Count - 1); i >= 0; i-931 -) 932 { 933 934 TempSingleScheduleData.StopIDs.RemoveAt(duplicateIndices[i]); 935 936 TempSingleScheduleData.StopDistances.RemoveAt(duplicateIndices[i]); 937 //query all trips belonging to this schedule 938 foreach (int tripID in 939 TempSingleScheduleData.TripIDs) 940 { 941
196
942 GtfsTripTable[tripID].StopTimes.RemoveAt(duplicateIndices[i]); 943 } 944 945 } 946 947 TempScheduleIDByRouteID.GetOrAdd(TempRouteID, 948 TempScheduleID);//a copy for master schedule table - by service period then 949 route id 950 GtfsScheduleTable.GetOrAdd(TempScheduleID, 951 TempSingleScheduleData);//simple schedule table lookup 952 TempSingleScheduleData = null; 953 } 954 } 955 TempRouteScheduleData.ServiceID = currentServiceID; 956 TempRouteScheduleData.ScheduleIDByRouteID = 957 TempScheduleIDByRouteID; 958 //TempRouteScheduleData.routeIDbyScheduleID = 959 TempRouteIDbyScheduleID; 960 GtfsScheduleIDLookupTable.GetOrAdd(currentServiceID, 961 TempRouteScheduleData); 962 963 TempRouteScheduleData = null; 964 //TempRouteIDbyScheduleID = null; 965 TempScheduleIDByRouteID = null; 966 //SSimSingleScheduleRef TempRouteScheduleData22 = 967 masterScheduleTable[1].routeSchedule[4];//test if index 4 is route 4 or 5th 968 position-->YES 969 } 970 971 //Remove trips that don't exist in schedule and route group 972 SSimGTFSTripRef anotherTempTripRef; 973 List<int> tripIDsToBeRemoved = new List<int>(); 974 foreach (int tripID in GtfsTripTable.Keys) 975 { 976 int scheduleID = GtfsTripTable[tripID].ScheduleID; 977 if (!GtfsScheduleTable.ContainsKey(scheduleID)) 978 { 979 tripIDsToBeRemoved.Add(tripID); 980 } 981 else 982 { 983 int groupID = GtfsScheduleTable[scheduleID].GroupID; 984 if (!GtfsRouteGroupTableByGroupID.ContainsKey(groupID)) 985 { 986 tripIDsToBeRemoved.Add(tripID); 987 } 988 } 989 } 990 foreach (int tripIDToRemove in tripIDsToBeRemoved) 991 { 992 GtfsTripTable.TryRemove(tripIDToRemove, out 993 anotherTempTripRef); 994 } 995
197
996 //Remove zero size schedules and trips 997 List<int> scheduleIDsToBeRemoved = new List<int>(); 998 foreach (int scheduleID in GtfsScheduleTable.Keys) 999 { 1000 if (GtfsScheduleTable[scheduleID].StopIDs.Count == 0) 1001 { 1002 scheduleIDsToBeRemoved.Add(scheduleID); 1003 } 1004 } 1005 SSimScheduleRef tempScheduleRef; 1006 SSimRouteRef tempRouteRef; 1007 SSimGTFSTripRef tempTripRef; 1008 foreach (int scheduleIDToRemove in scheduleIDsToBeRemoved) 1009 { 1010 foreach (int tripIDToRemove in 1011 GtfsScheduleTable[scheduleIDToRemove].TripIDs) 1012 { 1013 GtfsTripTable.TryRemove(tripIDToRemove, out tempTripRef); 1014 } 1015 int groupID = GtfsScheduleTable[scheduleIDToRemove].GroupID; 1016 1017 GtfsRouteGroupTableByGroupID[groupID].RouteIDs.Remove(scheduleIDToRemove); 1018 1019 GtfsRouteGroupTableByGroupID[groupID].RouteInfo.TryRemove(scheduleIDToRemove, 1020 out tempRouteRef); 1021 GtfsScheduleTable.TryRemove(scheduleIDToRemove, out 1022 tempScheduleRef); 1023 } 1024 1025 //Hotfix B: ensure scheduledStopTimesByServiceID = 1026 (Dictionary<int, List<TimeSpan>>) is correct and consistent 1027 //This hotfix overrides raw stop data from csv and uses the 1028 processed trip schedules generated by the gtfsconverter 1029 //This hotfix is need because Hotfix A removes some duplicate 1030 stop's stop times. 1031 ConcurrentDictionary<int, Dictionary<int, List<TimeSpan>>> 1032 scheduleStopTimesByStopIDByServiceID = new ConcurrentDictionary<int, 1033 Dictionary<int, List<TimeSpan>>>(); 1034 foreach (int tripID in GtfsTripTable.Keys) 1035 { 1036 int scheduleID = GtfsTripTable[tripID].ScheduleID; 1037 //foreach (int stopID in 1038 gtfsScheduleTable[scheduleID].StopIDs) 1039 for (int stopIndex = 0; stopIndex < 1040 GtfsScheduleTable[scheduleID].StopIDs.Count; stopIndex++) 1041 { 1042 //retrieve processed stop id and stop time value 1043 int stopID = 1044 GtfsScheduleTable[scheduleID].StopIDs[stopIndex]; 1045 TimeSpan stopTime = 1046 GtfsTripTable[tripID].StopTimes[stopIndex]; 1047 int serviceID = GtfsScheduleTable[scheduleID].ServiceID; 1048 //add to dictionary 1049
198
if 1050 (!scheduleStopTimesByStopIDByServiceID.ContainsKey(stopID)) 1051 { 1052 scheduleStopTimesByStopIDByServiceID.TryAdd(stopID, 1053 new Dictionary<int, List<TimeSpan>>()); 1054 } 1055 if 1056 (!scheduleStopTimesByStopIDByServiceID[stopID].ContainsKey(serviceID)) 1057 { 1058 1059 scheduleStopTimesByStopIDByServiceID[stopID].Add(serviceID, new 1060 List<TimeSpan>()); 1061 } 1062 1063 scheduleStopTimesByStopIDByServiceID[stopID][serviceID].Add(stopTime); 1064 1065 } 1066 } 1067 // update gtfsStopTable object 1068 foreach (int stopID in scheduleStopTimesByStopIDByServiceID.Keys) 1069 { 1070 //sort scheduleStopTimesByStopIDByServiceID object TimeSpan 1071 Lists 1072 foreach (int serviceID in 1073 scheduleStopTimesByStopIDByServiceID[stopID].Keys) 1074 { 1075 1076 scheduleStopTimesByStopIDByServiceID[stopID][serviceID].Sort(); 1077 //if 1078 (gtfsStopTable[stopID].scheduledStopTimesByServiceID[serviceID].Count != 1079 scheduleStopTimesByStopIDByServiceID[stopID][serviceID].Count) 1080 //{ 1081 // int check = 1;//result of check: changes due to 1082 Hotfix A has been addressed 1083 //} 1084 } 1085 // now, update to gtfsStopTable Object 1086 GtfsStopTable[stopID].scheduledStopTimesByServiceID = 1087 scheduleStopTimesByStopIDByServiceID[stopID]; 1088 } 1089 1090 /* Post Processing of Schedule Data */ 1091 //get the main or longest route within the route group 1092 //TempRouteSchedule = new ConcurrentDictionary<int, 1093 SSimSingleScheduleRef>(); 1094 //ConcurrentDictionary<int, SSimScheduleRef> 1095 selectRouteSchedule;//the longest route 1096 //selectRouteSchedule = new ConcurrentDictionary<int, 1097 SSimScheduleRef>();//the longest route 1098 //TempRouteSchedule = new ConcurrentDictionary<int, 1099 SSimSingleScheduleRef>();//the Temp route with index 1100 //SSimScheduleLookupByServiceRef TempRouteScheduleData = new 1101 SSimScheduleLookupByServiceRef();//Temp route 1102 1103
199
foreach (var group in GtfsRouteGroupTableByRouteCode)//seperate 1104 schedule by group then route 1105 { 1106 foreach (var servicep in GtfsScheduleIDLookupTable) 1107 { 1108 int servID = servicep.Value.ServiceID; 1109 1110 int maxStopDir0 = 0; 1111 int maxStopRouteIDDir0 = -1; 1112 int maxStopDir1 = 0; 1113 int maxStopRouteIDDir1 = -1; 1114 //List<char> subRouteCharsDir0 = new List<char>(); 1115 //List<char> subRouteCharsDir1 = new List<char>(); 1116 //List<int> subRouteIDDir0 = new List<int>(); 1117 //List<int> subRouteIDDir1 = new List<int>(); 1118 List<int> allRouteIDDir0 = new List<int>(); 1119 List<int> allRouteIDDir1 = new List<int>(); 1120 List<int> subRouteMaxStopDir0 = new List<int>(); 1121 List<int> subRouteMaxStopDir1 = new List<int>(); 1122 1123 foreach (int route in group.Value.RouteIDs) 1124 { 1125 int routeLetterIndex = 1126 GtfsRouteGroupTableByRouteCode[group.Key].RouteInfo[route].RouteTag.Count() - 1127 1; 1128 char TempRouteLetter = 1129 GtfsRouteGroupTableByRouteCode[group.Key].RouteInfo[route].RouteTag[routeLett1130 erIndex]; 1131 1132 if 1133 (GtfsScheduleIDLookupTable[servID].ScheduleIDByRouteID.Keys.Contains(route))/1134 /check to make sure the route is in the schedule 1135 { 1136 int currentScheduleID = 1137 GtfsScheduleIDLookupTable[servID].ScheduleIDByRouteID[route]; 1138 1139 // find main route with maximum stop - include 1140 sub-route if it does have the maximum stop - NOTE: some routes only has sub-1141 routes where others have only main. 1142 1143 if ((maxStopDir0 < 1144 GtfsScheduleTable[currentScheduleID].StopIDs.Count) && 1145 (GtfsRouteGroupTableByRouteCode[group.Key].RouteInfo[route].DirTag == 1146 0))//dir 0 1147 { 1148 maxStopDir0 = 1149 GtfsScheduleTable[currentScheduleID].StopIDs.Count; 1150 maxStopRouteIDDir0 = route; 1151 } 1152 else if ((maxStopDir1 < 1153 GtfsScheduleTable[currentScheduleID].StopIDs.Count) && 1154 (GtfsRouteGroupTableByRouteCode[group.Key].RouteInfo[route].DirTag == 1155 1))//dir 1 1156 { 1157
200
maxStopDir1 = 1158 GtfsScheduleTable[currentScheduleID].StopIDs.Count; 1159 maxStopRouteIDDir1 = route; 1160 } 1161 1162 if 1163 (GtfsRouteGroupTableByRouteCode[group.Key].RouteInfo[route].DirTag == 0)//dir 1164 0 1165 { 1166 allRouteIDDir0.Add(route); 1167 } 1168 else if 1169 (GtfsRouteGroupTableByRouteCode[group.Key].RouteInfo[route].DirTag == 1)//dir 1170 1 1171 { 1172 allRouteIDDir1.Add(route); 1173 }//end else if 1174 1175 }//end if route within service p 1176 1177 }//end foreach route 1178 1179 //add only if there's stops for the respective schedule 1180 bool serviceAvail = false; 1181 if (maxStopRouteIDDir0 != -1)//main route - dir 0 1182 { 1183 //int TempScheduleID = 1184 GtfsScheduleIDLookupTable[servID].scheduleIDByRouteID[maxStopRouteIDDir0]; 1185 1186 //longestRouteSchedule.GetOrAdd(gtfsScheduleTable[TempScheduleID].routeID, 1187 gtfsScheduleTable[TempScheduleID]); 1188 1189 GtfsRouteGroupTableByRouteCode[group.Key].RouteInfo[maxStopRouteIDDir0].Inclu1190 de = true;//update include to model 1191 1192 GtfsRouteGroupTableByRouteCode[group.Key].mainRouteIDDir0ByServiceP.GetOrAdd(1193 servicep.Key, maxStopRouteIDDir0); 1194 serviceAvail = true; 1195 } 1196 if (maxStopRouteIDDir1 != -1)//main route - dir 1 1197 { 1198 1199 //longestRouteSchedule.GetOrAdd(unprocessedScheduleIDLookupTable[servID].sche1200 duleIDByRouteID[maxStopRouteIDDir1].routeID, 1201 unprocessedScheduleIDLookupTable[servID].scheduleIDByRouteID[maxStopRouteIDDi1202 r1]); 1203 1204 GtfsRouteGroupTableByRouteCode[group.Key].RouteInfo[maxStopRouteIDDir1].Inclu1205 de = true;//update include to model 1206 1207 GtfsRouteGroupTableByRouteCode[group.Key].mainRouteIDDir1ByServiceP.GetOrAdd(1208 servicep.Key, maxStopRouteIDDir1); 1209 serviceAvail = true; 1210 } 1211
201
if (allRouteIDDir0.Count != 0)//sub - dir 0 1212 { 1213 for (int subIndex = 0; subIndex < 1214 allRouteIDDir0.Count; subIndex++) 1215 { 1216 int subRouteID = allRouteIDDir0[subIndex]; 1217 1218 GtfsRouteGroupTableByRouteCode[group.Key].RouteInfo[subRouteID].Include = 1219 true;//update include to model 1220 if (!subRouteID.Equals(maxStopRouteIDDir0)) 1221 { 1222 string subRouteTag = 1223 GtfsRouteGroupTableByRouteCode[group.Key].RouteInfo[subRouteID].RouteTag; 1224 Tuple<int, string, int> TempSubRouteInfo = 1225 new Tuple<int, string, int>(servicep.Key, subRouteTag, subRouteID); 1226 1227 GtfsRouteGroupTableByRouteCode[group.Key].subRouteIDDir0.Add(TempSubRouteInfo1228 ); 1229 serviceAvail = true; 1230 } 1231 } 1232 } 1233 if (allRouteIDDir1.Count != 0)//sub - dir 1 1234 { 1235 for (int subIndex = 0; subIndex < 1236 allRouteIDDir1.Count; subIndex++) 1237 { 1238 int subRouteID = allRouteIDDir1[subIndex]; 1239 1240 GtfsRouteGroupTableByRouteCode[group.Key].RouteInfo[subRouteID].Include = 1241 true;//update include to model 1242 if (!subRouteID.Equals(maxStopRouteIDDir1)) 1243 { 1244 string subRouteTag = 1245 GtfsRouteGroupTableByRouteCode[group.Key].RouteInfo[subRouteID].RouteTag; 1246 Tuple<int, string, int> TempSubRouteInfo = 1247 new Tuple<int, string, int>(servicep.Key, subRouteTag, subRouteID); 1248 1249 GtfsRouteGroupTableByRouteCode[group.Key].subRouteIDDir1.Add(TempSubRouteInfo1250 ); 1251 serviceAvail = true; 1252 } 1253 } 1254 } 1255 //service availability determined 1256 1257 GtfsRouteGroupTableByRouteCode[group.Key].serviceAvailByServiceP.GetOrAdd(ser1258 vicep.Key, serviceAvail); 1259 1260 }//end foreach service p 1261 } 1262 #endregion 1263 1264 source.Close(); 1265
203
A.2 Data Processing Algorithm
A.2.1 Data Processing Main Method (ProcessGPSData)
public void ProcessGPSData(int mode, bool printResultLog, bool 1270 mergeShortLinks, double minLinkDistForMerge = 1000) 1271 { 1272 //update status & progress bar based on seconds per hr of data, 1273 entire network 1274 UpdateGUI_StatusBox("Processing data...", 95.0, 700.0); 1275 //if link objects are loaded and processed, then no need to 1276 process any data 1277 if (LinkObjectTable.Count == 0) 1278 { 1279 //DATA PROCESSING: Calculating variable values - stop, stop 1280 distances, speed and headways 1281 bool hasChanges = 1282 ScheduleMatchingGPSTripWithGTFS(printResultLog); 1283 if (hasChanges) 1284 { 1285 UpdateGPSPointInDB(GpsPointTable); 1286 UpdateGPSTripInDB(GpsTripTable); 1287 SaveSimulatorDbChanges(false); 1288 } 1289 hasChanges = DetermineAllFeatureValues(printResultLog);// 1: 1290 schedule delay; 2: Speed; 3: Headways 1291 if (hasChanges) 1292 { 1293 UpdateGPSPointInDB(GpsPointTable); 1294 UpdateGPSTripInDB(GpsTripTable); 1295 SaveSimulatorDbChanges(false); 1296 } 1297 InitializeModelLinksToDb(); 1298 ConcurrentDictionary<int, List<bool>> 1299 stopMergeRecordByScheduleID = new ConcurrentDictionary<int, 1300 List<bool>>();//for info 1301 //mergeAllModelLinks(out stopMergeRecordByScheduleID, 1302 mergeShortLinks, minLinkDistForMerge); 1303 } 1304 SaveSimulatorDbChanges(false); 1305 //raw and initially processed data tables no longer needed - 1306 clear mem 1307 GpsPointTable.Clear(); 1308 //gpsTripTable.Clear();//do not clear trip table, it will be used 1309 later to export data 1310 //note gtfs and intersection tables are NOT cleared! As they may 1311 be needed later 1312 UpdateGUI_StatusBox("Data processed...", 100.0, 0.0); 1313 //runtime: full data processing: 90 mins 1314 GC.Collect(); 1315 GC.WaitForPendingFinalizers(); 1316 } 1317
204
A.2.2 Schedule Matching Method (ScheduleMatchingGPSTripWithGTFS)
private bool ScheduleMatchingGPSTripWithGTFS(bool 1318 printComputeResultLog) 1319 { 1320 StringBuilder tripMatchAnalysis = new StringBuilder(); 1321 1322 tripMatchAnalysis.AppendLine(String.Format("{0},{1},{2},{3},{4},{5},{6}", 1323 "distScore", "distToStartStop", "distToEndStop", "gpsTripID", 1324 "matchedScheduleID", "numGPSPointsInTrip", "numGTFSStops"));//"distScore", 1325 "distToStartStop", "distToEndStop", "timeScore", "distToStartStop", 1326 "distToEndStop", "gpsTripID", "matchedScheduleID", "matchedTripID", 1327 "numGPSPointsInTrip", "numGTFSStops"));//Header: distScore, distToStartStop, 1328 distToEndStop, timeScore, distToStartStop, distToEndStop, gpsTripID, 1329 matchedScheduleID, matchedTripID, numGPSPointsInTrip, numGTFSStops 1330 if (printComputeResultLog) 1331 { 1332 ExportGPSTripMatchCoordinates(GpsTripTable.Values.ToList(), 1333 @"ScheduleMatchingResultCoordinates-PreMatch\\");//prints results - no trims 1334 } 1335 1336 /* Start Generic Initializations - 101 */ 1337 //Get a list of possible GPS trips indexed for a particular date 1338 range 1339 //NOTE: GTFS trip found a GPS trip match will be deleted from 1340 list 1341 Dictionary<DateRange, List<int>> 1342 workloadScheduleMatch_GPSTripIDsByDateRange = new Dictionary<DateRange, 1343 List<int>>(); 1344 List<SSVehTripDataTable> unmatchedTrips = 1345 GpsTripTable.Values.Where(v => v.GtfsScheduleID <= 0).ToList();//unmatched 1346 results 1347 List<int> remainingTripIDs = unmatchedTrips.Select(v => 1348 v.TripID).ToList();//select only the trips with -1 trip schedule IDs 1349 //vehGPSTripTable.Keys.ToList(); 1350 if (remainingTripIDs.Count <= 0) 1351 { 1352 return false; 1353 } 1354 1355 double dateRangeAssignmentToleranceInSecs = 300;//default: 5mins. 1356 Check if dateRanges would overlapp with this 1357 for (int n = 0; n < (AllDateRanges.Count - 1); n++) 1358 { 1359 if ((AllDateRanges[n].end.DateTimeToEpochTime() - 1360 AllDateRanges[n + 1].start.DateTimeToEpochTime()) > 1361 dateRangeAssignmentToleranceInSecs) 1362 { 1363 dateRangeAssignmentToleranceInSecs = 1364 (AllDateRanges[n].end.DateTimeToEpochTime() - AllDateRanges[n + 1365 1].start.DateTimeToEpochTime()); 1366 } 1367 } 1368
205
foreach (DateRange currentDate in AllDateRanges) 1369 { 1370 workloadScheduleMatch_GPSTripIDsByDateRange.Add(currentDate, 1371 new List<int>()); 1372 List<int> matchedTripIDs = new List<int>();//for debug and 1373 efficiency purpose 1374 1375 //foreach (int TripID in allTripIDs) 1376 for (int i = 0; i < remainingTripIDs.Count; i++) 1377 { 1378 //assign TripID to a dateRange 1379 int TripID = remainingTripIDs[i]; 1380 long currentTripStart = 1381 GpsTripTable[TripID].startGPSTime; 1382 if ((currentDate.start.DateTimeToEpochTime() <= 1383 (currentTripStart + dateRangeAssignmentToleranceInSecs)) && 1384 (currentDate.end.DateTimeToEpochTime() >= (currentTripStart - 1385 dateRangeAssignmentToleranceInSecs))) 1386 { 1387 1388 workloadScheduleMatch_GPSTripIDsByDateRange[currentDate].Add(TripID); 1389 remainingTripIDs.RemoveAt(i); 1390 i--; 1391 } 1392 } 1393 } 1394 1395 //MAIN TASK OF THE MEDTHOD, iterate through DateRange and then 1396 Trip IDs 1397 foreach (DateRange currentDay in AllDateRanges) 1398 { 1399 //find current day service id 1400 int currentDayServiceID = 3;//default weekday 1401 DateTime tripStartTime = currentDay.start; 1402 dayType tripDayType = tripStartTime.getDayType(new 1403 dayTypeDefinition()); 1404 foreach (int serviceID in GtfsServicePeriodTable.Keys) 1405 { 1406 if (GtfsServicePeriodTable[serviceID].TypeOfDay == 1407 tripDayType) 1408 { 1409 currentDayServiceID = serviceID; 1410 break; 1411 } 1412 } 1413 /* End Generic Initializations - 101 */ 1414 1415 //find the list of TripIDs that falls within the current date 1416 range 1417 List<int> TripIDs = 1418 workloadScheduleMatch_GPSTripIDsByDateRange[currentDay]; 1419 List<int> TripIDToBeRemoved = new List<int>(); 1420 //List<int> TripIDs = vehGPSTripTable.Keys.ToList(); 1421 foreach (int TripID in TripIDs) 1422
206
{ 1423 SSVehTripDataTable aTrip = GpsTripTable[TripID]; 1424 int foundScheduleID = -1; 1425 //int foundTripID = -1;//finding trip id is not realsitic 1426 due to constant operational changes 1427 1428 /*======= SCHEDULE & TRIP MATCHING & TRIP TRIMMING 1429 ===========*/ 1430 //search gtfs trips for the same route 1431 int tripGroupID = 1432 GtfsRouteGroupTableByRouteCode[aTrip.RouteCode].GroupID; 1433 /*Code duplicate starts - Trip Setup*/ 1434 string currentTripRouteCode = aTrip.RouteCode; 1435 //List<int> possibleRouteIDs = 1436 GtfsScheduleIDLookupTable[tripServiceID].ScheduleIDByRouteID.Keys.ToList();//1437 list of route IDs within service period 1438 List<int> possibleRouteIDs = new 1439 List<int>(GtfsRouteGroupTableByRouteCode[currentTripRouteCode].RouteIDs);//se1440 arch only for the correct route group 1441 1442 /* Find GtfsScheduleID for vehGPSPointsTable and 1443 vehGPSTripTable */ 1444 List<SSimScheduleRef> allSchedules = new 1445 List<SSimScheduleRef>(); 1446 foreach (int aRouteID in possibleRouteIDs) 1447 { 1448 if 1449 (GtfsScheduleIDLookupTable[currentDayServiceID].ScheduleIDByRouteID.Keys.Cont1450 ains(aRouteID))//make sure the route belongs to the serviceID 1451 { 1452 //int aScheduleID = 1453 GtfsScheduleIDLookupTable[tripServiceID].ScheduleIDByRouteID[aRouteID]; 1454 1455 //allSchedules.Add(gtfsScheduleTable[aScheduleID]); 1456 1457 //make sure the schedule is within study period 1458 int aScheduleID = 1459 GtfsScheduleIDLookupTable[currentDayServiceID].ScheduleIDByRouteID[aRouteID]; 1460 //if 1461 (possibleGTFSScheduleByDateTime[tripStartTime.Date].Contains(aScheduleID)) 1462 //{ 1463 allSchedules.Add(GtfsScheduleTable[aScheduleID]); 1464 //} 1465 } 1466 } 1467 1468 //1. TRIP MATCHING - find and update the matching GTFS 1469 trip ID in TTCGPSTRIP table 1470 // TRIP TRIMMING DATA - idea gps start and gps end 1471 are determined during trip matching, index by schedule id 1472 Dictionary<int, int> idealGPSStart = new Dictionary<int, 1473 int>();//schedule id, ideal start index 1474 Dictionary<int, int> idealGPSEnd = new Dictionary<int, 1475 int>();//schedule id, ideal end index 1476
207
Dictionary<int, List<double>> indivProximityScores = new 1477 Dictionary<int, List<double>>();//schedule id, proximity scores for midpoints 1478 1479 List<Tuple<double, double, double, SSimScheduleRef>> 1480 finalMatchedSchedules = new List<Tuple<double, double, double, 1481 SSimScheduleRef>>();//for primary search and start stop matching result 1482 Tuple<double, double, double, SSimScheduleRef> 1483 finalSchedule_trimmed; 1484 //GPS Point Index, Distance between GPS points and GFTFS 1485 stop, and Schedule 1486 for (int j = 0; j < allSchedules.Count; j++)//for every 1487 possible schedule 1488 { 1489 //Use trip triming data to detect wrong Direction 1490 trip matches 1491 int scheduleDir = 1492 GtfsRouteGroupTableByRouteCode[(currentTripRouteCode)].RouteInfo[allSchedules1493 [j].RouteID].DirTag; 1494 if (scheduleDir == aTrip.Direction)//wrong Direction 1495 or non-moving trip 1496 { 1497 //TRIP TRIMMING DATA 1498 int bestGPSStartIndex = -1; 1499 int bestGPSEndIndex = -1; 1500 List<double> currentScheduleProximityScores = 1501 (new double[aTrip.GPSIDs.Count - 1]).Select(x => 0.0).ToList(); //new 1502 List<double>(aTrip.GPSIDs.Count - 1);//for midpoints ONLY - first and last 1503 elements will contain 0 as they are start and end. 1504 1505 //DISTANCE METRIC 1506 SSimScheduleRef aSchedule = allSchedules[j]; 1507 int currentScheduleID = 1508 allSchedules[j].ScheduleID; 1509 string currentShapeID = 1510 GtfsScheduleTable[currentScheduleID].ShapeID; 1511 //Note: w*|*****x***y*|*z (x and y will both have 1512 positive distance to start and end stop whereas w and z will both have 1513 negative distances. "|" are the start and end stops in GTFS shapes) 1514 1515 //Total Length of the Shape belonging to the trip 1516 double totalLengthOfCurrentSchedule = 1517 aSchedule.StopDistances.Last(); 1518 double startLengthOfCurrentSchedule = 1519 aSchedule.StopDistances.First(); 1520 GeoLocation dummyLocForOut = new GeoLocation(); 1521 //Start Stop Matching/Validating 1522 SSVehGPSDataTable startStopGPSPoint = 1523 GpsPointTable[aTrip.GPSIDs.First()]; 1524 double distToStartStop = 1525 AgencyShapeDistFromPointModForGPSPts(out dummyLocForOut, currentShapeID, new 1526 GeoLocation(startStopGPSPoint.Latitude, startStopGPSPoint.Longitude), 0.0, 1527 true) - startLengthOfCurrentSchedule; 1528 //End Stop Matching/Validating 1529
208
SSVehGPSDataTable endStopGPSPoint = 1530 GpsPointTable[aTrip.GPSIDs.Last()]; 1531 double distToEndStop = 1532 totalLengthOfCurrentSchedule - AgencyShapeDistFromPointModForGPSPts(out 1533 dummyLocForOut, currentShapeID, new GeoLocation(endStopGPSPoint.Latitude, 1534 endStopGPSPoint.Longitude), 0.0, true); 1535 //Mid-trip distance offsets - penalize schedule 1536 that doesn't have any overlap or less overlap 1537 double sumMidtripDistDiff = 0;//takes a long time 1538 for (int gpsIDIndex = 1; gpsIDIndex < 1539 (aTrip.GPSIDs.Count - 1); gpsIDIndex++) 1540 { 1541 int gpsID = aTrip.GPSIDs[gpsIDIndex]; 1542 double currentmidTripDist = 1543 ApproxMinDistToShapePath(currentShapeID, new 1544 GeoLocation(GpsPointTable[gpsID].Latitude, GpsPointTable[gpsID].Longitude)); 1545 sumMidtripDistDiff += currentmidTripDist; 1546 1547 currentScheduleProximityScores[gpsIDIndex] = 1548 currentmidTripDist;// Trip Trimming Info 1549 } 1550 //midtripDistDiff = midtripDistDiff / 1551 (aTrip.GPSIDs.Count - 2); 1552 1553 // Trip Trimming Computations 1554 //Note: midpoint distances to start and end stops 1555 are straight line distance evaluations 1556 List<GeoLocation> GPSTripPointLocations = (from r 1557 in aTrip.GPSIDs 1558 select 1559 new GeoLocation(GpsPointTable[r].Latitude, 1560 GpsPointTable[r].Longitude)).ToList(); 1561 GeoLocation startStop = new 1562 GeoLocation(GtfsStopTable[aSchedule.StopIDs.First()].Latitude, 1563 GtfsStopTable[aSchedule.StopIDs.First()].Longitude); 1564 1565 GeoLocation endStop = new 1566 GeoLocation(GtfsStopTable[aSchedule.StopIDs.Last()].Latitude, 1567 GtfsStopTable[aSchedule.StopIDs.Last()].Longitude); 1568 //Write a short method that finds the closest 1569 index, stops once the closest one has been found, rather than keep looking 1570 deeper. Effectively trimming the trips! 1571 bestGPSStartIndex = 1572 GetClosestIndexToStop(GPSTripPointLocations, startStop, true);//closest point 1573 to Start Stop Location (first hit) 1574 bestGPSEndIndex = 1575 GetClosestIndexToStop(GPSTripPointLocations, endStop, false);//closest point 1576 to End Stop Location (first hit) 1577 1578 //Trip Trimming Result/Data 1579 idealGPSStart.Add(aSchedule.ScheduleID, 1580 bestGPSStartIndex); 1581 idealGPSEnd.Add(aSchedule.ScheduleID, 1582 bestGPSEndIndex); 1583
209
indivProximityScores.Add(aSchedule.ScheduleID, 1584 currentScheduleProximityScores);//save calculation for later 1585 1586 //Add ranking score(s) - without trip trimming 1587 //if (distToStartStop >= 0 && distToEndStop >= 0) 1588 //{ 1589 finalMatchedSchedules.Add(new Tuple<double, 1590 double, double, SSimScheduleRef>(Math.Abs(distToStartStop) + 1591 Math.Abs(distToEndStop) + sumMidtripDistDiff, distToStartStop, distToEndStop, 1592 aSchedule));//add as a possible candidate 1593 } 1594 } 1595 //Scoring: ranked schedules and assign the closest 1596 schedule id to the GPS trip 1597 finalMatchedSchedules.Sort((x, y) => 1598 x.Item1.CompareTo(y.Item1));//combined score-ASC 1599 //http://stackoverflow.com/questions/23991802/c-sharp-1600 tuple-list-multiple-sort 1601 foundScheduleID = 1602 finalMatchedSchedules[0].Item4.ScheduleID; 1603 aTrip.GtfsScheduleID = foundScheduleID; 1604 1605 if (foundScheduleID > 0) 1606 { 1607 //2. TRIP TRIMMING OPERATION 1608 //TRIM the trip based on midPointDistToStartStop & 1609 midPointDistToEndStop 1610 //trim end then start 1611 if (((aTrip.GPSIDs.Count - 1 - 1612 idealGPSEnd[foundScheduleID]) > 0) && (idealGPSEnd[foundScheduleID] > 1613 0))//size of removal > 0 1614 { 1615 1616 aTrip.GPSIDs.RemoveRange(idealGPSEnd[foundScheduleID], aTrip.GPSIDs.Count - 1 1617 - idealGPSEnd[foundScheduleID]); 1618 } 1619 if (idealGPSStart[foundScheduleID] > 0)//size of 1620 removal > 0 1621 { 1622 if (idealGPSStart[foundScheduleID] < 1623 idealGPSEnd[foundScheduleID]) 1624 { 1625 aTrip.GPSIDs.RemoveRange(0, 1626 idealGPSStart[foundScheduleID]); 1627 } 1628 else 1629 { 1630 aTrip.GPSIDs = new List<int>();//trip marked 1631 for removal 1632 } 1633 } 1634 //Process trimmed trip - update some fields 1635 if ((aTrip.GPSIDs.Count > 1))//at least 2 points or 1636 trip is flagged and removed & schedule has been found 1637
210
{ 1638 //Update Key Variables 1639 aTrip.startGPSTime = 1640 GpsPointTable[aTrip.GPSIDs[0]].GPStime;//update GPS start time 1641 tripStartTime = 1642 SSUtil.EpochTimeToUTCDateTime(GpsPointTable[aTrip.GPSIDs[0]].GPStime); 1643 1644 //Recalculate Distance Score 1645 //Total Length of the Shape belonging to the trip 1646 double totalLengthOfSchedule = 1647 GtfsScheduleTable[foundScheduleID].StopDistances.Last(); 1648 double startLengthOfSchedule = 1649 GtfsScheduleTable[foundScheduleID].StopDistances.First(); 1650 GeoLocation dummyLocForOut = new GeoLocation(); 1651 //Start Stop 1652 SSVehGPSDataTable startGPSPoint = 1653 GpsPointTable[aTrip.GPSIDs.First()]; 1654 double newDistToStartStop = 1655 AgencyShapeDistFromPointModForGPSPts(out dummyLocForOut, 1656 GtfsScheduleTable[foundScheduleID].ShapeID, new 1657 GeoLocation(startGPSPoint.Latitude, startGPSPoint.Longitude), 0.0, true) - 1658 startLengthOfSchedule; 1659 //End Stop 1660 SSVehGPSDataTable endGPSPoint = 1661 GpsPointTable[aTrip.GPSIDs.Last()]; 1662 double newDistToEndStop = totalLengthOfSchedule - 1663 AgencyShapeDistFromPointModForGPSPts(out dummyLocForOut, 1664 GtfsScheduleTable[foundScheduleID].ShapeID, new 1665 GeoLocation(endGPSPoint.Latitude, endGPSPoint.Longitude), 0.0, true); 1666 //Mid Stops 1667 double newMiddDistDiff = 0; 1668 for (int midDistIndex = 1669 idealGPSStart[foundScheduleID] + 1; midDistIndex < 1670 idealGPSEnd[foundScheduleID]; midDistIndex++) 1671 { 1672 newMiddDistDiff += 1673 indivProximityScores[foundScheduleID][midDistIndex]; 1674 } 1675 finalSchedule_trimmed = new Tuple<double, double, 1676 double, SSimScheduleRef>(Math.Abs(newDistToStartStop) + 1677 Math.Abs(newDistToEndStop) + newMiddDistDiff, newDistToStartStop, 1678 newDistToEndStop, GtfsScheduleTable[foundScheduleID]); 1679 1680 GpsTripTable[TripID] = aTrip;//save changes to 1681 master table object 1682 1683 if (printComputeResultLog) 1684 { 1685 1686 tripMatchAnalysis.AppendLine(String.Format("{0},{1},{2},{3},{4},{5},{6}", 1687 finalSchedule_trimmed.Item1, finalSchedule_trimmed.Item2, 1688 finalSchedule_trimmed.Item3, aTrip.TripID, foundScheduleID, 1689 aTrip.GPSIDs.Count, finalSchedule_trimmed.Item4.StopIDs.Count));//Header: 1690 distScore, distToStartStop, distToEndStop, gpsTripID, matchedScheduleID, 1691
211
foundScheduleDelays, numGPSPointsInTrip, numGTFSStops 1692 (distToStartStop, distToEndStop, matchedTripID) 1693 } 1694 } 1695 else 1696 { 1697 if (printComputeResultLog) 1698 { 1699 1700 tripMatchAnalysis.AppendLine(String.Format("{0},{1},{2},{3},{4},{5},{6}", 1701 "unknown", "unknown", "unknown", aTrip.TripID, foundScheduleID, 1702 aTrip.GPSIDs.Count, "noMatch"));//Header: distScore, distToStartStop, 1703 distToEndStop, timeScore, distToStartStop, distToEndStop, gpsTripID, 1704 matchedScheduleID, matchedTripID, numGPSPointsInTrip, numGTFSStops 1705 } 1706 //delete these trips from vehGPSTripTable object 1707 SSVehTripDataTable removedTrip = null; 1708 GpsTripTable.TryRemove(TripID, out removedTrip); 1709 TripIDToBeRemoved.Add(removedTrip.TripID); 1710 //delete these trips from the table database 1711 }//end else 1712 }//end if 1713 }//end foreach trip 1714 DeleteFromTableDatabase("TTCGPSTRIPS", "TripID", 1715 TripIDToBeRemoved); 1716 }//end foreach dateRange 1717 1718 if (printComputeResultLog) 1719 { 1720 File.WriteAllText(LogFileFolder + @"TripMatchingResult.csv", 1721 tripMatchAnalysis.ToString()); 1722 ExportGPSTripMatchCoordinates(GpsTripTable.Values.ToList(), 1723 @"ScheduleMatchingResultCoordinates\\");//prints results 1724 } 1725 return true; 1726 } 1727 private void ExportGPSTripMatchCoordinates(List<SSVehTripDataTable> 1728 allGPSSchedules, string subFolderName) 1729 { 1730 string resultFolderPath = LogFileFolder + subFolderName; 1731 if (!Directory.Exists(resultFolderPath)) 1732 { 1733 Directory.CreateDirectory(resultFolderPath); 1734 } 1735 1736 //string outputFileName = logFileFolder + 1737 @"GPSTripsWithCoodinates.tsv"; 1738 StringBuilder outputText = new StringBuilder(); 1739 1740 foreach (SSVehTripDataTable gpsSchedule in allGPSSchedules) 1741 { 1742 string gps_coordinates = ""; 1743 foreach (int gpsID in gpsSchedule.GPSIDs) 1744 { 1745
212
gps_coordinates = gps_coordinates + 1746 String.Format("{0},{1}\n", GpsPointTable[gpsID].Latitude, 1747 GpsPointTable[gpsID].Longitude); 1748 } 1749 1750 string gtfs_stop_coordinates = ""; 1751 string gtfs_shape_coordinates = ""; 1752 string shapeID = "-1"; 1753 if (gpsSchedule.GtfsScheduleID > 0)// != -1) 1754 { 1755 SSimScheduleRef gtfsSchedule = 1756 GtfsScheduleTable[gpsSchedule.GtfsScheduleID]; 1757 shapeID = gtfsSchedule.ShapeID; 1758 foreach (int stopID in gtfsSchedule.StopIDs) 1759 { 1760 gtfs_stop_coordinates = gtfs_stop_coordinates + 1761 String.Format("{0},{1}\n", GtfsStopTable[stopID].Latitude, 1762 GtfsStopTable[stopID].Longitude); 1763 } 1764 foreach (Tuple<double, double, double> pathPoint in 1765 GtfsShapeTable[gtfsSchedule.ShapeID].Path) 1766 { 1767 gtfs_shape_coordinates = gtfs_shape_coordinates + 1768 String.Format("{0},{1}\n", pathPoint.Item1, pathPoint.Item2); 1769 } 1770 } 1771 1772 outputText.Append(String.Format("\"GPSTripID: 1773 {0}\"\n{1}\n\"GtfsScheduleID: {2}\"\n{3}\n\"GTFSShapeID: {4}\"\n{5}\n", 1774 gpsSchedule.TripID, gps_coordinates, gpsSchedule.GtfsScheduleID, 1775 gtfs_stop_coordinates, shapeID, gtfs_shape_coordinates)); 1776 File.WriteAllText(resultFolderPath + String.Format("{0}-{1}-1777 ", gpsSchedule.TripID, gpsSchedule.GtfsScheduleID) + 1778 "GPSTripsWithCoodinates.txt", outputText.ToString()); 1779 outputText.Clear(); 1780 } 1781 } 1782
1783
213
A.2.3 Feature Processing Method (DetermineAllFeatureValues)
private bool DetermineAllFeatureValues(bool printComputeResultLog) 1784 { 1785 int numTripAffected = 0; 1786 /*======= I-A1.Initialization of Estimated StopTimes - organize 1787 and process trip information ===========*/ 1788 //find the list of TripIDs that falls within the current date 1789 range 1790 //List<SSVehTripDataTable> allNewTrips = new 1791 List<SSVehTripDataTable>(); 1792 List<int> TripIDToBeRemoved = new List<int>(); 1793 //determine a list of trips needing processing - keep track trip 1794 processing 1795 List<SSVehTripDataTable> allTRIPs_ToBeOrganized = 1796 GpsTripTable.Values.Where(v => v.tripStopIDs == null).ToList(); 1797 numTripAffected += allTRIPs_ToBeOrganized.Count; 1798 List<int> allTripIDs_ToBeOrganized = 1799 allTRIPs_ToBeOrganized.Select(v => v.TripID).ToList(); 1800 allTRIPs_ToBeOrganized.Clear();//clear to min. mem usage 1801 1802 int numTripsRemovedHere = 0; 1803 int numTripsAddedFromSplit = 0; 1804 1805 while (allTripIDs_ToBeOrganized.Count > 0) 1806 { 1807 int TripID = allTripIDs_ToBeOrganized[0]; 1808 List<SSVehTripDataTable> newTripsFromSplit; 1809 bool removeThisTrip = 1810 ProcessAndOrganizeGPSPointsForTrip(TripID, out newTripsFromSplit); 1811 //Note: new trips are added to gpsTripTable() within the 1812 ProcessAndOrganizeGPSPointsForTrip() method. 1813 if (removeThisTrip) 1814 { 1815 TripIDToBeRemoved.Add(TripID);//done processing: deleted 1816 } 1817 if (newTripsFromSplit != null) 1818 { 1819 if (newTripsFromSplit.Count > 0) 1820 { 1821 List<int> newTripIDsFromSplit = (from r in 1822 newTripsFromSplit select r.TripID).ToList(); 1823 numTripsAddedFromSplit += newTripIDsFromSplit.Count; 1824 1825 allTripIDs_ToBeOrganized.AddRange(newTripIDsFromSplit);//additional 1826 processing tasks from split trips 1827 } 1828 } 1829 allTripIDs_ToBeOrganized.Remove(TripID);//done processing: 1830 finished 1831 } 1832 //REMOVE TRIPS 1833 //remove these trips from data object 1834
214
foreach (int removeTripID in TripIDToBeRemoved) 1835 { 1836 //delete these trips from vehGPSTripTable object 1837 SSVehTripDataTable removedTrip = null; 1838 GpsTripTable.TryRemove(removeTripID, out removedTrip); 1839 numTripsRemovedHere++; 1840 } 1841 //delete these trips from the table database 1842 DeleteFromTableDatabase("TTCGPSTRIPS", "TripID", 1843 TripIDToBeRemoved); 1844 TripIDToBeRemoved.Clear(); 1845 1846 /*======= I-A3.Initialization of Estimated StopTimes - compute 1847 then organize by stop id then by trip id ===========*/ 1848 Dictionary<int, Dictionary<int, TimeSpan>> 1849 computedArrivalTimesByStopIDByTripID = new Dictionary<int, Dictionary<int, 1850 TimeSpan>>(); 1851 List<SSVehTripDataTable> allTRIPs_NeedStopTimes = 1852 GpsTripTable.Values.Where(v => v.tripStopArrTimes == null).ToList(); 1853 numTripAffected += allTRIPs_ToBeOrganized.Count; 1854 List<int> allTripIDs_NeedStopTimes = 1855 allTRIPs_NeedStopTimes.Select(v => v.TripID).ToList(); 1856 allTRIPs_NeedStopTimes.Clear(); 1857 foreach (int TripID in GpsTripTable.Keys) 1858 //Parallel.ForEach(allTripIDs_AfterOrganize, (TripID) => 1859 { 1860 Dictionary<int, TimeSpan> TempStopTimesByStopID = new 1861 Dictionary<int, TimeSpan>(); 1862 Dictionary<int, double> TempDwellTimesByStopID; 1863 if (allTripIDs_NeedStopTimes.Contains(TripID))//needs to be 1864 calculated 1865 { 1866 bool complete = 1867 EstimatedArrivalAndDwellAlongScheduleRoute(TripID, out TempStopTimesByStopID, 1868 out TempDwellTimesByStopID);//out List<SSVehGPSDataTable> 1869 if (complete) 1870 { 1871 if 1872 (!computedArrivalTimesByStopIDByTripID.ContainsKey(TripID)) 1873 { 1874 computedArrivalTimesByStopIDByTripID.Add(TripID, 1875 TempStopTimesByStopID); 1876 } 1877 } 1878 else 1879 { 1880 TripIDToBeRemoved.Add(TripID); 1881 //updateGUI_LogBox(String.Format("StopTime 1882 Computation Failed. Line {0}, TripID {1} will be removed", "1732", TripID)); 1883 } 1884 } 1885 else//no need to calculate, can be retrived from saved data 1886 { 1887 //construct TempStopTimesByStopID from saved data 1888
215
for (int tripStopIndex = 0; tripStopIndex < 1889 GpsTripTable[TripID].tripStopIDs.Count; tripStopIndex++) 1890 { 1891 int tripStopID = 1892 GpsTripTable[TripID].tripStopIDs[tripStopIndex]; 1893 TimeSpan tripStopTime = 1894 GpsTripTable[TripID].tripStopArrTimes[tripStopIndex]; 1895 TempStopTimesByStopID.Add(tripStopID, tripStopTime); 1896 } 1897 if 1898 (!computedArrivalTimesByStopIDByTripID.ContainsKey(TripID)) 1899 { 1900 computedArrivalTimesByStopIDByTripID.Add(TripID, 1901 TempStopTimesByStopID); 1902 } 1903 } 1904 }//); 1905 1906 //REMOVE TRIPS 1907 //remove these trips from data object 1908 numTripsRemovedHere = 0; 1909 foreach (int removeTripID in TripIDToBeRemoved) 1910 { 1911 //delete these trips from vehGPSTripTable object 1912 SSVehTripDataTable removedTrip = null; 1913 GpsTripTable.TryRemove(removeTripID, out removedTrip); 1914 numTripsRemovedHere++; 1915 } 1916 //delete these trips from the table database 1917 DeleteFromTableDatabase("TTCGPSTRIPS", "TripID", 1918 TripIDToBeRemoved); 1919 TripIDToBeRemoved.Clear(); 1920 1921 ///*======= I-A4. Initialization of Estimated StopTimes - Update 1922 GPS Trips in the Data Object and Database ===========*/ 1923 ////update all trips variables in the table database 1924 //updateGPSTripInDB(vehGPSTripTable); 1925 1926 /* Start Generic Initializations - 101 */ 1927 //Get a list of possible GPS trips indexed for a particular date 1928 range 1929 //NOTE: GTFS trip found a GPS trip match will be deleted from 1930 list 1931 Dictionary<DateRange, List<int>> GPSTripIDsByDateRange = new 1932 Dictionary<DateRange, List<int>>(); 1933 List<SSVehTripDataTable> allTRIPs_NeedDelayAndHeadway = 1934 GpsTripTable.Values.Where(v => v.tripEstiHeadways == null).ToList(); 1935 numTripAffected += allTRIPs_NeedDelayAndHeadway.Count; 1936 //allTRIPs_NeedDelayAndHeadway.Select(v => v.TripID).ToList(); 1937 List<int> allTripIDs_NeedDelayAndHeadway = (from t in 1938 allTRIPs_NeedDelayAndHeadway.OrderBy(v => v.startGPSTime) select 1939 t.TripID).ToList();//ensure stoptimes processing is in order of time! 1940 allTRIPs_NeedDelayAndHeadway.Clear(); 1941
216
double dateRangeAssignmentToleranceInSecs = 300;//default: 5mins. 1942 Check if dateRanges would overlapp with this 1943 for (int n = 0; n < (AllDateRanges.Count - 1); n++) 1944 { 1945 if ((AllDateRanges[n].end.DateTimeToEpochTime() - 1946 AllDateRanges[n + 1].start.DateTimeToEpochTime()) > 1947 dateRangeAssignmentToleranceInSecs) 1948 { 1949 dateRangeAssignmentToleranceInSecs = 1950 (AllDateRanges[n].end.DateTimeToEpochTime() - AllDateRanges[n + 1951 1].start.DateTimeToEpochTime()); 1952 } 1953 } 1954 foreach (DateRange currentDate in AllDateRanges) 1955 { 1956 GPSTripIDsByDateRange.Add(currentDate, new List<int>()); 1957 List<int> matchedTripIDs = new List<int>();//for debug and 1958 efficiency purpose 1959 1960 //foreach (int TripID in allTripIDs) 1961 for (int i = 0; i < allTripIDs_NeedDelayAndHeadway.Count; 1962 i++) 1963 { 1964 //assign TripID to a dateRange 1965 int TripID = allTripIDs_NeedDelayAndHeadway[i]; 1966 long currentTripStart = 1967 GpsTripTable[TripID].startGPSTime; 1968 if ((currentDate.start.DateTimeToEpochTime() <= 1969 (currentTripStart + dateRangeAssignmentToleranceInSecs)) && 1970 (currentDate.end.DateTimeToEpochTime() >= (currentTripStart - 1971 dateRangeAssignmentToleranceInSecs))) 1972 { 1973 GPSTripIDsByDateRange[currentDate].Add(TripID); 1974 allTripIDs_NeedDelayAndHeadway.RemoveAt(i); 1975 i--; 1976 } 1977 } 1978 } 1979 1980 //MAIN TASK OF THE MEDTHOD 1981 foreach (DateRange currentDay in AllDateRanges) 1982 { 1983 //find current day service id 1984 int currentDayServiceID = 3;//default weekday 1985 DateTime tripStartTime = currentDay.start; 1986 dayType tripDayType = tripStartTime.getDayType(new 1987 dayTypeDefinition()); 1988 foreach (int serviceID in GtfsServicePeriodTable.Keys) 1989 { 1990 if (GtfsServicePeriodTable[serviceID].TypeOfDay == 1991 tripDayType) 1992 { 1993 currentDayServiceID = serviceID; 1994 break; 1995
217
} 1996 } 1997 /* End Generic Initializations - 101 */ 1998 List<int> currentDayTripIDs = 1999 GPSTripIDsByDateRange[currentDay]; 2000 2001 /*======= I-B2.Initialization of Scheduled StopTimes at Stops 2002 ===========*/ 2003 // initialize GTFS stop schedule within this service period, 2004 for all stops (specific to schedule delay calc) 2005 Dictionary<int, List<TimeSpan>> masterGTFSStopInfo = new 2006 Dictionary<int, List<TimeSpan>>();//index stopID, value stopTimes 2007 long bufferTime = 2 * 60;//2*60 mins buffer (from 2008 experimental estimations, 2 hours buffer will cover the worst of cases) 2009 TimeSpan dayStart = 2010 TimeSpan.FromTicks(currentDay.start.AddMinutes(-2011 bufferTime).ToLocalTime().Subtract(currentDay.start.ToLocalTime().Date).Ticks2012 ); 2013 TimeSpan dayEnd = 2014 TimeSpan.FromTicks(currentDay.end.AddMinutes(bufferTime).ToLocalTime().Subtra2015 ct(currentDay.start.ToLocalTime().Date).Ticks); 2016 foreach (SSimStopRef currentStop in GtfsStopTable.Values) 2017 { 2018 List<TimeSpan> singleStopTimes = new List<TimeSpan>(); 2019 2020 //stopTimes for current day service id only 2021 if 2022 (currentStop.scheduledStopTimesByServiceID.ContainsKey(currentDayServiceID)) 2023 { 2024 //SSimStopRef currentStop = 2025 gtfsStopTable[stopID];//check to make sure this is not a pointer 2026 for (int stIndex = 0; stIndex < 2027 currentStop.scheduledStopTimesByServiceID[currentDayServiceID].Count; 2028 stIndex++) 2029 { 2030 TimeSpan aStopTime = 2031 currentStop.scheduledStopTimesByServiceID[currentDayServiceID][stIndex]; 2032 if ((aStopTime.TotalSeconds >= 2033 dayStart.TotalSeconds) && (aStopTime.TotalSeconds <= dayEnd.TotalSeconds)) 2034 { 2035 singleStopTimes.Add(aStopTime); 2036 } 2037 } 2038 masterGTFSStopInfo.Add(currentStop.StopID, 2039 singleStopTimes); 2040 }//end if 2041 //else do nothing 2042 } 2043 2044 /*======= STOP TIMES CALCULATIONS - Done by 2045 computedArrivalTimesByStopIDByTripID() already ===========*/ 2046 2047 /*======= SPEED CALCULATIONS (FOR UTILITY) ===========*/ 2048
218
//TO-DO: technically not need, due to changes to the way data 2049 are processed 2050 for (int tripIndex = 0; tripIndex < currentDayTripIDs.Count; 2051 tripIndex++) 2052 { 2053 //note: assign TimeSpan value in the list to null once 2054 matched to avoid double matching using currentGTFSStopInfo. 2055 int TripID = currentDayTripIDs[tripIndex]; 2056 for (int gpsIndex = 0; gpsIndex < 2057 GpsTripTable[TripID].GPSIDs.Count; gpsIndex++) 2058 { 2059 int gpsID = GpsTripTable[TripID].GPSIDs[gpsIndex]; 2060 double gpsEstiSpeed = -1.0; 2061 if (GpsPointTable[gpsID].EstiAvgSpeed < 0)//check if 2062 it has been calculated already 2063 { 2064 if (gpsIndex > 0 && gpsIndex < 2065 (GpsTripTable[TripID].GPSIDs.Count)) 2066 { 2067 double distToGPS = 2068 GpsPointTable[gpsID].DistFromShapeStart; 2069 long timeToGPS = 2070 GpsPointTable[gpsID].GPStime; 2071 int prevGPSID = 2072 GpsTripTable[TripID].GPSIDs[gpsIndex - 1]; 2073 double distToPrevGPS = 2074 GpsPointTable[prevGPSID].DistFromShapeStart; 2075 long timeToPrevGPS = 2076 GpsPointTable[prevGPSID].GPStime; 2077 gpsEstiSpeed = ((distToGPS - distToPrevGPS) / 2078 (timeToGPS - timeToPrevGPS) * 3.6); 2079 } 2080 else if (gpsIndex == 0) 2081 { 2082 //assume at start of trip, speed is 0 2083 (vehicle generation) 2084 gpsEstiSpeed = 0.0; 2085 } 2086 2087 if (gpsEstiSpeed.Equals(-1.0)) 2088 { 2089 int debug = 0; 2090 } 2091 GpsPointTable[gpsID].EstiAvgSpeed = 2092 Math.Round(gpsEstiSpeed, 1); 2093 } 2094 } 2095 } 2096 /*======= SCHEDULE DELAY CALCULATIONS - A. Matching schedule 2097 stop times with estimated stop times (in 5 smaller steps) ===========*/ 2098 //Previous to this, a list of estimated stopTimes at each 2099 stop of each trip are computed using ProcessAndOrganizeGPSPointsForTrip 2100 (processing) and EstimatedStopTimesAlongScheduleRoute(compute) 2101
219
//Step A-1: simple assignment of observed stop time to 2102 schedule stop time based on lowest time difference 2103 //Step A-2: reverse assignment to reassign observed stop 2104 times with conflicted matches 2105 //Step A-3: Missed Stop Delay Penalization to reassign match 2106 based on missed stop. If stop seq is 1,2,5,6,7, revised is 1,2,3,6,7, where 2107 the delay due to missed stops is penalized. 2108 //Step A-4: backward construction of schedule versus 2109 estimated stopTimes. Based on the fixed match in A-3, construct final actual 2110 and schedule stop time match 2111 //Step A-5. Processing matches for stopTimes final match 2112 results - calculate delay at stops 2113 2114 //A-1: Simple Assignment 2115 Dictionary<int, List<Tuple<int, TimeSpan, int>>> 2116 stopTimesIndexMatchWithTripID_ByStopID = new Dictionary<int, List<Tuple<int, 2117 TimeSpan, int>>>();//Tuple contains stopTimes Index from masterGTFSStopInfo, 2118 estimated/calculated stopTime estimate and TripID - Used also by Headway 2119 Calc! 2120 for (int tripIndex = 0; tripIndex < currentDayTripIDs.Count; 2121 tripIndex++) 2122 { 2123 //note: assign TimeSpan value in the list to null once 2124 matched to avoid double matching using currentGTFSStopInfo. 2125 int TripID = currentDayTripIDs[tripIndex]; 2126 Dictionary<int, Tuple<TimeSpan, TimeSpan>> 2127 Temp_stopTimesMatch_ByStopID = new Dictionary<int, Tuple<TimeSpan, 2128 TimeSpan>>(); 2129 if 2130 (computedArrivalTimesByStopIDByTripID.ContainsKey(TripID)) 2131 { 2132 foreach (int stopID in 2133 GpsTripTable[TripID].tripStopIDs)//note: stops are in order 2134 { 2135 //determine if this trip and stop served multiple 2136 stop times 2137 TimeSpan actualEstiStopTime = 2138 computedArrivalTimesByStopIDByTripID[TripID][stopID]; 2139 //organize computed tuples used for matching 2140 Tuple<int, TimeSpan, int> simpleMatchTuple; 2141 if (masterGTFSStopInfo[stopID].Count > 0) 2142 { 2143 //delay = delay to current stop + missing 2144 stopTimes from previous trips 2145 List<double> delays = (from r in 2146 masterGTFSStopInfo[stopID] select Math.Abs(r.TotalSeconds - 2147 actualEstiStopTime.TotalSeconds)).ToList(); 2148 int stopTimeMatchedIndex = 2149 delays.IndexOf(delays.Min()); 2150 simpleMatchTuple = new Tuple<int, TimeSpan, 2151 int>(stopTimeMatchedIndex, actualEstiStopTime, TripID); 2152 } 2153 else 2154 { 2155
220
simpleMatchTuple = new Tuple<int, TimeSpan, 2156 int>(-1, actualEstiStopTime, TripID);//no schedule - delay will be based on 2157 previous stop if available (later) 2158 } 2159 2160 //add to stopTimesIndexMatchWithTripID_ByStopID 2161 if 2162 (stopTimesIndexMatchWithTripID_ByStopID.ContainsKey(stopID)) 2163 { 2164 2165 stopTimesIndexMatchWithTripID_ByStopID[stopID].Add(simpleMatchTuple); 2166 } 2167 else 2168 { 2169 2170 stopTimesIndexMatchWithTripID_ByStopID.Add(stopID, new List<Tuple<int, 2171 TimeSpan, int>>()); 2172 2173 stopTimesIndexMatchWithTripID_ByStopID[stopID].Add(simpleMatchTuple); 2174 } 2175 }//end foreach 2176 }//end if 2177 }//end for 2178 Dictionary<int, Dictionary<int, Tuple<TimeSpan, TimeSpan>>> 2179 scheduleVsEstimatedStopTimes_ByTripIDByStopID = new Dictionary<int, 2180 Dictionary<int, Tuple<TimeSpan, TimeSpan>>>(); 2181 foreach (int stopID in 2182 stopTimesIndexMatchWithTripID_ByStopID.Keys) 2183 { 2184 //A-2. Reverse Reassignment 2185 // sort stop times matches so duplicated matches are 2186 taken care of: sort by time 2187 stopTimesIndexMatchWithTripID_ByStopID[stopID].Sort((x, 2188 y) => x.Item2.CompareTo(y.Item2)); 2189 // Revise using Backward check for repeated index: check 2190 and see if the list result needs to be revised - no index repeated, if so, 2191 revised according to the TimeSpan values 2192 int prevStopTimesIndex; 2193 int nextStopTimesIndex; 2194 List<int> stopTimesMatchIndexList = (from listItem in 2195 stopTimesIndexMatchWithTripID_ByStopID[stopID] select 2196 listItem.Item1).ToList(); 2197 for (int tupleListIndex = (stopTimesMatchIndexList.Count 2198 - 2); tupleListIndex >= 1; tupleListIndex--)//first and last are not changed, 2199 as they are ref points 2200 { 2201 int TripID = 2202 stopTimesIndexMatchWithTripID_ByStopID[stopID][tupleListIndex].Item3; 2203 prevStopTimesIndex = 2204 stopTimesIndexMatchWithTripID_ByStopID[stopID][tupleListIndex + 1].Item1; 2205 nextStopTimesIndex = 2206 stopTimesIndexMatchWithTripID_ByStopID[stopID][tupleListIndex - 1].Item1; 2207 int currentStopTimesIndex = 2208 stopTimesIndexMatchWithTripID_ByStopID[stopID][tupleListIndex].Item1; 2209
221
//decrease current matched stop index by one only if: 2210 last stop index value is the same as current, next stop index value is not 2211 the same as current, and the (current-1) stop index is not present in the 2212 list, then make current = (current-1) 2213 if (currentStopTimesIndex == prevStopTimesIndex && 2214 currentStopTimesIndex > nextStopTimesIndex && 2215 !stopTimesMatchIndexList.Contains(currentStopTimesIndex - 1)) 2216 { 2217 Tuple<int, TimeSpan, int> revisedTuple = new 2218 Tuple<int, TimeSpan, int>((currentStopTimesIndex - 1), 2219 stopTimesIndexMatchWithTripID_ByStopID[stopID][tupleListIndex].Item2, 2220 stopTimesIndexMatchWithTripID_ByStopID[stopID][tupleListIndex].Item3);//match 2221 with a stop index one previous to 2222 stopTimesMatchIndexList[tupleListIndex] = 2223 (currentStopTimesIndex - 1); 2224 2225 stopTimesIndexMatchWithTripID_ByStopID[stopID][tupleListIndex] = 2226 revisedTuple;//change previous result 2227 } 2228 } 2229 //compute scheduleVsEstimatedStopTimes_ByTripIDByStopID 2230 stopTimesMatchIndexList = (from listItem in 2231 stopTimesIndexMatchWithTripID_ByStopID[stopID] select 2232 listItem.Item1).ToList();//list must be updated 2233 int minIndexBound = stopTimesMatchIndexList.Min();//first 2234 trip's stop is a reference, do not decrease below that. 2235 for (int tupleListIndex = (stopTimesMatchIndexList.Count 2236 - 1); tupleListIndex >= 1; tupleListIndex--) 2237 { 2238 int currentStopTimesIndexMatch = 2239 stopTimesIndexMatchWithTripID_ByStopID[stopID][tupleListIndex].Item1; 2240 int scheduleStopTimesIndex = 2241 currentStopTimesIndexMatch;//may be adjusted 2242 2243 //A-3. Missed Stop Delay Penalization 2244 //trying to find the earliest schedule stopTime that 2245 was not served - final fix of schedule stop time match 2246 while 2247 (!stopTimesMatchIndexList.Contains(scheduleStopTimesIndex - 1) && 2248 (scheduleStopTimesIndex - 1) > minIndexBound) 2249 { 2250 scheduleStopTimesIndex--; 2251 } 2252 if (scheduleStopTimesIndex != 2253 stopTimesIndexMatchWithTripID_ByStopID[stopID][tupleListIndex].Item1)//If 2254 changed by the above while, update tuple data for record. 2255 { 2256 Tuple<int, TimeSpan, int> revisedTuple = new 2257 Tuple<int, TimeSpan, int>((scheduleStopTimesIndex), 2258 stopTimesIndexMatchWithTripID_ByStopID[stopID][tupleListIndex].Item2, 2259 stopTimesIndexMatchWithTripID_ByStopID[stopID][tupleListIndex].Item3);//match 2260 with a stop index one previous to 2261
222
2262 stopTimesIndexMatchWithTripID_ByStopID[stopID][tupleListIndex] = 2263 revisedTuple;//change previous result 2264 } 2265 }//end for tuple matches 2266 for (int tupleListIndex = 0; tupleListIndex < 2267 stopTimesIndexMatchWithTripID_ByStopID[stopID].Count; tupleListIndex++) 2268 { 2269 int gpsTripID = 2270 stopTimesIndexMatchWithTripID_ByStopID[stopID][tupleListIndex].Item3; //final 2271 trip id 2272 TimeSpan estimatedStopTime = 2273 stopTimesIndexMatchWithTripID_ByStopID[stopID][tupleListIndex].Item2; //final 2274 actual stop time 2275 int scheduleStopTimesIndex = 2276 stopTimesIndexMatchWithTripID_ByStopID[stopID][tupleListIndex].Item1; 2277 TimeSpan scheduledStopTime;//final matched schedule 2278 stop time 2279 if (scheduleStopTimesIndex >= 0 && 2280 scheduleStopTimesIndex < masterGTFSStopInfo[stopID].Count) 2281 { 2282 scheduledStopTime = 2283 masterGTFSStopInfo[stopID][scheduleStopTimesIndex]; 2284 } 2285 else 2286 { 2287 scheduledStopTime = new TimeSpan(7, 0, 0, 0, 0); 2288 //7 day as void type 2289 } 2290 //A-4: forward construction of schedule versus 2291 estimated stopTimes. 2292 //= 2293 masterGTFSStopInfo[stopID][scheduleStopTimesIndex]; 2294 //start of: 2295 scheduleVsEstimatedStopTimes_ByTripIDByStopID addition 2296 if 2297 (!scheduleVsEstimatedStopTimes_ByTripIDByStopID.ContainsKey(gpsTripID)) 2298 { 2299 2300 scheduleVsEstimatedStopTimes_ByTripIDByStopID.Add(gpsTripID, new 2301 Dictionary<int, Tuple<TimeSpan, TimeSpan>>()); 2302 } 2303 if 2304 (!scheduleVsEstimatedStopTimes_ByTripIDByStopID[gpsTripID].ContainsKey(stopID2305 )) 2306 { 2307 2308 scheduleVsEstimatedStopTimes_ByTripIDByStopID[gpsTripID].Add(stopID, new 2309 Tuple<TimeSpan, TimeSpan>(scheduledStopTime, estimatedStopTime)); 2310 } 2311 }//end for tuple matches 2312 stopTimesMatchIndexList.Clear();//done using 2313 }//end for stops 2314 2315
223
//A-5: finds the delay at stops 2316 for (int tripIndex = 0; tripIndex < currentDayTripIDs.Count; 2317 tripIndex++) 2318 { 2319 int TripID = currentDayTripIDs[tripIndex]; 2320 //uses scheduleVsEstimatedStopTimes_ByTripIDByStopID to 2321 find estimatedDelay_ByStopID 2322 Dictionary<int, double> estimatedDelay_ByStopID = new 2323 Dictionary<int, double>();//in secs - processed delay 2324 double maxTimeVal = 2 * 24 * 3600; 2325 // double averageDelay = 0.0; 2326 double lastDelay = maxTimeVal; 2327 //foreach (int stopID in 2328 vehGPSTripTable[TripID].tripStopIDs)//note: stops are in order 2329 for (int stopIDIndex = 0; stopIDIndex < 2330 GpsTripTable[TripID].tripStopIDs.Count; stopIDIndex++) 2331 { 2332 int stopID = 2333 GpsTripTable[TripID].tripStopIDs[stopIDIndex]; 2334 double currentDelay = maxTimeVal; 2335 double scheduleTime = 2336 scheduleVsEstimatedStopTimes_ByTripIDByStopID[TripID][stopID].Item1.TotalSeco2337 nds; 2338 double estimatedTime = 2339 scheduleVsEstimatedStopTimes_ByTripIDByStopID[TripID][stopID].Item2.TotalSeco2340 nds; 2341 if (scheduleTime >= maxTimeVal)//should never happen, 2342 if it does, error occurred for this data 2343 { 2344 if (lastDelay < maxTimeVal) 2345 { 2346 currentDelay = lastDelay; 2347 } 2348 else 2349 { 2350 currentDelay = maxTimeVal; 2351 } 2352 } 2353 else 2354 { 2355 currentDelay = estimatedTime - scheduleTime; 2356 } 2357 2358 if (currentDelay < maxTimeVal) 2359 { 2360 //averageDelay = (averageDelay * stopIDIndex + 2361 currentDelay) / (stopIDIndex + 1); 2362 lastDelay = currentDelay; 2363 } 2364 estimatedDelay_ByStopID[stopID] = currentDelay; 2365 }//end for 2366 2367 //process stops with no delay value 2368 double upstreamNearbyDelay = maxTimeVal; 2369
224
for (int stopIDIndex = 0; stopIDIndex < 2370 GpsTripTable[TripID].tripStopIDs.Count; stopIDIndex++) 2371 { 2372 int stopID = 2373 GpsTripTable[TripID].tripStopIDs[stopIDIndex]; 2374 2375 if (stopIDIndex == 0 && 2376 (estimatedDelay_ByStopID[stopID] >= maxTimeVal))//has no upstream, look 2377 downstream 2378 { 2379 int searchStopIDIndex = stopIDIndex; 2380 bool foundVal = false; 2381 while (!foundVal && searchStopIDIndex < 2382 GpsTripTable[TripID].tripStopIDs.Count) 2383 { 2384 int searchStopID = 2385 GpsTripTable[TripID].tripStopIDs[searchStopIDIndex]; 2386 if (estimatedDelay_ByStopID[searchStopID] < 2387 maxTimeVal) 2388 { 2389 estimatedDelay_ByStopID[stopID] = 2390 estimatedDelay_ByStopID[searchStopID]; 2391 foundVal = true; 2392 } 2393 searchStopIDIndex++; 2394 } 2395 2396 if (!foundVal) 2397 { 2398 estimatedDelay_ByStopID[stopID] = -1.0;//bad 2399 trip, no delay can be calculated 2400 } 2401 } 2402 else if (estimatedDelay_ByStopID[stopID] >= 2403 maxTimeVal) 2404 { 2405 estimatedDelay_ByStopID[stopID] = 2406 upstreamNearbyDelay; 2407 } 2408 upstreamNearbyDelay = 2409 estimatedDelay_ByStopID[stopID]; 2410 }//end for stop 2411 2412 //check and validate delays values - stops may be missing 2413 schedule stop times (if exceed studyDurationTolerance) 2414 double studyDurationTolerance = 3600; 2415 //(currentDay.getDurationHours() - (includeOneHourWarmUp ? 1 : 0)) * 2416 3600;//too insensitive 2417 List<double> validDelays = 2418 estimatedDelay_ByStopID.Values.Where(o => o < studyDurationTolerance).Where(o 2419 => o > -studyDurationTolerance).ToList(); 2420 if (validDelays.Count > 0) 2421 { 2422 double avgValidDelays = validDelays.Average(); 2423
225
double lastValidDelay = double.MinValue; 2424 //process stops with missing schedule stop times - 2425 forward pass 2426 for (int stopIDIndex = 0; stopIDIndex < 2427 GpsTripTable[TripID].tripStopIDs.Count; stopIDIndex++) 2428 { 2429 int stopID = 2430 GpsTripTable[TripID].tripStopIDs[stopIDIndex]; 2431 if (Math.Abs(estimatedDelay_ByStopID[stopID] - 2432 avgValidDelays) > studyDurationTolerance) 2433 { 2434 if (lastValidDelay > -100000)// if it has 2435 been assigned 2436 { 2437 estimatedDelay_ByStopID[stopID] = 2438 lastValidDelay; 2439 } 2440 } 2441 else 2442 { 2443 lastValidDelay = 2444 estimatedDelay_ByStopID[stopID]; 2445 } 2446 } 2447 //process stops with missing schedule stop times - 2448 backward pass 2449 for (int stopIDIndex = 2450 (GpsTripTable[TripID].tripStopIDs.Count - 1); stopIDIndex >= 0; stopIDIndex--2451 ) 2452 { 2453 int stopID = 2454 GpsTripTable[TripID].tripStopIDs[stopIDIndex]; 2455 2456 if (Math.Abs(estimatedDelay_ByStopID[stopID] - 2457 avgValidDelays) > studyDurationTolerance) 2458 { 2459 if (lastValidDelay > -100000)// if it has 2460 been assigned 2461 { 2462 estimatedDelay_ByStopID[stopID] = 2463 lastValidDelay; 2464 } 2465 } 2466 else 2467 { 2468 lastValidDelay = 2469 estimatedDelay_ByStopID[stopID]; 2470 } 2471 } 2472 } 2473 2474 //get tripEstiDelays matrix for the trip 2475 List<double> thisTripEstiDelays = new List<double>(); 2476
226
for (int stopIDIndex = 0; stopIDIndex < 2477 GpsTripTable[TripID].tripStopIDs.Count; stopIDIndex++) 2478 { 2479 int stopID = 2480 GpsTripTable[TripID].tripStopIDs[stopIDIndex]; 2481 2482 thisTripEstiDelays.Add(estimatedDelay_ByStopID[stopID]); 2483 }//end for stop 2484 //UPDATE GPS TRIP VARIABLE 2485 GpsTripTable[TripID].tripEstiDelays = new 2486 List<double>(thisTripEstiDelays); 2487 List<double> tempList = 2488 GpsTripTable[TripID].tripEstiDelays.Where(v => v == 2489 maxTimeVal).ToList();//unmatched results 2490 if (tempList.Count > 1) 2491 { 2492 if (!TripIDToBeRemoved.Contains(TripID)) 2493 { 2494 TripIDToBeRemoved.Add(TripID); 2495 } 2496 } 2497 }//end for trip 2498 2499 /*======= OBSERVED HEADWAY CALCULATIONS (AT STOPS) 2500 ===========*/ 2501 //A-1: sort stopTimesIndexMatchWithTripID_ByStopID object 2502 into workable Dictionaries 2503 Dictionary<int, Dictionary<int, double>> 2504 EstiHeadwayByTripIDByStopID = new Dictionary<int, Dictionary<int, double>>(); 2505 foreach (int stopID in 2506 stopTimesIndexMatchWithTripID_ByStopID.Keys) 2507 { 2508 //foreach(Tuple<int,TimeSpan,int> 2509 stopTimesIndexMatchTuple in stopTimesIndexMatchWithTripID_ByStopID[stopID]) 2510 if (stopTimesIndexMatchWithTripID_ByStopID[stopID].Count 2511 > 1) 2512 { 2513 for (int TupleIndex = 0; TupleIndex < 2514 stopTimesIndexMatchWithTripID_ByStopID[stopID].Count; TupleIndex++) 2515 { 2516 //Headway: difference in time between prior 2517 arrival and current arrival at a stop. 2518 //current arrival 2519 Tuple<int, TimeSpan, int> 2520 curStopTimesIndexMatchTuple = 2521 stopTimesIndexMatchWithTripID_ByStopID[stopID][TupleIndex]; 2522 //previous arrival, or next arrival for first 2523 trip at stop 2524 Tuple<int, TimeSpan, int> 2525 nextStopTimesIndexMatchTuple = TupleIndex > 0 ? 2526 stopTimesIndexMatchWithTripID_ByStopID[stopID][TupleIndex - 1] : 2527 stopTimesIndexMatchWithTripID_ByStopID[stopID][TupleIndex + 1];//TupleIndex < 2528 (stopTimesIndexMatchWithTripID_ByStopID[stopID].Count - 1) ? 2529
227
stopTimesIndexMatchWithTripID_ByStopID[stopID][TupleIndex + 1] : 2530 stopTimesIndexMatchWithTripID_ByStopID[stopID][TupleIndex - 1]; 2531 int curTripID = 2532 curStopTimesIndexMatchTuple.Item3; 2533 //A-2: Calculate headway 2534 double estiHeadwayFromStopTimesInSecs = 2535 Math.Abs(curStopTimesIndexMatchTuple.Item2.TotalSeconds - 2536 nextStopTimesIndexMatchTuple.Item2.TotalSeconds); 2537 //A-3: Store calculated headway 2538 if 2539 (!EstiHeadwayByTripIDByStopID.ContainsKey(curTripID)) 2540 { 2541 EstiHeadwayByTripIDByStopID.Add(curTripID, 2542 new Dictionary<int, double>()); 2543 } 2544 if 2545 (!EstiHeadwayByTripIDByStopID[curTripID].ContainsKey(stopID)) 2546 { 2547 2548 EstiHeadwayByTripIDByStopID[curTripID].Add(stopID, 2549 estiHeadwayFromStopTimesInSecs); 2550 } 2551 } 2552 } 2553 //else do not attempt to calculate - rely on 2554 interpolations 2555 }//end foreach stop (trip loop not needed for this comp) 2556 //B-1:Compute and interpolated all observed headway values 2557 for trips' stops, aka process stops with no headway value 2558 for (int tripIndex = 0; tripIndex < currentDayTripIDs.Count; 2559 tripIndex++) 2560 { 2561 int TripID = currentDayTripIDs[tripIndex]; 2562 if (EstiHeadwayByTripIDByStopID.ContainsKey(TripID)) 2563 { 2564 //uses scheduleVsEstimatedStopTimes_ByTripIDByStopID 2565 to find estimatedDelay_ByStopID 2566 //Dictionary<int, double> EstimatedHeadway_ByStopID = 2567 EstiHeadwayByTripIDByStopID[TripID];//in secs - processed delay 2568 for (int stopIDIndex = 0; stopIDIndex < 2569 GpsTripTable[TripID].tripStopIDs.Count; stopIDIndex++) 2570 { 2571 int stopID = 2572 GpsTripTable[TripID].tripStopIDs[stopIDIndex]; 2573 if 2574 (!EstiHeadwayByTripIDByStopID[TripID].ContainsKey(stopID))//attempt to 2575 interpolate 2576 { 2577 int downstreamSearchi = stopIDIndex < 2578 (GpsTripTable[TripID].tripStopIDs.Count - 1) ? stopIDIndex + 1 : stopIDIndex 2579 - 1; 2580 int upstreamSearchj = stopIDIndex > (0) ? 2581 stopIDIndex - 1 : stopIDIndex + 1; 2582 int nearbyStopID_DOWNSTREAM = -1; 2583
228
int nearbyStopID_UPSTREAM = -1; 2584 //try find downstream 2585 while (nearbyStopID_DOWNSTREAM == -1 && 2586 downstreamSearchi <= (GpsTripTable[TripID].tripStopIDs.Count - 1)) 2587 { 2588 if 2589 (EstiHeadwayByTripIDByStopID[TripID].ContainsKey(GpsTripTable[TripID].tripSto2590 pIDs[downstreamSearchi])) 2591 { 2592 nearbyStopID_DOWNSTREAM = 2593 GpsTripTable[TripID].tripStopIDs[downstreamSearchi]; 2594 } 2595 else 2596 { 2597 downstreamSearchi++; 2598 } 2599 } 2600 //try find upstream 2601 while (nearbyStopID_UPSTREAM == -1 && 2602 upstreamSearchj >= 0) 2603 { 2604 if 2605 (EstiHeadwayByTripIDByStopID[TripID].ContainsKey(GpsTripTable[TripID].tripSto2606 pIDs[upstreamSearchj])) 2607 { 2608 nearbyStopID_UPSTREAM = 2609 GpsTripTable[TripID].tripStopIDs[upstreamSearchj]; 2610 } 2611 else 2612 { 2613 upstreamSearchj--; 2614 } 2615 } 2616 //both found, 1 or the other found, or none 2617 found 2618 if (nearbyStopID_DOWNSTREAM >= 0 && 2619 nearbyStopID_UPSTREAM >= 0) 2620 { 2621 double targetStopDist = 2622 GpsTripTable[TripID].tripStopDistances[stopIDIndex]; 2623 double downstreamDist = 2624 GpsTripTable[TripID].tripStopDistances[downstreamSearchi]; 2625 double upstreamDist = 2626 GpsTripTable[TripID].tripStopDistances[upstreamSearchj]; 2627 double downstreamHW = 2628 EstiHeadwayByTripIDByStopID[TripID][nearbyStopID_DOWNSTREAM]; 2629 double upstreamHW = 2630 EstiHeadwayByTripIDByStopID[TripID][nearbyStopID_UPSTREAM]; 2631 2632 double interpolatedHeadway = 2633 ((downstreamHW - upstreamHW) == 0 || (downstreamDist - upstreamDist) == 0 || 2634 (targetStopDist - upstreamDist) < 0) ? (upstreamHW) : (upstreamHW + 2635 (downstreamHW - upstreamHW) / (downstreamDist - upstreamDist) * 2636 (targetStopDist - upstreamDist)); 2637
229
2638 EstiHeadwayByTripIDByStopID[TripID][stopID] = interpolatedHeadway; 2639 } 2640 else if (nearbyStopID_DOWNSTREAM >= 0) 2641 { 2642 2643 EstiHeadwayByTripIDByStopID[TripID][stopID] = 2644 EstiHeadwayByTripIDByStopID[TripID][nearbyStopID_DOWNSTREAM]; 2645 } 2646 else if (nearbyStopID_UPSTREAM >= 0) 2647 { 2648 2649 EstiHeadwayByTripIDByStopID[TripID][stopID] = 2650 EstiHeadwayByTripIDByStopID[TripID][nearbyStopID_UPSTREAM]; 2651 } 2652 //if none of the above conditions can be met, 2653 this stop will be missed => the headway computation for this trip would be 2654 invalid and removed. 2655 }//end if contains key stopid 2656 }//end for stop 2657 //C-1: Assign final headway, post-processed 2658 //get tripEstiHeadways matrix for the trip 2659 List<double> thisTripEstiHeadways = new 2660 List<double>(); 2661 if (GpsTripTable[TripID].tripStopIDs.Count != 2662 EstiHeadwayByTripIDByStopID[TripID].Count) 2663 { 2664 TripIDToBeRemoved.Add(TripID);//remove trips that 2665 can't be processed 2666 } 2667 else 2668 { 2669 for (int stopIDIndex = 0; stopIDIndex < 2670 GpsTripTable[TripID].tripStopIDs.Count; stopIDIndex++) 2671 { 2672 int stopID = 2673 GpsTripTable[TripID].tripStopIDs[stopIDIndex]; 2674 2675 thisTripEstiHeadways.Add(Math.Round(EstiHeadwayByTripIDByStopID[TripID][stopI2676 D], 1)); 2677 }//end for stop 2678 //UPDATE GPS TRIP VARIABLE 2679 GpsTripTable[TripID].tripEstiHeadways = new 2680 List<double>(thisTripEstiHeadways); 2681 //delete criteria 2682 List<double> tempList = new List<double>(); 2683 2684 tempList.AddRange(GpsTripTable[TripID].tripEstiHeadways.Where(v => v >= 2685 currentDay.getDurationSeconds()).ToList());//invalid value 2686 2687 tempList.AddRange(GpsTripTable[TripID].tripEstiDelays.Where(v => v >= 2688 currentDay.getDurationSeconds()).ToList());//invalid value 2689 if (tempList.Count > 1) 2690 { 2691
230
if (!TripIDToBeRemoved.Contains(TripID)) 2692 { 2693 TripIDToBeRemoved.Add(TripID); 2694 } 2695 } 2696 } 2697 }//end if contains key TripID, if this cannot be met => 2698 the headway computation for this trip would be invalid and removed. 2699 }//end for trip 2700 2701 /*======= SCHEDULED HEADWAY CALCULATIONS (AT STOPS) 2702 ===========*/ 2703 //A-1: process all the matching schedule index 2704 Dictionary<int, Dictionary<int, double>> 2705 ScheduleHeadwayByTripIDByStopID = new Dictionary<int, Dictionary<int, 2706 double>>(); 2707 foreach (int stopID in 2708 stopTimesIndexMatchWithTripID_ByStopID.Keys) 2709 { 2710 if (stopTimesIndexMatchWithTripID_ByStopID[stopID].Count 2711 > 1) 2712 { 2713 for (int TupleIndex = 0; TupleIndex < 2714 stopTimesIndexMatchWithTripID_ByStopID[stopID].Count; TupleIndex++) 2715 { 2716 //Schedule Headway: difference in time between 2717 current arrival and prior arrival at a stop, as scheduled. 2718 //current trip 2719 int curTripID = 2720 stopTimesIndexMatchWithTripID_ByStopID[stopID][TupleIndex].Item3; 2721 //current index 2722 int curStopTimeIndex = 2723 stopTimesIndexMatchWithTripID_ByStopID[stopID][TupleIndex].Item1; 2724 ////previous index, or next if this is the first 2725 stop time (less likely) 2726 //int prevStopTimeIndex = curStopTimeIndex > 0 ? 2727 curStopTimeIndex - 1 : curStopTimeIndex + 1; 2728 2729 //only attempt to calculate if matches exist 2730 if (curStopTimeIndex >= 0)// && prevStopTimeIndex 2731 >= 0) 2732 { 2733 // Adjust indices to take the average headway 2734 for 11 arrivals or 10 intervals over that time period. 2735 int maxIndex = 2736 masterGTFSStopInfo[stopID].Count - 1; 2737 int earlierStopTimeIndex = 2738 Math.Max(curStopTimeIndex - 5, 0); 2739 int laterStopTimeIndex = 2740 Math.Min(curStopTimeIndex + 5, maxIndex); 2741 2742 //A-2: Calculate schedule headway 2743 double scheduleHeadwayInSecs = 2744 Math.Abs(masterGTFSStopInfo[stopID][laterStopTimeIndex].TotalSeconds - 2745
231
masterGTFSStopInfo[stopID][earlierStopTimeIndex].TotalSeconds) / 2746 (laterStopTimeIndex - earlierStopTimeIndex); 2747 2748 //A-3: Store calculated headway 2749 if 2750 (!ScheduleHeadwayByTripIDByStopID.ContainsKey(curTripID)) 2751 { 2752 2753 ScheduleHeadwayByTripIDByStopID.Add(curTripID, new Dictionary<int, 2754 double>()); 2755 } 2756 if 2757 (!ScheduleHeadwayByTripIDByStopID[curTripID].ContainsKey(stopID)) 2758 { 2759 2760 ScheduleHeadwayByTripIDByStopID[curTripID].Add(stopID, 2761 scheduleHeadwayInSecs); 2762 } 2763 } 2764 } 2765 } 2766 //else do not attempt to calculate - rely on 2767 interpolations 2768 }//end foreach stop (trip loop not needed for this comp) 2769 //B-1:Compute and interpolated all schedule headway values 2770 for trips' stops, aka process stops with no headway value 2771 for (int tripIndex = 0; tripIndex < currentDayTripIDs.Count; 2772 tripIndex++) 2773 { 2774 int TripID = currentDayTripIDs[tripIndex]; 2775 if (ScheduleHeadwayByTripIDByStopID.ContainsKey(TripID)) 2776 { 2777 for (int stopIDIndex = 0; stopIDIndex < 2778 GpsTripTable[TripID].tripStopIDs.Count; stopIDIndex++) 2779 { 2780 int stopID = 2781 GpsTripTable[TripID].tripStopIDs[stopIDIndex]; 2782 if 2783 (!ScheduleHeadwayByTripIDByStopID[TripID].ContainsKey(stopID))//attempt to 2784 interpolate 2785 { 2786 int downstreamSearchi = stopIDIndex < 2787 (GpsTripTable[TripID].tripStopIDs.Count - 1) ? stopIDIndex + 1 : stopIDIndex 2788 - 1; 2789 int upstreamSearchj = stopIDIndex > (0) ? 2790 stopIDIndex - 1 : stopIDIndex + 1; 2791 int nearbyStopID_DOWNSTREAM = -1; 2792 int nearbyStopID_UPSTREAM = -1; 2793 //try find downstream 2794 while (nearbyStopID_DOWNSTREAM == -1 && 2795 downstreamSearchi <= (GpsTripTable[TripID].tripStopIDs.Count - 1)) 2796 { 2797
232
if 2798 (ScheduleHeadwayByTripIDByStopID[TripID].ContainsKey(GpsTripTable[TripID].tri2799 pStopIDs[downstreamSearchi])) 2800 { 2801 nearbyStopID_DOWNSTREAM = 2802 GpsTripTable[TripID].tripStopIDs[downstreamSearchi]; 2803 } 2804 else 2805 { 2806 downstreamSearchi++; 2807 } 2808 } 2809 //try find upstream 2810 while (nearbyStopID_UPSTREAM == -1 && 2811 upstreamSearchj >= 0) 2812 { 2813 if 2814 (ScheduleHeadwayByTripIDByStopID[TripID].ContainsKey(GpsTripTable[TripID].tri2815 pStopIDs[upstreamSearchj])) 2816 { 2817 nearbyStopID_UPSTREAM = 2818 GpsTripTable[TripID].tripStopIDs[upstreamSearchj]; 2819 } 2820 else 2821 { 2822 upstreamSearchj--; 2823 } 2824 } 2825 //both found, 1 or the other found, or none 2826 found 2827 if (nearbyStopID_DOWNSTREAM >= 0 && 2828 nearbyStopID_UPSTREAM >= 0) 2829 { 2830 double targetStopDist = 2831 GpsTripTable[TripID].tripStopDistances[stopIDIndex]; 2832 double downstreamDist = 2833 GpsTripTable[TripID].tripStopDistances[downstreamSearchi]; 2834 double upstreamDist = 2835 GpsTripTable[TripID].tripStopDistances[upstreamSearchj]; 2836 double downstreamHW = 2837 ScheduleHeadwayByTripIDByStopID[TripID][nearbyStopID_DOWNSTREAM]; 2838 double upstreamHW = 2839 ScheduleHeadwayByTripIDByStopID[TripID][nearbyStopID_UPSTREAM]; 2840 2841 double interpolatedHeadway = 2842 ((downstreamHW - upstreamHW) == 0 || (downstreamDist - upstreamDist) == 0 || 2843 (targetStopDist - upstreamDist) < 0) ? (upstreamHW) : (upstreamHW + 2844 (downstreamHW - upstreamHW) / (downstreamDist - upstreamDist) * 2845 (targetStopDist - upstreamDist)); 2846 2847 ScheduleHeadwayByTripIDByStopID[TripID][stopID] = 2848 Math.Round(interpolatedHeadway, 1); 2849 } 2850 else if (nearbyStopID_DOWNSTREAM >= 0) 2851
233
{ 2852 2853 ScheduleHeadwayByTripIDByStopID[TripID][stopID] = 2854 ScheduleHeadwayByTripIDByStopID[TripID][nearbyStopID_DOWNSTREAM]; 2855 } 2856 else if (nearbyStopID_UPSTREAM >= 0) 2857 { 2858 2859 ScheduleHeadwayByTripIDByStopID[TripID][stopID] = 2860 ScheduleHeadwayByTripIDByStopID[TripID][nearbyStopID_UPSTREAM]; 2861 } 2862 //if none of the above conditions can be met, 2863 this stop will be missed => the headway computation for this trip would be 2864 invalid and removed. 2865 }//end if contains key stopid 2866 }//end for stop 2867 //C-1: Assign final schedule headway, post-processed 2868 //get tripEstiHeadways matrix for the trip 2869 List<double> thisTripScheduledHeadways = new 2870 List<double>(); 2871 if (GpsTripTable[TripID].tripStopIDs.Count != 2872 ScheduleHeadwayByTripIDByStopID[TripID].Count) 2873 { 2874 TripIDToBeRemoved.Add(TripID);//remove trips that 2875 can't be processed 2876 } 2877 else 2878 { 2879 for (int stopIDIndex = 0; stopIDIndex < 2880 GpsTripTable[TripID].tripStopIDs.Count; stopIDIndex++) 2881 { 2882 int stopID = 2883 GpsTripTable[TripID].tripStopIDs[stopIDIndex]; 2884 2885 thisTripScheduledHeadways.Add(Math.Round(ScheduleHeadwayByTripIDByStopID[Trip2886 ID][stopID], 1)); 2887 }//end for stop 2888 //UPDATE GPS TRIP VARIABLE 2889 GpsTripTable[TripID].tripScheduleHeadways = new 2890 List<double>(thisTripScheduledHeadways); 2891 //delete criteria 2892 List<double> tempList = new List<double>(); 2893 2894 tempList.AddRange(GpsTripTable[TripID].tripEstiHeadways.Where(v => v >= 2895 currentDay.getDurationSeconds()).ToList());//invalid value 2896 2897 tempList.AddRange(GpsTripTable[TripID].tripEstiDelays.Where(v => v >= 2898 currentDay.getDurationSeconds()).ToList());//invalid value 2899 if (tempList.Count > 1) 2900 { 2901 if (!TripIDToBeRemoved.Contains(TripID)) 2902 { 2903 TripIDToBeRemoved.Add(TripID); 2904 } 2905
234
} 2906 } 2907 }//end if contains key TripID, if this cannot be met => 2908 the headway computation for this trip would be invalid and removed. 2909 }//end for trip 2910 /*======= tripPrevTripID CALCULATIONS (AT STOPS) 2911 ===========*/ 2912 2913 /*======= Build PreviousTripIDByTripIDByStopID ===========*/ 2914 Dictionary<int, Dictionary<int, int>> 2915 PreviousTripIDByTripIDByStopID = new Dictionary<int, Dictionary<int, int>>(); 2916 foreach (int stopID in 2917 stopTimesIndexMatchWithTripID_ByStopID.Keys) 2918 { 2919 if (stopTimesIndexMatchWithTripID_ByStopID[stopID].Count 2920 > 1) 2921 { 2922 List<int> TripIDListAtStop = (from listItem in 2923 stopTimesIndexMatchWithTripID_ByStopID[stopID] select 2924 listItem.Item3).ToList(); 2925 for (int listIndex = 0; listIndex < 2926 TripIDListAtStop.Count; listIndex++) 2927 { 2928 int curTripID = TripIDListAtStop[listIndex]; 2929 int currentTripStartStopIndex = 2930 GpsTripTable[curTripID].tripStopIDs.IndexOf(stopID); 2931 int curTripNextStop = ((currentTripStartStopIndex 2932 + 1) < GpsTripTable[curTripID].tripStopIDs.Count) ? 2933 (GpsTripTable[curTripID].tripStopIDs[currentTripStartStopIndex + 1]) : (-2934 1);//terminal 2935 int prevTripID = -1; //first trip has no previous 2936 trip id 2937 int matchIndex = listIndex - 1; 2938 if (listIndex > 0) 2939 { 2940 //go over the entire trip list, and assign 2941 only if both stop and end stop match 2942 while (prevTripID == -1 && matchIndex > 0) 2943 { 2944 int potentialPrevTripID = 2945 TripIDListAtStop[matchIndex]; 2946 int prevTripStartStopIndex = 2947 GpsTripTable[potentialPrevTripID].tripStopIDs.IndexOf(stopID); 2948 int nextStopIDForPrevTrip = 2949 ((prevTripStartStopIndex + 1) < 2950 GpsTripTable[potentialPrevTripID].tripStopIDs.Count) ? 2951 (GpsTripTable[potentialPrevTripID].tripStopIDs[prevTripStartStopIndex + 1]) : 2952 (-1);//terminal 2953 if (nextStopIDForPrevTrip == 2954 curTripNextStop) 2955 { 2956 prevTripID = 2957 TripIDListAtStop[matchIndex]; 2958 } 2959
235
matchIndex--; 2960 } 2961 }//else no previous stop 2962 if 2963 (!PreviousTripIDByTripIDByStopID.ContainsKey(curTripID)) 2964 { 2965 PreviousTripIDByTripIDByStopID.Add(curTripID, 2966 new Dictionary<int, int>()); 2967 } 2968 if 2969 (!PreviousTripIDByTripIDByStopID[curTripID].ContainsKey(stopID)) 2970 { 2971 2972 PreviousTripIDByTripIDByStopID[curTripID].Add(stopID, prevTripID); 2973 } 2974 } 2975 } 2976 }//end foreach stop (trip loop not needed for this comp) 2977 //A-2: rearrange, then assign PreviousTripIDByTripIDByStopID 2978 in order of stop to the trip 2979 for (int tripIndex = 0; tripIndex < currentDayTripIDs.Count; 2980 tripIndex++) 2981 { 2982 int TripID = currentDayTripIDs[tripIndex]; 2983 Dictionary<int, int> prevTripID_ByStopID = new 2984 Dictionary<int, int>();//in secs 2985 for (int stopIDIndex = 0; stopIDIndex < 2986 GpsTripTable[TripID].tripStopIDs.Count; stopIDIndex++) 2987 { 2988 int stopID = 2989 GpsTripTable[TripID].tripStopIDs[stopIDIndex]; 2990 int prevTripID = 2991 PreviousTripIDByTripIDByStopID.ContainsKey(TripID) ? 2992 (PreviousTripIDByTripIDByStopID[TripID].ContainsKey(stopID) ? 2993 PreviousTripIDByTripIDByStopID[TripID][stopID] : -1) : -1;//no prev trip id 2994 if didn't exist 2995 prevTripID_ByStopID.Add(stopID, prevTripID); 2996 }//end for 2997 //get tripEstiDelays matrix for the trip 2998 List<int> thisTripPrevTripIDs = new List<int>(); 2999 for (int stopIDIndex = 0; stopIDIndex < 3000 GpsTripTable[TripID].tripStopIDs.Count; stopIDIndex++) 3001 { 3002 int stopID = 3003 GpsTripTable[TripID].tripStopIDs[stopIDIndex]; 3004 thisTripPrevTripIDs.Add(prevTripID_ByStopID[stopID]); 3005 }//end for stop 3006 //UPDATE GPS TRIP VARIABLE 3007 GpsTripTable[TripID].tripPrevTripIDs = new 3008 List<int>(thisTripPrevTripIDs); 3009 }//end for trip 3010 3011 }//end foreach dateRange 3012 3013
236
/*======= I-A2.More Changes to Trip Information (organize and 3014 process trip information) ===========*/ 3015 //add new trips - done with ProcessAndOrganizeGPSPointsForTrip() 3016 & its returns in both data object and database 3017 //remove these trips from data object 3018 //delete these trips from the table database 3019 foreach (int removeTripID in TripIDToBeRemoved) 3020 { 3021 //delete these trips from vehGPSTripTable object 3022 SSVehTripDataTable removedTrip = null; 3023 GpsTripTable.TryRemove(removeTripID, out removedTrip); 3024 } 3025 if (TripIDToBeRemoved.Count > 0) 3026 { 3027 DeleteFromTableDatabase("TTCGPSTRIPS", "TripID", 3028 TripIDToBeRemoved); 3029 } 3030 3031 if (numTripAffected == 0) 3032 { 3033 return false; 3034 } 3035 else 3036 { 3037 return true; 3038 } 3039 } 3040 3041
237
A.2.4 GPS Trip Processing Method (ProcessAndOrganizeGPSPointsForTrip)
private bool ProcessAndOrganizeGPSPointsForTrip(int gpsTripID, out 3042 List<SSVehTripDataTable> addNewTripsFromSplitTrip) 3043 { 3044 addNewTripsFromSplitTrip = new List<SSVehTripDataTable>(); 3045 bool tripNeedsRemoval = false; 3046 3047 if (!GpsTripTable.ContainsKey(gpsTripID)) 3048 { 3049 addNewTripsFromSplitTrip = null; 3050 return tripNeedsRemoval; 3051 } 3052 3053 //estimatedScheduledStopTimesAtGPSPoints = new Dictionary<int, 3054 TimeSpan>();//index by gpsID 3055 int scheduleID = GpsTripTable[gpsTripID].GtfsScheduleID; 3056 3057 if ((scheduleID <= 0) && (GpsTripTable[gpsTripID].GPSIDs != 3058 null))//has no matched schedule 3059 { 3060 addNewTripsFromSplitTrip = null; 3061 return tripNeedsRemoval; 3062 } 3063 3064 if (GpsTripTable[gpsTripID].GPSIDs.Count <= 0)//has no points 3065 { 3066 addNewTripsFromSplitTrip = null; 3067 return tripNeedsRemoval; 3068 } 3069 3070 string shapeID = GtfsScheduleTable[scheduleID].ShapeID; 3071 //Dictionary<int, List<Tuple<int, double, TimeSpan>>> 3072 Dictionary<int, List<Tuple<int, double, TimeSpan>>> 3073 gpsIDShapedistTime_ByStopID = new Dictionary<int, List<Tuple<int, double, 3074 TimeSpan>>>();//closest right before stop o-*--< 3075 List<int> tripStopIDs; 3076 List<double> tripStopDistances; 3077 //1. INITIALIZE GPS POINTS DISTANCES AND GTFS STOP DISTANCES 3078 // Step 1: Preprocessing, find the start stop and remove any 3079 points before that less than -500m 3080 // Step 2: Add all the points passing preprocessing to 3081 gpsPointShapeDistances 3082 // Step 3: Calculate tripStopDistances in the same way to be 3083 consistent (result close or identical to gtfs StopDistances) 3084 // Step 4: Update the processed gpsPointShapeDistances GPSIDs to 3085 the vehGPSTripTable[gpsTripID].GPSIDs object for next step (GPS POINT 3086 CLUSTERS PROCESSING) 3087 // (Fixed according to GTFSCoonvertSTSim Operations (Adjustment 3088 of distances to first stop) 3089 // Outputs: List<int> preprocessedNewGPSIDs 3090 Dictionary<int, double> gpsPointShapeDistances = new 3091 Dictionary<int, double>();//index by gpsID 3092
238
List<int> preprocessedNewGPSIDs = new List<int>(); 3093 List<double> allShapeDist = new List<double>(); 3094 List<int> GPSIDPointsOffShapeIndex = new List<int>(); 3095 for (int i = 0; i < (GpsTripTable[gpsTripID].GPSIDs.Count); i++) 3096 //gtfsScheduleTable[foundScheduleID].StopIDs) 3097 { 3098 int gpsID = GpsTripTable[gpsTripID].GPSIDs[i]; 3099 GeoLocation gpsPoint = new 3100 GeoLocation(GpsPointTable[gpsID].Latitude, GpsPointTable[gpsID].Longitude); 3101 GeoLocation gpsPoint_SnappedToShape; 3102 double ShapeDist = AgencyShapeDistFromPointModForGPSPts(out 3103 gpsPoint_SnappedToShape, shapeID, gpsPoint, ((allShapeDist.Count > 0) ? 3104 (allShapeDist.Last() - 100) : (double.MinValue)));//allows a 100m back track 3105 from previous point to tolerate gps errors// 3106 if (ShapeDist == -1)//point too far off shape 3107 { 3108 GPSIDPointsOffShapeIndex.Add(i); 3109 } 3110 else 3111 { 3112 GpsPointTable[gpsID].Latitude = 3113 gpsPoint_SnappedToShape.Latitude; 3114 GpsPointTable[gpsID].Longitude = 3115 gpsPoint_SnappedToShape.Longitude; 3116 allShapeDist.Add(ShapeDist); 3117 } 3118 } 3119 GPSIDPointsOffShapeIndex.Sort((item1, item2) => -1 * 3120 item1.CompareTo(item2)); // descending sort 3121 //remove off shape points - criteria is strict to screen any off-3122 route points 3123 foreach (int removeIndex in GPSIDPointsOffShapeIndex) 3124 { 3125 GpsTripTable[gpsTripID].GPSIDs.RemoveAt(removeIndex); 3126 } 3127 if (GpsTripTable[gpsTripID].GPSIDs.Count <= 0)//has no points (if 3128 all points are off shape, they are removed) 3129 { 3130 addNewTripsFromSplitTrip = null; 3131 return tripNeedsRemoval; 3132 } 3133 //find a list of possible start stops (-500 to 500 shape 3134 distances), split them at the earliest, if needed 3135 List<int> possibleStartStopGPSPoints = new List<int>(); 3136 for (int i = 0; i < (GpsTripTable[gpsTripID].GPSIDs.Count); i++) 3137 //gtfsScheduleTable[foundScheduleID].StopIDs) 3138 { 3139 if (allShapeDist[i] > -500 && allShapeDist[i] < 500) 3140 { 3141 possibleStartStopGPSPoints.Add(i); 3142 } 3143 } 3144 //splitting procedure here: if the possible start stops are not 3145 close to each other (1-2 index), then split them 3146
239
List<int> tripSplitIndexLocations = new List<int>(); 3147 int lastStartCandidateIndex = -1; 3148 for (int n = 0; n < (possibleStartStopGPSPoints.Count); n++) 3149 //gtfsScheduleTable[foundScheduleID].StopIDs) 3150 { 3151 if (n == 0) 3152 { 3153 //first one is always a possible split 3154 3155 tripSplitIndexLocations.Add(possibleStartStopGPSPoints[n]); 3156 lastStartCandidateIndex = possibleStartStopGPSPoints[n]; 3157 } 3158 if ((possibleStartStopGPSPoints[n] - lastStartCandidateIndex) 3159 > 3) 3160 { 3161 //large enough gap to warrent a split on the trip 3162 3163 tripSplitIndexLocations.Add(possibleStartStopGPSPoints[n]); 3164 lastStartCandidateIndex = possibleStartStopGPSPoints[n]; 3165 } 3166 else 3167 { 3168 lastStartCandidateIndex = possibleStartStopGPSPoints[n]; 3169 } 3170 } 3171 //1b. SPLIT TRIP IF NEEDED 3172 if (tripSplitIndexLocations.Count > 1)//if any needs to be split, 3173 remove this trip, split trips and add them to 3174 { 3175 gpsIDShapedistTime_ByStopID = null; 3176 //addNewTripsFromSplitTrip = new List<SSVehTripDataTable>(); 3177 for (int newStartLocIndex = 0; newStartLocIndex < 3178 (tripSplitIndexLocations.Count); newStartLocIndex++) 3179 { 3180 int newGPSTripID = GetNextGPSTripIDForTable() + 3181 newStartLocIndex;//note getNextGPSTripIDForTable() doesn't increment over 1 3182 until a trip is added to vehGPSTripTable 3183 int startOfSplitGPSIndex = 3184 tripSplitIndexLocations[newStartLocIndex]; 3185 int sizeOfSplitGPSIndex = (newStartLocIndex == 3186 (tripSplitIndexLocations.Count - 1)) ? (GpsTripTable[gpsTripID].GPSIDs.Count 3187 - tripSplitIndexLocations[newStartLocIndex]) : 3188 (tripSplitIndexLocations[newStartLocIndex + 1] - 3189 tripSplitIndexLocations[newStartLocIndex]); 3190 List<int> newGPSIDsForSplitTrip = 3191 GpsTripTable[gpsTripID].GPSIDs.GetRange(startOfSplitGPSIndex, 3192 sizeOfSplitGPSIndex); 3193 SSVehTripDataTable newSplitTrip = 3194 ObjectCopier.Clone(GpsTripTable[gpsTripID]); 3195 //UPDATE GPS TRIP VARIABLE - FOR NEW TRIPS 3196 newSplitTrip.TripID = newGPSTripID; 3197 newSplitTrip.GPSIDs = newGPSIDsForSplitTrip; 3198 newSplitTrip.startGPSTime = 3199 GpsPointTable[newSplitTrip.GPSIDs[0]].GPStime; 3200
240
addNewTripsFromSplitTrip.Add(newSplitTrip); 3201 } 3202 //ADD GPS TRIP TO DATABASE AND DATA OBJECT 3203 AddGPSTripToDB(addNewTripsFromSplitTrip); 3204 tripNeedsRemoval = true; 3205 return tripNeedsRemoval; 3206 } 3207 //find the cloeset index to the possble start stops 3208 double minGPSShapeDist = (from d in allShapeDist select 3209 Math.Abs(d)).ToList().Min(); //allShapeDist.IndexOf(allShapeDist.Min()); 3210 int TempMinGPSIndex = allShapeDist.Contains(minGPSShapeDist) ? 3211 allShapeDist.IndexOf(minGPSShapeDist) : allShapeDist.IndexOf(-3212 minGPSShapeDist); 3213 bool earliestStartFound = false; 3214 while ((TempMinGPSIndex != 0) && !earliestStartFound) 3215 { 3216 if ((allShapeDist[TempMinGPSIndex - 1] > -500) && 3217 (allShapeDist[TempMinGPSIndex - 1] < 500) && (allShapeDist[TempMinGPSIndex - 3218 1] <= allShapeDist[TempMinGPSIndex])) 3219 { 3220 TempMinGPSIndex--; 3221 } 3222 else 3223 { 3224 earliestStartFound = true; 3225 } 3226 } 3227 // ready to construct gpsPointShapeDistances 3228 double lastShapeDist = double.MinValue; 3229 for (int i = TempMinGPSIndex; i < 3230 (GpsTripTable[gpsTripID].GPSIDs.Count); i++) 3231 //gtfsScheduleTable[foundScheduleID].StopIDs) 3232 { 3233 int gpsID = GpsTripTable[gpsTripID].GPSIDs[i]; 3234 double ShapeDist = allShapeDist[i]; 3235 if (lastShapeDist <= ShapeDist)//only construct onces with 3236 ascending shape distances 3237 { 3238 gpsPointShapeDistances.Add(gpsID, ShapeDist); 3239 lastShapeDist = ShapeDist; 3240 } 3241 } 3242 preprocessedNewGPSIDs = gpsPointShapeDistances.Keys.ToList(); 3243 //UPDATE GPS TRIP VARIABLE - GPSIDs 3244 GpsTripTable[gpsTripID].GPSIDs = preprocessedNewGPSIDs; 3245 //GET already processed stop ids and distances 3246 tripStopIDs = new 3247 List<int>(GtfsScheduleTable[scheduleID].StopIDs);//.Distinct().ToList());//ge3248 t a copy 3249 //tripStopDistances = new List<double>(); 3250 tripStopDistances = new 3251 List<double>(GtfsScheduleTable[scheduleID].StopDistances);//get a copy 3252 if (tripStopIDs.Count < 2)//check stop size 3253 { 3254
241
tripNeedsRemoval = true; 3255 gpsIDShapedistTime_ByStopID = null; 3256 return tripNeedsRemoval; 3257 } 3258 ////Old procedure, less computationally efficient 3259 //if (numHeaders > 8) 3260 // newStopTimes.DistanceToStop = splitLine[8] != "" ? 3261 Double.Parse(splitLine[8]) * 1000 : 0; //*1000 for km to m 3262 conversion 3263 //else 3264 // newStopTimes.DistanceToStop = -1; 3265 //StopTimes.Add(newStopTimes); 3266 //// Trip stop distances compute 1: distances from shape start 3267 //for (int i = 0; i < (tripStopIDs.Count); i++) 3268 //{ 3269 // int stopID = tripStopIDs[i]; 3270 // GeoLocation stopPoint = new 3271 GeoLocation(gtfsStopTable[stopID].Latitude, gtfsStopTable[stopID].Longitude); 3272 3273 // bool notfirstOcc; 3274 // if (tripStopDistances.Count > 1)// && 3275 ListTrips[stopInfo.TripID].StopInfo[1].Stop == newSInfo.Stop) 3276 // { 3277 // notfirstOcc = true; 3278 // } 3279 // else 3280 // { 3281 // notfirstOcc = false; 3282 // } 3283 // tripStopDistances.Add(AgencyShapeDistFromPoint(shapeID, 3284 stopPoint, notfirstOcc ? tripStopDistances[i - 1] : 0)); 3285 // //newSInfo.DistFromFirstStop = stopInfo.DistanceToStop != -3286 1 ? stopInfo.DistanceToStop : 3287 AgencyShapeDistFromPoint(ListTrips[stopInfo.TripID].Shape.ShapeID, 3288 stopInfo.StopID, agencyID, notfirstOcc ? 3289 ListTrips[stopInfo.TripID].StopInfo[stopInfo.StopSequence - 3290 1].DistFromFirstStop : 0); 3291 // //if (tripStopDistances.Count == 1 && 3292 ListTrips[stopInfo.TripID].StopInfo[1].DistFromFirstStop > 3293 newSInfo.DistFromFirstStop) 3294 // //{ 3295 // // 3296 ListTrips[stopInfo.TripID].StopInfo[1].DistFromFirstStop = 0; 3297 // //} 3298 //} 3299 int firstStopID = tripStopIDs[0]; 3300 GeoLocation FirstStopPt = new 3301 GeoLocation(GtfsStopTable[firstStopID].Latitude, 3302 GtfsStopTable[firstStopID].Longitude); 3303 int secondStopID = tripStopIDs[1]; 3304 GeoLocation secondStopPt = new 3305 GeoLocation(GtfsStopTable[secondStopID].Latitude, 3306 GtfsStopTable[secondStopID].Longitude); 3307 // Trip stop distances compute 2: Shape Distance adjustments 3308
242
double firstStopDist = ShapeDistFromPoint(shapeID, FirstStopPt, 3309 tripStopDistances[1]); 3310 bool secondStopFromShapeStart = SecondStopFromShapeStart(shapeID, 3311 FirstStopPt, firstStopDist, secondStopPt, tripStopDistances[1]); 3312 //tripStopDistances[0] = firstStopDist; 3313 if (secondStopFromShapeStart == false) 3314 { 3315 for (int i = 0; i < (GpsTripTable[gpsTripID].GPSIDs.Count); 3316 i++) 3317 { 3318 int gpsID = GpsTripTable[gpsTripID].GPSIDs[i]; 3319 gpsPointShapeDistances[gpsID] += firstStopDist; 3320 } 3321 //// trip stop distances compute 3322 //for (int i = 1; i < (tripStopIDs.Count); i++) 3323 //{ 3324 // tripStopDistances[i] += firstStopDist; 3325 //} 3326 } 3327 //2. GPS POINT CLUSTERS PROCESSING 3328 // Groups gps clusters together to ensure points aren't redundant 3329 and/or back tracking 3330 // Step 1: go through gps points order by GPStime. If the 3331 current gps point is very close to the previous (~10m, add to a cluster). 3332 // If they are not close, check if the gps point has 3333 positive velocity relative to last gps point. 3334 // If yes, start a new cluster, if no, remove the gps 3335 point from the trip and update gps trip id 3336 // Step 2: after all the gps points are processed into their own 3337 clusters, for cluster with more than 2 points, find the earliest point and 3338 latest point 3339 // Then, remove all other gps points from trip and 3340 update gps trip id 3341 // The chosen two points will take coordinates of the 3342 cluster centroid, with their original GPStimes. 3343 // Step 3: now that the cluster processing has been conducted 3344 for all gps points in the trip, it is now ready to be used for purpose of 3345 variable calculations. 3346 // Outputs: List<int> clusterProcessedNewGPSIDs 3347 //Preprocessing, find the start stop and trim the trip 3348 List<List<int>> gpsClusters = new List<List<int>>(); 3349 List<int> clusterProcessedNewGPSIDs = new List<int>(); 3350 List<int> currentGPSCluster = new List<int>(); 3351 currentGPSCluster.Add(GpsTripTable[gpsTripID].GPSIDs[0]); 3352 bool isClusterNew = true; 3353 int startOfClusterGPSID = GpsTripTable[gpsTripID].GPSIDs[0]; 3354 //group points into many clusters 3355 for (int i = 1; i < (GpsTripTable[gpsTripID].GPSIDs.Count); i++) 3356 //gtfsScheduleTable[foundScheduleID].StopIDs) 3357 { 3358 startOfClusterGPSID = isClusterNew ? 3359 GpsTripTable[gpsTripID].GPSIDs[i - 1] : startOfClusterGPSID; 3360 int currentGPSID = GpsTripTable[gpsTripID].GPSIDs[i]; 3361
243
double displacement = gpsPointShapeDistances[currentGPSID] - 3362 gpsPointShapeDistances[startOfClusterGPSID]; 3363 //double deltaTime = (vehGPSPointsTable[currentGPSID].GPStime 3364 - vehGPSPointsTable[prevGPSID].GPStime); 3365 3366 if (Math.Abs(displacement) < ClusterDisplacementTolerance) 3367 //close by points 3368 { 3369 //add to current cluster 3370 currentGPSCluster.Add(currentGPSID); 3371 isClusterNew = false; 3372 } 3373 else //large enough displacement results in a new cluster 3374 { 3375 //finish adding current cluster to master gpsClusters and 3376 int[] TempCluster = new int[currentGPSCluster.Count]; 3377 currentGPSCluster.CopyTo(TempCluster); 3378 gpsClusters.Add(TempCluster.ToList()); 3379 //start a new cluster 3380 isClusterNew = true; 3381 currentGPSCluster = null; 3382 currentGPSCluster = new List<int>(); 3383 currentGPSCluster.Add(currentGPSID); 3384 } 3385 //add final point/cluster 3386 if (i == (GpsTripTable[gpsTripID].GPSIDs.Count - 1)) 3387 { 3388 //finish adding current cluster to master gpsClusters and 3389 int[] TempCluster = new int[currentGPSCluster.Count]; 3390 currentGPSCluster.CopyTo(TempCluster); 3391 gpsClusters.Add(TempCluster.ToList()); 3392 } 3393 } 3394 //processing to remove non-revenue or back-tracking clusters 3395 for (int i_group = 0; i_group < (gpsClusters.Count - 1); 3396 i_group++) 3397 { 3398 //due to nature of comparsions, last gpsCluster will be 3399 automatically retained 3400 List<int> thisGPSCluster = gpsClusters[i_group]; 3401 List<int> nextGPSCluster = gpsClusters[i_group + 1]; 3402 int thisGPSID = thisGPSCluster[0];//look at first pt of 3403 cluster 3404 int nextGPSID = nextGPSCluster[0];//look at first pt of 3405 cluster 3406 double displacement = gpsPointShapeDistances[nextGPSID] - 3407 gpsPointShapeDistances[thisGPSID]; 3408 double deltaTime = (GpsPointTable[nextGPSID].GPStime - 3409 GpsPointTable[thisGPSID].GPStime); 3410 //check if the cluster is decreasing in shape distance, if so 3411 if ((displacement) / (deltaTime) < 0.0) //needs a positive 3412 velocity or else the cluster is removed 3413 { 3414 //likely due to displacment 3415
244
gpsClusters.RemoveAt(i_group); 3416 i_group--; 3417 } 3418 } 3419 //process all gps clusters into a list of new GPS IDs --> can 3420 potentially compute dwell time from this! 3421 foreach (List<int> aGPSCluster in gpsClusters) 3422 { 3423 if (aGPSCluster.Count <= 2) 3424 { 3425 clusterProcessedNewGPSIDs.AddRange(aGPSCluster); 3426 } 3427 else 3428 { 3429 clusterProcessedNewGPSIDs.Add(aGPSCluster[0]); 3430 3431 clusterProcessedNewGPSIDs.Add(aGPSCluster[aGPSCluster.Count - 1]); 3432 }//end else 3433 }//end foreach 3434 //final checks of distances and times - ignores first points 3435 though (cannot compare) - similar check done before on clusters checking 3436 first point 3437 //foreach (int gpsID in newGPSIDs) 3438 List<int> GPSIDsToBeRemovedGPSErrors = new List<int>(); 3439 for (int centerPtIndex = 1; centerPtIndex < 3440 (clusterProcessedNewGPSIDs.Count); centerPtIndex++) 3441 { 3442 long timeLeft = 3443 GpsPointTable[clusterProcessedNewGPSIDs[centerPtIndex - 1]].GPStime; 3444 long timeCtr = 3445 GpsPointTable[clusterProcessedNewGPSIDs[centerPtIndex]].GPStime; 3446 //long timeRight = vehGPSPointsTable[newGPSIDs[centerPtIndex 3447 + 1]].GPStime; 3448 double shapeDistLeft = 3449 gpsPointShapeDistances[clusterProcessedNewGPSIDs[centerPtIndex - 1]]; 3450 double shapeDistCtr = 3451 gpsPointShapeDistances[clusterProcessedNewGPSIDs[centerPtIndex]]; 3452 //double shapeDistRight = 3453 gpsPointShapeDistances[newGPSIDs[centerPtIndex + 1]]; 3454 //double velocity1 = (shapeDistCtr - shapeDistLeft) / 3455 (timeCtr - timeLeft) * 3.6; 3456 //double velocity2 = (shapeDistRight - shapeDistCtr) / 3457 (timeRight - timeCtr) * 3.6; 3458 if ((shapeDistCtr - shapeDistLeft) < 0 && 3459 Math.Abs(shapeDistCtr - shapeDistLeft) < 10) 3460 { 3461 //very small negative distance difference, the vehicle is 3462 probably stopped, change its shape dist to ctr 3463 3464 gpsPointShapeDistances[clusterProcessedNewGPSIDs[centerPtIndex - 1]] = 3465 gpsPointShapeDistances[clusterProcessedNewGPSIDs[centerPtIndex]]; 3466 } 3467 if ((shapeDistCtr - shapeDistLeft) < 0 || (timeCtr - 3468 timeLeft) < 0)//case: raced ahead 3469
245
{ 3470 3471 GPSIDsToBeRemovedGPSErrors.Add(clusterProcessedNewGPSIDs[centerPtIndex]); 3472 } 3473 else if ((timeCtr - timeLeft) > (5 * 60))//gps point gap too 3474 larger (>5mins), rest of trip's gps points are may not be valid 3475 { 3476 3477 GPSIDsToBeRemovedGPSErrors.AddRange(clusterProcessedNewGPSIDs.GetRange(center3478 PtIndex, clusterProcessedNewGPSIDs.Count - centerPtIndex)); 3479 break; 3480 } 3481 } 3482 foreach (int removeGPS in GPSIDsToBeRemovedGPSErrors) 3483 { 3484 clusterProcessedNewGPSIDs.Remove(removeGPS); 3485 } 3486 //UPDATE GPS TRIP VARIABLE - GPSIDs 3487 GpsTripTable[gpsTripID].GPSIDs = clusterProcessedNewGPSIDs; 3488 //3. COMPUTE gpsID_DistFromStartMatchByStopID 3489 // Organize the gps points in relative positions to the gtfs 3490 stops, ordered by their distance from the stop downstream 3491 // * points can only be added to stop IN ORDER. This prevents 3492 back tracked points or travels in reversed Direction. 3493 // Outputs: List<int> stopOrderProcessedNewGPSIDs 3494 int currentStopIndexProcessing = -1;//may not add to a stop if it 3495 is "completed"/next one has started 3496 List<int> stopOrderProcessedNewGPSIDs = new List<int>(); 3497 for (int i = 0; i < (GpsTripTable[gpsTripID].GPSIDs.Count); i++) 3498 //(vehGPSTripTable[gpsTripID].GPSIDs.Count); i++) 3499 //gtfsScheduleTable[foundScheduleID].StopIDs) 3500 { 3501 int gpsID = GpsTripTable[gpsTripID].GPSIDs[i]; 3502 double distOfGPSFromStart = gpsPointShapeDistances[gpsID]; 3503 GeoLocation gpsPoint = new 3504 GeoLocation(GpsPointTable[gpsID].Latitude, GpsPointTable[gpsID].Longitude); 3505 TimeSpan gpsTimeSpan = 3506 SSUtil.EpochTimeToLocalDateTime(GpsPointTable[gpsID].GPStime).DateTimeToTimeS3507 panFromLastLocalMidnight(); 3508 //note: if the gps point is less than 50m past the stop, it 3509 is added. This allows better dwell time determinations for 20s data 3510 List<double> allResult = new List<double>();// = (from r in 3511 tripStopDistances select ((r - distOfGPSFromStart) >= -50 ? (r - 3512 distOfGPSFromStart) : double.MaxValue)).ToList(); 3513 //foreach (double r in tripStopDistances) 3514 for (int index = 0; index < tripStopDistances.Count; index++) 3515 { 3516 double nextStopDist = index < (tripStopDistances.Count - 3517 1) ? tripStopDistances[index + 1] : double.MaxValue; 3518 double r = tripStopDistances[index]; 3519 double stopDistDiff = (nextStopDist - r) / 2; 3520 //the stop boundary extends at most equals to the buffer 3521 distance tolerance setting 3522
246
double currentStopBoundary = (stopDistDiff / 2) <= 3523 (DwellTimeBufferRadius) ? (stopDistDiff / 2) : (DwellTimeBufferRadius); 3524 allResult.Add((r - distOfGPSFromStart) >= -3525 currentStopBoundary ? (r - distOfGPSFromStart) : double.MaxValue); 3526 } 3527 double result_posMinVal = allResult.Min(); 3528 //get the closest stop id 3529 int minStopIndex = (result_posMinVal > 10000000) ? 3530 (allResult.Count - 1) : allResult.IndexOf(result_posMinVal);//if all stops 3531 returned max val, the point is beyond last stop and may be added to last 3532 stop's gps list 3533 double distOfStopFromStart = tripStopDistances[minStopIndex]; 3534 if ((minStopIndex >= currentStopIndexProcessing)) 3535 { 3536 if 3537 (gpsIDShapedistTime_ByStopID.ContainsKey(tripStopIDs[minStopIndex])) 3538 { 3539 3540 gpsIDShapedistTime_ByStopID[tripStopIDs[minStopIndex]].Add(new Tuple<int, 3541 double, TimeSpan>(gpsID, distOfGPSFromStart, gpsTimeSpan)); 3542 } 3543 else 3544 { 3545 3546 gpsIDShapedistTime_ByStopID.Add(tripStopIDs[minStopIndex], new 3547 List<Tuple<int, double, TimeSpan>>()); 3548 3549 gpsIDShapedistTime_ByStopID[tripStopIDs[minStopIndex]].Add(new Tuple<int, 3550 double, TimeSpan>(gpsID, distOfGPSFromStart, gpsTimeSpan)); 3551 } 3552 stopOrderProcessedNewGPSIDs.Add(gpsID); 3553 currentStopIndexProcessing = minStopIndex; 3554 //GPS VARIABLE UPDATE 3555 GpsPointTable[gpsID].DistFromShapeStart = 3556 gpsPointShapeDistances[gpsID]; 3557 GpsPointTable[gpsID].DistToNextStop = distOfStopFromStart 3558 - distOfGPSFromStart; 3559 GpsPointTable[gpsID].NextStop = 3560 tripStopIDs[minStopIndex]; 3561 GpsPointTable[gpsID].PrevStop = (minStopIndex) > 0 ? 3562 tripStopIDs[minStopIndex - 1] : -1; 3563 } 3564 } 3565 bool anyPointsBeyondFirstStop = 3566 gpsPointShapeDistances.Values.ToList().Exists(v => v > 0); 3567 if ((stopOrderProcessedNewGPSIDs.Count < 3) || 3568 !anyPointsBeyondFirstStop)//not enough point at all or beyond the first stop 3569 { 3570 tripNeedsRemoval = true; 3571 //estimatedStopTimesAtStops = null; 3572 gpsIDShapedistTime_ByStopID = null; 3573 return tripNeedsRemoval; 3574 } 3575 //UPDATE GPS TRIP VARIABLE: GPS IDs and start GPS time 3576
247
GpsTripTable[gpsTripID].GPSIDs = stopOrderProcessedNewGPSIDs; 3577 GpsTripTable[gpsTripID].startGPSTime = 3578 GpsPointTable[stopOrderProcessedNewGPSIDs[0]].GPStime; 3579 //4. REVISE STOP SEQUENCE FOR THIS PARTICULAR TRIP, based on gps 3580 traces 3581 // remove edge stops (from start or end) that doesn't have gps 3582 points right in front of it (do not remove if it is the first stop) 3583 // remove case (remove first point, then stop): o----o-----x----3584 o-x---o 3585 // no need to remove: o-----x----o-x---o 3586 // * Takes care of short turn trips - will ignore missing stops 3587 if only 1 or 2 stops are missing at ends 3588 // * Execute if size difference between the 3589 gpsIDShapedistTime_ByStopID trace and tripStopIDs are > 2 3590 //if ((tripStopIDs.Count - gpsIDShapedistTime_ByStopID.Count) >= 3591 2) 3592 //{ 3593 int a = 0;//lower bound 3594 int b = 0;//upper bound 3595 List<int> indexLoc = (from r in gpsIDShapedistTime_ByStopID.Keys 3596 select tripStopIDs.IndexOf(r)).ToList(); 3597 a = indexLoc.Min(); 3598 b = indexLoc.Max(); 3599 3600 if (a == b)//if two are the same, trip is too short and is not 3601 valid 3602 { 3603 tripStopIDs = new List<int>(); 3604 tripStopDistances = new List<double>(); 3605 } 3606 else 3607 { 3608 //tripStopIDs = tripStopIDs.GetRange(a, b - a + 1); 3609 //tripStopDistances = tripStopDistances.GetRange(a, b - a + 3610 1); 3611 if ((tripStopIDs.Count - 1 - b) > 3)//only trim if need to 3612 remove more than 3 3613 { 3614 tripStopIDs.RemoveRange(b + 1, (tripStopIDs.Count - 1 - 3615 b)); 3616 tripStopDistances.RemoveRange(b + 1, (tripStopIDs.Count - 3617 1 - b)); 3618 } 3619 if (a > 3)//only trim if need to remove more than 3 3620 { 3621 tripStopIDs.RemoveRange(0, (a)); 3622 tripStopDistances.RemoveRange(0, (a)); 3623 } 3624 } 3625 //} 3626 //UPDATE GPS TRIP VARIABLE 3627 GpsTripTable[gpsTripID].tripStopIDs = tripStopIDs; 3628 GpsTripTable[gpsTripID].tripStopDistances = tripStopDistances; 3629
248
GpsTripTable[gpsTripID].gpsIDShapedistTime_ByStopID = 3630 gpsIDShapedistTime_ByStopID;//check order and result 3631 3632 if (tripStopIDs.Count < 2)//check stop size 3633 { 3634 tripNeedsRemoval = true; 3635 gpsIDShapedistTime_ByStopID = null; 3636 return tripNeedsRemoval; 3637 } 3638 tripNeedsRemoval = false; 3639 return tripNeedsRemoval; 3640 } 3641 3642
249
A.2.5 Arrival and Dwell Time Processing Method
(EstimatedArrivalAndDwellAlongScheduleRoute)
private bool EstimatedArrivalAndDwellAlongScheduleRoute(int 3643 gpsTripID, out Dictionary<int, TimeSpan> estimatedArrivalTimesAtStops, out 3644 Dictionary<int, double> estimatedDwellTimesAtStops)//out Dictionary<int, 3645 TimeSpan> estimatedScheduledStopTimesAtGPSPoints, 3646 { 3647 estimatedArrivalTimesAtStops = new Dictionary<int, TimeSpan>(); 3648 estimatedDwellTimesAtStops = new Dictionary<int, double>(); 3649 bool successCalc = false; 3650 3651 if (!GpsTripTable.ContainsKey(gpsTripID)) 3652 { 3653 estimatedArrivalTimesAtStops = null; 3654 estimatedDwellTimesAtStops = null; 3655 return successCalc; 3656 } 3657 3658 if (GpsTripTable[gpsTripID].GPSIDs == null || 3659 GpsTripTable[gpsTripID].tripStopIDs == null || 3660 GpsTripTable[gpsTripID].tripStopDistances == null || 3661 GpsTripTable[gpsTripID].gpsIDShapedistTime_ByStopID == null)//make sure 3662 ProcessAndOrganizeGPSPointsForTrip() method was performed fully 3663 { 3664 //NOT COMPUTABLE 3665 estimatedArrivalTimesAtStops = null; 3666 estimatedDwellTimesAtStops = null; 3667 return successCalc; 3668 } 3669 else if (GpsTripTable[gpsTripID].GPSIDs.Count < 3670 (GpsTripTable[gpsTripID].tripStopDistances.Last() / 2000))//at least one 3671 point for every 2km, otherwise data quality too poor to be computed 3672 { 3673 //LOW QUALITY COMPUTATIONS 3674 estimatedArrivalTimesAtStops = null; 3675 estimatedDwellTimesAtStops = null; 3676 return successCalc; 3677 } 3678 else 3679 { 3680 //COMPUTABLE 3681 List<int> stopIndexForSecondaryInterpolation = new 3682 List<int>(); 3683 List<int> tripStopIDs = GpsTripTable[gpsTripID].tripStopIDs; 3684 List<double> tripStopDistances = 3685 GpsTripTable[gpsTripID].tripStopDistances; 3686 Dictionary<int, List<Tuple<int, double, TimeSpan>>> 3687 gpsIDShapedistTime_ByStopID = 3688 GpsTripTable[gpsTripID].gpsIDShapedistTime_ByStopID; 3689 List<int> stopKeyList = 3690 gpsIDShapedistTime_ByStopID.Keys.ToList(); 3691
250
List<int> stopIndexList_RefStopIDs = (from r in stopKeyList 3692 select tripStopIDs.IndexOf(r)).ToList(); 3693 double cycleSpeed = 3694 (gpsIDShapedistTime_ByStopID[stopKeyList.Last()].Last().Item2 - 3695 gpsIDShapedistTime_ByStopID[stopKeyList.First()].First().Item2) / 3696 3697 (gpsIDShapedistTime_ByStopID[stopKeyList.Last()].Last().Item3.TotalSeconds - 3698 gpsIDShapedistTime_ByStopID[stopKeyList.First()].First().Item3.TotalSeconds); 3699 //5. COMPUTE estimatedStopTimes 3700 //A.Estimate stopTimes based on organized GPS data, by stopID 3701 //search through available gps points 3702 for (int i = 0; i < tripStopIDs.Count; i++) 3703 //gtfsScheduleTable[foundScheduleID].StopIDs) 3704 { 3705 int targetStopID = tripStopIDs[i];//target, unique for 3706 each loop (equivalent to beforeStopID if there are gps points nearby stop) 3707 int targetNextStopID = i < (tripStopIDs.Count - 1) ? 3708 tripStopIDs[i + 1] : -1;//the next stop after target, unique for each loop 3709 (equivalent afterStopID if there are gps points nearby stop) 3710 double distToTargetStop = tripStopDistances[i]; 3711 int beforeStopID = -1; 3712 int afterStopID = -1; 3713 if (stopKeyList.Contains(targetStopID)) 3714 { 3715 //A1: find nearest upstream and downstream stops with 3716 gps data 3717 List<int> stopIndexDiff = (from stopIndex in 3718 stopIndexList_RefStopIDs select ((stopIndex - i) <= 0 ? ((stopIndex - i)) : 3719 int.MinValue)).ToList(); 3720 int closestIndex = 3721 stopIndexDiff.IndexOf(stopIndexDiff.Max());//max of negatives 3722 closestIndex = closestIndex <= -1 ? 0 : closestIndex; 3723 beforeStopID = stopKeyList[closestIndex]; 3724 afterStopID = (beforeStopID == targetStopID) || 3725 (stopKeyList.Last() == beforeStopID) ? beforeStopID : 3726 stopKeyList[closestIndex + 1]; 3727 if (afterStopID == stopKeyList.Last() && beforeStopID 3728 == afterStopID && gpsIDShapedistTime_ByStopID[afterStopID].Count < 2) 3729 { 3730 //adjustment if gps point count of last stop is 3731 insufficient (<2), for if afterStopID == stopKeyList.Last() 3732 beforeStopID = stopKeyList[closestIndex - 1]; 3733 } 3734 else if (beforeStopID == afterStopID && 3735 gpsIDShapedistTime_ByStopID[beforeStopID].Count < 2)//beforeStopID == 3736 stopKeyList.First() && 3737 { 3738 //adjustment if gps point count of before stop is 3739 insufficient (<2) 3740 afterStopID = stopKeyList[closestIndex + 1]; 3741 } 3742 //A2: determine index of the specific gps points in 3743 the determined stop ids/index 3744 int beforeStopGPSIndex = -1; 3745
251
int afterStopGPSIndex = -1; 3746 if (beforeStopID == tripStopIDs.Last())//last stop, 3747 had checked that there are enough points 3748 { 3749 //special case - last stop, need to search for 3750 best point since there's no bound 3751 double min = double.MaxValue; 3752 for (int index = 0; index < 3753 gpsIDShapedistTime_ByStopID[beforeStopID].Count; index++) 3754 { 3755 Tuple<int, double, TimeSpan> 3756 beforeStopGPSPoint = gpsIDShapedistTime_ByStopID[beforeStopID][index]; 3757 double TempMinDist = 3758 Math.Abs(distToTargetStop - beforeStopGPSPoint.Item2); 3759 if (min > TempMinDist) 3760 { 3761 beforeStopGPSIndex = index; 3762 min = TempMinDist; 3763 } 3764 } 3765 //afterStopID = beforeStopID;//assumed 3766 if (beforeStopGPSIndex == 3767 (gpsIDShapedistTime_ByStopID[beforeStopID].Count - 1)) 3768 { 3769 afterStopGPSIndex = beforeStopGPSIndex; 3770 beforeStopGPSIndex = beforeStopGPSIndex - 1; 3771 } 3772 else 3773 { 3774 afterStopGPSIndex = beforeStopGPSIndex + 1; 3775 } 3776 } 3777 else if (beforeStopID == afterStopID)//default: 3778 enough points near current stop 3779 { 3780 beforeStopGPSIndex = 3781 gpsIDShapedistTime_ByStopID[beforeStopID].Count - 2; 3782 afterStopGPSIndex = 3783 gpsIDShapedistTime_ByStopID[afterStopID].Count - 1; 3784 } 3785 else if ((stopKeyList.IndexOf(afterStopID) - 3786 stopKeyList.IndexOf(beforeStopID)) == 1)//not enough point near current stop 3787 { 3788 beforeStopGPSIndex = 3789 gpsIDShapedistTime_ByStopID[beforeStopID].Count - 1; 3790 afterStopGPSIndex = 0; 3791 } 3792 //6. COMPUTE arrival time & dwell time 3793 //A3 3794 //Determine Dwell Time - only if two gps points with 3795 distances less than 50m exist close by the stop 3796 //Determine Arrival Time for dwell time NOT equal 0 3797 //May be 2 points before stop, 2 points after stop or 3798 1 point before and 1 point after 3799
252
TimeSpan estiArrivalTime = new TimeSpan(); 3800 double estiDwellTime = 0; 3801 TimeSpan time1; 3802 TimeSpan time2; 3803 double dist1 = -1; 3804 double dist2 = -1; 3805 int gpsPtSizeBeforeStop = 3806 gpsIDShapedistTime_ByStopID[beforeStopID].Count; 3807 int gpsPtSizeAfterStop = 3808 gpsIDShapedistTime_ByStopID[afterStopID].Count; 3809 //Check if the points have dwell time component 3810 (based on criteria for stopped) 3811 //dist1 = (gpsPtSizeBeforeStop >= 1) ? 3812 gpsIDShapedistTime_ByStopID[beforeStopID][beforeStopGPSIndex].Item2 : 3813 (0);//before stop 3814 //dist2 = (gpsPtSizeAfterStop >= 1) ? 3815 gpsIDShapedistTime_ByStopID[afterStopID][afterStopGPSIndex].Item2 : 3816 (10000000);//after stop 3817 dist1 = 3818 gpsIDShapedistTime_ByStopID[beforeStopID][beforeStopGPSIndex].Item2;//before 3819 stop 3820 dist2 = 3821 gpsIDShapedistTime_ByStopID[afterStopID][afterStopGPSIndex].Item2;//after 3822 stop 3823 time1 = 3824 gpsIDShapedistTime_ByStopID[beforeStopID][beforeStopGPSIndex].Item3; 3825 time2 = 3826 gpsIDShapedistTime_ByStopID[afterStopID][afterStopGPSIndex].Item3; 3827 double segmentSpeed = (dist2 - dist1) / 3828 (time2.TotalSeconds - time1.TotalSeconds); 3829 if (((dist2 - dist1) < DwellTimeBufferRadius * 2) && 3830 (beforeStopID == afterStopID) && (segmentSpeed < cycleSpeed))//points very 3831 close by or lower than average cycle speed 3832 { 3833 estiDwellTime = 3834 time2.Subtract(time1).TotalSeconds; 3835 estiArrivalTime = time1; 3836 } 3837 else 3838 { 3839 //no dwell time, estimate arrival time with 0 3840 dwell, to prevent overestimated (late) arrival time due to slow speed around 3841 stops, validate with the first point after before stop 3842 //dist1 = 3843 gpsIDShapedistTime_ByStopID[beforeStopID][beforeStopGPSIndex].Item2;//before 3844 stop 3845 //dist2 = 3846 gpsIDShapedistTime_ByStopID[afterStopID][afterStopGPSIndex].Item2;//after 3847 stop 3848 //time1 = 3849 gpsIDShapedistTime_ByStopID[beforeStopID][beforeStopGPSIndex].Item3; 3850 //time2 = 3851 gpsIDShapedistTime_ByStopID[afterStopID][afterStopGPSIndex].Item3; 3852
253
double distDiff1 = Math.Abs(dist1 - 3853 distToTargetStop); 3854 double distDiff2 = Math.Abs(dist2 - 3855 distToTargetStop); 3856 //no dwell case 1: polling point falls within 3857 stop and are within distance tolerance 3858 if ((distDiff1 < DwellTimeBufferRadius) && 3859 (distDiff2 < DwellTimeBufferRadius)) 3860 { 3861 if (distDiff1 <= distDiff2) 3862 { 3863 estiArrivalTime = time1;//time1 fell 3864 within stop boundary 3865 estiDwellTime = 0; 3866 } 3867 else 3868 { 3869 estiArrivalTime = time2;//time2 fell 3870 within stop boundary 3871 estiDwellTime = 0; 3872 } 3873 } 3874 else if (distDiff1 < DwellTimeBufferRadius) 3875 { 3876 estiArrivalTime = time1;//time1 fell within 3877 stop boundary 3878 estiDwellTime = 0; 3879 } 3880 else if (distDiff2 < DwellTimeBufferRadius) 3881 { 3882 estiArrivalTime = time2;//time2 fell within 3883 stop boundary 3884 estiDwellTime = 0; 3885 } 3886 double extrapolationSpeed = (segmentSpeed < 3887 cycleSpeed) ? cycleSpeed : segmentSpeed;//(beforeStopID == afterStopID) || 3888 //no dwell case 2: polling point falls outside 3889 stop, and stops are outside the two points (dist1 > distToTargetStop or dist2 3890 < distToTargetStop) 3891 if (distToTargetStop < dist1) 3892 { 3893 //adjust time1 and dist1 to one before, if 3894 possible 3895 double stopTimeToTargetStop = 0; 3896 if (stopKeyList.First() == beforeStopID) 3897 { 3898 stopTimeToTargetStop = time1.TotalSeconds 3899 + (distToTargetStop - dist1) / extrapolationSpeed; 3900 } 3901 else 3902 { 3903 beforeStopID = 3904 stopKeyList[stopKeyList.IndexOf(beforeStopID) - 1]; 3905
254
beforeStopGPSIndex = 3906 gpsIDShapedistTime_ByStopID[beforeStopID].Count - 1; 3907 dist2 = dist1; 3908 time2 = time1; 3909 dist1 = 3910 gpsIDShapedistTime_ByStopID[beforeStopID][beforeStopGPSIndex].Item2;//before 3911 stop 3912 time1 = 3913 gpsIDShapedistTime_ByStopID[beforeStopID][beforeStopGPSIndex].Item3; 3914 stopTimeToTargetStop = (((dist2 - dist1) 3915 == 0) || ((time2.TotalSeconds - time1.TotalSeconds) <= 0)) ? 3916 (time1.TotalSeconds) : (time1.TotalSeconds + 3917 Convert.ToInt64((time2.TotalSeconds - time1.TotalSeconds) / (dist2 - dist1) * 3918 (distToTargetStop - dist1))); 3919 } 3920 estiArrivalTime = 3921 TimeSpan.FromSeconds(Math.Round(stopTimeToTargetStop, 1)); 3922 estiDwellTime = 0; 3923 } 3924 else if (distToTargetStop > dist2) 3925 { 3926 //adjust time2 and dist2 to one after, if 3927 possible 3928 double stopTimeToTargetStop = 0; 3929 if (stopKeyList.Last() == afterStopID) 3930 { 3931 stopTimeToTargetStop = time2.TotalSeconds 3932 + (distToTargetStop - dist2) / extrapolationSpeed; 3933 } 3934 else 3935 { 3936 afterStopID = 3937 stopKeyList[stopKeyList.IndexOf(afterStopID) + 1]; 3938 afterStopGPSIndex = 0; 3939 dist1 = dist2; 3940 time1 = time2; 3941 dist2 = 3942 gpsIDShapedistTime_ByStopID[afterStopID][afterStopGPSIndex].Item2;//after 3943 stop 3944 time2 = 3945 gpsIDShapedistTime_ByStopID[afterStopID][afterStopGPSIndex].Item3; 3946 stopTimeToTargetStop = (((dist2 - dist1) 3947 == 0) || ((time2.TotalSeconds - time1.TotalSeconds) <= 0)) ? 3948 (time1.TotalSeconds) : (time1.TotalSeconds + 3949 Convert.ToInt64((time2.TotalSeconds - time1.TotalSeconds) / (dist2 - dist1) * 3950 (distToTargetStop - dist1))); 3951 } 3952 estiArrivalTime = 3953 TimeSpan.FromSeconds(Math.Round(stopTimeToTargetStop, 1)); 3954 estiDwellTime = 0; 3955 } 3956 else 3957 { 3958
255
//no dwell case 3: polling points fall 3959 outside stop, but stop within two points 3960 //search for best estimate of stop times and 3961 validate 3962 double stopTimeToTargetStop = (((dist2 - 3963 dist1) == 0) || ((time2.TotalSeconds - time1.TotalSeconds) <= 0)) ? 3964 (time1.TotalSeconds) : (time1.TotalSeconds + 3965 Convert.ToInt64((time2.TotalSeconds - time1.TotalSeconds) / (dist2 - dist1) * 3966 (distToTargetStop - dist1))); 3967 estiArrivalTime = 3968 TimeSpan.FromSeconds(stopTimeToTargetStop); 3969 estiDwellTime = 0; 3970 } 3971 } 3972 //Add computed arrival and dwell time, if there are 3973 enough nearby points to get these values 3974 if 3975 (!estimatedArrivalTimesAtStops.ContainsKey(targetStopID))//there should not 3976 be repeated stop in a single Direction, if there is, it is errorous 3977 { 3978 estimatedArrivalTimesAtStops.Add(targetStopID, 3979 estiArrivalTime); 3980 estimatedDwellTimesAtStops.Add(targetStopID, 3981 estiDwellTime); 3982 }//end if 3983 }//end if 3984 else 3985 { 3986 //not enough data to compute stop times for this stop 3987 stopIndexForSecondaryInterpolation.Add(i); 3988 }//end else 3989 }//end for 3990 //B.Find stops with missing GPS points for calculation 3991 through Interpolation with stopIndexForSecondaryInterpolation (estimated time 3992 when passing the stop) 3993 foreach (int index in stopIndexForSecondaryInterpolation) 3994 //gtfsScheduleTable[foundScheduleID].StopIDs) 3995 { 3996 //find nearest stops for stop time interpolation 3997 int targetStopID = tripStopIDs[index]; 3998 TimeSpan EstiStopTime = new TimeSpan(); 3999 List<int> ExistingSolIndex = (from r in 4000 estimatedArrivalTimesAtStops.Keys select tripStopIDs.IndexOf(r)).ToList(); 4001 int possibleStartIndex; 4002 int possibleEndIndex; 4003 bool startFound = false; 4004 int startSearchDir; 4005 bool endFound = false; 4006 int endSearchDir; 4007 //initialize and set search dir 4008 if (index <= ExistingSolIndex.Min()) 4009 { 4010 possibleStartIndex = ExistingSolIndex.Min(); 4011 possibleEndIndex = ExistingSolIndex.Min() + 1; 4012
256
startSearchDir = -1; 4013 endSearchDir = 1; 4014 } 4015 else if (index >= ExistingSolIndex.Max()) 4016 { 4017 possibleStartIndex = ExistingSolIndex.Max() - 1; 4018 possibleEndIndex = ExistingSolIndex.Max(); 4019 startSearchDir = -1; 4020 endSearchDir = 1; 4021 } 4022 else 4023 { 4024 possibleStartIndex = index; 4025 possibleEndIndex = index; 4026 startSearchDir = -1; 4027 endSearchDir = 1; 4028 } 4029 //search 4030 while (possibleStartIndex >= 0 && possibleStartIndex < 4031 (tripStopIDs.Count - 1) && !startFound) 4032 { 4033 if 4034 (estimatedArrivalTimesAtStops.ContainsKey(tripStopIDs[possibleStartIndex])) 4035 { 4036 startFound = true; 4037 } 4038 else 4039 { 4040 possibleStartIndex += startSearchDir; 4041 } 4042 } 4043 possibleEndIndex = possibleStartIndex; 4044 while (possibleEndIndex >= 0 && possibleEndIndex <= 4045 (tripStopIDs.Count - 1) && !endFound) 4046 { 4047 if 4048 (estimatedArrivalTimesAtStops.ContainsKey(tripStopIDs[possibleEndIndex]) && 4049 possibleStartIndex != possibleEndIndex) 4050 { 4051 endFound = true; 4052 } 4053 else 4054 { 4055 possibleEndIndex += endSearchDir; 4056 } 4057 } 4058 if (startFound && endFound) 4059 { 4060 double distToTargetStop = tripStopDistances[index]; 4061 //in case the start and end are swapped, switch start 4062 and end index 4063 if (possibleStartIndex > possibleEndIndex) 4064 { 4065 int temp = possibleStartIndex; 4066
257
possibleStartIndex = possibleEndIndex; 4067 possibleEndIndex = temp; 4068 } 4069 int upstreamStopID = tripStopIDs[possibleStartIndex]; 4070 int downstreamStopID = tripStopIDs[possibleEndIndex]; 4071 //check dwell time for different cases, stop after 4072 downstream, stop before upstream, etc 4073 double dwellTime1 = 4074 estimatedDwellTimesAtStops.ContainsKey(upstreamStopID) ? 4075 estimatedDwellTimesAtStops[upstreamStopID] : 0; 4076 double time1 = 4077 estimatedArrivalTimesAtStops[upstreamStopID].TotalSeconds;//arrival time from 4078 start 4079 double dwellTime2 = 4080 estimatedDwellTimesAtStops.ContainsKey(downstreamStopID) ? 4081 estimatedDwellTimesAtStops[downstreamStopID] : 0; 4082 double time2 = 4083 estimatedArrivalTimesAtStops[downstreamStopID].TotalSeconds;//arrival time to 4084 end 4085 double dist1 = tripStopDistances[possibleStartIndex]; 4086 double dist2 = tripStopDistances[possibleEndIndex]; 4087 double stopTimeToTargetStopEpoch = 0; 4088 double estiSpeed = (dist2 - dist1) / (time2 - (time1 4089 + dwellTime1));//running speed from dep of upstreamStop to arr of 4090 downstreamStop 4091 if (distToTargetStop < dist1)//target stop before 4092 upstreamStop 4093 { 4094 stopTimeToTargetStopEpoch = Math.Round((time1 - 4095 (dist1 - distToTargetStop) / estiSpeed), 1);//from arr of upstreamStop 4096 } 4097 else if (distToTargetStop > dist2)//target stop after 4098 downstreamStop 4099 { 4100 stopTimeToTargetStopEpoch = Math.Round(((time2 + 4101 dwellTime2) + (distToTargetStop - dist2) / estiSpeed), 1);//from dep of 4102 downstreamStop 4103 } 4104 else//target stop between upstreamStop and 4105 downstreamStop 4106 { 4107 stopTimeToTargetStopEpoch = Math.Round(((time1 + 4108 dwellTime1) + (distToTargetStop - dist1) / estiSpeed), 1);//from dep of 4109 upstreamStop 4110 } 4111 4112 EstiStopTime = 4113 TimeSpan.FromSeconds(stopTimeToTargetStopEpoch); 4114 if 4115 (!estimatedArrivalTimesAtStops.ContainsKey(targetStopID))//there should not 4116 be repeated stop in a single Direction, if there is, it is errorous 4117 { 4118 estimatedArrivalTimesAtStops.Add(targetStopID, 4119 EstiStopTime); 4120
258
estimatedDwellTimesAtStops.Add(targetStopID, 0); 4121 } 4122 } 4123 //else 4124 //{ 4125 // int debug = 0; 4126 // 4127 //stopIndexForSecondaryInterpolation.Add(index);//try again 4128 //} 4129 } 4130 //UPDATE GPS TRIP VARIABLE 4131 TimeSpan lastStopTime = new TimeSpan(0); 4132 GpsTripTable[gpsTripID].tripStopArrTimes = new 4133 List<TimeSpan>(); 4134 GpsTripTable[gpsTripID].tripDwellTimes = new List<double>(); 4135 //foreach (int stopID in gpsTripTable[gpsTripID].tripStopIDs) 4136 for (int n = 0; n < 4137 GpsTripTable[gpsTripID].tripStopIDs.Count; n++) 4138 { 4139 int stopID = GpsTripTable[gpsTripID].tripStopIDs[n]; 4140 int nextStopID = n < 4141 (GpsTripTable[gpsTripID].tripStopIDs.Count - 1) ? 4142 GpsTripTable[gpsTripID].tripStopIDs[n + 1] : stopID; 4143 //assess logical consistencies 4144 // strictly increasing stop times 4145 if (lastStopTime.TotalSeconds > 4146 estimatedArrivalTimesAtStops[stopID].TotalSeconds) 4147 { 4148 UpdateGUI_LogBox(String.Format("StopTime Error (incr. 4149 times). Line {0}, TripID: {1}, targetStopID: {2}", "3426", gpsTripID, 4150 stopID)); 4151 } 4152 // overall travel times greater than running time or 4153 dwell time, else, set dwell at start to 0, and add dwell to travel time 4154 double currentDwell = 4155 estimatedDwellTimesAtStops[stopID];//at start 4156 double currentTT = 4157 estimatedArrivalTimesAtStops[nextStopID].TotalSeconds - 4158 estimatedArrivalTimesAtStops[stopID].TotalSeconds; 4159 if ((currentTT < currentDwell) && stopID != nextStopID) 4160 { 4161 UpdateGUI_LogBox(String.Format("StopTime Error (dwell 4162 vs TT). Line {0}, TripID: {1}, targetStopID: {2}", "3433", gpsTripID, 4163 stopID)); 4164 //estimatedDwellTimesAtStops[stopID] = 0;//set dwell 4165 to zero 4166 } 4167 4168 //dwell times 4169 if (estimatedDwellTimesAtStops.ContainsKey(stopID)) 4170 { 4171 4172 GpsTripTable[gpsTripID].tripDwellTimes.Add(estimatedDwellTimesAtStops[stopID]4173 ); 4174
259
} 4175 else 4176 { 4177 GpsTripTable[gpsTripID].tripDwellTimes.Add(0); 4178 } 4179 //stop times 4180 if (estimatedArrivalTimesAtStops.ContainsKey(stopID)) 4181 { 4182 4183 GpsTripTable[gpsTripID].tripStopArrTimes.Add(estimatedArrivalTimesAtStops[sto4184 pID]); 4185 } 4186 else 4187 { 4188 //missing stop times, trip is not valid 4189 UpdateGUI_LogBox(String.Format("missing stop times, 4190 trip is not valid. Line {0}, TripID: {1}", "2998", gpsTripID)); 4191 estimatedArrivalTimesAtStops = null; 4192 successCalc = false; 4193 return successCalc; 4194 } 4195 lastStopTime = estimatedArrivalTimesAtStops[stopID]; 4196 } 4197 successCalc = true; 4198 return successCalc; 4199 } 4200 } 4201 4202
260
A.2.6 Link Data Processing Method (InitializeModelLinksToDb)
private void InitializeModelLinksToDb()//(bool newTable) 4203 { 4204 //StringBuilder randomNumTestString= new StringBuilder(); 4205 int GlobalLinkID = 1; 4206 int GlobalLinkDataID = 1; 4207 LinkObjectTable.Clear(); 4208 LinkIDIndexByStartEndAndScheduleID.Clear(); 4209 //1. Construct linkObjectTable (SSLinkObjectTable Dictionary, by 4210 LinkID) 4211 List<int> allTripIDs = GpsTripTable.Keys.ToList(); 4212 foreach (int TripID in allTripIDs) 4213 { 4214 //int TripID = allTripIDs[i]; 4215 List<int> stopIDs = GpsTripTable[TripID].tripStopIDs; 4216 for (int stopIndex = 0; stopIndex < (stopIDs.Count - 1); 4217 stopIndex++) 4218 { 4219 Tuple<int, int, int> newLinkIndex = new Tuple<int, int, 4220 int>(GpsTripTable[TripID].tripStopIDs[stopIndex], 4221 GpsTripTable[TripID].tripStopIDs[stopIndex + 1], 4222 GpsTripTable[TripID].GtfsScheduleID); 4223 double tempLinkDist = 4224 Math.Round(GpsTripTable[TripID].tripStopDistances[stopIndex + 1] - 4225 GpsTripTable[TripID].tripStopDistances[stopIndex], 1); 4226 if 4227 (!LinkIDIndexByStartEndAndScheduleID.ContainsKey(newLinkIndex)) 4228 { 4229 //new link and add data 4230 SSLinkObjectTable newLink = new SSLinkObjectTable(); 4231 newLink.LINKID = GlobalLinkID; 4232 newLink.StartStopID = 4233 GpsTripTable[TripID].tripStopIDs[stopIndex]; 4234 newLink.EndStopID = 4235 GpsTripTable[TripID].tripStopIDs[stopIndex + 1]; 4236 newLink.LinkDist = tempLinkDist; 4237 //Criteria to exclude links: not from platform to 4238 platform with short distances 4239 if ((GtfsStopTable[newLink.StartStopID].StopType == 4240 stopType.platform_bus && 4241 GtfsStopTable[newLink.EndStopID].StopType == 4242 stopType.platform_bus) && newLink.LinkDist <= 100) 4243 { 4244 continue;//skip this link and its link data 4245 } 4246 newLink.GtfsScheduleID = 4247 GpsTripTable[TripID].GtfsScheduleID; 4248 newLink.IntxnIDsAll = new List<int>();//TBD 4249 newLink.OtherLinkVars = new List<double>();//TBD 4250 newLink.OtherRouteVars = new List<double>();//TBD 4251 //newLink.StartStopLocTyp = "NULL";//TBD 4252 //newLink.EndStopLocTyp = "NULL";//TBD 4253
261
newLink.LinkData = new List<SSLinkDataTable>(); 4254 LinkObjectTable.AddOrUpdate(newLink.LINKID, newLink, 4255 (k, v) => newLink); 4256 4257 LinkIDIndexByStartEndAndScheduleID.AddOrUpdate(newLinkIndex, newLink.LINKID, 4258 (k, v) => newLink.LINKID); 4259 GlobalLinkID++; 4260 } 4261 //fill one link data 4262 SSLinkDataTable singleLinkData = new SSLinkDataTable(); 4263 singleLinkData.TripID = TripID; 4264 //determine y varaibles - eliminate infinite or high 4265 speeds 4266 double t1 = 4267 GpsTripTable[TripID].tripStopArrTimes[stopIndex].TotalSeconds + 4268 GpsTripTable[TripID].tripDwellTimes[stopIndex]; //departure time = arrival 4269 time + dwell time, at start stop 4270 double t2 = 4271 GpsTripTable[TripID].tripStopArrTimes[stopIndex + 1].TotalSeconds; //arrival 4272 time at end stop 4273 singleLinkData.RunningTime = (t2 >= t1) ? (t2 - t1) : 4274 ((t1 - t2) > 3600 ? (86400 - t1 + t2) : 0);//difference in arrival time 4275 singleLinkData.StartStopDwellTime = 4276 GpsTripTable[TripID].tripDwellTimes[stopIndex]; 4277 singleLinkData.EndStopDwellTime = 4278 GpsTripTable[TripID].tripDwellTimes[stopIndex + 1]; 4279 //Criteria to exclude link data: AvgSpeed 4280 double avgRunningSpeed = 4281 singleLinkData.getRunningSpeed(tempLinkDist); 4282 if (singleLinkData.RunningTime == 0 || avgRunningSpeed > 4283 120) 4284 { 4285 continue;//skip this loop, speed too high, bad data 4286 point 4287 } 4288 singleLinkData.LINKDATAID = GlobalLinkDataID; 4289 singleLinkData.LINKID = 4290 LinkIDIndexByStartEndAndScheduleID[newLinkIndex]; 4291 singleLinkData.GPSTimeAtLinkStart = 4292 Convert.ToInt64(GpsTripTable[TripID].startGPSTime.EpochTimeToLocalDateTime().4293 Date.DateTimeToEpochTime() + 4294 (GpsTripTable[TripID].tripStopArrTimes[stopIndex].TotalSeconds));//Convert.To4295 Int64(gpsTripTable[TripID].startGPSTime.EpochTimeToLocalDateTime().Date.DateT4296 imeToEpochTime() + 4297 ((gpsTripTable[TripID].tripEstiStopTimes[stopIndex].TotalSeconds + 4298 gpsTripTable[TripID].tripEstiStopTimes[stopIndex + 1].TotalSeconds) / 2)); 4299 singleLinkData.TrainOrTestData = 0; // to be assigned at 4300 link level 4301 //determine more x variables 4302 singleLinkData.DelayAtStart = 4303 GpsTripTable[TripID].tripEstiDelays[stopIndex];//(gpsTripTable[TripID].tripEs4304 tiDelays[stopIndex + 1] + gpsTripTable[TripID].tripEstiDelays[stopIndex]) / 4305 2; 4306
262
singleLinkData.HeadwayAtStart = 4307 GpsTripTable[TripID].tripEstiHeadways[stopIndex];//(gpsTripTable[TripID].trip4308 EstiHeadways[stopIndex + 1] + 4309 gpsTripTable[TripID].tripEstiHeadways[stopIndex]) / 2; 4310 singleLinkData.IncidentSQID = -1;//TO-DO: add procedure 4311 to determine this 4312 singleLinkData.WeatherSQID = -1;//TO-DO: add procedure to 4313 determine this 4314 4315 LinkObjectTable[singleLinkData.LINKID].LinkData.Add(singleLinkData); 4316 GlobalLinkDataID++; 4317 } 4318 } 4319 //File.WriteAllText(modelFolder + "randomNumTest.csv", 4320 randomNumTestString.ToString()); 4321 //2. *** Process/Determine Geographical variable values - 4322 Modifies SSLinkObjectTable 4323 //3. *** Process/Determine Other Operational variable values - 4324 Modifies SSLinkDataTable 4325 ParallelOptions pOptions = new ParallelOptions(); 4326 pOptions.MaxDegreeOfParallelism = Environment.ProcessorCount; 4327 List<int> allLinkObjKeys = LinkObjectTable.Keys.ToList(); 4328 4329 //2. Initialize IntxnIDsAll, some OtherLinkVars, and assign 4330 training and test set value 4331 //foreach (int key in linkObjectTable.Keys) 4332 //{ 4333 Parallel.For(0, LinkObjectTable.Count, pOptions, i => 4334 { 4335 int key = allLinkObjKeys[i];//linkID 4336 SSLinkObjectTable linkObject = LinkObjectTable[key]; 4337 4338 LinkObjectTable[key].OtherRouteVars = new 4339 List<double>();//initialize other route var list 4340 LinkObjectTable[key].OtherLinkVars = new 4341 List<double>();//initialize other link var list 4342 LinkObjectTable[key] = 4343 GetIntxnIDsAndInitialDataForLinkObj(LinkObjectTable[key]); 4344 4345 List<double> randomNumList = new List<double>(); 4346 List<double> testSetLinkDataIndex = new 4347 List<double>();//selection result 4348 for (int selectIndex = 0; selectIndex < 4349 linkObject.LinkData.Count; selectIndex++) 4350 { 4351 randomNumList.Add(RandomNumGen.NextDouble()); 4352 } 4353 List<double> topNumbers = new List<double>(randomNumList); 4354 topNumbers.Sort();//sorted top numbers 4355 int numTestSetData = 4356 Convert.ToInt32(Math.Floor(topNumbers.Count() * TestSetFraction)); 4357 for (int selectIndex = 0; selectIndex < numTestSetData; 4358 selectIndex++) 4359 { 4360
263
4361 testSetLinkDataIndex.Add(randomNumList.IndexOf(topNumbers[selectIndex])); 4362 } 4363 for (int selectIndex = 0; selectIndex < 4364 linkObject.LinkData.Count; selectIndex++) 4365 { 4366 SSLinkDataTable singleLinkData = 4367 linkObject.LinkData[selectIndex]; 4368 bool isTestSet = 4369 testSetLinkDataIndex.Contains(selectIndex) ? true : false; 4370 //assign data as test or train 4371 if (!RandomTestSample)//based on date range 4372 { 4373 if (singleLinkData.TrainOrTestData == 0) 4374 { 4375 foreach (DateRange dateRange in 4376 TestSetDateRanges)//include first and last day 4377 { 4378 DateTime startDateTime = 4379 dateRange.start.ToUniversalTime(); 4380 DateTime endDateTime = 4381 dateRange.end.ToUniversalTime(); 4382 if (startDateTime.DateTimeToEpochTime() < 4383 singleLinkData.GPSTimeAtLinkStart && endDateTime.DateTimeToEpochTime() > 4384 singleLinkData.GPSTimeAtLinkStart) 4385 { 4386 singleLinkData.TrainOrTestData = 2;//it 4387 is test set data 4388 break; 4389 } 4390 } 4391 } 4392 //determine if data is test or train, method 1: by 4393 date ranges 4394 if (singleLinkData.TrainOrTestData == 0) 4395 { 4396 foreach (DateRange dateRange in 4397 TrainingDateRanges)//include first and last day 4398 { 4399 DateTime startDateTime = 4400 dateRange.start.ToUniversalTime(); 4401 DateTime endDateTime = 4402 dateRange.end.ToUniversalTime(); 4403 if (startDateTime.DateTimeToEpochTime() < 4404 singleLinkData.GPSTimeAtLinkStart && endDateTime.DateTimeToEpochTime() > 4405 singleLinkData.GPSTimeAtLinkStart) 4406 { 4407 singleLinkData.TrainOrTestData = 1;//it 4408 is training data 4409 break; 4410 } 4411 } 4412 } 4413 } 4414
264
else//based on random 4415 { 4416 //determine if data is test or train, method 2: use 4417 random number generator (80% train, 20% test) 4418 if (singleLinkData.TrainOrTestData == 0) 4419 { 4420 //double tempRndNum = 4421 randomNumGenObj.NextDouble(); 4422 if (!isTestSet) 4423 { 4424 singleLinkData.TrainOrTestData = 1;//it is 4425 training data 4426 } 4427 else 4428 { 4429 singleLinkData.TrainOrTestData = 2;//it is 4430 test set data 4431 } 4432 } 4433 } 4434 } 4435 }); 4436 4437 Parallel.For(0, LinkObjectTable.Count, pOptions, i => 4438 { 4439 int key = allLinkObjKeys[i]; 4440 SSLinkObjectTable linkObject = LinkObjectTable[key]; 4441 4442 //2B. Get all Intersection data (OtherLinkVars) and 4443 OtherRouteVars 4444 LinkObjectTable[key] = 4445 GetIntxnDataMissingVolAndStopLoc(LinkObjectTable[key]); 4446 for (int j = 0; j < LinkObjectTable[key].LinkData.Count; j++) 4447 { 4448 //3. LinkData: Incident, Weather 4449 if (LinkObjectTable[key].LinkData[j] != null) 4450 { 4451 LinkObjectTable[key].LinkData[j] = 4452 GetLinkDataIncidentIDs(LinkObjectTable[key].LinkData[j]); 4453 LinkObjectTable[key].LinkData[j] = 4454 GetLinkDataWeatherIDs(LinkObjectTable[key].LinkData[j]); 4455 } 4456 else 4457 { 4458 //if there are any null data, remove it 4459 LinkObjectTable[key].LinkData.RemoveAt(j); 4460 j--; 4461 } 4462 } 4463 4464 }); 4465 //clear calculation objects to free mem 4466 //IsPointOnShapeSegment_preCalcResults.Clear(); 4467
265
IsPointOnShapeSegment_preCalcPaths = new 4468 ConcurrentDictionary<Tuple<int, int, int>, Tuple<List<Tuple<double, double, 4469 double>>, double>>(); 4470 //UPDATE LINK DATA 4471 UpdateAllLinksToDB(); 4472 UpdateStopDataToDB(StopLocTypStringByStopID); 4473 } 4474 4475
266
A.2.7 Intersection ID Processing Initialization Method
(GetIntxnIDsAndInitialDataForLinkObj)
private SSLinkObjectTable 4476 GetIntxnIDsAndInitialDataForLinkObj(SSLinkObjectTable thisLink)//designed to 4477 call by parallel for (thread safe) 4478 { 4479 int scheduleID = LinkObjectTable[thisLink.LINKID].GtfsScheduleID; 4480 int gtfsGroupID = 4481 GtfsScheduleTable[LinkObjectTable[thisLink.LINKID].GtfsScheduleID].GroupID; 4482 string gtfsRouteCode = 4483 GtfsRouteGroupTableByGroupID[gtfsGroupID].RouteCode; 4484 4485 //initialize 4486 thisLink.IntxnIDsAll = new List<int>(); 4487 if (thisLink.OtherRouteVars == null) 4488 thisLink.OtherRouteVars = new List<double>(); 4489 int OtherRouteVarsInitialIndex = thisLink.OtherRouteVars.Count; 4490 thisLink.OtherRouteVars.AddRange(Enumerable.Repeat(0.0, 4491 2));//This method adds 2 variables to OtherRouteVars 4492 if (thisLink.OtherLinkVars == null) 4493 thisLink.OtherLinkVars = new List<double>(); 4494 int OtherLinkVarsInitialIndex = thisLink.OtherLinkVars.Count; 4495 thisLink.OtherLinkVars.AddRange(Enumerable.Repeat(0.0, 4496 10));//This method adds 10 variables to OtherLinkVars 4497 4498 thisLink.OtherRouteVars[OtherRouteVarsInitialIndex + 0] = 4499 RouteList.ContainsKey(gtfsRouteCode) ? 4500 Convert.ToDouble(RouteList[gtfsRouteCode].IsStreetcar) : 0; 4501 thisLink.OtherRouteVars[OtherRouteVarsInitialIndex + 1] = 4502 RouteList.ContainsKey(gtfsRouteCode) ? 4503 Convert.ToDouble(RouteList[gtfsRouteCode].IsDedicatedROW) : 0; 4504 4505 //Search through all intersections to find intersections on the 4506 link 4507 foreach (int intKey in IntxnSignalTable.Keys) 4508 { 4509 SSINTNXSignalDataTable Intxn = IntxnSignalTable[intKey]; 4510 GeoLocation IntxnLoc = new GeoLocation(Intxn.Latitude, 4511 Intxn.Longitude); 4512 string turnDir = "none"; 4513 Tuple<bool, bool, bool, bool, bool> positionResult = 4514 GetPtPosOnShapeSeg(out turnDir, scheduleID, 4515 LinkObjectTable[thisLink.LINKID].StartStopID, 4516 LinkObjectTable[thisLink.LINKID].EndStopID, IntxnLoc); 4517 //First Step: determines intersections, the stop type and 4518 turn type variable 4519 //Store stop location information to table object - 4520 stopLocTypStringByStopID 4521 if (Intxn.isSignalizedIntxn)//must use an signalized 4522 intersection to determine stop position 4523 { 4524
267
//check start stop type 4525 if (GtfsStopTable[thisLink.StartStopID].StopType == 4526 stopType.stop || GtfsStopTable[thisLink.StartStopID].StopType == 4527 stopType.subway_stop) 4528 { 4529 if (positionResult.Item1)// && 4530 !stopLocTypStringByStopID.ContainsKey(data.StartStopID)) 4531 { 4532 //Intxn just outside start stop: start stop is 4533 far-sided - x--o 4534 4535 StopLocTypStringByStopID.AddOrUpdate(thisLink.StartStopID, "far-side", (k, v) 4536 => v); 4537 } 4538 else if (positionResult.Item2) 4539 { 4540 //Intxn just inside start stop: start stop is 4541 near-sided - o--x 4542 4543 StopLocTypStringByStopID.AddOrUpdate(thisLink.StartStopID, "near-side", (k, 4544 v) => v); 4545 } 4546 } 4547 if (GtfsStopTable[thisLink.EndStopID].StopType == 4548 stopType.stop || GtfsStopTable[thisLink.EndStopID].StopType == 4549 stopType.subway_stop) 4550 { 4551 if (positionResult.Item4)// && 4552 !stopLocTypStringByStopID.ContainsKey(data.StartStopID)) 4553 { 4554 //Intxn just inside end stop: start stop is far-4555 sided - x--o 4556 4557 StopLocTypStringByStopID.AddOrUpdate(thisLink.EndStopID, "far-side", (k, v) 4558 => v); 4559 } 4560 else if (positionResult.Item5) 4561 { 4562 //Intxn just outside end stop: start stop is 4563 near-sided - o--x 4564 4565 StopLocTypStringByStopID.AddOrUpdate(thisLink.EndStopID, "near-side", (k, v) 4566 => v); 4567 } 4568 } 4569 //Intxn is signalized and is within segment, check 4570 turning movements 4571 if (positionResult.Item3)//within segment 4572 { 4573 thisLink.OtherLinkVars[OtherLinkVarsInitialIndex + 0] 4574 += (turnDir == "left" ? 1 : 0);//Num_VehLtTurns 4575 thisLink.OtherLinkVars[OtherLinkVarsInitialIndex + 1] 4576 += (turnDir == "right" ? 1 : 0);//Num_VehRtTurns 4577 } 4578
268
} 4579 //Intxn is between start and end stop (All 4580 intersection/signal types) 4581 if (positionResult.Item3)//within segment 4582 { 4583 thisLink.IntxnIDsAll.Add(Intxn.pxID);//match found - note 4584 it is not incident id, but rather database id. 4585 //Note: total number 4586 of intersection is not used as it could be signalized, ped cross or flashing 4587 beacons 4588 } 4589 if (positionResult.Item5)//just outside of end of segment - 4590 need to count TSP intersection but not add to list of intersection 4591 { 4592 thisLink.OtherLinkVars[OtherLinkVarsInitialIndex + 3] += 4593 Convert.ToDouble(1);//Num_TSP_equipped 4594 } 4595 } 4596 //Second Step: aggregated variables based on IntxnIDsAll array 4597 and intersection types 4598 //Initialize average link volumes based on intersection volumes 4599 double total_perApproachIntVehVol = 0.0;//Apply 2.14% Annual 4600 Growth Factor on total Veh Vols (Based on Toronto Cordon Counts 2001-2010) 4601 double total_perApproachIntPedVol = 0.0;//No Growth Factor 4602 Applied 4603 int totalIntnxCountWithVol = 0; 4604 double totalSignalizedIntnxApproachCount = 0; 4605 //Link has intersections (signalized, ped cross or flashing 4606 beacons) 4607 if (thisLink.IntxnIDsAll.Count > 0) 4608 { 4609 foreach (int intersectionID in thisLink.IntxnIDsAll) 4610 { 4611 int numApproach = 4612 IntxnSignalTable[intersectionID].no_of_signalized_approaches < 1 ? 1 : 4613 IntxnSignalTable[intersectionID].no_of_signalized_approaches; 4614 if (IntxnSignalTable[intersectionID].vehVol > 0 || 4615 IntxnSignalTable[intersectionID].pedVol > 0)//values greater than 0 are 4616 assigned volumes 4617 { 4618 double growthFactorForVeh = (DateTime.Now.Year - 4619 IntxnSignalTable[intersectionID].countDate.Year) < 100 ? (DateTime.Now.Year - 4620 IntxnSignalTable[intersectionID].countDate.Year) * (2.14 / 100) : 0; 4621 total_perApproachIntVehVol += 4622 IntxnSignalTable[intersectionID].vehVol * (1 + growthFactorForVeh) / 4623 numApproach; 4624 total_perApproachIntPedVol += 4625 IntxnSignalTable[intersectionID].pedVol / numApproach; 4626 totalIntnxCountWithVol++; 4627 } 4628 if (IntxnSignalTable[intersectionID].isSignalizedIntxn) 4629 { 4630 totalSignalizedIntnxApproachCount += 4631 Convert.ToDouble(numApproach); 4632
269
} 4633 thisLink.OtherLinkVars[OtherLinkVarsInitialIndex + 2] += 4634 Convert.ToDouble(IntxnSignalTable[intersectionID].isSignalizedIntxn ? 1 : 4635 0);//Num_Intxn 4636 thisLink.OtherLinkVars[OtherLinkVarsInitialIndex + 3] += 4637 Convert.ToDouble(IntxnSignalTable[intersectionID].transit_preempt ? 1 : 4638 0);//Num_TSP_equipped 4639 thisLink.OtherLinkVars[OtherLinkVarsInitialIndex + 4] += 4640 Convert.ToDouble(IntxnSignalTable[intersectionID].isPedCross ? 1 : 4641 0);//Num_PedCross 4642 } 4643 } 4644 //Third Step: compute some variables from aggregated variables 4645 from previous steps 4646 // a. Change num_intxn to Num_VehThroughs: Num_VehThroughs = 4647 num_intxn - Num_VehLtTurns - Num_VehRtTurns 4648 double num_intxn = 4649 thisLink.OtherLinkVars[OtherLinkVarsInitialIndex + 2];//num_intxn (Signalized 4650 Intersections Only) 4651 double Num_VehLtTurns = 4652 thisLink.OtherLinkVars[OtherLinkVarsInitialIndex + 0];//read from aggregated 4653 number of left turns 4654 double Num_VehRtTurns = 4655 thisLink.OtherLinkVars[OtherLinkVarsInitialIndex + 1];//read from aggregated 4656 number of right turns 4657 double Num_VehThroughs = num_intxn - Num_VehLtTurns - 4658 Num_VehRtTurns; 4659 thisLink.OtherLinkVars[OtherLinkVarsInitialIndex + 2] = 4660 Num_VehThroughs >= 0 ? Num_VehThroughs : 0;//Num_VehThroughs 4661 thisLink.OtherLinkVars[OtherLinkVarsInitialIndex + 5] = 4662 totalSignalizedIntnxApproachCount;//Sum_SigIntxnApproach 4663 double AvgVehVolOnLink = total_perApproachIntVehVol == 0 ? 0 : 4664 total_perApproachIntVehVol / (totalIntnxCountWithVol); //Apply 2.14% Annual 4665 Growth Factor on total Veh Vols (Based on Toronto Cordon Counts 2001-2010) 4666 double AvgPedVolOnLink = total_perApproachIntPedVol == 0 ? 0 : 4667 total_perApproachIntPedVol / (totalIntnxCountWithVol); 4668 thisLink.OtherLinkVars[OtherLinkVarsInitialIndex + 6] = 4669 Math.Round(AvgVehVolOnLink, 0);//AvgVehVol 4670 thisLink.OtherLinkVars[OtherLinkVarsInitialIndex + 7] = 4671 Math.Round(AvgPedVolOnLink, 0);//AvgPedVol 4672 4673 return thisLink; 4674 } 4675 4676
270
A.2.8 Intersection Volume Data Processing
(GetIntxnDataMissingVolAndStopLoc)
private SSLinkObjectTable 4677 GetIntxnDataMissingVolAndStopLoc(SSLinkObjectTable thisLink) 4678 { 4679 //Search for volumes upstream and downstream for when 4680 //link has no signalized intersection, or none of the 4681 intersection has recorded volume 4682 if (thisLink.OtherLinkVars[6] == 0 && thisLink.OtherLinkVars[7] 4683 == 0) 4684 { 4685 int gtfsGroupID = 4686 GtfsScheduleTable[LinkObjectTable[thisLink.LINKID].GtfsScheduleID].GroupID; 4687 string gtfsRouteCode = 4688 GtfsRouteGroupTableByGroupID[gtfsGroupID].RouteCode; 4689 4690 int currentScheduleIDForSearch = thisLink.GtfsScheduleID; 4691 4692 //search through all other schedules with the same routeCode, 4693 if current schedule id link doesn't have volume data 4694 List<int> scheduleIDsOfRouteGroup = new 4695 List<int>(GtfsRouteGroupTableByGroupID[gtfsGroupID].RouteIDs); 4696 4697 scheduleIDsOfRouteGroup.Remove(currentScheduleIDForSearch);//remove current 4698 schedule as it will be searched first anyways 4699 4700 double upstreamVehVol = -1; 4701 double upstreamPedVol = -1; 4702 double downstreamVehVol = -1; 4703 double downstreamPedVol = -1; 4704 int deltaI = 1; 4705 int deltaJ = 1; 4706 int scheduleListIndex = 0; 4707 4708 while (upstreamVehVol < 0 && downstreamVehVol < 0 && 4709 (scheduleListIndex <= scheduleIDsOfRouteGroup.Count - 1))//need to find at 4710 least 1 4711 { 4712 //no intersection found on link, uses data from upstream 4713 and downstream links 4714 List<int> fullStopList = 4715 GtfsScheduleTable[currentScheduleIDForSearch].StopIDs; 4716 int startStopIndex = 4717 fullStopList.IndexOf(thisLink.StartStopID); 4718 int endStopIndex = 4719 fullStopList.IndexOf(thisLink.EndStopID); 4720 //fix for link with different start or end, but not both, 4721 if both, set numbers out of range(start) and proceed to the next schedule id 4722 if (startStopIndex == -1) 4723 { 4724 if (endStopIndex > 1) 4725
271
startStopIndex = endStopIndex - 1; 4726 else 4727 startStopIndex = -1;//results in 4728 upstreamStartStop = -1 4729 } 4730 else if (endStopIndex == -1) 4731 { 4732 if (startStopIndex < (fullStopList.Count - 2)) 4733 endStopIndex = startStopIndex + 1; 4734 else 4735 endStopIndex = int.MaxValue;//results in 4736 downstreamEndStop = -1 4737 } 4738 else if (startStopIndex == -1 && endStopIndex == -1) 4739 { 4740 startStopIndex = -1;//results in upstreamStartStop = 4741 -1 4742 endStopIndex = int.MaxValue;//results in 4743 downstreamEndStop = -1 4744 } 4745 //Initialize stops ids 4746 int i = startStopIndex;//upstream loc tracker 4747 int j = endStopIndex;//downstream loc tracker 4748 int upstreamStartStop = (i > 0) ? fullStopList[i - 1] : -4749 1; 4750 int upstreamEndStop = (i > 0) ? fullStopList[i] : -1; 4751 int downstreamStartStop = (j < (fullStopList.Count - 1)) 4752 && (j > 0) ? fullStopList[j] : -1; 4753 int downstreamEndStop = (j < (fullStopList.Count - 1)) && 4754 (j > 0) ? fullStopList[j + 1] : -1; 4755 4756 while (upstreamStartStop != -1 && upstreamVehVol < 4757 0)//while bound hasn't been reach and volume hasn't been found 4758 { 4759 Tuple<int, int, int> upstreamIDSearchKey = new 4760 Tuple<int, int, int>(upstreamStartStop, upstreamEndStop, 4761 currentScheduleIDForSearch); 4762 int upstreamLinkID = 4763 LinkIDIndexByStartEndAndScheduleID.ContainsKey(upstreamIDSearchKey) ? 4764 LinkIDIndexByStartEndAndScheduleID[upstreamIDSearchKey] : -1; 4765 if (upstreamLinkID > 0) 4766 { 4767 upstreamVehVol = 4768 LinkObjectTable[upstreamLinkID].OtherLinkVars[6];//AvgVehVol 4769 upstreamPedVol = 4770 LinkObjectTable[upstreamLinkID].OtherLinkVars[7];//AvgPedVol 4771 } 4772 i--; 4773 upstreamStartStop = (i > 0) ? fullStopList[i - 1] : -4774 1; 4775 upstreamEndStop = fullStopList[i]; 4776 }//end while 4777 4778
272
while (downstreamEndStop != -1 && downstreamVehVol < 4779 0)//while bound hasn't been reach and volume hasn't been found 4780 { 4781 Tuple<int, int, int> downstreamIDSearchKey = new 4782 Tuple<int, int, int>(downstreamStartStop, downstreamEndStop, 4783 currentScheduleIDForSearch); 4784 int downstreamLinkID = 4785 LinkIDIndexByStartEndAndScheduleID.ContainsKey(downstreamIDSearchKey) ? 4786 LinkIDIndexByStartEndAndScheduleID[downstreamIDSearchKey] : -1; 4787 if (downstreamLinkID > 0) 4788 { 4789 downstreamVehVol = 4790 LinkObjectTable[downstreamLinkID].OtherLinkVars[6];//AvgVehVol 4791 downstreamPedVol = 4792 LinkObjectTable[downstreamLinkID].OtherLinkVars[7];//AvgPedVol 4793 } 4794 j++; 4795 downstreamStartStop = fullStopList[j]; 4796 downstreamEndStop = (j < (fullStopList.Count - 1)) ? 4797 fullStopList[j + 1] : -1; 4798 }//end while 4799 4800 deltaJ = j - endStopIndex; 4801 deltaI = startStopIndex - i; 4802 currentScheduleIDForSearch = 4803 scheduleIDsOfRouteGroup[scheduleListIndex];//assign at end so the first 4804 schedule is for the current link data 4805 scheduleListIndex++; 4806 4807 ////The following codes is more efficient but may not 4808 find any volume 4809 //int downstreamEndStop = (endStopIndex < 4810 fullStopList.Count - 1) ? fullStopList[endStopIndex + 1] : -1; 4811 //int upstreamStartStop = (startStopIndex > 0) ? 4812 fullStopList[startStopIndex - 1] : -1; 4813 ////search upstream 4814 //Tuple<int, int, int> upstreamIDSearchKey = new 4815 Tuple<int, int, int>(thisLink.GtfsScheduleID, upstreamStartStop, 4816 thisLink.StartStopID); 4817 //int upstreamLinkID = 4818 linkIDIndexByGtfsScheduleIDStartAndEndStop.ContainsKey(upstreamIDSearchKey) ? 4819 linkIDIndexByGtfsScheduleIDStartAndEndStop[upstreamIDSearchKey] : -1; 4820 //if (upstreamLinkID > 0) 4821 //{ 4822 // upstreamVehVol = 4823 linkObjectTable[upstreamLinkID].OtherLinkVars[6];//AvgVehVol 4824 // upstreamPedVol = 4825 linkObjectTable[upstreamLinkID].OtherLinkVars[7];//AvgPedVol 4826 //} 4827 ////search downstream 4828 //Tuple<int, int, int> downstreamIDSearchKey = new 4829 Tuple<int, int, int>(thisLink.GtfsScheduleID, thisLink.EndStopID, 4830 downstreamEndStop); 4831
273
//int downstreamLinkID = 4832 linkIDIndexByGtfsScheduleIDStartAndEndStop.ContainsKey(downstreamIDSearchKey) 4833 ? linkIDIndexByGtfsScheduleIDStartAndEndStop[downstreamIDSearchKey] : -1; 4834 //if (downstreamLinkID > 0) 4835 //{ 4836 // downstreamVehVol = 4837 linkObjectTable[downstreamLinkID].OtherLinkVars[6];//AvgVehVol 4838 // downstreamPedVol = 4839 linkObjectTable[downstreamLinkID].OtherLinkVars[7];//AvgPedVol 4840 //} 4841 }//end while 4842 4843 //Apply 2.14% Annual Growth Factor to total Veh Vols (Based 4844 on Toronto Cordon Counts 2001-2010) 4845 //removed: 4460 Veh Vol if can't be found on any links of the 4846 route (~500veh/hr per approach) - 2017 Jan Open Data Toronto 4847 //removed: 500 Ped Vol if can't be found on any links of the 4848 route (~70ped/hr per approach) - 2017 Jan Open Data Toronto 4849 //Weighted Avg Vol by Delta Index = Vj(Deltai) + Vi(Deltaj) / 4850 (Deltai + Deltaj) 4851 double AvgVehVolOnLink = upstreamVehVol > -1 ? 4852 (downstreamVehVol > -1 ? ((upstreamVehVol * (deltaJ)) + 4853 (downstreamVehVol * (deltaI))) / ((deltaJ) + (deltaI)) : upstreamVehVol) : 4854 (downstreamVehVol > -1 ? (downstreamVehVol) : 4855 4460);//AvgVehVol; 4856 double AvgPedVolOnLink = upstreamPedVol > -1 ? 4857 (downstreamPedVol > -1 ? ((upstreamPedVol * (deltaJ)) + 4858 (downstreamPedVol * (deltaI))) / ((deltaJ) + (deltaI)) : upstreamPedVol) : 4859 (downstreamPedVol > -1 ? (downstreamPedVol) : 4860 500);//AvgPedVol 4861 thisLink.OtherLinkVars[6] = Math.Round(AvgVehVolOnLink / 8, 4862 0);//AvgVehVol: avg veh/hr/approach 4863 thisLink.OtherLinkVars[7] = Math.Round(AvgPedVolOnLink / 8, 4864 0);//AvgPedVol: avg veh/hr/approach 4865 } 4866 4867 //double check on stop locations 4868 //thisLink.StartStopLocTyp = 4869 stopLocTypStringByStopID.ContainsKey(data.StartStopID) ? 4870 stopLocTypStringByStopID[data.StartStopID] : "midblock"; 4871 //thisLink.EndStopLocTyp = 4872 stopLocTypStringByStopID.ContainsKey(data.EndStopID) ? 4873 stopLocTypStringByStopID[data.EndStopID] : "midblock"; 4874 bool isStartStopNearSided = 4875 StopLocTypStringByStopID.ContainsKey(thisLink.StartStopID) ? 4876 ((StopLocTypStringByStopID[thisLink.StartStopID]) == "near-side" ? true : 4877 false) : false; 4878 bool isEndStopFarSided = 4879 StopLocTypStringByStopID.ContainsKey(thisLink.EndStopID) ? 4880 ((StopLocTypStringByStopID[thisLink.EndStopID]) == "far-side" ? true : false) 4881 : false; 4882 thisLink.OtherLinkVars[8] = isStartStopNearSided == true ? 1 : 4883 0;//isStartStopNearSided //Changed from startStopLocTyp 4884 //stopLocTypStringByStopID.ContainsKey(data.StartStopID) ? 4885
274
(stopLocTypStringByStopID[data.StartStopID] == "near-side" ? 1 : 4886 (stopLocTypStringByStopID[data.StartStopID] == "far-side" ? 2 : 4887 (stopLocTypStringByStopID[data.StartStopID] == "none" ? 0 : 0))) : 0; 4888 thisLink.OtherLinkVars[9] = isEndStopFarSided == true ? 1 : 4889 0;//isEndStopFarSided //Changed from endStopLocTyp 4890 //stopLocTypStringByStopID.ContainsKey(data.EndStopID) ? 4891 (stopLocTypStringByStopID[data.EndStopID] == "near-side" ? 1 : 4892 (stopLocTypStringByStopID[data.EndStopID] == "far-side" ? 2 : 4893 (stopLocTypStringByStopID[data.EndStopID] == "none" ? 0 : 0))) : 0; 4894 4895 return thisLink; 4896 4897 } 4898 4899
275
A.2.9 Stop Data Processing Method (UpdateStopDataToDB)
private int UpdateStopDataToDB(ConcurrentDictionary<int, string> 4900 stopLocTypByStopID) 4901 { 4902 int numChangeBefore = DbConn_numChanges; 4903 if (SSUtil.GetTableSize(DbConn_Simulator, "TTCLINK_STOPDATA") > 4904 0) 4905 { 4906 DeleteFromTableDatabase("TTCLINK_STOPDATA");//delete old 4907 data, if any 4908 } 4909 //write to table 4910 List<int> stopIDKeys = stopLocTypByStopID.Keys.ToList(); 4911 stopIDKeys.Sort(); 4912 string sql; 4913 using (SQLiteTransaction tr = 4914 DbConn_Simulator.BeginTransaction()) 4915 { 4916 sql = "INSERT OR IGNORE INTO TTCLINK_STOPDATA (STOPID, 4917 StopLocTyp) VALUES (@STOPID, @StopLocTyp)"; 4918 4919 using (SQLiteCommand cmd = new SQLiteCommand(sql, 4920 DbConn_Simulator)) 4921 { 4922 foreach (int stopID in stopIDKeys) 4923 { 4924 cmd.Transaction = tr; 4925 4926 cmd.Parameters.Add("@STOPID", DbType.Int32).Value = 4927 stopID; 4928 cmd.Parameters.Add("@StopLocTyp", 4929 DbType.String).Value = StopLocTypStringByStopID[stopID]; 4930 4931 cmd.CommandType = CommandType.Text; 4932 4933 int rowsAffected = cmd.ExecuteNonQuery(); 4934 //no need to update 4935 DbConn_numChanges += rowsAffected; 4936 } 4937 } 4938 tr.Commit(); 4939 } 4940 return DbConn_numChanges - numChangeBefore; 4941 } 4942 4943
276
A.2.10 Link Attribute Update Method (UpdateAllLinksToDB)
private void UpdateAllLinksToDB() 4944 { 4945 //clear any previous link data and drop previous tables 4946 DeleteFromTableDatabase("TTCLINKDATA"); 4947 DeleteFromTableDatabase("TTCLINKOBJ"); 4948 4949 //4. Add linkObjectTable to TTCLINKOBJ and TTCLINKDATA 4950 TABLES(note these tables are write/initializing only, has no function to and 4951 are not intended to be modify) 4952 List<SSLinkObjectTable> allNewLinks = 4953 LinkObjectTable.Values.ToList(); 4954 List<SSLinkDataTable> allNewLinkData = new 4955 List<SSLinkDataTable>(); 4956 foreach (SSLinkObjectTable link in 4957 LinkObjectTable.Values.ToList()) 4958 { 4959 allNewLinkData.AddRange(link.LinkData); 4960 } 4961 List<SSLinkObjectTable> duplicateLinks; 4962 AddLinkObjToDB(allNewLinks, out duplicateLinks); 4963 AddLinkDataToDB(allNewLinkData); 4964 } 4965 4966
277
A.3 Model Estimation Algorithm
A.3.1 Model Estimation for Running Speed Method
(RModel_RunningSpeedModelEstimation)
public void RModel_RunningSpeedModelEstimation() 4967 { 4968 UpdateGUI_StatusBox("Model estimation started...", 5.0, 0.0, 4969 5.0); 4970 4971 // To prevent overwritting existing workbench, rename it if it 4972 already exist 4973 if (File.Exists(Sim_RModel.RDataWorkBenchRPath)) 4974 { 4975 4976 FileNameHandler.BackupThisFilename(Sim_RModel.RDataWorkBenchRPath); 4977 } 4978 4979 bool readFromRegistry = true; 4980 if (readFromRegistry) 4981 { 4982 using (RegistryKey registryKey = 4983 Registry.LocalMachine.OpenSubKey(@"SOFTWARE\R-core\R")) 4984 { 4985 var envPath = Environment.GetEnvironmentVariable("PATH"); 4986 string rBinPath = 4987 (string)registryKey.GetValue("InstallPath"); 4988 string rVersion = (string)registryKey.GetValue("Current 4989 Version"); 4990 rBinPath = System.Environment.Is64BitProcess ? rBinPath + 4991 "\\bin\\x64" : 4992 rBinPath 4993 + "\\bin\\i386"; 4994 Environment.SetEnvironmentVariable("PATH", envPath + 4995 Path.PathSeparator + rBinPath); 4996 4997 //Environment.SetEnvironmentVariable("R_HOME", rBinPath); 4998 } 4999 } 5000 else 5001 { 5002 //enter path mannually 5003 string rPath32 = @"C:\Program Files\Microsoft\R 5004 Client\R_SERVER\bin\x64"; 5005 string rPath64 = @"C:\Program Files\Microsoft\R 5006 Client\R_SERVER\bin\x64"; 5007 var oldPath = 5008 System.Environment.GetEnvironmentVariable("PATH"); 5009 var rPath = System.Environment.Is64BitProcess ? rPath64 : 5010 rPath32; 5011
278
//var rPath = @"C:\Program Files\Microsoft\R 5012 Client\R_SERVER\bin\x64"; 5013 5014 if (!Directory.Exists(rPath)) 5015 throw new DirectoryNotFoundException(string.Format(" 5016 R.dll not found in : {0}", rPath)); 5017 5018 var newPath = string.Format("{0}{1}{2}", rPath, 5019 System.IO.Path.PathSeparator, oldPath); 5020 System.Environment.SetEnvironmentVariable("PATH", newPath); 5021 } 5022 5023 UpdateGUI_StatusBox("", 35.0, 0.0, 5 * 60.0);//5 minutes for next 5024 task 5025 5026 //Tutorials on R.Net: 5027 http://jmp75.github.io/rdotnet/tut_basic_types/ 5028 5029 //REngine.SetEnvironmentVariables("R_Home"); 5030 // There are several options to initialize the engine, but by 5031 default the following suffice: 5032 REngine engine = REngine.GetInstance(); 5033 5034 // train models 5035 string currentWorkingDirR = ModelFolder.Replace("\\", 5036 "/");//change file path string for R 5037 string setwdCmd = string.Format("setwd('{0}')", 5038 currentWorkingDirR); 5039 engine.Evaluate(setwdCmd); 5040 UpdateGUI_StatusBox("", 65.0, 0.0, 15 * 60.0);//15 minutes for 5041 next task 5042 //string sourceCmd_noRoute = string.Format("source('{0}/{1}')", 5043 Sim_RModel.RFolder, Sim_RModel.TaskScript_NoRoute); 5044 //engine.Evaluate(sourceCmd_noRoute); 5045 string sourceCmd = string.Format("source('{0}/{1}')", 5046 Sim_RModel.RFolder, Sim_RModel.TaskScript_Default); 5047 engine.Evaluate(sourceCmd); 5048 UpdateGUI_StatusBox("", 95.0, 0.0, 60.0);//1 minute for next task 5049 string saveImgCmd = string.Format("save.image('{0}')", 5050 Sim_RModel.RDataWorkBenchRPath); 5051 engine.Evaluate(saveImgCmd); 5052 5053 // you should always dispose of the REngine properly. 5054 // After disposing of the engine, you cannot reinitialize nor 5055 reuse it 5056 //engine.Dispose();//dispose at program exit instead 5057 UpdateGUI_StatusBox("Estimation complete.", 100.0, 0.0); 5058 } 5059 5060
279
A.3.2 Model Estimation for Dwell Time Method
(Model_StopModelsEstimation)
public void Model_StopModelsEstimation(bool saveAllData = false) 5061 { 5062 Sim_DwellTimeModel_ByStartStopID = new Dictionary<int, 5063 DwellTimeModel>(); 5064 5065 int neitherOrTrainOrTest = 1;//1 for train, 2 for test 5066 5067 // Load training data for Dwell Time Distribution Models 5068 Dictionary<string, Dictionary<string, List<SSLinkModelBaseData>>> 5069 allTrainingDataRef_ByRouteCode_ByLinkName = new Dictionary<string, 5070 Dictionary<string, 5071 List<SSLinkModelBaseData>>>();//SimRef_TestDataRef_ByRouteCode_ByLinkName 5072 5073 allTrainingDataRef_ByRouteCode_ByLinkName = 5074 ImportLinkDataForModelRef("", "", neitherOrTrainOrTest, "NULL");//get real 5075 test data from csv 5076 5077 //check and load GTFS ref data 5078 if (GtfsScheduleTable == null) 5079 { 5080 LoadGTFSReferenceData();// read GTFS data for network 5081 initialization 5082 } 5083 5084 //restructure Sim_Ref Data, get data for terminal stations and 5085 intermediateStops 5086 Dictionary<int, List<SSLinkModelBaseData>> allRefData_ByStopID = 5087 new Dictionary<int, List<SSLinkModelBaseData>>(); 5088 foreach (string key in 5089 allTrainingDataRef_ByRouteCode_ByLinkName.Keys) 5090 { 5091 foreach (string innerKey in 5092 allTrainingDataRef_ByRouteCode_ByLinkName[key].Keys) 5093 { 5094 foreach (SSLinkModelBaseData data in 5095 allTrainingDataRef_ByRouteCode_ByLinkName[key][innerKey]) 5096 { 5097 //for all start stops 5098 int startStopID = data.StartStopID; 5099 if (!allRefData_ByStopID.ContainsKey(startStopID)) 5100 { 5101 allRefData_ByStopID.Add(startStopID, new 5102 List<SSLinkModelBaseData>()); 5103 }//end if 5104 allRefData_ByStopID[startStopID].Add(data); 5105 }//end for each data 5106 }//end for each inner key 5107 }//end for each key 5108 5109
280
/* Dwell Time Model */ 5110 //train network wide model 5111 List<double> allDwellTimeData = new List<double>(); 5112 foreach (int startStopID in allRefData_ByStopID.Keys) 5113 { 5114 5115 allDwellTimeData.AddRange(allRefData_ByStopID[startStopID].Select(o => 5116 o.DwellTime).ToList()); 5117 } 5118 try 5119 { 5120 DwellTimeModel networkLvlDwellTimeModel = new 5121 DwellTimeModel(allDwellTimeData, true, RandomNumGen);//always save network 5122 level model training data for later use 5123 Sim_DwellTimeModel_ByStartStopID.Add(0, 5124 networkLvlDwellTimeModel); 5125 } 5126 catch (ArgumentException notEnoughData) 5127 { 5128 //do nothing 5129 } 5130 //train terminal stop specific models 5131 foreach (int startStopID in allRefData_ByStopID.Keys) 5132 { 5133 List<double> newTrainingData = 5134 allRefData_ByStopID[startStopID].Select(o => o.DwellTime).ToList(); 5135 try 5136 { 5137 DwellTimeModel newDwellModel = new 5138 DwellTimeModel(newTrainingData, saveAllData, RandomNumGen); 5139 Sim_DwellTimeModel_ByStartStopID.Add(startStopID, 5140 newDwellModel); 5141 } 5142 catch (ArgumentException notEnoughData) 5143 { 5144 continue; //skip this stop 5145 } 5146 } 5147 } 5148 5149
281
A.3.3 Model Estimation Script for LME in R (ME_TaskHandler_LME.R)
### Model Estimator for Network with or without RouteCode Var ### 5150 5151 # LME Running Speed Model used 5152 5153 ### Set Working Directory and Data Name(s) ---- 5154 setwd("C:\\Nexus\\AppData\\MODEL\\Nexus.SSRTool\\") 5155 dataFileNames = c("Data\\linkData.csv") 5156 #"Data\\linkData-20170228_20170309.csv" 5157 #"Data\\linkData-20170228_20170309-2DayDebug.csv" 5158 5159 ### Global Model Settings ---- 5160 settings <<- 5161 list( 5162 selectRouteNumber = "0", #0-Network, 7-Bathurst, 512-StClair, 506-5163 Carlton, 504-King 5164 fileLabel = "Network", #Network, Bathurst, StClair 5165 loadDataOption = "Global", 5166 modelType = "LME", #MLR, SVM, LME, CART_RT, CART_FRF 5167 performEDA = FALSE, 5168 plotAnalysis = FALSE, #FALSE 5169 resampleTrainTest = FALSE, # When FALSE, trainingFraction and 5170 subSampleFraction are not used 5171 trainingFraction = 0.8, 5172 subSampleFraction = 0.25, 5173 nTree = 100 # number of trees - applies to RF 5174 ) 5175 # Retrive by: settings[['modelType']], etc 5176 # loadDataOption: ByRoute, Global, (separateTrainAndTestFiles-LEGACY-5177 DoneByC#Code) 5178 # modelType: type of model to be trained (MLR, SVM, LME, CART_RT, CART_RF, 5179 CART_FRF, URF) - accepts any in the string 5180 # plotAnalysis: print result and generate relevant plots 5181 # performEDA: generate scatterplots and histograms for Exploratory Data 5182 Analysis 5183 # resampleTrainTest: TRUE will enable LoadData functions to resample data by 5184 linkname 5185 # FALSE will load existing data with predetermined split 5186 # trainingFraction: fraction of sample to be assigned as training (80% is 5187 default) 5188 # subSampleFraction: fraction of training sample used (1 for no subsampling) 5189 5190 ### Load Data ---- 5191 source("LoadData.R") 5192 if (!exists("testData") || !exists("trainingData")) { 5193 if (settings[['loadDataOption']] == "ByRoute") { 5194 data <- 5195 Fn_LoadData( 5196 dataFileNames = dataFileNames, 5197 trainingFraction = settings[['trainingFraction']], 5198 subSampleFraction = settings[['subSampleFraction']], 5199 resampleTrainTest = settings[['resampleTrainTest']], 5200
282
separateByRoute = TRUE 5201 ) 5202 trainingData <- data$trainingData 5203 testData <- data$testData 5204 routeCodes <- levels(data$routeCodes) 5205 rm(data, dataFileNames) 5206 } else if (settings[['loadDataOption']] == "Global") { 5207 data <- 5208 Fn_LoadData( 5209 dataFileNames = dataFileNames, 5210 trainingFraction = settings[['trainingFraction']], 5211 subSampleFraction = settings[['subSampleFraction']], 5212 resampleTrainTest = settings[['resampleTrainTest']], 5213 separateByRoute = FALSE 5214 ) 5215 trainingData <- data$'trainingData' 5216 testData <- data$'testData' 5217 routeCodes <- c("0") 5218 rm(data, dataFileNames) 5219 } 5220 } 5221 5222 5223 ### Variable Definitions: response, fix and random variable(s) ---- 5224 # responseVar 5225 yName = "RunningSpeed" #"TravelSpeed" 5226 5227 # fixedVars 5228 xFixedNames = c( 5229 #"RunningSpeed", #Y 5230 "RouteCode", #Optional-Var-CANNOT USE IT WITH RandomForest - 5231 can have numerical difficulties # do not use when LinkName is used 5232 "HasIncident", #Optional-Vars - leads to rank deficiency 5233 "PrevLinkRunningSpeed", #Optional-Var 5234 "PrevTripRunningSpeed", #Optional-Var 5235 "Day", #Optional-Var 5236 "Time_mins", # minutes since midnight 5237 "LinkDist", 5238 "TerminalDelay", 5239 "Delay", 5240 "Headway", 5241 #"ScheduleHeadway", 5242 "HeadwayRatio", # observed over schedule headway 5243 "TotalPptn", 5244 "Num_VehLtTurns", 5245 "Num_VehRtTurns", 5246 "Num_VehThroughs", 5247 "Num_TSP_equipped", # "hasTSP", 5248 "Num_PedCross", 5249 #"Sum_SigIntxnApproach", # do not use with num_**Turns 5250 "AvgVehVol", 5251 "AvgPedVol", 5252 "IsStartStopNearSided", 5253 "IsEndStopFarSided" 5254
283
#"IsStreetcar", # do not use when RouteCode is used 5255 #"IsSeparatedROW" # do not use when RouteCode is used 5256 #"LinkName" 5257 # for network implementation, assume: TotalPptn = 0, HasIncident = F. 5258 # the rest of the variables can be obtained from SIM-TTC.db -> TTCLINKOBJ 5259 table 5260 ) 5261 # randomVars 5262 xRandName = c( 5263 #"RouteCode" #Optional-Var-CANNOT USE IT WITH RandomForest 5264 #"LinkDist" #"LinkDist.bins" # Note: not as Random Var 5265 #"RouteCode" 5266 "LinkName" # do not use with RouteCode 5267 ) 5268 # X for non-mixed effect models 5269 xNames = xFixedNames#c(xFixedNames, xRandName) 5270 5271 ### Load General analysis function --- 5272 source("ME_Analysis.R") 5273 # STANDARD MODEL OUTPUT FOR ANALYSIS RESULT: 5274 # model = fit 5275 # testResult = [preAdjR2, mseTest, varPTest, adjR2, mseTrain, varPTrain] 5276 5277 route <- 0 5278 # i <- "0" 5279 # for (route in routeCodes) { 5280 ## Temporarily assign trainingData and testData for current route 5281 # if (i == "0") { 5282 # trainingData_All <- trainingData 5283 # testData_All <- testData 5284 # } 5285 # trainingData <- trainingData_All[[route]] 5286 # testData <- testData_All[[route]] 5287 # if (route == "0") { 5288 # rm(trainingData_All, testData_All)# no need for global model 5289 # } 5290 5291 #Train on only select route 5292 if(settings[['selectRouteNumber']]!=0) { 5293 route <- settings[['selectRouteNumber']] 5294 trainingData <- split(trainingData, trainingData$RouteCode)# split by 5295 RouteCode, accessible by $'value'splitLinkData <- 5296 testData <- split(testData, testData$RouteCode)# split by RouteCode, 5297 accessible by $'value' 5298 trainingData <- trainingData[[settings[['selectRouteNumber']]]] 5299 testData <- testData[[settings[['selectRouteNumber']]]] #$'504' 5300 5301 # important: relevel test data so levels are consistent between test and 5302 train 5303 # https://stats.stackexchange.com/questions/235764/new-factors-levels-not-5304 present-in-training-data 5305 trainingData <- droplevels(trainingData) # drop unused levels in training 5306 testData$LinkName <- factor(testData$LinkName, levels = 5307 levels(trainingData$LinkName)) 5308
284
testData$RouteCode <- factor(testData$RouteCode, levels = 5309 levels(trainingData$RouteCode)) 5310 testDataSizeBefore <- nrow(testData) 5311 testData<-na.omit(testData)#, cols =c("LinkName","RouteCode")) 5312 testDataSizeAfter <- nrow(testData) 5313 if(testDataSizeAfter!=testDataSizeBefore){ 5314 message(paste0(testDataSizeBefore-testDataSizeAfter), " rows in test data 5315 deleted due to rank deficiencies to avoid prediction warnings.") 5316 } 5317 trainingData<-na.omit(trainingData)#, cols =c("LinkName","RouteCode")) 5318 } 5319 5320 ### Initialize model list --- 5321 modelCode <- "not specified" 5322 trainedModelList <- list() 5323 trainedModelResults <- list() 5324 5325 ### Perform EDA ---- 5326 # WARNING: performing EDA is slow for very large data sets 5327 if (settings[['performEDA']]) { 5328 source("EDA_Graphs.R") 5329 pngBaseFileName = paste0("EDA_Plot", "_", settings[['fileLabel']]) 5330 plotName <- paste0("Plots\\Plots_EDA_Plots", "_", settings[['fileLabel']], 5331 ".pdf") 5332 EDA_Plots(trainingData, columnSize = 2, fileName = plotName, 5333 PNGBaseFileName = pngBaseFileName) 5334 histName <- paste0("Plots\\Plots_EDA_Histograms", "_", 5335 settings[['fileLabel']], ".pdf") 5336 EDA_Histograms(trainingData, columnSize = 2, fileName = histName) 5337 rm(plotName, histName) 5338 } 5339 5340 # compare varPTest and varPTrain to (use for global model) 5341 varO_train <- var(trainingData$RunningSpeed) 5342 varO_test <- var(testData$RunningSpeed) 5343 5344 5345 ### Train RouteCode Model ---- 5346 5347 ### Training Model and Get Results ---- 5348 #if (settings[['modelType']] == "MLR") 5349 if (grepl("MLR", settings[['modelType']])) { 5350 modelCode <- paste0(route, "-MLR") 5351 Fn_Analysis_StartSink(modelCode = modelCode, fileLabel = 5352 settings[['fileLabel']]) 5353 source("ME_MLR.R") 5354 ##Multiple Linear Regression 5355 trainedModelList[[modelCode]] <- 5356 Fn_ME_MLR( 5357 trainingData = trainingData, 5358 testData = testData, 5359 yName = yName, 5360 xNames = xNames 5361 ) 5362
285
summary(trainedModelList[[modelCode]]) 5363 anova(trainedModelList[[modelCode]]) 5364 trainedModelResults[[modelCode]] <- 5365 Fn_Analysis( 5366 modelCode = "MLR", 5367 trainingData = trainingData, 5368 testData = testData, 5369 fit = trainedModelList[[modelCode]], 5370 plotResult = settings[['plotAnalysis']] 5371 ) 5372 # clean large data from fit 5373 trainedModelList[[modelCode]]$qr$qr = NULL 5374 5375 } 5376 5377 #if (settings[['modelType']] == "LME") 5378 if (grepl("LME", settings[['modelType']])) { 5379 modelCode <- paste0(route, "-LME") 5380 Fn_Analysis_StartSink(modelCode = modelCode, fileLabel = 5381 settings[['fileLabel']]) 5382 source("ME_LME.R") 5383 ##Mixed Effect Model 5384 trainedModelList[[modelCode]] <- 5385 Fn_ME_LME( 5386 trainingData = trainingData, 5387 testData = testData, 5388 yName = yName, 5389 xFixedNames = xFixedNames, 5390 xRandName = xRandName, 5391 enableAnovaReport = FALSE 5392 ) 5393 trainedModelResults[[modelCode]] <- 5394 Fn_Analysis( 5395 modelCode = "LME", 5396 trainingData = trainingData, 5397 testData = testData, 5398 fit = trainedModelList[[modelCode]], 5399 plotResult = settings[['plotAnalysis']] 5400 ) 5401 5402 } 5403 5404 if (grepl("SVM", settings[['modelType']])) { 5405 modelCode <- paste0(route, "-SVM") 5406 Fn_Analysis_StartSink(modelCode = modelCode, fileLabel = 5407 settings[['fileLabel']]) 5408 source("ME_SVM.R") 5409 ##Support Vector Regression 5410 trainedModelList[[modelCode]] <- 5411 Fn_ME_SVM( 5412 trainingData = trainingData, 5413 testData = testData, 5414 yName = yName, 5415 xNames = xNames 5416
286
) 5417 trainedModelResults[[modelCode]] <- 5418 Fn_Analysis( 5419 modelCode = "SVM", 5420 trainingData = trainingData, 5421 testData = testData, 5422 fit = trainedModelList[[modelCode]], 5423 plotResult = settings[['plotAnalysis']] 5424 ) 5425 5426 } 5427 5428 if (grepl("CART_RT", settings[['modelType']])) { 5429 modelCode <- paste0(route, "-CART_RT") 5430 Fn_Analysis_StartSink(modelCode = modelCode, fileLabel = 5431 settings[['fileLabel']]) 5432 source("ME_CART_RT.R") 5433 ##Regression Tree 5434 trainedModelList[[modelCode]] <- 5435 Fn_ME_RT( 5436 trainingData = trainingData, 5437 testData = testData, 5438 yName = yName, 5439 xNames = xNames 5440 ) 5441 trainedModelResults[[modelCode]] <- 5442 Fn_Analysis( 5443 modelCode = "CART_RT", 5444 trainingData = trainingData, 5445 testData = testData, 5446 fit = trainedModelList[[modelCode]], 5447 plotResult = settings[['plotAnalysis']] 5448 ) 5449 5450 } 5451 5452 if (grepl("CART_FRF", settings[['modelType']])) { 5453 modelCode <- paste0(route, "-CART_FRF_", settings[['nTree']]) 5454 Fn_Analysis_StartSink(modelCode = modelCode, fileLabel = 5455 settings[['fileLabel']]) 5456 source("ME_CART_FRF.R") 5457 ##Fast Random Forest 5458 trainedModelList[[modelCode]] <- 5459 Fn_ME_FRF( 5460 trainingData = trainingData, 5461 testData = testData, 5462 yName = yName, 5463 xNames = xNames, 5464 numTree = settings[['nTree']] 5465 ) 5466 trainedModelResults[[modelCode]] <- 5467 Fn_Analysis( 5468 modelCode = "CART_FRF", 5469 trainingData = trainingData, 5470
287
testData = testData, 5471 fit = trainedModelList[[modelCode]], 5472 plotResult = settings[['plotAnalysis']] 5473 ) 5474 5475 } 5476 5477 if (grepl("SVMLibLin", settings[['modelType']])) { 5478 modelCode <- paste0(route, "-SVMLibLin") 5479 Fn_Analysis_StartSink(modelCode = modelCode, fileLabel = 5480 settings[['fileLabel']]) 5481 source("ME_SVMLibLin.R") 5482 ##Support Vector Regression - Linear Faster 5483 trainedModelList[[modelCode]] <- 5484 Fn_ME_SVMLibLin( 5485 trainingData = trainingData, 5486 testData = testData, 5487 yName = yName, 5488 xNames = xNames 5489 ) 5490 trainedModelResults[[modelCode]] <- 5491 Fn_Analysis_SVMLibLin( 5492 modelCode = "SVMLibLin", 5493 trainingData = trainingData, 5494 testData = testData, 5495 fit = trainedModelList[[modelCode]], 5496 plotResult = settings[['plotAnalysis']] 5497 ) 5498 5499 } 5500 5501 if (grepl("CART_RF", settings[['modelType']])) { 5502 modelCode <- paste0(route, "-CART_RF") 5503 Fn_Analysis_StartSink(modelCode = modelCode, fileLabel = 5504 settings[['fileLabel']]) 5505 source("ME_CART_RF.R") 5506 ##'Slow' Random Forest 5507 ##'#Will not take RouteCode as a variable 5508 trainedModelList[[modelCode]] <- 5509 Fn_ME_RF( 5510 trainingData = trainingData, 5511 testData = testData, 5512 yName = yName, 5513 xNames = xNames, 5514 numTree = settings[['nTree']] 5515 ) 5516 trainedModelResults[[modelCode]] <- 5517 Fn_Analysis_RF( 5518 modelCode = "CART_RF", 5519 trainingData = trainingData, 5520 testData = testData, 5521 fit = trainedModelList[[modelCode]], 5522 plotResult = settings[['plotAnalysis']] 5523 ) 5524
288
5525 } 5526 5527 # Clear Data from Memory 5528 #rm(testData, trainingData) 5529 gc(verbose = FALSE) 5530 # } 5531 sink() 5532 5533 # load prepared model 5534 fit <- trainedModelList[[modelCode]] 5535 5536 trainedModelList <- list() 5537 trainedModelResults <- list() 5538 gc(verbose = FALSE) 5539 5540 5541 ### Train No RouteCode & No LinkName Model ---- 5542 5543 # Variable Assignment without RouteCode for secondary no route model 5544 xFixedNames <- xFixedNames[xFixedNames!="RouteCode"] 5545 xNames = xFixedNames#c(xFixedNames, xRandName) 5546 # randomVars 5547 xRandName = c( 5548 "Day" 5549 #"RouteCode" #Optional-Var-CANNOT USE IT WITH RandomForest 5550 #"LinkDist" #"LinkDist.bins" # Note: not as Random Var since it is 5551 continuous 5552 #"RouteCode" 5553 #"LinkName" # do not use with RouteCode 5554 ) 5555 5556 ### Training Model and Get Results ---- 5557 #if (settings[['modelType']] == "MLR") 5558 if (grepl("MLR", settings[['modelType']])) { 5559 modelCode <- paste0(route, "-MLR") 5560 Fn_Analysis_StartSink(modelCode = modelCode, fileLabel = 5561 settings[['fileLabel']]) 5562 source("ME_MLR.R") 5563 ##Multiple Linear Regression 5564 trainedModelList[[modelCode]] <- 5565 Fn_ME_MLR( 5566 trainingData = trainingData, 5567 testData = testData, 5568 yName = yName, 5569 xNames = xNames 5570 ) 5571 summary(trainedModelList[[modelCode]]) 5572 anova(trainedModelList[[modelCode]]) 5573 trainedModelResults[[modelCode]] <- 5574 Fn_Analysis( 5575 modelCode = "MLR", 5576 trainingData = trainingData, 5577 testData = testData, 5578
289
fit = trainedModelList[[modelCode]], 5579 plotResult = settings[['plotAnalysis']] 5580 ) 5581 # clean large data from fit 5582 trainedModelList[[modelCode]]$qr$qr = NULL 5583 5584 } 5585 5586 #if (settings[['modelType']] == "LME") 5587 if (grepl("LME", settings[['modelType']])) { 5588 modelCode <- paste0(route, "-LME") 5589 Fn_Analysis_StartSink(modelCode = modelCode, fileLabel = 5590 settings[['fileLabel']]) 5591 source("ME_LME.R") 5592 ##Mixed Effect Model 5593 trainedModelList[[modelCode]] <- 5594 Fn_ME_LME( 5595 trainingData = trainingData, 5596 testData = testData, 5597 yName = yName, 5598 xFixedNames = xFixedNames, 5599 xRandName = xRandName, 5600 enableAnovaReport = FALSE 5601 ) 5602 trainedModelResults[[modelCode]] <- 5603 Fn_Analysis( 5604 modelCode = "LME", 5605 trainingData = trainingData, 5606 testData = testData, 5607 fit = trainedModelList[[modelCode]], 5608 plotResult = settings[['plotAnalysis']] 5609 ) 5610 5611 } 5612 5613 if (grepl("SVM", settings[['modelType']])) { 5614 modelCode <- paste0(route, "-SVM") 5615 Fn_Analysis_StartSink(modelCode = modelCode, fileLabel = 5616 settings[['fileLabel']]) 5617 source("ME_SVM.R") 5618 ##Support Vector Regression 5619 trainedModelList[[modelCode]] <- 5620 Fn_ME_SVM( 5621 trainingData = trainingData, 5622 testData = testData, 5623 yName = yName, 5624 xNames = xNames 5625 ) 5626 trainedModelResults[[modelCode]] <- 5627 Fn_Analysis( 5628 modelCode = "SVM", 5629 trainingData = trainingData, 5630 testData = testData, 5631 fit = trainedModelList[[modelCode]], 5632
290
plotResult = settings[['plotAnalysis']] 5633 ) 5634 5635 } 5636 5637 if (grepl("CART_RT", settings[['modelType']])) { 5638 modelCode <- paste0(route, "-CART_RT") 5639 Fn_Analysis_StartSink(modelCode = modelCode, fileLabel = 5640 settings[['fileLabel']]) 5641 source("ME_CART_RT.R") 5642 ##Regression Tree 5643 trainedModelList[[modelCode]] <- 5644 Fn_ME_RT( 5645 trainingData = trainingData, 5646 testData = testData, 5647 yName = yName, 5648 xNames = xNames 5649 ) 5650 trainedModelResults[[modelCode]] <- 5651 Fn_Analysis( 5652 modelCode = "CART_RT", 5653 trainingData = trainingData, 5654 testData = testData, 5655 fit = trainedModelList[[modelCode]], 5656 plotResult = settings[['plotAnalysis']] 5657 ) 5658 5659 } 5660 5661 if (grepl("CART_FRF", settings[['modelType']])) { 5662 modelCode <- paste0(route, "-CART_FRF_", settings[['nTree']]) 5663 Fn_Analysis_StartSink(modelCode = modelCode, fileLabel = 5664 settings[['fileLabel']]) 5665 source("ME_CART_FRF.R") 5666 ##Fast Random Forest 5667 trainedModelList[[modelCode]] <- 5668 Fn_ME_FRF( 5669 trainingData = trainingData, 5670 testData = testData, 5671 yName = yName, 5672 xNames = xNames, 5673 numTree = settings[['nTree']] 5674 ) 5675 trainedModelResults[[modelCode]] <- 5676 Fn_Analysis( 5677 modelCode = "CART_FRF", 5678 trainingData = trainingData, 5679 testData = testData, 5680 fit = trainedModelList[[modelCode]], 5681 plotResult = settings[['plotAnalysis']] 5682 ) 5683 5684 } 5685 5686
291
if (grepl("SVMLibLin", settings[['modelType']])) { 5687 modelCode <- paste0(route, "-SVMLibLin") 5688 Fn_Analysis_StartSink(modelCode = modelCode, fileLabel = 5689 settings[['fileLabel']]) 5690 source("ME_SVMLibLin.R") 5691 ##Support Vector Regression - Linear Faster 5692 trainedModelList[[modelCode]] <- 5693 Fn_ME_SVMLibLin( 5694 trainingData = trainingData, 5695 testData = testData, 5696 yName = yName, 5697 xNames = xNames 5698 ) 5699 trainedModelResults[[modelCode]] <- 5700 Fn_Analysis_SVMLibLin( 5701 modelCode = "SVMLibLin", 5702 trainingData = trainingData, 5703 testData = testData, 5704 fit = trainedModelList[[modelCode]], 5705 plotResult = settings[['plotAnalysis']] 5706 ) 5707 5708 } 5709 5710 if (grepl("CART_RF", settings[['modelType']])) { 5711 modelCode <- paste0(route, "-CART_RF") 5712 Fn_Analysis_StartSink(modelCode = modelCode, fileLabel = 5713 settings[['fileLabel']]) 5714 source("ME_CART_RF.R") 5715 ##'Slow' Random Forest 5716 ##'#Will not take RouteCode as a variable 5717 trainedModelList[[modelCode]] <- 5718 Fn_ME_RF( 5719 trainingData = trainingData, 5720 testData = testData, 5721 yName = yName, 5722 xNames = xNames, 5723 numTree = settings[['nTree']] 5724 ) 5725 trainedModelResults[[modelCode]] <- 5726 Fn_Analysis_RF( 5727 modelCode = "CART_RF", 5728 trainingData = trainingData, 5729 testData = testData, 5730 fit = trainedModelList[[modelCode]], 5731 plotResult = settings[['plotAnalysis']] 5732 ) 5733 5734 } 5735 5736 # Clear Data from Memory 5737 #rm(testData, trainingData) 5738 gc(verbose = FALSE) 5739 # } 5740
292
#sink() 5741 5742 # load prepared model 5743 fitNoLinkOrRoute <- trainedModelList[[modelCode]] 5744 5745 # allows checks to avoid invalid prediction using fit (no routeCode and no 5746 LinkName) 5747 fit_RouteCodes <- levels(trainingData$RouteCode) 5748 fit_LinkNames <- levels(trainingData$LinkName) 5749 5750 trainedModelList <- list() 5751 trainedModelResults <- list() 5752 rm(testData, trainingData) 5753 5754
293
A.4 Model Simulation Algorithm
A.4.1 Data Import for Simulation Method (ImportLinkDataForModelRef)
private Dictionary<string, Dictionary<string, 5755 List<SSLinkModelBaseData>>> ImportLinkDataForModelRef(string fileFolder = "", 5756 string fileName = "", int NeitherOrTrainOrTest = -1, string DayFilter = 5757 "NULL") 5758 { 5759 Dictionary<string, Dictionary<string, List<SSLinkModelBaseData>>> 5760 sortedData = new Dictionary<string, Dictionary<string, 5761 List<SSLinkModelBaseData>>>(); 5762 List<SSLinkModelBaseData> dataRows = new 5763 List<SSLinkModelBaseData>(); 5764 //prepare file path, name and directory 5765 if (fileName == "" || fileName == null) 5766 { 5767 fileName = "linkData.csv"; //default 5768 } 5769 if (fileFolder == "" || fileFolder == null) 5770 { 5771 fileFolder = ModelDataFolder; //default 5772 } 5773 string filePath = fileFolder + fileName; 5774 if (!File.Exists(filePath)) 5775 { 5776 return sortedData; 5777 } 5778 5779 //instructions: http://joshclose.github.io/CsvHelper/ 5780 //read record one by one mannually to screen out data not needed 5781 using (var csv = new CsvReader(new StreamReader(filePath, 5782 Encoding.Default))) 5783 { 5784 csv.Configuration.Delimiter = ","; 5785 csv.Configuration.HasHeaderRecord = true; 5786 csv.Configuration.CultureInfo = CultureInfo.InvariantCulture; 5787 5788 //dataRows.AddRange(csv.GetRecords<SSLinkModelBaseData>().ToList()); 5789 while (csv.Read()) 5790 { 5791 var record = csv.GetRecord<SSLinkModelBaseData>(); 5792 bool addRow = true; 5793 if (NeitherOrTrainOrTest != -1)//if has filter 5794 { 5795 if (NeitherOrTrainOrTest != 5796 record.NeitherOrTrainOrTest)//check if fails filter 5797 { 5798 addRow = false; 5799 } 5800 } 5801 if (!DayFilter.Equals("NULL"))//if has filter 5802
294
{ 5803 if (!DayFilter.Equals(record.Day))//check if fails 5804 filter 5805 { 5806 addRow = false; 5807 } 5808 } 5809 5810 if (addRow) 5811 { 5812 dataRows.Add(record); 5813 } 5814 } 5815 } 5816 5817 //sort 5818 dataRows = dataRows.OrderBy(o => o.Time_mins).ToList(); 5819 5820 //build dictionaries 5821 for (int i = 0; i < dataRows.Count; i++) 5822 { 5823 string routeCode = dataRows[i].RouteCode; 5824 string linkName = dataRows[i].LinkName; 5825 5826 if (!sortedData.ContainsKey(routeCode)) 5827 { 5828 sortedData.Add(routeCode, new Dictionary<string, 5829 List<SSLinkModelBaseData>>()); 5830 } 5831 if (!sortedData[routeCode].ContainsKey(linkName)) 5832 { 5833 sortedData[routeCode].Add(linkName, new 5834 List<SSLinkModelBaseData>()); 5835 } 5836 sortedData[routeCode][linkName].Add(dataRows[i]); 5837 } 5838 5839 return sortedData; 5840 } 5841 5842
295
A.4.2 Model Prediction Method (GetPredictionForLinkDataInR)
private List<SSLinkSimModelData> 5843 GetPredictionForLinkDataInR(List<SSLinkSimModelData> inputTestData, REngine 5844 engine, string modelTyp, List<string> trainedVarListRoute, List<string> 5845 trainedVarListLink, bool updatePred = true, bool speedTest = true, bool 5846 saveRDataForRTest = false) 5847 { 5848 //Step 0: separate data with trained routes and data without 5849 trained routes 5850 List<string> modelList = Sim_RModel.ModelObjNames; 5851 //Three levels of prediction potentially (for Info only) 5852 // use model with route parameter - "fit" 5853 // use model without route parameter, but with link 5854 characteristics - "fitNoRoute" 5855 // use model with operational parameters only - (not used) 5856 5857 // data separation 5858 Dictionary<string, List<SSLinkSimModelData>> testData_ByModel = 5859 new Dictionary<string, List<SSLinkSimModelData>>(); 5860 foreach (string modelcode in modelList) 5861 { 5862 testData_ByModel.Add(modelcode, new 5863 List<SSLinkSimModelData>()); 5864 } 5865 foreach (SSLinkSimModelData dataRow in inputTestData) 5866 { 5867 if (modelTyp.Equals("FRF")) 5868 { 5869 if (trainedVarListRoute.Contains(dataRow.RouteCode)) 5870 { 5871 testData_ByModel[modelList[0]].Add(dataRow); 5872 } 5873 else 5874 { 5875 testData_ByModel[modelList[1]].Add(dataRow); 5876 } 5877 } 5878 else if (modelTyp.Equals("LME")) 5879 { 5880 if (trainedVarListRoute.Contains(dataRow.RouteCode) && 5881 trainedVarListLink.Contains(dataRow.LinkName)) 5882 { 5883 testData_ByModel[modelList[0]].Add(dataRow); 5884 } 5885 else 5886 { 5887 testData_ByModel[modelList[1]].Add(dataRow); 5888 } 5889 } 5890 } 5891 // get prediction from each model 5892 //foreach (string modelObject in testData_ByModel.Keys) 5893
296
for (int n = 0; n < testData_ByModel.Keys.Count; n++) 5894 { 5895 string modelObject = testData_ByModel.Keys.ToList()[n]; 5896 List<SSLinkSimModelData> testData = new 5897 List<SSLinkSimModelData>(testData_ByModel[modelObject]); 5898 5899 if (testData.Count > 0) 5900 { 5901 //Step 1: export Link Data To DataFrame In R 5902 //Convert existing list of Link Model Data to array of 5903 columns 5904 IEnumerable[] columns = new IEnumerable[29]; 5905 //Note: 5906 https://stackoverflow.com/questions/204505/preserving-order-with-linq 5907 columns[0] = testData.Select(o => 5908 o.RunningSpeedPred).ToArray();//"blank column" to store prediction 5909 columns[1] = testData.Select(o => o.LinkDist).ToArray(); 5910 columns[2] = testData.Select(o => 5911 o.PrevLinkRunningSpeed).ToArray(); 5912 columns[3] = testData.Select(o => 5913 o.PrevTripRunningSpeed).ToArray(); 5914 columns[4] = testData.Select(o => 5915 o.TerminalDelay).ToArray(); 5916 columns[5] = testData.Select(o => o.Delay).ToArray(); 5917 columns[6] = testData.Select(o => o.Headway).ToArray(); 5918 columns[7] = testData.Select(o => 5919 o.ScheduleHeadway).ToArray(); 5920 columns[8] = testData.Select(o => 5921 o.HeadwayRatio).ToArray(); 5922 columns[9] = testData.Select(o => o.TotalPptn).ToArray(); 5923 columns[10] = testData.Select(o => 5924 o.HasIncident).ToArray(); 5925 columns[11] = testData.Select(o => o.Day).ToArray(); 5926 columns[12] = testData.Select(o => 5927 o.Time_mins).ToArray(); 5928 columns[13] = testData.Select(o => 5929 o.RouteCode).ToArray(); 5930 columns[14] = testData.Select(o => 5931 o.GtfsScheduleID).ToArray(); 5932 columns[15] = testData.Select(o => o.LinkName).ToArray(); 5933 columns[16] = testData.Select(o => o.StopSeq).ToArray(); 5934 columns[17] = testData.Select(o => 5935 o.IsStreetcar).ToArray(); 5936 columns[18] = testData.Select(o => 5937 o.IsSeparatedROW).ToArray(); 5938 columns[19] = testData.Select(o => 5939 o.Num_VehLtTurns).ToArray(); 5940 columns[20] = testData.Select(o => 5941 o.Num_VehRtTurns).ToArray(); 5942 columns[21] = testData.Select(o => 5943 o.Num_VehThroughs).ToArray(); 5944 columns[22] = testData.Select(o => 5945 o.Num_TSP_equipped).ToArray(); 5946
297
columns[23] = testData.Select(o => 5947 o.Num_PedCross).ToArray(); 5948 columns[24] = testData.Select(o => 5949 o.Sum_SigIntxnApproach).ToArray(); 5950 columns[25] = testData.Select(o => 5951 o.AvgVehVol).ToArray(); 5952 columns[26] = testData.Select(o => 5953 o.AvgPedVol).ToArray(); 5954 columns[27] = testData.Select(o => 5955 o.IsStartStopNearSided).ToArray(); 5956 columns[28] = testData.Select(o => 5957 o.IsEndStopFarSided).ToArray(); 5958 5959 string[] columnNames = new[] { "RunningSpeed", 5960 "LinkDist", "PrevLinkRunningSpeed", "PrevTripRunningSpeed", "TerminalDelay", 5961 "Delay", "Headway", "ScheduleHeadway", "HeadwayRatio", "TotalPptn", 5962 "HasIncident", "Day", "Time_mins", "RouteCode", "GtfsScheduleID", "LinkName", 5963 "StopSeq", "IsStreetcar", "IsSeparatedROW", "Num_VehLtTurns", 5964 "Num_VehRtTurns", "Num_VehThroughs", "Num_TSP_equipped", "Num_PedCross", 5965 "Sum_SigIntxnApproach", "AvgVehVol", "AvgPedVol", "IsStartStopNearSided", 5966 "IsEndStopFarSided" }; 5967 5968 DataFrame testDataCurrentDf = 5969 engine.CreateDataFrame(columns, columnNames); 5970 5971 string RSymbolName = "testDataCurrentDf"; 5972 engine.SetSymbol(RSymbolName, testDataCurrentDf); 5973 5974 //Step 2: get prediction 5975 //Options for modelCode: CART_FRF, or empty for any other 5976 //double[] defaultValues = new double[testData.Count]; 5977 //defaultValues.Populate(20); 5978 NumericVector predValues; 5979 5980 // *Prediction in R 5981 //force enable speed test for debugging here 5982 //speedTest = true; 5983 //if saveRDataForRTest is true, speed test result for 5984 RdotNet should not be used - won't be accurate 5985 //saveRDataForRTest = false; 5986 if (speedTest == true) 5987 { 5988 //RdotNetSpeedTest - Comment out for deployment 5989 string speedTestLogFile = string.Format("{0}{1}", 5990 ModelFolder, "R_RdotNet_SpeedTest.txt"); 5991 if (!File.Exists(speedTestLogFile)) 5992 { 5993 //File.Create(speedTestLogFile); 5994 File.WriteAllText(speedTestLogFile, 5995 "SampleSize,ModelType,R.Net_Time,R_Time\n"); 5996 } 5997 Stopwatch watch = new Stopwatch(); 5998 watch.Start(); 5999 //end of RdotNetSpeedTest 6000
298
// PREDICT 6001 predValues = 6002 engine.Evaluate(string.Format("Fn_GenericPredict(modelType = 6003 settings[['modelType']], fit = {0}, testData = {1})", modelObject, 6004 RSymbolName)).AsNumeric();//key line - do not comment out 6005 // END OF PREDICT 6006 //RdotNetSpeedTest 6007 watch.Stop(); 6008 long timeElapsed_ms = watch.ElapsedMilliseconds; 6009 double timeElapsed_secs = 6010 Convert.ToDouble((double)watch.ElapsedMilliseconds / 1000); 6011 File.AppendAllText(speedTestLogFile, 6012 string.Format("{0},{1},{2},{3}\n", predValues.Count(), modelObject, 6013 timeElapsed_secs, ""));//leave R_Time to be filled up later 6014 6015 if (saveRDataForRTest) 6016 { 6017 // remove large model objects 6018 string removeCommand = 6019 string.Format("rm({0},{1})", testData_ByModel.Keys.ToList()[0], 6020 testData_ByModel.Keys.ToList()[1]); 6021 engine.Evaluate(removeCommand); 6022 //RdotNetSpeedTest 6023 6024 // save dataframe and small objects for this 6025 iteration 6026 string saveCommand = ""; 6027 if (modelObject == 6028 testData_ByModel.Keys.ToList()[0]) 6029 { 6030 speedTestTrackerFit++; 6031 saveCommand = 6032 string.Format("save.image('{0}/{1}')", Sim_RModel.RFolder, 6033 string.Format("RSpeed_{0}{1}.RData", modelObject, 6034 speedTestTrackerFit));//RdotNetSpeedTest 6035 } 6036 else if (modelObject == 6037 testData_ByModel.Keys.ToList()[1]) 6038 { 6039 speedTestTrackerFitAlt++; 6040 saveCommand = 6041 string.Format("save.image('{0}/{1}')", Sim_RModel.RFolder, 6042 string.Format("RSpeed_{0}{1}.RData", modelObject, 6043 speedTestTrackerFitAlt));//RdotNetSpeedTest 6044 } 6045 engine.Evaluate(saveCommand); //RdotNetSpeedTest 6046 6047 // read Travel Speed Model - since it is deleted 6048 from save 6049 string loadRDataCmd = 6050 string.Format("load('{0}')", Sim_RModel.RDataWorkBenchRPath); 6051 engine.Evaluate(loadRDataCmd); 6052 } 6053 6054
299
//end of RdotNetSpeedTest - Comment out for 6055 deployment 6056 } 6057 else 6058 { 6059 predValues = 6060 engine.Evaluate(string.Format("Fn_GenericPredict(modelType = 6061 settings[['modelType']], fit = {0}, testData = {1})", modelObject, 6062 RSymbolName)).AsNumeric();//key line - do not comment out 6063 } 6064 6065 List<double> predValuesList = predValues.ToList(); 6066 6067 //Step 3A: update prediction to testDataForExport 6068 for (int i = 0; i < testData.Count; i++) 6069 { 6070 //make sure StartStop_ScheduledArrivalTime has been 6071 updated before this prediction method is called 6072 if (updatePred) 6073 { 6074 6075 testData[i].UpdatePrediction(predValuesList[i]);//comment out for performance 6076 test 6077 } 6078 } 6079 6080 testData_ByModel[modelObject] = testData; 6081 } 6082 } 6083 6084 //Step 3B: get prediction from each model 6085 List<SSLinkSimModelData> testDataForExport = new 6086 List<SSLinkSimModelData>(); 6087 foreach (string modelObject in testData_ByModel.Keys) 6088 { 6089 testDataForExport.AddRange(testData_ByModel[modelObject]); 6090 } 6091 6092 return testDataForExport; 6093 } 6094 6095
300
A.4.3 Model Prediction Script in R (ME_Analysis.R)
### selected script within ME_Analysis.R functions ### 6096 6097 ## Load all possible library needed 6098 modelType = settings[['modelType']] 6099 if (grepl("CART_FRF", settings[['modelType']])) 6100 require(ranger) 6101 if (grepl("MLR", settings[['modelType']])) 6102 require(MASS) # Library for step 6103 if (grepl("SVM", settings[['modelType']])) 6104 require(e1071) # Library for step 6105 require(liquidSVM) 6106 if (grepl("CART_RT", settings[['modelType']])) 6107 require(rpart) 6108 if (grepl("CART_RF", settings[['modelType']])) 6109 require(randomForest) 6110 if (grepl("LME", settings[['modelType']])) { 6111 require(lme4) # load library - fast mixed effect 6112 require(nlme) # load library - regular mixed effect 6113 require(arm) # convenience functions for regression in R 6114 } 6115 6116 ## R Function to generate predictions using existing library 6117 Fn_GenericPredict <- function (modelType = NULL, fit, testData) { 6118 6119 if (modelType == "CART_FRF") { 6120 predictY_test <- predict(fit, data = testData) # out of bag 6121 predictY_test <- predictY_test$predictions 6122 } else{ 6123 predictY_test <- predict(fit, newdata = testData) 6124 } 6125 6126 return(predictY_test) 6127 } 6128 6129
301
A.5 Analytic Report Algorithm
A.5.1 Report Generation Main Method
(Model_GenerateReportsWithBinFileAndR)
public void Model_GenerateReportsWithBinFileAndR(string crossRefDay = 6130 "3") 6131 { 6132 bool crossRefObservedTestData = true; // this is needed for 6133 filling in attributes for predictions 6134 6135 // Temp code to measure time elasped 6136 //DateTime previousTime = DateTime.Now; 6137 //UpdateGUI_StatusBox(string.Format("Time Elapsed {0}: ", 6138 DateTime.Now.Subtract(previousTime).TotalSeconds), 40, 0.0, 60.0); 6139 //previousTime = DateTime.Now; 6140 6141 UpdateGUI_StatusBox("Report output started...", 15.0, 0.0, 50); 6142 6143 int neitherOrTrainOrTest = 2; 6144 string day = crossRefDay; 6145 6146 //Step 0: Get simulation reference data, in the form of 6147 SSLinkModelBaseData format from test data set 6148 // This is needed for both simulation and for reporting observed 6149 results for comparsions 6150 if (crossRefObservedTestData) 6151 { 6152 if (SimRef_TestDataRef_ByRouteCode_ByLinkName == null || 6153 SimRef_TestDataRef_ByRouteCode_ByLinkName.Count == 0) 6154 { 6155 SimRef_TestDataRef_ByRouteCode_ByLinkName = 6156 ImportLinkDataForModelRef("", "", neitherOrTrainOrTest, day);//get real test 6157 data from csv 6158 } 6159 } 6160 else 6161 { 6162 SimRef_TestDataRef_ByRouteCode_ByLinkName = null; 6163 } 6164 6165 //Load Reference Data 6166 // check and load GTFS ref data 6167 if (GtfsScheduleTable == null) 6168 { 6169 LoadGTFSReferenceData();// read GTFS data for network 6170 initialization 6171 } 6172 6173 //Step 1: Check if completed binary file is available and need to 6174 be recomputed, if not, continue, if so, skip to step 4. 6175
302
// Binary file path for C# use 6176 string Sim_ModelDataPred_File = Sim_ModelResultBinFile; 6177 int serviceIDSelect = 6178 GtfsServicePeriodTable.Values.ToList().Where(o => o.TypeOfDay == 6179 dayType.weekday).ToList()[0].ServiceID; 6180 // Read prediction data from binary 6181 Sim_ModelPredOutput_ByGtfsTripID = 6182 BinaryFileHandler.ReadFile<Dictionary<int, 6183 List<SSLinkSimModelData>>>(Sim_ModelDataPred_File); 6184 6185 //Get data into big lists 6186 // Collapse object to a big list 6187 List<SSLinkSimModelData> collapsed_SimModelData = new 6188 List<SSLinkSimModelData>(); 6189 foreach (int trip in Sim_ModelPredOutput_ByGtfsTripID.Keys) 6190 { 6191 6192 collapsed_SimModelData.AddRange(Sim_ModelPredOutput_ByGtfsTripID[trip]); 6193 } 6194 // Collapse object to a big list 6195 List<SSLinkModelBaseData> collapsed_TestData = new 6196 List<SSLinkModelBaseData>(); 6197 foreach (string route in 6198 SimRef_TestDataRef_ByRouteCode_ByLinkName.Keys) 6199 { 6200 foreach (string link in 6201 SimRef_TestDataRef_ByRouteCode_ByLinkName[route].Keys) 6202 { 6203 List<SSLinkModelBaseData> sortedData = 6204 SimRef_TestDataRef_ByRouteCode_ByLinkName[route][link].OrderBy(o => 6205 o.StopSeq).ToList(); 6206 collapsed_TestData.AddRange(sortedData); 6207 } 6208 } 6209 6210 REngine engine = REngine.GetInstance(); 6211 6212 //Program Data Exports 6213 // Dwell Time model parameters 6214 try 6215 { 6216 UpdateGUI_StatusBox("Exporting dwell time model 6217 parameters...", 25, 0.0, 60); 6218 Model_StopModelsEstimation(true);// estimate all stop models 6219 with data 6220 6221 string dwellParameterCSV = ModelFolder + 6222 @"CSharp_DwellParam.csv"; 6223 StringBuilder dwellParameterText = new StringBuilder(); 6224 6225 dwellParameterText.AppendLine("StopCode,StopName,LogN_Mu,LogN_Sigma,trainAvg,6226 trainStdDev,testObsAvg,testObsStdDev,testPredAvg,testPredStdDev"); 6227 6228
303
foreach (int startStopID in 6229 Sim_DwellTimeModel_ByStartStopID.Keys) 6230 { 6231 if (!GtfsStopTable.ContainsKey(startStopID)) 6232 { 6233 continue; 6234 } 6235 6236 string startStopCode = 6237 GtfsStopTable[startStopID].StopCode; 6238 string startStopName = 6239 GtfsStopTable[startStopID].StopName; 6240 6241 //Model Estimation specific measures 6242 double model_logStdDev_SIGMA = 6243 Sim_DwellTimeModel_ByStartStopID[startStopID].Distribution.Sigma; 6244 double model_logAvg_MU = 6245 Sim_DwellTimeModel_ByStartStopID[startStopID].Distribution.Mu; 6246 6247 double trainSample_avg = 6248 Sim_DwellTimeModel_ByStartStopID[startStopID].TrainingData.Average(); 6249 double trainSample_stdDev = 6250 Sim_DwellTimeModel_ByStartStopID[startStopID].TrainingData.StandardDeviation_6251 P(); 6252 double model_avg = 6253 Sim_DwellTimeModel_ByStartStopID[startStopID].Distribution.Mean;//compare 6254 double model_stdDev = 6255 Sim_DwellTimeModel_ByStartStopID[startStopID].Distribution.StdDev; 6256 6257 //Simulation specific measures 6258 // get test data observed dwell times 6259 List<double> testObservedDwellTimes = 6260 collapsed_TestData.Where(o => o.StartStopID == startStopID).Select(o => 6261 o.DwellTime).ToList(); 6262 testObservedDwellTimes.RemoveAll(t => t <= 0); 6263 double testObservedSample_avg = 6264 testObservedDwellTimes.Count > 3 ? testObservedDwellTimes.Average() : -1; 6265 double testObservedSample_stdDev = 6266 testObservedDwellTimes.Count > 3 ? 6267 testObservedDwellTimes.StandardDeviation_P() : -1; 6268 6269 // get test data predicted/siumulated dwell times 6270 List<double> testPredictedDwellTimes = 6271 collapsed_SimModelData.Where(o => o.StartStopID == startStopID).Select(o => 6272 o.DwellTime).ToList(); 6273 testPredictedDwellTimes.RemoveAll(t => t <= 0); 6274 double testPredictedSample_avg = 6275 testPredictedDwellTimes.Count > 3 ? testPredictedDwellTimes.Average() : -1; 6276 double testPredictedSample_stdDev = 6277 testPredictedDwellTimes.Count > 3 ? 6278 testPredictedDwellTimes.StandardDeviation_P() : -1; 6279 6280 6281 dwellParameterText.AppendLine(String.Format("{0},{1},{2},{3},{4},{5},{6},{7},6282
304
{8},{9}", startStopCode, startStopName, model_logAvg_MU, 6283 model_logStdDev_SIGMA, trainSample_avg, trainSample_stdDev, 6284 testObservedSample_avg, testObservedSample_stdDev, testPredictedSample_avg, 6285 testPredictedSample_stdDev)); 6286 } 6287 File.WriteAllText(dwellParameterCSV, 6288 dwellParameterText.ToString()); 6289 dwellParameterText.Clear(); 6290 } 6291 catch 6292 { 6293 UpdateGUI_StatusBox("Unexpected Error.", 25, 0.0, 0); 6294 } 6295 6296 //Analytics Reports 6297 // Time Distance Diagram for each schedule 6298 try 6299 { 6300 UpdateGUI_StatusBox("Producing time-distance diagrams...", 6301 40, 0.0, 280); 6302 string timeDistDiagFolder = ModelFolder + 6303 @"CSharp_TimeDistDiag\"; 6304 ExcelReports.CreateTimeDistDiagramReport(timeDistDiagFolder, 6305 collapsed_SimModelData, collapsed_TestData, GtfsStopTable, GtfsScheduleTable, 6306 GtfsRouteGroupTableByGroupID, serviceIDSelect); 6307 } 6308 catch 6309 { 6310 UpdateGUI_StatusBox("Unexpected Error.", 40, 0.0, 0); 6311 } 6312 6313 //Route Speed Reports 6314 try 6315 { 6316 UpdateGUI_StatusBox("Analyzing route speeds...", 60, 0.0, 6317 350); 6318 string routeReportFolder = ModelFolder + 6319 @"CSharp_RouteReport\"; 6320 ExcelReports.CreateRouteSpeedReport(routeReportFolder, 6321 collapsed_SimModelData, collapsed_TestData, GtfsStopTable, GtfsScheduleTable, 6322 GtfsRouteGroupTableByGroupID, serviceIDSelect, engine); 6323 } 6324 catch 6325 { 6326 UpdateGUI_StatusBox("Unexpected Error.", 60, 0.0, 0); 6327 } 6328 6329 //Stop Dwell Time Reports 6330 try 6331 { 6332 UpdateGUI_StatusBox("Analyzing stop dwell times...", 75, 0.0, 6333 90); 6334 string stopReportFolder = ModelFolder + 6335 @"CSharp_StopDwellReport\"; 6336
305
ExcelReports.CreateStopDwellReport(stopReportFolder, 6337 collapsed_SimModelData, collapsed_TestData, GtfsStopTable, GtfsScheduleTable, 6338 GtfsRouteGroupTableByGroupID, serviceIDSelect, engine); 6339 } 6340 catch 6341 { 6342 UpdateGUI_StatusBox("Unexpected Error.", 75, 0.0, 0); 6343 } 6344 6345 //Stop Delay Reports 6346 try 6347 { 6348 UpdateGUI_StatusBox("Analyzing stop delays...", 90, 0.0, 90); 6349 string stopReportFolder = ModelFolder + 6350 @"CSharp_StopDelayReport\"; 6351 ExcelReports.CreateStopDelayReport(stopReportFolder, 6352 collapsed_SimModelData, collapsed_TestData, GtfsStopTable, GtfsScheduleTable, 6353 GtfsRouteGroupTableByGroupID, serviceIDSelect, engine); 6354 } 6355 catch 6356 { 6357 UpdateGUI_StatusBox("Unexpected Error.", 90, 0.0, 0); 6358 } 6359 6360 //Clean Up 6361 Sim_ModelPredOutput_ByGtfsTripID.Clear(); 6362 SimRef_TestDataRef_ByRouteCode_ByLinkName.Clear(); 6363 UpdateGUI_StatusBox("Report output complete.", 100, 0.0, 0.0); 6364 // you should always dispose of the REngine properly. 6365 // After disposing of the engine, you cannot reinitialize nor 6366 reuse it 6367 engine.Dispose();//dispose at program exit 6368 } 6369 6370
306
A.5.2 Time Distance Diagram Report Generation Method
(CreateTimeDistDiagramReport)
public static void CreateTimeDistDiagramReport(string 6371 timeDistDiagFolder, List<SSLinkSimModelData> simulationData, 6372 List<SSLinkModelBaseData> observedData, ConcurrentDictionary<int, 6373 SSimStopRef> gtfsStopTable, ConcurrentDictionary<int, SSimScheduleRef> 6374 gtfsScheduleTable, ConcurrentDictionary<int, SSimRouteGroupRef> 6375 gtfsRouteGroupTableByGroupID, int serviceIDSelect) 6376 { 6377 if (!Directory.Exists(timeDistDiagFolder)) 6378 { 6379 Directory.CreateDirectory(timeDistDiagFolder); 6380 } 6381 6382 List<int> dirList = new List<int>() { 0, 1 }; 6383 6384 //GENERATE ROUTE LEVEL DIAGRAMS 6385 ParallelOptions pOptions = new ParallelOptions(); 6386 pOptions.MaxDegreeOfParallelism = Environment.ProcessorCount; 6387 List<int> allGroupIDs = 6388 gtfsRouteGroupTableByGroupID.Keys.ToList(); 6389 Parallel.For(0, allGroupIDs.Count, pOptions, pForIndex => 6390 { 6391 int groupID = allGroupIDs[pForIndex]; 6392 6393 foreach (int dir in dirList) 6394 { 6395 //List<SSLinkSimModelData> select_SimModelData = 6396 simulationData.Where(o => o.GtfsScheduleID == string.Format("{0}", 6397 scheduleID)).ToList(); 6398 //List<SSLinkModelBaseData> select_TestData = 6399 observedData.Where(o => o.GtfsScheduleID == string.Format("{0}", 6400 scheduleID)).ToList(); 6401 6402 if (gtfsRouteGroupTableByGroupID[groupID].RouteIDs.Count 6403 == 0 || gtfsRouteGroupTableByGroupID[groupID].mainRouteIDDir0ByServiceP.Count 6404 == 0 || gtfsRouteGroupTableByGroupID[groupID].mainRouteIDDir1ByServiceP.Count 6405 == 0) 6406 { 6407 continue; 6408 } 6409 6410 int mainScheduleID; 6411 6412 // Get Sim and Test Data for this route # and direction 6413 List<SSLinkSimModelData> select_SimModelData = new 6414 List<SSLinkSimModelData>(); 6415 List<SSLinkModelBaseData> select_TestData = new 6416 List<SSLinkModelBaseData>(); 6417 List<int> scheduleIDForThisDir = new List<int>(); 6418
307
// NOTE: route ID = schedule ID, but =/= group ID or 6419 route code 6420 if (dir == 0) 6421 { 6422 6423 scheduleIDForThisDir.AddRange(gtfsRouteGroupTableByGroupID[groupID].mainRoute6424 IDDir0ByServiceP.Values.ToList()); 6425 6426 scheduleIDForThisDir.AddRange(gtfsRouteGroupTableByGroupID[groupID].subRouteI6427 DDir0.Select(o => o.Item3).ToList()); 6428 mainScheduleID = 6429 gtfsRouteGroupTableByGroupID[groupID].mainRouteIDDir0ByServiceP[serviceIDSele6430 ct]; 6431 } 6432 else 6433 { 6434 6435 scheduleIDForThisDir.AddRange(gtfsRouteGroupTableByGroupID[groupID].mainRoute6436 IDDir1ByServiceP.Values.ToList()); 6437 6438 scheduleIDForThisDir.AddRange(gtfsRouteGroupTableByGroupID[groupID].subRouteI6439 DDir1.Select(o => o.Item3).ToList()); 6440 mainScheduleID = 6441 gtfsRouteGroupTableByGroupID[groupID].mainRouteIDDir1ByServiceP[serviceIDSele6442 ct]; 6443 } 6444 foreach (int tempScheduleID in scheduleIDForThisDir) 6445 { 6446 select_SimModelData.AddRange(simulationData.Where(o 6447 => o.GtfsScheduleID == string.Format("{0}", tempScheduleID)).ToList()); 6448 select_TestData.AddRange(observedData.Where(o => 6449 o.GtfsScheduleID == string.Format("{0}", tempScheduleID)).ToList()); 6450 } 6451 6452 string mainScheduleStopString = string.Join(" ", 6453 gtfsScheduleTable[mainScheduleID].StopIDs); 6454 6455 if (select_SimModelData.Count > 0 && 6456 select_TestData.Count > 0) 6457 { 6458 string routeCode = 6459 gtfsRouteGroupTableByGroupID[groupID].RouteCode; 6460 6461 //gtfsRouteGroupTableByGroupID[gtfsScheduleTable[scheduleID].GroupID].RouteCo6462 de; 6463 string routeName = 6464 gtfsRouteGroupTableByGroupID[groupID].RouteName.Replace('/', ' '); 6465 6466 //gtfsRouteGroupTableByGroupID[gtfsScheduleTable[scheduleID].GroupID].RouteNa6467 me.Replace('/', ' '); 6468 //string routeTag = 6469 gtfsRouteGroupTableByGroupID[gtfsScheduleTable[scheduleID].GroupID].RouteInfo6470 [scheduleID].RouteTag; 6471
308
string routeHeading = 6472 gtfsRouteGroupTableByGroupID[groupID].RouteInfo[mainScheduleID].Name;//gtfsRo6473 uteGroupTableByGroupID[gtfsScheduleTable[scheduleID].GroupID].RouteInfo[sched6474 uleID].Name; 6475 string sheetLabel = string.Format("{0}-{1}-{2}", 6476 routeCode, groupID, dir); 6477 string chartTitle = string.Format("TIME DISTANCE 6478 DIAGRAM\n({0} {1} {2})", routeCode, routeName, routeHeading.Replace(" - ", 6479 "^").Split('^').Last()); 6480 string ExcelBookFilePath = timeDistDiagFolder + 6481 string.Format(@"TD_Diag_{0}-{1}-{2}.xlsx", routeCode, routeName, 6482 routeHeading.Replace(" - ", "^").Split('^').Last()); 6483 //string.Format(@"TD_Diag_{0}-{1}-{2}.xlsx", 6484 routeCode, routeName, scheduleID); 6485 6486 // Create the file using the FileInfo object 6487 //ExcelBookFilePath = ExcelBookFilePath.Replace('/', 6488 ' '); 6489 var file = new FileInfo(ExcelBookFilePath); 6490 6491 // Create the package and make sure you wrap it in a 6492 using statement 6493 using (var xlPackage = new ExcelPackage(file)) 6494 { 6495 //--- Simulation Data Workbook --- 6496 6497 // add a new worksheet to the empty workbook 6498 ExcelWorksheet worksheet_sim = 6499 xlPackage.Workbook.Worksheets.Add(string.Format("SimData_{0}", sheetLabel)); 6500 List<int> simSheetRowBreaks = new List<int>(); 6501 6502 // sort simulation data 6503 select_SimModelData = 6504 select_SimModelData.OrderBy(o => o.StopSeq).ToList(); 6505 select_SimModelData = 6506 select_SimModelData.OrderBy(o => o.GtfsTripID).ToList(); 6507 //double distFromStart = 0.0; 6508 6509 // populate simulation data 6510 int i = 1;//header 6511 int lastGTFSTripID = -1; 6512 // First of all the first row 6513 worksheet_sim.Cells[i, 1].Value = "Scheduled 6514 Path"; 6515 worksheet_sim.Cells[i, 2].Value = "Simulation 6516 Path"; 6517 worksheet_sim.Cells[i, 3].Value = "Distance From 6518 Start (km)"; 6519 worksheet_sim.Cells[i, 4].Value = "Link Dist 6520 (m)"; 6521 worksheet_sim.Cells[i, 5].Value = "StopCode"; 6522 worksheet_sim.Cells[i, 6].Value = "GTFSTripID"; 6523 i++; 6524
309
foreach (SSLinkSimModelData aRow in 6525 select_SimModelData) 6526 { 6527 if (aRow.GtfsTripID != lastGTFSTripID) 6528 { 6529 //distFromStart = aRow.LinkDist; 6530 //simSheetRowBreaks.Add(i);//insert blank 6531 row before this row here later 6532 i++;//add empty row to break the lines in 6533 the scatterplot 6534 lastGTFSTripID = aRow.GtfsTripID; 6535 } 6536 //else 6537 //{ 6538 // distFromStart += aRow.LinkDist; 6539 //} 6540 6541 int scheduleID = 6542 Convert.ToInt32(aRow.GtfsScheduleID); 6543 //check if it is best to use main schedule 6544 ID's stop distance or this schedule ID 6545 string scheduleStopString = string.Join(" ", 6546 gtfsScheduleTable[scheduleID].StopIDs); 6547 if 6548 (mainScheduleStopString.Contains(scheduleStopString)) 6549 { 6550 scheduleID = mainScheduleID; 6551 }//else keep the original scheduleID 6552 //Note: if really clean graph is desired, use 6553 only stops from schedules that are subset of the main schedule stops 6554 6555 int stopIndex = 6556 gtfsScheduleTable[scheduleID].StopIDs.IndexOf(aRow.StartStopID); 6557 double distFromStart = 6558 gtfsScheduleTable[scheduleID].StopDistances[stopIndex] / 1000; // to km 6559 string stopCode = 6560 gtfsStopTable[aRow.StartStopID].StopCode; 6561 6562 //arrival at stop 6563 worksheet_sim.Cells[i, 1].Value = 6564 aRow.StartStop_ScheduledArrivalTime; 6565 worksheet_sim.Cells[i, 2].Value = 6566 aRow.StartStop_SimArrivalTime; 6567 worksheet_sim.Cells[i, 3].Value = 6568 distFromStart; 6569 worksheet_sim.Cells[i, 4].Value = 6570 aRow.LinkDist; 6571 worksheet_sim.Cells[i, 5].Value = stopCode; 6572 worksheet_sim.Cells[i, 6].Value = 6573 aRow.GtfsTripID; 6574 i++; 6575 //departure from stop 6576 worksheet_sim.Cells[i, 1].Value = 6577 aRow.StartStop_ScheduledArrivalTime; 6578
310
TimeSpan StartStop_SimDepTime = 6579 TimeSpan.FromSeconds(aRow.StartStop_SimArrivalTime.TotalSeconds + 6580 aRow.DwellTime); 6581 worksheet_sim.Cells[i, 2].Value = 6582 StartStop_SimDepTime; 6583 worksheet_sim.Cells[i, 3].Value = 6584 distFromStart; 6585 worksheet_sim.Cells[i, 4].Value = 6586 aRow.LinkDist; 6587 worksheet_sim.Cells[i, 5].Value = stopCode; 6588 worksheet_sim.Cells[i, 6].Value = 6589 aRow.GtfsTripID; 6590 i++; 6591 }//end foreach 6592 int total_SimWorksheetRows = i; 6593 6594 //--- Observed Data Workbook --- 6595 // add a new worksheet to the empty workbook 6596 ExcelWorksheet worksheet_obs = 6597 xlPackage.Workbook.Worksheets.Add(string.Format("ObsData_{0}", sheetLabel)); 6598 List<int> obsSheetRowBreaks = new List<int>(); 6599 6600 // sort simulation data 6601 select_TestData = select_TestData.OrderBy(o => 6602 o.StopSeq).ToList(); 6603 select_TestData = select_TestData.OrderBy(o => 6604 o.GpsTripID).ToList(); 6605 //distFromStart = 0.0; 6606 6607 // populate simulation data 6608 int lastGPSTripID = -1; 6609 i = 1;//header 6610 // First of all the first row 6611 worksheet_obs.Cells[i, 1].Value = "Observed 6612 Path"; 6613 worksheet_obs.Cells[i, 2].Value = "Distance From 6614 Start (km)"; 6615 worksheet_obs.Cells[i, 3].Value = "Link Dist 6616 (m)"; 6617 worksheet_obs.Cells[i, 4].Value = "StopCode"; 6618 worksheet_obs.Cells[i, 5].Value = "GPSTripID"; 6619 i++; 6620 foreach (SSLinkModelBaseData aRow in 6621 select_TestData) 6622 { 6623 if (aRow.GpsTripID != lastGPSTripID) 6624 { 6625 //distFromStart = aRow.LinkDist; 6626 //obsSheetRowBreaks.Add(i);//insert blank 6627 row before this row here later 6628 i++;//add empty row to break the lines in 6629 the scatterplot 6630 lastGPSTripID = aRow.GpsTripID; 6631 } 6632
311
//else 6633 //{ 6634 // distFromStart += aRow.LinkDist; 6635 // 6636 6637 int scheduleID = 6638 Convert.ToInt32(aRow.GtfsScheduleID); 6639 //check if it is best to use main schedule 6640 ID's stop distance or this schedule ID 6641 string scheduleStopString = string.Join(" ", 6642 gtfsScheduleTable[scheduleID].StopIDs); 6643 if 6644 (mainScheduleStopString.Contains(scheduleStopString)) 6645 { 6646 scheduleID = mainScheduleID; 6647 }//else keep the original scheduleID 6648 //Note: if really clean graph is desired, use 6649 only stops from schedules that are subset of the main schedule stops 6650 6651 int stopIndex = 6652 gtfsScheduleTable[scheduleID].StopIDs.IndexOf(aRow.StartStopID); 6653 double distFromStart = 6654 gtfsScheduleTable[scheduleID].StopDistances[stopIndex] / 1000; // to km 6655 string stopCode = 6656 gtfsStopTable[aRow.StartStopID].StopCode; 6657 6658 //arrival at stop 6659 TimeSpan StartStop_ObsArrTime = 6660 TimeSpan.FromMinutes(aRow.Time_mins); 6661 worksheet_obs.Cells[i, 1].Value = 6662 StartStop_ObsArrTime; 6663 worksheet_obs.Cells[i, 2].Value = 6664 distFromStart; 6665 worksheet_obs.Cells[i, 3].Value = 6666 aRow.LinkDist; 6667 worksheet_obs.Cells[i, 4].Value = stopCode; 6668 worksheet_obs.Cells[i, 5].Value = 6669 aRow.GpsTripID; 6670 i++; 6671 //departure from stop 6672 TimeSpan StartStop_ObsDepTime = 6673 TimeSpan.FromSeconds(StartStop_ObsArrTime.TotalSeconds + aRow.DwellTime); 6674 worksheet_obs.Cells[i, 1].Value = 6675 StartStop_ObsDepTime; 6676 worksheet_obs.Cells[i, 2].Value = 6677 distFromStart; 6678 worksheet_obs.Cells[i, 3].Value = 6679 aRow.LinkDist; 6680 worksheet_obs.Cells[i, 4].Value = stopCode; 6681 worksheet_obs.Cells[i, 5].Value = 6682 aRow.GpsTripID; 6683 i++; 6684 }//end foreach 6685 int total_ObsWorksheetRows = i; 6686
312
6687 //--- Formatting Cell Number Format --- 6688 6689 worksheet_sim.Cells[string.Format("'{0}'!${1}${2}:{1}{3}", 6690 worksheet_sim.Name, "A", 2, 6691 total_SimWorksheetRows)].Style.Numberformat.Format = "HH:mm:ss"; 6692 6693 worksheet_sim.Cells[string.Format("'{0}'!${1}${2}:{1}{3}", 6694 worksheet_sim.Name, "B", 2, 6695 total_SimWorksheetRows)].Style.Numberformat.Format = "HH:mm:ss"; 6696 6697 worksheet_obs.Cells[string.Format("'{0}'!${1}${2}:{1}{3}", 6698 worksheet_obs.Name, "A", 2, 6699 total_ObsWorksheetRows)].Style.Numberformat.Format = "HH:mm:ss"; 6700 6701 //--- Time Distance Graph Workbook --- 6702 // add a new worksheet to the empty workbook 6703 ExcelWorksheet worksheet_Graph = 6704 xlPackage.Workbook.Worksheets.Add(string.Format("TimeDistDiag_{0}", 6705 sheetLabel)); 6706 6707 //Create Chart and Format it 6708 ExcelChart chart = 6709 (ExcelScatterChart)worksheet_Graph.Drawings.AddChart("TimeDistDiagChart", 6710 eChartType.XYScatterLinesNoMarkers); 6711 chart.Title.Text = string.Format("{0}", 6712 chartTitle); 6713 chart.Title.Font.Size = 16; 6714 chart.SetPosition(1, 0, 3, 0); 6715 chart.SetSize(900, 540);// ~("28 cm", "15 cm");// 6716 DPI: 166 //http://dpi.lv/ 6717 chart.DisplayBlanksAs = 6718 eDisplayBlanksAs.Gap;//need to display as Gap for proper visual 6719 6720 //chart.Style = eChartStyle.Style5; 6721 chart.YAxis.Title.Text = "Distance from Start 6722 (km)"; 6723 chart.YAxis.Title.Font.Size = 12; 6724 chart.XAxis.Title.Text = "Time (hh:mm:ss)"; 6725 chart.XAxis.Title.Font.Size = 12; 6726 chart.XAxis.MinValue = (double)6 / 24;//6am 6727 chart.XAxis.MaxValue = (double)9 / 24;//9am 6728 chart.XAxis.MajorUnit = (double)0.5 / 24; //30 6729 minutes 6730 chart.XAxis.MinorUnit = (double)0.25 / 24; //15 6731 minutes 6732 6733 // add schedule times 6734 string cellSelectString_Y = 6735 string.Format("'{0}'!${1}${2}:{1}{3}", worksheet_sim.Name, "C", 2, 6736 total_SimWorksheetRows); //Distance From Start (km) 6737 string cellSelectString_XSchedule = 6738 string.Format("'{0}'!${1}${2}:{1}{3}", worksheet_sim.Name, "A", 2, 6739 total_SimWorksheetRows); //Scheduled Path 6740
313
var ser1 = 6741 (ExcelChartSerie)(chart.Series.Add(worksheet_Graph.Cells[cellSelectString_Y], 6742 worksheet_Graph.Cells[cellSelectString_XSchedule])); 6743 ser1.Header = "Scheduled"; 6744 6745 // add observed times 6746 string cellSelectString_YObserved = 6747 string.Format("'{0}'!${1}${2}:${1}${3}", worksheet_obs.Name, "B", 2, 6748 total_ObsWorksheetRows); //Distance From Start (km) 6749 string cellSelectString_XObserved = 6750 string.Format("'{0}'!${1}${2}:${1}${3}", worksheet_obs.Name, "A", 2, 6751 total_ObsWorksheetRows); //Scheduled Path 6752 var ser2 = 6753 (ExcelChartSerie)(chart.Series.Add(worksheet_Graph.Cells[cellSelectString_YOb6754 served], worksheet_Graph.Cells[cellSelectString_XObserved])); 6755 ser2.Header = "Observed"; 6756 6757 // add simulation times 6758 //use cellSelectString_Y //Distance From Start 6759 (km) 6760 string cellSelectString_XSimulated = 6761 string.Format("'{0}'!${1}${2}:${1}${3}", worksheet_sim.Name, "B", 2, 6762 total_SimWorksheetRows); //Scheduled Path 6763 var ser3 = 6764 (ExcelChartSerie)(chart.Series.Add(worksheet_Graph.Cells[cellSelectString_Y], 6765 worksheet_Graph.Cells[cellSelectString_XSimulated])); 6766 ser3.Header = "Simulated"; 6767 6768 //// break series at every trip completion to 6769 have proper chart display of trips 6770 //// - for SimData sheet 6771 //foreach (int emptyRowNumInsert in 6772 simSheetRowBreaks) 6773 //{ 6774 // worksheet_sim.InsertRow(emptyRowNumInsert, 6775 1); 6776 //} 6777 //// - for ObsData sheet 6778 //foreach (int emptyRowNumInsert in 6779 obsSheetRowBreaks) 6780 //{ 6781 // worksheet_sim.InsertRow(emptyRowNumInsert, 6782 1); 6783 //} 6784 6785 // set graph sheet as selected 6786 6787 xlPackage.Workbook.Worksheets[worksheet_Graph.Name].View.TabSelected = true; 6788 6789 //--- Save The Entire Workbook --- 6790 xlPackage.Save(); 6791 }//end using 6792 }//end if count > 0 6793 }//end foreach dir 6794
315
Appendix B Software Repository
The software program developed for this thesis can be found on two separate online code
repositories. Readers may request access to the program from the author via electronic mail at
bo.wen [at] alum [dot] utoronto [dot] ca. Alternatively, readers may request access to the
repositories by visiting the links below:
https://bitbucket.org/bowenwen/nexus.surfacesimulator
The first repository contains the Nexus Surface Simulator. The Nexus Surface Simulator is written
in C# and it consists of a set of tools that enables the data collection, data processing and model
estimation for transit data. Follow the instructions in “README.md” and “Installations.doc” to
set up your integrated developmental environment.
https://bitbucket.org/bowenwen/nexus.ssrtool
The second repository contains the R scripts used by the Nexus Surface Simulator package to
estimate models. Follow the instructions in “README.md” to set up the R scripts to be used by
the Nexus Surface Simulator.
316
Appendix C TRB Paper 2016
The attached paper was submitted to the 2016 Transportation Research Board Conference.
317
NEXUS SURFACE SIMULATOR FOR LARGE-SCALE TRANSIT SIMULATION
USING GPS AND GTFS DATA: FRAMEWORK AND TRAVEL TIME MODEL
Bo Wen Wen, B.A.Sc.
Department of Civil Engineering
University of Toronto
35 St. George Street, Toronto, ON, Canada M5S 1A4
Tel: 647-924-1996; Email: [email protected]
Siva Srikukenthiran, Ph.D.
Department of Civil Engineering
University of Toronto
35 St. George Street, Toronto, ON, Canada M5S 1A4
Tel: 416-978-5049; Email: [email protected]
Dennis Xu Wu
Department of Civil Engineering
University of Toronto
35 St. George Street, Toronto, ON, Canada M5S 1A4
Email: [email protected]
Prof. Amer Shalaby, Ph.D., P.Eng.
Department of Civil Engineering
University of Toronto
35 St. George Street, Toronto, ON, Canada M5S 1A4
Tel: 416-978-5907; Email: [email protected]
Word count: 5415 words text + 7 tables/figures x 250 words = 7165 words
2016-08-01
318
ABSTRACT
Transit network simulation is a critical planning and forecasting tool for assessing transit
performance, identifying operational shortcomings and evaluating potential service changes. In the
development of Nexus, an integrated transit network simulation platform, the surface transit arrival
patterns at major subway stations must be characterized. This study proposes Nexus Surface
Simulator, a computationally efficient large-scale transit network simulator using open data from
Automatic Data Collection Systems (ADCS) and General Transit Feed Specification (GTFS). A
modelling framework of the simulator and a first-stage prototype were developed. Using Support
Vector Regression (SVR), running time and dwell time models were estimated using GPS and
GTFS data. The SVR models replicated the observed conditions and produced acceptable root-
mean-square error measures. The results of the prototype simulator demonstrated the feasibility of
producing a large-scale travel time model to accurately represent the average weekday transit travel
conditions. Further developments of the simulator will include a vehicle simulation engine based
on the output conditions produced by the travel time model, which will then be integrated into the
Nexus platform.
Keywords: Public Transportation, Data and Information Technology, Planning and Forecasting
319
INTRODUCTION AND MOTIVATION
In many large cities around the world, public transit networks are operating near or at
capacity, leaving little room for absorbing service variation or disruptions. Currently, transit
agencies handle service irregularities and disruptions in an ad-hoc fashion. This is due to the lack
of analytical tools capable of realistically characterizing system performance under congested
conditions (due to capacity constraints and/or disruption incidents) and analyzing the impact of
response strategies. Nexus, an integrated crowd dynamics and transit network simulation platform
currently under development, enables the seamless interfacing of a rail simulator, a station and
pedestrian simulator, and a surface transit simulator to form a large-scale multimodal transit
network simulation (1). It overlays an agent-based framework to model passenger travel behaviour,
while permitting on-the-fly modifications to transit service operation (1). Nexus is designed to
perform capacity analysis of critical transit infrastructure, transit disruption scenario modelling
and analysis, and disruption response strategy assessment at the station, line and network levels.
The modelling of large-scale multimodal transit networks, such as those in Toronto,
Chicago and New York City, will require realistic representations of surface network operations
and their interactions with the rapid transit network in Nexus. Nexus currently uses a simplified
simulator of surface transit, with vehicles travelling according to scheduled times, but this does
not realistically represent the temporal variation of surface transit travel behaviour (1). Introducing
other vehicle modes and traffic signals would impose heavy data requirements, and would require
a traffic simulator which would greatly increase simulation time (1). As this may not be appropriate
for many situations, the objective of this study is to develop an alternate method where publicly
available transit data, such as global position systems (GPS) data of bus and streetcar movements,
are used to produce a statistical model of transit vehicle travel times between each vehicle stop.
State of Practice in Transit Modelling and Advances in Data
One of the major challenges in building a large-scale surface transit simulation is choosing
an appropriate modelling technique to realistically represent transit travel behaviour while
maintaining efficient computation. Macroscopic flow models incorporate aggregate variables such
as flow, density, and speed, but they are usually inadequate for modelling public transit because
they lack individual vehicle representations (2). At the other extreme, microsimulation software
can describe detailed interactions among transit vehicles, passengers, and traffic, which allows the
evaluation of complex public transit policies (3). However, in large networks, the simulation of
individual vehicles in microsimulation imposes high computational requirements (4). To overcome
this limitation, mesoscopic simulation models can be used since they do not explicitly model the
properties of individual vehicles such as position and velocity (5). Instead, they treat roadways as
queuing systems and track vehicle platoon flows (6). As such, mesoscopic simulation models can
produce stop-level statistics and are more suitable for large-scale transit applications.
Another major challenge in building a large-scale surface transit simulation is
characterizing the variation in transit vehicle travel. Recent studies have investigated the use of
artificial neural networks (ANN), support vector machines (SVM) and the k-nearest neighbour
algorithm (k-NN) to predict transit movement behaviour (7, 8, 9). It was found that SVM had the
strongest resistance to over-fitting (10). In addition, it demonstrated improved accuracy for
multiple route data (10). While previous studies have used statistical learning models to produce
travel time predictions for a limited number of routes, this study will use a statistical learning
320
model to develop a large-scale mesoscopic public transit network model with state-of-the-art GPS
and GTFS data.
Recent developments in public transit network modelling have been made possible by the
use of automatic data collection systems (ADCS) and general transit feed specification (GTFS)
(11). These advances in accessible public transit data have enabled modelling approaches that
require detailed network input data, such as new simulation methods based on a statistical learning
approach, as used in this study.
ADCS, such as automatic vehicle location systems (AVL), automatic passenger counting
systems (APC), and automatic fare collection systems (AFC), are capable of collecting real-time
data and archiving historical data (12). Bus locations collected by AVL-based GPS data are readily
available in real time. In contrast, passenger loading and passenger trip data, collected by APC and
AFC, respectively, are typically not available in real time (12). A large sample of detailed data with
measurable accuracy can be collected at a lower marginal cost by ADCS (12); this data can be used
to model transit vehicle travel and transit passenger demand.
GTFS is a standardized open data format for transit schedule data (13). Many transportation
modelling applications have been introduced as the result of the GTFS initiative. For example,
GTFS data have been used to build and update a large regional travel forecasting model (14). Map-
matching algorithms have been applied to GTFS data with GPS points from navigation to create a
multimodal transportation model (15). An intermodal shortest path algorithm has also been used
with GTFS data to develop a dynamic transit assignment model (16). Finally, transportation
simulation software packages including MATSim have used GTFS data to build microsimulation
model layouts (17).
Past studies have demonstrated the use of the ADCS and GTFS data in transit models. By
using both of these datasets, the transit travel patterns and transit network layout can be modelled
to evaluate transit network performance. The aim of this study is to develop a computationally
efficient surface transit simulator using these datasets which interfaces with the Nexus platform.
321
SYSTEM FRAMEWORK
The Nexus Surface Simulator (NSS) consists of a database server, a travel time model, a
transit network builder and a vehicle simulation engine, as illustrated in Figure 1.
FIGURE 1 Nexus Surface Simulator Flowchart
The database server component stores the data required for the travel time model. The
presence of GTFS data and GPS data is the minimal requirement for a travel time model, because
GTFS data is required to build the transit network layout, and GPS data is required to build the
link travel time model. To take into consideration other factors affecting link travel time, the
database server would ideally also include automated passenger counts (APC), automated fare
collection (AFC), weather and roadway conditions data.
322
The travel time model uses the processed data from the database server to estimate and
validate SVR models for link travel times. While the simulator can use a range of travel time
modelling methods such as ANN and k-NN, for the initial implementation of NSS, SVR has been
chosen due to its resistance to overfitting in comparison to other modelling methods (10). SVR’s
resistance to overfitting is largely attributed to the use of the v-fold cross validation procedure for
model training (18).
The network builder enables the construction of the transit network layout using GTFS
data. In particular, it converts processed GTFS data into data objects containing stop information,
link definitions, route layouts, route schedules and trip timetables. These data objects can be used
to automate the creation of large-scale surface transit networks, which can be quickly updated via
GTFS data when changes occur to the network. The network builder can also use this network
layout data to visually represent model information.
The final component, the vehicle simulation engine, simulates the movements of public
transit vehicles. The initial conditions for vehicle release are based on the starting stops and times
defined by the values stored in the GTFS trip tables. The SVR model is used to update the link
travel time periodically. The vehicle simulation engine uses this predicted link travel time to update
vehicle position, as well as the expected arrival time at downstream stops and stations. The updated
vehicle and link information is used in turn by the SVR model to compute a series of link travel
times for the next time step.
The conceptual design of the NSS also includes a connection interface to enable linking
the simulator to the Nexus coordination server. This allows the integration of surface transit
simulation with rail and pedestrian simulation on the Nexus platform for use in a broad range of
agent-based analyses of public transit systems (1).
METHODS
The first stage prototype of the NSS consists of a database server with GTFS and GPS data,
an SVR travel time model, and a transit network builder. Development of the first stage prototype
of the NSS involved four steps: data collection, data processing, model estimation and model
validation.
Case Study for the NSS Prototype
The objective of developing the first stage prototype was to test the accuracy of the SVR
travel time model for a large-scale surface transit network. The transit network chosen to build this
prototype was the Toronto Transit Commission (TTC) network, using available GTFS and GPS
data. The TTC network, operating within the City of Toronto, consists of 4 subway lines, 1 short
automated transit line, 11 streetcar lines, 140 bus lines, and over 10,000 transit stops. The TTC
surface transit system is well connected to the subway system with 148 out of 151 bus and streetcar
lines making 245 connections with the subway and the rapid transit line. The study period selected
for training the SVR model was from May 2, 2016 to May 6, 2016. A large-scale weekday morning
peak-period travel time model was developed for the TTC surface transit network for the study
period.
323
Data Collection
To collect data efficiently, web APIs were used to retrieve data from external data sources,
and store them within the NSS’s data server. This method allows for continuous training of the
travel time model with temporally varying data. The two sets of data used for the prototype NSS
were historical GTFS and GPS data. After determining the datasets required, the NSS retrieved
missing GTFS and GPS data for the study period. The GTFS and GPS data tables are summarized
in Table 1.
TABLE 1 Summary of database tables and variables in the GTFS and GPS data
The GTFS data contains schedule information for the transit service network. The GTFS
data for the TTC are updated around ten times per year. The updates to GTFS data correspond
closely to periodic service changes the TTC makes throughout the year. The GTFS data were
retrieved via the TransitFeeds website (19). The GTFS data are provided in the comma-separated
values (CSV) file format, containing information on transit service period, routes, shapes, stops,
324
stop times and trips. The GTFS data for the period between March 27, 2016 and May 7, 2016 were
used to construct the transit network structure for the NSS (20). The GTFS schedule information
in its raw CSV format was processed using a converter before being stored in a Nexus SQLite
database.
The GPS data were retrieved via the Connected Vehicles and Smart Transportation (CVST)
web API (21). These data contain vehicle information such as latitude, longitude, heading (bearing),
route information and vehicle identification, collected at 1 minute intervals. A set of AM peak
period (6:00 to 9:00) GPS data, from May 2, 2016 to May 6, 2016, was used as training data to
estimate the travel time model, while a set of testing data from April 25, 2016 to April 29, 2016,
AM peak period was used to validate the model. The latitude, longitude, heading, route information
and vehicle identification of the GPS points were sorted into vehicle trips and processed, as
detailed in Data Processing, before being stored in the SQLite database.
Data Limitations
No APC data was included in the prototype NSS due to the lack of network-wide data on
passenger counts for the TTC network. Although the deployment of AFC system on the TTC is
underway, the AFC system currently does not have network-wide availability (22). In addition, a
tap-on and -off AFC system is needed to fully characterize passenger trips from the AFC system
(11). Currently, the TTC has plans to implement tap-on and -off at subway stations by 2017 (23).
Finally, while weather and roadway incident data are available on Statistics Canada and Open Data
Toronto, respectively, they were not included in the first stage prototype because of the absence of
a web application program interface (API) implementation to retrieve these data efficiently.
Due to the unavailability of these data sources, a few assumptions on weather conditions,
traffic conditions and transit service quality were made in the travel time model. Firstly, due to the
exclusion of weather data, the GPS data was assumed to have been collected on days where there
would be minimal impact from weather on transit travel time. Secondly, the effects of incidents
and congestion were assumed to be implicit in the travel time model; in particular, the variation of
travel times across the modelling period was assumed to capture the background congestion level
of the modelling period. Finally, since a large part of the TTC network during peak hours has
frequent service of 10 minutes or less, headway variation between vehicles, as opposed to schedule
delay, was assumed to be a better indicator of transit service quality (24).
Data Processing
The raw data were processed into the appropriate formats and then stored in the data server,
allowing it to be used by the travel time model.
GTFS Data
Using a GTFS data converter, a network of routed vehicles with schedule stops and times
was built from a list of stops with a series of vehicle departure times. First, the converter read into
internal data structures all of the original GTFS data tables such as calendar, routes, shapes, stops,
stop times and trips. This allows the converter to organize the tables of trip segments and stop
departure times into series of vehicle blocks. Next, stop and route information were matched to
generate a schedule table. The schedule table defines a list of stops and distances between stops
with a unique schedule for each specific route and service period. Finally, the converter generated
325
a trip table with stop departure times linked to the schedule table. The schedule table provides the
network structure of the transit network by route and service period while the trip table provides a
detailed transit schedule. Together, they allow for the construction of network links, which are each
defined by a start stop and an end stop.
GPS Data
The GPS data were processed in order to determine the start and end stops and the estimated
speed and headway for each vehicle GPS point. First, after sorting each vehicle GPS points into
trips, the upstream (start) and downstream (end) stops were determined for the GPS point. This
was accomplished by identifying the link upon which the GPS point fell, based on the list of stops
and stop distances in the schedule table. The GPS points that did not fall on current segments in
the transit network were not considered in this study, including transit vehicles taking deadhead
trips or long detours. Next, the estimated speed and headway of the GPS points were calculated
based on the time and location of nearby GPS points. The estimated speed was calculated based
on the differences in time and distance between adjacent GPS points on the same trip, while the
estimated headway was approximated based on the differences in time between the subsequent
trips at approximately the same location. If the adjacent GPS points of a given trip did not change
position for more than 1 minute (the GPS data collection interval), the dwell times at transit stops
were determined. The time, speed, headway and vehicle identifications data for all the GPS points
were populated into their respective GTFS-based links.
Model Estimation using Support Vector Machines
The theoretical framework of the travel time model and the three basic components of SVR
machines are discussed.
Travel Time Model
The travel time model provides predictions of the running times between stops as well as
dwell times. The total travel time for a transit vehicle trip is the summation of these times across
the length of the route. As such, the SVR travel time model is a combination of a running time
model and a dwell time model, as shown in Equation 1.
𝑇𝑙,𝑏 = ∑ 𝑡𝑖,𝑙,𝑏𝑟𝑢𝑛𝑛𝑖𝑛𝑔
𝐼
𝑖=1
+ ∑ 𝑡𝑗.𝑙,𝑏𝑑𝑤𝑒𝑙𝑙
𝐽
𝑗=1
subject to 𝜉𝑖 ≥ 0 (1)
where 𝑇𝑙,𝑏 is the total travel time of a bus (b) along a specific route (l), 𝑡𝑖,𝑙,𝑏𝑟𝑢𝑛𝑛𝑖𝑛𝑔
is the
predicted running time at link i of a bus (b) along a specific route (l), and 𝑡𝑖.𝑙,𝑏𝑑𝑤𝑒𝑙𝑙 is the predicted
dwell time at stop j of a bus (b) along a specific route (l).
Soft Margin Hyperplane Optimization
SVMs are statistical learning models capable of classification and regression (25). They
are based on the idea of finding an optimal hyperplane that maximizes the separation or “margin”
326
between groups of data with different properties (18). The soft margin hyperplane which separates
data with minimum classification error is found using Equation 2 (26).
min1
2𝑤𝑇𝑤 + 𝐶 ∑ 𝜉𝑖
𝑙
𝑖=1
subject to 𝜉𝑖 ≥ 0 (2)
where 𝑤 is a normal vector to the hyperplane, C is the constant on the error terms
(regularization parameter), 𝑙 is the sample size, and 𝜉𝑖 are the estimation error terms (26).
Radial basis function kernel (RBF)
For linearly inseparable data, the kernel method is used to transform the data to higher
dimensions and then to classify the data in higher dimensions (25). To avoid the need to specify
the transformed data points explicitly, SVM uses a kernel function to compute the decision
boundaries (25). The kernel functional form and the kernel parameters determine the shape of the
hyperplane (25). A commonly used kernel function is the radial basis function (RBF) (see Equation
3). The RBF kernel has been chosen for its ability to handle nonlinear relations and its
computational efficiency with fewer hyperparameters (12).
𝐾(𝑥𝑖, 𝑥𝑗) = exp (−𝛾‖𝑥𝑖 − 𝑥𝑗‖2
)
subject to 𝛾 > 0 (3)
where 𝛾 is the kernel parameter; 𝑥𝑖 and 𝑥𝑗 are two feature vectors containing dependent
variables such as travel time on the link or dwell time at a stop, and independent variables such as
time value in 5-minute intervals and vehicle headway (18).
ε-insensitive loss function for SVR
To adapt SVM for regression, SVR uses an ε-insensitive error measure where low error
values are ignored while larger error contributes linearly to the absolute residual greater than a
constant ε (see Equation 4) (25). This makes the fitting less sensitive to outliers (25). The soft
margin hyperplane and kernel property are not limited to classification and have been demonstrated
in past studies to be integral to the properties of SVR (27). Furthermore, the kernel method for
SVR can avoid the need to evaluate a large set of functions to obtain error minimizations (25).
𝜉𝜀 = |𝑟| − 𝜀
subject to |𝑟| ≥ 𝜀 (4)
where 𝜀 is the loss function parameter and 𝑟 is the estimation error (25); this loss function
is applied to the error terms for Equation 1 of the SVR model estimation (27).
Parameter estimation and cross validation procedure
327
To estimate a SVR model based on RBF kernel and ε-insensitive loss functions, three
parameters need to be specified: the regularization parameter (𝐶), the kernel parameter (𝛾) and the
loss function parameter 𝜀. A v-fold cross validation procedure using a grid-search method was used
to find the most suitable set of 𝐶 and 𝛾 parameters. The v-fold cross validation procedure divides
training data into v number of subsets and trains the model sequentially by each subset (18). This
method of finding 𝐶 and 𝛾 prevents overfitting while minimizing the root-mean-square error of
the SVR model (18). A common 𝜀 value of 0.001 was used since it was reasonable to exclude very
small estimation errors (18). To implement SVR in the prototype NSS, LIBSVM library was used
to estimate the travel time model (28). Using the LIBSVM library, the SVR model was trained
with a set of link running times, time values in 5-minute intervals and vehicle headways. Using
the test set data with time values and vehicle headways as inputs, the trained model produced link
running time and dwell time predictions.
Validation of SVR Travel Time Model
After the final SVR models for link running time and dwell time were estimated, the trained
SVR models were validated against the test set data corresponding to the link or stop. This was
done by first using the trained SVR models to generate sets of predictions on link running time
and dwell time using the input test data. The sets of predictions were then compared against the
observed conditions in the test set data using the root-mean-square error (RMSE) measure.
RMSE = √1
𝑙∑(𝑓(𝑥𝑖) − 𝑦𝑖)2
𝑙
𝑖=1
subject to 𝑙 ≠ 0 (5)
where 𝑙 is the number of sample, 𝑓(𝑥𝑖) is the value of the predicted value, and 𝑦𝑖 is the
value of the observed value; the estimation error is 𝑟 = 𝑓(𝑥𝑖) − 𝑦𝑖 as shown in Equation 4.
RESULTS
The first stage prototype of the NSS produced a detailed travel time model for a large scale
transit network. The model training results and model validation results for link running times,
stop dwell times are presented in this section.
Results of link running time model
Using the route and stop information from the GTFS data, a complete set of links was
constructed to represent the entire TTC network. Using the GPS data on each of the links, a running
time model of each link was estimated using the training data set and validated using the test set
data. The running time model was able to replicate the slower travel speeds and lower travel times
in the Toronto downtown area, as well as the higher transit speeds outside the downtown core (see
Figure 2). The upper map shows the observed link travel times from the test set data, while the
lower map shows the predicted values for the same period.
328
FIGURE 2 Above: average of observed link running times for test set data. Below: average
of predicted link running times for test set data.
329
Validation of results
RMSE values were produced for the running time models (see Figure 3). The average
RMSE for link running times across the entire network was 0.60 min, which means on average, a
given link would provide a running time prediction of within 0.60 min of the observed running
time. While the RMSE for the Toronto downtown area was low (~1 min), some longer links in the
network showed poorer performance, such as links in the vicinity of York University and
Scarborough town centre areas (>2 min). A scatterplot of RMSE versus link length, as shown in
Figure 4, demonstrated a greater spread of RMSE with increasing link length (R2 = 0.60 for all
links, R2 = 0.1404 for link lengths less than 400m).
FIGURE 3 RMSE of link running times on the TTC network.
330
FIGURE 4 Above: RMSE of link running time versus link length for entire data set Below:
RMSE of link running time versus link length for link length 400m or less.
Results of stop dwell times
Using the stop information from the GTFS data, a complete set of stops in the TTC network
was constructed. Then, using the GPS data arriving at each stop, dwell time models were produced.
The dwell time models were able to replicate the dwell times at the major intersections, terminal
stations and subway stations (see Figure 5). In particular, the High Park, Rosedale and Pape subway
stations had some of the highest observed and predicted dwell times ranging from 10 to 15 minutes
(See Figure 5). A common characteristic between these high dwell time stations is the presence of
lower frequency buses (headway of 15 min or greater). Stops with higher dwell time observations
331
and predictions were usually terminal stops or subway stops. In contrast, lower dwell time
observations and predictions (<5 min) usually occurred at timed points or intersections with longer
than normal delays.
332
FIGURE 5 Above: average of observed dwell times on the TTC network. Below: average of
predicted dwell times on the TTC network.
333
Validation of results
RMSE values were produced for the dwell time models (see Figure 6). The average RMSE
dwell times across the entire network was found be to 2.5 min, explaining the variation in dwell
time predictions. While RMSE for a large part of the transit network remained low, there were
isolated areas where RMSE of dwell time was high. The RMSEs at major subway stations and
terminal stops were generally higher while the RMSEs at intersections or timed stops were lower.
FIGURE 6 RMSE of dwell time on the TTC network
334
DISCUSSION AND CONCLUSION
This study presented the conceptual framework of a large-scale surface transit simulator to
be an eventual component of the Nexus integrated transit simulation platform. A first stage
prototype of the NSS, consisting of a data server, travel time model and a transit network builder,
was developed. The prototype NSS generated a detailed large-scale transit network, upon which a
travel time model was built. The travel time model used SVR to generate link running time and
stop dwell time models. The SVR models were able to replicate the observed conditions with
reasonable RMSE; however, the RMSE varied considerably across different links within the
network.
Two factors which may have influenced the performance of the running time models were
link length and roadway incidents. Longer link length was correlated with RMSE for the running
time models (R2 = 0.60 for all links). Longer links usually involve more turning movements and
intersection delays which were not captured and may have resulted in lower accuracy of predicted
link travel times. When examining shorter links where the stop distances are less than 400m (the
TTC criteria for the stop spacing is 300 to 500 m), the correlation between RMSE and link length
was less than that of all links (R2 = 0.60 for all links, R2 = 0.14 for link lengths less than 400m).
This implies that link distance is less of an influence on travel time for shorter links. Roadway
conditions including congestion and roadway incidents may influence transit travel times, but they
were not included in the running time model. Roadway conditions would be needed to improve
prediction performance.
The performance of the dwell time model may have been affected by the length of the GPS
polling time, lack of passenger loading data and exclusion of scheduled departure of the vehicles.
Dwell times shorter than the polling time of 60 seconds would not have been captured by the model.
This may have influenced the RMSE for the dwell time models. In addition, passenger boarding
and alighting data could improve the accuracy of the dwell time predictions, especially for non-
terminal bus stops. For terminal bus stations and subway stations, scheduled departure times of the
vehicle could improve dwell time predictions. Doing so will require the implementation of a
vehicle simulation engine where the arrival and departure times of a bus at a terminal station can
be tracked.
In the continued development of the NSS, a wider range of data, such as APC, AFC,
weather and traffic condition data, will be included in the data server to produce a more accurate
travel time model. The model could also be continuously trained with real-time transit data for
other applications such as real-time trip planning and real-time transit network management. Using
the updated travel time model, a vehicle simulation engine will then be developed to produce a
large-scale transit simulation network. Finally, the NSS will be integrated into the Nexus platform
where activity-based transit simulation will be performed to quantify multimodal transit
performances at the network-level.
335
ACKNOWLEDGEMENTS
The authors express gratitude to the resources provided to us by the Connected Vehicles
and Smart Transportation (CVST).
REFERENCES
1. Srikukenthiran, S., and A. Shalaby. Prototyping a Scalable Agent-based Modelling
Framework for Large-Scale Simulation of Crowd & Subway Network Dynamics.
Presented at the Conference on Advanced Systems in Public Transport, Santiago, Chile,
2012.
2. Spiliopoulou, A., M. Kontorinaki, and M. Papageor. Macroscopic traffic flow model
validation at congested freeway. Transportation Research Part C, Vol 41, 2014, pp. 18-29.
3. Fernandez, R., C. Cortes, and V. Burgos. Microscopic simulation of transit operations:
policy studies with the MISTRANSIT application programming interface. Transportation
Planning and Technology, Vol. 33, No. 2, 2010, pp. 157-176.
4. Cortes, C., L. Pages, and R. Jayakrishnan. Microsimulation of Flexible Transit System
Designs in Realistic Urban Networks. In Transportation Research Record: Journal of the
Transportation Research Board, No. 1923, Transportation Research Board of the National
Academies, Washington, D.C., 2005, pp. 153-163.
5. Toledo, T., O. Cats, W. Burghout, and H. Koutsopoulos. Mesoscopic simulation for transit
operations. Transportation Research Part C: Emerging Technologies, Vol. 18, No. 6, 2010,
pp. 896-908.
6. Cats, O., W. Burghout, T. Toledo, and H. Koutsopolous. Mesoscopic Modelling of Bus
Public Transportation. In Transportation Research Record: Journal of the Transportation
Research Board, No. 2188, Transportation Research Board of the National Academies,
Washington, D.C., 2010, pp. 9-18.
7. Chen, M., X. Liu, J. Xia, and S. Chien. A Dynamic Bus-Arrival Time Prediction Model
Based on APC Data. Computer-Aided Civil and Infrastructure Engineering, Vol. 19, No. 5,
2004, pp. 364–376.
8. Shalaby, A., and A. Farhan. Prediction Model of Bus Arrival and Departure Times Using
AVL and APC Data. Journal of Public Transportation, Vol. 7, No. 1, 2004, pp. 41-61.
9. Yu, B., Z. Yang, and B. Yao. Bus Arrival Time Prediction Using Support Vector Machines.
Journal of Intelligent Transportation Systems, Vol. 10, No. 4, 2006, pp. 151-158.
10. Yu, B., W. Lam, and M. L. Tam. Bus arrival time prediction at bus stop with multiple routes.
Transportation Research Part C: Emerging Technologies, Vol. 19, No. 6, 2011, pp. 1157-
1170.
11. Gaudette, P., R. Chapleau, and T. Spurr. Bus Network Microsimulation with GTFS and
Tap-in Only Smart Card Data. Presented at the 95th Annual Meeting of the Transportation
Research Board, Washington, D.C., 2016.
12. Munoz, J. C., and L. Paget-Seekins, eds. Restructuring Public Transport through Bus Rapid
Transit: An International and Interdisciplinary Perspective. Policy Press, Bristol, 2016.
13. Goldstein, B., and L. Dyson, eds. Beyond Transparency. Code for America Press, Portland,
2013.
336
14. Puchalsky, C., D. Joshi, and W. Scherr. Development of a Regional Forecasting Model
Based on Google Transit Feed. Presented at the 91st Annual Meeting of the Transportation
Research Board, Washington, D.C., 2012.
15. Perrine, K., A. Khani, and N. Ruiz-Juri. Map-Matching Algorithm for Applications in
Multimodal Transportation Network Modeling. In Transportation Research Record:
Journal of Transportation Research Board, No. 2537, Transportation Research Board of
the National Academies, Washington, D.C., 2015, pp. 62-70.
16. Khani, A., B. Bustillos, H. Noh, Y. Chiu, and M. Hickman. Modeling Transit and
Intermodal Tours in a Dynamic Multimodal Network. In Transportation Research Record:
Journal of the Transportation Research Board, No. 2467, Transportation Research Board
of the National Academies, Washington, D.C., 2014, pp. 21-29.
17. Nicolai, T. W., and K. Nagel. Integration of agent-based transport and land use, December
2013. http://svn.vsp.tu-berlin.de/repos/public-svn/publications/vspwp/2013/13-20/chapter
3-2_10dec2013.pdf. Accessed June 16, 2016.
18. Hsu, C.-W., C.-C. Chang, and C.-J. Lin. A Practical Guide to Support Vector Classification.
National Taiwan University, Taipei, 2016, pp. 1-16.
19. TransitFeeds. TTC GTFS – TransitFeeds. http://transitfeeds.com/p/ttc/33. Accessed April
30, 2016.
20. TransitFeeds. 19 March 2016 - TransitFeeds. https://transitfeeds.com/p/ttc/33/20160319.
Accessed May 20, 2016.
21. Connected Vehicles and Smart Transportation. CVST Live Traffic. http://portal.cvst.ca/.
Accessed June 21, 2016.
22. Hollis, R., and C. Upfold. Expanding Presto on TTC: Status Update.
http://www.metrolinx.com/en/docs/pdf/board_agenda/20160427/20160427_PRESTO_Im
plementation_Update_EN.pdf. Accessed July 16, 2016.
23. Borkwood, A. TTC Fare Policy. https://www.ttc.ca/About_the_TTC/Commission_reports
_and_information/Commission_meetings/2015/December_16/Reports/Presentation_Fare
_Policy_final.pdf. Accessed July 16, 2016.
24. Toronto Transit Commission. TTC introduces 10-minutes-or-better service on buses,
streetcars. https://www.ttc.ca/News/2015/June/0615_10min-service.jsp. Accessed July 17,
2016.
25. Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer, Stanford, 2008.
26. Cortes, C., and V. Vapnik. Support Vector Networks. Machine Learning, Vol. 20, 1995, pp.
273-297.
27. Smola, A. J., and B. Scholkopf. A Tutorial on Support Vector Regression.
http://www.svms.org/regression/SmSc98.pdf. Accessed June 22, 2016.
28. Chang, C.-C., and C.-J. Lin. LIBSVM: a library for support vector machines. ACM
Transactions on Intelligent Systems and Technology, Vol. 27, 2011, pp. 1-27.
337
Appendix D TransitData 2017 Abstract
The attached abstract was accepted and presented at the 2017 TransitData Conference.
338
AUTOMATIC MODELLING TOOL FOR ESTIMATING TRANSIT TRAVEL TIME TO
ENABLE SIMULATION OF LARGE-SCALE SURFACE TRANSIT NETWORKS
Bo Wen Wen, B.A.Sc.
Department of Civil Engineering
University of Toronto
35 St. George Street, Toronto, ON, Canada M5S 1A4
Tel: 647-924-1996; Email: [email protected]
Siva Srikukenthiran, Ph.D.
Department of Civil Engineering
University of Toronto
35 St. George Street, Toronto, ON, Canada M5S 1A4
Tel: 416-978-5049; Email: [email protected]
Prof. Amer Shalaby, Ph.D., P.Eng.
Department of Civil Engineering
University of Toronto
35 St. George Street, Toronto, ON, Canada M5S 1A4
Tel: 416-978-5907; Email: [email protected]
Introduction
Current microscopic simulation models can track the movements of transit vehicles but the
construction of these highly-detailed models is computationally expensive. On the other hand,
traditional macroscopic and mesoscopic models lack the ability to track the precise movements of
transit vehicles. Fortunately, with the advent of big data for transit systems, an efficient transit
travel simulation can be produced using a transit travel time model based on machine learning
algorithms. Previous studies on transit travel time models have used delay, traffic volume,
passenger demand and weather as key variables to predict bus travel times (Yu, et al., 2006; Chien,
et al., 2002). A comparison study on transit travel time models using support vector machine
(SVM), artificial neural networks (ANN), k-nearest neighbor (k-NN) and linear regression (LR)
found that SVM with data from multiple routes produced the most accurate travel time models
(Yu, et al., 2011). In particular, SVM showed greater resistance to the problem of over-fitting and
performed well with large data sets (Yu, et al., 2011). This paper presents an automatic modelling
tool that generates transit link travel time models using open data and SVM, with a case study on
the Toronto Transit Commission (TTC) network. The intended use of this modelling tool is to
enable the automatic development and updating of large-scale surface transit simulation models
using open data.
Methods
The automatic modelling tool collects Automatic Vehicle Location (AVL), General Transit Feed
Specification (GTFS), road restriction, signalized intersection and weather data, from online
sources, including NextBus API, Open Data Toronto and Open Weather Map API. During data
339
processing, the 20-second interval AVL data were organized into trips and matched with their
corresponding surface routes in the GTFS schedules. Since the arrival of vehicles at stops were of
interest and routes have common characteristics, such as planned headway and ridership, a stop-
based travel time model was constructed for each route. This involved creating virtual stop-to-stop
transit links using GTFS route schedule data. They were populated with transit operational
attributes (e.g. observed total travel times, headways and delays), as well as environmental
attributes (e.g. weather conditions, intersection counts and the presence of road restrictions).
Finally, the attribute values at each transit links were grouped into their respective routes for model
estimation.
The automatic modelling tool uses LIBSVM, an integrated software for SVM classification and
regression, to perform SVM regression model estimation (Chang & Lin, 2011). The average link
speed was used as the target (dependent) variable while headway, delay, temperature and number
of signalized intersections along the link were the features (independent variables). Blocking by
the presence of road restriction on links was performed to separate data under normal road capacity
from data under restricted road capacity. This was followed by training a travel time model for
each transit route using a set of optimal hyperparameters obtained with a k-fold cross validation
procedure, a kernel function (e.g. linear, polynomial, radial basis function (RBF) or sigmoid) and
an epsilon intensive loss function. For this paper, a large-scale transit travel time model was
estimated for the TTC surface transit network using morning peak (6am to 9am) data from
December 19 to December 22, 2016. The model was evaluated using morning peak data from
December 23, 2016.
Results
Past studies have suggested that the RBF kernel function produced predictions with less numerical
difficulties, making it the first choice (Yu, et al., 2006; Chang & Lin, 2011). However, this study
found that the linear kernel function produced predictions with lower root means square error
(RMSE) and proper bounds, as shown in Figures 1 and 2. The RMSE generated with the linear
kernel ranged from 6 to 20 kph, with majority of the models having less than 10 kph RMSE. Route
models with higher RMSE may have been impacted by factors not captured by the features
introduced, such as long dwell time and roadway congestion not caused by roadway restrictions.
Since the travel time model is intended for the use in transit network simulation, it should
accurately the distribution of the observed values. The vast majority of route models predicted the
average link travel speeds well, in particular for models with lower RMSE (see Figures 3 and 4).
While the average link travel speeds were well represented by the SVM models, the standard
deviation of the predictions poorly resembled that of the observed, as shown in Figures 5 and 6.
This may be caused by a few limitations with the models.
Conclusion
A key limitation to the models presented in this paper is the lack of representation of dwell time.
Since dwell time is a major aspect of transit operations, a travel time model that represents both
running time and dwell time is needed to accurately represent the total travel times along routes.
Unfortunately, the lack of Automated Passenger Counter (APC) or Automated Fare Collection
(AFC) data for the TTC network makes the representation and validation of dwell time difficult.
340
Since it is possible to compute dwell time to 20-second precision with AVL data, an estimated
dwell time could be included as a feature of the model for future work. For networks with readily
available APC or AFC data, more precise dwell time could be obtained. Another key limitation is
the lack of representation of travel time distribution with the use of SVM. This needs to be address
with an alternative modelling approach such as support distribution machine (SDM) in future work
(Szabo, et al., 2016).
This paper presented an automatic modeling tool for large-scale transit travel time model with a
case study of the TTC network. The estimated model provided good representation of the average
travel times between stops for all routes in the TTC network. A large-scale transit travel time model
that can accurately represent both the average travel times and the distribution of travel times is
needed. A future research interest would be to use this model to simulate transit operations, assess
the impact of transit operational characteristics, and determine the effectiveness of transit policy
measures.
Figure 1: RMSE between prediction and observed average speeds, with RBF Kernel
Figure 2: RMSE between prediction and observed average speeds, with Linear Kernel
0
10
20
30
40
50
60
5 81
21
62
12
42
83
13
43
74
04
34
64
95
25
55
96
26
56
87
17
47
78
08
38
68
99
29
61
00
10
41
07
11
01
13
11
71
23
12
61
30
13
31
41
14
41
61
16
81
90
19
51
99
50
15
04
50
95
12
RM
SE, k
ph
Route Number
0
5
10
15
20
25
5 81
21
62
12
42
83
13
43
74
04
34
64
95
25
55
96
26
56
87
17
47
78
08
38
68
99
29
61
00
10
41
07
11
01
13
11
71
23
12
61
30
13
31
41
14
41
61
16
81
90
19
51
99
50
15
04
50
95
12
RM
SE, k
ph
Route Number
341
Figure 3: Prediction and observed averages of 20 routes with the highest RMSE
Figure 4: Prediction and observed averages of 20 routes with the lowest RMSE
Figure 5: Prediction and observed standard deviations of 20 routes with the highest RMSE
0
10
20
30
40
50
84 73 131 144 60 145 143 53 24 134 86 98 51 141 130 132 45 191 105 101
Ave
rage
, kp
h
Route Number
Prediction Avg.
Inputs Avg.
0
5
10
15
20
25
82 65 506 505 504 77 75 94 91 510 168 512 6 509 126 22 31 29 23
Ave
rage
, kp
h
Route Number
Prediction Avg.
Inputs Avg.
0
5
10
15
20
84 73 131 144 60 145 143 53 24 134 86 98 51 141 130 132 45 191 105 101Stan
dar
d D
evia
tio
n, k
ph
Route Number
Prediction SD
Inputs SD
342
Figure 6: Prediction and observed standard deviations of 20 routes with the lowest RMSE
Bibliography
Chang, C.-C. & Lin, C.-J., 2011. LIBSVM : a library for support vector machines. ACM
Transactions on Intelligent Systems and Technology, Volume 27, pp. 1-27.
Chien, S. I.-J., Ding, Y. & Wei, C., 2002. Dynamic Bus Arrival Time Prediction with Artificial
Neural Networks. Journal of Transportation Engineering, pp. 429-438.
Srikukenthiran, S., 2015. Integrated Microsimulation Modelling of Crowd and Subway Network
Dynamics For Disruption Management Support, Toronto, ON: Ph.D. dissertation, Graduate
Department of Civil Engineering, University of Toronto.
Szabo, Z., Sriperumbudur, B., Pocz, B. & Gretton, A., 2016. Learning Theory for Distribution
Regression. Journal of Machine Learning Research, 17(152), pp. 1-40.
Yu, B., Lam, W. & Tam, M. L., 2011. Bus arrival time prediction at bus stop with multiple routes.
Transportation Research Part C: Emerging Technologies, 19(6), pp. 1157-1170.
Yu, B., Yang, Z. & Yao, B., 2006. Bus Arrival Time Prediction Using Support Vector Machines.
Journal of Intelligent Transportation Systems, 10(4), pp. 151-158.
0
2
4
6
8
10
82 65 506 505 504 77 75 94 91 510 168 512 6 509 126 22 31 29 23Stan
dar
d D
evia
tio
n, k
ph
Route Number
Prediction SD
Inputs SD