mining jams into pollution: how waze data helps estimating air … · 2019-11-01 · mining jams...

66
Jo˜ ao Luiz Martins Carabetta, BSc Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s Thesis to achieve the university degree of Master of Science Master’s degree programme: Mathematical Modeling submitted to Funda¸c˜ ao Get´ ulio Vargas Supervisor Prof. Eduardo Fonseca Mendes Escola de Matem´ atica Aplicada Rio de Janeiro, Junho 2019

Upload: others

Post on 15-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

Joao Luiz Martins Carabetta, BSc

Mining jams into pollution: how Wazedata helps estimating air pollution in

large cities

Master’s Thesis

to achieve the university degree of

Master of Science

Master’s degree programme: Mathematical Modeling

submitted to

Fundacao Getulio Vargas

Supervisor

Prof. Eduardo Fonseca Mendes

Escola de Matematica Aplicada

Rio de Janeiro, Junho 2019

Page 2: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

Dados Internacionais de Catalogação na Publicação (CIP)

Ficha catalográfica elaborada pelo Sistema de Bibliotecas/FGV

Carabetta, João Luiz Martins

Mining jams into pollution: how Waze data helps estimating air pollution in

large cities / João Luiz Martins Carabetta. – 2019.

64 f.

Dissertação (mestrado) - Fundação Getulio Vargas, Escola de Matemática

Aplicada.

Orientador: Eduardo Fonseca Mendes.

Inclui bibliografia.

1. Modelagem de dados. 2. Trânsito - Congestionamento. 3. Gases estufa. 4.

Ar - Poluição. I. Mendes, Eduardo Fonseca. II. Fundação Getulio Vargas.

Escola de Matemática Aplicada. III. Título.

CDD – 005.75

Elaborada por Kelly Ayala – CRB-7/7007

Page 3: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s
Page 4: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

This document is set in Palatino, compiled with pdfLATEX2e and Biber.

The LATEX template from Karl Voit is based on KOMA script and can befound online: https://github.com/novoid/LaTeX-KOMA-template

Page 5: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

A�davit

I declare that I have authored this thesis independently, that I have notused other than the declared sources/resources, and that I have explicitlyindicated all material which has been quoted either literally or by contentfrom the sources used. The text document uploaded to tugrazonline isidentical to the present master‘s thesis.

Date Signature

iii

Page 6: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

Abstract

Air pollution has been a growing worry of international medical organiza-tions and governments due its relation with a large number of respiratorydiseases, among other effects. This work proposes an open, crowdsourced,scalable methodology to model spatial air pollution in cities worldwide.We use both Waze and Open Street Maps data to construct a collection offeatures aimed to model car emissions in (large) cities. Waze data carriesinformation about all jammed road segments of a region for every two min-utes and Open Street Maps (OSM) is an open source, detailed, dynamicallyupdated, spatial database of mapped features. Our model is trained usingdata from a 30 sq km region of Oakland, California in the United Statesof America. The dependent variables are the annual concentration of finegrained black carbon, nitric oxide, and nitrogen dioxide. The features areaggregated in hexagons with a 173 meters edge. We notice that pollutantconcentration between hexagons follows a power law and high concentra-tion is associated with the presence of highways. We estimate four models:simple linear regression where the only feature is the presence of a highwayin the hexagon, multiple linear regression, random forest, and XGBoost.The latter yields better results in the validation set for black carbon, NOand NO2. Finally, we extrapolate the model for Montevideo, Uruguay andobserve adherence to what is expected in practice.

iv

Page 7: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

Contents

Abstract iv

1 Introduction 1

2 Literature Review 52.1 Monitoring Data Collection . . . . . . . . . . . . . . . . . . . . 52.2 Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Location Based . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . 72.2.3 Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.4 Land Use Regression . . . . . . . . . . . . . . . . . . . . 8

3 Data 103.1 Waze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.1 Open Street Maps . . . . . . . . . . . . . . . . . . . . . 123.2 Air Pollution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Uber Hexagons H3 . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Methodology 244.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1.1 Features Waze . . . . . . . . . . . . . . . . . . . . . . . . 244.1.2 Features OSM . . . . . . . . . . . . . . . . . . . . . . . . 254.1.3 Pollution (Target Variable) . . . . . . . . . . . . . . . . . 28

5 Results 295.1 Effects of highways and dispersion on pollutant concentration 295.2 Model Results and Feature Importance . . . . . . . . . . . . . 325.3 XGBoost Air Pollution Estimation for Montevideo . . . . . . . 39

v

Page 8: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

Contents

6 Discussion 426.1 Model Limits and Possible Extensions . . . . . . . . . . . . . . 42

6.1.1 Challenges on estimation air pollution in urban areas 426.1.2 Biases carried by this model . . . . . . . . . . . . . . . 436.1.3 Data needed to improve model . . . . . . . . . . . . . . 436.1.4 Modeling decisions and improvements . . . . . . . . . 44

6.2 Social and Health Impacts with Global Reach . . . . . . . . . 45

7 Conclusion 47

Bibliography 48

vi

Page 9: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

List of Figures

1.1 World deaths by risk factor. Image (a) shows the number ofdeaths by risk factor in 2017. Outdoor air pollution is the 5thcause of death with 2.94 people affected. Image (b) shows thegrowth of this cause of death in the world leaded by SouthAsian and Southest Asia, East Asia and Oceania countries.The number of deaths went from 1.7 million in 1990 to 2.94in 2017 maintaining a stable death rate of ⇠ 30 per 100 people. 2

1.2 Outdoor air pollution death rate by country. The image com-pares the death rate of outdoor (ambient) air pollution byGDP per capita. The size of the circles is proportional to thepopulation. It shows that middle income counties are themost affected by air pollution. . . . . . . . . . . . . . . . . . . . 4

3.1 Waze official map with annotations for jams and alerts data. 123.2 Number of unique OSM users over time. The dotted line in

blue shows the OSM community is still growing and it has5 million users that update the map 3 million times a day. Fig-ure from https://wiki.openstreetmap.org/wiki/File:Osmdbstats1log.png 14

3.3 Evolution of the number of OSM core elements. The nodes,in pink, present a sharp increase since 2007 reaching almost 5billion objects in January of 2019. Figure from https://wiki.openstreetmap.org/wiki/File:Osmdbstats2.png 14

3.4 Areas of Oakland, California, US that were air pollution wasestimated by Aclima, Inc and Google. Major highways wereannotated and US Census 2010 was used to the populationinset. The wind inset shows that at 85% of daytime the windwas blowing west (Figure from [18]) . . . . . . . . . . . . . . . 16

vii

Page 10: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

List of Figures

3.5 Uber H3 hexagon hierarchical grid over a city section. Thelarger hexagons with resolution 8 are about the size of asmall neighbourhood. The smaller hexagons, with resolution10, are about the size city block. Resolution 9 hexagons con-tains several blocks but are not large enough to represent aneighbourhood. Image from https://eng.uber.com/h3/. . . . 17

3.6 Uber H3 hexagon grid over Oakland, California, US. Hexagoncoverage is bounded to the availability of pollutant concen-tration data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.7 Polygons grid centerpoint neighbour distance. The trian-gle grid has 3 different neighbour centerpoint distance andneighbours in the vertices. The square grid has 2 centerpointdistance and also neighbours in vertices. The hexagon gridhas an unique centerpoint distance with no neighbours in thevertices. Images from https://eng.uber.com/h3/. . . . . . . . 20

5.1 Positive correlation between highway proximity and pollutantconcentration. (a) Panel from

aptehigh� resolution2017showsthedecayo f pollutantconcentration f romhighwaystoroads.Thele f timageshowstheratioo f medianconcentrationatadistance f romthenearesthighway.Theerrorbarsarethestandarderror f rombootstrapresampling.Anexponentialunconstrainedthreeparametermodel f itstheobserveddecay.Therightimageisthedistance f romhighwaydcomputedbytheharmonicmeandistancetothenearestportiono f the f ourmajorhighways(b)Medianconcentrationo f blackcarbon, NOandNO2for each hexagon region categorized by its intersection with ajammed highway. Each point is the median concentration inan hexagon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2 Hexagon regions median concentration distribution by rank.An unconstrained power model (y = axb) fits the first 250hexagons median. The high and medium concentration valuesshows a high agreement with the power model whereasfor low concentration the distribution drops sharply for allpollutants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

viii

Page 11: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

List of Figures

5.3 Modeling results for Highway Linear Regression and Mul-tiple Linear Regression. Image (a) shows the results for thelinear regression C(x) = aH(x) + b where x is the hexagonregion, H(x) is an indicator function which is 1 if the hexagonintersects a highway, a and b are unconstrained parameters.Image (b) has the results of a Multiple Linear Regression per-formed over all available features. Orange dots are hexagonsused to train the models, whereas blue dots to validate it.Mean Squared Error (MSE) is at the top left of each subplotand the black line is the ideal model. . . . . . . . . . . . . . . 33

5.4 Modeling results for Random Forest and XGBoost. Image(a) shows Random Forest results which underestimates highconcentration values. The XGBoost regression, which has thebest MSE, is shown in image (b). Orange dots are hexagonsused to train the models, whereas blue dots to validate it.Mean Squared Error (MSE) is at the top left of each subplotand the black line is the ideal model. . . . . . . . . . . . . . . 35

5.5 Comparison between actual median pollutant concentration(left) and estimated concentration by XGBoost model (right).Panel (a), (b) and (c) corresponds to black carbon, NO andNO2, respectively. The concentration is binned by the Jenksnatural breaks classification method. Missing hexagons in theestimated maps (right) are due to the lack of Waze data onthe region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.6 Nitrogen Oxide (NO) concentration XGBoost estimation forMontevideo, Uruguay. The estimation used Waze data fromApril, 2019. Dark orange hexagon regions (¿53 ppb) are abovethe yearly average recommended by the United States Envi-ronmental Protection Agency. [41]. . . . . . . . . . . . . . . . . 41

ix

Page 12: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

1 Introduction

Outdoors air pollution is one of the major causes of premature deaths in theworld [1]. Estimations of long-term air pollution deaths varies from 2.94 to8.8 million people and over 103 million disability-adjusted life years lost [2].This puts air pollution in between the 5th and 2nd place of higher risk factorsof death, in front of alcohol, unsafe water sources and poor sanitation(Figure 1.1 (a)). According with a conservative estimate, the number ofdeaths caused by air pollution raised from 1.7 million in 1990 to 2.94 millionwith a stable death rate of 30 people of 100 thousand. A notable increase ofdeath is seen in underdeveloped regions, especially South, Southeast, EastAsia and Oceania (Figure 1.1 (b)). In fact, the death rate by air pollutionexposure for middle income countries are among the highest. Large ondevelopment countries such as China and India have death rates above the32 people per 100 thousand mean (1.2).

By definition, air pollution is any substance in the air that harm humans an-imals, vegetation or materials [3]. There are several different pollutants thatdiffer depending on their composition, source and production conditions.The most common gases are sulfur oxides, mainly SO2, nitrogen oxisides(NO and NO2), reactive hydrocarbons and carbon monoxide. Also, particu-late matter of less than 2.5 micrometers (PM2.5) or less than 10 micrometers(PM10) are considered air pollutants.

Recent studies shows that long-term air pollution exposure are responsiblefor health issues in the respiratory system and potentially in all organsof the body [3]. It is related to lung and heart diseases, diabetes, demen-tia, liver problems, bladder cancer, brittle bones and damaged skin. Also,fertility, foetuses and children development are affected by air pollution.Also, although it affects all kinds of people regarding age and sex, morevulnerable social groups are more exposed to air pollution if they have otherillnesses.

1

Page 13: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

1 Introduction

(a)

(b)

Figure 1.1: World deaths by risk factor. Image (a) shows the number of deaths by risk factorin 2017. Outdoor air pollution is the 5th cause of death with 2.94 people affected.Image (b) shows the growth of this cause of death in the world leaded by SouthAsian and Southest Asia, East Asia and Oceania countries. The number ofdeaths went from 1.7 million in 1990 to 2.94 in 2017 maintaining a stable deathrate of ⇠ 30 per 100 people.

2

Page 14: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

1 Introduction

Long-term air pollution exposure is not only a major causes of death butit affects the most vulnerable countries and social groups. In spite of theseproblems, most cities do not have a great air monitoring system [4]. Themost affected continents, Asia and Africa, have poor or none air pollutionmonitoring systems [5]. Even high income countries, such as the US, havea low monitor density, ⇠ 2-5 monitors per million people and 1000 kmsquared.

Several air pollution concentration estimation studies try to address thisinfrastructure gap [6], [7]. Intra urban air pollution concentration data withmedium or high resolution is essential to support epidemiology and publicpolicy impact studies. With high concentration areas identified, measurescan be taken to mitigate pollution and provide health support to the affectedpopulation.

Land Use Regression (LUR) is one of the adopted estimation techniques [8].It uses pollutant concentration data from monitor along with geographicpredictors to model by regression pollutant surface in a region. In spiteof its good predictions, models can hardily be transferable to differentcities. The main constrain is the locality of most predictors. Congestion andtraffic intensity, ones of the most important predictors depend upon localvehicle counting data. If available and open sourced, this data is not usuallystandardized and there is no guarantee of completeness or quality.

This dissertation proposes a solution for the tranferability constrains ofLUR models. Based on fine grained pollution concentration data of NO,NO2 and black carbon from Oakland, California, US. First, it builds a trafficdata set based on Waze data that has global reach and instant updates.Also, it attempts to use a crowd sourced database, Open Street Maps, toobtain land-use predictors. Then, it develops a LUR model over aggregateddata in a hierarchical hexagon grid open sourced by Uber. Different re-gression models are trained: Simple Linear Regression, Multiple LinearRegression, Random Forest and XGBoost. And, the last model achievesresults compatible with the current literature. Finally, it estimates pollutionconcentration for Montevideo, Uruguay, by applying the XGBoost modelwith Waze predictors.

3

Page 15: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

1 Introduction

Figure 1.2: Outdoor air pollution death rate by country. The image compares the death rateof outdoor (ambient) air pollution by GDP per capita. The size of the circles isproportional to the population. It shows that middle income counties are themost affected by air pollution.

4

Page 16: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

2 Literature Review

2.1 Monitoring Data Collection

Monitoring air quality is essential to provide reliable and trustworthy airpollution concentration data. Initially, fixed monitoring sites were installedin specific locations of cities to comply to air quality legislation. Due toits high cost, even the most covered cities do not have a large monitor sitenetwork. Although they were not built with research intentions, severalstudies used their data [9]–[11]. It is important to note that these types ofsites are usually located in hotspots such as heavy traffic roads or industrialareas. So, they can be biased estimates of pollutant concentration.

In the wake of air quality control studies, several investigators have builttheir own monitoring campaigns [12]–[14]. These usually consist of betweenseveral seven or fourteen days sampling campaigns. In which, the monitorsare positioned in a way to optimize estimation and reduce bias. Morerecently, there are studies focused in the optimization of fixed monitoringsites given the city topography, buildings height and emission locations[15].

Other researchers are testing mobile monitoring to map a large road net-work at high spatial resolution. Cars, bikes, trams or buses equipped withpollution sensors repeatedly go over the same roads during a specifiedperiod of time sampling concentration levels. With a resolution of up to 20meters, studies are able to map sharp pollutant concentration gradients andidentify hotspots [16], [17]. Data from a particular extensive study in theSan Francisco Bay that repeatedly sampled a 30 km2 area for NO, NO2 andblack carbon is used in this dissertation [18]. Which is the largest urban airquality data set and it is freely available.

5

Page 17: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

2 Literature Review

2.2 Estimation Methods

The literature consolidated four main methods to estimate air pollution:Location Based, Interpolation, Dispersion Models and Land Use Regression(LUR) [11]. The last method, LUR, has a more detailed discussion, since thatit is the method applied in this dissertation. We discuss its prediction vari-ables importance, model development techniques, validation, tranferabilityconcerns, performance and limitations.

2.2.1 Location Based

Location Based Methods are the oldest and simplest models to estimate airquality. They are based on the what is known as the first law of geography -all things are related, but near things are more strongly related than distantones [19]. Thus, proximity, spatial overlap, colocation and contiguity areproxies to pollutant concentration.

Although it oversimplifies the intricate dynamic of pollutant dispersion,the assumptions behind the Location Based Methods are solid. Pollutantconcentration is expected to decay with distance from source, as a resultof dilution as they are transported [18]. Hence, distance can be used as thebasis pollutant concentration estimation methods.

Computing those proxies is fairly simple and straightforward with a Ge-ographical Information System (GIS). Functions such as point-in-polygon,intersect and buffering are fast and scalable on GIS system. Thus, it al-lows analysis of large datasets, extensive coverage area and thousands oflocations with relatively low computational power.

Studies that used point-in-polygon GIS functions assumed that data frommonitoring sites are attributed to a region. The region can be a city orstudy area [20] or a land use zone [21]. Buffering supposes that air qualityis constant given a fixed distance from a emission source. A 100 metersdistance from major roads was used by Harrison et al. [22] and a Englishel al. used a 170 meters from traffic sources as buffer [23]. Moreover, lineardistance decay from the nearest road was applied by Livingstone et al.[24].

6

Page 18: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

2 Literature Review

However, the reliability of such models can be put in doubt. They are basedon an extremely generic argument which does not take into account weather,topography, wind flow and buildings height.

2.2.2 Interpolation

Interpolation is a more powerful method of assessing pollutant concentra-tions than Location Based. With more complex methodologies, interpolationhas the advantage of providing estimate errors of unsampled sites. Also,most of its tools are available on GIS softwares, but they are rather morecomputationally expensive than Location Based.

Hoek et al. [9], [25] used a inverse distance weighting technique to assessNO2 and Black Carbon concentrations. Also, Kriging was used to estimatedaytime ozone concentration [26].

But, interpolation just relies on monitoring locations data. Which, even forthe most complete monitoring networks, are usually not extensive enough.Air pollution surfaces on urban sites are extremely complex, with steepgradients and localized hotspots. Thus, interpolation techniques that justrely on emission data often struggles to perform well.

2.2.3 Dispersion

Either Interpolation or Location Based are mostly static estimates. Dispersionmodels attempt to estimate pollutant concentration by theoretical chemicaland physical interactions. Due to its complexity, dispersion models canbe computationally intensive and hard to calculate. Also, differently fromprevious models, emission sources are used as input. Not only the emissionsource has to be identified, their emission pattern has to be previouslymodelled or profiled.

Gaussian models are the usual choice do estimate dispersion. In which,plume and puff are the two main types. Plume assume steady-state condi-tions whereas puff are active simulations of instantaneous emission release[27]. Softwares maintained by US Environmental Protection Agency (EPA)

7

Page 19: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

2 Literature Review

use either one technique or a combination of both [28]. They can considerterrain elevation, building heights and meteorological variables of varyingcomplexity.

Although very precise, dispersion models are hard to replicate. They arehigh demanding computationally and requires precise emission profiling.This raises the cost of infrastructure and personal. Thus, it is a considerablebarrier for cities in developing countries.

2.2.4 Land Use Regression

Land Use Regression (LUR) were first introduced by Briggs in 1997 withthe SAVIAH (Small Area Variations in Air Quality and Health) [29]. Itwas presented as an improvement of interpolation methods by introducingmore features into the model. Although the name mentions only landuse, other features such altitude, traffic proxies and meteorology are oftenintroduced in the model. Instead of interpolating the emission data, Briggsapplied a regression which went through training and validation procedures.Currently, this is the most applied methodology [11] and it is the oneperformed in this dissertation.

Beyond emission monitoring data, it demands contextual features oftenobtained by GIS. Once the data is structured, there are plenty of model-ing options that can be applied. Hence, computational demands can varygreatly depending on the data volume and model complexity. LUR mod-els have three main components: geographic predictors (features), modeldevelopment and validation.

Geographic predictors, or features, are the main components of a LURbased estimation. Usually, studies use a large set of potential variables intheir studies. The most common feature groups are traffic, population, landuse and physical geography. Significant traffic features are distance fromhighway, traffic intensity and congestion [9], [11], [12], [30]. Population andhousing density are also used as features. Land use significant featuresconsists in categorizing a location by urban, industrial, commercial or openspace [10], [31]–[34]. Finally, physical geography important features are

8

Page 20: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

2 Literature Review

altitude, country region, border distance, sea distance and geographicalcoordinates [35].

Most studies use Multiple Linear Regression (MLR) to develop predictivemodels. Some of them used a combination of MLR with different featuresfor different spatial scales [9], [11]. Diagnostic tests of heteroscedasticityand independence of residuals are usually performed to validate the use ofordinary least squares regression [14], [36]. Moreover, some studies apply alogarithm transformation on concentration to improve residual distribution[10], [34], [37].

Validation techniques are crucial to guarantee the model predictabilityon new data. The most commonly used is leave-one-out cross-validationwith varying n sizes and repetitions. Another approach is to subdividemonitoring sites as training and validation sets. A combination of both canalso lead to better results. Last, some other studies used different typesof monitoring sites, such as routine monitoring stations to validate theirpredictions [34].

Model performance is assessed by two main metrics, R2 and Root MeanSquare Error (RMSE). Given the differences on dispersion and chemicalreactions of each pollutant, the precision of the models vary. For NO, theR2 of the validation sample varies from 0.49 to 0.70, RMSE statistics werenot reported. For NO2, the R2 of the validation sample is in between 0.36and 0.87 with a mean of 0.68. Whereas RMSE is varies from 1.3 and 4.5 ppbwith a mean of 2.9 ppb. For black carbon, R2 ranges from 0.35 to 0.89 with a0.63 mean [6].

Although thoroughly used by many investigators, one of the biggest dis-advantages of LUR models are the non standardization of geographicalpredictors. Traffic related data is usually hard to get with municipalities,if existent at all. The quality of demographic features vary a lot accordingto the country and municipality institutions. Also, there is no control oncompleteness or quality of the data used. These problems greatly diminishthe transferability potential of models and techniques.

9

Page 21: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

3 Data

The goal of the dissertation is to build a model that uses world availabledata to estimate air pollution. In contrast with other similar models, wewant a model that is not bounded to regional and local data such as census,origin-destination matrices, land use or traffic congestion. This type ofdata is only usually available to median-high income countries with solidgovernment planning institutions. Also, there is no common standard forthis most data types, which makes it difficult to reproduce the model in othercities and regions of the world. Finally, city-sizes databases are producedby the cities. Thus, a different data mining effort is needed for each city.For those reasons, inequality of availability, lack of data standards anddecentralization of databases that we pursue world available databases.

Two databases are used to build features, Waze and Open Street Maps(OSM). The first was chosen because it has up-to-date traffic informationwhich is one of the most important sources of air pollution [6]. OSM datahas information about land-use, road structure and points of interest whichwere used in other studies to predict air pollution.

The target variable was built upon a empirical study conducted by Aclima,Inc. and Google which collected air pollution data from Oakland, CA, USfor a year. This database has fine granularity of 30m for an area of 30 km2.The study collected emissions of three pollutant gases: NO, NO2 and BlackCarbon (BC). Thus, it offers a wide range of aggregations possibilities.

Finally, Uber H3 hierarchical hexagon grid database was chosen to dividethe city in areas. It has a wide range of hexagon resolutions and has anstable, fast and maintained library.

10

Page 22: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

3 Data

3.1 Waze

Waze is a company with worldwide reach that provides routes for driversvia a cellphone app. Besides routes, Waze also allows users to post reportsabout road conditions, such as traffic intensity, pot holes, floods and others.This creates a community of users that provides rich and valuable dataabout the city road infrastructure network. In order to allow cities to usethis information, Waze created the Citizens Connected Program. It connectscities and research initiatives around shared and live Waze data, which isthe data used in this dissertation.

The data provided by Waze is the data showed in the Waze app. It consistsin two main data types: Alerts and Jams (Figure 3.1). Alerts data locates auser report with an specific latitude and longitude pair. So, its existence onthe database depends upon the habit of the app users in that city. On theother hand, Jams data is passive, it does not depend on the app user activelyengaging with the app. Jams data identifies congested roads by comparinghistorical position and speed collected via GPS from its users. Thus, as longas the city has enough users, Jams data is a more reliable and stable datasource. For that reason, we choose to only use Jams data to estimate airpollutant emissions.

Each jam has geographical location and properties such as speed andlength that evolves in time. Jam data describes it in four categories of data:Identifiers, Geographical, Traffic Status and Road Characteristics. Table 3.1shows each variable available, but some considerations are necessary tounderstand the data. The data is updated every two minutes. Also, Wazeidentifies jams by comparing live speed with free flow speed, which isestimated using data from 2am to 5am. It does not count the number ofvehicles in the road, so Waze does not identify roads which are loaded butroads that are below free flow speed. Therefore, Waze only provides trafficdata for roads that are jammed.

11

Page 23: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

3 Data

Figure 3.1: Waze official map with annotations for jams and alerts data.

3.1.1 Open Street Maps

Open Street Maps is a open source project that creates and distributesfree geographic data. Founded in 2004, the community saw an exponentialincrease in its early years and has been growing steadily since. In 2019, itreached 5 million unique users that contribute by adding and modifyinggeographical data (See figure 3.2). OSM data structure is centered in threebasic elements: nodes, ways and relations. Figure 3.3 shows the creationof those elements over time. There are around 5 billion nodes in 2019, oneorder of magnitude greater than ways, which had 500 million objects atthe same period. This demonstrates that OSM community is already animportant source of geographical data.

There is an hierarchy between the elements. The node element is uniqueand represents a point in the earth surface as a latitude and longitude pair.It can represent standalone features such as ATMs, park benches and others.The way is a ordered collection of nodes that forms a polygon. This polygoncan represent a highway or river. But if the first node is repeated in the endof the collection, the polygon is closed and it represent buildings, schools

12

Page 24: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

3 Data

or forests. Finally, the relation documents the relationship between nodes,ways and relations. It can be used to represent a collection of buildings thatform an university, an interstate route with several ways and other types ofrelations. Table 3.2 describes the data structure of those elements.

Any of those basic elements can carry a tag which gives meaning to it. Thetag has two free format fields, key and value in the form key = value. Forinstance, highway=residential is a valid tag in which highway is the object andresidential is the category. Thus, it is a tag for a road that is used by citizensto access their homes. For each element, the key has to be unique, therecannot be a tag that carries amenity=restaurant and amenity=bar.

Thus, Open Street Maps has important characteristics that motivates itsuse. It has a large world community that is active and growing. This allowsany analysis built upon OSM data to be replicable in other parts of theworld. Also, it has a simple and uniform data structure that pinpoints worldfeatures in the earth surface and attribute meaning to them.

3.2 Air Pollution

The air pollution data set comes from a Aclima, Inc. and Google project toevaluate fine street air pollution. The companies equipped two cars with1Hz air pollution measurement devices that sampled for three pollutants:black carbon, NO and NO2. This harmful to health pollutants are emitted byvehicular traffic, shipping, industrial combustion, cooking and heating. Thecars drove within residential, industrial and commercial areas of Oakland,California, US, during 1 year. The study emphasized West Oakland (WO)with 10km2, East Oakland (EO) with 15km2 and Downtown 5km2 as seenin Figure 3.4,

The study covered 750 road-km that were divided in 30 m segments. Eachsegment has on average 31 days and 200 1-Hz measurements data points.Through a process of data reduction and bootstrap resampling algorithms,the researches computed the daytime yearly median and standard error(SE) for 21.000 road segments. Repeated sampling of the same segment at

13

Page 25: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

3 Data

Figure 3.2: Number of unique OSM users over time. The dotted line in blueshows the OSM community is still growing and it has 5 millionusers that update the map 3 million times a day. Figure fromhttps://wiki.openstreetmap.org/wiki/File:Osmdbstats1log.png

Figure 3.3: Evolution of the number of OSM core elements. The nodes, in pink, present asharp increase since 2007 reaching almost 5 billion objects in January of 2019.Figure from https://wiki.openstreetmap.org/wiki/File:Osmdbstats2.png

14

Page 26: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

3 Data

different seasons and the reduction methods let the researches to obtainstable and precise (±10 � 20%)estimateso f airpollution.

The data set shared by the researchers has the structure described in Table3.3. It has the yearly median emission estimates for each road segmentand each pollutant, BC, NO and NO2. It is a unique data set with richfine grained directed measured data that can be used as a solid basis forstatistical inferences.

3.3 Uber Hexagons H3

The estimation presented in this dissertation was not fine grained due tothe complexity of factor and computational constrains. Thus, we opted toestimate the air pollution density in regions, instead of streets. There areseveral ways to partition areas of the Earth. It can be a political partitionsuch as neighbourhoods or zip code areas or a geometric shape that repeatsover the area of interest. While political partitions are good for day-to-dayuse to make decision, their irregular geometric structure are subject ofunpredictable change (https://fas.org/sgp/crs/misc/RL33488.pdf). Also,there is no world unified database of political geographical divisions, whichturn the task of partitioning the world cumbersome.

We opted to use a regular geometric structure to partition the earth surface.The Uber H3 project already developed an hierarchical hexagon grid for theentire earth surface that was open sourced. The grid has 15 resolutions. Thebiggest hexagon resolution, 1, has 1000 km of edge length with 4.25⇥107

km2 whereas the smaller hexagon resolution, 15, has 5⇥10�4 km of edgelength and 9⇥10�9 km2.

Squares, triangles and hexagons are the only polygons that can form acomplete grid in a plane using only one type of polygon. The advantage ofthe hexagon in relation to squares and triangles is that an hexagon grid havea unique distance between the centerpoint of an hexagon an its neighbours(Figure 3.7). Another advantage is that the hexagon grid is the only gridin which there are no vertices neighbours. This simplifies the analysis andgradient smothering.

15

Page 27: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

3 Data

Figure 3.4: Areas of Oakland, California, US that were air pollution was estimated byAclima, Inc and Google. Major highways were annotated and US Census 2010was used to the population inset. The wind inset shows that at 85% of daytimethe wind was blowing west (Figure from [18])

16

Page 28: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

3 Data

Figure 3.5: Uber H3 hexagon hierarchical grid over a city section. The larger hexagons withresolution 8 are about the size of a small neighbourhood. The smaller hexagons,with resolution 10, are about the size city block. Resolution 9 hexagons containsseveral blocks but are not large enough to represent a neighbourhood. Imagefrom https://eng.uber.com/h3/.

17

Page 29: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

3 Data

In this project we opted for hexagons with resolution 9 with 174 metersof edge and 0.1 km2 of area. The reasoning is by exclusion (Figure 3.5).Resolution 8 hexagon are large enough to contain a small neighbourhood.If this resolution is chosen, we lose definition of our data and there are toofew hexagons for the city area that we working with. The argument is validfor lower resolutions. On the other hand, if we chose a resolution 10 grid,the definition is too high, at the scale of a city block. In this way we mightnot have enough information from our main sources, OSM and Waze toestimate pollution. The argument is valid for grid resolution smaller than10. Thus, resolution 9 is the fittest grid for our goals. Figure 3.6 shows thecomplete H3 grid with resolution 9 over the studied areas of Oakland, CA,US.

18

Page 30: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

3 Data

Figure 3.6: Uber H3 hexagon grid over Oakland, California, US. Hexagon coverage isbounded to the availability of pollutant concentration data.

19

Page 31: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

3 Data

Figure 3.7: Polygons grid centerpoint neighbour distance. The triangle grid has 3 differentneighbour centerpoint distance and neighbours in the vertices. The square gridhas 2 centerpoint distance and also neighbours in vertices. The hexagon gridhas an unique centerpoint distance with no neighbours in the vertices. Imagesfrom https://eng.uber.com/h3/.

20

Page 32: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

3 Data

Nam

eC

ateg

ory

Des

crip

tion

Exam

ple

uuid

Iden

tifier

Uni

que

Iden

tifier

ofth

eja

m54

5159

50pu

bMill

isId

entifi

erM

illis

econ

dssi

nce

1sto

fJan

uary

of19

7015

3758

1410

449

coun

try

Geo

grap

hic

Cou

ntry

nam

eBr

asil

city

Geo

grap

hic

City

nam

eR

iode

Jane

iro

stre

etG

eogr

aphi

cSt

reet

Nam

eR

.Um

ari

leve

lTr

affic

Stat

us

Cur

rent

spee

dpe

rcen

tage

offr

eeflo

wsp

eed.

0:1

00%

to80

%of

free

flow

spee

d1

:80%

to61

%of

free

flow

spee

d2

:60%

to41

%of

free

flow

spee

d3

:40%

to21

%of

free

flow

spee

d4

:20%

to1%

offr

eeflo

wsp

eed

5:b

lock

edro

ad

1

dela

yTr

affic

Stat

usTi

me

diff

eren

cebe

twee

nja

mtr

avel

time

and

free

flow

trav

eltim

ein

seco

nds

12

spee

dTr

affic

Stat

usC

urre

ntsp

eed

ofth

ero

adin

km/h

34

line

Traf

ficSt

atus

Arr

ayof

latit

udes

and

long

itude

sth

atde

scri

beth

ese

gmen

tsth

atar

eja

mm

ed

[{”x

”:13

.023

445,

”y”:

-34.

5563},

{”x

”:13

.023

34,

”y”:

-34.

5523

4}]

leng

thTr

affic

Stat

usLe

ngth

ofth

eja

min

met

ers

23

road

Type

Roa

dC

hara

cter

istic

s

Type

ofth

ero

adco

de:

1St

reet

s,2

Prim

ary

Stre

et,

3Fr

eew

ays,

4R

amps

,5Tr

ails

,6

Prim

ary,

7Se

cond

ary,

8an

d14

4X4

Trai

ls,9

Wal

kway

,10

Pede

stri

an,1

1Ex

it,15

Ferr

ycr

ossi

ng,1

6St

airw

ay,

17Pr

ivat

ero

ad,1

8R

ailr

oads

,19

Run

way

/Tax

iway

,20

Park

ing

lotr

oad,

21Se

rvic

ero

ad.

2

Tabl

e3.

1:W

aze

jam

data

desc

ript

ion.

Som

efie

lds

that

did

notc

onta

ined

rela

vant

info

rmat

ion

wer

esu

ppre

ssed

.

21

Page 33: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

3 Data

Nam

eD

escr

iptio

nEx

ampl

e

idel

emen

tuni

que

iden

tifier

5958

1601

0ty

peel

emen

ttyp

e:no

de,w

ayor

rela

tion

node

tags

dict

iona

ryof

keys

and

valu

esth

atgi

ves

mea

ning

toth

eel

emen

t{”

high

way

”:”r

esid

entia

l”,

”am

enity

”:”r

esta

uran

t”}

lat

geog

raph

ical

coor

dina

tela

titud

e51

.682

5848

lon

geog

raph

ical

coor

dina

telo

ngitu

de3.

8384

109

nds

orde

red

listo

fnod

esid

sth

atde

fine

apo

lylin

e.U

sed

inw

ays.

[595

8160

10,5

9581

6013

,59

5816

043]

mem

bers

orde

red

listo

fele

men

tids

that

defin

ea

rela

tion

[595

8160

23,5

9581

6063

,59

5816

056]

chan

gese

tch

ange

setn

umbe

rin

whi

chth

eel

emen

twas

crea

ted

orup

date

d34

0438

6

times

tam

pda

tean

dtim

ein

whi

chth

eel

emen

twas

crea

ted

orup

date

d20

09-1

2-19

00:4

2:47

.000

uid

user

uniq

ueid

entifi

erth

atcr

eate

dor

upda

ted

the

elem

ent

1952

19

user

user

nam

e3d

Shap

esve

rsio

ned

itve

rsio

nof

the

elem

ent.

1w

hen

new

lycr

eate

d2

Tabl

e3.

2:O

pen

Stre

etM

apda

tade

scri

ptio

n

22

Page 34: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

3 Data

Name Description Example

Lon geographical coordinate longitude of 30 metersegment 51.6825848

Lat geographical coordinate longitude of 30 metersegment 3.8384109

NO Med Median NO concentration for 30 meter roadsegment in ppm 23.4

NO2 Med Median NO2 concentration for 30 meter roadsegment in ppm 2.5

BC Med Median Black Carbon concentrationfor 30 meter road segment in ppm 14.5

Table 3.3: Description of Aclima Inc. and Google data pollution estimate. Parts of thedataset were ommited.

23

Page 35: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

4 Methodology

4.1 Features

There are two main goals on pre-processing OSM and Waze data. Thefirst was to extract variables that are related to other variables used in theliterature. Other studies already identified variables in a diverse range ofdatabases that are useful to estimate air pollution such as traffic congestionand land use. OSM contains data that are related to literature studies thatwere exploited. The second goal was to build variables that are independentby using different raw data variables combinations.

In the following subsection we shows which data preprocessing was con-ducted in the Waze and OSM database. In the end of the section, we showhow pollution data was aggregated.

4.1.1 Features Waze

Traffic features, such as congestion and intensity, are commonly among withthe most important model predictors. Usually, models use the distancesbuffer distance or actual distance from heavy traffic spots as predictors.However, due to the hexagon aggregation and computation constrains, weopt to consider just the simplest traffic features that can be extracted fromWaze. We only considered intersection data between the hexagon regionand Waze data, not computing possible explanatory data sitting outside thehexagon.

Waze Jam data has two main indexes, the jam identification uuid, j, and thetime pubMillis, t. The main numeric variables in the data set are: speed,

24

Page 36: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

4 Methodology

s; level, l; and length, d. The output feature is aggregate in an hexagon, h,which contains several jams. Thus, a variable in a moment in time associatedwith a certain jam contained by a hexagon is vhtj, where v can be any variable.For convenience, Jht is the set all jams in an hexagon h at time t and T thenumber of time intervals t in the period of time of the study. Given that,table 4.1 shows how each feature we use on our model.

There are two main feature categories: simple descriptors and jam character-istics. Pollutant emission is positively correlated with the number of lanes,i.e. the road type. Thus, simple descriptors are just counts of jams accordingto their road type. Jam characteristics take into account simple statistics ofspeed, level and length.

4.1.2 Features OSM

Another major regression feature is land use. Thus, we build OSM basedfeatures to mimic those known useful geographical predictors of pollutantconcentration.

Table 4.2 shows all features from Open Street Maps. As in Waze, due tocomputational constrains, just very simple features are extracted. All thefeatures are counts of element tags inside the hexagon. They are dividedin General, Road Type, Urban x Rural and POI. As the name says, generalare broad counts of different OSM elements in the hexagon. It is expectedthat denser areas also have a larger number of OSM elements, once thatthere are more urban features to map. Given their importance as sources,the total number of highway elements is a feature in the Road Type category.The difference between Urban and Rural areas are not straightforward withOSM data. Thus, we use a tag natural to describe nature and parking, placeto describe urban areas. Finally, POIs are used to improve the description ofurban areas and as a proxy for high human activity neighbourhoods.

25

Page 37: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

4 Methodology

Name Description Formula

max counta Maximum number ofunique jams at the sametime

maxt(Jht)

max length Maximum length of a jam maxtj(dhtj)

avg congested prop Average jam length maxtj(dhtj)maxt(Jht)

max avg speed Maximum average speedof jams

maxt(Âj2Jht

shtj|Jht|

)

min avg speed Minimum average speedof the jams

mint(Âj2Jht

shtj|Jht|

)

avg speed Average speed of the aver-age speed

1T ÂT

tÂj2Jht

shtj|Jht|

max median level Maximum median level ofthe jams

maxt(medianj(lhtj))

min median level Minimum median level mint(medianj(lhtj))median level Median of the median

levelmediantj(lhtj)

bool highway Indicator of a highwayusual road type Median of the road type

numberusual road type Most common road typebool highway Indicator of highwaybool primary Indicator of primary roadsbool ramps Indicator of rampsbool secondary Indicator of secondary

roadscount highway Number of jams in high-

ways in the periodcount streets Number of jams in streets

in the periodcount primary Number of jams in pri-

mary roads in the periodcount secondary Number of jams in sec-

ondary roads in the periodcount primary street Number of jams in pri-

mary streets in the periodcount primary ramps Number of jams in pri-

mary ramps in the period

Table 4.1: Description of Waze Features

26

Page 38: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

4 Methodology

Name Description Classification

general counta Number of elements Generalnode counta Number of nodes Generalinfo counta Number of info tags Generalhighway counta Number of highway tags Road Typenatural counta Number of natural tags Urban x Ruralplace counta Number of place tags Urban x Ruralparking counta Number of parking tags Urban x Ruraladdr street counta Number of addr:street tags Urban x Ruraladdr housenumber counta Number of

addr:housenumber tagsUrban x Rural

amenity counta Number of amenity tags POIschool counta Number of school tags POIrestaurant counta Number of restaurant tags POIplace of worship counta Number of

place of worship tagsPOI

shop counta Number of shop tags POIname counta Number os name tags POIcrossing counta Number of crossing tags POI

Table 4.2: Description of Open Street Maps Features

27

Page 39: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

4 Methodology

Variable Skewness Excess Kurtosis

Median BC 0.53 -1.74Median NO 1.06 -0.48Median NO2 0.65 -0.98

Table 4.3: Mean centrality statistics of pollution distributions in hexagon. Excess Kurtosisis the kurtosis minus 3.

4.1.3 Pollution (Target Variable)

Pollution concentration in urban areas has sharp local gradients and verylocalized hotspots. LUR models are not designed have high precision anddetect local emission anomalies. Thus, when aggregating pollution datain the hexagon, we opted by choosing the central tendency statistics lessaffected by outliers.

High precision pollution data from Oakland, CA, US has two central statis-tics, mean and median. Thus, there are the mean and median values foreach pollutant, black carbon, NO and NO2. The data set authors, in theiranalysis, chose to use median due to its better description of long termtrends. The other central tendency regarding the data aggregation intohexagons is also the median. Table 4.3 shows mean central moments valuesfor data aggregated in hexagons. Notably, all skewness values are positive,which indicates a long right tail. Also, Excesses Kurtosis is negative for alldistributions. This is a pattern of a long right tail with a concentrated peak.Which indicates the presence of outliers that can affect the mean. Thus, wechoose the median of the pollution data inside a hexagon to be the centraltendency.

28

Page 40: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

5 Results

5.1 E↵ects of highways and dispersion onpollutant concentration

In this section we will show that high pollutant concentration is related tohighways, specially for NO and black carbon. Also, high distance-decay ratesfrom highways leads to more unequal hexagon concentration distributions.Moreover, hexagons distributions present three different patterns dependingon the pollutant concentration.

There is a known ”distance-decay” concentration pattern between heavytraffic highways and residential streets [18]. Figure 5.1(a) shows an exponen-tial decay of pollutant concentration given the nearest highway distance. Theunconstrained exponential model, C(d) = a + b exp(�3d/k) reproduceswell the observed pattern. Here, d, in meters, is the distance to highways, ais the urban background concentration for d ! •, the near-road parameterb governs the concentration increment of highway proximity and k is thedecay parameter that represents the distance in which the concentrationrelax to a.

As expected by atmospheric dynamics, NO presents a sharper distance-decay relationship, followed by BC and NO2. Both BC and NO sources arecombustion engines. But, different from BC, NO has a faster depletion rateduring daytime due to its reaction with ozone (NO + O3 ! NO2) [REFSEE HIGH]. On the other hand, since NO2 presence comes from secondaryreactions, its presence is more homogeneous than NO and BC.

When aggregated in hexagons, the distance-decay relationship is preserved.Figure 5.1(b) compare hexagons on whether they intersect a highway.

29

Page 41: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

5 Results

(a)

(b)

Figure 5.1: Positive correlation between highway proximity and pollutant concentration.(a) Panel fromaptehigh� resolution2017showsthedecayo f pollutantconcentration f romhighwaystoroads.Thele f timageshowstheratioo f medianconcentrationatadistance f romthenearesthighway.Theerrorbarsarethestandarderror f rombootstrapresampling.Anexponentialunconstrainedthreeparametermodel f itstheobserveddecay.Therightimageisthedistance f romhighwaydcomputedbytheharmonicmeandistancetothenearestportiono f the f ourmajorhighways(b)Medianconcentrationo f blackcarbon, NOandNO2for each hexagon region categorized by its intersection with a jammed highway.Each point is the median concentration in an hexagon.

30

Page 42: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

5 Results

Hexagons that do not intersect with a highways tend to have lower concen-tration for all pollutants. For NO, highway intersecting hexagons have 3.5higher mean than those that are not intersecting. BC and NO2 follows with2.09 and 1.84 mean difference, respectively.

Figure 5.2: Hexagon regions median concentration distribution by rank. An unconstrainedpower model (y = axb) fits the first 250 hexagons median. The high and mediumconcentration values shows a high agreement with the power model whereasfor low concentration the distribution drops sharply for all pollutants.

However, there are hexagon concentrations that do not follow the centraltendency. The case of hexagon concentration that intersects highways that islower than expected can be explained by traffic intensity. Not all highwayspresent heavy traffic on daytime, which can impact especially BC and NOconcentrations. On the other hand, when hexagons that do not intersectwith highways present high concentration, they can be neighbouring intensetraffic highways.

These distance-decay relationships affect the distribution of concentration inhexagons. A decreasing rank-size distribution of the hexagon concentrationswas modeled by C(r) = ar�b with high fidelity (R2 > 0.93) (Figure 5.2).Here, the ranking is r, a is a scale factor and the inequality parameter b. If btakes higher values, then there are fewer hexagons with high concentration.Since NO has the sharpest distance-decay, we expect a more unequal distri-bution and a higher b value. Which is the case, NO has the higher b valueof -0.76 whereas BC takes 0.45 followed by NO2.

31

Page 43: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

5 Results

Another relevant aspect is the distribution pattern of very high concentrationhexagons for NO2 and NO. For the 10 highest concentrations, b is 0.4 forNO and 0.19 for NO2. This difference on behaviour can come from the factthat there is a limited highway capacity in which pollutant concentration ismaximized. Moreover, the lowest concentration values decrease sharply forall pollutants which are located in neighborhoods with parks and great treecoverage .

Air pollution is highly correlated with the presence of highways, especiallyfor BC and NO that has vehicles as their main source. The distance-decayrelationship dictated by atmospheric interactions greatly changes the dis-tribution of pollutants of the hexagon regions. NO2 is more widely spreadand seems uniformly distributed. On the other hand, BC and NO, whichpresent sharper distance-decays, are irregularly distributed. Also, there arethree patterns on the distributions; it is less unequal for high concentrationhexagons of NO and NO2’; for very low concentrations it drops sharply andfor mid-range concentrations it presents a power law behaviour.

5.2 Model Results and Feature Importance

In this section we present the models used to estimate pollutant concentra-tion. The simplest model, a linear regression with highway intersection asonly feature is used as baseline. Other models use Waze features with lowcorrelation and had their parameters optimized. XGBoost yielded betterresults, followed by Random Forest and Multiple Linear Regression. Then, agraphical comparison between the actual values and XGBoost model is done.Finally, a mean decrease of impurity of the Random Forest model was usedto access feature importance, which greatly favors jams characteristics.

In order to evaluate the model we split the hexagons into two non-overlappingsets of training and validation. The training set contained 90% of the totalset, 254 hexagons, whereas the validation set had the remaining 28 hexagons.Neither models had access to the validation set during training. But, themodel evaluation was done by calculating the Mean Squared Error (MSE)of the model in the validation set.

32

Page 44: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

5 Results

(a)

(b)

Figure 5.3: Modeling results for Highway Linear Regression and Multiple Linear Regres-sion. Image (a) shows the results for the linear regression C(x) = aH(x) + bwhere x is the hexagon region, H(x) is an indicator function which is 1 if thehexagon intersects a highway, a and b are unconstrained parameters. Image(b) has the results of a Multiple Linear Regression performed over all availablefeatures. Orange dots are hexagons used to train the models, whereas blue dotsto validate it. Mean Squared Error (MSE) is at the top left of each subplot andthe black line is the ideal model.

33

Page 45: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

5 Results

Based on the discussion of the previous section about the correlation ofhighways and air pollution concentration, we build our simplest model. It ita simple linear regression that we call Highway Linear Regression

C(x) = aH(x) + b

where x is the hexagon, H(x) is an indicator function which is 1 whenthe hexagon x intersects a highway, a and b are unconstrained parameters.As Figure 5.3 (a) shows, C(x) has only two possible values, values foreach pollutant. As expected, the model did not perform well, especiallybecause of the hexagons that intersects highways but do not present highconcentration (See 5.1).

Other features beyond highway presence may better explain the data. Thus,using all the available Waze features that are not highly correlated, we fit aMultiple Linear Regression,

Ci(x) = b01 + b1xi1 + . . . + bpxip + ei, i = 1, . . . , n

where i is the hexagon, p is a feature and the total number of hexagons is n.Thus, xip is the feature p of the hexagon i, b are unconstrained parameters,C the estimated concentration and e the error term. Table 5.1 shows thatMSE only performs better than the baseline for NO. BC and NO2 MSEvalues are slightly worse. Figure 5.3(b) shows that the model underestimateshigh concentration values.

All the following models have tuned hyperparameters. The method consistsin a wider Random Search of hyperparameters, which leads to a narrowerGrid Search of the resulting hyperparameters.

The third proposed model is a Random Forest which has being used toestimate pollutant concentration in other studies [38]. It consists on a com-bination of unpruned regression trees generated by bootstrapped samples.When building the trees, it selects the best split at each node among arandom selection of explanatory variables. The regression prediction is doneby averaging the results of those trees. At each boostrap interaction, aboutone third of the data is left out, an out-of-bag (OOB) sample. The average ofthe OOB error is called OOB estimate error rate [39].

34

Page 46: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

5 Results

(a)

(b)

Figure 5.4: Modeling results for Random Forest and XGBoost. Image (a) shows RandomForest results which underestimates high concentration values. The XGBoostregression, which has the best MSE, is shown in image (b). Orange dots arehexagons used to train the models, whereas blue dots to validate it. MeanSquared Error (MSE) is at the top left of each subplot and the black line is theideal model.

35

Page 47: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

5 Results

BC NO NO2

Highway Linear Regression 0.086 95.8 24.0Multiple Linear Regression 0.089 78.0 25.6Random Forest 0.058 49.6 18.6XGBoost 0.047 39.5 16.0

Table 5.1: Mean Squared Error (MSE) of each model by pollutant.

The Random Forest had consistently better MSE for all pollutants. It de-creased 32%, 37% and 28% MSE for BC, NO and NO2, respectively (See 5.1).However, the model still underestimates medium and high concentrationvalues (see Figure 5.4(a)), especially for NO2 and NO with residual mean of-0.68 and -0.7, where the residual is the difference between the actual andestimated concentration.

Finally, the last model is a XGBoost which is an implementation of thegradient-boosted decision trees. It is a widely used technique that achievesstate-of-the-art results on regression benchmarks. On the top of fast, reliableand stable, it is also used in production environments [40]. The modelconsists on an ensemble of weak tree classifiers that are built sequentially.Each new classifier improves on the gain of the objective function which isbuilt by a training loss and a regularization term. We use the Mean SquaredError as our training loss.

The XGBoost outperforms competing models. It achieves a 47% gain of MSEfor BC, 58% for NO and 33% for NO2. Although it stills underestimateshigh concentrations, with residual means of -0.8 and -0.32 for NO and NO2,the high concentration error dropped two fold for NO, 30% for NO2 and40% for BC.

A graphical comparison of the XGBoost prediction and real values is seen onFigure 5.5. In order to produce a comparable map, the hexagons values werebinned using a clustering method called Jenks natural breaks classificationmethod. It minimize each class’s average deviation from the class mean,while maximizing each class’s deviation from the means of the other groups.In order to keep comparability, the bins of NO were also used on NO2. The

36

Page 48: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

5 Results

BC NO NO2 Meanfeature

max length 0.323 0.270 0.312 0.301avg congested prop 0.185 0.239 0.104 0.176avg speed 0.141 0.213 0.120 0.158count highway 0.064 0.077 0.118 0.086count primary ramps 0.052 0.031 0.055 0.046min median level 0.076 0.015 0.042 0.044bool highway 0.032 0.017 0.081 0.043bool ramps 0.050 0.042 0.024 0.038count primary street 0.021 0.031 0.037 0.030count streets 0.025 0.015 0.043 0.028count primary 0.011 0.033 0.035 0.026count secondary 0.022 0.016 0.029 0.022

Table 5.2: Waze feature relevance according to mean decrease of impurity of the RandomForest.

leftmost maps come from real data, whereas rightmost maps were producedby the XGBoost model.

We observe a good agreement between real values and estimation for allpollutants. The model managed to identify the highest concentration areasand, with reasonable fidelity, the rare very low NO2 concentration areas.Also, there are some hexagon differences that comes from underestimation.It is the case for the south west region of West Oakland where there is ahigh concentration of pollutants due to the presence of a highway. XGBoostunderestimates the concentration for NO (map b) and NO2. However, onsouth east Oakland, it overestimated the concentration for NO.

Therefore, it is reasonable to conclude that XGBoost model outperformedthe previous models. However, the Random Forest provides a powerfulmethod to assess feature importance. It is called mean decrease of impuritywhich consists on randomly excluding a feature from nodes. If it causes anOOB error increase, then the feature is relevant.

Table 5.2 presents feature importance for each pollutant and the mean fea-

37

Page 49: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

5 Results

(a)

(b)

(c)

Figure 5.5: Comparison between actual median pollutant concentration (left) and estimatedconcentration by XGBoost model (right). Panel (a), (b) and (c) corresponds toblack carbon, NO and NO2, respectively. The concentration is binned by theJenks natural breaks classification method. Missing hexagons in the estimatedmaps (right) are due to the lack of Waze data on the region.

38

Page 50: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

5 Results

ture importance. It is clear that three features are more relevant: max length,avg congested prop and avg speed. Those are traffic characteristics that rightlydescribe jams conditions. The first is associated with the traffic length, whichis only very high on highways and heavy traffic conditions. The second is aproportion of how often all the observed jams on that period get congested,which is a reliable sign of a usually congested region. However, the propor-tion can also be large if the region has too few roads. Finally, avg speed has anegative correlation with heavy traffic. If the speed is very low or zero, thetraffic is worst.

5.3 XGBoost Air Pollution Estimation forMontevideo

Waze data from April of 2019 were used along the XGBoost model toestimate the pollutant concentrations for Montevideo, Uruguay. It as amedium-sized city with 1.3 million inhabitants which is the capital of thecountry. Whereas Oakland has 425 thousand inhabitants and is located in themetropolitan region of a bigger city, San Francisco, California, USA. Thereare major differences to consider in respect of the cities. They have differentvehicle fleets and fuel composition; Montevideo has smaller highwayscompared to Oakland; also, atmospheric conditions differ which can affectNO and NO2 estimations. However, both are at sea-level and have the sameurban area, 200km2.

Considering that all potential variables are either constant or, in the average,do not influence the estimation, we estimated the NO concentration. Figure5.6 is a map of Montevideo with concentration hexagons with the same binsused to validate the model. Montevideo has over 2000 hexagon regions, butseveral of them are isolated at country side areas.

We notice that highways used to enter the city through the north are presentmedium and high NO concentration. The highway that crosses the citytowards the east is also surrounded by medium and high concentration.Also, there is a medium concentration region in the city center, which ismost active part of the city during day time. However, the is the highest

39

Page 51: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

5 Results

BC NO NO2

R2 MSE R2 MSE R2 MSEHighway Linear Regression 0.29 0.086 0.36 95.8 0.40 24.0Multiple Linear Regression 0.26 0.089 0.48 78.0 0.36 25.6Random Forest 0.53 0.058 0.61 68.5 0.49 18.6XGBoost 0.67 0.047 0.72 42.5 0.60 16.0

concentration hexagons are scattered in countryside regions and far from ahighway.

40

Page 52: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

5 Results

Figure 5.6: Nitrogen Oxide (NO) concentration XGBoost estimation for Montevideo,Uruguay. The estimation used Waze data from April, 2019. Dark orange hexagonregions (¿53 ppb) are above the yearly average recommended by the UnitedStates Environmental Protection Agency. [41].

41

Page 53: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

6 Discussion

6.1 Model Limits and Possible Extensions

6.1.1 Challenges on estimation air pollution in urban areas

Estimating air pollution in urban areas is not a straightforward task. It is acomplex phenomenon that depended on a wide range of variables. There areseveral of different pollutants. This dissertation covered NO, NO2 and BC,but there are others as harmful as those such as Sulfur Oxide, Ozone, MicroParticles, Carbon Monoxide and Carbon Dioxide. Each of those pollutantshave different sources and go through different chemical processes to beformed. The main source of NO, BC and NO2 is burning fossil fuel: coal, oiland gas. Which, in urban areas, come from vehicle combustion, domesticfuel burning and industries.

However, each of those sources have its own variables. Traffic emissiondepends on the vehicle fleet composition. Different vehicle types presentdifferent emitting profiles. Trucks and buses, due to its engine size, emitmore pollutant per hour than a car or a motorcycle. Also, some countrieshave stricter regulations about engines emissions and the vehicle age takespart on the combustion efficiency. Also, fossil fuels can present differentcomposition around the globe, which affects domestic fuel and industryemission. Moreover, some pollutants, such as NO2, come from secondarychemical reactions. Those reactions depends on sunlight, temperature andatmospheric composition.

42

Page 54: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

6 Discussion

6.1.2 Biases carried by this model

In this case, in particular, the features used to build the model are primarilyrelated to traffic. Thus, the emission profile is bound to the vehicle fleetcharacteristics of Oakland, CA, USA. In order to mitigate this bias a rescalingfactor can be introduced given that the emission profile of each vehicle typeis known. Also, Waze data only carries congested traffic road segments. But,there may be cases of highways or streets with a heavy traffic intensity thatdo not get congested. Thus, high emitting road segments can be overlookby Waze data.

Climate and weather effects were attenuated because pollution data is ayear-wide sample. However, atmospheric chemical composition and day-light duration has to be taken into account as variables of distance-decayconcentration rates.

Also, the urban background concentration of pollutants, which dependsfrom other sources than traffic, may not be proper estimated by the currentmethod.

6.1.3 Data needed to improve model

Much of the data needed to improve the model and mitigate the biasesare not fully available. First, a fine-grained pollution concentration of otherurban areas are rare and usually not fully disclosed. Second, a detailed,updated and complete Points of Interest (POI) map are maintained bycompanies that do not share the data. Open Street Maps attempts to fill thisgap, but the community is still not large enough to keep all urban areasupdated. Third, fleet composition data exists usually for broader politicalareas such as states or counties.

On the other hand, weather and atmospheric composition data are fullyavailable from open sources.

43

Page 55: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

6 Discussion

6.1.4 Modeling decisions and improvements

Several decisions were made to build the model. The pollution data wascollected between 2015 and 2016, while Waze data is from March of 2019We assumed that the traffic patterns did not change significantly betweenthose periods based on other studies. And that one month traffic behaviouraverage represents a year with considerable fidelity. But, it is known thattraffic can change greatly during the year depending on school vacationsand holidays. Thus, matching the period of the Waze and pollution samplescould greatly contribute to improve the accuracy of the model.

When aggregating the pollution data into hexagons, there are two maincentral tendencies to consider, mean and median. The mean is affectedby very localized emission sources in the hexagon such as a truck garage.Localized sources like that could appear on very detailed POI maps. But,median is chosen in this dissertation because the explanatory variables weremostly traffic related.

OSM has the potential to describe land-use in a more detailed mannerwhich could improve the urban background concentration estimation [42].The features built were individual counts which do not provide muchdata variability to estimator. However, combinations of these features intoland-use categories can provide better and richest data.

The feature importance method, mean decrease of impurity, has knownflaws [43]. It does not indicate which feature decreases the predictability ofthe model. Also, there are several tests showing that random features canbe well ranked. Other, more reliable and model agnostic methodologies canbe used such as permutation importance or drop-column importance [42],[44]

Moreover, the estimation on Montevideo shows high concentration hexagonsscattered on the country-side or minor roads. This may happen due to thefeature avg congested prop which is high when there are just one segmentwhich is always jammed. Thus, remote region with traffic lights and inter-sections can be prone to be classified with high concentration. One wayto mitigate that effect is to exclude hexagons with very low unique jamsegments.

44

Page 56: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

6 Discussion

Finally, a different approach might produce better general and reproducibleresults. Projects like OpenAQ [45] compiles air quality data from more than10 thousand locations. Most of the data sets are from sparse fixed monitorsof air concentration. But, due to the large amount of different locations,standard LUR models can be trained using only Waze and Open StreetMaps features. In this way, the lack of high resolution data is compensatedby the volume of different locations. Different validation settings can takeplace, using locations entire cities as validation data sets. This might be asolution to the transferability problem and a way to build global intra-urbanair pollution estimation method.

6.2 Social and Health Impacts with Global Reach

This study presents a straightforward estimation approach for estimating airpollution concentration with worldwide reach. In development cities thathas poor or non-existent air quality measurements are in reach to obtainthis information with a much lower cost. This can have transformativeimplications for public policy, environmental equity, public awareness andepidemiology.

Policy makers of continent-wise or local-wise reach can improve their deci-sion process. ONU, OMS and federal agencies can keep track of air pollutiongoals in large urban areas. Also, along with population density data, theycan better estimate the population affected by air poor air quality. Also, theestimation of the air quality impact of regulatory actions or traffic changesis possible. Moreover, local governments will be able to localize and takeactive measures to diminish high concentration areas.

Further studies on environmental equity can access the impact of air qualityon unprivileged populations. Demographic data, bus routes and origindestination data can understand the population flow in the city. Combinedwith air pollution regions, health impact on specific population sectors canbe estimated.

Also, routing companies and jogging apps can provide healthier routes touser. If the air quality is publicized, people can make better decisions on

45

Page 57: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

6 Discussion

which neighbourhood to live or to work in order to diminish their healthexposure.

46

Page 58: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

7 Conclusion

This dissertation proposes a framework to estimate intra-urban pollutantconcentration using open crowded and global reaching data such as Wazeand Open Street Maps. It consists on aggregating geographical data in anhexagonal grid, H3, provided by Uber. A low number of geographical trafficpredictors built from Waze regressed by a XGBoost model managed to reachliterature results for black carbon, NO and NO2. Due to its simplicity, themodel opens a broad range of improvement paths. Either by better featureengineering Waze and OSM data or experimenting different models.

In conclusion, this study suggests that Waze data is a good candidate astraffic predictor for LUR models. Beyond that, it tackles the opens a newpath to tackle the transferability problem. The global databases can be usedin most of the world and are maintained accordingly. Plenty of other fieldsof studies can build upon those estimations such as epidemiology andpublic policy planning.

47

Page 59: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

Bibliography

[1] GBD Results Tool — GHDx. [Online]. Available: http://ghdx.healthdata.org/gbd-results-tool (visited on 06/02/2019).

[2] R. Burnett, H. Chen, M. Szyszkowicz, N. Fann, B. Hubbell, C. A.Pope, J. S. Apte, M. Brauer, A. Cohen, S. Weichenthal, J. Coggins,Q. Di, B. Brunekreef, J. Frostad, S. S. Lim, H. Kan, K. D. Walker, G. D.Thurston, R. B. Hayes, C. C. Lim, M. C. Turner, M. Jerrett, D. Krewski,S. M. Gapstur, W. R. Diver, B. Ostro, D. Goldberg, D. L. Crouse,R. V. Martin, P. Peters, L. Pinault, M. Tjepkema, A. van Donkelaar,P. J. Villeneuve, A. B. Miller, P. Yin, M. Zhou, L. Wang, N. A. H.Janssen, M. Marra, R. W. Atkinson, H. Tsang, T. Quoc Thach, J. B.Cannon, R. T. Allen, J. E. Hart, F. Laden, G. Cesaroni, F. Forastiere, G.Weinmayr, A. Jaensch, G. Nagel, H. Concin, and J. V. Spadaro, “Globalestimates of mortality associated with long-term exposure to outdoorfine particulate matter,” eng, Proceedings of the National Academy ofSciences of the United States of America, vol. 115, no. 38, pp. 9592–9597,2018, issn: 1091-6490. doi: 10.1073/pnas.1803222115.

[3] D. E. Schraufnagel, J. R. Balmes, C. T. Cowl, S. De Matteis, S.-H. Jung,K. Mortimer, R. Perez-Padilla, M. B. Rice, H. Riojas-Rodriguez, A.Sood, G. D. Thurston, T. To, A. Vanker, and D. J. Wuebbles, “AirPollution and Noncommunicable Diseases,” en, Chest, vol. 155, no. 2,pp. 409–416, Feb. 2019, issn: 00123692. doi: 10.1016/j.chest.2018.10.042. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0012369218327235 (visited on 06/02/2019).

[4] J. S. Apte, J. D. Marshall, A. J. Cohen, and M. Brauer, “Address-ing Global Mortality from Ambient PM2.5,” Environmental Science& Technology, vol. 49, no. 13, pp. 8057–8066, Jul. 2015, issn: 0013-936X. doi: 10.1021/acs.est.5b01236. [Online]. Available: https://doi.org/10.1021/acs.est.5b01236 (visited on 06/02/2019).

48

Page 60: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

Bibliography

[5] H. Carvalho, “The air we breathe: Differentials in global air qualitymonitoring,” en, The Lancet Respiratory Medicine, vol. 4, no. 8, pp. 603–605, Aug. 2016, issn: 22132600. doi: 10.1016/S2213-2600(16)30180-1.[Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S2213260016301801 (visited on 06/02/2019).

[6] G. Hoek, R. Beelen, K. de Hoogh, D. Vienneau, J. Gulliver, P. Fis-cher, and D. Briggs, “A review of land-use regression models toassess spatial variation of outdoor air pollution,” Atmospheric Envi-ronment, vol. 42, no. 33, pp. 7561–7578, Oct. 2008, issn: 1352-2310.doi: 10.1016/j.atmosenv.2008.05.057. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1352231008005748(visited on 05/31/2019).

[7] H. Forehead and N. Huynh, “Review of modelling air pollutionfrom traffic at street-level - The state of the science,” en, Environ-mental Pollution, vol. 241, pp. 775–786, Oct. 2018, issn: 02697491.doi: 10.1016/j.envpol.2018.06.019. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0269749118300216(visited on 06/01/2019).

[8] N. G. Best, K. Ickstadt, R. L. Wolpert, and D. J. Briggs, Combiningmodels of health and exposure data: The SAVIAH study.

[9] G. Hoek, P. Fischer, P. Van Den Brandt, S. Goldbohm, and B. Brunekreef,“Estimation of long-term average exposure to outdoor air pollutionfor a cohort study on mortality,” eng, Journal of Exposure Analysis andEnvironmental Epidemiology, vol. 11, no. 6, pp. 459–469, Dec. 2001, issn:1053-4245. doi: 10.1038/sj.jea.7500189.

[10] D. K. Moore, M. Jerrett, W. J. Mack, and N. Kunzli, “A land useregression model for predicting ambient fine particulate matter acrossLos Angeles, CA,” eng, Journal of environmental monitoring: JEM, vol. 9,no. 3, pp. 246–252, Mar. 2007, issn: 1464-0325. doi: 10.1039/b615795e.

[11] D. Briggs, “The Role of Gis: Coping With Space (And Time) in AirPollution Exposure Assessment,” en, Journal of Toxicology and Envi-ronmental Health, Part A, vol. 68, no. 13-14, pp. 1243–1261, Jul. 2005,issn: 1528-7394, 1087-2620. doi: 10.1080/15287390590936094. [On-line]. Available: http://www.tandfonline.com/doi/abs/10.1080/15287390590936094 (visited on 06/01/2019).

49

Page 61: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

Bibliography

[12] D. J. BRIGGS, S. COLLINS, P. ELLIOTT, P. FISCHER, S. KINGHAM,E. LEBRET, K. PRYL, H. V. REEUWIJK, K. SMALLBONE, and A. V. D.VEEN, “Mapping urban air pollution using GIS: A regression-basedapproach,” International Journal of Geographical Information Science, vol.11, no. 7, pp. 699–718, Oct. 1997, issn: 1365-8816. doi: 10. 1080/136588197242158. [Online]. Available: https://doi.org/10.1080/136588197242158 (visited on 05/31/2019).

[13] Z. Ross, P. B. English, R. Scalf, R. Gunier, S. Smorodinsky, S. Wall, andM. Jerrett, “Nitrogen dioxide prediction in Southern California usingland use regression modeling: Potential for environmental healthanalyses,” En, Journal of Exposure Science & Environmental Epidemiology,vol. 16, no. 2, p. 106, Mar. 2006, issn: 1559-064X. doi: 10.1038/sj.jea.7500442. [Online]. Available: https://www.nature.com/articles/7500442 (visited on 05/31/2019).

[14] M. Jerrett, M. A. Arain, P. Kanaroglou, B. Beckerman, D. Crouse,N. L. Gilbert, J. R. Brook, N. Finkelstein, and M. M. Finkelstein,“Modeling the intraurban variability of ambient traffic pollution inToronto, Canada,” eng, Journal of Toxicology and Environmental Health.Part A, vol. 70, no. 3-4, pp. 200–212, Feb. 2007, issn: 1528-7394. doi:10.1080/15287390600883018.

[15] S. Gupta, E. Pebesma, J. Mateu, and A. Degbelo, “Air Quality Moni-toring Network Design Optimisation for Robust Land Use RegressionModels,” en, Sustainability, vol. 10, no. 5, p. 1442, May 2018. doi:10.3390/su10051442. [Online]. Available: https://www.mdpi.com/2071-1050/10/5/1442 (visited on 05/31/2019).

[16] J. Van den Bossche, J. Peters, J. Verwaeren, D. Botteldooren, J. Theunis,and B. De Baets, “Mobile monitoring for mapping spatial variationin urban air quality: Development and validation of a methodologybased on an extensive dataset,” Atmospheric Environment, vol. 105,pp. 148–161, Mar. 2015, issn: 1352-2310. doi: 10.1016/j.atmosenv.2015.01.017. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1352231015000254 (visited on 06/01/2019).

[17] J. Kerckhoffs, G. Hoek, J. Vlaanderen, E. van Nunen, K. Messier, B.Brunekreef, J. Gulliver, and R. Vermeulen, “Robustness of intra urbanland-use regression models for ultrafine particles and black carbon

50

Page 62: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

Bibliography

based on mobile monitoring,” en, Environmental Research, vol. 159,pp. 500–508, Nov. 2017, issn: 00139351. doi: 10.1016/j.envres.2017.08.040. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0013935117310186 (visited on 06/01/2019).

[18] J. S. Apte, K. P. Messier, S. Gani, M. Brauer, T. W. Kirchstetter, M. M.Lunden, J. D. Marshall, C. J. Portier, R. C. Vermeulen, and S. P.Hamburg, “High-Resolution Air Pollution Mapping with GoogleStreet View Cars: Exploiting Big Data,” en, Environmental Science& Technology, vol. 51, no. 12, pp. 6999–7008, Jun. 2017, issn: 0013-936X, 1520-5851. doi: 10.1021/acs.est.7b00891. [Online]. Available:http://pubs.acs.org/doi/10.1021/acs.est.7b00891 (visited on06/01/2019).

[19] W. R. Tobler, “A Computer Movie Simulating Urban Growth in theDetroit Region,” Economic Geography, vol. 46, pp. 234–240, 1970, issn:0013-0095. doi: 10.2307/143141. [Online]. Available: https://www.jstor.org/stable/143141 (visited on 06/01/2019).

[20] S. H. Moolgavkar, “Air pollution and daily mortality in three U.S.counties,” eng, Environmental Health Perspectives, vol. 108, no. 8, pp. 777–784, Aug. 2000, issn: 0091-6765. doi: 10.1289/ehp.00108777.

[21] A. Biggeri, F. Barbone, C. Lagazio, M. Bovenzi, and G. Stanta, “Airpollution and lung cancer in Trieste, Italy: Spatial analysis of risk as afunction of distance from sources.,” Environmental Health Perspectives,vol. 104, no. 7, pp. 750–754, Jul. 1996, issn: 0091-6765. [Online]. Avail-able: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1469415/(visited on 05/31/2019).

[22] R. M. Harrison, P. L. Leung, L. Somervaille, R. Smith, and E. Gilman,“Analysis of incidence of childhood cancer in the West Midlandsof the United Kingdom in relation to proximity to main roads andpetrol stations,” Occupational and Environmental Medicine, vol. 56, no.11, pp. 774–780, Nov. 1999, issn: 1351-0711. [Online]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1757680/ (visited on05/31/2019).

[23] P. English, R. Neutra, R. Scalf, M. Sullivan, L. Waller, and L. Zhu,“Examining associations between childhood asthma and traffic flowusing a geographic information system,” eng, Environmental Health

51

Page 63: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

Bibliography

Perspectives, vol. 107, no. 9, pp. 761–767, Sep. 1999, issn: 0091-6765.doi: 10.1289/ehp.99107761.

[24] A. E. Livingstone, G. Shaddick, C. Grundy, and P. Elliott, “Do peopleliving near inner city main roads have more asthma needing treat-ment? Case control study.,” BMJ : British Medical Journal, vol. 312,no. 7032, pp. 676–677, Mar. 1996, issn: 0959-8138. [Online]. Avail-able: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2350529/(visited on 05/31/2019).

[25] G. Hoek, B. Brunekreef, S. Goldbohm, P. Fischer, and P. A. vanden Brandt, “Association between mortality and indicators of traffic-related air pollution in the Netherlands: A cohort study,” eng, Lancet(London, England), vol. 360, no. 9341, pp. 1203–1209, Oct. 2002, issn:0140-6736. doi: 10.1016/S0140-6736(02)11280-3.

[26] L. .-.-J. S. Liu and A. J. Rossini, “Use of kriging models to predict12-hour mean ozone concentrations in Metropolitan Toronto—A pilotstudy,” Environment International, vol. 22, no. 6, pp. 677–692, Jan. 1996,issn: 0160-4120. doi: 10.1016/S0160-4120(96)00059-1. [Online].Available: http://www.sciencedirect.com/science/article/pii/S0160412096000591 (visited on 05/31/2019).

[27] M. Fallah Shorshani, M. Andre, C. Bonhomme, and C. Seigneur,“Modelling chain for the effect of road traffic on air and water qual-ity: Techniques, current status and future prospects,” EnvironmentalModelling & Software, vol. 64, pp. 102–123, Feb. 2015, issn: 1364-8152.doi: 10.1016/j.envsoft.2014.11.020. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1364815214003466(visited on 06/01/2019).

[28] O. U. EPA, Air Quality Dispersion Modeling, en, Policies and Guidance,Sep. 2016. [Online]. Available: https://www.epa.gov/scram/air-quality-dispersion-modeling (visited on 06/01/2019).

[29] D. J. Briggs and P. Elliott, “The use of geographical information sys-tems in studies on environment and health,” eng, World Health Statis-tics Quarterly. Rapport Trimestriel De Statistiques Sanitaires Mondiales,vol. 48, no. 2, pp. 85–94, 1995, issn: 0379-8070.

52

Page 64: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

Bibliography

[30] M. Hochadel, J. Heinrich, U. Gehring, V. Morgenstern, T. Kuhlbusch,E. Link, H. .-.-E. Wichmann, and U. Kramer, “Predicting long-termaverage concentrations of traffic-related air pollutants using GIS-basedinformation,” Atmospheric Environment, vol. 40, no. 3, pp. 542–553, Jan.2006, issn: 1352-2310. doi: 10.1016/j.atmosenv.2005.09.067. [On-line]. Available: http://www.sciencedirect.com/science/article/pii/S1352231005009222 (visited on 05/31/2019).

[31] R. Beelen, G. Hoek, P. Fischer, P. A. v. d. Brandt, and B. Brunekreef,“Estimated long-term outdoor air pollution concentrations in a cohortstudy,” Atmospheric Environment, vol. 41, no. 7, pp. 1343–1358, Mar.2007, issn: 1352-2310. doi: 10.1016/j.atmosenv.2006.10.020. [On-line]. Available: http://www.sciencedirect.com/science/article/pii/S1352231006010351 (visited on 06/02/2019).

[32] T. Sahsuvaroglu, A. Arain, P. Kanaroglou, N. Finkelstein, B. Newbold,M. Jerrett, B. Beckerman, J. Brook, M. Finkelstein, and N. L. Gilbert, “ALand Use Regression Model for Predicting Ambient Concentrations ofNitrogen Dioxide in Hamilton, Ontario, Canada,” Journal of the Air &Waste Management Association, vol. 56, no. 8, pp. 1059–1069, Aug. 2006,issn: 1096-2247. doi: 10.1080/10473289.2006.10464542. [Online].Available: https://doi.org/10.1080/10473289.2006.10464542(visited on 06/02/2019).

[33] Y. Yue, Y. Zhuang, A. G. O. Yeh, J.-Y. Xie, C.-L. Ma, and Q.-Q. Li,“Measurements of POI-based mixed use and their relationships withneighbourhood vibrancy,” International Journal of Geographical Informa-tion Science, vol. 31, no. 4, pp. 658–675, Apr. 2017, issn: 1365-8816. doi:10.1080/13658816.2016.1220561. [Online]. Available: https://doi.org/10.1080/13658816.2016.1220561 (visited on 06/02/2019).

[34] S. B. Henderson, B. Beckerman, M. Jerrett, and M. Brauer, “Appli-cation of land use regression to estimate long-term concentrationsof traffic-related nitrogen oxides and fine particulate matter,” eng,Environmental Science & Technology, vol. 41, no. 7, pp. 2422–2428, Apr.2007, issn: 0013-936X.

[35] G. Hoek, R. Beelen, K. de Hoogh, D. Vienneau, J. Gulliver, P. Fis-cher, and D. Briggs, “A review of land-use regression models toassess spatial variation of outdoor air pollution,” en, Atmospheric En-

53

Page 65: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

Bibliography

vironment, vol. 42, no. 33, pp. 7561–7578, Oct. 2008, issn: 13522310.doi: 10.1016/j.atmosenv.2008.05.057. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S1352231008005748(visited on 05/31/2019).

[36] C. Madsen, K. C. L. Carlsen, G. Hoek, B. Oftedal, P. Nafstad, K. Melief-ste, R. Jacobsen, W. Nystad, K.-H. Carlsen, and B. Brunekreef, “Mod-eling the intra-urban variability of outdoor traffic pollution in Oslo,Norway—A GA2len project,” Atmospheric Environment, vol. 41, no. 35,pp. 7500–7511, Nov. 2007, issn: 1352-2310. doi: 10.1016/j.atmosenv.2007.05.039. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1352231007005043 (visited on 05/31/2019).

[37] P. H. Ryan and G. K. LeMasters, “A Review of Land-use Regres-sion Models for Characterizing Intraurban Air Pollution Exposure,”Inhalation toxicology, vol. 19, no. Suppl 1, pp. 127–133, 2007, issn: 0895-8378. doi: 10.1080/08958370701495998. [Online]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2233947/ (visited on05/31/2019).

[38] B. Song, J. Wu, Y. Zhou, and K. Hu, “Fine-Scale Prediction of RoadsideCO and NOx Concentration Based on Random Forest,” Journal ofResiduals Science and Technology, vol. 11, Nov. 2014.

[39] A. Liaw and M. Wiener, “Classication and Regression by randomFor-est,” en, vol. 2, p. 6, 2002.

[40] T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,”en, in Proceedings of the 22nd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining - KDD ’16, San Francisco,California, USA: ACM Press, 2016, pp. 785–794, isbn: 978-1-4503-4232-2. doi: 10.1145/2939672.2939785. [Online]. Available: http://dl.acm.org/citation.cfm?doid=2939672.2939785 (visited on06/02/2019).

[41] O. US EPA, NAAQS Table, en, Policies and Guidance, Apr. 2014. [On-line]. Available: https://www.epa.gov/criteria-air-pollutants/naaqs-table (visited on 06/02/2019).

54

Page 66: Mining jams into pollution: how Waze data helps estimating air … · 2019-11-01 · Mining jams into pollution: how Waze data helps estimating air pollution in large cities Master’s

Bibliography

[42] J. Estima and M. Painho, “Exploratory analysis of OpenStreetMap forland use classification,” en, in Proceedings of the Second ACM SIGSPA-TIAL International Workshop on Crowdsourced and Volunteered GeographicInformation - GEOCROWD ’13, Orlando, Florida: ACM Press, 2013,pp. 39–46, isbn: 978-1-4503-2528-8. doi: 10.1145/2534732.2534734.[Online]. Available: http : / / dl . acm . org / citation . cfm ? doid =2534732.2534734 (visited on 06/02/2019).

[43] P. Radivojac, Z. Obradovic, A. K. Dunker, and S. Vucetic, “Feature Se-lection Filters Based on the Permutation Test,” en, in Machine Learning:ECML 2004, J.-F. Boulicaut, F. Esposito, F. Giannotti, and D. Pedreschi,Eds., ser. Lecture Notes in Computer Science, Springer Berlin Heidel-berg, 2004, pp. 334–346, isbn: 978-3-540-30115-8.

[44] G. Chandrashekar and F. Sahin, “A survey on feature selection meth-ods,” en, Computers & Electrical Engineering, vol. 40, no. 1, pp. 16–28,Jan. 2014, issn: 00457906. doi: 10.1016/j.compeleceng.2013.11.024.[Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0045790613003066 (visited on 06/02/2019).

[45] Ambient Air Pollution Exposure Estimation for the Global Burden of Disease2013 — Environmental Science & Technology. [Online]. Available: https://pubs.acs.org/doi/abs/10.1021/acs.est.5b03709 (visited on06/02/2019).

55