pm10 forecasting using clusterwise regression

at SciVerse ScienceDirect

Atmospheric Environment 45 (2011) 7005e7014

Contents lists available

Atmospheric Environment

journal homepage: www.elsevier .com/locate/atmosenv

PM10 forecasting using clusterwise regression

Jean-Michel Poggia,c, Bruno Portierb,*a Laboratoire de Mathématiques d’Orsay, Université Paris-Sud, 91405 Orsay, Franceb Laboratoire de Mathématiques, INSA de Rouen, BP 08, Avenue de l’Université, 76800 Saint-Etienne du Rouvray, FrancecUniversité Paris-Descartes, France

a r t i c l e i n f o

Article history:Received 20 May 2011Received in revised form7 September 2011Accepted 9 September 2011

Keywords:Particulate matterForecastingClusterwise linear modelsGeneralized additive modelsRandom forestsRouen

* Corresponding author.E-mail address: [email protected] (B. Po

1352-2310/$ e see front matter � 2011 Elsevier Ltd.doi:10.1016/j.atmosenv.2011.09.016

a b s t r a c t

In this paper, we are interested in the statistical forecasting of the daily mean PM10 concentration.Hourly concentrations of PM10 have been measured in the city of Rouen, in Haute-Normandie, France.Located at northwest of Paris, near the south side of Manche sea and heavily industrialised. We considerthree monitoring stations reflecting the diversity of situations: an urban background station, a trafficstation and an industrial station near the cereal harbour of Rouen. We have focused our attention on datafor the months that register higher values, from December to March, on years 2004e2009. The modelsare obtained from the winter days of the four seasons 2004/2005 to 2007/2008 (training data) and thenthe forecasting performance is evaluated on the winter days of the season 2008/2009 (test data).

We show that it is possible to accurately forecast the daily mean concentration by fitting a function ofmeteorological predictors and the average concentration measured on the previous day. The values ofobserved meteorological variables are used for fitting the models and are also considered for the testdata. We have compared the forecasts produced by three different methods: persistence, generalizedadditive nonlinear models and clusterwise linear regression models. This last method gives veryimpressive results and the end of the paper tries to analyze the reasons of such a good behavior.

� 2011 Elsevier Ltd. All rights reserved.

1. Introduction

The atmosphere consists of gases but also particulate mattercondensed in liquid or solid. The particles havemultiple origins andcome from natural sources (sea salt, volcanic eruptions, forest fires,wind erosion of soils, .) as well as from human activities (trans-port, heating, industry, agriculture, .). They can also be secondarysources, i.e., formed by combination according to a complex phys-icochemical process. Their characteristics are extremely diverseand may vary over time. They are often classified according toa standard size or chemical composition which often determinesthe intensity of their health impact.

The problem of air pollution by particulate matter, althoughcomplex, is a real public health issue requiring a response from thepublic authorities. New standards designed to comply with newrequirements for monitoring, updating the regulation of airborneparticles was performed by European Council and at national levelby the “tightening up” of the thresholds as triggers of information,caution and warning respectively. The European regulation setsthat PM10 (particles whose diameter is less than 10 mm) daily

rtier).

All rights reserved.

cannot exceed 50 mg m�3 more than 35 days per year, and cannotexceeds an annual mean of 40 mg m�3. Recently the Frenchgovernment adopted a national plan on particles. Airborne partic-ulate matter as PM10, measured by Tapered Element OscillatingMicrobalance (TEOM) continuous monitor is factual in France andin Europe for more than 10 years.

Pollution forecasting from suspended particles in the air isobviously an important issue for the Haute-Normandie region, butalso for France. Indeed, for several years, the limit values for PM10concentrations are exceeded in several French regions. The devel-opment of a statistical forecasting technique of PM10 concentra-tion, aiming in particular to improve early warning procedures,useful for sensitive people, would be an important tool for AirNormand, the local air quality agency and therefore for the regionof Haute-Normandie.

A lot of references in the literature can be considered aboutPM10 forecasting, see for example Grivas and Chaloulakou (2006)or the long and detailed introduction of Dong et al. (2009) high-lighting methods and models. The references related to the use ofstatistical approaches can be distinguished according to the adop-ted model and the predictors involved.

A large panel of methods is available among which we findneural networks, linear regression models, nonlinear parametric or

mailto:[email protected]

www.sciencedirect.com/science/journal/13522310

www.elsevier.com/locate/atmosenv

http://dx.doi.org/10.1016/j.atmosenv.2011.09.016



J.-M. Poggi, B. Portier / Atmospheric Environment 45 (2011) 7005e70147006

nonparametric models but also new strategies such as mixtures ofpredictions from different models or clusterwise regressionmodels. Let us briefly review some of them.

First of all, neural networks are the most frequently used. Therecent paper of Paschalidou et al. (2011) offers a synthesis about theneural networks based forecasting and compares different neuralnetwork architectures including multilayer perceptron as well asradial basis neural networks. But of course, due to the lack oftheoretical results from the statistical viewpoint as well as the lowinterpretability of this class of black-box models, some alternativestrategies have been considered.

Indeed, multiple linear modeling is also frequently used, as inStadlober et al. (2008) for example focusing on winter days, in thebasin areas of the Alps. The results are very satisfactory, even if lowvalues are often overestimated while high values are frequentlyunderestimated. This method is then often considered asa competitor to compare a simpler actual scheme.

For example, Slini et al. (2006) use and compare classificationand regression trees (CART), neural network, linear regression andregression on principal components, in Thessaloniki, Greece. Theconclusion is that CART and neural network capture PM10 trendswhile CART is better for alarm forecasting.

In Milan, Italy, where PM10 pollution is extremely importantand easy to predict, Corani (2005) uses local polynomials basednonparametric method to estimate a nonlinear regression modelwhich delivers good results.

The introduction of clusterwise linear models for PM10 fore-casting, promoted in this paper, is already present for example inSfetsos and Vlachogiannis (2010) considering Athens, Helsinki andLondon. A simple global model is first considered and then a two-steps approach is proposed. First, the days are classified; seconda linear model is used inside each class for forecasting. The benefitwith respect to a single global model is significant in performancebut also in interpretability and simplicity of the models. The clus-ters come from a supervised classification integrating observedPM10, leading to more coherent models inside each class. Withrespect to this previous work, we propose in this paper to usea more global way to handle such clusterwise linear regression andto simultaneously optimize the clusters and the local models. Theunderlying statistical modeling framework is the so-calledmixtures of linear regressions (see McLachlan and Peel, 2000).We still use in the sequel the expression “clusterwise regression”because it is more intuitive even if it is a little bit an abuse oflanguage since the induced clusters are designed using theresponse variable.

The variables used in forecasting schemes may simply combineconventional pollutants andmeteorological variables but often alsoinclude model output statistics or constructed variables.

We find of course classical variables involving the wind speedand direction, solar radiation (considered in some cases asa precursor of other pollutants), relative humidity, temperaturesand temperature gradients, atmospheric pressure and rain. Some-times, cloud cover and dew point are included in the first set ofcandidates, before a variable selection step. A persistence term isvery often introduced through the average PM10 of the day before.For example, Chaloulakou et al. (2003a) compare a neural networkand a multiple linear regression model for PM10 forecasting inAthens. They show that the lagged PM10 is an important predictor:it appears to be as informative alone as the considered set ofmeteorological variables.

In this paper, we consider daily mean concentrations of PM10measured in the city of Rouen, in Haute-Normandie, France.Located at northwest of Paris, near the south side of Manche sea,the Haute-Normandie region is heavily industrialised. We considerthreemonitoring stations of network of Air Normand, reflecting the

diversity of situations: an urban background station, a traffic stationand an industrial station near the cereal harbour of Rouen. We havefocused our attention on data for the months that register highervalues, from December to March, on years 2004e2009. The modelsare obtained from the winter days of the four seasons 2004/2005 to2007/2008 (training data) and then the forecasting performance isevaluated on the winter days of the season 2008/2009 (test data).To give some general ideas, let us note that the total numbers ofexceedances of the threshold value 50 mg m�3 over the six years atthe three sites are respectively about 16, 35 and 13 for winter daysand 21, 17 and 44 for the whole period. Considering a smallerthreshold value, the average numbers of exceedances of 30 mg m�3

per year are about 20, 42 and 22 for winter days and twice fora year. The annual averages of PM10 are about 21, 26 and 20 mg m�3

while the levels in winter season increase about 2 mg m�3.We show that it is possible to accurately forecast the daily mean

concentration by fitting a function of meteorological predictors andthe average concentration measured on the previous day. Thevalues of observed meteorological variables are used for fitting themodels and are also considered for the test data. We havecompared the forecasts produced by three different statisticalmodels: persistence, generalized additive model and clusterwiselinear models. This last method gives impressive results. Of course,a lot of competitors could be chosen in order to assess the presentproposition, including neural networks for example. But since weare primarily interested in the understanding of the good behaviorof the forecasting model, we favor two explicit global models: thesimplest one (persistence) as the basic reference and a secondcompetitor: the generalized additive model which contains thelinear model and which is of intermediate complexity. Such modelshave been recently used in a similar pollution context by Aldrin andHaff (2005) and Barmpadimos et al. (2011) to study the influence ofmeteorology on air pollution traffic volume or on PM10 trends.

Our paper is organized as follows. Section 2 describes the data,PM10 and meteorological, and recall some basics about theconsidered statistical methods. Section 3 presents the results ob-tained using by the different models on the three consideredstations, including the choice of predictors and the forecastingperformance on test data as well as on training data. Section 4analyzes more deeply the clusterwise linear model capabilities,focusing on the urban background station. Finally, some concludingremarks are collected in Section 5.

2. Materials and methods

Rouen (Latitude: 49�250N, Longitude: 1�40E) city is located in theHaute-Normandie region of northern France, near the south side ofManche sea and at northwest of Paris (nearby 100 km). The Haute-Normandie is heavily industrialised and more than 490k peoplelive in Rouen and its agglomeration.

We have focused our attention on data for the months, fromDecember to March, on years 2004e2009. Indeed during thesemonths, higher values are registered, both in average and innumber of exceedances, and the data can be considered as homo-geneous in terms of pollution sources which are not included in themodels. The data set consists of PM10 daily mean concentrationsand many meteorological data over the period 2004e2009. Thedata are presented in the next two subsections.

The statistical tools used in this work are not classical (moreprecisely they are not widely known) but have been recently usedfor analysis PM10 pollution in Haute-Normandie (see Jollois et al.,2009 and Bobbia et al., 2011). We recall some elements aboutthese methods in the third subsection.

Finally, it should be noted that published studies for PM10forecasting in France are very difficult to find. Let us mention the

Table 1Summary statistics of daily mean PM10 concentrations (in mg m�3) for the threestations.

JUS GCM GUI

Minimum 7 5 91st Quartile 16 13 19Median 20 19 26Mean 22.34 21.43 27.973rd Quartile 26 26 33Maximum 81 88 100SD 10.17 10.76 12.31Missing values 10 0 2No. of values 697 697 697

J.-M. Poggi, B. Portier / Atmospheric Environment 45 (2011) 7005e7014 7007

paper of Zolghadri and Cazaurang (2006) which consider only onestation in Bordeaux and use extended Kalman filter.

2.1. PM10 data

PM10 come from three monitoring stations of Air Normandnetwork. They are located in Rouen city, namely Palais de Justice(JUS), Guillaume-Le-Conquérant (GUI) and Grand-Couronne (GCM).This choice reflects the diversity of situations by considering theurban background station JUS, the roadside station GUI, which isthe second most polluted in the region, and the industrial stationGCM which is near the cereal harbour of Rouen, one of the mostimportant in Europe, in order to have a widest panel. In Fig. 1, wecan find the boxplots of PM10 from each monitoring station. Theboxplots are based on the 697 winter days of the six years.

In addition to these boxplots, Table 1 containing a summary ofbasic statistics complements this synthetic view.

As it can be seen, the three monitoring stations are differentfrom the daily mean PM10 concentration distributions, especiallyGUI from the two others. In addition, we emphasize that, from ourprevious studies (see Jollois et al., 2009), they are also very differentin terms of intrinsic difficulty to model PM10 concentrations withthe same predictors, namely JUS is easy, GCM is hard and GUI ismedium. So we capture in this subset of PM10 monitoring stationsof the Rouen network the meaningful part of the crucial difficultiesfor the forecasting problem.

2.2. Meteorological data

Meteorological data are provided by Météo-France (the Frenchnational meteorological company) and come from one monitoringsite located nearby Rouen. We have retained classical meteoro-logical indicators including the daily minimum (Tmin), mean(Tmoy) and maximum (Tmax) temperature (in �C), the daily totalrain (PLsom, in mm), the daily mean atmospheric pressure (PAmoy,in hPa), the daily maximum (VVmax, in m/s) and mean (VVmoy, inm/s) wind speed, the daily most frequently observed wind direc-tion (DVdom, in �), the wind direction associated with the dailymaximum wind speed (DVmaxvv, in �) and the daily minimum(HRmin, in percentage), maximum (HRmax) and mean (HRmoy)relative humidity. In addition, we use a measure of the vertical

Fig. 1. Boxplots of daily mean PM10 concentrations (in mg m�3) for the three stations.

difference between temperatures (GTrouen, in �C), giving us an ideaof the mixing height. In Table 2, a summary of basic statistics givessome general information about their distributions. The lastcolumn of the table gives the number of missing values, and wenote that it is always very small.

2.3. Methodologies

We give here some elements of the statistical tools used in thispaper.

2.3.1. Variable selection using random forestsIn the forecasting problem, the choice of well-suited predictors

is of course crucial. To deal with this difficult problem, we usea recent highly nonlinear and nonparametric statistical methodintroduced by Breiman (2001) and called random forests. A randomforest is an aggregation of many binary decision trees obtainedusing the CART method (see Breiman et al., 1984) or someunpruned version of CART.

In this context, the CART model is a binary decision tree veryeasy to understand and very general: input data do not need to beGaussian and nonlinear relationships between the response andthe explanatory variables can be handled. A model is built by firstperforming a growing step by recursive partitioning the set ofobservations, choosing the best split (a split is defined by a variableand a threshold) dividing the current set of observations in twosubsets, minimizing the local internal variance of the response. Toprevent overfitting, a pruning step is then performed to selecta convenient subtree. To cope with the instability of this methodwith respect to perturbations on the training data set, Breiman(2001) introduced a new method called Random Forests (RF forshort). The principle of RF is to combine many binary decision treesbuilt using several bootstrap samples coming from the set of

Table 2Summary statistics for meteorological variables. The last column gives the numberof missing values.

Min. 1stQu. Median Mean 3rdQu. Max. SD Missing

Tmin �10.8 �0.9 1.7 1.7 4.6 12.1 3.87 2Tmoy �6.7 1.9 4.9 4.8 7.6 14.3 4.11 2Tmax �2.5 4.7 7.7 7.6 10.6 22 3.8 2VVmoy 1.2 3.0 4.2 4.6 5.9 11.9 2 1VVmax 2 5 7 7.2 9 17 2.8 1DVdom 0 90 180 162 225 315 98.7 0DVmaxvv 10 100 200 185 260 360 98 1PAmoy 980 1011 1019 1018 1028 1042 11.8 2HRmin 24 65 73 71.61 80 98 8.4 8HRmoy 46 81.18 87 85 91 99 4.9 3HRmax 56 92 95 93.8 97 100 2.2 3GTrouen �1.5 �0.3 0.8 1.3 2 14 3.8 0PLsom 0 0 0 1.88 2 28 3.8 0


observations and choosing randomly at each node a subset ofexplanatory variables. At each node, a given number of inputvariables are randomly chosen and the best split is calculated onlywithin this subset and second, no pruning step is performed so allthe trees are maximal trees. To evaluate the quality of the fittedmodel, it is common to estimate the error of generalization thanksto a test set. Breiman introduced the OOB scheme: the error isestimated through the Out-Of-Bag (OOB) error, calculated accord-ing to the iterations of the algorithm. The OOB error corresponds tothe prediction error for the data not belonging to the bootstrapsample used to build the tree.

For sure, the RF method allows building a black-box predictionmodel, difficult to interpret, but which efficiently computes indi-vidual scores of importance of regressors.

The quantification of the variable importance is an importantissue. In the RF framework, the most widely used score of impor-tance of a given variable is the mean increase of the error of a tree(quantified by the mean square error) in the forest when theobserved values of this variable are randomly permuted in the OOBsamples. The idea is simple: if a variable is not important, thena random changewould not degrade toomuch the predictionwhileit hugely deteriorates it if the variable is important: the higher theimportance, the stronger the variable influence.

We perform the statistical analyses using R (http://www.r-project.org/) with the associated R package randomForest, seeLiaw and Wiener (2002).

2.3.2. Nonlinear additive modelsNonlinear additive models have been introduced by Breiman

and Friedman (1985) and have been widely used since the workof Hastie and Tibshirani (1990).

These models are more flexible than traditional linear modelssince they allow to model nonlinear effects instead of simple linearones while retaining the additivity property preserving the ease ofinterpretation by the separability of effects of each regressor. Forthese reasons, such models are attractive for modeling and fore-casting problems (see for example, Aldrin and Haff, 2005 andBarmpadimos et al., 2011).

To estimate nonlinear additive models, we use the R packagemgcv developed by Wood (2006). Nonlinear functions are esti-mated using the backfitting algorithm based on penalized regres-sion splines.

2.3.3. Clusterwise linear modelsMixture models, gaussian in particular, are widely used in

statistics, including classification (see McLachlan and Peel, 2000).They have recently been extended by mixing standard linearregression models. The main hypothesis is that observations comefrom a mixture of s components in some unknown proportion andin each component observations are modeled using a linearregression model. The purpose is then to estimate the parametersof each linear model and the parameters defining the components.In the clustering context, each object is supposed to be generatedby one of the components of the mixture model being fitted. Thepartition is derived from these parameters using the maximuma posteriori (MAP) principle from the posterior probabilities for anobject to belong to a component.

Finite mixture models with a fixed number of components areusually estimated, under the maximum-likelihood estimationframework, using the expectation-maximization (EM) algorithm(Dempster et al., 1977). This algorithm iteratively repeats two stepsuntil convergence. The first step E computes the conditionalexpectation of the complete log-likelihood, and the second one Mcomputes the parameters maximizing the complete log-likelihood.The number of components is generally unknown and needs to be

estimated. A classical approach is to fit models of various number ofcomponents and to compare them using the BIC criterion (Schwarz,1978) for example.

To compute a mixture of linear regressions, we use the flexmix Rpackage, described in Leisch (2004) and Grün and Leisch (2007).This package has been recently extended to handle more generalmodels in each cluster (see Grün and Leisch, 2008) but it is out ofthe scope of this paper.

Instead of talking about mixtures of linear regressions, we use inthe sequel the expression “clusterwise regression” because it ismore intuitive even if it is a little bit an abuse of language since theinduced clusters are designed using the response variable.

3. Results and discussion

Let us now present the different results obtained for the twoclasses of models on the three stations considered. For each station,we first give the forecasting performance obtained for the trainingdata: the winter days of the seasons 2004/2005 to 2007/2008, andthen the forecasting performance on the test data: the winter daysof December 2008 to March 2009.

Usual statistical indices are used to provide indications of thequality of the prediction or the forecast with respect to theobserved PM10 concentration. We must distinguish (see forexample Chaloulakou et al., 2003b for a detailed definition of theseindicators) those based on forecast errors from those based on levelexceedances.

In the first category: the percentage of explained variance (EV)given by 1minus the ratio of the residual variance and the observedPM10 variance, the correlation coefficient between the prediction(or forecast) and the observed PM10 concentration (R), the meanabsolute percentage error (MAPE), the root mean square error(RMSE), the index of agreement (IA) and finally the skill score (SS).More precisely, IA gives the degree to which model predictions areerror free (range between [0,1], with a best value of 1); SS measuresthe relative improvement with respect to persistence forecast(range between [�1, 1] and a value of 0.5 or more indicatesa significant improvement in skill). Note that the percentage of theexplained variance EV does not exactly correspond to the square ofthe correlation coefficient R, even if it is true for linear models.

In the second category, classical indicators are the probability ofdetection (POD, range between [0,1] with a best value of 1), thefalse alarm rate (FAR, range between [0,1] with a best value of 0)and the threat score (TS, range between [0,1] with a best value of 1).Let us mention that the threshold value used for computing POD, TSand FAR is set to 30 mg m�3, instead of 50 mg m�3 in order to takeinto account the fact that TEOMmeasurements do not integrate thevolatile fraction.

3.1. Choice of predictors

The choice of the predictors used in the models is based on therandom forest variable importance. More precisely, the variablesare ranked for each station using the average importance of twentyrandom forests to minimize sampling effects.

We first examine the problem of the choice of lagged PM10(denoted by PM10hier) which plays a crucial role especially whenpollution level is high. To illustrate it, we examine for the JUSstation only, the average importance of variables based on 20random forests in three different situations collected in Table 3. Thefirst column involves pollutants (NO, NO2 and SO2) and meteoro-logical variables, the second one adds the lagged PM10 (PM10hier,in italics) to this set of variables and the last column is obtainedremoving pollutants.

http://www.r-project.org/

http://www.r-project.org/

Fig. 3. BIC values versus number of clusters of CLM for each of the three stations.

Table 3JUS station. Importance of variables based on 20 random forests and quantified bythemean increase of the mean square error of a tree in the forest when the observedvalues of this variable are randomly permuted in the OOB samples.

Pollutants andMeteo

Adding PM10hier Removing NO, NO2,SO2

NO2 26 NO 24 PM10hier 25NO 25 NO2 24 VVmoy 22SO2 16 PM10hier 21 GTrouen 22VVmoy 15 VVmoy 14 VVmax 12PAmoy 14 SO2 13 PAmoy 12GTrouen 14 GTrouen 12 PLsom 10HRmoy 12 PAmoy 12 Tmoy 8PLsom 12 PLsom 12 Tmin 8VVmax 10 HRmoy 11 DVdom 6DVmaxvv 9 VVmax 9 Tmax 6Tmoy 9 Tmoy 7 DVmaxvv 5Tmax 8 DVmaxvv 7 HRmoy 5HRmax 8 HRmax 7 HRmax 5Tmin 7 Tmax 7 HRmin 2HRmin 7 Tmin 6DVdom 6 DVdom 5

HRmin 4


Starting from the first model as a reference (first column ofTable 3), adding the lagged variable slightly modifies the scores ofimportance but promotes in third position PM10hier whileremoving the pollutants leads to push it at the first rank leavingalmost unchanged the importance scores of meteorological vari-ables. This illustrates why the lagged variable is considered as veryuseful in the forecasting context (from the d-day for tomorrow)because PM10hier is available since it is measured while NO, NO2,SO2 are not observable and need to be replaced by forecasts whichare of medium to poor accuracy.

Let us now select the meteorological parameters. In Fig. 2, onecan find a typical result coming from one forest for each station. Theresults are extremely homogeneous: the same top 3 variables arehighlighted and the plots look similar.

For each station, the variables are selected according to theaverage importance of twenty random forests to minimizesampling effects. A first gap (huge) around 20, and a second onearound 10 help us in the choice of a small subset of 6 variables. So,we select the same predictors for the three stations: PM10hier,GTrouen and VVmoy which are clearly the most important, and inaddition PAmoy and Tmoy. Let us remark that the daily total rain(variable Plsom) is not retained in the final set of variables. In fact,this variable is of small average importance, especially in winter,and in addition it is hard to predict so, even if it is useful to explainexceedances, we chose to let this variable out of the consideredmodels. But we must remark that if total rain is not easy to predict,some applications show that rain as binary variable is easy to

Fig. 2. Variable importance

predict and can be a crucial variable in the model, and some recentstudies show that PM10 concentrations decrease even when thereis little rain only.

In the sequel we focus on these five predictors. This selectioncaptures the most frequently retained predictors, according to theliterature.

3.2. Forecasting performances on training data

In addition to the persistence method, we consider a referencemodel given by a generalized additive model (GAM for short),which naturally contains the regression linear model and generallyoutperforms it in such a context. Then we consider a clusterwiselinear model (CLM for short) with an unprescribed number ofclusters. So the first step is to choose it according to a modelselection criterion: we compare the BIC values given for eachclusterwise linear model and we choose the number of clustersleading to the smallest BIC value. We can see in Fig. 3, that 2 or 3clusters lead to the smaller values. In particular for GUI station, theBIC value is the same for 2 or 3 clusters. Of course, in such situa-tions, the parcimony argument tends to favor the smaller one but itcould be of interest, from a descriptive or explicative perspective toalso inspect the model with 3 clusters.

using random forests.

Table 4Forecasting errors on training data (winter days 2004/2005 e 2007/2008).

Station Statistical indices GAM CLM (2 clusters) CLM (3 clusters)

JUS R 0.78 0.87 0.93EV 0.61 0.76 0.85IA 0.86 0.92 0.96SS 0.53 0.71 0.83MAPE 0.23 0.18 0.14RMSE 6.34 4.98 3.85

GUI R 0.74 0.86EV 0.55 0.72IA 0.83 0.91SS 0.49 0.68MAPE 0.25 0.20RMSE 8.20 6.43

GCM R 0.74 0.86 0.93EV 0.54 0.73 0.85IA 0.83 0.91 0.95SS 0.44 0.67 0.81MAPE 0.29 0.21 0.17RMSE 7.32 5.63 4.23

Table 6Forecasting errors on test data (winter days 2008e2009).

Station Statistical indices GAM CLM (2 clusters) CLM (3 clusters)

JUS R 0.81 0.87 0.92EV 0.65 0.74 0.81IA 0.89 0.90 0.93SS 0.66 0.73 0.80MAPE 0.22 0.17 0.14RMSE 6.16 5.47 4.72

GUI R 0.76 0.85EV 0.57 0.68IA 0.85 0.87SS 0.64 0.73MAPE 0.24 0.18RMSE 8.34 7.21

GCM R 0.69 0.79 0.90EV 0.45 0.62 0.80IA 0.82 0.87 0.94SS 0.50 0.66 0.83MAPE 0.30 0.20 0.15RMSE 7.45 6.15 4.43


So, we compare the nonlinear additive one and the two clus-terwise linear models for 2 or 3 clusters.

The forecasting performances evaluated on winter days of years2004e2008, for the three sites are given in Table 4 for indices basedon forecasting errors. First, the results for the three models are verygood. For example, for the urban background station JUS, the worstmodel explains 61% of the variance and the best one 85%.

To complement this view, Table 5 contains the forecastingperformances for indices based on level exceedances. The threat-score (TS) is about 0.7 for the best models, this is remarkable.

Let us start the global comparison with the persistence strategywhich is the worse: bad TS (around 35% instead of 70% consideredas good), poor POD (around 55% instead of 75%) and poor FAR(around 45% instead of 10%).

The comparison between the three other models is clear: thenonlinear additive model is less accurate than the clusterwiselinear model with 2 clusters, and the one with 3 clusters is, as ex-pected on the training data, slightly better. It turns out that a globalmultiple linear model fitted on this data set leads to less accurateforecasts, since it is outperformed by the GAMmodel. However, thislast one is only slightly better than persistence. Note that, even ifthis is of little interest on the training data, the forecasting model ofpersistence performs even slightly better than GAM on trainingdata set of the GCM station (R¼ 0.81, EV¼ 0.65 and TS¼ 0.37). Thiscomes certainly from the fact that in winter, exceedances appear assequences of consecutive observations exhibiting a locallyincreasing trend.

Let us remark that, at this stage focusing on simple models, wehave neglected possible weekend effects. So the comparisonbetween clusterwise regression and nonlinear additive regressionis not completely fair. Indeed, instead of clustering, one could studyweekend effects and include corresponding dummy variables.

Table 5Forecasting performances on training data (winter days 2004e2008).

Station Statistical indices Persistence GAM CLM (2 clusters) CLM (3 clusters)

JUS POD 0.52 0.56 0.77 0.80TS 0.34 0.45 0.76 0.77FAR 0.49 0.30 0.03 0.05

GUI POD 0.57 0.66 0.78TS 0.40 0.53 0.73FAR 0.44 0.28 0.09

GCM POD 0.54 0.38 0.62 0.73TS 0.37 0.31 0.55 0.67FAR 0.46 0.38 0.18 0.10

Some statistical studies in other regions indicate lower levels onSaturdays compared to working days, and lower levels on Sundayscompared to Saturdays, so one could study models with twodummy variables for the weekend.

3.3. Forecasting performance on test data

Let us now study the forecasting performances of the threemodels. We use models estimated using the training data andanalyzed in the previous section and we evaluate their perfor-mances on winter days of year 2008e2009, in order to obtaina fair evaluation. The different performance statistics for the threesites are summarized in Table 6 for indices based on forecastingerrors.

The results are remarkably stable from the training period to thetest period. The results remain of good quality: explained variancefrom 45% to 74%, depending on the station and the model.

As in the previous studies (see Jollois et al., 2009) for modelingPM10 concentrations using together pollutants and meteorologicalvariables, the stations exhibit different difficulties for forecasting,namely JUS is easy, GCM is harder and GUI is medium.

The comparison with persistence is of interest for the test data.Let us notice that the skill score SS is always greater than 0.5, whichmeans that the considered model outperforms the persistencemodel. More details can be found in Table 7 giving the performanceindices based on level exceedances: the performances of persis-tence are considerably degraded with respect to those attained onthe training data set.

The results obtained for the GAM and CLM models are stable,even if the results in exceedances are based on a small number ofdays, and can be very satisfactory, for example for JUS stationwherethe TS is about 0.80 and the FAR is zero.

Table 7Forecasting performances on test data (winter days 2008e2009).

Station Statistical indices Persistence GAM CLM (2 clusters) CLM (3 clusters)

JUS POD 0.36 0.71 0.79 0.93TS 0.23 0.53 0.79 0.93FAR 0.62 0.33 0 0

GUI POD 0.59 0.74 0.67TS 0.43 0.60 0.66FAR 0.39 0.24 0.03

GCM POD 0.25 0.45 0.45 0.45TS 0.14 0.36 0.38 0.39FAR 0.75 0.38 0.31 0.25

Table 8JUS Station, winter days 2004e2008. CLM with 2 clusters. Estimated model coeffi-cients with the p-value of the corresponding significance test.

Predictors Cluster 1 (139 days) Cluster 2 (415 days)

Coefficient p-value Coefficient p-value

(Intercept) 0.32 6.97e � 05 �0.32 <2.2e � 16PM10hier 0.53 <2.2e � 16 0.02 0.60Tmoy �0.08 0.172 �0.004 0.92VVmoy �0.25 0.00128 �0.10 0.007PAmoy 0.09 0.267 0.14 8.2e � 07GTrouen 0.26 0.00024 0.27 1.5e � 15


4. Analysis of clusterwise linear model on JUS station

In this section, we develop more carefully the clusterwise linearmodel focusing on JUS station. We first analyze together with thecluster assignment distribution, the models built in the differentclusters. We compare them and study the discriminative proper-ties. Then, we discuss the different ways to compute the forecastsby combining the predictions given by each model (or equivalentlyeach cluster).

As previously mentioned, the clusterwise linear model is ob-tained from all days of winter for years 2004e2008, using the R-package flexmix by varying the number of clusters from 1 to 7. Let usmention that we use the method on standardized variables, whichthen requires rescaling back the estimates of PM10 concentrations.The BIC criterion is minimum for 2 or 3 clusters (see Fig. 3), so weexamine first the model with 2 clusters and close the section bya short discussion about the model with 3 clusters.

4.1. The model

Recall that a clusterwise regression model is a mixture of linearmodels, that is a way to compute the posterior probabilities ofbelonging to clusters together with the local linear models.

Let us first examine these probabilities to describe the twoclusters. We will show that cluster 1 corresponds somehow tothe class of polluted days. Fig. 4 shows the posterior probabili-ties of belonging to cluster 1 versus the observed PM10concentrations (of course, the probabilities of belonging tocluster 2 are the complements to 1). Two horizontal lines allowto easily visualize unpolluted days from medium to highlypolluted days. Since the assignment is based on the maximum-likelihood, we adopt the following convention: cluster 1 (resp.2) is made of the days such that the posterior probability isgreater (resp. less) than 0.5. The cluster size on the training datais then: 139 days in cluster 1 and 415 days in cluster 2. We markby circles the days corresponding to cluster 1 and by crossesthose of cluster 2.

Fig. 4. JUS Station, winter days 2004e2008. CLM with 2 clusters. Posterior probabil-ities of belonging to class 1.

Cluster 2 does not contain any day of concentration greaterthan 35 mg m�3 while all the polluted days (PM10concentration � 50 mg m�3) are assigned to cluster 1 with a prob-ability equal to 1. So the interpretation is easy: cluster 1 is essen-tially the class of polluted days and cluster 2 the one of unpolluteddays.

Let us now analyze the linear model across the 2 clustersinvolving the previously selected predictors (PM10hier, VVmoy,GTrouen, Tmoy and PAmoy). In Table 8, we find the estimatedcoefficients with additional information: the p-value of the corre-sponding significance test.

A graphical counterpart of this table (where the coefficientsvalues are explicit) is given in Fig. 5, facilitating the comparisonbetween the two models.

Let us first notice that the two models are similar except twomain differences: the sign of the intercept and the prominent roleof PM10hier for cluster 1. More precisely, the intercept is positive incluster 1 and negative in cluster 2. Influent predictors are not thesame in each cluster. Lagged PM10 variable (PM10hier) is the mostimportant predictor in cluster 1 while it is not significant in cluster2. Indeed in cluster 1, Tmoy and PAmoy are not significant (p-value > 5%), while in Cluster 2 PM10hier and Tmoy are notsignificant.

Fig. 5. JUS Station, winter days 2004e2008. CLM with 2 clusters. Confidence intervals95% and significance of the coefficients of linear models. (Comp. stands for Cluster).

Fig. 6. JUS Station, winter days 2004e2008. CLM with 2 clusters, (observed PM10, predicted PM10), «hard» forecast method on the left panel and «fuzzy» forecast method on theright panel.

Table 9JUS station. Performances for forecasting winter days 2004e2008. Hard and fuzzyforecasting methods.

Statistical indices Hard method Fuzzy method

R 0.87 0.87EV 0.75 0.76IA 0.92 0.92SS 0.70 0.71MAPE 0.18 0.18RMSE 4.98 4.98

Table 10JUS station. Performances for forecasting winter days 2009. Hard and fuzzy fore-casting methods.

Statistical indices Hard method Fuzzy method

R 0.87 0.87EV 0.75 0.74IA 0.90 0.90SS 0.72 0.73MAPE 0.18 0.17RMSE 5.47 5.47


The previous description of the clusters helps us to interpret thediscrimination. For example, the intercept can be interpreted asa difference with respect to the global PM10 mean (in fact 0 sincePM10 is standardized): more polluted for cluster 1 and less pollutedfor cluster 2.

Fig. 7. JUS Station, winter days 2004e2008. CLM with 3 clus

4.2. How to build a forecast of PM10 concentration from a givenclusterwise model?

In addition to the local linear models, the method providesa way to compute the posterior probabilities of belonging to clus-ters. Assignment to a class is then based on the maximum likeli-hood. So, two ways can be explored to build a forecast:

- to assign the new observation to a class and use the model ofthis class to predict (we refer to this method by hard since theassignment is hard). The hard forecast is then obtained usingthe linear model of the chosen cluster;

- to combine the forecasts delivered by the different modelsweighted by the posterior probabilities (we refer to thismethod by fuzzy since the assignment is a probabilitydistribution).

On the left panel of Fig. 6, we find the forecasts obtained by thehard strategy versus the observed PM10 concentrations. The fore-casts obtained using the model of cluster 1 are represented bya circle and by a cross for those coming frommodel of cluster 2. Theright panel displays the forecasts obtained using the “fuzzystrategy”, all the points are represented using the same symbolsince each day is forecasted using both models with differentweights.

It should be noted that all PM10 concentrations above 30 mgm�3

are estimated using the linear model of cluster 1, see the circles onthe left panel of Fig. 6.

ters. Posterior probabilities of belonging to each cluster.

Fig. 8. JUS Station, winter days 2004e2008. CLM with 3 clusters. Confidence intervals95% and significance of the coefficients of linear models (Comp. stands for Cluster).


Even if the second approach seems more appropriate for theproblem of forecasting since the assignment to a cluster is notnecessary, comparing the two panels does not reveal significantdifferences between the two scatter plots. Table 9 contains thenumerical indices of performance and confirms that the accuracy issimilar.

Fig. 9. JUS Station, winter days 2004e2008. CLM with 3 clusters, (observed PM10,predicted PM10), «hard» forecasting method.

To end this paragraph, let us briefly comment the results, seeTable 10, obtained by the two different forecasting methods on thetest data.

Unsurprisingly the results are very similar to those obtained onthe training data and also very close to each other for the twoforecasting methods.

This comes from the fact that polluted days are assigned tocluster 1 with a probability very close to 1.

4.3. A model with 3 clusters

To close this descriptive part, let us comment the clusterwiseregression model with 3 clusters, instead of only 2 as in theprevious section. The cluster sizes are as follows: 223 for cluster 1,257 for cluster 2 and 74 for cluster 3. Again, by examining theposterior probabilities of belonging to each cluster, we easily getthe interpretation of clusters. From the right to the left of Fig. 7: thepolluted days belong to cluster 3 with a probability close to 1, thedays belonging to cluster 2 with a probability greater than 0.5 areunpolluted and the cluster 1 corresponds to an intermediate class.

The linear models for the 3 clusters are given in Fig. 8. Thissynthetic view has to be compared with Fig. 5. It is clear that themodels of new clusters 2 and 3 are roughly as the same as those ofprevious clusters 2 and 1 respectively. The intercept of the model ofthe newcluster 1 is about zero confirming the intermediate situation.

The forecasts obtaining by combining the predictions given bythe three models lead to the graph of predicted versus observedPM10 concentrations of Fig. 9. This scatter plot exhibits an excep-tional concentration around the ideal diagonal even if the modelslightly underestimates the medium to high values. The explainedvariance reaches 86% which is very satisfactory.

5. Conclusion and discussion

For three monitoring stations of Rouen (Haute-Normandie,France) reflecting the diversity of urban situations (background,traffic and industrial stations), we have built statistical models fordaily mean PM10 concentrations from the winter days 2004/2005to 2007/2008. We have shown that it is possible to accuratelyforecast the daily mean PM10 concentration by fitting a function ofmeteorological predictors and the average PM10 concentrationmeasured on the previous day. We have compared the forecastingperformance evaluated onwinter days 2008/2009 of three differentmethods: persistence, generalized additive nonlinear models andclusterwise linear regression models and analyze the reasons of theespecially good behavior of this last one.

To discuss the future directions of this promising work, let ussketch two different questions.

The first one is about the forecasting results in the actual fore-casting context. Indeed, the results obtained in the literature areoften of excellent quality, especially when local particularities(geographic or meteorological) lead to easy predict situations (seefor example Diaz-Robles et al., 2008, about Temuco, Chili). Inaddition, as in the present work, a lot of papers evaluate the fore-casting performance on observed meteorological variables whichshould be completed by effective evaluation of the additionaluncertainty brought by the replacement of observed values bypredicted ones. The perfect prognosis strategy for forecasting (seeWilks, 1995) leads to fit the models on observed variables usingtraining data and to use in actual forecasting situation by replacingunavailable variables by forecasts generally coming from themeteorological institute. This allows to introduce for example thetemperature of the day to explain PM10 by considering the fore-casted temperature in an actual situation. Even if the perfectprognosis strategy is used almost everywhere, Caselli et al. (2009)


for Bari (Italy) use a classical neural network fitted directly theforecasts of meteorological variables. Nevertheless, focusing on anapplication in operational mode requires additional experiments toassess the quality of the forecasting procedure. Indeed, for themodel considered in our paper, it may be mentioned that in fore-casting PM10 for day d, e.g. at the evening of day d-1, one cannotuse PM10hier. Instead onemay use amodel withmoving average ofPM10 from 16.00 day d-2 to 16.00 day d-1.

The second direction is to make profit of deterministic fore-casting coming from large-scale numerical modelization. A lot ofideas can be used to make the cooperation between the twomodeling approaches. For example, in Hooyberghs et al. (2005), theconsidered models are very simple and the main originality is toinclude the mixing height. A model including only this variable andthe PM10 of the last day gives satisfactory results in terms ofalarms. The introduction of four classical additional variables citedbelow leads to slightly improved performances. In Konovalov et al.(2009), a deterministic forecasting model is compared to statisticalones and appears to be less accurate. What is interesting is that, thestatistical models can be easily improved by introducing thedeterministic forecast as an additional predictor. In the same line,Perez and Reyes (2006) introduce the Meteorological Potential ofAtmospheric Pollution given by a deterministic model which leadsto linearized and simplified forecasting models. Finally, let usmention that Cobourn (2010), using nonlinear parametric model,shows that adding the back-trajectories is useful from the averageerror as well as in terms of alarms.

Acknowledgements

We want to thank Véronique Delmas and Michel Bobbia fromAir Normand for fruitful discussions. In addition, we want to thankAir Normand and Météo-France for providing PM10 and meteoro-logical data respectively. Finally, the authors thank the reviewersfor constructive comments and recommendations.

References

Aldrin, M., Haff, I.H., 2005. Generalised additive modelling of air pollution, trafficvolume and meteorology. Atmospheric Environment 39, 2145e2155.

Barmpadimos, I., Hueglin, C., Keller, J., Henne, S., Prévôt, A.S.H., 2011. Influence ofmeteorology on PM10 trends and variability in Switzerland from 1991 to 2008.Atmospheric Chemistry and Physics 11, 1813e1835.

Bobbia, M., Jollois, F.X., Poggi, J.M., Portier, B., 2011. Quantifying local and back-ground contributions to PM10 concentrations in Haute-Normandie usingrandom forests. Environmetrics 22, 758e768.

Breiman, L., Friedman, J.H., Ohlsen, R.A., Stone, C.J., 1984. Classification andregression trees, Belmont.

Breiman, L., Friedman, J.H., 1985. Estimating optimal transformations for multipleregression and correlation. Journal of the American Statistical Association 80,580e619.

Breiman, L., 2001. Random forests. Machine Learning 45, 5e32.Caselli, V., Trizio, L., de Gennaro, G., Ielpo, P., 2009. A simple feed forward neural

network for the PM10 forecasting: comparison with a radial basis function

network and a multivariate linear regression model. Journal Water, Air, & SoilPollution 201, 365e377.

Chaloulakou, A., Grivas, G., Spyrellis, N., 2003a. Neural network and multipleregression models for PM10 prediction in Athens: a comparative assessment.Journal of the Air & Waste Management Association 53, 1183e1190.

Chaloulakou, A., Saisana, M., Spyrellis, N., 2003b. Comparative assessment of neuralnetworks and regression models for forecasting summertime ozone in Athens.The Science of the Total Environment 313, 1e13.

Cobourn, W.G., 2010. An enhanced PM2.5 air quality forecast model based onnonlinear regression and back-trajectory concentrations. Atmospheric Envi-ronment 44, 3015e3023.

Corani, G., 2005. Air quality prediction in Milan: feed-forward neural networks,pruned neural networks and lazy learning. Ecological Modelling 185, 513e529.

Dempster, A., Laird, N., Rubin, D., 1977. Maximum likelihood for incomplete data viathe EM algorithm. Journal of the Royal Statistical Society 39, 1e38.

Diaz-Robles, L.A., Ortega, J.C., Fu, J.S., Reed, G.D., Chow, J.C., Watson, J.G., Moncada-Herrera, J.A., 2008. A hybrid arima and artificial neural networks model toforecast particulate matter in urban areas: the case of Temuco, Chile. Atmo-spheric Environment 42, 8331e8340.

Dong, M., Yang, D., Kuang, Y., He, D., Erdal, S., Kenski, D., 2009. PM2.5 concentrationprediction using hidden semi-markov model-based times series data mining.Expert Systems with Applications 36, 9046e9055.

Grivas, G., Chaloulakou, A., 2006. Artificial neural network models for prediction ofPM10 hourly concentrations, in the greater area of Athens, Greece. AtmosphericEnvironment 40, 1216e1229.

Grün, B., Leisch, F., 2007. Fitting finite mixtures of generalized linear regressions inR. Computational Statistics & Data Analysis 51, 5247e5252.

Grün, B., Leisch, F., 2008. Flexmix version 2: finite mixtures with concomitantvariables and varying and constant parameters. Journal of Statistical Software28, 1e35.

Hastie, T., Tibshirani, R., 1990. Generalized Additive Models. Chapman & Hall.Hooyberghs, J., Mensink, C., Dumont, G., Fierens, F., Brasseur, O., 2005. A neural

network forecast for daily average PM10 concentrations in Belgium. Atmo-spheric Environment 39, 3279e3289.

Jollois, F.X., Poggi, J.M., Portier, B., 2009. Three nonlinear statistical methods toanalyze PM10 pollution in Rouen area, case Stud. Bus. Ind. Gov. Stat. 3, 1e17.

Konovalov, I.B., Beekmann, M., Meleux, F., Dutot, A., Foret, G., 2009. Combiningdeterministic and statistical approaches for PM10 forecasting in Europe.Atmospheric Environment 43, 6425e6434.

Leisch, F., 2004. FlexMix: a general framework for finite mixture models and latentclass regression in R. Journal of Statistical Software 11, 1e18.

Liaw, A., Wiener, M., 2002. Classification and regression by random forest. R News 2,18e22.

McLachlan, G., Peel, D., 2000. Finite mixture models, Wiley series in probability andstatistics.

Paschalidou, A.K., Karakitsios, S., Kleanthous, S., Kassomenos, P.A., 2011. Forecastinghourly PM10 concentration in Cyprus through artificial neural networks andmultiple regression models: implications to local environmental management.Environmental Science and Pollution Research 18, 316e327.

Perez, P., Reyes, J., 2006. An integrated neural network model for PM10 forecasting.Atmospheric Environment 40, 2845e2851.

Schwarz, G., 1978. Estimating the dimension of a model. Annals of Statistics 6,461e464.

Sfetsos, A., Vlachogiannis, D., 2010. Time series forecasting of hourly PM10 usinglocalized linear models. Journal of Software Engineering and Applications 3,374e383.

Slini, T., Kaprara, A., Karatzas, K., Moussiopoulos, N., 2006. PM10 forecasting forThessaloniki, Greece. Environmental Modelling & Software 21, 559e565.

Stadlober, E., Hörmann, S., Pfeiler, B., 2008. Quality and performance of a PM10 dailyforecasting model. Atmospheric Environment 42, 1098e1109.

Wilks, D.R., 1995. Statistical Methods in the Atmospheric Sciences: An Introduction.Academic Press.

Wood, S.N., 2006. Generalized Additive Models: An Introductionwith R. Chapman &Hall/CRC.

Zolghadri, A., Cazaurang, F., 2006. Adaptive nonlinear state-space modelling for theprediction of daily mean PM10 concentrations. Environmental Modelling &Software 21, 885e894.

pm10 forecasting using clusterwise regression

Documents