introduction - european commission · web viewland use classification based on present population...

Land use classification based on present population daily profiles from a big data source

Fernando Reis ([email protected])1, Gerdy Seynaeve2, Albrecht Wirthmann1, Freddy de Meersman2 and Marc Debusschere3

Keywords: mobile phone data, big data, land use, official statistics.

1. INTRODUCTION

1.1. Big data in official statisticsThe use in official statistics of high detail exhaust data left by individuals in their daily use of many technological platforms (i.e. big data) is gaining momentum in official statistics [1]. Of the many potential big data sources, geolocation data captured by the mobile phone networks (mobile phone data from now on) holds great potential for producing relevant statistics on the population and its mobility [2].

This data source offers very high temporal and spatial resolution. The high spatial resolution in particular offers the possibility of new or enhanced regional statistics which frequently find a limit when general population sample surveys are used.

1.2. Problem StatementSome studies have attempted to use mobile phone data to predict existing classifications of land use with supervised learning algorithms [3]. However, these classifications are hardwired to existing regional statistics. They are very multidimensional because there is a wide spectrum of regional statistics but they don't use characteristics of the regions which although relevant don't have statistics available. At the same time mobile phone data is inevitably limited in reproducing exactly existing classifications because it cannot cover all the dimensions used in building the classifications. Therefore, the first research question is what spatial patterns are revealed by mobile phone data when this data source is left completely free (i.e. unsupervised learning).

One fundamental difference in classifying land use with this data source and with what is usual in land use surveys is that in the latter the classification is applied to one single point in space. However, in the case of mobile phone data the classification is applied to a spatial area which might actually include multiple uses. Therefore, the research question is if we can find methods which allow us to characterise the multiple land use in single spatial areas.

1.3. Contribution of this paperThis paper contributes to the research of the use of mobile phone data to land use classification by proposing a two steps method. In the first step, spatial units are classified based on a cluster algorithm applied to the relative amount of present population throughout the day which is used to identify basic profiles. In the second step, the influence of each of the present population basic profiles in each spatial unit is estimated using a structural equations model. This paper also contributes to the identification of characteristics of regions which can be used to potentially improve existing classifications of land use.

2. DATA

The study used aggregated geolocation data captured by the Belgian mobile network operator Proximus for the devices connected to the network during three week days in March 2016. Whenever there's a communication between a device and the mobile network, the operator becomes aware of the approximate location of the device via the

1 Eurostat, European Commission2 Proximus, Belgium 3 Statistics Belgium

1

mailto:[email protected]

identification of the antenna used for the communication. Devices connect to antennas nearby their location, normally (although not always) the closest one. This allows us to allocate individuals communicating via a certain antenna to a spatial unit composed by the points to which the antenna is the closest one.

2.1. TimeThe focus in most studies so far (European Commission, 2014) has been on CDRs, call detail records, used for billing purposes. These reveal the time and location of a mobile phone whenever it is used. Nowadays, however, network probing systems capture all signalling events, including non-billable transactions, and therefore offer a much better time granularity. In the Proximus network, the amount of useful signalling events is about 10 times higher than the amount of CDRs. For each device on the network a position is recorded at least once every 3 hours. With an active data connection, this interval decreases to approximately once every hour. In practice transactions are recorded even more frequently, especially for smartphones which often connect to the network without the owner being aware of this.

2.2. SpaceFor each transaction on a mobile network, the mobile phone location is known down to the level of a cell identity. A mobile phone network is a cellular system which over time has grown ever more complex. Antenna sites nowadays typically contain multiple technologies (2G, 3G, 4G) and multiple cells. For the purpose of this study, a construct called TACS (Technology-Agnostic Cell Sector) was developed: the area served by all cells on a particular site with the same azimuth (direction of antenna main lobe) and irrespective of the technology used, consisting of all locations closer to the cell site than to surrounding ones (each also TACS). The resulting polygons are represented as Voronoi diagrams, making it possible to build a simplified model of the mobile network. This representation of TACS as Voronoi polygons is an approximation of cell coverage, which disregards the complexity of the different technology layers and thus allows for fast and performant calculations.

2.3. CountsThe raw data, at the MNO, is composed of records of signalling events between the device and the network. Raw data was filtered in order to eliminate devices identified as not being associated to a person or being fixed. For the purposes of this study, the location of the device was assumed to be fixed until a new signalling event revealed a new position. Finally, we computed the number of subscribers in each spatial unit (TACS) based on the antenna used at the signalling event at 15 minutes intervals.

In this study the number of subscribers was taken as a proxy to the present population in each spatial unit (TACS). We made no attempt to gross up the number of subscribers to the overall population of Belgium.

3. METHOD3.1. Cluster analysisThe number of subscribers in each spatial unit in each period of 15 minutes was averaged, so that we had 96 data points for each of the around 11000 spatial units. The average number was then standardised throughout the day so that the 96 points had mean zero and variance equal to one in each spatial unit. This way it included only the "shape" of the daily profile irrespective of the absolute number of subscribers present.

The k-means algorithm was finally used to identify clusters of spatial units. The similarity between spatial units was computed using the Euclidian distance using the 96 time slots as dimensions. The number of clusters was chosen by analysing the within

2

groups sum of squares (SSW) for the result of the k-means run for a pre-specified number of clusters from 2 to 15.

3.2. Structural equations modelFor each cluster identified in the previous step, a characteristic profile of relative present population was computed by averaging each of the 15 minutes time slots (see red curve in figure 1). The contribution of each characteristic profile to the relative present population in each spatial unit was estimated with a regression model with constraints.

A=α × R+ β ×C+γ ×W +εα , β , γ∈ [0,1 ]α +β+γ=1ε ~ N(0,1)

Where A is the vector of the observed relative present population in a cell and R, C and W are vectors with the corresponding values in the characteristics profiles. In order to improve the interpretability of the coefficients of the regression model, the characteristic profiles were scaled to values between 0 and 1, while the relative present population in each spatial unit was scaled as a percentage of the maximum value (see chart in figure 2).

4. RESULTS

The analysis of the SSW didn't show an obvious point where it stabilised. After an initial drop, it decreased continually until the number of clusters became very large. However, with 4 clusters we could get already more than 80% of the reduction in the SSW which we would get with 15 clusters. We analysed the characteristic profiles of the results found by the k-means for several pre-specified number of clusters and finally chose 3 clusters which accounted to a large proportion of the reduction of the SSW.

There are basically 3 different daily profiles in a week-day (figure 1). The first one starts with low values in the beginning of the day, increasing during the day and then returns to low values at the end of the day. This corresponds with the idea of spatial units where people mainly work. The second profile mirrors the first one, with high values in the beginning and end of the day and low values in the middle. This corresponds with the idea of a spatial unit where people mainly live. The last profile is characterised by two peaks, one early in the morning and another in the afternoon and corresponds to the idea of a spatial unit where people are present when they commute.

Mapping the classification of the spatial units produces a coherent picture, with most of the territory occupied with residential areas, a few working areas and commuting areas bridging the former 2.

3Figure 1: TACS identified as 1: ‘work’, 2: ‘commuting’ or 3: ‘residential’

The 3 characteristic profiles identified, residential, commuting and working, were then taken as predictors in the structural equations model (R-C-W model). In this second step we could take the coefficients as three new indicators of the contributions of each characteristic profile for each spatial unit.

Spatial units classified in a particular cluster normally had the corresponding characteristic profile (the classifying profile) as the main contributor. This was not always the case, most probably because the scaling used for the clustering was not the same as the one used in the model. However, in these cases the contribution of the classifying profile was very close to the main contributor.

The contributions estimated seem to be useful because even if not classified in one particular cluster, the contributions were still significant in spatial units close to those classified in that cluster. See figure 2 for the case of the contributions of the working profile.

5. CONCLUSIONS AND NEXT STEPS

Classifying spatial units solely based on mobile phone data results in clusters meaningful from the point of view of land use. We don't obtain all the richness, in terms of different categories, of existing classifications. However, the characteristic profiles of daily present population, residential, commuting and working, are relevant. We also managed to identify a method to estimate the contribution of each characteristic profile to the observed pattern of daily present population. Future research may be directed to assess to which extent it may be taken as a proxy to the proportion of each type of present population (residential, commuting and working) in a spatial unit over the day or as a proxy to the proportion of the area of the spatial unit dedicated to that type of land use.

REFERENCES

[1] European Commission (2015), ESS Big Data Action Plan and Roadmap 1.0.

[2] European Commission (2014) Feasibility Study on the Use of Mobile Positioning Data for Tourism Statistics.

[3] Ríos, Muñoz (2016) Land Use detection with cell phone data using topic models.

4

Figure 2: Contributions of characteristic profiles

introduction - european commission · web viewland use classification based on present population...

Documents