technical description for data merging

1

Technical report on GIS Analysis, Mapping and Linking of Contextual Data to the

European Social Survey

HAPPINESS project of the Cross-National and Multi-level Analysis of Human Values,

Institutions and Behaviour (HumVIB) programme

Finbarr Brereton, University College Dublin, Ireland

Mirko Moro, University of Stirling, The United Kingdom

Tine Ningal, University College Dublin, Ireland

Susana Ferreira, University of Georgia, USA

Abstract

This technical paper documents the work undertaken to link the European Social Survey a

biennial multi-country survey, which measures attitudes, beliefs and values of individuals

living in more than 30 nations to multi-level variables capturing the physical environment

and context of the respondents (air pollution, climate, land use, local GDP per capita,

population density, unemployment rate, etc.). The process of linking the data involved

creating a series of spatial identifiers based on the Nomenclature of Territorial Units for

Statistics (NUTS) geocodes. In addition, while the macroeconomic contextual variables are

typically available at the regional level, pollution and climate data are recorded at monitoring

stations, and Geographic Information Systems (GIS) spatial interpolation techniques need to

be applied prior to linking these to a particular respondent. GIS is increasingly used to

process, analyse and display georeferenced data effectively due to its mapping capabilities.

The resulting dataset provides a unique tool for quantitative investigation of interrelationships

at the individual, regional and national levels in Europe.

Financial support from the from European Science Foundation (Cross-National and Multi-level Analysis of

Human Values, Institutions and Behaviour (HumVIB)) is gratefully acknowledged. We thank Oana Borcan and

Victor Peredo Alvarez for outstanding research assistance. Corresponding author: [email protected]

mailto:[email protected]

2

TABLE OF CONTENTS

1. Introduction 3

1.1 European Social Survey 3

1.2 Geographical Information Systems 3

1.3 Deliverables from the Project 5

2. Creating a regional identifier in the ESS for data linking 7

3. GIS analysis and mapping of air quality 9

3.1 Data 9

3.2 Methods 10

3.2.1 Importing spreadsheet data into GIS 11

3.2.2 Spatial Interpolation in GIS 14

3.2.3 Integration of air quality with NUTS data 23

4. GIS analysis and mapping of climate and land use data 27

4.1 Climate data 27

4.2 Land use data 28

5. References 31

6. Appendix 32

3

1. Introduction

1.1 European Social Survey (ESS)

The ESS is an academically-driven, international survey examining changing social attitudes,

beliefs and values across Europe. It has become the first ever social science project to be

granted the prestigious Descartes prize, awarded by the European Commission for

excellence in scientific research. In our project we focus on the first three waves of the

survey. The first wave was fielded in 2002/2003, the third one in 2006/2007. ESS data are

obtained using random (probability) samples, where the sampling strategies, which may vary

by country, are designed to ensure representativeness and comparability across European

countries. The three-wave cumulative includes around 120,000 observations from 23

European countries.1

One of the variables collected in the survey is the region within a country where the

respondent lives. This information allows us to match the survey data spatially to a map of

Europe using Geographic Information Systems (GIS) and hence it is possible to combine

individual data with a vector of spatial amenities.2 These two datasets are combined at the

NUTS level.3 To assess the impact of changes in spatial amenities on individual variables (of

particular interest for our project, self-reported subjective well-being) in a more precise

manner, ideally, one would want to be able to match contextual factors to a particular

individual rather than a particular area. At present, however, the data do not allow this and

anonymity may preclude this in any case.

1.2 Geographic Information Systems and the Social Sciences

Adoption of Geographic Information System (GIS) and spatial modelling tools in the social

sciences is in its infancy, which is primarily due to a lack of recognition by social scientists of

the capability and capacity of such tools to support and develop new research areas and to aid

1 They are Austria, Belgium, Czech Republic, Switzerland, Germany, Denmark, Estonia, Spain, Finland, France,

Greece, Hungary, Ireland, Italy, Luxembourg, Netherlands, Norway, Poland, Portugal, Sweden, Slovenia,

Slovakia and the UK. 2 GIS works well when applied to static data, and less well when applied to time series analysis (Goodchild and

Haining, 2004) and hence is well-suited to the cross-sectional data employed in this project.

3 European Nomenclature of Territorial Units for Statistics (NUTS) is a geocode standard for referencing

administrative divisions of countries for statistical purposes developed by the European Union.

4

and enhance research applications. GIS offers great potential to generate innovative

approaches and advance knowledge in disciplines such as political science, economics,

archaeology, environmental studies, history, demography, anthropology, and applied social

sciences.

GIS is widely-used as a planning and analysis computing tool that allows the visual

representation of spatially referenced data and provides a powerful set of tools for spatial

analysis and modelling. It has advanced the technical ability to handle spatial data as

countable numbers of points, lines and polygons4 in two-dimensional space (Goodchild and

Haining, 2004) and to link various datasets using spatial identifiers (Bond and Devine, 1991).

It represents a solid base for spatial data analysis and provides a range of techniques for

analysis and visualisation of spatial data. It provides effective decision support through its

database management capabilities, graphical user interfaces and cartographic visualisation

(Wu et al., 2001). It provides tools for integrating, querying and analysing a wide variety of

data types, such as scientific and cultural data, satellite imagery and aerial photography, as

well as data collected by individuals, into projects, with geographic locations providing the

integral link between all the data.

With the rapid growth in the availability of geographical data in digital formats and parallel

innovations in technology to allow for the manipulation, analyses and visualisation of these

data, new types of information are being created. This underpins developments in

Participatory GIScience which provides a better understanding of the complexity of decision

situations involving human interactions with their physical environment. A recent article in

Science (Butz and Torrey, 2006) highlights the importance of the new GIScience tools in

providing the ability to analyse social behaviour across time and geographic scales. It further

points out that their adoption by social scientists is still in its infancy.

GIS methods can contribute to multi-level analysis; they can even generate new levels of

analysis and allow access to levels previously only identifiable in principle. They can also

help disseminating multilevel research findings. With a diverse range of disciplines involved

in multi-disciplinary research (sociologists, psychologists, economists, political scientists

etc.), creating policy documents that are accessible to the research community and to the

4 A polygon is the GIS term for any multi sided figure.

5

general public can become a challenge. To this end, GIS applications allow cartographic

representations of data and results that aid in disseminating information to a wide audience

a picture is worth a thousand words.

There now exists unprecedented individual-level data resources in Europe, typified by the

European Social Survey (ESS). There also exists comprehensive system-level and contextual

data. Heretofore, however, there are few analyses employing individual level data linked to

contextual data and they typically cover a limited local area or a limited set of indicators (see,

e.g., Brereton et al. 2008; MacKerron and Mourato, 2008; Luechinger, 2009). GIS facilitates

linking contextual data (institutional, economic. environmental etc.) to individual-level data.

While many social scientists are currently engaged in cross-national analysis, using GIS to

link data at the regional level would allow investigators to go further and engage in analysis

at the micro, meso and macro levels, using data that are comparable across a larger number of

units of analysis (regions) and this would increase the validity of multi-level analysis.

The growth in the availability of geographical data has not been accompanied by a coherent,

coordinated data collection effort at the European level. For example, the National Institute

for Regional and Spatial Analysis (NIRSA) in Ireland, the URBIS Digital Spatial Database in

University College Dublin, EDINA in the UK and Eurostat all house digital spatial data,

much of this overlapping. A goal of our project is to address the fragmentation that currently

plagues digital data archives in Europe by creating a pan-European research dataset with

environmental and other contextual data spatially referenced and linked to the ESS, and to

share the dataset and methodologies used to create it.

1.3 Deliverables from the Project

A key deliverable of this project is a pan-European dataset of environmental and other spatial

variables geo-referenced at a regional level and linked to the individuals in the ESS. The

contextual variables can be classified into four groups: air pollution concentrations, climate,

land use and macro-socioeconomic factors (Table 1).

6

Table 1. List of variables in spatial dataset

Category Indicators Main Source

Air Pollution PM10 mean annual concentration (g/m3) EEA AirBase

CO mean annual concentration (mg/ m3) (http://acm.eionet.europa.eu/databases/airbase/)

SO2 mean annual concentration (g/ m3)

NO mean annual concentration (g/ m3)

NO2 mean annual concentration (g/ m3)

Benzene mean annual concentration (g/ m3)

Climate Annual mean temperature (C) ECA

Mean of daily max. temperature in July (C) (http://eca.knmi.nl/)

Mean of daily min. temperature in January

(C)

Annual mean precipitation (mm)

Land use Residential CORINE

Commercial and Industrial (http://www.eea.europa.eu/publications/COR0-landcover)

Mines and Dumps

Green Urban Spaces

Agricultural Land

Forestry

Natural Areas

Water bodies

Macro-

socioeconomic GDP per capita Eurostat

Population change (%) (http://epp.eurostat.ec.europa.eu/portal/page/portal/eurostat

/home/)

Population density

Deaths from respiratory diseases

Unemployment rate (by age group/gender)

In the following sections, we provide a more detailed description of the construction of the

dataset, in particular for the pollution, climate and land use variables. Macro-socioeconomic

variables were already available at a NUTS 2 or NUTS 3 level (depending on the variable)

from the Eurostat database.

The NUTS is a geocode standard for referencing administrative divisions of countries for

statistical purposes developed by the European Union. Regions at NUTS level 1 are large

sub-national units (such as Scotland or Bavaria) each of which usually comprises a number of

NUTS 2 regions (examples of this level include the Autonomous Communities in Spain or

the "regions" in France). In turn, these are made up of NUTS 3 regions (such as the "Kreis" in

7

Germany). Although broadly very stable over time in a number of countries, the NUTS

classification has been amended several times, most recently in 1995, 1999 and 2003.5

The Cumulative ESS database does not use a coherent definition of region; in some cases the

regions can be NUTS level 1 in other cases NUTS level 2 or NUTS level 3 (Table 2).

Table 2. NUTS levels used in ESS for each participant country

NUTS Level Countries

1 Belgium (BE), Germany (DE), France* (FR), Luxemburg (LU), United Kingdom (UK)

2 Austria (AT), Switzerland (CH), Spain (ES), Finland (FI), France* (FR), Greece (GR),

Hungary (HU), Italy (IT), Ireland* (IE), Norway (NO), Poland (PO), Portugal (PT), Sweden

(SE)

3 Czech Republic (CZ), Denmark (DK), Estonia (EE), Ireland* (IE), The Netherlands (NL),

Slovenia (SI), Slovakia (SK), UA (Ukraine)

* ESS used mixed boundaries of NUTS levels 1 & 2 for France and levels 2 & 3 for Ireland.

The spatial variables in Table 1 were linked to each respondent at the corresponding NUTS

level in Table 2. In addition, we preserved them at the higher level of spatial disaggregation

at which they were available (NUTS 3 for pollution, climate and land use data, NUTS2-3 for

macro-socioeconomic indicators). All the datasets will be publicly available in the project

website (http://www.ucd.ie/happy/resear.html) from January 2012.

2. Creating a regional identifier in the ESS for data linking

The cumulative ESS dataset can be freely downloaded from the ESS website

(www.europeansocialsurvey.org). We appended three additional spatial identifiers in order to

facilitate the matching of the ESS data file with spatially referenced data and to carry out

spatial analysis of data: "cntry," "region" and "code_id." "cntry" is the NUTS code for each

country; "region" is name of the region as reported in the ESS Cumulative dataset. The ESS

contains multiple variables to identify the region where the respondent lives. We merge these

into one new variable. The key new regional variable we create is "code_id:" a unique

identifier equal to the NUTS level for a particular observation in the ESS (see Table 2).

5 A detailed list of NUTS by each European country can be found at

http://ec.europa.eu/eurostat/ramon/nuts/codelist_en.cfm?list=nuts. Maps of each country and regions with

subdivision in NUTS levels can be found at http://circa.europa.eu/irc/dsis/regportraits/info/data/en/.

8

In the NUTS system, each country is divided following a three-level hierarchy of regions

established on the basis of existing administrative regions or groupings of these. The NUTS 1

code is composed of three alphanumeric characters. The first two refers to the country (and

they are the same as the cntry variable), while the third one is usually a number. NUTS 2

code is composed of four alphanumeric characters, while NUTS 3 consists of five

alphanumeric characters. Box 1 illustrates the NUTS code hierarchy for the Spanish regions.

Therefore if code_id for a particular observation is, for example, GR11 it means that the

respondent lives in a NUTS2 level region of Greece, while UA044 stands for a NUTS 3 level

region of Ukraine, etc.

Box 1: NUTS code hierarchy for Spanish regions

ES (Spain)

ES1 (represents NUTS 1 level identifying the North-West region)

ES11 (represents NUTS 2 level identifying Galicia)

ES111 (represents NUTS 3 level identifying La Corua)

ES112 (region at NUTS 3 level, Lugo)

ES12 (Asturias)

ES120 (Asturias)

ES13 (Cantabria)

ES130 (Cantabria)

ES2 (NUTS 1 identifying North-East region)

ES21 (Basque Country)

ES211 (lava/Araba)

ES212 (Guipzcoa/Gipuzkoa)

...

...

ES7 (NUTS1 Canarias)

ES70 (Canary Islands)

ES701 (Las Palmas)

ES702 (Tenerife)

In a few cases (Switzerland, France, Italy, Ireland), a new code was created to accommodate

the fact that their ESS regions were an aggregation of NUTS. In this case, the following rule

has been observed in creating the ad-hoc code:

- CH02-04 is an aggregation of 3 Swiss regions at NUTS 2-level (i.e., CH02, CH03,

and CH04).

- FI18,20 represents 2 Finnish regions at NUTS 2-level (i.e., FI18 and FI20), etc. Ditto

for France.

- IE022-025 represents 4 Irish regions at the third NUTS level (i.e., IE022, IE023,

IE024, IE025).

9

The unique spatial identifier enables us to match any kind of data available at different spatial

levels to the original Cumulative ESS dataset.

3. GIS analysis and mapping of air quality

This section describes the technical procedures involved in the GIS analyses and mapping of

concentrations of PM10, SO2, NO, NO2, CO and Benzene across the 23 European countries in

the cumulative ESS R1-R3 dataset (see footnote 1) from 2001 to 2008. The air quality

datasets were obtained from AirBase - the European Air Quality Database in spreadsheet

(Excel) format. The datasets underwent several preparation, conversion, interpolation,

processing and analyses steps in spreadsheet and GIS formats. The first outcome is a GIS

database with a grid cell size of 5km on the side. The data was combined with EU NUTS data

to map air quality at NUTS3 level. The final resulting database on air quality can be queried

for air quality and NUTS information over any location in the study and mapped both at

NUTS3 and 5 km spatial resolutions.

3.1. Air quality data

The air quality data under investigation are Particulate Matter under 10 microns (PM10),

Sulphur Dioxide (SO2), Nitrogen Oxide (NO), Nitrogen Dioxide (NO2), Carbon Monoxide

(CO) and Benzene (C6H6). The data is average annual time series from 2001-2008 and covers

23 European countries. The air quality data have been recorded by a network of monitoring

stations and submitted to European Topic Centre for Air Pollution and Climate Change

Mitigation (ETC/ACM), an agent of the European Environmental Agency (EEA). The data

and other information on air quality are hosted by European Air Quality Database (AirBase)

where they are publicly accessible (http://acm.eionet.europa.eu/databases/airbase/).

The datasets for the 23 countries were downloaded in spreadsheet (Excel) format and the

tables were re-structured for eventual import into GIS. The structured tables were identical in

their data types and each air pollution table contains 30 variables, ranging from

administrative units to geographic coordinates (longitude and latitude) of the monitoring

station and air quality values. Since all the spreadsheet tables have identical number of

columns and data types, all the air quality data for the different member states were merged

10

into a single Microsoft Excel spreadsheet. The combined datasets in the single spreadsheet

file has six worksheets, each representing one of the six air pollutants.

An enumeration of the monitoring stations by country showed Germany leading with over

25% of the total air monitoring stations, followed by Spain, France and Italy (Table 3). Over

60% of the total monitoring stations in the study are concentrated in these four countries:

Table 3. Number of monitoring stations and land areas per country

3.2 Methods

The methodology employed in processing the air pollution datasets is multi-step and lengthy.

In brief, three main phases can be identified: i) importing data into GIS, ii) undertaking

interpolation, and iii) integrating NUTS with air quality data. Each of these main steps will be

discussed in the following sections. Fig.1 provides a schematic overview of the workflow.

11

Figure 1. Workflow showing input, processing and output of air quality data. An example demonstrating

the above workflow is shown in Fig.12 for Germany.

3.2.1 Importing spreadsheet data into GIS

In order to map air quality data in GIS, the data must be spatially referenced. The longitude

and latitude provide the necessary spatial references that define the locations of the

monitoring stations. However, these coordinates are in World Grid System of 1984 (WGS84)

and must be re-projected to the European ETRF projection system to accurately overlay with

other GIS layers.

The air quality data in the spreadsheet is prepared for import into GIS by shortening the

column names to less than 10 characters and eliminating character in column names. The

columns are formatted as numeric, text or dates accordingly, as ArcGIS has specific protocols

regarding table structures and data types. After the necessary preparations are made in the

spreadsheet, it is ready for import into GIS.

In ArcMap, the Add XY Data command is invoked which opens a dialogue window for the

input of XY data from a tabular data to create an event theme. In ArcMap 10, the command

to add XY data is via File > Add Data > Add XY Data. In the dialogue window that

follows, the relevant spreadsheet file is selected, then the columns that hold the X- and Y

12

coordinates are selected to match the corresponding X and Y fields, and the output coordinate

system is specified (see Fig.2). The settings are checked and if satisfactory, the selection is

confirmed with OK to execute the process of converting the spreadsheet data into GIS.

Figure 2. The Add XY data dialogue windows in ArcMap to import tabular data with coordinates to

create event themes, before filling in the details (left) and after entering the required parameters (right).

After the tabular data is successfully imported into GIS, an event theme (temporary GIS

layer) is created which places a point on each coordinate pairs (see Fig.3-A). The result is

examined for spatial accuracy and if satisfactory, a permanent copy is then made by

exporting the event theme to a new GIS layer. The new GIS copy of air quality contains all

the data from the spreadsheet and is now ready for processing and analys in GIS.

The air quality GIS layer is first re-projected from WGS84 coordinate system to ETRS

Lambert Azimuthal Equal Area projection. This is necessary to adopt a common EU

projection because all the countries have different local datums. On the ETRS datum, all the

air pollution data accurately overlay with each other in ArcMap.

When the monitoring stations are superimposed on the EU country layer, it is evident that

some of the monitoring stations lie outside the country boundaries. This may be due to errors

in the coordinate values from the spreadsheet. Using the country boundaries as a spatial filter,

13

the monitoring stations that are found inside the countries are saved to a new GIS layer while

those that fall outside are excluded (Fig.3-B). The next step is then to filter and separate the

pollution types by dates into the different years they were recorded. The attribute tables are

queried and the monitoring stations are separated and saved into different years from 2001 to

2008 for all the pollution types (see Fig.4). After this process, there are 48 GIS layers

resulting from the 6 original air quality layers each having 8 layers for the different years

from 2001 to 2008. The final 48 GIS layers are now ready for interpolation in the subsequent

stage.

Figure 3. Event theme of PM10 created from spread sheet (A) and the permanent copy made of

monitoring stations that fall inside the countries GIS layer (B) shown in ArcMap. The permanent copy of

PM is for all years from 2001 to 2008.

14

Figure 4. The PM10 layer over the study area is separated into different years from 2001 to 2008 for

interpolation by individual years.

3.2.2 Spatial Interpolation in GIS

Ambient air concentrations are recorded at the monitoring-station level. However, due to

their uneven distributions, the concentrations between monitoring stations remain unknown.

The immediate solution is to apply spatial interpolation techniques to the available

monitoring data to provide air quality information between monitoring stations (Denbyl et al.,

2010).

Air monitoring stations measure ambient air concentrations, generally at fixed locations and

they represent changes in air concentrations. However, on their own, they are insufficient to

provide estimates for the intervening locations to visualize their continuity and variability.

This is where interpolation becomes unavoidable because of its ability to create continuous

surfaces from sample data that makes interpolation both powerful and useful. From the

surface, the morphology and characteristics of the changes can be described (Childs, 2004).

There are different methods of interpolation. Each method uses a different approach and they

almost always produce different results, therefore the most appropriate method will depend

on the distribution of the sample points and the phenomena being studied (Childs, 2004).

15

In GIS, surface representation is done by storing the x,y values and Z values define the

location of a sample and the change characteristic represented by the Z value. These points

can be represented as contours where lines of equal values can be joined to depict the surface

as in contour lines or alternatively, the points can be represented as triangular irregular

network (TIN) or as grid surfaces. TIN is a vector data structure used to store and display

surface models while grid is a spatial data structure that defines spaces as an array of cells of

equal size that are arranged in rows and columns representing a surface. The various methods

are aimed at representing continuous surfaces through interpolation.

There are various interpolation techniques but some of the common ones available in GIS are

spline, inverse distance weighting (IDW), kriging, trend surface and thiessen polygons.

Within ArcGIS, several spatial interpolation techniques such as natural neighbour, spline

with barriers, topo to raster and trend are available. These spatial interpolation methods can

be generally grouped into several categories based on their basic hypotheses and

mathematical natures such as geometric method, statistical, geostatistical, stochastic

simulation, physical model simulation and combined method (Li et al., 2000). Ultimately, the

rationale behind interpolation is to fill in the blanks in between points and display a much

smoother and fine surface. Therefore, well distributed and sufficient number of data points in

the area under investigation would minimize uncertainties between points.

Research on the comparison of the various spatial interpolation methods shows that there is

no absolutely optimal method; however, there is only relatively optimal interpolation method

in special situations (ibid). Therefore the best spatial interpolation method should be selected

in accordance with the quantitative analysis of the data analysis and repeated experiments. In

addition, the results of spatial interpolation should be strictly examined for validity (ibid).

Studies relating to air pollution that implemented spatial interpolation to map air quality at

European scale used additional datasets like land-cover, elevation, meteorology and

population density to improve the methodology and reduce uncertainties in their models

(Horlek et al., 2007, Horlek et al., 2010, Smet et al., 2009). A series of technical

publications by the European Topic Centre on Air and Climate Change (ETC/ACC) deals

extensively on the topic of interpolation and air quality mapping. In particular, a paper by

Horlek and others (2007) on Spatial Mapping of Air Quality for European Scale

Assessment is comprehensive and incorporates most of the common interpolation techniques

in their methodologies with supplementary data to map air quality in urban and rural areas

across Europe. From a series of testing and modelling they concluded that for air quality

16

assessment, kriging methods are generally preferred over IDW and for PM10, lognormal

kriging over ordinary kriging. In addition, preference is advocated to methodologies that are

based on linear regression using supplementary data over pure interpolation methods. The

usage of concurrent meteorological data is reported to give better results than climatological

data (Horlek et al., 2007). All these preferences and recommendations are based on repeated

testing of their methodologies over time. Others like Naoum and Tsanis (2004) stated that

despite the numerous articles written about interpolation, there is little or no agreement

among the authors on the superiority of some techniques over others. They added that

judgement and experience come into play when considering which interpolation method to

use. In a personal communication with Peter de Smet (2011), a leading researcher on air

quality mapping over Europe from the European Topic Centre on Air Pollution and Climate

Change Mitigation (ETC/ACM), he confirmed that their tests showed kriging method to

produce accurate results and IDW usually produce high uncertainties.

After considering the data available to us, the methodologies used in other researches, the

tools at our disposal and testing several interpolation techniques, the options came down to

kriging and inverse distance weighted (IDW). Although kriging is preferred over IDW for

mapping air quality at European scale, IDW remains popular where there are fewer

datapoints and is suitable for rapid interpolation of in-situ air quality data. When both kriging

and IDW were tested by varying the number of monitoring stations, the differences were

acceptable and the values for IDW remained generally consistent. In addition, IDW retains a

large range of the original data after interpolation compared to kriging.

IDW is grounded on the principle of inverse distance where the values of the cells are based

on a linear weighted combination set of sample points. The value assigned to a cell is a

function of the distance of an input point from the output cell location the so-called distance

decay concept. In other words, its estimates are based on values at nearby locations weighted

only by distance from the interpolation location. The greater the distance, the less influence

the cell has on the output value. IDW does not make assumptions about spatial relationships

except the basic assumption that nearby points ought to be more closely related than distant

points of the value at the interpolated location (Naoum and Tsanis, 2004). IDW interpolation

is preferred over kriging in this work and will be demonstrated in the following section.

Choosing an appropriate geostatistical model

The various geostatistical analysis models have advantages and setbacks for different

applications. After consulting a number of sources and testing the available interpolation

17

techniques using Ireland as a case study, the inverse weighted distance (IDW) and kriging

methods appear to suit the mapping of PM10 and other pollution data. IDW is a deterministic

method but is mentioned in a number of sources as a suitable choice while kriging which is a

geostatistical technique is also recommended.

A test over and Germany using IDW and krigging was carried out and values compared on a

cell by cell basis. The range in their difference is between -9 and 9, and while IDW retains a

longer range of values after interpolation, kriging appears to trim the values

further, resulting in a shorter range on the result. On the test with PM10 for 2005 over

Germany, the IDW interpolated values range from 11.6 to 36.5 with a mean of 24.1, whereas

kriging ranges from 17.1 to 31 with a mean of 24.1. The standard deviation for IDW is 3.2

and kriging is 2.6. An Excel table on the test with descriptive statistics is available on request

to help explain the differences between the two. In spite of their differences, the majority of

the values compare well (Figure 4a).

Figure 4a. IDW interpolation (A) compared to Kriging interpolation (B).

18

Data preparation prior to Interpolation

Prior to any implementation of interpolation, a number of steps are necessary to ensure the

correct outcomes. The first step is to reproject the pollution layer from WGS84 to ETRS 1989

projection system on the European datum. Since the air quality data is a composite of time-

series data between 2001 and 2008, the next step then is to filter and separate them into

individual layers which results in 8 separate GIS layers for each pollution type. Then the

pollution data for all years are loaded into ArcMap one pollutant at a time for processing.

At this stage, the geoprocessing environment is configured to the EU countries as the

maximum processing extent for interpolation. After setting the environment the IDW

interpolation technique is invoked which opens up the IDW interpolation dialogue window

(see Fig.5). The input point feature requires the input of pollution point data, the Z value

field is the field for air quality data to interpolate and the output raster is for the name and

location for the interpolation result. After the input dialog options are selected and adjusted,

the interpolation is executed. The result is displayed immediately as shown in Fig.6. This

procedure is repeated for all the years from 2001 to 2008 for each pollutant. After the

completion of the interpolation processing, there are 48 raster layers created from 6 pollution

types.

Figure 5. Dialogue windows for IDW interpolation in ArcMap: default user interactive window (A) and

populated IDW window ready for processing (B).

19

Figure 6. The outcome of IDW Interpolation for PM10 for 2001 across the 23 EU member states.

The ideal scenario would be to query and extract information on air quality from a single

database as opposed to dealing with a number of separate layers which would involve

additional time and effort besides taking up more storage space, particularly for raster

datasets. In order to combine all the air quality data into a single GIS database, a vector grid

matrix is created as a container to store the extracted raster values from the interpolation

results. The steps involved are detailed as follows.

Fishnet vector grid matrix to store interpolation results

The fishnet function in ArcGIS can create a regularized vector grid matrix of any cell size at

any given extent. Setting the EU countries as the maximum geoprocessing extent, the fishnet

tool is implemented to create a vector grid matrix of 5x5 km cell size. The result is a vector

GIS layer that spans the extent of the EU countries and contains 534,378 grid cells (Fig.7-A).

A spatial overlay is made between the two layers and the grids that intersect with the EU

countries are saved to a new GIS layer for further processing. The resulting grid matrix has

177,645 grid cells (Fig.7-B).

20

Figure 7. The EU countries GIS layer on the left is used to select the 5x5km vector grid that it intersects

with and the result saved to a new file as shown on the right.

Since the interpolation and vector grid layers are created from a common map extent and

their grid sizes set to 5km on the side, there is an exact match when the vector grid is

superimposed on the interpolation raster layer (see Fig.8). The attribute table of the vector

grid is restructured and 8 new fields are added with their nomenclature sequentially ranging

from PM10_2001 to PM10_2008. These fields will store the pollution values of the

corresponding years from the interpolation raster layers. One vector GIS grid matrix is

created for each pollution type resulting in 6 grids and their attribute tables restructured to

store data for all years.

Figure 8. Overlay of vector grid matrix on PM10 raster interpolation result showing the exact

match between the raster cells and the vector grid cells. The figure on the right is an inset of the

box on the figure to the left.

21

In the next step the vector grid is overlaid on the raster layer and the raster pixels are

transferred to the corresponding vector grid cells. This process is repeated for all the 8 raster

layers for each year where the pixel values of each raster layer is transferred to the

corresponding year attribute in the vector GIS database. For example, the pixel values from

PM10 for 2001 are transferred to the attribute (field/column) for 2001 in the GIS attribute

table, the pixel values for PM10 for 2002 are stored in the attribute for 2002 and so on (see

Fig.10). This process is repeated for all the 8 raster layers until all the corresponding 8

attributes in the vector database for PM10 are updated. The advantage of storing data in the

vector format is its flexibility to query the database in various ways and vector data structure

occupies less storage space. The same process is replicated on the remaining seven pollutants

and in the end, the 48 raster layers are compacted into to just 6 vector GIS layers.

Transferring raster cell values to vector attribute table

The Transfer of raster cell values to the grid matrix is a two-step process. First an

intermediate point layer is created from the polygon grid matrix layer (Fig.9-A) where the

points represent the centre (centroid) of the grid (Fig.9-B). Using a GIS surface overlay tool

called Extract Raster Values, (Fig.9-D) the raster cell values (Fig.9-C) are extracted and

transferred to the corresponding points attribute table (see Fig.10). After this step is

completed, the second step is then to transfer the attributes from the point GIS layer across to

the polygon grid matrix through attribute transfer function. Since the points were created

from the grid matrix layer, they have identical feature identification record addresses and the

values in the point layer are transferred by address matching. This procedure is repeated for

all the other layers.

22

Figure 9. Raster Value Extraction procedure showing the construction of vector grid matrix (A),

creating points/centroids from the grid (B) and extracting the raster pixel values (C). The GIS

tool used to extract raster values is shown in D.

PM10

2001

PM10

2002

PM10

2003

PM10

2004

PM10

2005

PM10

2006

PM10

2007

PM10

2008

Figure 10. The pixel values of the 8 raster layers for PM10 from 2001-2008 (top) are transferred to the

corresponding fields in the vector .grid attribute table (bottom) resulting in only one file storing PM10

data for all years.

23

Querying and mapping the pollution data at 5x5 km grid matrix

The outcome of the previous step produced 6 vector grid layers which can be queried and

mapped. The IDW interpolation results showed that the original air quality values are

rounded up from the lowest and rounded down from the highest values. For example, the raw

data for PM10 for 2001 ranged from 6.3 to 103.4; however, the range of values after

interpolation is shortened to 8 till 95. The computed values cannot be higher than the highest

and lower than the lowest value of the original data, a typical feature pertinent to IDW

interpolation. From one vector GIS layer, 8 time series maps for 2001 to 2008 can be

produced (Fig.11). A variety of statistical analysis and querying are possible from the

database when the data is in vector format.

Figure 11. Rapid outputs of annual mean PM10 from 2001 to 2008 across the study area.

3.2.3 Integration of air quality with NUTS data

The smallest mapping unit or resolution for the air quality maps are 5x5 km. These data can

be integrated with other datasets like climate, demography, socio-economic, transport and

others to enhance the data content to facilitate answering questions on a broad range of

themes. One of the aims of this study is to map the air quality at NUTS3 level. This implies

an aggregation of the interpolation results to NUTS3 scale. This involves a number of steps to

produce a new dataset that integrates information on both NUTS3 and pollution.

2004 2002 2003

2006 2007

2001

2005 2008

24

The first step requires combining the air quality data and NUTS3 into a single dataset. This

task is carried by union in GIS overlay operation. In union, there is a geometric intersection

of the two GIS data layers where they are multiplied by each other to produce a Cartesian

product. After the union, some clean-up process is necessary and the portions of grids that

fall outside the boundaries of the countries are eliminated.

At this stage the smallest mapping unit is 5x5 km with the NUTS3 boundaries infused,

cutting through them like a cookie cutter (see Fig.12-E). Since the desired output is to map

the air quality at NUTS3 level, the air quality data is aggregated to NUTS3 level using a

generalization tool called dissolve by melting away the 5x5 km grids in each NUTS3 units.

During the aggregation, the numeric data are averaged while the non-numeric data are

transferred either by first or last name. After the dissolve operation, the pollution data are

aggregated at NUTS3 level and can be queried and mapped.

The workflow in Fig.1 is illustrated in Fig.12 using Germany as an example. It shows the

steps involves in extracting air quality monitoring stations over Germany, then applying the

interpolation techniques to create a surface and extracting the raster values to vector grids of

5x5km resolution. The values are then transferred and averaged to the NUTS level where

PM10 for 2005 is mapped (Fig.12:A-G).

26

Figure 12. Demonstrating actual steps involved in creating PM10 map at NUTS3 scale for Germany.

Notes on Fig.12 A-G

A: Spatial distribution of monitoring

stations across Germany.

B: IDW interpolation result from the

monitoring stations.

C: Extract raster values from

interpolation surface to 5x5 km

vector grid matrix through overlay

function.

D: Mapping PM 10 at 5x5 km grid

resolution.

E: Geometric intersection with

NUTS3 GIS layer through union

overlay function.

F: Dissolving the 5x5 km grid

resulting in air quality data

aggregated and mapped at NUTS3

scale.

G: Final PM10 map for 2005 over

Germany at NUTS3 scale.

27

4. GIS analysis and mapping of climate and land use data

4.1 Climate data

Temperature and precipitation data was obtained from the European gridded data set of

surface temperature and precipitation for the period of 1950 2008, version 4.0, produced by

the European Climate Assessment & Dataset (ECA&D). This freely available dataset

contains daily observations at meteorological stations throughout Europe and the

Mediterranean (http://eca.knmi.nl/).

The selected data files were compressed in a NetCDF format. With an original resolution of

0.25 degrees and projected in a latitude longitude grid. The relevant NetCDF files were

extracted using the specialised software CDO produced by Max-Planck. CDO is a collection

of tools developed to manipulate and analyse climate and forecast model data.

The produced factors were of: i) mean annual temperature, ii) maximum temperature (July),

iii) minimum temperature (January) and iv) mean annual precipitation. Mean values were

obtained from daily data with the use of CDO.

The extracted maps were re-projected from their latitude and longitude coordinates to

Lambert Azimuthal Equal Area (LAEA) and re-sampled to a 5,000 metre resolution with the

use of ArcGIS. The produced maps showed continuous data of temperature and precipitation.

The main characteristic of continuous data is that values often changed every 5,000 metres.

To obtain mean values at NUTS 3 level, images for each factor where analysed, using the

Zonal Statistics module in ArcGis. This module allows calculating statistics on values of a

raster within the zones of another dataset. In this case, the NUTS 3 dataset defined the zones

and each factor the aggregated values.

In total, 32 maps were produced from the period of 2001 to 2008, one for each year for each

of the four climatic variables (see Fig.13 for an example).

In order to produce attribute tables for each factor and link those to the NUTS 3 database, the

produced maps where transformed to vector files. This produced attribute tables for each

factor with the same Object ID as the NUTS 3, making it possible to link each database.

Finally, with the use of Spatial Join module all attribute tables where integrated into the

NUTS 3 database. The Spatial Join module creates a table join in which fields from one

28

layers attribute table are appended to another layers attribute table based on the relative

location of the features in the two layers.

Mean Annual Temperature 2001, NUTS 3

TC

01-mat--c

Value

High : 20.000000

Low : -30.000000


TC

2002-matnuts3

Value

High : 20

Low : -30


TC

2003-matnuts3

Value

High : 20.000000

Low : -30.000000


TC

2004-matnuts3

Value

High : 20.000000

Low : -30.000000


TC

2005-matnuts3

Value

High : 20.000000

Low : -30.000000


TC

2006-matnuts3

Value

High : 20.000000

Low : -30.000000


TC

2007-matnuts3

Value

High : 20.000000

Low : -30.000000


TC

2008-matnuts3

Value

High : 20.000000

Low : -30.000000

Figure 13. Mean Annual Temperature 2001 to 2008 across the study area.

4.2 Land Use Data

The vector of spatial land use factors comes from the Coordination of Information on the

Environment (CORINE) land cover database. CORINE is a pan-European database carried

out within each European member state. It is a vector spatial dataset, land cover digitized

based on the interpretation of medium resolution satellite imagery and assigned a land use

class based on a standardized land cover nomenclature defined by the European Environment

Agency. The minimum area mapped in the dataset is 25 hectares. Within this research, broad

land use statistics were derived from the CORINE database for the years 2001 and 2006. For

the purposes of this study, the original 44 land use categories of the CORINE nomenclature

are re-categorised into the following classes.

1. Residential

2. Commercial and Industrial

3. Mines and Dumps

2004 2002 2003

2006 2007

2001

2005 2008

29

4. Green Urban Spaces

5. Agricultural Land

6. Forestry

7. Natural Areas6

8. Waterbodies

This re-categorisation is considered more appropriate to capture the environmental typologies

of interest in this study due to the low spatial resolution of 25 hectares and for econometric

purposes (i.e. to avoid multicollinearity).

As a quantitative representation of land cover in the ESS regions we have used areas

in square meters for each of the CORINE 44 land cover classes (see Appendix). However, it

was not straightforward as the regions used in ESS for different countries were based on the

boundaries of different NUTS levels (Table 2). Moreover, in some cases, different NUTS

levels were used even inside of the same country (e.g. France and Ireland). Therefore, we

have composed an ESS regions map, which includes the boundaries of corresponding NUTS

level for each country. Europe NUTS 1-, 2- and 3-level maps provided by ESRI were used as

a base for this composite map, which we will call ESS Regions map hereafter.

Figure 14: Geographical coverage of CORINE for 2000 and 2006

6 Natural Areas are EU-designated as areas of outstanding natural beauty.

2000

2006

30

As mentioned above, the studied ESS rounds were implemented in 2001, 2004 and 2006.

Thus, only for 2006 we have data from both sources (ESS and CORINE). Therefore, we have

used linear interpolation to estimate the appropriate land cover statistics in the intermediate

years of 2001 and 2004, based on the CORINE data of 2000 and 2006. Particularly, in the

first stage ArcGIS Spatial Analyst Tabulate Area function was used to calculate areas by

land cover classes for each region from the ESS Region map based on 2000 and 2006

CORINE raster maps. Then the results were exported to MS Excel and used to estimate the

corresponding values of 2001/2 and 2004 in the following formula for each land cover class

and region:

,46

,26

,

2000

,

2006,

2000

,

2002

,

2000

,

2006,

2000

,

2002

ncnc

ncnc

ncnc

ncnc

CLCCLCCLCCLC

CLCCLCCLCCLC

where nc

yCLC,

is the area of a land cover class c in a NUTS unit n in the year y.

As a result for each region used in ESS we have got 44 land cover class areas in square

meters for 2000 and 2006 from actual CORINE datasets and for 2001/2 and 2004/5 from

linear interpolation. Finally the resulting land cover data table was joined with ESS dataset

and the demographic data provided by ESRI Europe NUTS maps. ArcGIS Join attributes

from a table function was applied using NUTS name as linking field.

31

References

Bond, Derek and Devine, Paula. The Role of Geographic Information Systems in Survey

Analysis. The Statistician, 1991, 40 (2), pp. 209 216.

Brereton, F., Clinch, J.P. and Ferreira, S. (2008) Happiness, Geography and the

Environment Ecological Economics 65, 386396

Butz, W.P. & Torrey, B.B. (2006) Some Frontiers in Social Science, Science, 312, 30,

1898-1900.

Childs, C. 2004. Interpolating Surfaces in ArGIS Spatial Analyst. Developer's Corner -

ArcUser July-September 2004 ESRI.

Denbyl, B., Garcia, V., HoUand, D. & Hogrefe, C. 2010. Integration of air quality modeling

and monitoring data for enhanced health exposure assessment. EM Magazine

[Online], Special Issue. Available:

http://www.google.ie/url?sa=t&source=web&cd=1&ved=0CBUQFjAA&url=http%3

A%2F%2Foaspub.epa.gov%2Feims%2Feimscomm.getfile%3Fp_download_id%3D4

91678&ei=9SI0TrLKHYbMhAfc3NHiCg&usg=AFQjCNFvcSq8o9Gm6ZGWASgac

Ukky3Gj-Q [Accessed 30 July 2011].

Goodchild, Michael, F. and Haining, Robert, P. GIS and Spatial Data Analysis: Converging

Perspectives. Papers in Regional Science, 2004, 83, pp. 363 385.

Horlek, J., Denby, B., Smet, P. d., Leeuw, F. d., Kurfrst, P., Swart, R. & Noije, T. v. 2007.

Spatial mapping of air quality for European scale assessment. ETC/ACC Technical

Paper 2006/6. Bilthoven: European Topic Centre on Air and Climate Change.

Horlek, J., Smet, P. d., Leeuw, F. d., Cokov, M., Denby, B. & Kurfrst, P. 2010.

Methodological improvements on interpolating European air quality maps. ETC/ACC

Technical Paper 2009/16. Bilthoven: European Topic Centre on Air and Climate

Change.

Li, X., Cheng, G. & Lu, L. 2000. Comparison of Spatial Interpolation Methods. Advances in

Earth Science, 260-265.

Luechinger, S., (2009). Valuing Air Quality Using the Life Satisfaction Approach. Economic

Journal 119, 482-515.

MacKerron, G., and S. Mourato, (2009). Life satisfaction and air quality in London,

Ecological Economics, 68(5): 1441-1453

Naoum, S. & Tsanis, I. K. 2004. Ranking Sparial Interpolation Techniques using a GIS-

Based DSS. Global Nest The International Journal, 6, 1-20.

Smet, P. d. 2011. RE: Interpolation techniques and modelling for mapping air monitoring

values across EU. Type to Ningal, T.

Smet, P. d., Horlek, J., Cokov, M., Kurfrst, P., Leeuw, F. d. & Denby, B. 2009.

European air quality maps of ozone and PM10 for 2007 and their uncertainty analysis.

ETC/ACC Technical Paper 2009/9. Bilthoven: European Topic Centre on Air and

Climate Change.

Wu, Yi-Hwa; Miller, Havery, J. and Hung, Ming-Chih. A GIS-based Decision Support

System for Analysis of Route Choice in Congested Urban Road Networks. Journal

of Geographical Systems, 2001, 3, pp. 3 24.

http://www.google.ie/url?sa=t&source=web&cd=1&ved=0CBUQFjAA&url=http%3A%2F%2Foaspub.epa.gov%2Feims%2Feimscomm.getfile%3Fp_download_id%3D491678&ei=9SI0TrLKHYbMhAfc3NHiCg&usg=AFQjCNFvcSq8o9Gm6ZGWASgacUkky3Gj-Qhttp://www.google.ie/url?sa=t&source=web&cd=1&ved=0CBUQFjAA&url=http%3A%2F%2Foaspub.epa.gov%2Feims%2Feimscomm.getfile%3Fp_download_id%3D491678&ei=9SI0TrLKHYbMhAfc3NHiCg&usg=AFQjCNFvcSq8o9Gm6ZGWASgacUkky3Gj-Qhttp://www.google.ie/url?sa=t&source=web&cd=1&ved=0CBUQFjAA&url=http%3A%2F%2Foaspub.epa.gov%2Feims%2Feimscomm.getfile%3Fp_download_id%3D491678&ei=9SI0TrLKHYbMhAfc3NHiCg&usg=AFQjCNFvcSq8o9Gm6ZGWASgacUkky3Gj-Qhttp://www.google.ie/url?sa=t&source=web&cd=1&ved=0CBUQFjAA&url=http%3A%2F%2Foaspub.epa.gov%2Feims%2Feimscomm.getfile%3Fp_download_id%3D491678&ei=9SI0TrLKHYbMhAfc3NHiCg&usg=AFQjCNFvcSq8o9Gm6ZGWASgacUkky3Gj-Q

32

Appendix. Classes of CORINE nomenclature 3 levels

GRID

CODE LABEL1 LABEL2 LABEL3

1 Artificial surfaces Urban fabric Continuous urban fabric

2 Artificial surfaces Urban fabric Discontinuous urban fabric

3 Artificial surfaces Industrial, commercial and transport units Industrial or commercial units

4 Artificial surfaces Industrial, commercial and transport units Road and rail networks and associated land

5 Artificial surfaces Industrial, commercial and transport units Port areas

6 Artificial surfaces Industrial, commercial and transport units Airports

7 Artificial surfaces Mine, dump and construction sites Mineral extraction sites

8 Artificial surfaces Mine, dump and construction sites Dump sites

9 Artificial surfaces Mine, dump and construction sites Construction sites

10 Artificial surfaces Artificial, non-agricultural vegetated areas Green urban areas

11 Artificial surfaces Artificial, non-agricultural vegetated areas Sport and leisure facilities

12 Agricultural areas Arable land Non-irrigated arable land

13 Agricultural areas Arable land Permanently irrigated land

14 Agricultural areas Arable land Rice fields

15 Agricultural areas Permanent crops Vineyards

16 Agricultural areas Permanent crops Fruit trees and berry plantations

17 Agricultural areas Permanent crops Olive groves

18 Agricultural areas Pastures Pastures

19 Agricultural areas Heterogeneous agricultural areas Annual crops associated with permanent crops

20 Agricultural areas Heterogeneous agricultural areas Complex cultivation patterns

21 Agricultural areas Heterogeneous agricultural areas Land principally occupied by agriculture, with significant areas of natural vegetation

22 Agricultural areas Heterogeneous agricultural areas Agro-forestry areas

23 Forest and semi natural areas Forests Broad-leaved forest

24 Forest and semi natural areas Forests Coniferous forest

25 Forest and semi natural areas Forests Mixed forest

26 Forest and semi natural areas Scrub and/or herbaceous vegetation associations Natural grasslands

27 Forest and semi natural areas Scrub and/or herbaceous vegetation associations Moors and heathland

28 Forest and semi natural areas Scrub and/or herbaceous vegetation associations Sclerophyllous vegetation

29 Forest and semi natural areas Scrub and/or herbaceous vegetation associations Transitional woodland-shrub

30 Forest and semi natural areas Open spaces with little or no vegetation Beaches, dunes, sands

31 Forest and semi natural areas Open spaces with little or no vegetation Bare rocks

32 Forest and semi natural areas Open spaces with little or no vegetation Sparsely vegetated areas

33

33 Forest and semi natural areas Open spaces with little or no vegetation Burnt areas

34 Forest and semi natural areas Open spaces with little or no vegetation Glaciers and perpetual snow

35 Wetlands Inland wetlands Inland marshes

36 Wetlands Inland wetlands Peat bogs

37 Wetlands Maritime wetlands Salt marshes

38 Wetlands Maritime wetlands Salines

39 Wetlands Maritime wetlands Intertidal flats

40 Water bodies Inland waters Water courses

41 Water bodies Inland waters Water bodies

42 Water bodies Marine waters Coastal lagoons

43 Water bodies Marine waters Estuaries

44 Water bodies Marine waters Sea and ocean

48 NODATA NODATA NODATA

49 UNCLASSIFIED UNCLASSIFIED LAND SURFACE UNCLASSIFIED LAND SURFACE

50 UNCLASSIFIED UNCLASSIFIED WATER BODIES UNCLASSIFIED WATER BODIES

255 UNCLASSIFIED UNCLASSIFIED UNCLASSIFIED

technical description for data merging

Documents