technical description for data merging
TRANSCRIPT
-
1
Technical report on GIS Analysis, Mapping and Linking of Contextual Data to the
European Social Survey
HAPPINESS project of the Cross-National and Multi-level Analysis of Human Values,
Institutions and Behaviour (HumVIB) programme
Finbarr Brereton, University College Dublin, Ireland
Mirko Moro, University of Stirling, The United Kingdom
Tine Ningal, University College Dublin, Ireland
Susana Ferreira, University of Georgia, USA
Abstract
This technical paper documents the work undertaken to link the European Social Survey a
biennial multi-country survey, which measures attitudes, beliefs and values of individuals
living in more than 30 nations to multi-level variables capturing the physical environment
and context of the respondents (air pollution, climate, land use, local GDP per capita,
population density, unemployment rate, etc.). The process of linking the data involved
creating a series of spatial identifiers based on the Nomenclature of Territorial Units for
Statistics (NUTS) geocodes. In addition, while the macroeconomic contextual variables are
typically available at the regional level, pollution and climate data are recorded at monitoring
stations, and Geographic Information Systems (GIS) spatial interpolation techniques need to
be applied prior to linking these to a particular respondent. GIS is increasingly used to
process, analyse and display georeferenced data effectively due to its mapping capabilities.
The resulting dataset provides a unique tool for quantitative investigation of interrelationships
at the individual, regional and national levels in Europe.
Financial support from the from European Science Foundation (Cross-National and Multi-level Analysis of
Human Values, Institutions and Behaviour (HumVIB)) is gratefully acknowledged. We thank Oana Borcan and
Victor Peredo Alvarez for outstanding research assistance. Corresponding author: [email protected]
mailto:[email protected]
-
2
TABLE OF CONTENTS
1. Introduction 3
1.1 European Social Survey 3
1.2 Geographical Information Systems 3
1.3 Deliverables from the Project 5
2. Creating a regional identifier in the ESS for data linking 7
3. GIS analysis and mapping of air quality 9
3.1 Data 9
3.2 Methods 10
3.2.1 Importing spreadsheet data into GIS 11
3.2.2 Spatial Interpolation in GIS 14
3.2.3 Integration of air quality with NUTS data 23
4. GIS analysis and mapping of climate and land use data 27
4.1 Climate data 27
4.2 Land use data 28
5. References 31
6. Appendix 32
-
3
1. Introduction
1.1 European Social Survey (ESS)
The ESS is an academically-driven, international survey examining changing social attitudes,
beliefs and values across Europe. It has become the first ever social science project to be
granted the prestigious Descartes prize, awarded by the European Commission for
excellence in scientific research. In our project we focus on the first three waves of the
survey. The first wave was fielded in 2002/2003, the third one in 2006/2007. ESS data are
obtained using random (probability) samples, where the sampling strategies, which may vary
by country, are designed to ensure representativeness and comparability across European
countries. The three-wave cumulative includes around 120,000 observations from 23
European countries.1
One of the variables collected in the survey is the region within a country where the
respondent lives. This information allows us to match the survey data spatially to a map of
Europe using Geographic Information Systems (GIS) and hence it is possible to combine
individual data with a vector of spatial amenities.2 These two datasets are combined at the
NUTS level.3 To assess the impact of changes in spatial amenities on individual variables (of
particular interest for our project, self-reported subjective well-being) in a more precise
manner, ideally, one would want to be able to match contextual factors to a particular
individual rather than a particular area. At present, however, the data do not allow this and
anonymity may preclude this in any case.
1.2 Geographic Information Systems and the Social Sciences
Adoption of Geographic Information System (GIS) and spatial modelling tools in the social
sciences is in its infancy, which is primarily due to a lack of recognition by social scientists of
the capability and capacity of such tools to support and develop new research areas and to aid
1 They are Austria, Belgium, Czech Republic, Switzerland, Germany, Denmark, Estonia, Spain, Finland, France,
Greece, Hungary, Ireland, Italy, Luxembourg, Netherlands, Norway, Poland, Portugal, Sweden, Slovenia,
Slovakia and the UK. 2 GIS works well when applied to static data, and less well when applied to time series analysis (Goodchild and
Haining, 2004) and hence is well-suited to the cross-sectional data employed in this project.
3 European Nomenclature of Territorial Units for Statistics (NUTS) is a geocode standard for referencing
administrative divisions of countries for statistical purposes developed by the European Union.
-
4
and enhance research applications. GIS offers great potential to generate innovative
approaches and advance knowledge in disciplines such as political science, economics,
archaeology, environmental studies, history, demography, anthropology, and applied social
sciences.
GIS is widely-used as a planning and analysis computing tool that allows the visual
representation of spatially referenced data and provides a powerful set of tools for spatial
analysis and modelling. It has advanced the technical ability to handle spatial data as
countable numbers of points, lines and polygons4 in two-dimensional space (Goodchild and
Haining, 2004) and to link various datasets using spatial identifiers (Bond and Devine, 1991).
It represents a solid base for spatial data analysis and provides a range of techniques for
analysis and visualisation of spatial data. It provides effective decision support through its
database management capabilities, graphical user interfaces and cartographic visualisation
(Wu et al., 2001). It provides tools for integrating, querying and analysing a wide variety of
data types, such as scientific and cultural data, satellite imagery and aerial photography, as
well as data collected by individuals, into projects, with geographic locations providing the
integral link between all the data.
With the rapid growth in the availability of geographical data in digital formats and parallel
innovations in technology to allow for the manipulation, analyses and visualisation of these
data, new types of information are being created. This underpins developments in
Participatory GIScience which provides a better understanding of the complexity of decision
situations involving human interactions with their physical environment. A recent article in
Science (Butz and Torrey, 2006) highlights the importance of the new GIScience tools in
providing the ability to analyse social behaviour across time and geographic scales. It further
points out that their adoption by social scientists is still in its infancy.
GIS methods can contribute to multi-level analysis; they can even generate new levels of
analysis and allow access to levels previously only identifiable in principle. They can also
help disseminating multilevel research findings. With a diverse range of disciplines involved
in multi-disciplinary research (sociologists, psychologists, economists, political scientists
etc.), creating policy documents that are accessible to the research community and to the
4 A polygon is the GIS term for any multi sided figure.
-
5
general public can become a challenge. To this end, GIS applications allow cartographic
representations of data and results that aid in disseminating information to a wide audience
a picture is worth a thousand words.
There now exists unprecedented individual-level data resources in Europe, typified by the
European Social Survey (ESS). There also exists comprehensive system-level and contextual
data. Heretofore, however, there are few analyses employing individual level data linked to
contextual data and they typically cover a limited local area or a limited set of indicators (see,
e.g., Brereton et al. 2008; MacKerron and Mourato, 2008; Luechinger, 2009). GIS facilitates
linking contextual data (institutional, economic. environmental etc.) to individual-level data.
While many social scientists are currently engaged in cross-national analysis, using GIS to
link data at the regional level would allow investigators to go further and engage in analysis
at the micro, meso and macro levels, using data that are comparable across a larger number of
units of analysis (regions) and this would increase the validity of multi-level analysis.
The growth in the availability of geographical data has not been accompanied by a coherent,
coordinated data collection effort at the European level. For example, the National Institute
for Regional and Spatial Analysis (NIRSA) in Ireland, the URBIS Digital Spatial Database in
University College Dublin, EDINA in the UK and Eurostat all house digital spatial data,
much of this overlapping. A goal of our project is to address the fragmentation that currently
plagues digital data archives in Europe by creating a pan-European research dataset with
environmental and other contextual data spatially referenced and linked to the ESS, and to
share the dataset and methodologies used to create it.
1.3 Deliverables from the Project
A key deliverable of this project is a pan-European dataset of environmental and other spatial
variables geo-referenced at a regional level and linked to the individuals in the ESS. The
contextual variables can be classified into four groups: air pollution concentrations, climate,
land use and macro-socioeconomic factors (Table 1).
-
6
Table 1. List of variables in spatial dataset
Category Indicators Main Source
Air Pollution PM10 mean annual concentration (g/m3) EEA AirBase
CO mean annual concentration (mg/ m3) (http://acm.eionet.europa.eu/databases/airbase/)
SO2 mean annual concentration (g/ m3)
NO mean annual concentration (g/ m3)
NO2 mean annual concentration (g/ m3)
Benzene mean annual concentration (g/ m3)
Climate Annual mean temperature (C) ECA
Mean of daily max. temperature in July (C) (http://eca.knmi.nl/)
Mean of daily min. temperature in January
(C)
Annual mean precipitation (mm)
Land use Residential CORINE
Commercial and Industrial (http://www.eea.europa.eu/publications/COR0-landcover)
Mines and Dumps
Green Urban Spaces
Agricultural Land
Forestry
Natural Areas
Water bodies
Macro-
socioeconomic GDP per capita Eurostat
Population change (%) (http://epp.eurostat.ec.europa.eu/portal/page/portal/eurostat
/home/)
Population density
Deaths from respiratory diseases
Unemployment rate (by age group/gender)
In the following sections, we provide a more detailed description of the construction of the
dataset, in particular for the pollution, climate and land use variables. Macro-socioeconomic
variables were already available at a NUTS 2 or NUTS 3 level (depending on the variable)
from the Eurostat database.
The NUTS is a geocode standard for referencing administrative divisions of countries for
statistical purposes developed by the European Union. Regions at NUTS level 1 are large
sub-national units (such as Scotland or Bavaria) each of which usually comprises a number of
NUTS 2 regions (examples of this level include the Autonomous Communities in Spain or
the "regions" in France). In turn, these are made up of NUTS 3 regions (such as the "Kreis" in
-
7
Germany). Although broadly very stable over time in a number of countries, the NUTS
classification has been amended several times, most recently in 1995, 1999 and 2003.5
The Cumulative ESS database does not use a coherent definition of region; in some cases the
regions can be NUTS level 1 in other cases NUTS level 2 or NUTS level 3 (Table 2).
Table 2. NUTS levels used in ESS for each participant country
NUTS Level Countries
1 Belgium (BE), Germany (DE), France* (FR), Luxemburg (LU), United Kingdom (UK)
2 Austria (AT), Switzerland (CH), Spain (ES), Finland (FI), France* (FR), Greece (GR),
Hungary (HU), Italy (IT), Ireland* (IE), Norway (NO), Poland (PO), Portugal (PT), Sweden
(SE)
3 Czech Republic (CZ), Denmark (DK), Estonia (EE), Ireland* (IE), The Netherlands (NL),
Slovenia (SI), Slovakia (SK), UA (Ukraine)
* ESS used mixed boundaries of NUTS levels 1 & 2 for France and levels 2 & 3 for Ireland.
The spatial variables in Table 1 were linked to each respondent at the corresponding NUTS
level in Table 2. In addition, we preserved them at the higher level of spatial disaggregation
at which they were available (NUTS 3 for pollution, climate and land use data, NUTS2-3 for
macro-socioeconomic indicators). All the datasets will be publicly available in the project
website (http://www.ucd.ie/happy/resear.html) from January 2012.
2. Creating a regional identifier in the ESS for data linking
The cumulative ESS dataset can be freely downloaded from the ESS website
(www.europeansocialsurvey.org). We appended three additional spatial identifiers in order to
facilitate the matching of the ESS data file with spatially referenced data and to carry out
spatial analysis of data: "cntry," "region" and "code_id." "cntry" is the NUTS code for each
country; "region" is name of the region as reported in the ESS Cumulative dataset. The ESS
contains multiple variables to identify the region where the respondent lives. We merge these
into one new variable. The key new regional variable we create is "code_id:" a unique
identifier equal to the NUTS level for a particular observation in the ESS (see Table 2).
5 A detailed list of NUTS by each European country can be found at
http://ec.europa.eu/eurostat/ramon/nuts/codelist_en.cfm?list=nuts. Maps of each country and regions with
subdivision in NUTS levels can be found at http://circa.europa.eu/irc/dsis/regportraits/info/data/en/.
-
8
In the NUTS system, each country is divided following a three-level hierarchy of regions
established on the basis of existing administrative regions or groupings of these. The NUTS 1
code is composed of three alphanumeric characters. The first two refers to the country (and
they are the same as the cntry variable), while the third one is usually a number. NUTS 2
code is composed of four alphanumeric characters, while NUTS 3 consists of five
alphanumeric characters. Box 1 illustrates the NUTS code hierarchy for the Spanish regions.
Therefore if code_id for a particular observation is, for example, GR11 it means that the
respondent lives in a NUTS2 level region of Greece, while UA044 stands for a NUTS 3 level
region of Ukraine, etc.
Box 1: NUTS code hierarchy for Spanish regions
ES (Spain)
ES1 (represents NUTS 1 level identifying the North-West region)
ES11 (represents NUTS 2 level identifying Galicia)
ES111 (represents NUTS 3 level identifying La Corua)
ES112 (region at NUTS 3 level, Lugo)
ES12 (Asturias)
ES120 (Asturias)
ES13 (Cantabria)
ES130 (Cantabria)
ES2 (NUTS 1 identifying North-East region)
ES21 (Basque Country)
ES211 (lava/Araba)
ES212 (Guipzcoa/Gipuzkoa)
...
...
ES7 (NUTS1 Canarias)
ES70 (Canary Islands)
ES701 (Las Palmas)
ES702 (Tenerife)
In a few cases (Switzerland, France, Italy, Ireland), a new code was created to accommodate
the fact that their ESS regions were an aggregation of NUTS. In this case, the following rule
has been observed in creating the ad-hoc code:
- CH02-04 is an aggregation of 3 Swiss regions at NUTS 2-level (i.e., CH02, CH03,
and CH04).
- FI18,20 represents 2 Finnish regions at NUTS 2-level (i.e., FI18 and FI20), etc. Ditto
for France.
- IE022-025 represents 4 Irish regions at the third NUTS level (i.e., IE022, IE023,
IE024, IE025).
-
9
The unique spatial identifier enables us to match any kind of data available at different spatial
levels to the original Cumulative ESS dataset.
3. GIS analysis and mapping of air quality
This section describes the technical procedures involved in the GIS analyses and mapping of
concentrations of PM10, SO2, NO, NO2, CO and Benzene across the 23 European countries in
the cumulative ESS R1-R3 dataset (see footnote 1) from 2001 to 2008. The air quality
datasets were obtained from AirBase - the European Air Quality Database in spreadsheet
(Excel) format. The datasets underwent several preparation, conversion, interpolation,
processing and analyses steps in spreadsheet and GIS formats. The first outcome is a GIS
database with a grid cell size of 5km on the side. The data was combined with EU NUTS data
to map air quality at NUTS3 level. The final resulting database on air quality can be queried
for air quality and NUTS information over any location in the study and mapped both at
NUTS3 and 5 km spatial resolutions.
3.1. Air quality data
The air quality data under investigation are Particulate Matter under 10 microns (PM10),
Sulphur Dioxide (SO2), Nitrogen Oxide (NO), Nitrogen Dioxide (NO2), Carbon Monoxide
(CO) and Benzene (C6H6). The data is average annual time series from 2001-2008 and covers
23 European countries. The air quality data have been recorded by a network of monitoring
stations and submitted to European Topic Centre for Air Pollution and Climate Change
Mitigation (ETC/ACM), an agent of the European Environmental Agency (EEA). The data
and other information on air quality are hosted by European Air Quality Database (AirBase)
where they are publicly accessible (http://acm.eionet.europa.eu/databases/airbase/).
The datasets for the 23 countries were downloaded in spreadsheet (Excel) format and the
tables were re-structured for eventual import into GIS. The structured tables were identical in
their data types and each air pollution table contains 30 variables, ranging from
administrative units to geographic coordinates (longitude and latitude) of the monitoring
station and air quality values. Since all the spreadsheet tables have identical number of
columns and data types, all the air quality data for the different member states were merged
-
10
into a single Microsoft Excel spreadsheet. The combined datasets in the single spreadsheet
file has six worksheets, each representing one of the six air pollutants.
An enumeration of the monitoring stations by country showed Germany leading with over
25% of the total air monitoring stations, followed by Spain, France and Italy (Table 3). Over
60% of the total monitoring stations in the study are concentrated in these four countries:
Table 3. Number of monitoring stations and land areas per country
3.2 Methods
The methodology employed in processing the air pollution datasets is multi-step and lengthy.
In brief, three main phases can be identified: i) importing data into GIS, ii) undertaking
interpolation, and iii) integrating NUTS with air quality data. Each of these main steps will be
discussed in the following sections. Fig.1 provides a schematic overview of the workflow.
-
11
Figure 1. Workflow showing input, processing and output of air quality data. An example demonstrating
the above workflow is shown in Fig.12 for Germany.
3.2.1 Importing spreadsheet data into GIS
In order to map air quality data in GIS, the data must be spatially referenced. The longitude
and latitude provide the necessary spatial references that define the locations of the
monitoring stations. However, these coordinates are in World Grid System of 1984 (WGS84)
and must be re-projected to the European ETRF projection system to accurately overlay with
other GIS layers.
The air quality data in the spreadsheet is prepared for import into GIS by shortening the
column names to less than 10 characters and eliminating character in column names. The
columns are formatted as numeric, text or dates accordingly, as ArcGIS has specific protocols
regarding table structures and data types. After the necessary preparations are made in the
spreadsheet, it is ready for import into GIS.
In ArcMap, the Add XY Data command is invoked which opens a dialogue window for the
input of XY data from a tabular data to create an event theme. In ArcMap 10, the command
to add XY data is via File > Add Data > Add XY Data. In the dialogue window that
follows, the relevant spreadsheet file is selected, then the columns that hold the X- and Y
-
12
coordinates are selected to match the corresponding X and Y fields, and the output coordinate
system is specified (see Fig.2). The settings are checked and if satisfactory, the selection is
confirmed with OK to execute the process of converting the spreadsheet data into GIS.
Figure 2. The Add XY data dialogue windows in ArcMap to import tabular data with coordinates to
create event themes, before filling in the details (left) and after entering the required parameters (right).
After the tabular data is successfully imported into GIS, an event theme (temporary GIS
layer) is created which places a point on each coordinate pairs (see Fig.3-A). The result is
examined for spatial accuracy and if satisfactory, a permanent copy is then made by
exporting the event theme to a new GIS layer. The new GIS copy of air quality contains all
the data from the spreadsheet and is now ready for processing and analys in GIS.
The air quality GIS layer is first re-projected from WGS84 coordinate system to ETRS
Lambert Azimuthal Equal Area projection. This is necessary to adopt a common EU
projection because all the countries have different local datums. On the ETRS datum, all the
air pollution data accurately overlay with each other in ArcMap.
When the monitoring stations are superimposed on the EU country layer, it is evident that
some of the monitoring stations lie outside the country boundaries. This may be due to errors
in the coordinate values from the spreadsheet. Using the country boundaries as a spatial filter,
-
13
the monitoring stations that are found inside the countries are saved to a new GIS layer while
those that fall outside are excluded (Fig.3-B). The next step is then to filter and separate the
pollution types by dates into the different years they were recorded. The attribute tables are
queried and the monitoring stations are separated and saved into different years from 2001 to
2008 for all the pollution types (see Fig.4). After this process, there are 48 GIS layers
resulting from the 6 original air quality layers each having 8 layers for the different years
from 2001 to 2008. The final 48 GIS layers are now ready for interpolation in the subsequent
stage.
Figure 3. Event theme of PM10 created from spread sheet (A) and the permanent copy made of
monitoring stations that fall inside the countries GIS layer (B) shown in ArcMap. The permanent copy of
PM is for all years from 2001 to 2008.
-
14
Figure 4. The PM10 layer over the study area is separated into different years from 2001 to 2008 for
interpolation by individual years.
3.2.2 Spatial Interpolation in GIS
Ambient air concentrations are recorded at the monitoring-station level. However, due to
their uneven distributions, the concentrations between monitoring stations remain unknown.
The immediate solution is to apply spatial interpolation techniques to the available
monitoring data to provide air quality information between monitoring stations (Denbyl et al.,
2010).
Air monitoring stations measure ambient air concentrations, generally at fixed locations and
they represent changes in air concentrations. However, on their own, they are insufficient to
provide estimates for the intervening locations to visualize their continuity and variability.
This is where interpolation becomes unavoidable because of its ability to create continuous
surfaces from sample data that makes interpolation both powerful and useful. From the
surface, the morphology and characteristics of the changes can be described (Childs, 2004).
There are different methods of interpolation. Each method uses a different approach and they
almost always produce different results, therefore the most appropriate method will depend
on the distribution of the sample points and the phenomena being studied (Childs, 2004).
-
15
In GIS, surface representation is done by storing the x,y values and Z values define the
location of a sample and the change characteristic represented by the Z value. These points
can be represented as contours where lines of equal values can be joined to depict the surface
as in contour lines or alternatively, the points can be represented as triangular irregular
network (TIN) or as grid surfaces. TIN is a vector data structure used to store and display
surface models while grid is a spatial data structure that defines spaces as an array of cells of
equal size that are arranged in rows and columns representing a surface. The various methods
are aimed at representing continuous surfaces through interpolation.
There are various interpolation techniques but some of the common ones available in GIS are
spline, inverse distance weighting (IDW), kriging, trend surface and thiessen polygons.
Within ArcGIS, several spatial interpolation techniques such as natural neighbour, spline
with barriers, topo to raster and trend are available. These spatial interpolation methods can
be generally grouped into several categories based on their basic hypotheses and
mathematical natures such as geometric method, statistical, geostatistical, stochastic
simulation, physical model simulation and combined method (Li et al., 2000). Ultimately, the
rationale behind interpolation is to fill in the blanks in between points and display a much
smoother and fine surface. Therefore, well distributed and sufficient number of data points in
the area under investigation would minimize uncertainties between points.
Research on the comparison of the various spatial interpolation methods shows that there is
no absolutely optimal method; however, there is only relatively optimal interpolation method
in special situations (ibid). Therefore the best spatial interpolation method should be selected
in accordance with the quantitative analysis of the data analysis and repeated experiments. In
addition, the results of spatial interpolation should be strictly examined for validity (ibid).
Studies relating to air pollution that implemented spatial interpolation to map air quality at
European scale used additional datasets like land-cover, elevation, meteorology and
population density to improve the methodology and reduce uncertainties in their models
(Horlek et al., 2007, Horlek et al., 2010, Smet et al., 2009). A series of technical
publications by the European Topic Centre on Air and Climate Change (ETC/ACC) deals
extensively on the topic of interpolation and air quality mapping. In particular, a paper by
Horlek and others (2007) on Spatial Mapping of Air Quality for European Scale
Assessment is comprehensive and incorporates most of the common interpolation techniques
in their methodologies with supplementary data to map air quality in urban and rural areas
across Europe. From a series of testing and modelling they concluded that for air quality
-
16
assessment, kriging methods are generally preferred over IDW and for PM10, lognormal
kriging over ordinary kriging. In addition, preference is advocated to methodologies that are
based on linear regression using supplementary data over pure interpolation methods. The
usage of concurrent meteorological data is reported to give better results than climatological
data (Horlek et al., 2007). All these preferences and recommendations are based on repeated
testing of their methodologies over time. Others like Naoum and Tsanis (2004) stated that
despite the numerous articles written about interpolation, there is little or no agreement
among the authors on the superiority of some techniques over others. They added that
judgement and experience come into play when considering which interpolation method to
use. In a personal communication with Peter de Smet (2011), a leading researcher on air
quality mapping over Europe from the European Topic Centre on Air Pollution and Climate
Change Mitigation (ETC/ACM), he confirmed that their tests showed kriging method to
produce accurate results and IDW usually produce high uncertainties.
After considering the data available to us, the methodologies used in other researches, the
tools at our disposal and testing several interpolation techniques, the options came down to
kriging and inverse distance weighted (IDW). Although kriging is preferred over IDW for
mapping air quality at European scale, IDW remains popular where there are fewer
datapoints and is suitable for rapid interpolation of in-situ air quality data. When both kriging
and IDW were tested by varying the number of monitoring stations, the differences were
acceptable and the values for IDW remained generally consistent. In addition, IDW retains a
large range of the original data after interpolation compared to kriging.
IDW is grounded on the principle of inverse distance where the values of the cells are based
on a linear weighted combination set of sample points. The value assigned to a cell is a
function of the distance of an input point from the output cell location the so-called distance
decay concept. In other words, its estimates are based on values at nearby locations weighted
only by distance from the interpolation location. The greater the distance, the less influence
the cell has on the output value. IDW does not make assumptions about spatial relationships
except the basic assumption that nearby points ought to be more closely related than distant
points of the value at the interpolated location (Naoum and Tsanis, 2004). IDW interpolation
is preferred over kriging in this work and will be demonstrated in the following section.
Choosing an appropriate geostatistical model
The various geostatistical analysis models have advantages and setbacks for different
applications. After consulting a number of sources and testing the available interpolation
-
17
techniques using Ireland as a case study, the inverse weighted distance (IDW) and kriging
methods appear to suit the mapping of PM10 and other pollution data. IDW is a deterministic
method but is mentioned in a number of sources as a suitable choice while kriging which is a
geostatistical technique is also recommended.
A test over and Germany using IDW and krigging was carried out and values compared on a
cell by cell basis. The range in their difference is between -9 and 9, and while IDW retains a
longer range of values after interpolation, kriging appears to trim the values
further, resulting in a shorter range on the result. On the test with PM10 for 2005 over
Germany, the IDW interpolated values range from 11.6 to 36.5 with a mean of 24.1, whereas
kriging ranges from 17.1 to 31 with a mean of 24.1. The standard deviation for IDW is 3.2
and kriging is 2.6. An Excel table on the test with descriptive statistics is available on request
to help explain the differences between the two. In spite of their differences, the majority of
the values compare well (Figure 4a).
Figure 4a. IDW interpolation (A) compared to Kriging interpolation (B).
-
18
Data preparation prior to Interpolation
Prior to any implementation of interpolation, a number of steps are necessary to ensure the
correct outcomes. The first step is to reproject the pollution layer from WGS84 to ETRS 1989
projection system on the European datum. Since the air quality data is a composite of time-
series data between 2001 and 2008, the next step then is to filter and separate them into
individual layers which results in 8 separate GIS layers for each pollution type. Then the
pollution data for all years are loaded into ArcMap one pollutant at a time for processing.
At this stage, the geoprocessing environment is configured to the EU countries as the
maximum processing extent for interpolation. After setting the environment the IDW
interpolation technique is invoked which opens up the IDW interpolation dialogue window
(see Fig.5). The input point feature requires the input of pollution point data, the Z value
field is the field for air quality data to interpolate and the output raster is for the name and
location for the interpolation result. After the input dialog options are selected and adjusted,
the interpolation is executed. The result is displayed immediately as shown in Fig.6. This
procedure is repeated for all the years from 2001 to 2008 for each pollutant. After the
completion of the interpolation processing, there are 48 raster layers created from 6 pollution
types.
Figure 5. Dialogue windows for IDW interpolation in ArcMap: default user interactive window (A) and
populated IDW window ready for processing (B).
-
19
Figure 6. The outcome of IDW Interpolation for PM10 for 2001 across the 23 EU member states.
The ideal scenario would be to query and extract information on air quality from a single
database as opposed to dealing with a number of separate layers which would involve
additional time and effort besides taking up more storage space, particularly for raster
datasets. In order to combine all the air quality data into a single GIS database, a vector grid
matrix is created as a container to store the extracted raster values from the interpolation
results. The steps involved are detailed as follows.
Fishnet vector grid matrix to store interpolation results
The fishnet function in ArcGIS can create a regularized vector grid matrix of any cell size at
any given extent. Setting the EU countries as the maximum geoprocessing extent, the fishnet
tool is implemented to create a vector grid matrix of 5x5 km cell size. The result is a vector
GIS layer that spans the extent of the EU countries and contains 534,378 grid cells (Fig.7-A).
A spatial overlay is made between the two layers and the grids that intersect with the EU
countries are saved to a new GIS layer for further processing. The resulting grid matrix has
177,645 grid cells (Fig.7-B).
-
20
Figure 7. The EU countries GIS layer on the left is used to select the 5x5km vector grid that it intersects
with and the result saved to a new file as shown on the right.
Since the interpolation and vector grid layers are created from a common map extent and
their grid sizes set to 5km on the side, there is an exact match when the vector grid is
superimposed on the interpolation raster layer (see Fig.8). The attribute table of the vector
grid is restructured and 8 new fields are added with their nomenclature sequentially ranging
from PM10_2001 to PM10_2008. These fields will store the pollution values of the
corresponding years from the interpolation raster layers. One vector GIS grid matrix is
created for each pollution type resulting in 6 grids and their attribute tables restructured to
store data for all years.
Figure 8. Overlay of vector grid matrix on PM10 raster interpolation result showing the exact
match between the raster cells and the vector grid cells. The figure on the right is an inset of the
box on the figure to the left.
-
21
In the next step the vector grid is overlaid on the raster layer and the raster pixels are
transferred to the corresponding vector grid cells. This process is repeated for all the 8 raster
layers for each year where the pixel values of each raster layer is transferred to the
corresponding year attribute in the vector GIS database. For example, the pixel values from
PM10 for 2001 are transferred to the attribute (field/column) for 2001 in the GIS attribute
table, the pixel values for PM10 for 2002 are stored in the attribute for 2002 and so on (see
Fig.10). This process is repeated for all the 8 raster layers until all the corresponding 8
attributes in the vector database for PM10 are updated. The advantage of storing data in the
vector format is its flexibility to query the database in various ways and vector data structure
occupies less storage space. The same process is replicated on the remaining seven pollutants
and in the end, the 48 raster layers are compacted into to just 6 vector GIS layers.
Transferring raster cell values to vector attribute table
The Transfer of raster cell values to the grid matrix is a two-step process. First an
intermediate point layer is created from the polygon grid matrix layer (Fig.9-A) where the
points represent the centre (centroid) of the grid (Fig.9-B). Using a GIS surface overlay tool
called Extract Raster Values, (Fig.9-D) the raster cell values (Fig.9-C) are extracted and
transferred to the corresponding points attribute table (see Fig.10). After this step is
completed, the second step is then to transfer the attributes from the point GIS layer across to
the polygon grid matrix through attribute transfer function. Since the points were created
from the grid matrix layer, they have identical feature identification record addresses and the
values in the point layer are transferred by address matching. This procedure is repeated for
all the other layers.
-
22
Figure 9. Raster Value Extraction procedure showing the construction of vector grid matrix (A),
creating points/centroids from the grid (B) and extracting the raster pixel values (C). The GIS
tool used to extract raster values is shown in D.
PM10
2001
PM10
2002
PM10
2003
PM10
2004
PM10
2005
PM10
2006
PM10
2007
PM10
2008
Figure 10. The pixel values of the 8 raster layers for PM10 from 2001-2008 (top) are transferred to the
corresponding fields in the vector .grid attribute table (bottom) resulting in only one file storing PM10
data for all years.
-
23
Querying and mapping the pollution data at 5x5 km grid matrix
The outcome of the previous step produced 6 vector grid layers which can be queried and
mapped. The IDW interpolation results showed that the original air quality values are
rounded up from the lowest and rounded down from the highest values. For example, the raw
data for PM10 for 2001 ranged from 6.3 to 103.4; however, the range of values after
interpolation is shortened to 8 till 95. The computed values cannot be higher than the highest
and lower than the lowest value of the original data, a typical feature pertinent to IDW
interpolation. From one vector GIS layer, 8 time series maps for 2001 to 2008 can be
produced (Fig.11). A variety of statistical analysis and querying are possible from the
database when the data is in vector format.
Figure 11. Rapid outputs of annual mean PM10 from 2001 to 2008 across the study area.
3.2.3 Integration of air quality with NUTS data
The smallest mapping unit or resolution for the air quality maps are 5x5 km. These data can
be integrated with other datasets like climate, demography, socio-economic, transport and
others to enhance the data content to facilitate answering questions on a broad range of
themes. One of the aims of this study is to map the air quality at NUTS3 level. This implies
an aggregation of the interpolation results to NUTS3 scale. This involves a number of steps to
produce a new dataset that integrates information on both NUTS3 and pollution.
2004 2002 2003
2006 2007
2001
2005 2008
-
24
The first step requires combining the air quality data and NUTS3 into a single dataset. This
task is carried by union in GIS overlay operation. In union, there is a geometric intersection
of the two GIS data layers where they are multiplied by each other to produce a Cartesian
product. After the union, some clean-up process is necessary and the portions of grids that
fall outside the boundaries of the countries are eliminated.
At this stage the smallest mapping unit is 5x5 km with the NUTS3 boundaries infused,
cutting through them like a cookie cutter (see Fig.12-E). Since the desired output is to map
the air quality at NUTS3 level, the air quality data is aggregated to NUTS3 level using a
generalization tool called dissolve by melting away the 5x5 km grids in each NUTS3 units.
During the aggregation, the numeric data are averaged while the non-numeric data are
transferred either by first or last name. After the dissolve operation, the pollution data are
aggregated at NUTS3 level and can be queried and mapped.
The workflow in Fig.1 is illustrated in Fig.12 using Germany as an example. It shows the
steps involves in extracting air quality monitoring stations over Germany, then applying the
interpolation techniques to create a surface and extracting the raster values to vector grids of
5x5km resolution. The values are then transferred and averaged to the NUTS level where
PM10 for 2005 is mapped (Fig.12:A-G).
-
25
-
26
Figure 12. Demonstrating actual steps involved in creating PM10 map at NUTS3 scale for Germany.
Notes on Fig.12 A-G
A: Spatial distribution of monitoring
stations across Germany.
B: IDW interpolation result from the
monitoring stations.
C: Extract raster values from
interpolation surface to 5x5 km
vector grid matrix through overlay
function.
D: Mapping PM 10 at 5x5 km grid
resolution.
E: Geometric intersection with
NUTS3 GIS layer through union
overlay function.
F: Dissolving the 5x5 km grid
resulting in air quality data
aggregated and mapped at NUTS3
scale.
G: Final PM10 map for 2005 over
Germany at NUTS3 scale.
-
27
4. GIS analysis and mapping of climate and land use data
4.1 Climate data
Temperature and precipitation data was obtained from the European gridded data set of
surface temperature and precipitation for the period of 1950 2008, version 4.0, produced by
the European Climate Assessment & Dataset (ECA&D). This freely available dataset
contains daily observations at meteorological stations throughout Europe and the
Mediterranean (http://eca.knmi.nl/).
The selected data files were compressed in a NetCDF format. With an original resolution of
0.25 degrees and projected in a latitude longitude grid. The relevant NetCDF files were
extracted using the specialised software CDO produced by Max-Planck. CDO is a collection
of tools developed to manipulate and analyse climate and forecast model data.
The produced factors were of: i) mean annual temperature, ii) maximum temperature (July),
iii) minimum temperature (January) and iv) mean annual precipitation. Mean values were
obtained from daily data with the use of CDO.
The extracted maps were re-projected from their latitude and longitude coordinates to
Lambert Azimuthal Equal Area (LAEA) and re-sampled to a 5,000 metre resolution with the
use of ArcGIS. The produced maps showed continuous data of temperature and precipitation.
The main characteristic of continuous data is that values often changed every 5,000 metres.
To obtain mean values at NUTS 3 level, images for each factor where analysed, using the
Zonal Statistics module in ArcGis. This module allows calculating statistics on values of a
raster within the zones of another dataset. In this case, the NUTS 3 dataset defined the zones
and each factor the aggregated values.
In total, 32 maps were produced from the period of 2001 to 2008, one for each year for each
of the four climatic variables (see Fig.13 for an example).
In order to produce attribute tables for each factor and link those to the NUTS 3 database, the
produced maps where transformed to vector files. This produced attribute tables for each
factor with the same Object ID as the NUTS 3, making it possible to link each database.
Finally, with the use of Spatial Join module all attribute tables where integrated into the
NUTS 3 database. The Spatial Join module creates a table join in which fields from one
-
28
layers attribute table are appended to another layers attribute table based on the relative
location of the features in the two layers.
Mean Annual Temperature 2001, NUTS 3
TC
01-mat--c
Value
High : 20.000000
Low : -30.000000
Mean Annual Temperature 2002, NUTS 3
TC
2002-matnuts3
Value
High : 20
Low : -30
Mean Annual Temperature 2003, NUTS 3
TC
2003-matnuts3
Value
High : 20.000000
Low : -30.000000
Mean Annual Temperature 2004, NUTS 3
TC
2004-matnuts3
Value
High : 20.000000
Low : -30.000000
Mean Annual Temperature 2005, NUTS 3
TC
2005-matnuts3
Value
High : 20.000000
Low : -30.000000
Mean Annual Temperature 2006, NUTS 3
TC
2006-matnuts3
Value
High : 20.000000
Low : -30.000000
Mean Annual Temperature 2007, NUTS 3
TC
2007-matnuts3
Value
High : 20.000000
Low : -30.000000
Mean Annual Temperature 2008, NUTS 3
TC
2008-matnuts3
Value
High : 20.000000
Low : -30.000000
Figure 13. Mean Annual Temperature 2001 to 2008 across the study area.
4.2 Land Use Data
The vector of spatial land use factors comes from the Coordination of Information on the
Environment (CORINE) land cover database. CORINE is a pan-European database carried
out within each European member state. It is a vector spatial dataset, land cover digitized
based on the interpretation of medium resolution satellite imagery and assigned a land use
class based on a standardized land cover nomenclature defined by the European Environment
Agency. The minimum area mapped in the dataset is 25 hectares. Within this research, broad
land use statistics were derived from the CORINE database for the years 2001 and 2006. For
the purposes of this study, the original 44 land use categories of the CORINE nomenclature
are re-categorised into the following classes.
1. Residential
2. Commercial and Industrial
3. Mines and Dumps
2004 2002 2003
2006 2007
2001
2005 2008
-
29
4. Green Urban Spaces
5. Agricultural Land
6. Forestry
7. Natural Areas6
8. Waterbodies
This re-categorisation is considered more appropriate to capture the environmental typologies
of interest in this study due to the low spatial resolution of 25 hectares and for econometric
purposes (i.e. to avoid multicollinearity).
As a quantitative representation of land cover in the ESS regions we have used areas
in square meters for each of the CORINE 44 land cover classes (see Appendix). However, it
was not straightforward as the regions used in ESS for different countries were based on the
boundaries of different NUTS levels (Table 2). Moreover, in some cases, different NUTS
levels were used even inside of the same country (e.g. France and Ireland). Therefore, we
have composed an ESS regions map, which includes the boundaries of corresponding NUTS
level for each country. Europe NUTS 1-, 2- and 3-level maps provided by ESRI were used as
a base for this composite map, which we will call ESS Regions map hereafter.
Figure 14: Geographical coverage of CORINE for 2000 and 2006
6 Natural Areas are EU-designated as areas of outstanding natural beauty.
2000
2006
-
30
As mentioned above, the studied ESS rounds were implemented in 2001, 2004 and 2006.
Thus, only for 2006 we have data from both sources (ESS and CORINE). Therefore, we have
used linear interpolation to estimate the appropriate land cover statistics in the intermediate
years of 2001 and 2004, based on the CORINE data of 2000 and 2006. Particularly, in the
first stage ArcGIS Spatial Analyst Tabulate Area function was used to calculate areas by
land cover classes for each region from the ESS Region map based on 2000 and 2006
CORINE raster maps. Then the results were exported to MS Excel and used to estimate the
corresponding values of 2001/2 and 2004 in the following formula for each land cover class
and region:
,46
,26
,
2000
,
2006,
2000
,
2002
,
2000
,
2006,
2000
,
2002
ncnc
ncnc
ncnc
ncnc
CLCCLCCLCCLC
CLCCLCCLCCLC
where nc
yCLC,
is the area of a land cover class c in a NUTS unit n in the year y.
As a result for each region used in ESS we have got 44 land cover class areas in square
meters for 2000 and 2006 from actual CORINE datasets and for 2001/2 and 2004/5 from
linear interpolation. Finally the resulting land cover data table was joined with ESS dataset
and the demographic data provided by ESRI Europe NUTS maps. ArcGIS Join attributes
from a table function was applied using NUTS name as linking field.
-
31
References
Bond, Derek and Devine, Paula. The Role of Geographic Information Systems in Survey
Analysis. The Statistician, 1991, 40 (2), pp. 209 216.
Brereton, F., Clinch, J.P. and Ferreira, S. (2008) Happiness, Geography and the
Environment Ecological Economics 65, 386396
Butz, W.P. & Torrey, B.B. (2006) Some Frontiers in Social Science, Science, 312, 30,
1898-1900.
Childs, C. 2004. Interpolating Surfaces in ArGIS Spatial Analyst. Developer's Corner -
ArcUser July-September 2004 ESRI.
Denbyl, B., Garcia, V., HoUand, D. & Hogrefe, C. 2010. Integration of air quality modeling
and monitoring data for enhanced health exposure assessment. EM Magazine
[Online], Special Issue. Available:
http://www.google.ie/url?sa=t&source=web&cd=1&ved=0CBUQFjAA&url=http%3
A%2F%2Foaspub.epa.gov%2Feims%2Feimscomm.getfile%3Fp_download_id%3D4
91678&ei=9SI0TrLKHYbMhAfc3NHiCg&usg=AFQjCNFvcSq8o9Gm6ZGWASgac
Ukky3Gj-Q [Accessed 30 July 2011].
Goodchild, Michael, F. and Haining, Robert, P. GIS and Spatial Data Analysis: Converging
Perspectives. Papers in Regional Science, 2004, 83, pp. 363 385.
Horlek, J., Denby, B., Smet, P. d., Leeuw, F. d., Kurfrst, P., Swart, R. & Noije, T. v. 2007.
Spatial mapping of air quality for European scale assessment. ETC/ACC Technical
Paper 2006/6. Bilthoven: European Topic Centre on Air and Climate Change.
Horlek, J., Smet, P. d., Leeuw, F. d., Cokov, M., Denby, B. & Kurfrst, P. 2010.
Methodological improvements on interpolating European air quality maps. ETC/ACC
Technical Paper 2009/16. Bilthoven: European Topic Centre on Air and Climate
Change.
Li, X., Cheng, G. & Lu, L. 2000. Comparison of Spatial Interpolation Methods. Advances in
Earth Science, 260-265.
Luechinger, S., (2009). Valuing Air Quality Using the Life Satisfaction Approach. Economic
Journal 119, 482-515.
MacKerron, G., and S. Mourato, (2009). Life satisfaction and air quality in London,
Ecological Economics, 68(5): 1441-1453
Naoum, S. & Tsanis, I. K. 2004. Ranking Sparial Interpolation Techniques using a GIS-
Based DSS. Global Nest The International Journal, 6, 1-20.
Smet, P. d. 2011. RE: Interpolation techniques and modelling for mapping air monitoring
values across EU. Type to Ningal, T.
Smet, P. d., Horlek, J., Cokov, M., Kurfrst, P., Leeuw, F. d. & Denby, B. 2009.
European air quality maps of ozone and PM10 for 2007 and their uncertainty analysis.
ETC/ACC Technical Paper 2009/9. Bilthoven: European Topic Centre on Air and
Climate Change.
Wu, Yi-Hwa; Miller, Havery, J. and Hung, Ming-Chih. A GIS-based Decision Support
System for Analysis of Route Choice in Congested Urban Road Networks. Journal
of Geographical Systems, 2001, 3, pp. 3 24.
http://www.google.ie/url?sa=t&source=web&cd=1&ved=0CBUQFjAA&url=http%3A%2F%2Foaspub.epa.gov%2Feims%2Feimscomm.getfile%3Fp_download_id%3D491678&ei=9SI0TrLKHYbMhAfc3NHiCg&usg=AFQjCNFvcSq8o9Gm6ZGWASgacUkky3Gj-Qhttp://www.google.ie/url?sa=t&source=web&cd=1&ved=0CBUQFjAA&url=http%3A%2F%2Foaspub.epa.gov%2Feims%2Feimscomm.getfile%3Fp_download_id%3D491678&ei=9SI0TrLKHYbMhAfc3NHiCg&usg=AFQjCNFvcSq8o9Gm6ZGWASgacUkky3Gj-Qhttp://www.google.ie/url?sa=t&source=web&cd=1&ved=0CBUQFjAA&url=http%3A%2F%2Foaspub.epa.gov%2Feims%2Feimscomm.getfile%3Fp_download_id%3D491678&ei=9SI0TrLKHYbMhAfc3NHiCg&usg=AFQjCNFvcSq8o9Gm6ZGWASgacUkky3Gj-Qhttp://www.google.ie/url?sa=t&source=web&cd=1&ved=0CBUQFjAA&url=http%3A%2F%2Foaspub.epa.gov%2Feims%2Feimscomm.getfile%3Fp_download_id%3D491678&ei=9SI0TrLKHYbMhAfc3NHiCg&usg=AFQjCNFvcSq8o9Gm6ZGWASgacUkky3Gj-Q
-
32
Appendix. Classes of CORINE nomenclature 3 levels
GRID
CODE LABEL1 LABEL2 LABEL3
1 Artificial surfaces Urban fabric Continuous urban fabric
2 Artificial surfaces Urban fabric Discontinuous urban fabric
3 Artificial surfaces Industrial, commercial and transport units Industrial or commercial units
4 Artificial surfaces Industrial, commercial and transport units Road and rail networks and associated land
5 Artificial surfaces Industrial, commercial and transport units Port areas
6 Artificial surfaces Industrial, commercial and transport units Airports
7 Artificial surfaces Mine, dump and construction sites Mineral extraction sites
8 Artificial surfaces Mine, dump and construction sites Dump sites
9 Artificial surfaces Mine, dump and construction sites Construction sites
10 Artificial surfaces Artificial, non-agricultural vegetated areas Green urban areas
11 Artificial surfaces Artificial, non-agricultural vegetated areas Sport and leisure facilities
12 Agricultural areas Arable land Non-irrigated arable land
13 Agricultural areas Arable land Permanently irrigated land
14 Agricultural areas Arable land Rice fields
15 Agricultural areas Permanent crops Vineyards
16 Agricultural areas Permanent crops Fruit trees and berry plantations
17 Agricultural areas Permanent crops Olive groves
18 Agricultural areas Pastures Pastures
19 Agricultural areas Heterogeneous agricultural areas Annual crops associated with permanent crops
20 Agricultural areas Heterogeneous agricultural areas Complex cultivation patterns
21 Agricultural areas Heterogeneous agricultural areas Land principally occupied by agriculture, with significant areas of natural vegetation
22 Agricultural areas Heterogeneous agricultural areas Agro-forestry areas
23 Forest and semi natural areas Forests Broad-leaved forest
24 Forest and semi natural areas Forests Coniferous forest
25 Forest and semi natural areas Forests Mixed forest
26 Forest and semi natural areas Scrub and/or herbaceous vegetation associations Natural grasslands
27 Forest and semi natural areas Scrub and/or herbaceous vegetation associations Moors and heathland
28 Forest and semi natural areas Scrub and/or herbaceous vegetation associations Sclerophyllous vegetation
29 Forest and semi natural areas Scrub and/or herbaceous vegetation associations Transitional woodland-shrub
30 Forest and semi natural areas Open spaces with little or no vegetation Beaches, dunes, sands
31 Forest and semi natural areas Open spaces with little or no vegetation Bare rocks
32 Forest and semi natural areas Open spaces with little or no vegetation Sparsely vegetated areas
-
33
33 Forest and semi natural areas Open spaces with little or no vegetation Burnt areas
34 Forest and semi natural areas Open spaces with little or no vegetation Glaciers and perpetual snow
35 Wetlands Inland wetlands Inland marshes
36 Wetlands Inland wetlands Peat bogs
37 Wetlands Maritime wetlands Salt marshes
38 Wetlands Maritime wetlands Salines
39 Wetlands Maritime wetlands Intertidal flats
40 Water bodies Inland waters Water courses
41 Water bodies Inland waters Water bodies
42 Water bodies Marine waters Coastal lagoons
43 Water bodies Marine waters Estuaries
44 Water bodies Marine waters Sea and ocean
48 NODATA NODATA NODATA
49 UNCLASSIFIED UNCLASSIFIED LAND SURFACE UNCLASSIFIED LAND SURFACE
50 UNCLASSIFIED UNCLASSIFIED WATER BODIES UNCLASSIFIED WATER BODIES
255 UNCLASSIFIED UNCLASSIFIED UNCLASSIFIED