23 11:00 big data in official statistics data...26 march 2015 -- 09:30–11:00 big data in official...
TRANSCRIPT
26 March 2015 -- 09:30–11:00
Big data in official statistics
Part II The case of mobile phone data for tourism statistics
EMOS Spring School
23 – 27 March 2015
EUROSTAT Task Force on Big Data &
Unit G-3 ‘Short-term business statistics and tourism’
Outline of the session
Feasibility study on using mobile phone data for tourism statistics
Rationale and objectives of the project
Barriers to access
Methodological challenges
Coherence
Opportunities and benefits
Conclusions and points for further discussion
Q&A
2
Why a project on using mobile phone data?
The world around us is changing
Changing geo-political environment
e.g. free movement of persons in Schengen area border surveys
Quickly evolving technology and large-scale adoption of tools/devices
Changing working environment of official statisticians
New technologies, new techniques, new sources and
a new 'Zeitgeist' boost and stimulate a paradigm shift
in official statistics
3
Why a project on using mobile phone data?
Potential of mobile positioning data, expectations
Making collection and compilation of data more efficient: reducing
burden and improving quality?
e.g. reduction of data entry error, reduction of recall bias (short
trips, same-day visits)
Partly replace data collection on tourism flows within the EU
(domestic, outbound)?
Complete or enhance current data on domestic and outbound
tourism flows (Regulation 692/2011) with data on total inbound
tourism flows?
4
Why a project on using mobile phone data?
Potential of mobile positioning data, expectations (continued)
Further harmonisation?
e.g. use of algorithms rather than subjective opinion/memory of
the respondent
Extension to other domains?
e.g. travel, passenger mobility, migration
Information previously not available
e.g. data at more detailed regional level or destination level,
infra-monthly data (day, week, weekends)
5
Which were the main objectives of the project?
In a nutshell:
Getting answers to the many questions
raised by "doubters"/"non-believers"
(but also by "believers") in big data,
in particular mobile phone data as a
source for tourism statistics
Is this only a daydream nation or
possibly a promised land for statisticians?
"What about those who don't
use mobile phones?"
"I live near the border and sometimes
connect to a foreign network!"
"Tourists buy foreign SIM cards when travelling,
don't they?"
6
Which were the main objectives of the project?
St. Peter's Square, Vatican City, 2005,
Benedictus becomes the new pope
St. Peter's Square, Vatican City, 2013,
Franciscus becomes the new pope
7
But if the coverage is not complete, how can we use it as a
reliable basis?
We should also look at how things are
currently being done!
0
10
20
30
40
50
60
70
80
90
100
IS FI SE NO LU IT DK
NL
CY
UK SI IE AT
MT ES EE PT
DE
CZ
SK LV EU BE
HU FR LT EL PL
BG
RO
Percentage of households having access to a mobile phone (2006)
Italy: 93%
Penetration rate of fixed lines for CATI
interviews by ISTAT: 49%
Which were the main objectives of the project?
"Vertical" objectives (task by task)
Assess feasibility to access databases with mobile
positioning data in European countries
Assess the feasibility to use mobile positioning data for
tourism statistics in the European context
Identify, discuss and address the main challenges for
implementation
Assess the potential impact on cost-efficiency of data
production
Assess the possibility to expand the methodology to other
domains and define joint algorithms
8
Which were the main objectives of the project?
"Horizontal" objectives (cross-cutting approach)
Mix of scientific/theoretical & practical/empirical/
applied work !
Can the methodology/technology be applied to the particular
case of tourism statistics (with its specific international
definitions)?
Can it be applied across a wide group of countries in a similar
way?
Can the outcomes be generalised to all countries?
9
Who carried out the feasibility study?
A multidisciplinary, international consortium (DE, EE, FR, FI)
National statistical institutes
Tourism reseachers
Academics
Data scientists
10
Where can I find the reports?
All reports are publicly available for download from the
Eurostat website
http://epp.eurostat.ec.europa.eu/portal/page/portal/tourism/methodology
/projects_and_studies
1 consolidated report (50 pages, incl. 10 pages executive summary)
5 comprehensive reports:
Stock-taking
Feasibility of access
Feasibility of use (methodological issues)
Feasibility of use (coherence)
Opportunities and benefits
11
Stock-taking
Inventory of the work already done (focus on Europe)
Use of mobile positioning data for research, in particular for
statistics on tourism flows or any other field of official statistics
Institutional set-up (users involved, MNOs involved, technological
aspects)
Outcomes (success? failure?) and lessons to be learnt for this
project
31 cases with access to data were documented (in official
statistics, private or government initiatives, scientific research)
12
Access
Discussion of potential barriers
(and how to overcome these)
privacy issues (operator, national law)
technical issues
financial and business related issues
Improving access to mobile positioning data
is THE main short term challenge in order to
pave the way for a more generalised use of
this source of big data!
13
Access
First things first … what is mobile positioning data?
Stored records of activities of mobile devices by the mobile network
operator (MNO)
Types of data
Call detail records (CDR)
on average 4 events per subscriber per day
Data detail records (e.g. internet usage)
on average 200 events per subscriber per day
Location updates
on average 12 events per subscriber per day
Technical data
on average 100 events per subscriber per day
14
Access
First things first … what is mobile positioning data?
(continued)
For the purpose of the feasibility study, call detail records were used
Good quality as MNOs use this for billing purposes, but
nevertheless certain limitations (see further)
Basic information consists of
− subscriber ID
− country code
− time of the event
− type of event (call, sms, data)
− cell ID (location)
15
Access
Barrier #1: privacy protection and legislation
Relevant legislation
Data Protection Directive (Directive 1995/46/EC and its
successor, the General Data Protection Regulation)
Electronic Privacy Directive (Directive 2002/58/EC)
Data Retention Directive (Directive 2006/24/EC)
Opinions of the Article 29 Data Protection Working Party
But also…
European Statistics Regulation (Regulations 223/2009/EC)
16
Access
Barrier #1: privacy protection and legislation
Directly or indirectly identifiable personal data
(e.g. mobile positioning data) can be used and
processed for statistics if one of the following is
true:
1. The subscriber has given his/her consent, or
2. National legislation allows the NSI and compels the MNOs, or
3. Data is processed in a fully anonymous way
17
Access
Barrier #1: privacy protection and legislation
Grey area in the interpretation of the key concept "personal data"
Because the end result of processing is, by itself, anonymous
(aggregated data), the processing of personal data for such
purpose can be interpreted as appropriate
Reluctance to grant access
Fear of public opinion
Big differences in the NSIs' rights to access data
Access to mobile positioning data can range from relatively easy
to nearly impossible
Strong need for a harmonised legal and methodological framework
for NSIs to access data from mobile network operators
18
Access
Barrier #2: technical feasibility
Complicated but possible
Not considered a hard barrier
Some possible issues
Differences in network systems
Patents
Processing (volume of the data)
Continuous data update
(processing time)
…
19
Access
Barrier #2: technical feasibility
Choice of the data compilation process:
decentralised or centralised? who pays what? quality assessment?
20
Access
Barrier #3: financial and business related aspects
MNOs are interested if the following is considered
Legal aspects and regulations
Public opinion
Business secrets (e.g. sensitive data such as share in the country's
roaming market)
Costs versus benefits (burden in terms of costs is significant:
implementation of extraction system, maintenance, human
resources, …): big data ≠ free data
MNOs need incentives / expect a mutually beneficial relation
Remuneration scheme for the provision of data, or
Ability to use the data for own purposes (internal, profit-making)
21
Methodological challenges
'Universal' issues
Data collection & compilation: sampling design, stratification, calibration
Issues that are inherent to mobile phone data
Representativeness (systematic / sampling bias?) of the technique,
assessment compared to traditional techniques for data collection?
e.g. structural bias: increase in trips or only increase in use?
overcoverage / undercoverage (> 1 SIM card, foreign SIM card)
Applying tourism statistics scope and definitions?
exclude flows within the usual environment, longitudinal data, …
Not more significant that similar shortcomings of 'traditional' sources
22
Methodological challenges
Issues that are inherent to new technologies
Continuity of data access
flexibility of changing the data requirements (e.g. new breakdown)
robustness of series if one or more MNOs drop out
contingency planning if all MNOs stop providing data
Shifts in technology and consumer behaviour
new devices and their impact on the way people communicate
new services (e.g. relevance of call detail records in 2020?)
bigger exposure to exogenous factors makes
close monitoring and constant innovation
essential conditions for using big data in official statistics
23
Methodological challenges
Location vs. antenna data: probabilistic geographical distribution
24
Methodological challenges
Effect of using different administrative borders on usual environment
25
Using LAU-1 for defining usual environment
Using LAU-2 for defining usual environment
Methodological challenges
Limitations of mobile positioning data for tourism statistics
Not entirely compatible with existing definitions and breakdowns
Mostly unknown purpose of the trip
No information on expenditure
Mostly unknown means of transport
Generally no socio-demographic breakdowns
The need for longitudinal data (to determine usual place of residence
& usual environment) is an additional barrier to getting access
26
Coherence
Analysis of the coherence of output based on mobile
positioning data versus existing tourism statistics
Domain coverage: domestic, inbound, outbound
Breakdown into tourism trips and same-day visits
Coherence with existing indicators, and reasons for deviations
27
Coherence
28
0
50 000
100 000
150 000
200 000
250 000
300 000
350 000
Jan-
09
Mar
-09
May
-09
Jul-0
9
Sep-
09
Nov
-09
Jan-
10
Mar
-10
May
-10
Jul-1
0
Sep-
10
Nov
-10
Jan-
11
Mar
-11
May
-11
Jul-1
1
Sep-
11
Nov
-11
Jan-
12
Mar
-12
May
-12
Jul-1
2
Sep-
12
Nov
-12
MOB_IN(EU-27)_OVERNIGHT SUPPLY_EE(EU-27)_ARR
Inbound overnight trips (vs. accommodation statistics) Inbound, outbound overnight trips (vs. ferry passengers data)
0
50 000
100 000
150 000
200 000
250 000
300 000
350 000
400 000
450 000
500 000
Q1-
09
Q2-
09
Q3-
09
Q4-
09
Q1-
10
Q2-
10
Q3-
10
Q4-
10
Q1-
11
Q2-
11
Q3-
11
Q4-
11
Q1-
12
Q2-
12
Q3-
12
Q4-
12
MOB_OUT(EU-27)_OVERNIGHT DEMAND_EE(EU-27)_OVERNIGHT
Outbound overnight trips (vs. demand side data)
0
20 000
40 000
60 000
80 000
100 000
120 000
140 000
160 000
180 000
Jan-
09M
ar-0
9M
ay-0
9
Jul-
09Se
p-09
Nov
-09
Jan-
10
Mar
-10
May
-10
Jul-
10Se
p-10
Nov
-10
Jan-
11M
ar-1
1M
ay-1
1Ju
l-11
Sep-
11
Nov
-11
Jan-
12M
ar-1
2M
ay-1
2Ju
l-12
Sep-
12N
ov-1
2
MOB_EE(RU) BORDCONT_EE(RU)
Inbound overnight trips (vs. border control data)
Better coverage
Less recall bias
Opportunities and benefits
Assessment of strengths and weaknesses
(as compared to the current production methodology)
Relevance
+ Completeness: better coverage, larger scope
+ New statistics, indicators, breakdowns previously not available
(e.g. finer granularity of space and time)
− Lack of socio-demographic variables and some domain-specific
variables (purpose of trips, expenditure, …)
Timeliness
+ Increased integration and automation leads to better timeliness,
up to near-real-time data (but impact on the cost!)
29
Opportunities and benefits
Accuracy
+ Absence of non-response
+ Absence of memory effects or recall bias
− Some overcoverage and undercoverage issues
− Measurement error (# observations vs. precision of location/duration)
Coherence and comparability
+ Good coherence with existing series
+ Synergies with related domains
(BOP travel, transport and urban mobility, etc.)
+ Use of joint algorithms leads to better comparability across domains
(and over time)
+ Additional calibration source for 'traditional' data
30
Opportunities and benefits
Cost and burden
+ Elimination of direct respondent burden
+ Elimination of traditional data entry (important error source!)
+ Possibly more cost-efficient than traditional surveys
− Piloting and implementation cost (start up), regular production cost
− Possibly parallel processes (big data / traditional data) in a first phase
− New skills needed
− Dependency on external data providers (in casu MNOs)
31
Opportunities and benefits
findings concerning the costs
Example: 3 MNOs (10, 5, 1 million subscribers), 15-day latency
High implementation cost, low annual running cost
Processing within the NSI is less costly (compared to decentralised)
32
Opportunities and benefits
findings concerning the costs
Impact of desired latency and of number of subscribers
Cost of data extraction from MNOs increases proportionally with the
number of subscribers and with the allowed latency.
Initial implementation and automation is expensive, maintaining the
system much less so.
33
Conclusions & points for further discussion #1
At present, mobile positioning data cannot replace current
statistics but can give complementary and/or faster results
However… official statisticians have to think out of the box
and leave their comfort zone
The existing scope and definitions are – besides user needs – based on
the available sources and methodologies at the time of development
Do not repeat but do better !
Use of big data necessitates a revolution of the mindset rather than a
simple evolution !
Re-thinking indicators, zero-base user need analysis instead of
incremental changes in the existing frame
34
Conclusions & points for further discussion #2
Implications for the statistical community
Follow the developments (before others take over our business!)
Explore mixed-mode solutions (e.g. large samples based on big data
+ smaller follow-up survey to collect domain-specific information)
Need for horizontal (across domains) and international cooperation in
this area (e.g. Task Force on Big Data)
Implementation will benefit from initiatives covering several countries
(and domains)
Same market structure (MNOs as contact point!)
Same methodological challenges and limitations
Same user needs (at least at European level)
35
Conclusions & points for further discussion #3
Achievements of the study
Impressive set of reports addressing many questions
"Everything I always wanted to know about using mobile
positioning data for statistics, but was afraid to ask"
This feasibility study should serve as a starting point & reference for
many projects to come
In the area of mobile positioning data, but also other types of big
data
In the area of tourism, but also in many other fields of statistics
36
Conclusions & points for further discussion #4
Next steps by the ESS/Eurostat ?
Multi-country and multi-domain project in the pipeline (tbc)
Access is a critical factor the number of statistical domains
analysed & assessed should be maximised (e.g. population,
balance of payments, transport & urban mobility, tourism)
Involve several countries, possibly two-speed approach
Use of data stored by Mobile Network Operators
Expected output
Partnerships with MNOs
Studying data structures and defining data access standards
Testing data compilation and assessing quality
37
38
Thank you
for your
attention!