creating open data whilst maintaining confidentiality philip lowthian, caroline tudor office for...
TRANSCRIPT
Creating Open Data whilst maintaining confidentiality
Philip Lowthian, Caroline Tudor
Office for National Statistics
1
Main points of presentation
• What are Open Data• Why the need for Open Data• Steps to protect the data prior to release• Privacy Concerns• An Example• The Future for Open Data
2
Definition of Open Data
“Open Data simply encompasses data that are made available by organisations, businesses and individuals for anyone to access, use and share no matter where they are and what they want to do with the data.”
Open Data Institute: Guides – What are Open Data
3
Why there is interest in Open Data
• Current release options were considered too restrictive and not allowing data to be used to full capacity
• Government commitment to the open data agenda resulted in the Open Government Licence (OGL)
- no restrictions on use, no registration required
- disclosure risk must be reduced to low or negligible risk
4
Open data – Initial steps
• Different to other microdata releases - negligible possibility of identification
• All direct identifiers removed• Possibly of limited use to researchers• Sensitive variables may be recoded or
removed
But• Can be used as teaching / training datasets
5
Open data – Initial steps
• Users are allowed to publish, adapt and combine with other data as long as the information is not personal data
• Difficult balancing act between producing Open Data which are of some use and protected to a reasonable level even when combined with other data sources
6
Risk – Utility
Disclosure Risk:Information aboutconfidential units
Data Utility: Information about legitimate items High
High
Low7
The Risk Utility Relationship
Data Utility: Information about legitimate items
Low
High
High
1. Assess dataset background and context
User requirements
If possible discuss with potential users. Think about the following
- Variables
- Level of Detail
- Geography
8
1. Assess dataset background and context
Dataset details- Is the original dataset a survey sample, an
administrative dataset or a census?
Sample survey – doubt as to whether a member of the population is in the sample.
Administrative dataset or census – could release a sample from the complete data
9
2. Intruder Scenarios
10
Why might an intruder want to discover confidential information?
Identity theft
Gain against commercial competitors
Self identification
Journalist after a good ‘public interest’ story
Sensitive information about people - salary, health
Discredit government or GSS
Nosy neighbour
Database enhancement
2. Intruder Scenarios
• Nosy neighbour – They would know certain facts in the dataset about a neighbour or colleague. Could use this to discover private details.
• Journalist – Could use the data to find out personal information about an individual who is unique on a set of variables.
11
3. Determine the Key Variables
Variables most likely to lead to confidential information being found in a dataset
Either• Visible variables - possibly that an intruder
might know through observation
Or • Sensitive variables - if known by an intruder, it
would be likely to assist in an immediate identification
12
3. Determine the Key Variables
The choice of key variables will depend on the dataset. Typical key variables are:
• Age (individual or grouped)• Sex• Health indicator (more likely to be a key variable if a specific
condition)• Size / composition of household• Income (household or individual)• Occupation / Industry• Ethnic Group• Religion• Country of birth• Marital Status
13
3. Determine the Key Variables
Key variables unique to particular datasets• Dwelling characteristics
• Household structure
• Education variables
‘Response' variable for each record, an outcome that relates to the specific purpose of the collection.
• Income
• Expenditure
• Gas /Electricity Consumption
14
4. Outputs from variable combinations
• Select combination of key variables
• Look for rare combinations or uniques in thee combinations
• Protect data by methods such as removal of variables or records, recoding or record swapping.
• Carry out intruder testing (internal and external). Repeat above steps
• Publish data under an Open Licence15
What is Intruder Testing?
• Use ‘Friendly Intruders’ (Usually internal: ONS staff for example) to see if they can re-identify anyone in the dataset
• Discover what additional information is used by the intruders
• Discover which variables are used by intruders when attempting identification
• Determine the level of disclosure risk in the dataset empirically.
16
Privacy concerns
• Data released under OGL although protected will have a residual risk due to:
• The ‘mosaic’ effect. Linking different similar datasets may help identify a record in the data. The possibility of this will increase with greater computer power and matching software
• Access cannot be withdrawn from an open dataset
17
Example: 2011 Census teaching dataset
In addition to Safeguarded and Secure Access data
Teaching dataset introduced. Also acts as a taster for more detailed datasets
Approx 500,000 records for England and Wales
Protection is given to members of this dataset by:
Small sample size – Small likelihood of an individual being in the sample
Record swapping – Geographic perturbation
18
Steps to follow: 2011 Census teaching dataset
• Remove all direct personal identifiers such as Name, Address and Date of Birth
• There are a large number of variables in the original data. Decide on the variables to include in the teaching dataset. To include geography (Region) and basic demographic information
• Identify Key variables
• Create combinations of Key variables from both sample and population. Look for unique or rare combinations
19
Steps to follow: 2011 Census teaching dataset
• Recode some of the most identifying variables
Recode Age into 8 Categories
Recode Ethnic group from 16 to 5 categories
Additional recodes for Industry, Economic Activity and Religion
• Recreate the variable combinations. Many sample uniques but very few population uniques.
• Carry out some additional record swapping. Swap a small number of most risky records at Region level
• Release the data under OGL 20
The Future
• Currently Open Data are often released alongside other (more detailed) licensed data.
• Confidentiality of data released under OGL protected by law
• They will be of limited use for complex research projects and used mainly for training / teaching purposes.
• Will the data be used by more than a small group of people?
21
Any Questions?
22