creating open data whilst maintaining confidentiality philip lowthian, caroline tudor office for...

Creating Open Data whilst maintaining confidentiality

Philip Lowthian, Caroline Tudor

Office for National Statistics

1

Main points of presentation

• What are Open Data• Why the need for Open Data• Steps to protect the data prior to release• Privacy Concerns• An Example• The Future for Open Data

2

Definition of Open Data

“Open Data simply encompasses data that are made available by organisations, businesses and individuals for anyone to access, use and share no matter where they are and what they want to do with the data.”

Open Data Institute: Guides – What are Open Data

3

Why there is interest in Open Data

• Current release options were considered too restrictive and not allowing data to be used to full capacity

• Government commitment to the open data agenda resulted in the Open Government Licence (OGL)

- no restrictions on use, no registration required

- disclosure risk must be reduced to low or negligible risk

4

Open data – Initial steps

• Different to other microdata releases - negligible possibility of identification

• All direct identifiers removed• Possibly of limited use to researchers• Sensitive variables may be recoded or

removed

But• Can be used as teaching / training datasets

5

Open data – Initial steps

• Users are allowed to publish, adapt and combine with other data as long as the information is not personal data

• Difficult balancing act between producing Open Data which are of some use and protected to a reasonable level even when combined with other data sources

6

Risk – Utility

Disclosure Risk:Information aboutconfidential units

Data Utility: Information about legitimate items High

High

Low7

The Risk Utility Relationship

Data Utility: Information about legitimate items

Low

High

High

1. Assess dataset background and context

User requirements

If possible discuss with potential users. Think about the following

- Variables

- Level of Detail

- Geography

8

1. Assess dataset background and context

Dataset details- Is the original dataset a survey sample, an

administrative dataset or a census?

Sample survey – doubt as to whether a member of the population is in the sample.

Administrative dataset or census – could release a sample from the complete data

9

2. Intruder Scenarios

10

Why might an intruder want to discover confidential information?

Identity theft

Gain against commercial competitors

Self identification

Journalist after a good ‘public interest’ story

Sensitive information about people - salary, health

Discredit government or GSS

Nosy neighbour

Database enhancement

2. Intruder Scenarios

• Nosy neighbour – They would know certain facts in the dataset about a neighbour or colleague. Could use this to discover private details.

• Journalist – Could use the data to find out personal information about an individual who is unique on a set of variables.

11

3. Determine the Key Variables

Variables most likely to lead to confidential information being found in a dataset

Either• Visible variables - possibly that an intruder

might know through observation

Or • Sensitive variables - if known by an intruder, it

would be likely to assist in an immediate identification

12


The choice of key variables will depend on the dataset. Typical key variables are:

• Age (individual or grouped)• Sex• Health indicator (more likely to be a key variable if a specific

condition)• Size / composition of household• Income (household or individual)• Occupation / Industry• Ethnic Group• Religion• Country of birth• Marital Status

13


Key variables unique to particular datasets• Dwelling characteristics

• Household structure

• Education variables

‘Response' variable for each record, an outcome that relates to the specific purpose of the collection.

• Income

• Expenditure

• Gas /Electricity Consumption

14

4. Outputs from variable combinations

• Select combination of key variables

• Look for rare combinations or uniques in thee combinations

• Protect data by methods such as removal of variables or records, recoding or record swapping.

• Carry out intruder testing (internal and external). Repeat above steps

• Publish data under an Open Licence15

What is Intruder Testing?

• Use ‘Friendly Intruders’ (Usually internal: ONS staff for example) to see if they can re-identify anyone in the dataset

• Discover what additional information is used by the intruders

• Discover which variables are used by intruders when attempting identification

• Determine the level of disclosure risk in the dataset empirically.

16

Privacy concerns

• Data released under OGL although protected will have a residual risk due to:

• The ‘mosaic’ effect. Linking different similar datasets may help identify a record in the data. The possibility of this will increase with greater computer power and matching software

• Access cannot be withdrawn from an open dataset

17

Example: 2011 Census teaching dataset

In addition to Safeguarded and Secure Access data

Teaching dataset introduced. Also acts as a taster for more detailed datasets

Approx 500,000 records for England and Wales

Protection is given to members of this dataset by:

Small sample size – Small likelihood of an individual being in the sample

Record swapping – Geographic perturbation

18

Steps to follow: 2011 Census teaching dataset

• Remove all direct personal identifiers such as Name, Address and Date of Birth

• There are a large number of variables in the original data. Decide on the variables to include in the teaching dataset. To include geography (Region) and basic demographic information

• Identify Key variables

• Create combinations of Key variables from both sample and population. Look for unique or rare combinations

19

Steps to follow: 2011 Census teaching dataset

• Recode some of the most identifying variables

Recode Age into 8 Categories

Recode Ethnic group from 16 to 5 categories

Additional recodes for Industry, Economic Activity and Religion

• Recreate the variable combinations. Many sample uniques but very few population uniques.

• Carry out some additional record swapping. Swap a small number of most risky records at Region level

• Release the data under OGL 20

The Future

• Currently Open Data are often released alongside other (more detailed) licensed data.

• Confidentiality of data released under OGL protected by law

• They will be of limited use for complex research projects and used mainly for training / teaching purposes.

• Will the data be used by more than a small group of people?

21

Any Questions?

22

creating open data whilst maintaining confidentiality philip lowthian, caroline tudor office for...

Documents