user-focused threat identification for anonymised microdata hans-peter hafner htw saar – saarland...
Post on 12-Jan-2016
217 Views
Preview:
TRANSCRIPT
User-focused Threat IdentificationFor Anonymised Microdata
Hans-Peter HafnerHTW Saar – Saarland University of Applied Sciences Hans-Peter.Hafner@htwsaar.de
Felix RitchieUniversity of the West of England
Rainer LenzTechnical University of Dortmund
Conference of European Statistics StakeholdersRome, 24 November 2014
1
2
Motivation
Motivation
User-focused threat identification
Production of anonymised data sources for the scientific community as a key task of National Statistics Institutes (NSI)
Conservative risk averse approach (data protection) Release data only if it can be shown they are safe (defensive)
vs
Alternative user oriented approach Release data unless it presents a disclosure risk (cooperative)
3
Overview
• Common approaches to anonymisation
• Critique of common perspective
Focus data protection Worst-case scenarios
• Evidence-based risk assignment (Case study: CIS 2010)
• Impact of new strategy
• Conclusion
Overview
User-focused threat identification
4
Common approach to anonymisation
ESSNET Handbook on SDC (Statistical Disclosure Control)
• Microdata protection should be based on
Knowledge of the use of the data Access requirements Potential to match external datasets Structure of the data itself
• Risk scenarios are based on
Spontaneous recognition Actively searching (record linkage)
Common approach to anonymisation
User-focused threat identification
5
Critique 1: Focus on data protection
Assumption:
Existence of intruders who want to identify companies / persons in the data.
But:
There are no known cases of malicious misuse of data.Only some mistakes or some efforts to circumvent procedures to make life easier are known.
Problem not anonymisation but accreditation procedures!
Critique of common perspective 1
User-focused threat identification
6
Worst-case Scenarios
Scenario often:
Anonymised data vs. Original data (Record matching)
Not realistic:
• Large differences between official statistics and commercial databases
Total protection is not required by law:
• De facto anonymity (Germany): Reidentification allowed as far as effort / costs greater than benefit
Critique of common perspective 2
User-focused threat identification
7
Evidence-Based Risk Assignment:Case Study CIS 2010
CIS (Community Innovation Survey)
• Survey about the innovation activities of enterprises in countries of the European Union
• Conducted every 2 years• For some countries census, for others only sample survey; but large
companies are always included• Many categorical variables, only 9 continuous attributes
Case Study 1
User-focused threat identification
8
Case Study CIS 2010 – to be continued
Risk Scenario
Step 1: Identify user needs
Analysis of research papers + Google Scholar search
Linear and nonlinear regression are most frequently used methods
Step 2: Identify user risks
Spontaneous recognition of outliers No risk since no disclosure to unauthorized person
Group disclosure from categorical variables No risk since focus not on descriptive statistics
Case Study 2
User-focused threat identification
9
Case Study CIS 2010 – to be continued
Case Study 3
User-focused threat identification
Risk Evaluation
Spontaneous recognition
Very unlikely because of large differences between data sources
Matching on categorical variables
Uncertain since statistical business register and classification of economic activity in commercial databases differ (main activity vs main turnover) Moreover: Matching is prohibited by licence agreements
Remaining risks Magnitude tables with 1 or 2 observations in a cell Dominance of one unit in cell / dataset
10
Impact of new strategy
Impact
User-focused threat identification
Consequence of risk evaluation
Small cell count (< 3) or dominance problem in cell:
Determination of records at risk in these cells Only records at risk are perturbed (individual microaggregation of metric variables)
Consequence for the quality of the anonymised datasets
For less than 1% of all records microaggregation was performed Small impact on regression coefficients
11
Conclusion
Conclusion
User-focused threat identification
Change of perspective
from total data protection to a realistic user-oriented approach
that takes into account user needs, quality of external databases, accreditation procedures and statistical legislation
leads to datasets with higher analytical potential for the scientific community!
12
User-focused threat identification
THANK YOU FOR YOUR ATTENTION
top related