1 tel aviv april 29th, 2007 disclosure limitation from a statistical perspective natalie shlomo...

41
1 Tel Aviv April 29th, 2007 Disclosure Limitation from a Statistical Perspective Natalie Shlomo Dept. of Statistics, Hebrew University Central Bureau of Statistics

Upload: eugene-wade

Post on 29-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

1

Tel Aviv

April 29th, 2007

Disclosure Limitation from a Statistical Perspective

Natalie Shlomo

Dept. of Statistics, Hebrew University Central Bureau of Statistics

2

Topics of Discussion

1. Introduction and Motivation

2. Disclosure risk – data utility decision problem

3. Assessing disclosure risk

4. Methods for masking statistical data

- microdata

- tabular data

5. Assessing information loss

3

Statistical Data

 Sources of Statistical Data:

• Census - full enumeration of the population

• Administrative – data collected by Government Agencies for other purposes, i.e. tax records, population register

• Survey – random sample out of the population. Each unit in the sample is assigned a sampling weight. Often population is unknown.

SDC Approach – “Safe Data” versus “Safe Settings” • Microdata Review Panels need to make informed decisions on releasing microdata and mode of access

4

Assessing Disclosure Risk

 Physical disclosure – disclosure from breach of physical security, e.g. Stolen questionnaires, computer hacker

 Statistical disclosure – disclosure from statistical outputs Disclosure risk scenarios - assumptions about information or IT tools available to an intruder that increase the probability of disclosure, e.g. matching to external files or spontaneous recognition

Key - combination of indirect identifying variables, such as sex, age, occupation, place of residence, country of birth and year of immigration, marital status, etc.

5

Types of Statistical Disclosure

Identity disclosure - an intruder identifies a data subject

confidentiality pledges and code of practice:

“…no statistics will be produced that are likely to identify an individual unless specifically agreed with them”

Individual attribute disclosure - confidential information revealed and can be attributed to a data subject  Identity disclosure a necessary condition for attribute disclosure and should be avoided

Group attribute disclosure - learn about a group but not about a single subject. May cause harm, i.e. all adults in a village collect unemployment

6

The SDC ProblemR-U Confidentiality Map (Duncan, et.al. 2001)

Original DataMaximum Tolerable Risk

Released DataNo data

Data Utility: Quantitative measure on the statistical quality

Disclosure Risk:

Probability of re-

identification

7

Disclosure Risk Measures• Frequency tables with full population counts: - 1’s and 2’s in cells lead to disclosure - 0’s may be disclosive if only one non-zero cell in

a row or column

Disclosure risk quantified by the percentage of small

cells, probability that a high-risk cell is protected (take into account other measurement errors, i.e. imputation rates)

• Magnitude Tables : Sensitivity measures based on the number of

contributing units and the distribution of the target variable in the cell

8

Disclosure Risk Measures• Microdata from surveys (and frequency tables): Decisions typically based on check lists and ad-

hoc decision rules regarding low frequencies in combinations of identifying key variables

In recent years, objective quantitative criteria for measuring disclosure risk when the population is unknown:

- Probability that a sample unique is a population unique

- Probability of a correct match between a record in the microdata to an external file

9

On Definitions of Disclosure Risk

• In the statistics literature, we present examples of risk measures, but lack formal definitions of when a file is safe

• In the computer science literature, there is a formal definition of disclosure risk (e.g., Dinur, Dwork, Nisim (2004-5), Adam and Wortman(1989

In some of the CS literature any data must be released with noise of magnitude

Adding noise of order hides information on individuals and small groups, but yields meaningful information about sums of O(n) units for which noise of order is natural

n

n

10

On Definitions of Disclosure Risk

Worst Case scenario of the CS approach, for example, that the intruder has all information on everyone in the data set except the individual being snooped, simplifies definitions and there is no need to consider other, more realistic but more complicated scenarios.

But would Statistics Bureaus and statisticians agree to adding noise to any data?

Other approaches like query restriction or query auditing do not lead to formal definitions.

11

On Definitions of Disclosure Risk

Collaboration with the CS and Statistical Community

where:

1. In the statistical community, there is a need for more formal and clear definitions of disclosure risk

2. In the CS community, there is a need for statistical methods to preserve the utility of the data

- allow sufficient statistics to be released without

perturbation

- methods for adding correlated noise

- sub-sampling and other methods for data masking

Can the formal notions from CS and the practical approach of statisticians lead to a compromise that will allow us to set practical but well defined standard for disclosure risk?

12

SDC Methods for Microdata

Data Masking Techniques:

Non-perturbative methods – preserves the integrity of the data (impact on the variance)

Examples: recoding, local suppression, sub-sampling,

Perturbative Methods - alters the data (impact on bias) Examples: adding noise, rounding, microaggregation, record swapping, post- randomization method, synthetic data

13

• Additive noise

A random vector (for example, from a normal distribution) is generated (with zero mean) independently for each individual in the microdata and added to the continuous variables to be perturbed.

Use correlated noise based on target variables in order to ensure equal means, covariance matrix and also preserves linear balance edits, i.e. X+Y=Z

Let Generate

Calculate: where controls the amount of noise

SDC Methods for Microdata

),(~ 2iidx ),(~ 2 iid

221 xx

2

2122 )()()(1)(

EExExE

2)var( x

14

•PRAM ( Post-randomisation method)

Misclassify categorical variables according to transition matrix P and a random draw:

For vector of the perturbed frequencies, is an unbiased moment estimator of the data

Condition of invariance (the vector of the original frequencies is the eigenvector of P), perturbed file is unbiased estimate of the original file.

Expected values of marginal distribution reserved. Can also ensure exact marginal distributions by controlling the selection process for changing records

Use control strata to ensure no silly combinations

SDC Methods for Microdata

)|( iiscategoryoriginaljiscategoryperturbedppij

1*ˆ PTT*T

TTP

15

PRAM Example: T`=(25, 30, 50, 10) - Generate a Transition matrix with a minimum value

on the diagonal and all other probabilities equal.

SDC Methods for Microdata

8207.00598.00598.00598.0

0479.08563.00479.00479.0

0427.00427.08718.00427.0

0579.00579.00579.08264.0

P

)8.0( dp

- Calculate Invariant matrix R and determine such that final matrix will have the desired diagonals

7543.01067.00674.00716.0

0213.09058.00359.00370.0

0225.00598.08764.00413.0

0287.00740.00496.08478.0

*R

ggg IRR )1(*

Note that TRT *

16

• Synthetic Data

- Fit data to model, e.g. using multiple imputation techniques to develop posterior distribution of a population given the sample data - Can be implemented on parts of the data where a mixture is obtained of real and synthetic data - In practice, very difficult to capture all of the conditional relationships between variables and within sub-groups

SDC Methods for Microdata

• Microaggregation

- Identify groups of records, e.g. of size 3, and replace values by group mean (has been shown that it is easy to ‘unpick’ for one variable)

- Carry out on several variables at once using clustering algorithms for reducing within variance

17

SDC Methods for Magnitude TablesCell Suppression

36210 36220 36300 36

A 2,608 (5) 11,358 4,871 18,837

B 2,562 (3) 11,631 3,652 17,845

C 2,608 (12) 11,956 3,054 17,618

D Suppress 12,281 3,051 17,641

E 2,240 (2) 7,347 3,537 13,124

Total 12,327 54,573 18,165 85,065

IndustryR

eg

ion

18

36210 36220 36300 36

A 2,608 (5) 11,358 4,871 18,837

B Secondary 11,631 Secondary 17,845

C 2,608 (12) 11,956 3,054 17,618

D Suppress 12,281 Secondary 17,641

E 2,240 (2) 7,347 3,537 13,124

Total 12,327 54,573 18,165 85,065

IndustryR

egio

n

SDC Methods for Magnitude Tables Cell Suppression

19

SDC Methods for Magnitude TablesInformation Available to Table User

(1) T(1)+T(2)=12327-2608-2608-2240=4871 T(2)<= 4871 from (1)

(2) T(1)+T(3)=17845-11631=6214 T(4)<= 6703 from (3)

(3) T(3)+T(4)=18165-4871-3054-3537=6703 T(2)>=5360-6703=0 from (4),(6) (4) T(2)+T(4)=17641-12281=5360 etc…

(5) T(1)>0, (6) T(2)>0, (7) T(3)>0, (8) T(4)>0

Represent as matrix equation and vector inequality

A T=b, T >0 where

A = 1 0 1 0 T= T(1) b= 6214

0 1 0 1 T(2) 5360

1 1 0 0 T(3) 4871

0 0 1 1 T(4) 6703

20

SDC Methods for Magnitude Tables Disclosure Protection

Determine upper and lower bounds for T(1), ….., T(4) (feasibility intervals) using eight linear programming solutions.

1a maximise T(1) subject to AT=b, T>01a minimise T(1) subject to AT=b, T>02a maximise T(2) subject to AT=b, T>0

There must be ‘feasible’ solutions and true values of T(X) will lie within bounds.

Let bounds be T(1)L and T(1)U etc.

21

SDC Methods for Magnitude Tables

Disclosure Protection

2,608 11,358 4,871 18,837

[0 , 4871] 11,631 [1343 , 6214] 17,845

2,608 11,956 3,054 17,618

[0 , 4871] 12,281 [489 , 5360] 17,641

2,240 7,347 3,537 13,124

12,327 54,573 18,165 85,065

22

SDC Methods for Magnitude TablesChoice of Secondary Cells

Stipulate requirement on T(1)L and T(1)U to ensure interval sufficiently wide with a fixed percentage, e.g. [T(X)U-T(X)L]≥ (p/100)T(X) for all X

Employ sensitivity measure:

Require T(X)U>T(X)+(p/100)T(X)And by symmetry T(X)L<T(X)-(p/100)T(X)

Sliding rule protection – only the width is predetermined and interval may be skewed

23

SDC Methods for Magnitude TablesChoice of Secondary Cells

Many possible sets of suppressed cells (including all

cells!), Define target function and minimise subject to constraints for preserving protection intervals

Idea: Target function: Cost = information content of cell

Common choices of C(X):

a)C(X)=1 minimise number of cells suppressed

b)C(X)=N(X) minimise number of contributors suppressed

c) C(X)=T(X) minimise total value suppressed (all cells must be non-negative)

Xcellssuppressed

)(XC

24

SDC Methods for Magnitude Tables

Choice of Secondary CellsHypercube method: Simple but not optimal

On a k-dimensional table, choose a k-dimensional hypercube with the sensitive cell in a corner. All 2k corner points are suppressed

Criteria:• Corner can’t be zero since structural zeros may

allow recalculating other corners• Feasibility intervals should be sufficiently wide

(intervals simpler to calculate on a hypercube)• Possible suppression candidates and choose one

with minimal information loss (minimize cost function)

• A priori choose cells that were previously suppressed to minimize information loss by putting a large negative cost on the suppressed cells

25

SDC Methods for Frequency Tables

Rounding

Round frequencies

- deterministice.g. to nearest 5

- randome.g. to close multiple of 5

- controllede.g. to multiple of 5

Usually interior cellsand margins roundedindependently - tables don’t add up

Margins = sum ofinterior cells

}

Can implement rounding on only small cells of the table Margins added up from perturbed and non-perturbed cells

26

1 2 3 4 Total A 0 1 0 0 1 B 5 2 2 4 13 C 6 1 0 3 10 D 4 7 0 4 15 Total 15 11 2 11 39

SDC Methods for Frequency TablesRounding

Example - complete census

What types of disclosure risk are present in this table?

27

SDC Methods for Frequency TablesRounding

Deterministic rounding process to base 3

The published table

1 2 3 4 Total

A 0 0 0 0 0

B 6 3 3 3 12

C 6 0 0 3 9

D 3 6 0 3 15

Total 15 12 3 12 39

1 2 3 4 Total

A 0 0

1 0 00 00 10

B 56 23 23 43 1312

C 66 10 00 33 109

D 43 6 7 00 43 1515

Total 1515 1112 23 1112 3939

28

SDC Methods for Frequency TablesRounding

Random rounding algorithm:

• Let be the largest multiple k of the base b such that for an entry x.

• Define

• x is rounded up to with probability and rounded down to with probability

• If x is already a multiple of b, it remains unchanged

The expected value of the rounded entry is the original entry since:

Each small cell is rounded independently in the table.

Can also control the selection process to ensure additive totals in one dimension.

)(xFloorxbk

)()( xFloorxxres

))(( bxFloor b

xres )(

)(xFloor)

)(1(

b

xres

0)(

)))((())(

1())(( b

xresbxFloorx

b

xresxFloorx

29

0 01 0 with probability 2/3

3 with probability 1/3

2 0 with probability 1/33 with probability 2/3

3 3

4 3 with probability 2/36 with probability 1/3

…... An example of random rounding

1 2 3 4 Total A 0 0 1 3 0 0 0 0 0 0

B 5 3 2 0 2 3 4 6 13 15

C 6 6 1 0 0 0 3 3 10 12

D 4 6 7 9 0 0 4 3 15 15

Total 15 15 11 9 2 3 11 12 39 39

SDC Methods for Frequency TablesRounding

A typical rounding scheme

Margins rounded separately

30

16-25

26-49

40-59

60-69

70-79

80+ Total

benefit claimed

20 15 15 5 5 0 60

not claimed

25 10 5 0 0 0 45

Total 40 30 15 0 5 5 105

Age

SDC Methods for Frequency TablesRounding

Complete census in small area, after random rounding

- Ones and twos disappear - Doubt cast on zeroes so disclosure prevented - Figures don’t add up, may allow table to be “unpicked”

31

16-25

26-49

40-59

60-69

70-79

80+ Total

benefit claimed

20 15 15 10 0 0 60

not claimed

20 15 5 0 5 0 45

Total 40 30 20 10 5 0 105

Age

SDC Methods for Frequency Tables Rounding

- Ones and twos disappear - Doubt cast on zeros so disclosure prevented - Tables additive - Zero-restricted – the entry that is an integer multiple of the base b is unchanged

Controlled Rounding – rounding in such a way that table are additive

32

Example - Random Rounding to base 5

Auditing Random Rounding

Under 30 30-60 Over 60 Total

Male 6 7 1 14

Female 5 4 0 9

Total 11 11 1 23

Under 30 30-60 Over 60 Total

Male 6 5 7 10 1 0 14 10

Female 5 5 4 5 0 0 9 5

Total 11 15 11 15 1 0 23 20

• Feasible interval generally 8 wide (between 1 and 9), except for column 3 which is 4 wide (between 0 and 4)

• Column one and row one do not add up to totals, nor one-way margins to grand total

33

Example

Under30

30-60 Over 60 Total

Male 10

Female 5

Total 15 15 0 20

Restrict attention just to one-way margins and total.

Auditing Random Rounding

34

Under 30 30-60 Over 60 TotalMale 10

6-14Female 5

1-9Total 15 15 0 20

11-19 11-19 0-4 16-24

• Feasible intervals

Auditing Random Rounding

35

Under 30 30-60 Over 60 TotalMale 10

13-14Female 5

8-9Total 15 15 0 20

11-12 11-12 0-1 22-23

• Feasible intervals sometimes much narrower than the rounding method suggests.

• In some cases, where frequencies low, can result in potential disclosure.

Auditing Random Rounding

36

Impact on Analysis  

• Loss of information – combining categories

• Inflate or deflate variance

• Bias and inconsistency in the data

Some SDC methods are transparent and users can take them into account, e.g. rounding. Other methods have hidden bias and the effects are difficult to assess, e.g. record swapping

))(ˆ(

))(ˆ(

old

new

dataVar

dataVarIL

))(ˆ())(ˆ( newnew dataEdataBias

37

Information Loss Measures

Types of Information loss measures depending on use of data:

• Distortion to distributions and totals (bias) as

measured by distance metrics, entropy, average perturbation per cell, etc.

• Impact on variance of estimates

• Impact on measures of association based on chi-squared tests for independence

• Impact on goodness-of-fit criteria, regression coefficients, statistical analysis and inference

38

Information Loss Measures

Measures for Bias and Distortion

• Hellinger’s Distance

•Relative Absolute Distance

•Average Absolute distance per cell

where

Method SCA CSCA CRND

HD 5.272 5.279 5.416

RAD 76.804 76.824 84.641

AAD 0.629 0.630 1.021

| |2

1

1 1( , ) ( ( ) ( ))

| | 2

OAk k

orig pert pert origk c k

HD D D D c D cOA

| |

1

| ( ) ( ) |1( , )

| | ( )

k kOApert orig

orig pert kk c k orig

D c D cRAD D D

OA D c

| |

1

| ( ) ( ) |1

( , )| | | |

k kpert origOA

c korig pert

k

D c D cAAD D D

OA k

| | ( )c

k I c k

SCA – small cell rounding

CRND – semi controlled full rounding

39

Information Loss Measures

Measures for Bias and Distortion

for 10 consecutive OA’s

R - random R/I – random (no imputed) T - targeted

( , ) ( ) ( )k k k korig pert pert origAD N N N C N C

40

Information Loss Measures

Impact on Measure of Association – Cramer’s V

Method SJ Cramer’s V=0.2021

SCA CSCA CRND

Original 6.8%- -6.7% -7.8%

)1(),1min(

2

CRnCV

)(

)()(100),(

orig

pertorigpertorig DCV

DCVDCVDDRCV

On two-Way Table defined by OA * Age-Sex and Economic Activity * Long-Term Illness calculate:

The information loss measure:

Method SJ Cramer’s V=0.2021

1% 10% 20%

Random 0.3% 2.8% 4.8%

Rand/Imp 0.3% 2.0% 3.8%

Targeted 0.1% 1.4% 3.3%

41

Disclosure Control TechniquesRecord Swapping

Disclosure risk measure – Percent records in small cells of the tables that were not perturbed or not imputed

Information Loss measure – Average absolute difference per cell

Region SJ

0%

20%

40%

60%

80%

100%

00.20.40.60.811.21.4Average Perturbation Per Cell

Per

cen

t U

np

ertu

rbed

in

Sm

all

Cel

lsRandom Rand/Imp Targeted

10%

20%10%

20%

1

1%

||

|)()(|),(

C

cDcDDDAAD Cc

pertorig

pertorig