1 measures of disclosure risk and harm measures of disclosure risk and harm diane lambert, journal...

1

Measures of Disclosure Risk and Harm

Diane Lambert, Journal of Official Statistics, 9 (1993), pp. 313-331

Jim Lynch

NISS/SAMSI & University of South Carolina

http://cm.bell-labs.com/stat/doc/93.17.ps

2

Measures of Disclosure Risk and Harm

• Introduction-Discussion (Section 7)

• What is Disclosure?

• Risk of Perceived Identification

• Modeling the Intruder

• Risk of True Identification

• Disclosure Harm

3

Discussion (Section 7)

• It is the intruder, and not the structure of the data alone, that controls disclosure.

• When the intruder is sure enough that a released record belongs to a respondent– There is a re identification. – It may be incorrect, but the intruder perceives

there to be a re identification.

4


• The risk of perceived disclosure and the risk of true disclosure cannot be measured without considering the seriousness of the threat posed by the intruder's strategy.

• The harm that follows from a re identification – Depends on the attributes, if any, that the intruder infers about the

target– The harm cannot be measured without considering the strategy that

the intruder uses to infer sensitive attributes.• Once the intruder's strategy is modeled, disclosure risk and

harm can be evaluated• Risk is measured in terms of probabilities• Harm is measured in losses or costs.

5

Discussion (Section 7)• All the agency can do to reduce disclosure risk or

harm is– to mask the data before release– or carefully select the individuals and organizations that

are given the data, or both. • The models developed here imply that masking and

releasing only a subset of records does not necessarily protect against disclosure.

• Masking may lower the risk of true re identification– But it may also lead to false re identifications and false

inferences about attributes. – The fact that inferred attributes may be wrong may be

little comfort to the respondent whose record is re identified.

6


• Masking also complicates data analysis– An agency cannot be expected to predict and

minimize all the effects of masking on all the analyses of interest.

– Nor is it reasonable to expect the data analyst to describe how the data will be analyzed before the data are obtained so that the agency can verify that the conclusions will be the same for the masked data as they would have been for the original data.

– Future masking techniques may preserve more general features of the data, but for now data masked enough to preserve confidentiality can be a challenge to analyze appropriately.

7

Discussion (Section 7)• It does seem reasonable to put some of the burden for

protecting confidentiality on the researcher.– Institutions and researchers have to abide by all sorts of

conditions in experiments involving humans.– The experience in those and other areas ought to provide some

guidance on protecting respondents in agency databases from unscrupulous intruders.

– Would not necessarily remove the need for some masking, but it might reduce the need for heroic masking that severely limits the usefulness of the data.

• “Confidentiality issues for medical data miners,” Jules J. Berman, Pathology Informatics Cancer Diagnosis Program, DCTD, NCI, NIH,

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed&cmd=Retrieve&list_uids=12234715&dopt=Citation

8

Discussion (Section 7)• One could argue that models of disclosure

are hopeless because the issues are too complex and the intruder too mysterious.

• This paper, though, argues that models of disclosure are indispensable.– They force definitions and assumptions to be

stated explicitly.– When the assumptions are realistic, models of

disclosure can lead to practical measures of disclosure risk and harm.

9

What is Disclosure?• Key Attributes

– Useful for identification but usually not sensitive– E.g., age, location, marital status, and profession

• Sensitive Attributes– Disease, debts, credit rating

• Scenario: A sample of records is released– Obvious identifiers removed– Some attributes left intact such as marital status– Others modified to protect confidentiality

• Incomes truncated, professions grouped more coarsely, and ages on pairs of records swapped, some attributes on some records might be missing or imputed.

10

What is Disclosure?• Two types of identity disclosure

– Identification or Re-identification• Equivalent to inadvertent release of an identifiable record

– Attribute Disclosure• Occurs when the intruder believes something new has been

learned about the respondent.• May occur with or without re-identification• E.g., the intruder may narrow the list of possible target records to

two with nearly the same value of a sensitive attribute. Then the attribute is disclosed although the target record is not located. Or two records may be averaged so the released record belongs to no one. Yet the debt on the averaged record may disclose something about the debt carried by the targeted individual. The agency must decide whether attribute disclosures without identifications are important.

11

What is Disclosure?• Considers only disclosures that involve re

identifications but NOT attribute disclosures without re identifications.

• Attribute disclosures that result from re identification are considered to the extent that they harm the respondent.

• In this paper, the risk of disclosure is the risk of re identifying a released record and the harm from disclosure depends on what is learned from the identification.

12

What is Disclosure?• Attribute disclosures that do not involve

identification are ignored

• This assumes that all intruders first look for the record that is most likely to be correct and then take information about the targeted attribute from that record.

• Intruders with other strategies are ignored.

13

What is Disclosure?• Includes

– true and false re identifications and– true and false attribute disclosures.– Correct and incorrect inferences can be distinguished if desired

(as happens with measures of harm)

• It distinguishes between– true identification and true attribute disclosure and – perceived identification and perceived attribute disclosure

(the intruder believes the information is correct)

where, in the former, when correct inferences are to be prevented and in the latter when perceived inferences are to be prevented.

14

The Risk of Perceived Identification

• Basic Premise: Disclosure is limited only to the extent that the intruder is discouraged from making any inferences, correct or incorrect, about a particular target respondent.

15


• Format (Similar to Jerry’s last time)– Population of N records denoted Z

– A random sample of n masked records X=(x1,…, xn) with k attributes

– Masking suppresses attributes in Z, adds random noise, truncates outliers, or swaps values of an attribute between records. Knowing this, which, if any, record in the released file should be linked to the target respondent’s record Y?

16


• Rational Intruder has two options.– 1. Decide that one of the released records belongs to the target

respondent. (i.e., link the i th released record xi to the target record Y.

– 2. Decide not to link (the null link) any released record to Y, perhaps because none of the released records is close enough to what the intruder expects for Y or perhaps because too many released records are close to what the intruder expects for Y. The decision not

• Rational intruder chooses the link (non null or null) believed most likely to be correct whenever any incorrect choice incurs the same positive loss and a correct link incurs no loss. (See Duncan and Lambert (1989) for details.)

17

The Risk of Perceived Identification pi = the intruder's probability that the i th released record in X is the target's.

n

ipq1

1 is the intruder's probability that the target record has not

been released. Intruder’s protocol

o If ii pq max1 don’t link (choose null link)

o ii pmax is large enough then the intruder links with a released

record

o ii pmax is called the risk of perceived re identification

o risk of perceived re-identification is not defined in terms of the intruder's expected loss or the agency's or respondent's expected loss

defined in terms of the seriousness of the threat posed by the intruder

depends only on the intruder's posterior probability that the i th released record is the correct one.

18

The Risk of Perceived IdentificationOther Measures

AofycardinalitthedenotesAwhere

XPNjXD

NXDNXPXD

XrecordpopsisxrecordpopsisxPXD

XP

XrecordpopjisxPXD

ijni

tot

N

jijniave

jinji

ijniNj

thiniNj

max:)(

/)(/max)(

]|'2'1[max)(

maxmax

][maxmax)(

1

,

19

Modeling the IntruderExample 4.1 – Pop of 2 Records: N=2=n

• One continuous attribute• Intruder makes judgments about the M(Y), the

masked version of target Y• Series of judgments leads to intruder modeling prior

about M(Y) as lognormal with (0,1) (prior denoted f1(x))

• Information about the other respondent, Y’, is modeled as M(Y’)~lognormal(2,1) denoted f2(x)

• E(M(Y))=1.65 and E(M(Y’))=12.2• Released data is X=(7,20)

20

Modeling the Intruder Example 4.1 – A “Posterior Calculation”• p1=P(M(Y)=7|X=(7,20))=P(M(Y’)=20|X=(7,20))

=f1(7)f2(20)/[f1(7)f2(20)+f2(7)f1(20)]=.89• In the original population Y=Z1<Z2=Y’; p1 is just the

probability that the order is preserved in the released data after masking. The terminology of “prior” and “posterior” don’t suggest that this is Bayesian. Just modeling the masking.

• If masking techniques require order to be preserved then p1=1 and the joint distribution of M(Y) and M(Y’) is not f1f2.

21

Modeling the Intruder Example 4.1

89.}11,.89max{.

))}20,7(|20)(()),20,7(|7)((max{

])20,7([maxmax)( 22

XYMPXYMP

XrecordpopjisxPXD thiij

•Suppose only one record is released and it is x=7. Then,

p1=P(M(Y) is selected and M(Y)=7|X=(7,20)) =.5f1(7)/[.5f1(7))+.5f2(7)]=.13

•In this case, D(X)=max(.13,.87)=.87

22

Modeling the Intruder Example 4.2 – n of N records

• Intruder believes that the ith record in pop Z will be appear as Mi=M(Yi)~ fi(x)

• The probability that the nth released record belongs to target Y1 is

pn=P(Y1 is sampled and M1=xn|X)

=P(xn is sampled from f1 and x1,…, xn-1 are sampled from f2,…, fN)/P(x1,…, xn are sampled from f1,…, fN)

23

Modeling the Intruder Example 4.2 - n=2 of N=3 records

Non-unique Records

)fff

if 1)-1/(Np general,(in 1/2pthen ff if•

1/N)p general,(in 1/3pthen fff if •

)](xf)[.5(xf31

))]((f)()[.5(f(xf31

))x,(xX|xM and sampled is P(Yp•

1-N21

1121

11321

3

1 ji2i1j

232211

211111

j

xx

24

Example 4.2 - n=1 of N=2 records Unknown respondents may be re identifiable

• Intruder’s priors on Z– Y1~Unif[-4,4], Y2~N(0,1), x1=-2.25

798.)(xf8/1

8/1

)(xf

)(xf

-2.25))(X|xY and sampled is P(Yp•

122

11j

11

1111

j

25

Example 4.2 - n=10 of N=100 records

Sampling by itself need not protect confidentiality

• Target is thought to be the smallest in Pop• The Priors: Y1~LogN(0,.5), Y2,…,Y100LogN(2,.5)• Masking is iid multiplicative LogN(0,.5)• Uncertainty in the released records

(masking+intruder prior)

M1~LogN(0,1)=f1, M2,…,M100LogN(2,1)=f2

• X=(.05, .14, 1.5, 2.4, 3.2, 3.8, 4.6, 8.7, 10.3, 10.7)

26

Example 4.2 - n=10 of N=100 recordsSampling by itself need not protect confidentiality

10

1j2j1

i2i1

10

1 k

10

1jj2

10100

1099k2

10100

999j1

jj2

10100

999i1

i1i

900)(x)/f(xf

)(xf/)(xf

)(xfCC

)(xfCC

)(xf100

1

)(xfCC

)(xf100

1

X)|x P(Mp•

j

j j

i

27

Example 4.2 - n=10 of N=100 recordsSampling by itself need not protect confidentiality

Values of Pj1(X)

xj f1(xj)/ f2(xj) 90 Claim 900 0.05 147.781122 0.49 ,86 0.132 0.14 52.778972 0.176 .11 0.047

1.5 4.926037 0.016 <.001 0.004 2.4 3.078773 0.01 “ 0.003 3.2 2.30908 0.008 “ 0.002 3.8 1.944488 0.006 “ 0.002 4.6 1.606317 0.005 “ 0.001 8.7 0.849317 0.003 “ 0

10.3 0.717384 0.002 “ 0 10.7 0.690566 0.002 “ 0

28

Risk of True Identification

• The agency cannot control the intruder's perceptions and actions once the data are released.

• All it can do is count the number of true identifications for an intruder with a given set of beliefs about the target and source file.

• A reasonable measure of the risk of true identification, then, is simply the fraction of released records (or number of released records) that an intruder can correctly re identify.

29


• Distinguishes “Risk of Matching” (Spruill, 1982-4) with “Risk of True Identification” (Risk of Matching is the proportion of masked records whose closest source records are the actual source records generated them)

• To illustrate Risk of True identification, consider the following example where N is large and n small so that we can calculate using sampling with replacement

30


n

1i

N

1jij

1-n

1i

N

2jijn1

n

1in21i

1-n

1in32in1

n1n1

1nn21-n1n1

)(xf

)(xf)(xf

)for or... for f from is P(x

)for or... for f from is P(x)(xf

)f,...,f from is x,...,P(x

)f from is xand f,...,f from is x,...,P(x p•

31


N

2jjn

1-jjn

1n1-

11nN

2jj1

1-jj1

111-

111

ii

N

2jnj

n1N

2j1j

11

1-n

1i

N

2jijn1

n

2i

N

2jij11

n1

11n111

)}y-(x)'y-exp{-.5(x

)}y-(x)'y-exp{-.5(x

)}y-(x)'y-exp{-.5(x

)}y-(x)'y-exp{-.5(x

becomes inequalitylast this),N(yfFor

)(xf

)(xf

)(xf

)(xf

1)(xf)(xf

)(xf)(xf

pppp Thus,

i

32

Risk of True IdentificationSource yj 9.8 10.8 14.1 14.6 14.7 15.0 30.0 40.7 47.1 53.2 x1=32 (15)

0.016 0.024 0.065 0.072 0.074 0.078 0.202 0.183 0.156 0.130

x2=35 (30)

0.010 0.017 0.048 0.054 0.056 0.059 0.199 0.205 0.188 0.164

TABLE 1

The intruder's probability, pij that the released record in the ith row comes from

the source record in the jth column. Intruder knows the source values. Unknown to the intruder, 32 is the masked version of 15.0 and 35 is the masked version of 30.0. Masking fj ~ LogN(yj,.5)

• Risk of True Identification is low (zero if .078 is too low to link). Look down columns 15 and 30.

• Risk of matching is not zero for both records? Look across rows. 32 matched with 30 which is incorrect but 35 is matched with 30? (Why not 40.7?) Claimed risk of matching is ½?

• Risk of perceived re-identification? Look down all columns. If 1- the sum of the column is more than the max of the column the intruder is wasting their time. This is an assumption about the intruder that their rational decision is that the record for that column has not been released. In this example, this is true for all the columns.

33

Disclosure Harm

• Just Considers Harm to Respondent (not to agencies, researchers, etc) whose released record has been re-identified or perceived to have been reidentified

• Scenario– Masked Data is released X=(x1,…,xn) where = xi=(xi1,…,xik) and

xi is a binary attribute of interest. Assume that the target record is Y1 and that the intruder has linked Y1 to x1.Let x-11=(x12,…,x1k) and X-1 =(x-11,x2…,xn)

– Because of masking the intruder believes, independent of everything else, that

• x11= Y11with probability q

• x11= 1-Y11with probability 1- q

34

Disclosure Harm

)x |0P(Y)1()x |1P(Y)1(

)x |1P(Y)1(

)x 0,x|1P(Y X)|1P(Y

0, when xSimilarly,

)x |0P(Y)1()x |1P(Y

)x |1P(Y

)x |0)P(Yx 0,Y|1P(x)x |1)P(Yx 1,Y|1P(x

)x |1)P(Yx 1,Y|1P(x

)x 1,x|1P(Y X)|1P(Y

Then 1.Let x

1-111-11

1-11

1-111111

11

1-111-11

1-11

1-111-11111-111-1111

1-111-1111

1-111111

11

qq

q

qq

q

35

Disclosure Harm-Logistic Regression

}exp{1

}exp{

by )x|1P(Y

estimate and,..., smle' Obtain the

)}x|0P(Y)1()x|1P(Y)1(log{)x| P(x log

is 1or 0 where xof likelihood log theon tocontributi Then the

)x|0P(Y

)x|1P(Ylog Assume

1

^

0

^

1

^

0

^

111

^

0

^

i-i11

i-i11

i- ii1

i ii1

10

i-i1

i-i1

ij

k

jj

ij

k

jj

k

ij

k

jj

x

x

qqqq

x

iiii

36

Disclosure Harm-Measures of Harm• Harm H(Y11,X) is a variable that takes on various values

depending on the action that the intruder takes based on their their strategy

• These values are losses and are– 0 if record is not identified

– cFN if re-identification is incorrect and Y11 is not inferred

– cTN if re-identification is correct and Y11 is not inferred

– and Infer Re-ID

Correct

Incorrect

Correct CTT CTF Incorrect CFT CFF

37

Disclosure HarmSome Possibly Delusionary Closing Comments

• Think of the source data, Y, as the parameter• The released data, X, is the sample• This is somewhat like a two person game where

the agency plays the role of Mother Nature and the intruder is the other person

• The agency controls the way it generates the released data

38

Disclosure Harm Some Possibly Delusionary Closing Comments

• When we describe the mechanism/structure/model that is used to generate released data we are specifying somewhat the model X|Y.– Are we totally specifying this?– There are at the very least some nuisance parameters regarding

weights, e.g.• Is there a meaningful interpretation in randomizing over

the parameter from the agencies perspective?• Perhaps we should reverse the roles of the agency and

the intruder. The parameter is then the intruder’s strategy. In any event Lambert is suggesting that we need to model the intruder strategy and formulate the problem from a decision theory standpoint.