anonymity for continuous data publishing

Anonymity for Continuous Data Publishing

Benjamin C. M. Fung

Concordia University Montreal, QC, Canada

http://www.ciise.concordia.ca/~fung

Ke Wang

Simon Fraser University Burnaby, BC, Canada

Ada Wai-Chee Fu

The Chinese University of Hong Kong

Jian Pei

Simon Fraser University Burnaby, BC, Canada

The 11th International Conference on Extending Database Technology (EDBT 2008)

2

Privacy-Preserving Data Publishingk-anonymity [SS98] 2-anoymous patient table

Birthplace Job Disease

UK Professional Flu

UK Professional Diabetes

France Professional Diabetes

France Professional Flu

Raw patient table

Quasi-Identifier (QID) Sensitive


UK Engineer Flu

UK Lawyer Diabetes

France Engineer Diabetes

France Lawyer Flu

(Hospital)

3

Privacy Requirement k-anonymity [SS98]

Every QID group contains at least k records.

Confidence bounding[WFY05, WFY07] Bound the confidence

QIDsensitive value within h%.

l-diversity [MGKV06] Every QID group contains

l well-represented distinct sensitive values.

Patient table

QID Sensitive


UK Professional Flu








4

Continuous Data Publishing Model

At time T1,Collected a set of raw data records D1

Published a k-anonymous version of D1, denoted release R1.

At time T2, Collect a new set of raw data records D2 Want to publish all data collected so far.Publish a k-anonymous version of D1UD2,

denoted release R2.


(a1) Europe (UK) Lawyer Flu



(a4) Europe (France) Lawyer Diabetes

(a5) Europe (France) Lawyer Diabetes


(b1) UK Professional (Lawyer) Flu



(b4) France Professional (Lawyer) Diabetes



(b7) France Professional (Doctor) Flu

(b8) France Professional (Doctor) Flu

(b9) UK Professional (Doctor) Diabetes

(b10) UK Professional (Lawyer) Diabetes

R1

R2

D1

D2

D1

Continuous Data Publishing Model

6

Correspondence Attacks

An attacker could “crack” the k-anonymity by comparing R1 and R2.

Background knowledge: QID of a target victim (e.g., Alice is born in France and is a

lawyer.) Timestamp of a target victim.

Correspondence knowledge: Every record in R1 has a corresponding record in R2.

Every record timestamped T2 has a record in R2, but not in R1.

7

Our Contributions What exactly are the records that can be

excluded (cracked) based on R1 and R2? Systematically characterize the set of cracked records

by correspondence attacks. Propose the notion of BCF-anonymity to measure

anonymity after excluding the cracked records.

Developed an efficient algorithm to identify a BCF-anonymized R2, and studied its data quality.

Extended the proposed approach to deal with more than two releases and other privacy notions.

8

Problem Statements Detection problem:

Determine the number of cracked records in the worst case by applying the correspondence knowledge on the k-anonymized R1 and R2 .

Anonymization problem: Given R1, D1 and D2, we want to generalize R2

= D1UD2 so that R2 satisfies a given BCF-anonymity requirement and remains as useful as possible wrt a specified information metric.

R1 Birthplace Job Disease

(a1) Europe Lawyer Flu



(a4) Europe Lawyer Diabetes



(b1) UK Professional Flu



(b4) France Professional Diabetes



(b7) France Professional Flu


(b9) UK Professional Diabetes


Alice: {France, Lawyer} with timestamp T1.

Attempt to identify her record in R1.

Forward-Attack (F-Attack)

a1, a2, a3 cannot all originate from [France, Lawyer].

Otherwise, R2 would have at least three [France, Professional, Flu].


















F-Attack

CG(qid1,qid2) = {(g1,g2),(g1',g2')}

g1

g2

g1'

g2'

qid1

qid2


















F-Attack

Crack size of g1 wrt P:

c = |g1| – min(|g1|,|g2|)

c = 3 – min(3, 2) = 1.

Crack size of g1' wrt P:

c = |g1'| – min(|g1'|,|g2'|)

c = 2 – min(2, 3) = 0.

F(P, qid1, qid2) = c

over all CG(qid1, qid2)

12

Definition: F-Anonymity F(qid1, qid2) denotes the maximum

F(P, qid1, qid2) for any target P that matches (qid1, qid2).

F(qid1) denotes the maximum F(qid1, qid2) for all qid2 in R2.

F-anonymity of (R1,R2), denoted by FA(R1,R2), is the minimum(|qid1| - F(qid1)) for all qid1 in R1.


















Alice: {France, Lawyer} with timestamp T1.


Cross-Attack (C-Attack)

At least one of b4,b5,b6 must have timestamp T2.

Otherwise, R1 would have at least three records [Europe, Lawyer, Diabetes]


















C-Attack


c = |g2| – min(|g1|,|g2|)

c = 2 – min(3, 2) = 0


c = |g2'| – min(|g1'|,|g2'|)

c = 3 – min(2, 3) = 1

C(P, qid1, qid2) = c

over all CG(qid1, qid2)

15

Definition: C-Anonymity C(qid1, qid2) denotes the maximum

C(P, qid1, qid2) for any target P that matches (qid1, qid2).

C(qid2) denotes the maximum C(qid1, qid2) for all qid1 in R1.

C-anonymity of (R1,R2), denoted by CA(R1,R2), is the minimum(|qid2| - C(qid2)) for all qid2 in R2.


















Alice: {UK, Lawyer} with timestamp T2.


Backward-Attack (B-Attack)

At least one of b1,b2,b3 must have timestamp T1.

Otherwise, one of a1,a2,a3 would have no corresponding record in R2.


















B-Attack

Target person P

{UK, Lawyer}

with timestamp T2.


c = max(0,|G1|-(|G2|-|g2|))

g2 = {b1, b2, b3}

G1 = {a1, a2, a3}

G2 = {b1, b2, b3, b7, b8}

c = max(0,3-(|5|-|3|)) = 1


















B-Attack


c = max(0,|G1'|-(|G2'|-|g2'|))

g2' = {b9, b10}

G1' = {a4, a5}

G2' = {b4, b5, b6, b9, b10}

c = max(0,2-(|5|-|2|)) = 0

B(P, qid2) = c

over all g2 in qid2.

19

Definition: B-Anonymity

B(qid2) denotes the maximum B(P, qid2) for any target P that matches qid2.

B-anonymity of (R1,R2), denoted by BA(R1,R2), is the minimum(|qid2| - B(qid2)) for all qid2 in R2.

20

In brief…

cracked records: either do not originate from Alice's QID or do not have Alice's timestamp.

Such cracked records are not related to Alice, thus, excluding them allows the attacker to focus on a smaller set of candidate records.

21

Definition: BCF-Anonymity

A BCF-anonymity requirement states that all of BA(R1,R2)k, CA(R1,R2)k, and FA(R1,R2)k, where k is a user-specified threshold.

We now present an algorithm for anonymizing R2=D1UD2.

BCF-Anonymizer1. generalize every value for Aj QID in R2 to ANYj;

2. let candidate list contain all ANYj;3. sort candidate list by Score in descending order;4. while the candidate list is not empty do5. if the first candidate w in candidate list is valid then

6. specialize w into {w1,…,wz} in R2;

7. compute Score for all wi; and add them to candidate list;

8. sort the candidate list by Score in descending order;9. else10. remove w from the candidate list;11. end if12. end while

13. output R2

ANY

Europe ……America

France UK ……

23

Anti-Monotonicity of BCF-Anonymity

Theorem: Each of FA, CA and BA is non-increasing with respect to a specialization on R2.

Guarantee that the produced BCF-anonymized R2 is maximally specialized (suboptimal) which any further specialization leads to a violation.

24

Empirical Study Study the threat of correspondence

attacks.

Evaluate the information usefulness of a BCF-anonymized R2.

Adult dataset (US Census data)8 categorical attributes30,162 records in training set15,060 records in testing set

Experiment Settings D1 contains all records in testing set.

Three cases of D2 at timestamp T2:

200D2: D2 contains the first 200 records in the training set, modelling a small set of new records at T2.

2000D2: D2 contains the 2000 records in the training set, modelling a medium set of new records at T2.

allD2: D2 contains all 30,162 records in the training set, modelling a large set of new records at T2.

26

Violations of BCF-Anonymity

27

Anonymization• BCF-Anonymized R2: Our method.• k-Anonymized R2: Not safe from correspondence attacks.• k-Anonymized D2: Anonymize D2 separately from D1.

29

Related Work Byun et al. (VLDB-SDM06) is an early study

on continuous data publishing scenario.Anonymization relies on delaying records release

and the delay can be unbounded. In our method, records collected at timestamp Ti

are always published in the corresponding release Ri without delay.

Xiao and Tao (SIGMOD07) presents the first study to address both record insertions and deletions in data re-publication.Anonymization relies on generalization and

adding counterfeit records.

30

Related Work Wang and Fung (SIGKDD06) study the

problem of anonymizing sequential releases where each subsequent release publishes a different subset of attributes for the same set of records.

A B C D

R1

R2

31

Conclusion & Contributions Systematically characterize different types

of correspondence attacks and concisely compute their crack size.

Define BCF-anonymity requirement.

Present an anonymization algorithm to achieve BCF-anonymity while preserving information usefulness.

Extendable to multiple releases.

32

For more information: http://www.ciise.concordia.ca/~fung

Acknowledgement: Reviewers of EDBT Concordia University

Faculty Start-up Grants Natural Sciences and Engineering Research

Council of Canada

(NSERC)Discovery GrantsPGS Doctoral Award

33

References[BSBL06] J.-W. Byun, Y. Sohn, E. Bertino, and N.

Li. Secure anonymization for incremental datasets. In VLDB Workshop on Secure Data Management (SDM), 2006.

[MGKV06] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. In ICDE, Atlanta, GA, April 2006.

[PXW07] J. Pei, J. Xu, Z. Wang, W. Wang, and K. Wang. Maintaining k-anonymity against incremental updates. In SSDBM, Banff, Canada, 2007

34

References[SS98] P. Samarati and L. Sweeney. Protecting

privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical report, SRI International, March 1998.

[WF06] K. Wang and B. C. M. Fung. Anonymizing sequential releases. In ACM SIGKDD, Philadelphia, PA, August 2006, pp. 414-423.

[WFY05] K. Wang, B. C. M. Fung, and P. S. Yu. Template-based privacy preservation in classification problems. In IEEE ICDM, pages 466-473, November 2005.

35

References[WFY07] K. Wang, B. C. M. Fung, and P. S. Yu.

Handicapping attacker's confidence: an alternative to k-anonymization. Knowledge and Information Systems: An International Journal (KAIS), 11(3):345-368, April 2007.

[XY07] X. Xiao and Y. Tao. m-invariance: Towards privacy preserving re-publication of dynamic datasets. In ACM SIGMOD, June 2007.

anonymity for continuous data publishing

Documents

bcfanonymized r2

anonymized r1

denoted release r2

set of cracked records

data quality

new set of raw data

denoted release r1

number of cracked records