anonymity for continuous data publishing

34
Anonymity for Continuous Data Publishing Benjamin C. M. Fung Concordia University Montreal, QC, Canada http://www.ciise.concordia.ca/ ~fung Ke Wang Simon Fraser University Burnaby, BC, Canada Ada Wai-Chee Fu The Chinese University of Hong Kong Jian Pei Simon Fraser University Burnaby, BC, Canada The 11 th International Conference on Extending Database Technology (EDBT 2008)

Upload: borna

Post on 06-Jan-2016

44 views

Category:

Documents


2 download

DESCRIPTION

The 1 1 th International Conference on Extending Database Technology ( EDBT 2008 ). Anonymity for Continuous Data Publishing. http://www.ciise.concordia.ca/~fung. Benjamin C. M. Fung Concordia University Montreal, QC, Canada. Ke Wang Simon Fraser University Burnaby, BC, Canada. Jian Pei - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Anonymity for Continuous Data Publishing

Anonymity for Continuous Data Publishing

Benjamin C. M. Fung

Concordia University Montreal, QC, Canada

http://www.ciise.concordia.ca/~fung

Ke Wang

Simon Fraser University Burnaby, BC, Canada

Ada Wai-Chee Fu

The Chinese University of Hong Kong

Jian Pei

Simon Fraser University Burnaby, BC, Canada

The 11th International Conference on Extending Database Technology (EDBT 2008)

Page 2: Anonymity for Continuous Data Publishing

2

Privacy-Preserving Data Publishingk-anonymity [SS98] 2-anoymous patient table

Birthplace Job Disease

UK Professional Flu

UK Professional Diabetes

France Professional Diabetes

France Professional Flu

Raw patient table

Quasi-Identifier (QID) Sensitive

Birthplace Job Disease

UK Engineer Flu

UK Lawyer Diabetes

France Engineer Diabetes

France Lawyer Flu

(Hospital)

Page 3: Anonymity for Continuous Data Publishing

3

Privacy Requirement k-anonymity [SS98]

Every QID group contains at least k records.

Confidence bounding[WFY05, WFY07] Bound the confidence

QIDsensitive value within h%.

l-diversity [MGKV06] Every QID group contains

l well-represented distinct sensitive values.

Patient table

QID Sensitive

Birthplace Job Disease

UK Professional Flu

UK Professional Diabetes

UK Professional Diabetes

UK Professional Diabetes

France Professional Diabetes

France Professional Diabetes

France Professional Flu

France Professional Flu

Page 4: Anonymity for Continuous Data Publishing

4

Continuous Data Publishing Model

At time T1,Collected a set of raw data records D1

Published a k-anonymous version of D1, denoted release R1.

At time T2, Collect a new set of raw data records D2 Want to publish all data collected so far.Publish a k-anonymous version of D1UD2,

denoted release R2.

Page 5: Anonymity for Continuous Data Publishing

Birthplace Job Disease

(a1) Europe (UK) Lawyer Flu

(a2) Europe (UK) Lawyer Flu

(a3) Europe (UK) Lawyer Flu

(a4) Europe (France) Lawyer Diabetes

(a5) Europe (France) Lawyer Diabetes

Birthplace Job Disease

(b1) UK Professional (Lawyer) Flu

(b2) UK Professional (Lawyer) Flu

(b3) UK Professional (Lawyer) Flu

(b4) France Professional (Lawyer) Diabetes

(b5) France Professional (Lawyer) Diabetes

(b6) France Professional (Lawyer) Diabetes

(b7) France Professional (Doctor) Flu

(b8) France Professional (Doctor) Flu

(b9) UK Professional (Doctor) Diabetes

(b10) UK Professional (Lawyer) Diabetes

R1

R2

D1

D2

D1

Continuous Data Publishing Model

Page 6: Anonymity for Continuous Data Publishing

6

Correspondence Attacks

An attacker could “crack” the k-anonymity by comparing R1 and R2.

Background knowledge: QID of a target victim (e.g., Alice is born in France and is a

lawyer.) Timestamp of a target victim.

Correspondence knowledge: Every record in R1 has a corresponding record in R2.

Every record timestamped T2 has a record in R2, but not in R1.

Page 7: Anonymity for Continuous Data Publishing

7

Our Contributions What exactly are the records that can be

excluded (cracked) based on R1 and R2? Systematically characterize the set of cracked records

by correspondence attacks. Propose the notion of BCF-anonymity to measure

anonymity after excluding the cracked records.

Developed an efficient algorithm to identify a BCF-anonymized R2, and studied its data quality.

Extended the proposed approach to deal with more than two releases and other privacy notions.

Page 8: Anonymity for Continuous Data Publishing

8

Problem Statements Detection problem:

Determine the number of cracked records in the worst case by applying the correspondence knowledge on the k-anonymized R1 and R2 .

Anonymization problem: Given R1, D1 and D2, we want to generalize R2

= D1UD2 so that R2 satisfies a given BCF-anonymity requirement and remains as useful as possible wrt a specified information metric.

Page 9: Anonymity for Continuous Data Publishing

R1 Birthplace Job Disease

(a1) Europe Lawyer Flu

(a2) Europe Lawyer Flu

(a3) Europe Lawyer Flu

(a4) Europe Lawyer Diabetes

(a5) Europe Lawyer Diabetes

R2 Birthplace Job Disease

(b1) UK Professional Flu

(b2) UK Professional Flu

(b3) UK Professional Flu

(b4) France Professional Diabetes

(b5) France Professional Diabetes

(b6) France Professional Diabetes

(b7) France Professional Flu

(b8) France Professional Flu

(b9) UK Professional Diabetes

(b10) UK Professional Diabetes

Alice: {France, Lawyer} with timestamp T1.

Attempt to identify her record in R1.

Forward-Attack (F-Attack)

a1, a2, a3 cannot all originate from [France, Lawyer].

Otherwise, R2 would have at least three [France, Professional, Flu].

Page 10: Anonymity for Continuous Data Publishing

R1 Birthplace Job Disease

(a1) Europe Lawyer Flu

(a2) Europe Lawyer Flu

(a3) Europe Lawyer Flu

(a4) Europe Lawyer Diabetes

(a5) Europe Lawyer Diabetes

R2 Birthplace Job Disease

(b1) UK Professional Flu

(b2) UK Professional Flu

(b3) UK Professional Flu

(b4) France Professional Diabetes

(b5) France Professional Diabetes

(b6) France Professional Diabetes

(b7) France Professional Flu

(b8) France Professional Flu

(b9) UK Professional Diabetes

(b10) UK Professional Diabetes

F-Attack

CG(qid1,qid2) = {(g1,g2),(g1',g2')}

g1

g2

g1'

g2'

qid1

qid2

Page 11: Anonymity for Continuous Data Publishing

R1 Birthplace Job Disease

(a1) Europe Lawyer Flu

(a2) Europe Lawyer Flu

(a3) Europe Lawyer Flu

(a4) Europe Lawyer Diabetes

(a5) Europe Lawyer Diabetes

R2 Birthplace Job Disease

(b1) UK Professional Flu

(b2) UK Professional Flu

(b3) UK Professional Flu

(b4) France Professional Diabetes

(b5) France Professional Diabetes

(b6) France Professional Diabetes

(b7) France Professional Flu

(b8) France Professional Flu

(b9) UK Professional Diabetes

(b10) UK Professional Diabetes

F-Attack

Crack size of g1 wrt P:

c = |g1| – min(|g1|,|g2|)

c = 3 – min(3, 2) = 1.

Crack size of g1' wrt P:

c = |g1'| – min(|g1'|,|g2'|)

c = 2 – min(2, 3) = 0.

F(P, qid1, qid2) = c

over all CG(qid1, qid2)

Page 12: Anonymity for Continuous Data Publishing

12

Definition: F-Anonymity F(qid1, qid2) denotes the maximum

F(P, qid1, qid2) for any target P that matches (qid1, qid2).

F(qid1) denotes the maximum F(qid1, qid2) for all qid2 in R2.

F-anonymity of (R1,R2), denoted by FA(R1,R2), is the minimum(|qid1| - F(qid1)) for all qid1 in R1.

Page 13: Anonymity for Continuous Data Publishing

R1 Birthplace Job Disease

(a1) Europe Lawyer Flu

(a2) Europe Lawyer Flu

(a3) Europe Lawyer Flu

(a4) Europe Lawyer Diabetes

(a5) Europe Lawyer Diabetes

R2 Birthplace Job Disease

(b1) UK Professional Flu

(b2) UK Professional Flu

(b3) UK Professional Flu

(b4) France Professional Diabetes

(b5) France Professional Diabetes

(b6) France Professional Diabetes

(b7) France Professional Flu

(b8) France Professional Flu

(b9) UK Professional Diabetes

(b10) UK Professional Diabetes

Alice: {France, Lawyer} with timestamp T1.

Attempt to identify her record in R2.

Cross-Attack (C-Attack)

At least one of b4,b5,b6 must have timestamp T2.

Otherwise, R1 would have at least three records [Europe, Lawyer, Diabetes]

Page 14: Anonymity for Continuous Data Publishing

R1 Birthplace Job Disease

(a1) Europe Lawyer Flu

(a2) Europe Lawyer Flu

(a3) Europe Lawyer Flu

(a4) Europe Lawyer Diabetes

(a5) Europe Lawyer Diabetes

R2 Birthplace Job Disease

(b1) UK Professional Flu

(b2) UK Professional Flu

(b3) UK Professional Flu

(b4) France Professional Diabetes

(b5) France Professional Diabetes

(b6) France Professional Diabetes

(b7) France Professional Flu

(b8) France Professional Flu

(b9) UK Professional Diabetes

(b10) UK Professional Diabetes

C-Attack

Crack size of g2 wrt P:

c = |g2| – min(|g1|,|g2|)

c = 2 – min(3, 2) = 0

Crack size of g2' wrt P:

c = |g2'| – min(|g1'|,|g2'|)

c = 3 – min(2, 3) = 1

C(P, qid1, qid2) = c

over all CG(qid1, qid2)

Page 15: Anonymity for Continuous Data Publishing

15

Definition: C-Anonymity C(qid1, qid2) denotes the maximum

C(P, qid1, qid2) for any target P that matches (qid1, qid2).

C(qid2) denotes the maximum C(qid1, qid2) for all qid1 in R1.

C-anonymity of (R1,R2), denoted by CA(R1,R2), is the minimum(|qid2| - C(qid2)) for all qid2 in R2.

Page 16: Anonymity for Continuous Data Publishing

R1 Birthplace Job Disease

(a1) Europe Lawyer Flu

(a2) Europe Lawyer Flu

(a3) Europe Lawyer Flu

(a4) Europe Lawyer Diabetes

(a5) Europe Lawyer Diabetes

R2 Birthplace Job Disease

(b1) UK Professional Flu

(b2) UK Professional Flu

(b3) UK Professional Flu

(b4) France Professional Diabetes

(b5) France Professional Diabetes

(b6) France Professional Diabetes

(b7) France Professional Flu

(b8) France Professional Flu

(b9) UK Professional Diabetes

(b10) UK Professional Diabetes

Alice: {UK, Lawyer} with timestamp T2.

Attempt to identify her record in R2.

Backward-Attack (B-Attack)

At least one of b1,b2,b3 must have timestamp T1.

Otherwise, one of a1,a2,a3 would have no corresponding record in R2.

Page 17: Anonymity for Continuous Data Publishing

R1 Birthplace Job Disease

(a1) Europe Lawyer Flu

(a2) Europe Lawyer Flu

(a3) Europe Lawyer Flu

(a4) Europe Lawyer Diabetes

(a5) Europe Lawyer Diabetes

R2 Birthplace Job Disease

(b1) UK Professional Flu

(b2) UK Professional Flu

(b3) UK Professional Flu

(b4) France Professional Diabetes

(b5) France Professional Diabetes

(b6) France Professional Diabetes

(b7) France Professional Flu

(b8) France Professional Flu

(b9) UK Professional Diabetes

(b10) UK Professional Diabetes

B-Attack

Target person P

{UK, Lawyer}

with timestamp T2.

Crack size of g2 wrt P:

c = max(0,|G1|-(|G2|-|g2|))

g2 = {b1, b2, b3}

G1 = {a1, a2, a3}

G2 = {b1, b2, b3, b7, b8}

c = max(0,3-(|5|-|3|)) = 1

Page 18: Anonymity for Continuous Data Publishing

R1 Birthplace Job Disease

(a1) Europe Lawyer Flu

(a2) Europe Lawyer Flu

(a3) Europe Lawyer Flu

(a4) Europe Lawyer Diabetes

(a5) Europe Lawyer Diabetes

R2 Birthplace Job Disease

(b1) UK Professional Flu

(b2) UK Professional Flu

(b3) UK Professional Flu

(b4) France Professional Diabetes

(b5) France Professional Diabetes

(b6) France Professional Diabetes

(b7) France Professional Flu

(b8) France Professional Flu

(b9) UK Professional Diabetes

(b10) UK Professional Diabetes

B-Attack

Crack size of g2' wrt P:

c = max(0,|G1'|-(|G2'|-|g2'|))

g2' = {b9, b10}

G1' = {a4, a5}

G2' = {b4, b5, b6, b9, b10}

c = max(0,2-(|5|-|2|)) = 0

B(P, qid2) = c

over all g2 in qid2.

Page 19: Anonymity for Continuous Data Publishing

19

Definition: B-Anonymity

B(qid2) denotes the maximum B(P, qid2) for any target P that matches qid2.

B-anonymity of (R1,R2), denoted by BA(R1,R2), is the minimum(|qid2| - B(qid2)) for all qid2 in R2.

Page 20: Anonymity for Continuous Data Publishing

20

In brief…

cracked records: either do not originate from Alice's QID or do not have Alice's timestamp.

Such cracked records are not related to Alice, thus, excluding them allows the attacker to focus on a smaller set of candidate records.

Page 21: Anonymity for Continuous Data Publishing

21

Definition: BCF-Anonymity

A BCF-anonymity requirement states that all of BA(R1,R2)k, CA(R1,R2)k, and FA(R1,R2)k, where k is a user-specified threshold.

We now present an algorithm for anonymizing R2=D1UD2.

Page 22: Anonymity for Continuous Data Publishing

BCF-Anonymizer1. generalize every value for Aj QID in R2 to ANYj;

2. let candidate list contain all ANYj;3. sort candidate list by Score in descending order;4. while the candidate list is not empty do5. if the first candidate w in candidate list is valid then

6. specialize w into {w1,…,wz} in R2;

7. compute Score for all wi; and add them to candidate list;

8. sort the candidate list by Score in descending order;9. else10. remove w from the candidate list;11. end if12. end while

13. output R2

ANY

Europe ……America

France UK ……

Page 23: Anonymity for Continuous Data Publishing

23

Anti-Monotonicity of BCF-Anonymity

Theorem: Each of FA, CA and BA is non-increasing with respect to a specialization on R2.

Guarantee that the produced BCF-anonymized R2 is maximally specialized (suboptimal) which any further specialization leads to a violation.

Page 24: Anonymity for Continuous Data Publishing

24

Empirical Study Study the threat of correspondence

attacks.

Evaluate the information usefulness of a BCF-anonymized R2.

Adult dataset (US Census data)8 categorical attributes30,162 records in training set15,060 records in testing set

Page 25: Anonymity for Continuous Data Publishing

Experiment Settings D1 contains all records in testing set.

Three cases of D2 at timestamp T2:

200D2: D2 contains the first 200 records in the training set, modelling a small set of new records at T2.

2000D2: D2 contains the 2000 records in the training set, modelling a medium set of new records at T2.

allD2: D2 contains all 30,162 records in the training set, modelling a large set of new records at T2.

Page 26: Anonymity for Continuous Data Publishing

26

Violations of BCF-Anonymity

Page 27: Anonymity for Continuous Data Publishing

27

Anonymization• BCF-Anonymized R2: Our method.• k-Anonymized R2: Not safe from correspondence attacks.• k-Anonymized D2: Anonymize D2 separately from D1.

Page 28: Anonymity for Continuous Data Publishing

29

Related Work Byun et al. (VLDB-SDM06) is an early study

on continuous data publishing scenario.Anonymization relies on delaying records release

and the delay can be unbounded. In our method, records collected at timestamp Ti

are always published in the corresponding release Ri without delay.

Xiao and Tao (SIGMOD07) presents the first study to address both record insertions and deletions in data re-publication.Anonymization relies on generalization and

adding counterfeit records.

Page 29: Anonymity for Continuous Data Publishing

30

Related Work Wang and Fung (SIGKDD06) study the

problem of anonymizing sequential releases where each subsequent release publishes a different subset of attributes for the same set of records.

A B C D

R1

R2

Page 30: Anonymity for Continuous Data Publishing

31

Conclusion & Contributions Systematically characterize different types

of correspondence attacks and concisely compute their crack size.

Define BCF-anonymity requirement.

Present an anonymization algorithm to achieve BCF-anonymity while preserving information usefulness.

Extendable to multiple releases.

Page 31: Anonymity for Continuous Data Publishing

32

For more information: http://www.ciise.concordia.ca/~fung

Acknowledgement: Reviewers of EDBT Concordia University

Faculty Start-up Grants Natural Sciences and Engineering Research

Council of Canada

(NSERC)Discovery GrantsPGS Doctoral Award

Page 32: Anonymity for Continuous Data Publishing

33

References[BSBL06] J.-W. Byun, Y. Sohn, E. Bertino, and N.

Li. Secure anonymization for incremental datasets. In VLDB Workshop on Secure Data Management (SDM), 2006.

[MGKV06] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. In ICDE, Atlanta, GA, April 2006.

[PXW07] J. Pei, J. Xu, Z. Wang, W. Wang, and K. Wang. Maintaining k-anonymity against incremental updates. In SSDBM, Banff, Canada, 2007

Page 33: Anonymity for Continuous Data Publishing

34

References[SS98] P. Samarati and L. Sweeney. Protecting

privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical report, SRI International, March 1998.

[WF06] K. Wang and B. C. M. Fung. Anonymizing sequential releases. In ACM SIGKDD, Philadelphia, PA, August 2006, pp. 414-423.

[WFY05] K. Wang, B. C. M. Fung, and P. S. Yu. Template-based privacy preservation in classification problems. In IEEE ICDM, pages 466-473, November 2005.

Page 34: Anonymity for Continuous Data Publishing

35

References[WFY07] K. Wang, B. C. M. Fung, and P. S. Yu.

Handicapping attacker's confidence: an alternative to k-anonymization. Knowledge and Information Systems: An International Journal (KAIS), 11(3):345-368, April 2007.

[XY07] X. Xiao and Y. Tao. m-invariance: Towards privacy preserving re-publication of dynamic datasets. In ACM SIGMOD, June 2007.