# [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Mondrian Multidimensional K-Anonymity

Post on 08-Dec-2016

213 views

Embed Size (px)

TRANSCRIPT

Mondrian Multidimensional K-Anonymity

Kristen LeFevre David J. DeWitt Raghu RamakrishnanUniversity of Wisconsin, Madison

Abstract

K-Anonymity has been proposed as a mechanism for pro-tecting privacy in microdata publishing, and numerous re-coding models have been considered for achieving k-anonymity. This paper proposes a new multidimensionalmodel, which provides an additional degree of exibility notseen in previous (single-dimensional) approaches. Oftenthis exibility leads to higher-quality anonymizations, asmeasured both by general-purpose metrics and more spe-cic notions of query answerability.

Optimal multidimensional anonymization is NP-hard(like previous optimal k-anonymity problems). However,we introduce a simple greedy approximation algorithm,and experimental results show that this greedy algorithmfrequently leads to more desirable anonymizations thanexhaustive optimal algorithms for two single-dimensionalmodels.

1. IntroductionA number of organizations publish microdata for purposessuch as demographic and public health research. In orderto protect individual privacy, known identiers (e.g., Nameand Social Security Number) must be removed. In addi-tion, this process must account for the possibility of com-bining certain other attributes with external data to uniquelyidentify individuals [15]. For example, an individual mightbe re-identied by joining the released data with another(public) database on Age, Sex, and Zipcode. Figure 1 showssuch an attack, where Ahmeds medical information is de-termined by joining the released patient data with a publicvoter registration list.

K-anonymity has been proposed to reduce the risk ofthis type of attack [12, 13, 15]. The primary goal of k-anonymization is to protect the privacy of the individuals towhom the data pertains. However, subject to this constraint,it is important that the released data remain as useful aspossible. Numerous recoding models have been proposedin the literature for k-anonymization [8, 9, 13, 17, 10], andoften the quality of the published data is dictated by themodel that is used. The main contributions of this paper area new multidimensional recoding model and a greedy algo-

Voter Registration DataName Age Sex ZipcodeAhmed 25 Male 53711Brooke 28 Female 55410Claire 31 Female 90210Dave 19 Male 02174Evelyn 40 Female 02237

Patient DataAge Sex Zipcode Disease25 Male 53711 Flu25 Female 53712 Hepatitis26 Male 53711 Brochitis27 Male 53710 Broken Arm27 Female 53712 AIDS28 Male 53711 Hang Nail

Figure 1. Tables vulnerable to a joining attack

rithm for k-anonymization, an approach with several impor-tant advantages:1

The greedy algorithm is substantially more efcientthan proposed optimal k-anonymization algorithms forsingle-dimensional models [2, 9, 12]. The time com-plexity of the greedy algorithm is O(nlogn), while theoptimal algorithms are exponential in the worst case.

The greedy multidimensional algorithm often pro-duces higher-quality results than optimal single-dimensional algorithms (as well as the many existingsingle-dimensional heuristic [6, 14, 16] and stochasticsearch [8, 18] algorithms).

1.1. Basic Denitions

Quasi-Identier Attribute Set A quasi-identifer is a min-imal set of attributesX1, ..., Xd in table T that can be joinedwith external information to re-identify individual records.We assume that the quasi-identier is understood based onspecic knowledge of the domain.Equivalence Class A table T consists of a multiset of tu-ples. An equivalence class for T with respect to attributesX1, ..., Xd is the set of all tuples in T containing identi-cal values (x1, ..., xd) for X1, ..., Xd. In SQL, this is like aGROUP BY query on X1, ..., Xd.

1The visual representation of such recodings reminded us of the workof artist Piet Mondrian (1872-1944).

Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

K-Anonymity Property Table T is k-anonymous withrespect to attributes X1, ..., Xd if every unique tuple(x1, ..., xd) in the (multiset) projection of T on X1, ..., Xdoccurs at least k times. That is, the size of each equivalenceclass in T with respect to X1, ..., Xd is at least k.

K-Anonymization A view V of relation T is said to be ak-anonymization if the view modies or generalizes the dataof T according to some model such that V is k-anonymouswith respect to the quasi-identier.

1.2. General-Purpose Quality Metrics

There are a number of notions of microdata quality [2, 6,8, 10, 12, 13, 14, 15, 16], but intuitively, the anonymiza-tion process should generalize or perturb the original data aslittle as is necessary to satisfy the k-anonymity constraint.Here we consider some simple general-purpose quality met-rics, but a more targeted approach to quality measurementbased on query answerability is described in Section 5.

The simplest kind of quality measure is based on the sizeof the equivalence classes E in V . Intuitively, the discern-ability metric (CDM ), described in [2], assigns to each tu-ple t in V a penalty, which is determined by the size of theequivalence class containing t.

CDM =

EquivClasses E |E|2

As an alternative, we also propose the normalized aver-age equivalence class size metric (CAVG).

CAVG = ( total recordstotal equiv classes )/(k)

1.3. Paper Overview

The rst contribution of this paper is a new multidimen-sional model for k-anonymization (Section 2). Like previ-ous optimal k-anonymity problems [1, 10], optimal multi-dimensional k-anonymization is NP-hard. However, for nu-meric data, we nd that under reasonable assumptions theworst-case maximum size of equivalence classes is O(k) inthe multidimensional case, while in the single-dimensionalmodel, this bound can grow linearly with the total numberof records. For a simple variation of the multidimensionalmodel, this bound is 2k (Section 3).

Using the multidimensional recoding model, we intro-duce a simple and efcient greedy algorithm that can be ap-plied to both categorical and numeric data (Section 4). Fornumeric data, the results are a constant-factor approxima-tion of optimal, as measured by the general-purpose qualitymetrics described in the previous section.

General-purpose quality metrics are a good starting pointwhen the ultimate use of the published data is unknown.However, in some cases, quality might be more appropri-ately measured by the application consuming the publisheddata. The second main contribution of this paper is a more

sophisticated notion of quality measurement, based on aworkload of aggregate queries (Section 5).

Using general-purpose metrics and a simple query work-load, our experimental evaluation (Section 6) indicates thatthe quality of the anonymizations obtained by our greedy al-gorithm are often superior to those obtained by exhaustiveoptimal algorithms for two single-dimensional models.

The paper concludes with discussions of related and fu-ture work (Sections 7 and 8).

2. Multidimensional Global RecodingIn a relational database, each attribute has some domain ofvalues. We use the notation DX to denote the domain of at-tribute X . A global recoding achieves anonymity by map-ping the domains of the quasi-identier attributes to gener-alized or altered values [17].

Global recoding can be further broken down into twosub-classes [9]. A single-dimensional global recoding isdened by a function i : DXi D for each attribute Xiof the quasi-identier. An anonymization V is obtained byapplying each i to the values of Xi in each tuple of T .

Alternatively, a multidimensional global recoding is de-ned by a single function : DX1 ... DXn D,which is used to recode the domain of value vectors asso-ciated with the set of quasi-identier attributes. Under thismodel, V is obtained by applying to the vector of quasi-identier values in each tuple of T .

Multidimensional recoding can be applied to categor-ical data (in the presence of user-dened generalizationhierarchies) or to numeric data. For numeric data, andother totally-ordered domains, (single-dimensional) par-titioning models have been proposed [2, 8]. A single-dimensional interval is dened by a pair of endpoints p, v DXi such that p v. (The endpoints of such an intervalmay be open or closed to handle continuous domains.)

Single-dimensional Partitioning Assume there is a totalorder associated with the domain of each quasi-identierattribute Xi. A single-dimensional partitioning denes, foreach Xi, a set of non-overlapping single-dimensional in-tervals that cover DXi . i maps each x DXi to somesummary statistic for the interval in which it is contained.

The released data will include simple statistics that sum-marize the intervals they replace. For now, we assume thatthese summary statistics are min-max ranges, but we dis-cuss some other possibilities in Section 5.

This partitioning model is easily extended to multidi-mensional recoding. Again, assume a total order for eachDXi . A multidimensional region is dened by a pair of d-tuples (p1, ..., pd), (v1, ..., vd) DX1 ...DXd such thati, pi vi. Conceptually, each region is bounded by a d-dimensional rectangular box, and each edge and vertex ofthis box may be either open or closed.

Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

Age Sex Zipcode Disease[25-28] Male [53710-53711] Flu[25-28] Female 53712 Hepatitis[25-28] Male [53710-53711] Brochitis[25-28] Male [53710-53711] Broken Arm[25-28] Female 53712 AIDS[25-28] Male [53710-53711] Hang Nail

Figure 2. Single-dimensional anonymization

Age Sex Zipcode Disease[25-26] Male 53711 Flu[25-27] Female 53712 Hepatitis[25-26] Male 53711 Brochitis[27-28] Male [53710-53711] Broken Arm[25-27] Female 53712 AIDS[27-28] Male [53710-53711] Hang Nail

Figure 3. Multidimensional anonymization

Strict Multidimensional Partitioning A strict multidi-mensional partitioning denes a set of non-overlappingmultidimensional regions that cover DX1 ... DXd . maps each tuple (x1, ..., xd) DX1 ...DXd to a sum-mary statistic for the region in which it is contained.

When is applied to table T (assuming each region ismapped to a unique vector of summary statistics), the tupleset in each non-empty region forms an equivalence class inV . For simplicity, we again assume that these summarystatistics are ranges, and further discussion is provided inSection 5.

Sample 2-anonymizations of Patients, using single-dimensional and strict multidimensional partitioning, areshown in Figures 2 and 3. Notice that the anonymization ob-tained using the multidimensional model is not permissibleunder the single-dimensional model because the domains ofAge and Zipcode are not recoded to a single set of intervals(e.g., Age 25 is mapped to either [25-26] or [25-27], de-pending on the values of Zipcode and Sex). However, thesingle-dimensional recoding is also valid under the multidi-mensional model.

Proposition 1 Every single-dimensional partitioning forquasi-identier attributes X1, ..., Xd can be expressed as astrict multidimensional partitioning. However, when d 2 and i, |DXi | 2, there exists a strict multidimen-sional partitioning that cannot be expressed as a single-dimensional partitioning.

It is intuitive that the optimal strict multidimensionalpartitioning must be at least as good as the optimalsingle-dimensional partitioning. However, the optimal k-anonymous multidimensional partitioning problem is NP-hard (Section 2.2). For this reason, we consider theworst-case upper bounds on equivalence class size forsingle-dimensional and multidimensional partitioning (Sec-tion 2.3).

2.1. Spatial Representation

Throughout the paper, we use a convenient spatial repre-sentation for quasi-identiers. Consider table T with quasi-identier attributes X1, ..., Xd, and assume that there is atotal ordering for each domain. The (multiset) projectionsof T on X1, ..., Xd can then be represented as a multisetof points in d-dimensional space. For example, Figure 4(a)shows the two-dimensional representation of Patients fromFigure 1, for quasi-identier attributes Age and Zipcode.

Similar models have been considered for rectangular par-titioning in 2 dimensions [11]. In this context, the single-dimensional and multidimensional partitioning models areanalogous to the p p and arbitrary classes of tilings,respectively. However, to the best of our knowledge, noneof the previous optimal tiling problems have included con-straints requiring minimum occupancy.

2.2. Hardness Result

There have been previous hardness results for optimal k-anonymization under the attribute- and cell-suppressionmodels [1, 10]. The problem of optimal strict k-anonymousmultidimensional partitioning (nding the k-anonymousstrict multidimensional partitioning with the smallest CDMor CAVG) is also NP-hard, but this result does not followdirectly from the previous results.

We formulate the following decision problem usingCAVG.2 Here, a multiset of points P is equivalently rep-resented as a set of distinct (point, count) pairs.

Decisional K-Anonymous Multidimensional PartitioningGiven a set P of unique (point, count) pairs, with pointsin d-dimensional space, is there a strict multidimen-sional partitioning for P such that for every resultingmultidimensional region Ri,

pRi count(p) k or

pRi count(p) = 0, and CAVG positive constant c?Theorem 1 Decisional k-anonymous multidimensionalpartitioning is NP-complete.

Proof The proof is by reduction from Partition:Partition Consider a set A of n positive integers{a1, ..., an}. Is there some A A, such that

aiA ai =

ajAA aj ?

For each ai A, construct a (point, count) pair. Let thepoint be dened by (01, ..., 0, 1i, 0, ..., 0d) (the ith coordi-nate is 1, and all other coordinates are 0), which resides ina d-dimensional unit-hypercube, and let the count equal ai.Let P be the union of all such pairs.

We claim that the partition problem for A can be re-duced to the following: Let k =

ai2 . Is there a k-

anonymous strict multidimensional partitioning for P such2Though stated for CAV G, the result is similar for CDM .

Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

28

27

26

25

537125371153710

(a) Patients

28

27

26

25

537125371153710

(b) Single-Dimensional

28

27

26

25

537125371153710

(c) Strict Multidimensional

Figure 4. Spatial representation of Patients and partitionings (quasi-identiers Zipcode and Age)

that CAVG 1? To prove this claim, we show that there isa solution to the k-anonymous multidimensional partition-ing problem for P if and only if there is a solution to thepartition problem for A.

Suppose there exists a k-anonymous multidimen-sional partitioning for P . This partitioning must de-ne two multidimensional regions, R1 and R2, such that

pR1 count(p) =

pR2 count(p) = k =

ai2 , and

possibly some number of empty regions. By the strictnessproperty, these regions must not overlap. Thus, the sum ofcounts for the two non-empty regions constitute the sum ofintegers in two disjoint complementary subsets of A,...

Recommended