# [ieee 22nd international conference on data engineering (icde'06) - atlanta, ga, usa...

Post on 08-Dec-2016

215 views

Embed Size (px)

TRANSCRIPT

Mondrian Multidimensional K-Anonymity

Kristen LeFevre David J. DeWitt Raghu RamakrishnanUniversity of Wisconsin, Madison

Abstract

K-Anonymity has been proposed as a mechanism for pro-tecting privacy in microdata publishing, and numerous re-coding models have been considered for achieving k-anonymity. This paper proposes a new multidimensionalmodel, which provides an additional degree of exibility notseen in previous (single-dimensional) approaches. Oftenthis exibility leads to higher-quality anonymizations, asmeasured both by general-purpose metrics and more spe-cic notions of query answerability.

Optimal multidimensional anonymization is NP-hard(like previous optimal k-anonymity problems). However,we introduce a simple greedy approximation algorithm,and experimental results show that this greedy algorithmfrequently leads to more desirable anonymizations thanexhaustive optimal algorithms for two single-dimensionalmodels.

1. IntroductionA number of organizations publish microdata for purposessuch as demographic and public health research. In orderto protect individual privacy, known identiers (e.g., Nameand Social Security Number) must be removed. In addi-tion, this process must account for the possibility of com-bining certain other attributes with external data to uniquelyidentify individuals [15]. For example, an individual mightbe re-identied by joining the released data with another(public) database on Age, Sex, and Zipcode. Figure 1 showssuch an attack, where Ahmeds medical information is de-termined by joining the released patient data with a publicvoter registration list.

K-anonymity has been proposed to reduce the risk ofthis type of attack [12, 13, 15]. The primary goal of k-anonymization is to protect the privacy of the individuals towhom the data pertains. However, subject to this constraint,it is important that the released data remain as useful aspossible. Numerous recoding models have been proposedin the literature for k-anonymization [8, 9, 13, 17, 10], andoften the quality of the published data is dictated by themodel that is used. The main contributions of this paper area new multidimensional recoding model and a greedy algo-

Voter Registration DataName Age Sex ZipcodeAhmed 25 Male 53711Brooke 28 Female 55410Claire 31 Female 90210Dave 19 Male 02174Evelyn 40 Female 02237

Patient DataAge Sex Zipcode Disease25 Male 53711 Flu25 Female 53712 Hepatitis26 Male 53711 Brochitis27 Male 53710 Broken Arm27 Female 53712 AIDS28 Male 53711 Hang Nail

Figure 1. Tables vulnerable to a joining attack

rithm for k-anonymization, an approach with several impor-tant advantages:1

The greedy algorithm is substantially more efcientthan proposed optimal k-anonymization algorithms forsingle-dimensional models [2, 9, 12]. The time com-plexity of the greedy algorithm is O(nlogn), while theoptimal algorithms are exponential in the worst case.

The greedy multidimensional algorithm often pro-duces higher-quality results than optimal single-dimensional algorithms (as well as the many existingsingle-dimensional heuristic [6, 14, 16] and stochasticsearch [8, 18] algorithms).

1.1. Basic Denitions

Quasi-Identier Attribute Set A quasi-identifer is a min-imal set of attributesX1, ..., Xd in table T that can be joinedwith external information to re-identify individual records.We assume that the quasi-identier is understood based onspecic knowledge of the domain.Equivalence Class A table T consists of a multiset of tu-ples. An equivalence class for T with respect to attributesX1, ..., Xd is the set of all tuples in T containing identi-cal values (x1, ..., xd) for X1, ..., Xd. In SQL, this is like aGROUP BY query on X1, ..., Xd.

1The visual representation of such recodings reminded us of the workof artist Piet Mondrian (1872-1944).

Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

K-Anonymity Property Table T is k-anonymous withrespect to attributes X1, ..., Xd if every unique tuple(x1, ..., xd) in the (multiset) projection of T on X1, ..., Xdoccurs at least k times. That is, the size of each equivalenceclass in T with respect to X1, ..., Xd is at least k.

K-Anonymization A view V of relation T is said to be ak-anonymization if the view modies or generalizes the dataof T according to some model such that V is k-anonymouswith respect to the quasi-identier.

1.2. General-Purpose Quality Metrics

There are a number of notions of microdata quality [2, 6,8, 10, 12, 13, 14, 15, 16], but intuitively, the anonymiza-tion process should generalize or perturb the original data aslittle as is necessary to satisfy the k-anonymity constraint.Here we consider some simple general-purpose quality met-rics, but a more targeted approach to quality measurementbased on query answerability is described in Section 5.

The simplest kind of quality measure is based on the sizeof the equivalence classes E in V . Intuitively, the discern-ability metric (CDM ), described in [2], assigns to each tu-ple t in V a penalty, which is determined by the size of theequivalence class containing t.

CDM =

EquivClasses E |E|2

As an alternative, we also propose the normalized aver-age equivalence class size metric (CAVG).

CAVG = ( total recordstotal equiv classes )/(k)

1.3. Paper Overview

The rst contribution of this paper is a new multidimen-sional model for k-anonymization (Section 2). Like previ-ous optimal k-anonymity problems [1, 10], optimal multi-dimensional k-anonymization is NP-hard. However, for nu-meric data, we nd that under reasonable assumptions theworst-case maximum size of equivalence classes is O(k) inthe multidimensional case, while in the single-dimensionalmodel, this bound can grow linearly with the total numberof records. For a simple variation of the multidimensionalmodel, this bound is 2k (Section 3).

Using the multidimensional recoding model, we intro-duce a simple and efcient greedy algorithm that can be ap-plied to both categorical and numeric data (Section 4). Fornumeric data, the results are a constant-factor approxima-tion of optimal, as measured by the general-purpose qualitymetrics described in the previous section.

General-purpose quality metrics are a good starting pointwhen the ultimate use of the published data is unknown.However, in some cases, quality might be more appropri-ately measured by the application consuming the publisheddata. The second main contribution of this paper is a more

sophisticated notion of quality measurement, based on aworkload of aggregate queries (Section 5).

Using general-purpose metrics and a simple query work-load, our experimental evaluation (Section 6) indicates thatthe quality of the anonymizations obtained by our greedy al-gorithm are often superior to those obtained by exhaustiveoptimal algorithms for two single-dimensional models.

The paper concludes with discussions of related and fu-ture work (Sections 7 and 8).

2. Multidimensional Global RecodingIn a relational database, each attribute has some domain ofvalues. We use the notation DX to denote the domain of at-tribute X . A global recoding achieves anonymity by map-ping the domains of the quasi-identier attributes to gener-alized or altered values [17].

Global recoding can be further broken down into twosub-classes [9]. A single-dimensional global recoding isdened by a function i : DXi D for each attribute Xiof the quasi-identier. An anonymization V is obtained byapplying each i to the values of Xi in each tuple of T .

Alternatively, a multidimensional global recoding is de-ned by a single function : DX1 ... DXn D,which is used to recode the domain of value vectors asso-ciated with the set of quasi-identier attributes. Under thismodel, V is obtained by applying to the vector of quasi-identier values in each tuple of T .

Multidimensional recoding can be applied to categor-ical data (in the presence of user-dened generalizationhierarchies) or to numeric data. For numeric data, andother totally-ordered domains, (single-dimensional) par-titioning models have been proposed [2, 8]. A single-dimensional interval is dened by a pair of endpoints p, v DXi such that p v. (The endpoints of such an intervalmay be open or closed to handle continuous domains.)

Single-dimensional Partitioning Assume there is a totalorder associated with the domain of each quasi-identierattribute Xi. A single-dimensional partitioning denes, foreach Xi, a set of non-overlapping single-dimensional in-tervals that cover DXi . i maps each x DXi to somesummary statistic for the interval in which it is contained.

The released data will include simple statistics that sum-marize the intervals they replace. For now, we assume thatthese summary statistics are min-max ranges, but we dis-cuss some other possibilities in Section 5.

This partitioning model is easily extended to multidi-mensional recoding. Again, assume a total order for eachDXi . A multidimensional region is dened by a pair of d-tuples (p1, ..., pd), (v1, ..., vd) DX1 ...DXd such thati, pi vi. Conceptually, each region is bounded by a d-dimensional rectangular box, and each edge and vertex ofthis box may be either open or closed.

Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

Age Sex Zipcode Disease[25-28] Male [53710-53711] Flu[25-28] Female 53712 Hepatitis[25-28] Male [53710-53711] Brochitis[25-28] Male [53710-53711] Broken Arm[25-28] Female 53712 AIDS[25-28] Male [53710-53711] Hang Nail

Figure 2. Single-dimensional anonymization

Age Sex Zipcode Disease[25-26] Male 53711 Flu[25-27] Female 53712 Hepatitis[25-26] Male 53711 Brochitis[27-28] Male [53710-53711] Broken Arm[25-27] Female 53712 AIDS[27-28] Male [53710-

Recommended