ieeepro techno solutions ieee dotnet project - generalized approach for data

GENERALIZED APPROACH FOR DATA ANONYMIZATION USING MAP REDUCE

ON CLOUD K.R.VIGNESH,

M.Tech CSE,

SRM University,

Kattankulathur,

Chennai, India.

P.SARANYA,

Asst.profesor Dept. of CSE,

SRM University,

Kattankulathur,

Chennai, India

ABSTRACT— Data anonymization has been extensively studied and widely adopted method for privacy

preserving in data publishing and sharing scenario. Data anonymization is hiding up of sensitive data for

owner’s data record to avoid unidentified Risk. The privacy of an individual can be effectively preserved

while some aggregate information is shared to data user for data analysis and data mining. The proposed

method is Generalized method data anonymization using Map Reduce on cloud. Here we Two Phase Top

Down specialization. In First Phase the original data set is partitioned into group of smaller dataset and they

are anonymized and intermediate result is produced. In second phase the intermediate result first is further

anonymized to achieve consistent data set. And the data is presented in generalized form using Generalized

Approach.

Keywords: Cloud computing, Data Anonymization, Map Reduce, Privacy Preserving.

1. INTRODUCTION:

Cloud computing is a disruptive trend at

present, poses significant amount of current IT

industry and for research organizations, Cloud

computing provides massive storage and power

capacity, enable user a to implement application cost

effectively without heavy investment and

infrastructure. But privacy preserving is one of the

major disadvantages of cloud environment, some

privacy issues are not new where some personal

health record is shared for data analysis for research

organization. For e.g. Microsoft Health Vault an

online health cloud service.

Data anonymization is widely used method

for Privacy Preserving of data in non-interactive data

publishing scenario Data anonymization refers to the

hiding the identity/or sensitive data for owners data

record. The privacy of individual can be effectively

preserved while some aggregate information is

shared for data analysis and mining. A variety of

anonymizing algorithm is with different operations

have been proposed[3,4,5,6] however the data set

size has been increased tremendously in the big data

trend[1,7] this have become a challenge for

anonymization of data set. And for processing of

large data set we use Map Reduce integrated with

cloud to provide high computational capability for

application.

2. RELATED WORK: Recently data privacy preservation has been

extensively studied and investigated [2]. Le Fever

et.al has addressed about scalability of anonymization

algorithm via introducing scalable decision tree and

the sampling technique, and lwuchkwu et.al[8]

proposed R-tree based index approach by building a

spatial index over data sets, achieving high

efficiency. However the approach aim at

multidimensional generalization[6] which fail to

work in Top Down Specialization[TDS].

Fung et.al [2, 9, 10] proposed some TDS

approach that produce anonymize data set with data

exploration problem. A data structure taxonomy

indexed partition [TIPS] is exploited to improve

efficiency of TDS but it fails to handle large data set.

But this approach is centralized leasing to in

adequacy of large data set.

Several distributed algorithm are proposed

to preserve privacy of multiple data set retained by

multiple parties, Jiang et al [12] proposed distributed

algorithm to anonymization to vertical portioned

data. However, the above algorithms mainly based on

secure anonymization and integration. But our aim is

scalability issue of TDS anonymization.

Further, Zhang et al [13] leveraged Map

Reduce itself to automatically partition the

computation job in term of security level protecting

data and further processed by other Map Reduce

itself to anonymize large scale data before further

processed by other Map Reduce job, arriving at

privacy preservation.

3. Top-Down Specialization:

Generally, Top-Down Specialization (TDS)

is an iterative process starting from the

Topmost domain values in the taxonomy trees of

attributes. Each round of iteration consists of three

main steps, namely, finding the best specialization,

performing specialization and updating values of the

search metric for the next round [3]. Such a process is

repeated until k-anonymity is violated, in order to

expose the maximum data utility. The goodness of a

specialization is measured by a search metric. We

adopt the Information Gain per Privacy Loss (IGPL),

a trade-off metric that considers both the privacy and

information requirements, as the search metric in our

approach. A specialization with the highest IGPL

value is regarded as the best one and selected in each

round. We briefly describe how to calculate the value

of IGPL subsequently to make readers understand our

approach well. Interested readers can refer to [11] for

more details.

Given a specialization spec: p→ child

(p),the IGPL of the specialization is calculated by

IGPL (spec) = IG (spec)/(PL(spec) + 1). (1)

The term IG (spec) is the information gain

after performing spec, and PL (spec) is the privacy

loss. IG (spec) and PL (spec) can be computed via

statistical information derived from data sets. Let Rx

denote the set of original records containing attribute

values that can be generalized to x. |Rx | is the

number of data records in Rx. Let I(Rx ) be the

entropy of Rx. Then, IG (spec) is calculated by

∑child p RCRP

I Rc , (2)

Let |(Rx , sv )| denote the number of the data

records with sensitive value sv in Rx . I(Rx) is

computed

I Rx ∑sv€sv | R ,

RX. log2 R ,

RX.(3)

The anonymity of a data set is defined by the

minimum group size out of all QI-groups, denoted as

, i.e., A = minqid€QID {|QID(qid)|}, where |QID(qid)|,

is the size of QID(qid). Let Ap (Spec) denote the

anonymity before performing spec, while Ac (Spec)

be that after performing spec. Privacy loss caused by

Spec is calculated by

PL (spec) = Ap (spec) - Ac (Spec).

4. Two Phase Top down Specialization:

The Two-phase Top down specialization

(TPDS) approach is in First phase the given data set

is first partitioned and anonymized then the

intermediate result is produced then in the second

phase the intermediate result is further anonymized

and stored in the database. Three components of the

TPTDS approach, namely, data partition,

anonymization level merging and data specialization

4.1 Sketch of Two Phase Top down specialization:

We propose a Two-Phase Top-Down

Specialization (TPTDS) approach to conduct the

computation required in TDS in a highly scalable and

efficient fashion. The two phases of our approach are

based on the two levels of parallelization provisioned

by Map Reduce on cloud. Basically, Map Reduce on

cloud has two levels of parallelization, i.e., job level

and task level. Job level parallelization means that

multiple Map Reduce jobs can be executed

simultaneously to make full use of cloud

infrastructure resources. Combined with cloud, Map

Reduce becomes more powerful and elastic as cloud

can offer infrastructure resources on demand, e.g.,

Amazon Elastic Map Reduce service [11]. Task level

parallelization refers to that multiple mapper/reducer

tasks in a Map Reduce job are executed

simultaneously over data splits. To achieve high

scalability, we parallelizing multiple jobs on data

partitions in the first phase, but the resultant

anonymization levels are not identical. To obtain

finally consistent anonymous data sets, the second

phase is necessary to integrate the intermediate

results and further anonymize entire data sets.

In the first phase, an original data set D is

partitioned into smaller ones. Let Di, 1 ≤ i ≤ P, denote

the data sets partitioned from D the, where P is the

number of partitions, and D=∑i=1 Di, 1≤i < j ≤ p.

Then, we run a subroutine over each of the

partitioned data sets in parallel to make full use of the

job level parallelization of Map Reduce. The

subroutine is a Map Reduce version of centralized

TDS (MRTDS) which concretely conducts the

computation required in TPTDS. MRTDS

anonymizes data partitions to generate intermediate

anonymization levels. An intermediate

anonymization level means that further specialization

can be performed without violating k-anonymity.

MRTDS only leverages the task level parallelization

of Map Reduce. Formally, let function MRTDS (D,

K, AL) →AL` represent a MRTDS routine that

anonymizes data set D to satisfy k-anonymity from

anonymization level AL to A. AL0 is the initial

anonymization level, i.e., AL0 = ({TOP1},{TOP2 },

… ,{TOPM}�, where TopJ=DOM j,1≤ j ≤ m , is the

topmost domain value in TTj . ALi is the resultant

intermediate anonymization level.

In the second phase, all intermediate

anonymization levels are merged into one. The

merged anonymization level is denoted as ALi. The

merging process is formally represented as function

merge(AL`1,AL`2….AL`P)→ ALi. Then, the whole

data set D is further anonymized based on AL`,

achieving k-anonymity finally, i.e., MRTDS (D, K,

AL) →AL*, where AL* denotes the final

anonymization level. Ultimately, D is concretely

anonymized according to AL*.Above all, Algorithm

1 depicts the sketch of the two-phase TDS approach.

5. Data Partition:

When D is partitioned into Di, 1 ≤ i ≤ p, it is

required that the distribution of data records in Di is

similar to D. A data record here can be treated as a

point in an m dimension space, where m is the

number of attributes. Thus, the intermediate

anonymization levels derived from Di, 1 ≤ i ≤ p, can

be more similar so that we can get a better merged

anonymization level. Random sampling technique is

adopted to partition D, which can satisfy the above

requirement. Specifically, a random number RAND,

1≤ i ≤ p, is generated for each data record. A record is

assigned to the partition DRand. Algorithm2 shows the

Map Reduce program of data partition. Note that the

number of Reducers should be equal to p so that each

Reducer handles one value of Rand, exactly

producing p resultant files. Each file contains a

random sample of D. Once partitioned data sets Di,

1≤ i ≤ p , are obtained, we run MRTDS (D, K, AL0)

on these data sets in parallel to derive intermediate

anonymization levels ALi*,1≤ i ≤p.

6. Data Specialization:

An original data set D is concretely

specialized for anonymization in a one-pass

MapReduce job. After obtaining the merged

intermediate anonymization level AL1, we run

MRTDS (D, K, AL1) on the entire data set D, and get

the final anonymization level AL* . Then, the data set

D is anonymized by replacing original attribute

values in D with the responding domain values in

AL*.

Details of Map and Reduce functions of the

data specialization MapReduce job are described in

Algorithm3. The Map function emits anonymous

records and its count. The Reduce function simply

aggregates these anonymous records and counts their

number. An anonymous record and its count

represent a QI-group. The QI-groups constitute the

final anonymous data sets.

Algorithm1:Two Phase Top Down

Specialization:

Input: Data set D, anonymity parameter K, K`

and the number of partition P.

Output: Anonymous Data set D*.

1. Partition D into Di, 1≤ i ≤ p`.

2. Execute MRTDS (Di, Ki, AL0) →AL`I, 1 ≤ i

≤ p in parallel as multiple Map Reduce jobs 3. Merge all intermediate anonymization level into one, merge (AL`1, AL`2….AL`P) 4. Execute MRTDS (D, K, AL1) → AL* to achieve K‐anonymity. 5. Specialization D according to AL*, output D**.

Algorithm 3: Data Specialization:

Input: Data Record (IDr,r),r € D, Anonymization

Level AL*`

Output: Anonymous Record(r*,count).

Map: Construct anonymous Record

r*=(p1,p2,……….pm, sv),pi 1≤ i ≤ m, is the parent

of specialization in current AL and it I also an

ancestor of Vi in r; emit ( r*,count)

Reduce: For each r*,sum←∑count; emit ( r*,sum)

Algorithm 2: Data partition and Map & Reduce.

Input: Data Record (IDr,r), r € D, parameters P.

Output: Di, 1≤ i ≤ p.

Map: Generate a random number rand, where 1≤

rand ≤ p; emit (rand, r).

Reduce: For each rand, emit (null, list(r)).

7. MRTDS Driver:

Usually, a single MapReduce job is

inadequate to accomplish a complex task in many

applications. Thus, a group of MapReduce jobs are

orchestrated in a driver program to achieve such an

objective. MRTDS consists of MRTDS Driver and

two types of jobs, i.e., IGPL Initialization and IGPL

Update. The driver arranges the execution of jobs.

Algorithm 4 frames MRTDS Driver where a data set

is anonymized by TDS. It is the algorithmic design of

function MRTDS (D, K, AL) →AL. Note that we

leverage anonymization level to manage the process

of anonymization. Step 1 initializes the values of

information gain and privacy loss for all

specializations, which can be done by the job IGPL

Initialization. Step 2 is iterative. Firstly, the best

specialization is selected from valid specializations in

current anonymization level as described in Step 2.1.

A specialization spec is a valid one if it satisfies two

conditions. One is that its parent value is not a leaf,

and the other is that the anonymity Ac (spec) > k, i.e.,

the data set is still k-anonymous if spec is performed.

Then, the current anonymization level is modified via

performing the best specialization in Step 2.2, i.e.,

removing the old specialization and inserting new

ones that are derived from the old one. In Step 2.3,

information gain of the newly added specializations

and privacy loss of all specializations need to be

recomputed, which are accomplished by job IGPL

Update. The iteration continues until all

specializations become invalid, achieving the

maximum data utility. MRTDS produces the same

anonymous data as the centralized TDS in [12],

because they follow the same steps. MTRDS mainly

differs from centralized TDS on calculating IGPL

values. However, calculating IGPL values dominates

the scalability of TDS approaches, as it requires TDS

algorithms to count the statistical information of data

sets iteratively. MRTDS exploits Map Reduce on

cloud to make the computation of IGPL parallel and

scalable.

REFERENCE:

[1] S. Chaudhuri, “What Next?: A Half-Dozen Data

Management Research Goals for Big Data and the

Cloud,”in Proc. 31st Symp Principles of Database

Systems (PODS'12), pp. 1-4, 2012.

[2] B.C.M. Fung, K. Wang, R. Chen and P.S. Yu,

“Privacy- Preserving Data Publishing: A Survey of

Recent Developments,”ACM Comput. Surv., vol. 42,

no. 4, pp. 1-53, 2010.

[3] B.C.M. Fung, K. Wang and P.S. Yu,

“Anonymizing Classification Data for Privacy

Preservation,” IEEE Trans.Knowl..Data Eng., vol.

19, no. 5, pp. 711-725, 2007.

[4] X. Xiao and Y. Tao, “Anatomy: Simple and

Effective PrivacyPreservation,” Proc. 32nd Int'l Conf.

Algorithm 4: MRTDS DRIVER.

Input: Data set D anonymized level AL and K-

anonymity parameter k.

Output: Anonymization AL`.

1. Initialize the value of search metrics IGPL,

i.e., for each specialization spec € ujm =1 cutj . The

IGPL value of spec is computed by job IGPL

initialization

2. while ₤ spec € ujm =1 cutj is valid

2.1 find the specialization from Ali ,spec Best.

2.2 update ALi and ALi+1. 2.3 update information Gained on the new

specialization in ALi+1. And the privacy of the

specialization IGPL update.

end while

AL`← AL

Very Large Data Bases (VLDB'06), pp. 139-150,

2006.

[5]. K. LeFevre, D.J. DeWitt and R. Ramakrishnan,

“Incognito: EfficientFull-Domain K-Anonymity,”

Proc. 2005 ACM SIGMODInt'l Conf' Management

of Data (SIGMOD '05), pp. 49-60, 2005.

[6].K. LeFevre, D.J. DeWitt and R. Ramakrishnan,

“Mondrian Multidimensional K-Anonymity,” Proc.

22nd Int'l Conf. Data Engineering (ICDE '06), artical

25, 2006.

[7].V. Borkar, M.J. Carey and C. Li, “Inside "Big

Data Management": Ogres, Onions, or Parfaits?,”

Proc. 15th Int'l Conf.Extending Database Technology

(EDBT'12), pp. 3-14, 2012.

[8].T. Iwuchukwu and J.F. Naughton “K-

Anonymization as Spatial Indexing: Toward Scalable

and Incremental Anonymization,” Proc. 33rd Int'l

Conf. Very Large Data Bases(VLDB'07), pp. 746-

757, 2007.

[9] N. Mohammed, B. Fung, P.C.K. Hung and C.K.

Lee, “Centralized and Distributed Anonymization for

High-Dimensional Healthcare Data,” ACM Trans.

Knowl. Discov. Data, vol. 4, no. 4, article 18, 2010.

[10]. B. Fung, K. Wang, L. Wang and P.C.K. Hung,

“Privacy- Preserving Data Publishing for Cluster

Analysis,” Data Knowl.Eng., vol. 68, no. 6, pp. 552-

575, 2009.

[11].Amazon Web Services, “Amazon Elastic

Mapreduce,”http://aws.amazon.com/elasticmapreduc

e/, accessed on: Jan. 05, 2013.

[12]. W. Jiang and C. Clifton, “A Secure Distributed

Framework for Achieving k-anonymity,” VLDB J.,

vol. 15, no. 4, pp. 316-333,2006.

[13]. K. Zhang, X. Zhou, Y. Chen, X. Wang and Y.

Ruan, “Sedic: Privacy-Aware Data Intensive

Computing on Hybrid Clouds,”Proc. 18th ACM

Conf. Computer and Communications Security

(CCS'11), pp. 515-526, 2011.

BIOGRAPHY

K. R. VIGNESH is the M.Tech

student of the Computer science and Engineering in

SRM University, India. His main areas of interest in

cloud computing.

P. SARANYA, is the Assistant

Professor of Department of Computer Science and

Engineering, Kattankulathur Campus, SRM

University, India. Her main areas of interest in Data

Mining and Web Mining.