a review on anonymization approach to preserve privacy of ... · a review on anonymization approach...

A Review on anonymization approach to preserve privacy of

Published data through record elimination

Isha K. Gayki

P.R.Patil COE &T

Computer science & Engg.Dept, Amravati

[email protected]

Prof.Arvind S.Kapse

P.R.Patil COE &T

Computer science & Engg.Dept, Amravati

[email protected]

Abstract

Data mining is the process of analyzing data. Data

Privacy is collection of data and dissemination of

data. Privacy issues arise in different area such as

health care, intellectual property, biological data,

financial transaction etc. It is very difficult to protect

the data when there is transfer of data. Sensitive

information must be protected. There are two kinds of

major attacks against privacy namely record linkage

and attribute linkage attacks. Research have

proposed some methods namely k-anonymity, ℓ-

diversity, t-closeness for data privacy. K-anonymity

method preserves the privacy against record linkage

attack alone. It is unable to prevent address attribute

linkage attack. ℓ-diversity method overcomes the

drawback of k-anonymity method. But it fails to

prevent identity disclosure attack and attribute

disclosure attack. t-closeness method preserves the

privacy against attribute linkage attack but not

identity disclosure attack. A proposed method used to

preserve the privacy of individuals sensitive data

from record and attribute linkage attacks. In the

proposed method, privacy preservation is achieved

through generalization by setting range values and

through record elimination. A proposed method

overcomes the drawback of both record linkage

attack and attribute linkage attack.

1. Introduction 1.1. Data Mining

Data mining (sometimes called data or knowledge

discovery) is the process of analyzing data from

different perspectives and summarizing it into useful

information - information that can be used to increase

revenue, cuts costs, or both. Data mining software is

one of a number of analytical tools for analyzing

data. It allows users to analyze data from many

different dimensions or angles, categorize it, and

summarize the relationships identified. Technically,

data mining is the process of finding correlations or

patterns among dozens of fields in large relational

databases. Data mining is an interdisciplinary subfield of computer science, is the computational

process of discovering patterns in large data

sets involving methods at the intersection of artificial

intelligence, machine learning, statistics,

and database systems. The overall goal of the data

mining process is to extract information from a data

set and transform it into an understandable structure

for further use. The actual data mining task is the

automatic or semi-automatic analysis of large

quantities of data to extract previously unknown

interesting patterns such as groups of data records,

unusual records and dependencies. Data mining uses

information from past data to analyze the outcome of

a particular problem or situation that may arise. Data

mining works to analyze data stored in data

warehouses that are used to store that data that is

being analyzed. Data mining interprets its data into real time analysis that can be used to increase sales,

promote new product, or delete product that is not

value-added to the company. Government and private

sectors are publishing micro data to facilitate pure

research. Individuals’ privacy should be safeguarded.

1.2. Published Data

Published scientific data sets provide a great

opportunity for instructors who want to get their

students working with data. Using available data can

make it quicker and easier to get rolling. Allowing

Isha K Gayki et al , Int.J.Computer Technology & Applications,Vol 4 (6),986-989

IJCTA | Nov-Dec 2013 Available [email protected]

986

ISSN:2229-6093

students to explore real questions with the best

scientific data available adds excitement to course

and enhances student motivation for learning. Other

reasons to use published data resources from

scientific projects include:

Ease of Use.

Data Quality:- Quality control of the data

Spatial Coverage.

Temporal Coverage.

Focus on Process.

Visualization.

1.3. Data Privacy

Most exciting developments in privacy research

in recent years has been the emergence of a

workable, formal definition of privacy-preserving

data access, along with algorithms that can provide a

mathematical proof of privacy preservation in certain

cases. Information privacy, or data privacy, is the

relationship between collection and dissemination

of data, technology, the public expectation of

privacy, and the legal and political issues surrounding

them Privacy concerns exist wherever personally

identifiable information is collected and stored – in

digital form or otherwise. Improper or non-existent

disclosure control can be the root cause for privacy

issues. Data privacy issues can arise in response to

information from a wide range of sources, such as:

Healthcare records

Criminal justice investigations and proceedings

Financial institutions and transactions

Biological traits, such as genetic material

Residence and geographic records

Ethnicity

Privacy breach

Location-based service and geolocation

The challenge in data privacy is to share data while

protecting personally identifiable information. The

fields of data security and information security design

and utilize software, hardware and human resources

to address this issue.

Data mining involves six common classes of tasks:

1)Anomaly detection (Outlier/change/deviation

detection) – The identification of unusual data

records.2)Association rule learning (Dependency

modeling) – Searches for relationships between variables..3)Clustering – is the task of discovering

groups and structures in the data that are in some way

or another "similar", without using known structures

in the data.4)Classification – is the task of

generalizing known structure to apply to new data.

For example, an e-mail program might attempt to

classify an e-mail as "legitimate" or as

"spam".5)Regression – Attempts to find a function

which models the data with the least error. 6)Summarization – providing a more compact

representation of the data set, including visualization

and report generation.7)Sequential pattern mining –

Sequential pattern mining finds sets of data items that

occur together frequently in some sequences.

Sequential pattern mining, which extracts frequent

subsequences from a sequence database e.g. web user

analysis, stock trend prediction, DNA sequence

analysis, finding language or linguistic patterns from

natural language texts, and using the history of

symptoms to predict certain kind of disease.

2. Literature review

Existing techniques find solution for privacy

problem to some extent. k-anonymity[7] can prevent

the identity disclosure attack but not attribute

disclosure attack. Another method, ℓ-diversity[9]

method preserves the privacy against attribute

disclosure attack. But, not identity disclosure attack.

t-closeness method[9] is good at attribute disclosure

attack. It is computationally complex. but, it fail to

protect the privacy against attribute disclosure attack

P sensitive k-anonymity model[7], the modified

micro data table T* satisfies (p+, α)-sensitive k-

anonymity property if it satisfies k-anonymity, and

each Q-I group has at least p distinct categories of the sensitive attribute and its total weight is at least α.

This method significantly reduces the possibility of

Similarity Attack and incurs less distortion ratio

compared to p-sensitive k-anonymity method.

Tamir Tassa [2] proposed an alternative model of

k-type anonymity. It is reduce the information loss

than k-anonymity and obtained anonymized table by

less generalization. It preserves the privacy against

identity disclosure alone. Qian Wang[4] proposed the

model k-anonymity in protection of attribute

disclosure. It can prevent attribute disclosure by

controlling average leakage probability and

probability difference of sensitive attribute value

Mahesh, Meyyappan[11] proposed a new method to anonymize the dataset by setting range values in

Quasi identifiers. If the Quasi identifier consists of

same attribute values in any class.

In t-closeness method[9], an equivalence class is

said to have t-closeness if the distance between the



987

ISSN:2229-6093

http://en.wikipedia.org/wiki/Data

http://en.wikipedia.org/wiki/Technology

http://en.wikipedia.org/wiki/Expectation_of_privacy



http://en.wikipedia.org/wiki/Legal

http://en.wikipedia.org/wiki/Political

http://en.wikipedia.org/wiki/Privacy

http://en.wikipedia.org/wiki/Personally_identifiable_information



http://en.wikipedia.org/wiki/Genetic_material

http://en.wikipedia.org/wiki/Privacy_breach

http://en.wikipedia.org/wiki/Geolocation

http://en.wikipedia.org/wiki/Data_security

http://en.wikipedia.org/wiki/Information_security

distribution of a sensitive attribute in this class and

the distribution of the attribute in the whole table is

no more than a threshold t. A table is said to have t-

closeness if all equivalence classes have tcloseness. It

preserves the privacy against homogeneity and

background knowledge attacks.

In (α,k)-Anonymity [6] model, a view of the table

is said to be an (α, k)-anonymization, if the

modification of the table satisfies both k-anonymity

and α-deassociation properties with respect to the

quasi-identifier. It does not address the identity

disclosure attack. The proposed method provides new

anonymization technique comprising record

elimination and generalization the proposed method

reduces the information loss compared to existing

methods.

Versatile publishing method [8] preserves the

privacy by splitting the anonymized table T* by

framing the privacy rules. Privacy can be breached by

applying the conditional probability in published

table.

Mahesh, Meyyappan[1] proposed a method to

anonymize the dataset by setting range values in

Quasi

identifiers. If the Quasi identifier consists of same

attribute values in any class, but fails to preserve the privacy.

3. Methods for privacy preservation

3.1. k-anonymity when attributes are suppressed or generalized

until each row is identical with at least k-1 other rows

then that method is called as a k-anonymity it

prevents definite database linkages and also

guarantees that the data released is accurate. but it

has some limitation it does not hide individual

identity, Unable to protect against attacks based on background knowledge

k-anonymity cannot be applied to high-

dimensional data without complete loss of utility also

some special methods are required for

anonymization and published data more than once.

3.2. l- diversity Method overcomes the drawbacks of k-anonymity

but fails to preserve the privacy against skewness

and similarity attacks.

3.3. t-closeness Method is called as t-closeness when the distance

between the distribution of a sensitive attribute in

same class and the distribution of the attribute in the

whole table is no more than a threshold t. A table is

said to have t-closeness if all equivalence classes

have t-closeness. It preserves the privacy against

homogeneity and background knowledge attacks.

4. Proposed work Database of any organization,company’s,medical

is a confidential database such database must be

preserved so that no confidential information can be

loss. the proposed method provides new

anonymization technique comprising record

elimination and generalization. An application will be

develop which will have one graphical user

presentation and one database. This database will

have patient database. Data is pass to the algorithm

and generalized database will be passed to research

Proposed method reduces the information loss

compared to existing system.

5. Conclusion Information about age or any confidential data

published in web pages are growing enormously

every year. When we are utilizing the data for

research purpose, privacy of the individuals whose

data is going to be published should not be

challenged. The proposed method attempts at static

micro data only which contain numeric quasi

identifiers. Future we are going to study different

data publishing scenarios such as multiple view

publishing, Anonymizing sequential release with new

attributes and incrementally update data records as

well as non-numeric quasi identifiers.

REFERENCES

[1] Mahesh R, Meyyappan T, “Anonymization technique

through record elimination to Preserve Privacy of

Published data”, International workshop on pattern

recognization, Informatics and mobile engineering, proceedings,,978-1-4673-5845-3,2013

[2]Tamir Tassa,Arnon Mazza and Aristides Gionis,“k-

Concealment: An Alternative Model of k-Type

Anonymity”, TRANSACTIONS ON DATA PRIVACY 5, 2012, pp189–222

[3] Xin Jin,Mingyang Zhang,Nan Zhang and Gautam Das,

“Versatile Publishing For Privacy Preservation”,

2010,KDD10,ACM [4] Qiang Wang,Zhiwei Xu and Shengzhi Qu,“An

Enhanced K-Anonymity Model against Homogeneity

Attack”, Journal of software,2011, Vol. 6, No.10, October

2011;1945-1952 [5] Benjamin C.M.Fung,KE Wang,Ada Wai-Chee Fu and Philip S. Yu, Introduction to Privacy-

Preserving Data Publishing Concepts and techniques,

ISBN:978-1-4200-9148-9,2010

[6] Raymond Wong, Jiuyong Li,Ada Fu and Ke wang, “(α,k)-anonymous data publishing”, Journal Intelligent

Information System, 2009,pp209- 234.

[7] Xiaoxun Sun, Hua Wang, Jiuyong Li and Traian

Marius Truta, “Enhanced P-Sensitive K-Anonymity Models for privacy Preserving Data Publishing”,

Transactions On Data Privacy, 2008,pp53-66



988

ISSN:2229-6093

[8] B.C.M. Fung, Ke Wang and P.S.Yu, “Anonymizing

classification data for privacy preservation”, IEEE Transactions on Knowledge and Data Engineering(TKDE),

2007,pp711-725

[9] Ninghui Li, Tiancheng Li, Suresh

Vengakatasubramaniam,“t-Closeness: Privacy Beyond k-Anonymity and ℓ-Diversity”, International Conference on

Data Engineering, 2007, pp106-115

[10] X. Xiao and Y. Tao,“Personalized privacy

preservation”, In Proceedings of ACM Conference on Management of Data (SIGMOD’06”),2006,pp229-240

[11] Mahesh R, Meyyappan T, “A New Method for

Preserving Privacy in Data Publishing”,International

workshop on cryptography and Information Security, CS&IT proceedings,2012,pp 261-266

[12]Neha V. Mogre, Girish Agarwal, Pragati Patil:“A

Review On Data Anonymization Technique For Data

Publishing” Proc. International Journal of Engineering Research &

Technology (IJERT) Vol. 1 Issue 10, December- 2012

ISSN: 2278-0181

[13]Y. Xu, K. Wang, A.W.-C. Fu, and P.S. Yu, “Anonymizing Transaction Databases for Publication,”

Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery

and Data Mining (KDD), pp. 767-775, 2008.

[14]N. Li, T. Li, and S. Venkatasubramanian, “t-Closeness: Privacy Beyond k-Anonymity and „-Diversity,” Proc.

IEEE 23rd Int’l Conf. Data Eng. (ICDE), pp. 106-115,

2007.

[15]T. M. Truta and V. Bindu,“Privacy Protection: “p-

sensitive k-anonymity property”, International Workshop

of Privacy Data Management (PDM2006), In Conjunction

with 22th International Conference of Data Engineering (ICDE),2006, pp94

[16] X. Xiao and Y. Tao,“Personalized privacy

preservation”, In Proceedingsof ACM Conference on

Management of Data (SIGMOD’06”),2006,pp229-240 [17]L. Sweeney,“An Achieving k-Anonymity Privacy

Protection Using Generalization and Suppression”,

International Journal of Uncertainty, Fuzziness and

Knowledge-Based System,2002,pp571-588



989

ISSN:2229-6093

a review on anonymization approach to preserve privacy of ... · a review on anonymization approach...

Documents