using blocking approach to preserve privacy in ... · pdf file* corresponding author. email...

9
* Corresponding Author. Email address: [email protected]; Tel: +98-313-5354001 44 Using blocking approach to preserve privacy in classification rules by inserting dummy Transaction Doryaneh Hossien Afshari 1 , Farsad Zamani Boroujeni 1* (1) Department of Computer Engineering, Khorasgan (Isfahan) Branch, Islamic Azad University, Isfahan, Iran Copyright 2017 © Doryaneh Hossien Afshari and Farsad Zamani Boroujeni. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract The increasing rate of data sharing among organizations could maximize the risk of leaking sensitive knowledge. Trying to solve this problem leads to increase the importance of privacy preserving within the process of data sharing. In this study is focused on privacy preserving in classification rules mining as a technique of data mining. We propose a blocking algorithm to hiding sensitive classification rules. In the solution, rules' hiding occurs as a result of editing a set of transactions which satisfy sensitive classification rules. The proposed approach tries to deceive and block adversaries by inserting some dummy transactions. Finally, the solution has been evaluated and compared with other available solutions. Results show that limiting the number of attributes existing in each sensitive rule will lead to a decrease in both the number of lost rules and the production rate of ghost rules. Keywords: Blocking Approach, Classification Rule Hiding, Dummy Transaction and Privacy Preserving. 1 Introduction Different data mining techniques and algorithms used to mine useful and hidden knowledge from databases have their own upsides and downsides. Production and hosting by ISPACS GmbH. Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 Available online at www.ispacs.com/jsca Volume 2017, Issue 1, Year 2017 Article ID jsca-00073, 9 Pages doi:10.5899/2017/jsca-00073 Research Article

Upload: lecong

Post on 06-Mar-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

* Corresponding Author. Email address: [email protected]; Tel: +98-313-5354001

44

Using blocking approach to preserve privacy in classification rules

by inserting dummy Transaction

Doryaneh Hossien Afshari1, Farsad Zamani Boroujeni1*

(1) Department of Computer Engineering, Khorasgan (Isfahan) Branch, Islamic Azad University, Isfahan, Iran

Copyright 2017 © Doryaneh Hossien Afshari and Farsad Zamani Boroujeni. This is an open access article distributed

under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in

any medium, provided the original work is properly cited.

Abstract

The increasing rate of data sharing among organizations could maximize the risk of leaking sensitive

knowledge. Trying to solve this problem leads to increase the importance of privacy preserving within the

process of data sharing. In this study is focused on privacy preserving in classification rules mining as a

technique of data mining. We propose a blocking algorithm to hiding sensitive classification rules. In the

solution, rules' hiding occurs as a result of editing a set of transactions which satisfy sensitive classification

rules. The proposed approach tries to deceive and block adversaries by inserting some dummy

transactions. Finally, the solution has been evaluated and compared with other available solutions. Results

show that limiting the number of attributes existing in each sensitive rule will lead to a decrease in both the

number of lost rules and the production rate of ghost rules.

Keywords: Blocking Approach, Classification Rule Hiding, Dummy Transaction and Privacy Preserving.

1 Introduction

Different data mining techniques and algorithms used to mine useful and hidden knowledge from

databases have their own upsides and downsides.

Production and hosting by ISPACS GmbH.

Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52

Available online at www.ispacs.com/jsca

Volume 2017, Issue 1, Year 2017 Article ID jsca-00073, 9 Pages

doi:10.5899/2017/jsca-00073

Research Article

Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 45

http://www.ispacs.com/journals/jsca/2017/jsca-00073/

International Scientific Publications and Consulting Services

One of the disadvantages resulted from using these techniques is disclosure of sensitive information; this

issue will jeopardize privacy and security of data and bring some disastrous consequences upon the

owners. In order to solve the mentioned risk, Privacy Preserving in Data Mining (PPDM) was proposed.

Taking advantage of privacy preserving and data sanitization techniques along with extracting new useful

patterns would minimize the possibility to gain access over classified information. The issue of privacy

preserving have been put under study throughout different data mining techniques. The present paper

concerns the issue of privacy preserving in mining process of classification rules. In the proposed solution,

after mining classification rules from main database, selecting a number of rules as the sensitive rules by

using blocking approach, editing a limited number of attributes values, and inserting some dummy

transactions in the database, hiding process of sensitive classification rules will be run. This study aims on

hiding all the sensitive classification rules with leaving the minimum side-effects possible on the

insensitive rules.

The fundamental concepts in privacy preserving and related works are presented in Sections 2,3,4. Section

5 includes the proposed algorithm, and, at the end, section 6 and 7 present the discussions and the

conclusion part.

2 Theoretical backgrounds

Privacy preserving techniques impede sensitive information to be mined after data mining process by

making changes in main database. The main purpose of preserving privacy in data mining process is to

sanitize sensitive knowledge and make the minimum changes possible in main database in a manner that

sensitive knowledge cannot be mined through (after) running data mining process. Generally, data

sanitization or data censoring is carried out in one of the two following ways:

Data Reconstruction: in this method, main database does not directly go through changes[1].

Data Modification: here, changes are directly made in main database. This method includes 2 different

techniques:

Definition 2.1. Data Distortion which is carried out by changing the data value from 1 to 0 or vice

versa[2],[3].

Definition 2.2. Data Blocking which is operated by replacing the data value with “?” value[2],[3].

One of the downsides of Data Distortion technique is placement of wrong values in database. In some

databases like medical databases Data Distortion technique cannot be used because excluding some

elements can be highly dangerous, also placing a number of wrong values can lead to terrible

consequences[4]. Data sanitization, however, has its own problems. Among its side-effects, hiding failure,

lost rules, and ghost rules are the most common ones, detailed description in[5],[6]. Up to now, various

attempts on sensitive knowledge sanitization have been done. Moreover, different algorithm have been

introduced for hiding sensitive association rules or sensitive itemsets with regard to various approaches

such as heuristic approach, border-based approach, exact approach, or hybrid approach[7-9].

This paper concerns classification rules hiding. As it follows, a number of previously conducted studies in

this field will be presented.

3 Related works

Methods which have been introduced for classification rules hiding go under 3 general categories as

follows: Statistical-based methods, Reconstruction-based methods, and Perturbation-based methods.

In[10], a reconstruction-based algorithm for hiding sensitive classification rules in Categorical database

Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 46

http://www.ispacs.com/journals/jsca/2017/jsca-00073/

International Scientific Publications and Consulting Services

has been presented. In this method and for generating a sanitized database, a decision tree including only

insensitive rules has been generated and used to reconstruct the database. Reference [11]presents another

reconstruction-based algorithm (titled LSA) which uses insensitive rules mined form main database to

reconstruct the sanitized database. In this method, LSA edits the transactions supporting sensitive and

insensitive rules in a way that they only support insensitive ones. This way will decrease the amount of

side-effects in the sanitized database.

Reference [12]includes another algorithm for hiding classification rules. The method in this article is based

on data reduction approach and performs the hiding process by completely deleting the selected tuples.

The solution proposed in [13]is based on data perturbation. The ROD algorithm has been introduced to

hide sensitive classification rules in categorical databases. In ROD, tuples supporting the insensitive rules

enter the sanitized database after the separation of sensitive and insensitive rules. Then, by replacing some

attributes in every tuple related to sensitive rules, these tuples are changed so as they look the same as

insensitive rules.

One of the obstacles of data perturbation based approaches is the substitution of correct values in main

database with wrong values (changing 1 to 0 and vice versa). As it was mentioned earlier, such substitution

under certain conditions such as medical databases may have severe consequences[14]. Hence, data

blocking based approach has been introduced in [15]. In this approach, all values of attributes in sensitive

rules are substituted with unknown value “?” in all transactions supporting sensitive rules.

This method inspired our proposed solution for hiding classification rules. Our proposed solution performs

the hiding process by substituting a limited number of sensitive attributes in sensitive transactions. Also, it

has been tried to deceive and block the adversaries by inserting dummy transactions including unknown

values.

4 Problem definition

Suppose that Rc is a classification rule which has been mined form the main database D:

(𝐴𝐶1 = (𝑉 𝐴𝐶1)) ∧ …(𝐴𝐶𝑛 = (𝑉 𝐴𝐶𝑛)) → (𝐶 = 𝑉𝑐) (4.1)

The left side of the above rule includes attributes and their values so as 𝐴𝐶1, 𝐴𝐶2, … , 𝐴𝐶𝑛 ≠ 𝐶 and the right

side includes class attribute and its value. Every classification rule has a degree of support which shows

the number of transactions supporting the rule. A transaction in a database supports a classification rule if

all attributes and all values of attributes of the rule exist in the transaction and they are the same as the rule.

Regarding above, the issue of classification rules hiding is presented in the following way: if R is a set of

classification rules mined form the main database D and Rs is the set of sensitive rules (RsϵR), then the

purpose is to generate a sanitized database sensitive rules of which cannot be mined and its insensitive

rules R-Rs can be mined as much as possible.

5 Main section: The proposed algorithm(BCR)

The proposed solution substitutes a certain number of attribute values with an unknown value in

transactions supporting sensitive rules in order to hide sensitive classification rules. This solution does its

task in a manner that these sensitive rules cannot be mined from final database. This method aims both on

hiding sensitive rules and editing the database in such way that adversaries cannot detect the values existed

before editing. So, inserting transactions with unknown values is done for a number of sensitive rules.

In BCR method, after mining the classification rules from database and selecting the sensitive rules,

sensitive rules are sorted in descending order regarding the number of attributes. Database would be then

scanned for every sensitive rule and the transactions supporting the rule would be selected for editing. To

hide a sensitive rule not all attributes of a rule but a certain number of them are to be blocked. Hence,

formula 2 is used to determine the number of attributes to be blocked:

Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 47

http://www.ispacs.com/journals/jsca/2017/jsca-00073/

International Scientific Publications and Consulting Services

𝑋 = ⌈𝑁𝐴2⌉ (5.2)

Where NA is the number of attributes of a sensitive rule.

In blocking process, class attribute is always blocked and the remaining numbers of X value are being

blocked randomly from the attributes on the left side. Blocking process for every sensitive rule is

conducted on the same rate as the rule’s Degree of support. In other words, substituting main values with

unknown values in transactions supporting the sensitive rule continues until the rule is finally hidden.

During the blocking process, inserting dummy transactions for a number of sensitive rules is performed.

According to the sorted sensitive rules, ⌈𝑛

2⌉dummy transactions for the first sensitive rule with the highest

number of attributes, ⌈𝑛

2⌉ − 1 dummy transactions for the second sensitive rule, ⌈

𝑛

2⌉ − 2 dummy transaction

for the third sensitive rule ,…. Will be inserted. The first transaction supporting the sensitive rule will be

the first to experience the insertion of dummy transactions. Dummy transactions will be inserted on a

certain number. For every real transaction some values of which have been changed to unknown there will

be a dummy transaction inserted (If a sensitive rule has Degree of Support of 5 and 5 supporting

transactions, but we can insert only three dummy transactions for it, inserting process will then begin form

the first main transaction and the two last transactions will earn no dummy transactions.). Formula 3 shows

the total number of dummy transactions which can be inserted:

{

𝑛 = 1,2 ∑(𝑛 − 𝑖)

𝑛−1

𝑖=0

𝑛 > 2 ∑ (⌈𝑛

2⌉ − 𝑖)

⌈𝑛2−1⌉

𝑖=0

(5.3)

Where n is the number of sensitive rules. The insertion process begins from the first transactions which

support the first sensitive rule. A dummy transaction including unknown values similar to the main

transaction is inserted. Unknown values will be displaced regarding the attributes of the sensitive rule. This

method prevents the repetition of transactions. Since inserting dummy transaction and displacing unknown

values for sensitive rules with small number of attributes is difficult, sensitive rules have been sorted in a

descending order with regard to the number of attributes, so inserting dummy transactions is being

performed in a same order. Pseudo code for the proposed solution is shown in Figure 1. The following

shows some examples of performance of the proposed solution.

5.1. Examples of performance of the proposed algorithm

Table 1 shows the main database. Every single transaction in this table has 5 attributes, and Approval

result is the class attribute. To mine the classification rules the RIPPER algorithm which is available in

WEKA 3.7 software, Weka, Classifiers. Rules. JRip was used. In order to mine rules, all parameters of

JRIP algorithm but the last one, i.e. Use Pruning which has been changed from True to False, are in default

mode. Table 2 shows the mined rules.

Rule No. 2 is considered as a sensitive rule. Since the selected rule is a sensitive one, rule sorting will not

be performed. The transactions which support the sensitive rule are selected by scanning the initial

database. In this example, transactions no. 6 and 14 support the sensitive rule. The blocking process is now

being performed with regard to Formula 2 and variable value of X=2. Two attributes one of which is class

attribute out of rule’s 3attributes are blocked. According to Formula 3, a dummy transaction is inserted for

sensitive rule. This dummy transactions in which unknown values will be substituted is similar to

transaction No.6. The sensitive rule is now hidden. After running the JRIP algorithm again with similar

parameters, the only rule being mined is rule No.1. Table 3 shows the sanitized database.

Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 48

http://www.ispacs.com/journals/jsca/2017/jsca-00073/

International Scientific Publications and Consulting Services

Table 1: The initial database.

Approval

result Black list Gender Marriage status Years at current work No.

No

No

Yes

Yes

Yes

No

Yes

No

Yes

Yes

Yes

Yes

Yes

No

No

Yes

No

No

No

Yes

Yes

No

No

No

Yes

Yes

No

Yes

Female

Female

Female

Female

Male

Male

Male

Female

Male

Male

Male

Female

Male

Female

Married

Married

Married

Divorce

Single

Single

Single

Divorce

Single

Divorce

Divorce

Divorce

Married

Divorce

Short

Short

Long

Medium

Medium

Medium

Long

Short

Short

Medium

Short

Long

Long

Medium

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Table 2: Classification rules mined from the initial database.

Classification rules No.

(gender = female) and (years at current work = short) => approval result=NO

(3.0/0.0)

(years at current work = medium) and (black list = yes) => approval result=NO

(2.0/0.0)

1

2

Table 3: The sanitized database.

Approval

result Black list Gender Marriage status Years at current work No.

No

No

Yes

Yes

Yes

?

?

Yes

No

Yes

Yes

Yes

Yes

Yes

?

No

Yes

No

No

No

?

Yes

Yes

No

No

No

Yes

Yes

No

Yes

Female

Female

Female

Female

Male

Male

Male

Male

Female

Male

Male

Male

Female

Male

Female

Married

Married

Married

Divorce

Single

Single

Single

Single

Divorce

Single

Divorce

Divorce

Divorce

Married

Divorce

Short

Short

Long

Medium

Medium

Medium

?

Long

Short

Short

Medium

Short

Long

Long

?

1

2

3

4

5

6

D*

7

8

9

10

11

12

13

14

6 Comparative evaluation of the proposed algorithm

To evaluate the performance of the algorithm, both the BCR proposed algorithm and ROD algorithm

[13] were run on 3 real databases provided by UCI repository. Table 4 shows the specifications of the 3

databases.

Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 49

http://www.ispacs.com/journals/jsca/2017/jsca-00073/

International Scientific Publications and Consulting Services

Table 4: Specifications of databases.

Rules Attributes Instances Dataset

14 10 286 Breast-Cancer

9 23 8124 Mushroom

49 7 1728 Car

Classification rules were mined using RIPPER algorithm and 2 and 3 sensitive rules were randomly

selected. Three factors of hiding failure, lost rules, and ghost rules have been considered to evaluate the

algorithms.

Hiding failure factor shows the number of sensitive rules which will be mined from database after

performing the sanitization process.

Lost rule declares the number of insensitive rules which will not be mined from database after performing

the sanitization process.

Ghost rule shows the number of rules which do not exist in initial database but will be mined from

sanitized database after performing the sanitization process.

Results for running the proposed algorithm are shown in Figures 1, 2 and 3.

Figure 1: Comparative evaluation in Mushroom database.

Figure 2: Comparative evaluation in Breast Cancer database.

0

2

4

6

Hiding Failure Lost Rules Ghost Rules Hiding Failure Lost Rules Ghost Rules

BCR ROD

Nu

mb

er

of

Sid

e E

ffe

cte

s All Side Effectes in Mushroom Dataset

2-Rules 3-Rules

0

2

4

6

Hiding Failure Lost Rules Ghost Rules Hiding Failure Lost Rules Ghost Rules

BCR ROD

Nu

mb

er

of

Sid

e E

ffe

cte

s

All Side Effectes in Cancer Dataset

2-Rules 3-Rules

Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 50

http://www.ispacs.com/journals/jsca/2017/jsca-00073/

International Scientific Publications and Consulting Services

Figure 3: Comparative evaluation in Car database.

As it can be seen in diagrams above, the proposed algorithm has made the sensitive rules made hidden

with the minimum side-effects possible. Selecting a limited number of attributes for hiding process would

cause fewer changes in database and, consequently, lead to a decrease in number of lost rules and

generation rate of ghost rules. Sorting the rules according to related number of attributes and inserting

dummy transactions in a hierarchical manner, along with leaving some sensitive rules free of dummy

transactions would cause perturbation in detecting the blocking process and result in a fall in the

generation rate of ghost rules.

The above result show that in ROD algorithm by increasing the number of sensitive rules the number of

side effects increase as well however this is not the case in proposed algorithm. Although as it is illustrated

in Figure3 inserting the transaction has resulted in ghost rules, yet in comparison to ROD the proposed

algorithm shows better results. This is due to defining limitations for the insertion of dummy transaction.

Given that the insertion of dummy transactions is subtractive and hierarchical the production on ghost

rules is being controlled. By making changes in all those attributes that are supporting the non-sensitive

rules, the ROD algorithm produces more ghost rules compared to proposed method. This problem gets

even worst when the number of attributes of sensitive and non-sensitive rules are equal. The more the

number of attributes for the sensitive rules the better the result from proposed algorithm compared to the

ROD, and proposed method is even more efficient in maintaining data quality.in general in order to hide

sensitive rules it is enough to change only a limited number of attributes and inserting more changes will

only result in more side effects as mentioned above, therefore the result from suggested algorithm have

shown to be better in all cases mentioned above compared to the results from ROD. Another factor

contributing to better results in the proposed method compared to the ROD is their difference in sorting

rules. Since in ROD the sorting of rules is based on the number of supports, if there are no non-sensitive

rules with less number of support the sanitization of data may face some problems.

6.1. The limitation of the proposed algorithm

One of the limitations in the proposed algorithm is its time complexity in large databases. In other words

due to inserting transactions, the larger the database the worse becomes the sanitization in the proposed

method. Further, by accidently choosing the attributes in proposed method in case the same attribute is

present in many different non-sensitive rules it will increase the number of side effects such as ghost rules

or lost rules.

7 Conclusion and future works

The present study introduced an algorithm based on blocking approach, for hiding sensitive

classification rules. The proposed solution performs the hiding process as follows: in this method,

0

2

4

6

HidingFailure

Lost Rules Ghost Rules HidingFailure

Lost Rules Ghost Rules

BCR ROD

Nu

mb

er

of

Sid

e E

ffe

cte

s All Side Effectes in Car Dataset

2-Rules 3-Rules

Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 51

http://www.ispacs.com/journals/jsca/2017/jsca-00073/

International Scientific Publications and Consulting Services

according to the number of attributes the sensitive rules are being sorted in descending order where the

hiding is started from the sensitive rule with the highest number of attributes. For hiding the sensitive rules

in addition to class attribute, values of a limited number of attributes on the left side of the sensitive rules

are randomly substituted with “?” value.

Proposed method inserts some dummy transactions which contain unknown values for a certain number of

sensitive rules to disturb database. In the proposed method in order to control the outbreak of side effects

resulted from inserting transaction the insert of transaction is done hierarchically and descending and this

is done for the limited number of sensitive rules and transactions. All the results shows the proposed

method is clean from any hiding failure and in the number of lost rules and ghost rules it has shown to be

more efficient in comparison to the introduced algorithm in [15]. Given that the propose algorithm perform

the sanitization with fewer adverse effects, therefor, is more successful in maintaining data quality.

The proposed solution is capable of further development and can be used in other classification methods

such as nearest neighbor classifier and decision tree.

In the proposed algorithm, the manner of choice the victim attribute can be changed so that based on the

non-sensitive rules. It can be used in conjunction with the insert of transactions, So that the insertion of

transactions carried out according to the non-sensitive rules.

References

[1] Y. Guo, Reconstruction-based association rule hiding, SIGMOD2007 Ph. D. Workshop on Innovative

Database Research, (2007) 51-56.

[2] H. Wang, Association Rule: From Mining to Hiding, Applied Mechanics and Materials, 321 (2013)

2570-2573.

https://doi.org/10.4028/www.scientific.net/AMM.321-324.2570

[3] V. S. Verykios, Association rule hiding methods, Wiley Interdisciplinary Reviews: Data Mining and

Knowledge Discovery, 3 (2013) 28-36.

https://doi.org/10.1002/widm.1082

[4] V. Garg, A. Singh, D. Singh, A Survey of Association Rule Hiding Algorithms, Communication

Systems and Network Technologies (CSNT), 2014 Fourth International Conference on, (2014) 404-

407.

https://doi.org/10.1109/CSNT.2014.86

[5] H. Wang, Quality Measurements for Association Rules Hiding, AASRI Procedia, 5 (2013) 228-234.

https://doi.org/10.1016/j.aasri.2013.10.083

[6] A. Gkoulalas-Divanis, J. Haritsa, M. Kantarcioglu, Privacy Issues in Association Rule Mining,

Frequent Pattern Mining, ed: Springer, (2014) 369-401.

https://doi.org/10.1007/978-3-319-07821-2_15

[7] G. V. Moustakides, V. S. Verykios, A MaxMin approach for hiding frequent itemsets, Data &

Knowledge Engineering, 65 (2008) 75-89.

https://doi.org/10.1016/j.datak.2007.06.012

[8] T.-P. Hong, C.-W. Lin, K.-T. Yang, S.-L. Wang, Using TF-IDF to hide sensitive itemsets, Applied

intelligence, 38 (2013) 502-510.

https://doi.org/10.1007/s10489-012-0377-5

Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 52

http://www.ispacs.com/journals/jsca/2017/jsca-00073/

International Scientific Publications and Consulting Services

[9] S. Xingzhi, P. S. Yu, A border-based approach for hiding sensitive frequent itemsets, Data Mining,

Fifth IEEE International Conference, (2005) 8.

https://doi.org/10.1109/ICDM.2005.2

[10] J. Natwichai, X. Li, M. E. Orlowska, A reconstruction-based algorithm for classification rules hiding,

17th Australasian Database Conference, 49 (2006) 49-58.

[11] A. Katsarou, G.-D. Aris, V. S. Verykios, Reconstruction-based Classification Rule Hiding through

Controlled Data Modification, Artificial Intelligence Applications and Innovations III, ed: Springer,

(2009) 449-458.

https://doi.org/10.1007/978-1-4419-0221-4_53

[12] J. Natwichai, X. Sun, X. Li, Data reduction approach for sensitive associative classification rule

hiding, Nineteenth conference on Australasian database, 75 (2008) 23-30.

[13] A. Delis, V. S. Verykios, A. A. Tsitsonis, A data perturbation approach to sensitive classification rule

hiding, ACM Symposium on Applied Computing, (2010) 605-609.

https://doi.org/10.1145/1774088.1774216

[14] Y. Saygin, V. S. Verykios, C. Clifton, Using unknowns to prevent discovery of association rules,

ACM SIGMOD Record, 30 (2001) 45-54.

https://doi.org/10.1145/604264.604271

[15] A. Parmar, U. P. Rao, D. R. Patel, Blocking based approach for classification Rule hiding to Preserve

the Privacy in Database, International Symposium on Computer Science and Society (ISCCS), (2011)

323-326.

https://doi.org/10.1109/isccs.2011.103