using blocking approach to preserve privacy in ... · pdf file* corresponding author. email...
TRANSCRIPT
* Corresponding Author. Email address: [email protected]; Tel: +98-313-5354001
44
Using blocking approach to preserve privacy in classification rules
by inserting dummy Transaction
Doryaneh Hossien Afshari1, Farsad Zamani Boroujeni1*
(1) Department of Computer Engineering, Khorasgan (Isfahan) Branch, Islamic Azad University, Isfahan, Iran
Copyright 2017 © Doryaneh Hossien Afshari and Farsad Zamani Boroujeni. This is an open access article distributed
under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Abstract
The increasing rate of data sharing among organizations could maximize the risk of leaking sensitive
knowledge. Trying to solve this problem leads to increase the importance of privacy preserving within the
process of data sharing. In this study is focused on privacy preserving in classification rules mining as a
technique of data mining. We propose a blocking algorithm to hiding sensitive classification rules. In the
solution, rules' hiding occurs as a result of editing a set of transactions which satisfy sensitive classification
rules. The proposed approach tries to deceive and block adversaries by inserting some dummy
transactions. Finally, the solution has been evaluated and compared with other available solutions. Results
show that limiting the number of attributes existing in each sensitive rule will lead to a decrease in both the
number of lost rules and the production rate of ghost rules.
Keywords: Blocking Approach, Classification Rule Hiding, Dummy Transaction and Privacy Preserving.
1 Introduction
Different data mining techniques and algorithms used to mine useful and hidden knowledge from
databases have their own upsides and downsides.
Production and hosting by ISPACS GmbH.
Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52
Available online at www.ispacs.com/jsca
Volume 2017, Issue 1, Year 2017 Article ID jsca-00073, 9 Pages
doi:10.5899/2017/jsca-00073
Research Article
Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 45
http://www.ispacs.com/journals/jsca/2017/jsca-00073/
International Scientific Publications and Consulting Services
One of the disadvantages resulted from using these techniques is disclosure of sensitive information; this
issue will jeopardize privacy and security of data and bring some disastrous consequences upon the
owners. In order to solve the mentioned risk, Privacy Preserving in Data Mining (PPDM) was proposed.
Taking advantage of privacy preserving and data sanitization techniques along with extracting new useful
patterns would minimize the possibility to gain access over classified information. The issue of privacy
preserving have been put under study throughout different data mining techniques. The present paper
concerns the issue of privacy preserving in mining process of classification rules. In the proposed solution,
after mining classification rules from main database, selecting a number of rules as the sensitive rules by
using blocking approach, editing a limited number of attributes values, and inserting some dummy
transactions in the database, hiding process of sensitive classification rules will be run. This study aims on
hiding all the sensitive classification rules with leaving the minimum side-effects possible on the
insensitive rules.
The fundamental concepts in privacy preserving and related works are presented in Sections 2,3,4. Section
5 includes the proposed algorithm, and, at the end, section 6 and 7 present the discussions and the
conclusion part.
2 Theoretical backgrounds
Privacy preserving techniques impede sensitive information to be mined after data mining process by
making changes in main database. The main purpose of preserving privacy in data mining process is to
sanitize sensitive knowledge and make the minimum changes possible in main database in a manner that
sensitive knowledge cannot be mined through (after) running data mining process. Generally, data
sanitization or data censoring is carried out in one of the two following ways:
Data Reconstruction: in this method, main database does not directly go through changes[1].
Data Modification: here, changes are directly made in main database. This method includes 2 different
techniques:
Definition 2.1. Data Distortion which is carried out by changing the data value from 1 to 0 or vice
versa[2],[3].
Definition 2.2. Data Blocking which is operated by replacing the data value with “?” value[2],[3].
One of the downsides of Data Distortion technique is placement of wrong values in database. In some
databases like medical databases Data Distortion technique cannot be used because excluding some
elements can be highly dangerous, also placing a number of wrong values can lead to terrible
consequences[4]. Data sanitization, however, has its own problems. Among its side-effects, hiding failure,
lost rules, and ghost rules are the most common ones, detailed description in[5],[6]. Up to now, various
attempts on sensitive knowledge sanitization have been done. Moreover, different algorithm have been
introduced for hiding sensitive association rules or sensitive itemsets with regard to various approaches
such as heuristic approach, border-based approach, exact approach, or hybrid approach[7-9].
This paper concerns classification rules hiding. As it follows, a number of previously conducted studies in
this field will be presented.
3 Related works
Methods which have been introduced for classification rules hiding go under 3 general categories as
follows: Statistical-based methods, Reconstruction-based methods, and Perturbation-based methods.
In[10], a reconstruction-based algorithm for hiding sensitive classification rules in Categorical database
Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 46
http://www.ispacs.com/journals/jsca/2017/jsca-00073/
International Scientific Publications and Consulting Services
has been presented. In this method and for generating a sanitized database, a decision tree including only
insensitive rules has been generated and used to reconstruct the database. Reference [11]presents another
reconstruction-based algorithm (titled LSA) which uses insensitive rules mined form main database to
reconstruct the sanitized database. In this method, LSA edits the transactions supporting sensitive and
insensitive rules in a way that they only support insensitive ones. This way will decrease the amount of
side-effects in the sanitized database.
Reference [12]includes another algorithm for hiding classification rules. The method in this article is based
on data reduction approach and performs the hiding process by completely deleting the selected tuples.
The solution proposed in [13]is based on data perturbation. The ROD algorithm has been introduced to
hide sensitive classification rules in categorical databases. In ROD, tuples supporting the insensitive rules
enter the sanitized database after the separation of sensitive and insensitive rules. Then, by replacing some
attributes in every tuple related to sensitive rules, these tuples are changed so as they look the same as
insensitive rules.
One of the obstacles of data perturbation based approaches is the substitution of correct values in main
database with wrong values (changing 1 to 0 and vice versa). As it was mentioned earlier, such substitution
under certain conditions such as medical databases may have severe consequences[14]. Hence, data
blocking based approach has been introduced in [15]. In this approach, all values of attributes in sensitive
rules are substituted with unknown value “?” in all transactions supporting sensitive rules.
This method inspired our proposed solution for hiding classification rules. Our proposed solution performs
the hiding process by substituting a limited number of sensitive attributes in sensitive transactions. Also, it
has been tried to deceive and block the adversaries by inserting dummy transactions including unknown
values.
4 Problem definition
Suppose that Rc is a classification rule which has been mined form the main database D:
(𝐴𝐶1 = (𝑉 𝐴𝐶1)) ∧ …(𝐴𝐶𝑛 = (𝑉 𝐴𝐶𝑛)) → (𝐶 = 𝑉𝑐) (4.1)
The left side of the above rule includes attributes and their values so as 𝐴𝐶1, 𝐴𝐶2, … , 𝐴𝐶𝑛 ≠ 𝐶 and the right
side includes class attribute and its value. Every classification rule has a degree of support which shows
the number of transactions supporting the rule. A transaction in a database supports a classification rule if
all attributes and all values of attributes of the rule exist in the transaction and they are the same as the rule.
Regarding above, the issue of classification rules hiding is presented in the following way: if R is a set of
classification rules mined form the main database D and Rs is the set of sensitive rules (RsϵR), then the
purpose is to generate a sanitized database sensitive rules of which cannot be mined and its insensitive
rules R-Rs can be mined as much as possible.
5 Main section: The proposed algorithm(BCR)
The proposed solution substitutes a certain number of attribute values with an unknown value in
transactions supporting sensitive rules in order to hide sensitive classification rules. This solution does its
task in a manner that these sensitive rules cannot be mined from final database. This method aims both on
hiding sensitive rules and editing the database in such way that adversaries cannot detect the values existed
before editing. So, inserting transactions with unknown values is done for a number of sensitive rules.
In BCR method, after mining the classification rules from database and selecting the sensitive rules,
sensitive rules are sorted in descending order regarding the number of attributes. Database would be then
scanned for every sensitive rule and the transactions supporting the rule would be selected for editing. To
hide a sensitive rule not all attributes of a rule but a certain number of them are to be blocked. Hence,
formula 2 is used to determine the number of attributes to be blocked:
Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 47
http://www.ispacs.com/journals/jsca/2017/jsca-00073/
International Scientific Publications and Consulting Services
𝑋 = ⌈𝑁𝐴2⌉ (5.2)
Where NA is the number of attributes of a sensitive rule.
In blocking process, class attribute is always blocked and the remaining numbers of X value are being
blocked randomly from the attributes on the left side. Blocking process for every sensitive rule is
conducted on the same rate as the rule’s Degree of support. In other words, substituting main values with
unknown values in transactions supporting the sensitive rule continues until the rule is finally hidden.
During the blocking process, inserting dummy transactions for a number of sensitive rules is performed.
According to the sorted sensitive rules, ⌈𝑛
2⌉dummy transactions for the first sensitive rule with the highest
number of attributes, ⌈𝑛
2⌉ − 1 dummy transactions for the second sensitive rule, ⌈
𝑛
2⌉ − 2 dummy transaction
for the third sensitive rule ,…. Will be inserted. The first transaction supporting the sensitive rule will be
the first to experience the insertion of dummy transactions. Dummy transactions will be inserted on a
certain number. For every real transaction some values of which have been changed to unknown there will
be a dummy transaction inserted (If a sensitive rule has Degree of Support of 5 and 5 supporting
transactions, but we can insert only three dummy transactions for it, inserting process will then begin form
the first main transaction and the two last transactions will earn no dummy transactions.). Formula 3 shows
the total number of dummy transactions which can be inserted:
{
𝑛 = 1,2 ∑(𝑛 − 𝑖)
𝑛−1
𝑖=0
𝑛 > 2 ∑ (⌈𝑛
2⌉ − 𝑖)
⌈𝑛2−1⌉
𝑖=0
(5.3)
Where n is the number of sensitive rules. The insertion process begins from the first transactions which
support the first sensitive rule. A dummy transaction including unknown values similar to the main
transaction is inserted. Unknown values will be displaced regarding the attributes of the sensitive rule. This
method prevents the repetition of transactions. Since inserting dummy transaction and displacing unknown
values for sensitive rules with small number of attributes is difficult, sensitive rules have been sorted in a
descending order with regard to the number of attributes, so inserting dummy transactions is being
performed in a same order. Pseudo code for the proposed solution is shown in Figure 1. The following
shows some examples of performance of the proposed solution.
5.1. Examples of performance of the proposed algorithm
Table 1 shows the main database. Every single transaction in this table has 5 attributes, and Approval
result is the class attribute. To mine the classification rules the RIPPER algorithm which is available in
WEKA 3.7 software, Weka, Classifiers. Rules. JRip was used. In order to mine rules, all parameters of
JRIP algorithm but the last one, i.e. Use Pruning which has been changed from True to False, are in default
mode. Table 2 shows the mined rules.
Rule No. 2 is considered as a sensitive rule. Since the selected rule is a sensitive one, rule sorting will not
be performed. The transactions which support the sensitive rule are selected by scanning the initial
database. In this example, transactions no. 6 and 14 support the sensitive rule. The blocking process is now
being performed with regard to Formula 2 and variable value of X=2. Two attributes one of which is class
attribute out of rule’s 3attributes are blocked. According to Formula 3, a dummy transaction is inserted for
sensitive rule. This dummy transactions in which unknown values will be substituted is similar to
transaction No.6. The sensitive rule is now hidden. After running the JRIP algorithm again with similar
parameters, the only rule being mined is rule No.1. Table 3 shows the sanitized database.
Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 48
http://www.ispacs.com/journals/jsca/2017/jsca-00073/
International Scientific Publications and Consulting Services
Table 1: The initial database.
Approval
result Black list Gender Marriage status Years at current work No.
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
Yes
No
No
No
Yes
Yes
No
No
No
Yes
Yes
No
Yes
Female
Female
Female
Female
Male
Male
Male
Female
Male
Male
Male
Female
Male
Female
Married
Married
Married
Divorce
Single
Single
Single
Divorce
Single
Divorce
Divorce
Divorce
Married
Divorce
Short
Short
Long
Medium
Medium
Medium
Long
Short
Short
Medium
Short
Long
Long
Medium
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Table 2: Classification rules mined from the initial database.
Classification rules No.
(gender = female) and (years at current work = short) => approval result=NO
(3.0/0.0)
(years at current work = medium) and (black list = yes) => approval result=NO
(2.0/0.0)
1
2
Table 3: The sanitized database.
Approval
result Black list Gender Marriage status Years at current work No.
No
No
Yes
Yes
Yes
?
?
Yes
No
Yes
Yes
Yes
Yes
Yes
?
No
Yes
No
No
No
?
Yes
Yes
No
No
No
Yes
Yes
No
Yes
Female
Female
Female
Female
Male
Male
Male
Male
Female
Male
Male
Male
Female
Male
Female
Married
Married
Married
Divorce
Single
Single
Single
Single
Divorce
Single
Divorce
Divorce
Divorce
Married
Divorce
Short
Short
Long
Medium
Medium
Medium
?
Long
Short
Short
Medium
Short
Long
Long
?
1
2
3
4
5
6
D*
7
8
9
10
11
12
13
14
6 Comparative evaluation of the proposed algorithm
To evaluate the performance of the algorithm, both the BCR proposed algorithm and ROD algorithm
[13] were run on 3 real databases provided by UCI repository. Table 4 shows the specifications of the 3
databases.
Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 49
http://www.ispacs.com/journals/jsca/2017/jsca-00073/
International Scientific Publications and Consulting Services
Table 4: Specifications of databases.
Rules Attributes Instances Dataset
14 10 286 Breast-Cancer
9 23 8124 Mushroom
49 7 1728 Car
Classification rules were mined using RIPPER algorithm and 2 and 3 sensitive rules were randomly
selected. Three factors of hiding failure, lost rules, and ghost rules have been considered to evaluate the
algorithms.
Hiding failure factor shows the number of sensitive rules which will be mined from database after
performing the sanitization process.
Lost rule declares the number of insensitive rules which will not be mined from database after performing
the sanitization process.
Ghost rule shows the number of rules which do not exist in initial database but will be mined from
sanitized database after performing the sanitization process.
Results for running the proposed algorithm are shown in Figures 1, 2 and 3.
Figure 1: Comparative evaluation in Mushroom database.
Figure 2: Comparative evaluation in Breast Cancer database.
0
2
4
6
Hiding Failure Lost Rules Ghost Rules Hiding Failure Lost Rules Ghost Rules
BCR ROD
Nu
mb
er
of
Sid
e E
ffe
cte
s All Side Effectes in Mushroom Dataset
2-Rules 3-Rules
0
2
4
6
Hiding Failure Lost Rules Ghost Rules Hiding Failure Lost Rules Ghost Rules
BCR ROD
Nu
mb
er
of
Sid
e E
ffe
cte
s
All Side Effectes in Cancer Dataset
2-Rules 3-Rules
Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 50
http://www.ispacs.com/journals/jsca/2017/jsca-00073/
International Scientific Publications and Consulting Services
Figure 3: Comparative evaluation in Car database.
As it can be seen in diagrams above, the proposed algorithm has made the sensitive rules made hidden
with the minimum side-effects possible. Selecting a limited number of attributes for hiding process would
cause fewer changes in database and, consequently, lead to a decrease in number of lost rules and
generation rate of ghost rules. Sorting the rules according to related number of attributes and inserting
dummy transactions in a hierarchical manner, along with leaving some sensitive rules free of dummy
transactions would cause perturbation in detecting the blocking process and result in a fall in the
generation rate of ghost rules.
The above result show that in ROD algorithm by increasing the number of sensitive rules the number of
side effects increase as well however this is not the case in proposed algorithm. Although as it is illustrated
in Figure3 inserting the transaction has resulted in ghost rules, yet in comparison to ROD the proposed
algorithm shows better results. This is due to defining limitations for the insertion of dummy transaction.
Given that the insertion of dummy transactions is subtractive and hierarchical the production on ghost
rules is being controlled. By making changes in all those attributes that are supporting the non-sensitive
rules, the ROD algorithm produces more ghost rules compared to proposed method. This problem gets
even worst when the number of attributes of sensitive and non-sensitive rules are equal. The more the
number of attributes for the sensitive rules the better the result from proposed algorithm compared to the
ROD, and proposed method is even more efficient in maintaining data quality.in general in order to hide
sensitive rules it is enough to change only a limited number of attributes and inserting more changes will
only result in more side effects as mentioned above, therefore the result from suggested algorithm have
shown to be better in all cases mentioned above compared to the results from ROD. Another factor
contributing to better results in the proposed method compared to the ROD is their difference in sorting
rules. Since in ROD the sorting of rules is based on the number of supports, if there are no non-sensitive
rules with less number of support the sanitization of data may face some problems.
6.1. The limitation of the proposed algorithm
One of the limitations in the proposed algorithm is its time complexity in large databases. In other words
due to inserting transactions, the larger the database the worse becomes the sanitization in the proposed
method. Further, by accidently choosing the attributes in proposed method in case the same attribute is
present in many different non-sensitive rules it will increase the number of side effects such as ghost rules
or lost rules.
7 Conclusion and future works
The present study introduced an algorithm based on blocking approach, for hiding sensitive
classification rules. The proposed solution performs the hiding process as follows: in this method,
0
2
4
6
HidingFailure
Lost Rules Ghost Rules HidingFailure
Lost Rules Ghost Rules
BCR ROD
Nu
mb
er
of
Sid
e E
ffe
cte
s All Side Effectes in Car Dataset
2-Rules 3-Rules
Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 51
http://www.ispacs.com/journals/jsca/2017/jsca-00073/
International Scientific Publications and Consulting Services
according to the number of attributes the sensitive rules are being sorted in descending order where the
hiding is started from the sensitive rule with the highest number of attributes. For hiding the sensitive rules
in addition to class attribute, values of a limited number of attributes on the left side of the sensitive rules
are randomly substituted with “?” value.
Proposed method inserts some dummy transactions which contain unknown values for a certain number of
sensitive rules to disturb database. In the proposed method in order to control the outbreak of side effects
resulted from inserting transaction the insert of transaction is done hierarchically and descending and this
is done for the limited number of sensitive rules and transactions. All the results shows the proposed
method is clean from any hiding failure and in the number of lost rules and ghost rules it has shown to be
more efficient in comparison to the introduced algorithm in [15]. Given that the propose algorithm perform
the sanitization with fewer adverse effects, therefor, is more successful in maintaining data quality.
The proposed solution is capable of further development and can be used in other classification methods
such as nearest neighbor classifier and decision tree.
In the proposed algorithm, the manner of choice the victim attribute can be changed so that based on the
non-sensitive rules. It can be used in conjunction with the insert of transactions, So that the insertion of
transactions carried out according to the non-sensitive rules.
References
[1] Y. Guo, Reconstruction-based association rule hiding, SIGMOD2007 Ph. D. Workshop on Innovative
Database Research, (2007) 51-56.
[2] H. Wang, Association Rule: From Mining to Hiding, Applied Mechanics and Materials, 321 (2013)
2570-2573.
https://doi.org/10.4028/www.scientific.net/AMM.321-324.2570
[3] V. S. Verykios, Association rule hiding methods, Wiley Interdisciplinary Reviews: Data Mining and
Knowledge Discovery, 3 (2013) 28-36.
https://doi.org/10.1002/widm.1082
[4] V. Garg, A. Singh, D. Singh, A Survey of Association Rule Hiding Algorithms, Communication
Systems and Network Technologies (CSNT), 2014 Fourth International Conference on, (2014) 404-
407.
https://doi.org/10.1109/CSNT.2014.86
[5] H. Wang, Quality Measurements for Association Rules Hiding, AASRI Procedia, 5 (2013) 228-234.
https://doi.org/10.1016/j.aasri.2013.10.083
[6] A. Gkoulalas-Divanis, J. Haritsa, M. Kantarcioglu, Privacy Issues in Association Rule Mining,
Frequent Pattern Mining, ed: Springer, (2014) 369-401.
https://doi.org/10.1007/978-3-319-07821-2_15
[7] G. V. Moustakides, V. S. Verykios, A MaxMin approach for hiding frequent itemsets, Data &
Knowledge Engineering, 65 (2008) 75-89.
https://doi.org/10.1016/j.datak.2007.06.012
[8] T.-P. Hong, C.-W. Lin, K.-T. Yang, S.-L. Wang, Using TF-IDF to hide sensitive itemsets, Applied
intelligence, 38 (2013) 502-510.
https://doi.org/10.1007/s10489-012-0377-5
Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 52
http://www.ispacs.com/journals/jsca/2017/jsca-00073/
International Scientific Publications and Consulting Services
[9] S. Xingzhi, P. S. Yu, A border-based approach for hiding sensitive frequent itemsets, Data Mining,
Fifth IEEE International Conference, (2005) 8.
https://doi.org/10.1109/ICDM.2005.2
[10] J. Natwichai, X. Li, M. E. Orlowska, A reconstruction-based algorithm for classification rules hiding,
17th Australasian Database Conference, 49 (2006) 49-58.
[11] A. Katsarou, G.-D. Aris, V. S. Verykios, Reconstruction-based Classification Rule Hiding through
Controlled Data Modification, Artificial Intelligence Applications and Innovations III, ed: Springer,
(2009) 449-458.
https://doi.org/10.1007/978-1-4419-0221-4_53
[12] J. Natwichai, X. Sun, X. Li, Data reduction approach for sensitive associative classification rule
hiding, Nineteenth conference on Australasian database, 75 (2008) 23-30.
[13] A. Delis, V. S. Verykios, A. A. Tsitsonis, A data perturbation approach to sensitive classification rule
hiding, ACM Symposium on Applied Computing, (2010) 605-609.
https://doi.org/10.1145/1774088.1774216
[14] Y. Saygin, V. S. Verykios, C. Clifton, Using unknowns to prevent discovery of association rules,
ACM SIGMOD Record, 30 (2001) 45-54.
https://doi.org/10.1145/604264.604271
[15] A. Parmar, U. P. Rao, D. R. Patel, Blocking based approach for classification Rule hiding to Preserve
the Privacy in Database, International Symposium on Computer Science and Society (ISCCS), (2011)
323-326.
https://doi.org/10.1109/isccs.2011.103