privacy-preserving anonymization of set value data manolis terrovitis institute for the management...
TRANSCRIPT
Privacy-preserving Anonymization of Set Value Data
Manolis TerrovitisInstitute for the Management of Information Systems
(IMIS), RC AthenaNikos Mamoulis
University of Hong Kong (HKU)Panos Kalnis
King Abdullah University of Science and Technology (KAUST)
2
Motivation
Attacker can see up to m items Any m items No distinction between sensitive and non-sensitive items
0% M
ilk
Preg
nanc
y
test
Beer
Helen
3
Motivation (cont.)
Helen: Beer, 0% Milk, Pregnancy testJohn: Cola, CheeseTom: 2% Milk, Coffee….Mary: Wine, Beer, Full-fat Milk
Database
t1: Beer, 0%Milk, Pregnancy testt2: Cola, Cheeset3: 2% Milk, Coffee….tn: Wine, Beer, Full-fat Milk
Published
AttackerFind all transactions that contain Beer & 0% Milk
t1: Beer, Milk, Pregnancy testt2: Cola, Cheeset3: Milk, Coffee….tn: Wine, Beer, Milk
4
km-anonymity
Di
tttD
t
ooo
,...,
,...,,
21
21
Set of items
Transaction
Database
tqsDttres |
kresres 0
mqs Query terms
km-anonymity:
5
Related Work: K-Anonymity [Swe02]
Age ZipCode Disease
42 25000 Flu
46 35000 AIDS
50 20000 Cancer
54 40000 Gastritis
48 50000 Dyspepsia
56 55000 Bronchitis
[Swe02] L. Sweeney. k-Anonymity: A Model for Protecting Privacy. Int. J. of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5):557-570, 2002.
(a) Microdata
Quasi-identifier
Age ZipCode Disease
42-46 25000-35000 Flu
42-46 25000-35000 AIDS
50-54 20000-40000 Cancer
50-54 20000-40000 Gastritis
48-56 50000-55000 Dyspepsia
48-56 50000-55000 Bronchitis
(a) 2-anonymous microdata
NOT suitable for high-dimensionality
6
Related Work: L-diversity in Transactions
[GTK08] G. Ghinita, Y. Tao, P. Kalnis, “On the Anonymization of Sparse High-Dimensional Data”, ICDE, 2008
Requires knowledge of (non)-sensitive attributes
7
Our Approach: Employs Generalization
Aaa 21,
Gen
era
lizati
on
H
iera
rch
y
otherwise ,
node leaf ,0)(
pupNCP
Information loss
k=2m=2
8
Lattice of Generalizations
9
Optimal Algorithm
Q: Q: Q:
10
Count Tree
1221
1212122 ,,,
,,,,,,,,
baBaAbAB
baBABAbabat
A1B
12a
11b
1
1b1
B1
2a1
1b1
1 1 1
All generalized forms of the paths reside in the tree We can find easily which anonymizations are needed
11
Apriori-based Anonymization
Global Optimal vs Local Optimal Solution for each path
We examine the paths By size (A priori principle) Paths with invalid nodes are skipped
12
Apriori-based Anonymization
1. Initialize gen_map2. For i := 1 to m do
1. For all t D do1. Extend t acccording to gen_map2. Add all i-subsets of extended t to
count-tree3. Check all paths in count tree and update
gen_map
13
Small Datasets (2-15K, BMS-WebView2)
|I|=40..60, k=100, m=3
14
Small Datasets (BMS-WebView2)
|D|=10K, k=100, m=1..4
15
Apriori Anonymization for Large Datasets
500
sec
10se
c
100
sec
|D| |I|
515K 1657
59K 497
77K 3340
k=5 m=3
16
Points to Remember
Anonymization of Transactional Data Attacker knows m items Any m items can be the quasi-identifier
Global recoding method Optimal solution: too slow Apriori Anonymization: fast and low information
loss Extensions (VLDBJ 2010)
Local recoding (sort by Gray order and partition)
Global recoding (by partitioning the data domain)