introduction to data mining - emory universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 ·...
TRANSCRIPT
Introduction to Data Mining
Privacy preserving data mining
4/3/2011 1
Privacy preserving data mining
Li Xiong
Slides credits:
Chris Clifton
Agrawal and Srikant
Privacy Preserving Data Mining
� Privacy concerns about personal data
� AOL query log release
� Netflix challenge
� Data scraping
A race to the bottom: privacy ranking of Internet service companies
� A study done by Privacy International into the privacy practices of key Internet based companies
� Amazon, AOL, Apple, BBC, eBay, Facebook, Friendster, Google, LinkedIn, LiveJournal, Friendster, Google, LinkedIn, LiveJournal, Microsoft, MySpace, Skype, Wikipedia, LiveSpace, Yahoo!, YouTube
A Race to the Bottom: Methodologies
� Corporate administrative details
� Data collection and processing
� Data retention
� Openness and transparency
� Customer and user control
� Privacy enhancing innovations and privacy invasive innovations
A race to the bottom: interim results revealed
Why Google
� Retains a large quantity of information about users, often for an unstated or indefinite length of time, without clear limitation on subsequent use or disclosure
� Maintains records of all search strings with � Maintains records of all search strings with associated IP and time stamps for at least 18-24 months
� Additional personal information from user profiles in Orkut
� Use advanced profiling system for ads
Remember, they are always watching …
Some advice from privacy campaigners …
� Use cash when you can.
� Do not give your phone number, social-security number or address, unless you absolutely have to.
� Do not fill in questionnaires or respond to telemarketers.
� Demand that credit and data-marketing firms produce all information they have on you, correct errors and remove you information they have on you, correct errors and remove you from marketing lists.
� Check your medical records often.
� Block caller ID on your phone, and keep your number unlisted.
� Never leave your mobile phone on, your movements can be traced.
� Do not user store credit or discount cards
� If you must use the Internet, encrypt your e-mail, reject all “cookies” and never give your real name when registering at websites
� Better still, use somebody else’s computer
� Data obfuscation (non-interactive model)
Privacy-Preserving Data Mining
Original Data
“Sanitized”Data
MinerAnonymization
� Output perturbation (interactive model)
AccessInterface
Original Data
Miner
“Perturbed” Results
Classes of Solutions
� Methods
� Input obfuscation
� Perturbation
� Generalization� Generalization
� Output perturbation
� Differential privacy
� Metrics
� Privacy vs. Utility
Data Perturbation
� Data randomization� Randomization (additive noise)
� Geometric perturbation and projection (multiplicative noise)
� Randomized response technique (categorical data)
Randomization Based Decision Tree Learning (Agrawal and Srikant ’00)
� Basic idea: Perturb Data with Value Distortion� User provides xi+r instead of xi� r is a random value
� Uniform, uniform distribution between [-α, α]� Gaussian, normal distribution with µ = 0, σ
� Hypothesis� Hypothesis
� Miner doesn’t see the real data or can’t reconstruct real values
� Miner can reconstruct enough information to identify patterns
Classification using Randomization Data
50 | 40K | ...30 | 70K | ... ...
Randomizer Randomizer
65 | 20K | ... 25 | 60K | ... ...
Alice’s age
Add random number to
Age
Classification
AlgorithmModel
65 | 20K | ... 25 | 60K | ... ...30
becomes 65
(30+35)
?
Output: A Decision Tree for “buys_computer”
age?
overcast<=30 >4031..40
February 12, 2008 Data Mining: Concepts and Techniques 14
student? credit rating?
no yes yes
yes
fairexcellentyesno
Attribute Selection Measure: Gini index (CART)
� If a data set D contains examples from n classes, gini index, gini(D) is
defined as
where pj is the relative frequency of class j in D
� If a data set D is split on A into two subsets D1 and D2, the gini index
gini(D) is defined as
∑=
−=n
j
p jDgini
1
21)(
February 12, 2008 Data Mining: Concepts and Techniques 15
1 2
gini(D) is defined as
� Reduction in Impurity:
� The attribute provides the smallest ginisplit(D) (or the largest reduction
in impurity) is chosen to split the node
)(||
||)(
||
||)( 2
21
1Dgini
D
DDgini
D
DDgini A +=
)()()( DginiDginiAginiA
−=∆
Randomization Approach Overview
50 | 40K | ...30 | 70K | ... ...
Randomizer Randomizer
65 | 20K | ... 25 | 60K | ... ...
Alice’s age
Add random number to
Age
...Reconstruct
Distribution
of Age
Reconstruct
Distribution
of Salary
Classification
AlgorithmModel
65 | 20K | ... 25 | 60K | ... ...30
becomes 65
(30+35)
Original Distribution Reconstruction
� x1, x2, …, xn are the n original data values
� Drawn from n iid random variables with distribution X
� Using value distortion,
� The given values are w1 = x1 + y1, w2 = x2 + y2, …, wn = xn + yn
� yi’s are from n iid random variables with distribution Y
� Reconstruction Problem:
� Given FY and wi’s, estimate FX
Original Distribution Reconstruction: Method
� Bayes’ theorem for continuous distribution
� The estimated density function (minimum mean square error estimator):
( ) ( ) ( )∑
−=′
nXiY afawf
af1
� Iterative estimation
� The initial estimate for fX at j=0: uniform distribution
� Iterative estimation
� Stopping Criterion: difference between successive iterations is small
( ) ( ) ( )( ) ( )
∑∫=
∞
∞−−
=′i
XiY
XiYX
dzzfzwfnaf
1
1
( ) ( ) ( )( ) ( )
∑∫=
∞
∞−
+
−
−=
n
ij
XiY
j
XiYj
X
dzzfzwf
afawf
naf
1
1 1
Reconstruction of Distribution
800
1000
1200
Number of Peo
ple
Original
0
200
400
600
20 60
Age
Number of Peo
ple
Original
Randomized
Reconstructed
Original Distribution Reconstruction
Original Distribution Construction for Decision Tree
� When are the distributions reconstructed?� Global
� Reconstruct for each attribute once at the beginning� Build the decision tree using the reconstructed data
� ByClass� First split the training data
Reconstruct for each class separately� Reconstruct for each class separately� Build the decision tree using the reconstructed data
� Local� First split the training data� Reconstruct for each class separately� Reconstruct at each node while building the tree
Accuracy vs. Randomization Level
Fn 3
90
100
40
50
60
70
80
10 20 40 60 80 100 150 200
Randomization Level
Accura
cy Original
Randomized
ByClass
More Results
� Global performs worse than ByClass and Local
� ByClass and Local have accuracy within 5% to 15% (absolute error) of the Original accuracy
� Overall, all are much better than the Randomized accuracy
Privacy metrics
� Privacy metrics of random additive data perturbation
4/3/2011 Data Mining: Principles and Algorithms 24
Unfortunately
� Random additive data perturbation are subject to data reconstruction attacks
� Original data can be estimated using spectral filtering techniques
� H. Kargupta , S. Datta. On the privacy preserving properties of random data perturbation techniques, in ICDM 2003
4/3/2011 Data Mining: Principles and Algorithms 25
Estimating distribution and data values
4/3/2011 Data Mining: Principles and Algorithms 26
Follow-up Work
� Multiplicative randomization
� Geometric randomization
� Also subjective to data reconstruction attacks!
� Known input-output
� Known samples
4/3/2011 Data Mining: Principles and Algorithms 27
Data Perturbation
� Data randomization� Randomization (additive noise)
� Geometric perturbation and projection (multiplicative noise)
� Randomized response technique (categorical data)
Data Collection Model
Data cannot be shared directly because of privacy concern
Background:Randomized Response
Do you smoke?
Head Yes
The true answer is “Yes”
)5.0(
)(
≠
=
θθYesP
P'(Yes) = P(Yes) ⋅ θ + P(No) ⋅ (1−θ)P'(No) = P(Yes) ⋅ (1−θ) + P(No) ⋅ θ
Head
TailNo
YesBiased coin:
5.0
)(
≠
=
θθHeadP
Decision Tree Mining using Randomized Response
� Multiple attributes encoded in bits
)5.0(
)(
≠
=
θθYesP
Head True answer E: 110Biased coin:
)( = θHeadP )5.0( ≠θ
TailFalse answer !E: 0015.0
)(
≠
=
θθHeadP
� Column distribution can be estimated for learning a decision tree
Using Randomized Response Techniques for Privacy-Preserving Data Mining, Du, 2003
Generalization for Multi-Valued Categorical Data
Si
Si+1
Si+2
q1
q2
q3
q4True Value: Si Si+3
q4
P'(s1)
P'(s2)
P'(s3)
P'(s4)
=
q1 q4 q3 q2
q2 q1 q4 q3
q3 q2 q1 q4
q4 q3 q2 q1
P(s1)
P(s2)
P(s3)
P(s4)
M
A Generalization
� RR Matrices [Warner 65], [R.Agrawal 05], [S. Agrawal 05]
� RR Matrix can be arbitrary
� Can we find optimal RR matrices?
M =
a11 a12 a13 a14
a21 a22 a23 a24
a31 a32 a33 a34
a41 a42 a43 a44
OptRR:Optimizing Randomized Response Schemes for Privacy-Preserving Data Mining, Huang, 2008
What is an optimal matrix?
� Which of the following is better?
M1 =
1 0 0
0 1 0
M2 =
13
13
13
13
13
13
M1 = 0 1 0
0 0 1
2 3 3 3
13
13
13
Privacy: M2 is betterUtility: M1 is better
So, what is an optimal matrix?
Optimal RR Matrix
� An RR matrix M is optimal if no other RR matrix’s privacy and utility are both better than M (i, e, no other matrix dominates M).
� Privacy Quantification
� Utility Quantification� Utility Quantification
� Privacy and utility metrics
� Privacy: how accurately one can estimate individualinfo.
� Utility: how accurately we can estimate aggregate info.
Optimization algorithm
� Evolutionary Multi-Objective Optimization (EMOO)
� The algorithm � Start with a set of initial RR matrices
� Repeat the following steps in each iteration
� Mating: selecting two RR matrices in the pool
� Crossover: exchanging several columns between the two RR � Crossover: exchanging several columns between the two RR matrices
� Mutation: change some values in a RR matrix
� Meet the privacy bound: filtering the resultant matrices
� Evaluate the fitness value for the new RR matrices.
Note : the fitness values is defined in terms of privacy and utility metrics
Output of Optimization
Worse
M5M6
The optimal set is often plotted in the objective space as Pareto front.
Privacy
Utility
Better
M1M2
M4
M3
M5
M7
M6
M8
Classes of Solutions
� Methods
� Input obfuscation
� Perturbation
� Generalization
Output perturbation� Output perturbation
� Differential privacy
� Metrics
� Privacy vs. Utility
Disease
Birthdate
Zip
Data Re-identification
Sex
Zip
Name
k-anonymity & l-diversity
40
Privacy preserving data mining
� Generalization principles
� k-anonymity, l-diversity, …
� Methods
� Optimal
� Greedy
41
� Greedy
� Top-down vs. bottom-up
Mondrian: Greedy Partitioning Algorithm
� Problem
� Need an algorithm to find multi-dimensional partitions
� Optimal k-anonymous strict multi-dimensional partitioning is NP-hardpartitioning is NP-hard
� Solution
� Use a greedy algorithm
� Based on k-d trees
� Complexity O(nlogn)
Example
k = 2; Quasi Identifiers: Age, Zipcode
What should be the splitting criteria?
Patient Data Multi-Dimensional
Unfortunately
� Generalization based principles and methods are subjective to attacks
� Background knowledge sensitive
� Attack dependent
4/3/2011 Data Mining: Principles and Algorithms 44
Classes of Solutions
� Methods
� Input obfuscation
� Perturbation
� Generalization� Generalization
� Output perturbation
� Differential privacy
� Metrics
� Privacy vs. Utility
� Differential privacy requires the outcome to be formally indistinguishable when run with and without any particular record in the data set
Differential Privacy
D1 Q(D1) + Y1
Differentially
PrivateInterface
D2Bob out
UserQ
D1Bob in
A(D2)
A(D1)Q(D1) + Y1
Q(D2) + Y2
� Differential privacy
� Laplace mechanism
Q(D) + Y where Y is drawn from
� Query sensitivity
Differential Privacy
� Query sensitivity
Differentially Private
InterfaceD2
Bob out
UserQ
D1Bob in
A(D2)
A(D1)Q(D1) + Y1
Q(D2) + Y2
Coming up
� Data mining algorithms using differential privacy
� Decision tree learning (Data Mining with Differential Privacy, SIGKDD 10)
� Frequent itemsets mining (discovering frequent patterns in sensitive data, SIGKDD 10)patterns in sensitive data, SIGKDD 10)
4/3/2011 Data Mining: Principles and Algorithms 48
Midterm Exam
� Adjusted mean: 85.3
� Adjusted max: 101
� Your favorite topics: Clustering, frequent itemsets mining, decision tree
Your favorite assignments: Apriori� Your favorite assignments: Apriori
� Your least favorite: SOM, Weka analysis
4/3/2011 Data Mining: Principles and Algorithms 49