privacy preserving data mining – multiplicative perturbation techniques li xiong cs573 data...
DESCRIPTION
slide 3 Additive noise (randomization) x1…xnx1…xn Reveal entire database, but randomize entries Database x1+1…xn+nx1+1…xn+n Add random noise i to each database entry x i For example, if distribution of noise has mean 0, user can compute average of x i UserTRANSCRIPT
Privacy preserving data mining – multiplicative perturbation techniques
Li Xiong
CS573 Data Privacy and Anonymity
Outline Review and critique of randomization
approaches (additive noise) Multiplicative data perturbations
Rotation perturbation Geometric Data Perturbation Random projection
Comparison
slide 3
Additive noise (randomization)
x1…xn
Reveal entire database, but randomize entries
Database
x1+1…xn+n
Add random noise i to each database entry xi
For example, if distribution of noise hasmean 0, user can compute average of xi
User
Learning decision tree on randomized data
50 | 40K | ... 30 | 70K | ... ...
...
Randomizer Randomizer
ReconstructDistribution of Age
ReconstructDistributionof Salary
ClassificationAlgorithm Model
65 | 20K | ... 25 | 60K | ... ...30 becomes 65 (30+35)
Alice’s age
Add random number to Age
Summary on additive perturbations
Benefits Easy to apply – applied separately to each data point (record) Low cost Can be used for both web model and corporate model
WebApps
data
user 1 User 2 User n
Privateinfo x1…
xn
x1+1…xn+n
Additive perturbations - privacy
Need to publish noise distribution The column distribution is disclosed
Subject to data value attacks!
On the Privacy Preserving Properties of Random Data Perturbation Techniques, Kargupta, 2003a
The spectral filtering technique can be used to estimate the original data
The spectral filtering technique can perform poorly when there is an inherent random component in the original data
Randomization – data utility
Only preserves column distribution Need to redesign/modify existing data mining
algorithms Limited data mining applications
Decision tree and naïve bayes classifier
Randomization approaches
Privacy guarantee
Data Utility/Model accuracy
?
Privacy guarantee
Data utility/Model accuracy
• Difficult to balance the two factors• Low data utility • Subject to attacks
More thoughts about perturbation
1. Preserve Privacy Hide the original data
not easy to estimate the original values from the perturbed data
Protect from data reconstruction techniques
The attacker has prior knowledge on the published data
2. Preserve Data Utility for Tasks Single-dimensional
properties - column distribution, etc. Decision tree, Bayesian
classifier Multi-dimensional
properties - covariance matrix, distance, etc SVM classifier, knn
classification, clustering
Multiplicative perturbations
Preserving multidimensional data properties Geometric data perturbation (GDP) [Chen
’07] Rotation data perturbation Translation data perturbation Noise addition
Random projection perturbation(RPP) [Liu ‘06]
Chen, K. and Liu, L. Towards attack-resilient geometric data perturbation. SDM, 2007Liu, K., Kargupt, H., and Ryan, J. Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. TKDE, 2006
Rotation Perturbation
G(X) = R*X
Key features preserves Euclidean distance and inner product of data
points preserves geometric shapes such as hyperplane and hyper
curved surfaces in the multidimensional space
Rm*m - an orthonormal matrix (RTR = RRT = I)Xm*n - original data set with n m-dimensional data pointsG(X)m*n - rotated data set
Example:ID 001 002age 30 25rent 1350 1000tax 4230 3320
ID 001 002age 1176 948rent 3112 2392tax -2920 -2309
.83 -.40 .40
.2 .86 .46
.53 .30 -.79= *
Illustration of multiplicative data perturbation
Preserving distanceswhile perturbing eachindividual dimensions
Data properties A model is invariant to geometric perturbation if
distance plays an important role Class/cluster members and decision boundaries are correlated in
terms of distance, not the concrete locations
Classification boundary
Class 1
Class 2
Classification boundary
Class 1
Class 2
Rotation and translation
Class 1
Class 2Slightly changed Classification boundary
Distance perturbation(Noise addition)
2D Example:
Applicable DM algorithms
Models “invariant” to GDP all Euclidean distance based clustering algorithms Classification algorithms
K Nearest Neighbors Kernel methods Linear classifier Support vector machines Most regression models And potentially more …
When to Use Multiplicative Data Perturbation
Data Owner Service Provider/data user
G(X)=RX+T+D
Mined models/patterns
G(X)
F(G(X), )
Good for the corporate model or dataset publishing.Major issue!! curious service providers/data users try to break G(X)
Attacks!
Three levels of knowledgeKnow nothing naïve estimationKnow column distributions Independent Component Analysis Know specific points (original points and their images in perturbed data) distance inference
Attack 1: naïve estimation
Estimate original points purely based on the perturbed dataIf using “random rotation” only
Intensity of perturbation matters Points around origin
Classification boundary
Class 1
Class 2
Classification boundary
Class 1
Class 2
Classification boundary
Class 1
Class 2 X
Y
Countering naïve estimation
Maximize intensity Based on formal analysis of “rotation intensity” Method to maximize intensity
Fast_Opt algorithm in GDP
“Random translation” T Hide origin Increase difficulty of attacking!
Need to estimate R first, in order to find out T
Attack 2: ICA based attacks Independent Component Analysis (ICA)
Try to separate R and X from Y= R*X
Characteristics of ICA
1. Ordering of dimensions is not preserved.2. Intensity (value range) is not preserved
Conditions of effective ICA-attack1. Knowing column distribution2. Knowing value range.
Countering ICA attack Weakness of ICA attack
Need certain amount of knowledge Cannot effectively handle dependent columns
In reality… Most datasets have correlated columns We can find optimal rotation perturbation maximizing the difficulty of ICA attacks
Original Perturbed
Known point
image
Attack 3: distance-inference attack
If with only rotation/translation perturbation, when the attacker knows a set of original points and their mapping…
How is the Attack done … Knowing points and their images …
find exact images of the known points Enumerate pairs by matched distances … Less effective for large data … we assume pairs are successfully identified
Estimation 1. Cancel random translation T from pairs (x, x’)2. calculate R with pairs: Y=RX R = Y*X-1 3. calculate T with R and known pairs
Countering distance-inference: Noise addition Noise brings enough variance in estimation of R and
T
Can the noise be easily filtered? Need to know noise distribution, Need to know distribution of RX + T, Both distributions are not published, however.
Note: It is very different from the attacks to noise addition data perturbation [Kargupta03]
Attackers with more knowledge?
What if attackers know large amount of original records? Able to accurately estimate covariance matrix, column
distribution, and column range, etc., of the original data Methods PCA,etc can be used
What do we do?
Stop releasing any kind of data anymore
Benefits of Geometric Data Perturbation
Privacy guarantee
Data Utility/Model accuracy
decoupled
Applicable to many DM algorithms-Distance-based Clustering-Classification: linear, KNN, Kernel, SVM,…
Make optimization and balancing easier!- Almost fully preserving model accuracy - we optimize privacy only
A randomized perturbation optimization algorithm
Start with a random rotation Goal: passing tests on simulated attacks Not simply random – a hillclimbing method
1. Iteratively determine R - Test on naïve estimation (Fast_opt) - Test on ICA (2nd level) find a better rotation R
2. Append a random translation component3. Append an appropriate noise component
Privacy guarantee:GDP In terms of naïve estimation and ICA-based attacks Use only the random rotation and translation components
(R*X+T)
Worst perturbation (no optimization)Optimized for Naïve estimation only
Optimized perturbation for both attacks
Privacy guarantee:GDP In terms of distance inference attacks
Use all three components (R*X +T+D) Noise D : Gaussian N(0, 2) Assume pairs of (original, image) are identified by attackers
no noise addition, privacy guarantee =0
Considerably high PGat small perturbation=0.1
Data utility: GDP with noise addition Noise addition vs. model accuracy - noise:
N(0, 0.12)
Boolean data is more sensitive to distance perturbation
Random Projection Perturbation
Random projection projects a set of data points from high dimensional
space to a lower dimensional subspace F(X) = P*X
X is m*n matrix: m columns and n rows P is a k*m random matrix, k <= m
Johnson-Lindenstrauss LemmaThere is a random projection F() with
e is a small number <1, so that (1-e)||x-y||<=||F(x)-F(y)||<=(1+e)||x-y||
i.e. distance is approximately preserved.
Data Utility: RPP Reduced # of dims vs. model accuracy
KNN classifiers SVMs
Random projection vs. geometric perturbation Privacy preservation
Subject to similar kinds of attacks RPP is more resilient to distance-based
attacks Utility preservation(model accuracy)
GDP: R and T exactly preserve distances, The effect of D needs experimental evaluation
RPP Approximately preserves distances # of perturbed dimensions vs. utility
Coming up
Output perturbation Cryptographic protocols
Methodology of attack analysis
An attack is an estimate of the original data
Original O(x1, x2,…, xn) vs. estimate P(x’1, x’2,…, x’n) How similar are these two series?
One of the effective methods is to evaluate the variance/standard deviation of the difference [Rakesh00]
Var (P–O) or std(P-O), P: estimated, O: original
Two multi-column privacy metrics
qi : privacy guarantee for column i qi = std(Pi–Oi), Oi normalized column values,
Pi estimated column values
Min privacy guarantee: the weakest link of all columns
min { qi, i=1..d}
Avg privacy guarantee: overall privacy guarantee
1/d qi