privacy preserving data mining – multiplicative perturbation techniques li xiong cs573 data...

39
Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Upload: antony-lyons

Post on 06-Jan-2018

224 views

Category:

Documents


2 download

DESCRIPTION

slide 3 Additive noise (randomization) x1…xnx1…xn Reveal entire database, but randomize entries Database x1+1…xn+nx1+1…xn+n Add random noise  i to each database entry x i For example, if distribution of noise has mean 0, user can compute average of x i User

TRANSCRIPT

Page 1: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Privacy preserving data mining – multiplicative perturbation techniques

Li Xiong

CS573 Data Privacy and Anonymity

Page 2: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Outline Review and critique of randomization

approaches (additive noise) Multiplicative data perturbations

Rotation perturbation Geometric Data Perturbation Random projection

Comparison

Page 3: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

slide 3

Additive noise (randomization)

x1…xn

Reveal entire database, but randomize entries

Database

x1+1…xn+n

Add random noise i to each database entry xi

For example, if distribution of noise hasmean 0, user can compute average of xi

User

Page 4: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Learning decision tree on randomized data

50 | 40K | ... 30 | 70K | ... ...

...

Randomizer Randomizer

ReconstructDistribution of Age

ReconstructDistributionof Salary

ClassificationAlgorithm Model

65 | 20K | ... 25 | 60K | ... ...30 becomes 65 (30+35)

Alice’s age

Add random number to Age

Page 5: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Summary on additive perturbations

Benefits Easy to apply – applied separately to each data point (record) Low cost Can be used for both web model and corporate model

WebApps

data

user 1 User 2 User n

Privateinfo x1…

xn

x1+1…xn+n

Page 6: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Additive perturbations - privacy

Need to publish noise distribution The column distribution is disclosed

Subject to data value attacks!

On the Privacy Preserving Properties of Random Data Perturbation Techniques, Kargupta, 2003a

Page 7: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

The spectral filtering technique can be used to estimate the original data

Page 8: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity
Page 9: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

The spectral filtering technique can perform poorly when there is an inherent random component in the original data

Page 10: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Randomization – data utility

Only preserves column distribution Need to redesign/modify existing data mining

algorithms Limited data mining applications

Decision tree and naïve bayes classifier

Page 11: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Randomization approaches

Privacy guarantee

Data Utility/Model accuracy

?

Privacy guarantee

Data utility/Model accuracy

• Difficult to balance the two factors• Low data utility • Subject to attacks

Page 12: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

More thoughts about perturbation

1. Preserve Privacy Hide the original data

not easy to estimate the original values from the perturbed data

Protect from data reconstruction techniques

The attacker has prior knowledge on the published data

2. Preserve Data Utility for Tasks Single-dimensional

properties - column distribution, etc. Decision tree, Bayesian

classifier Multi-dimensional

properties - covariance matrix, distance, etc SVM classifier, knn

classification, clustering

Page 13: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Multiplicative perturbations

Preserving multidimensional data properties Geometric data perturbation (GDP) [Chen

’07] Rotation data perturbation Translation data perturbation Noise addition

Random projection perturbation(RPP) [Liu ‘06]

Chen, K. and Liu, L. Towards attack-resilient geometric data perturbation. SDM, 2007Liu, K., Kargupt, H., and Ryan, J. Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. TKDE, 2006

Page 14: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Rotation Perturbation

G(X) = R*X

Key features preserves Euclidean distance and inner product of data

points preserves geometric shapes such as hyperplane and hyper

curved surfaces in the multidimensional space

Rm*m - an orthonormal matrix (RTR = RRT = I)Xm*n - original data set with n m-dimensional data pointsG(X)m*n - rotated data set

Example:ID 001 002age 30 25rent 1350 1000tax 4230 3320

ID 001 002age 1176 948rent 3112 2392tax -2920 -2309

.83 -.40 .40

.2 .86 .46

.53 .30 -.79= *

Page 15: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Illustration of multiplicative data perturbation

Preserving distanceswhile perturbing eachindividual dimensions

Page 16: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Data properties A model is invariant to geometric perturbation if

distance plays an important role Class/cluster members and decision boundaries are correlated in

terms of distance, not the concrete locations

Classification boundary

Class 1

Class 2

Classification boundary

Class 1

Class 2

Rotation and translation

Class 1

Class 2Slightly changed Classification boundary

Distance perturbation(Noise addition)

2D Example:

Page 17: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Applicable DM algorithms

Models “invariant” to GDP all Euclidean distance based clustering algorithms Classification algorithms

K Nearest Neighbors Kernel methods Linear classifier Support vector machines Most regression models And potentially more …

Page 18: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

When to Use Multiplicative Data Perturbation

Data Owner Service Provider/data user

G(X)=RX+T+D

Mined models/patterns

G(X)

F(G(X), )

Good for the corporate model or dataset publishing.Major issue!! curious service providers/data users try to break G(X)

Page 19: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Attacks!

Three levels of knowledgeKnow nothing naïve estimationKnow column distributions Independent Component Analysis Know specific points (original points and their images in perturbed data) distance inference

Page 20: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Attack 1: naïve estimation

Estimate original points purely based on the perturbed dataIf using “random rotation” only

Intensity of perturbation matters Points around origin

Classification boundary

Class 1

Class 2

Classification boundary

Class 1

Class 2

Classification boundary

Class 1

Class 2 X

Y

Page 21: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Countering naïve estimation

Maximize intensity Based on formal analysis of “rotation intensity” Method to maximize intensity

Fast_Opt algorithm in GDP

“Random translation” T Hide origin Increase difficulty of attacking!

Need to estimate R first, in order to find out T

Page 22: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Attack 2: ICA based attacks Independent Component Analysis (ICA)

Try to separate R and X from Y= R*X

Page 23: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Characteristics of ICA

1. Ordering of dimensions is not preserved.2. Intensity (value range) is not preserved

Conditions of effective ICA-attack1. Knowing column distribution2. Knowing value range.

Page 24: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Countering ICA attack Weakness of ICA attack

Need certain amount of knowledge Cannot effectively handle dependent columns

In reality… Most datasets have correlated columns We can find optimal rotation perturbation maximizing the difficulty of ICA attacks

Page 25: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Original Perturbed

Known point

image

Attack 3: distance-inference attack

If with only rotation/translation perturbation, when the attacker knows a set of original points and their mapping…

Page 26: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

How is the Attack done … Knowing points and their images …

find exact images of the known points Enumerate pairs by matched distances … Less effective for large data … we assume pairs are successfully identified

Estimation 1. Cancel random translation T from pairs (x, x’)2. calculate R with pairs: Y=RX R = Y*X-1 3. calculate T with R and known pairs

Page 27: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Countering distance-inference: Noise addition Noise brings enough variance in estimation of R and

T

Can the noise be easily filtered? Need to know noise distribution, Need to know distribution of RX + T, Both distributions are not published, however.

Note: It is very different from the attacks to noise addition data perturbation [Kargupta03]

Page 28: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Attackers with more knowledge?

What if attackers know large amount of original records? Able to accurately estimate covariance matrix, column

distribution, and column range, etc., of the original data Methods PCA,etc can be used

What do we do?

Stop releasing any kind of data anymore

Page 29: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Benefits of Geometric Data Perturbation

Privacy guarantee

Data Utility/Model accuracy

decoupled

Applicable to many DM algorithms-Distance-based Clustering-Classification: linear, KNN, Kernel, SVM,…

Make optimization and balancing easier!- Almost fully preserving model accuracy - we optimize privacy only

Page 30: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

A randomized perturbation optimization algorithm

Start with a random rotation Goal: passing tests on simulated attacks Not simply random – a hillclimbing method

1. Iteratively determine R - Test on naïve estimation (Fast_opt) - Test on ICA (2nd level) find a better rotation R

2. Append a random translation component3. Append an appropriate noise component

Page 31: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Privacy guarantee:GDP In terms of naïve estimation and ICA-based attacks Use only the random rotation and translation components

(R*X+T)

Worst perturbation (no optimization)Optimized for Naïve estimation only

Optimized perturbation for both attacks

Page 32: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Privacy guarantee:GDP In terms of distance inference attacks

Use all three components (R*X +T+D) Noise D : Gaussian N(0, 2) Assume pairs of (original, image) are identified by attackers

no noise addition, privacy guarantee =0

Considerably high PGat small perturbation=0.1

Page 33: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Data utility: GDP with noise addition Noise addition vs. model accuracy - noise:

N(0, 0.12)

Boolean data is more sensitive to distance perturbation

Page 34: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Random Projection Perturbation

Random projection projects a set of data points from high dimensional

space to a lower dimensional subspace F(X) = P*X

X is m*n matrix: m columns and n rows P is a k*m random matrix, k <= m

Johnson-Lindenstrauss LemmaThere is a random projection F() with

e is a small number <1, so that (1-e)||x-y||<=||F(x)-F(y)||<=(1+e)||x-y||

i.e. distance is approximately preserved.

Page 35: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Data Utility: RPP Reduced # of dims vs. model accuracy

KNN classifiers SVMs

Page 36: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Random projection vs. geometric perturbation Privacy preservation

Subject to similar kinds of attacks RPP is more resilient to distance-based

attacks Utility preservation(model accuracy)

GDP: R and T exactly preserve distances, The effect of D needs experimental evaluation

RPP Approximately preserves distances # of perturbed dimensions vs. utility

Page 37: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Coming up

Output perturbation Cryptographic protocols

Page 38: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Methodology of attack analysis

An attack is an estimate of the original data

Original O(x1, x2,…, xn) vs. estimate P(x’1, x’2,…, x’n) How similar are these two series?

One of the effective methods is to evaluate the variance/standard deviation of the difference [Rakesh00]

Var (P–O) or std(P-O), P: estimated, O: original

Page 39: Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Two multi-column privacy metrics

qi : privacy guarantee for column i qi = std(Pi–Oi), Oi normalized column values,

Pi estimated column values

Min privacy guarantee: the weakest link of all columns

min { qi, i=1..d}

Avg privacy guarantee: overall privacy guarantee

1/d qi