multiplicative data perturbations

37
Multiplicative Data Perturbations

Upload: jennifer-wiggins

Post on 03-Jan-2016

54 views

Category:

Documents


2 download

DESCRIPTION

Multiplicative Data Perturbations. Outline. Introduction Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random projection Understanding Distance preservation Perturbation-invariant models Attacks Privacy Evaluation Model - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multiplicative Data Perturbations

Multiplicative Data Perturbations

Page 2: Multiplicative Data Perturbations

Outline Introduction Multiplicative data perturbations

Rotation perturbation Geometric Data Perturbation Random projection

Understanding Distance preservation Perturbation-invariant models

Attacks Privacy Evaluation Model Background knowledge and attack analysis Attack-resilient optimization

Comparison

Page 3: Multiplicative Data Perturbations

Summary on additive perturbations

problems Weak to various attacks

Need to publish noise distribution The column distribution is known

Need to develop/revise data mining algorithms in order to utilize perturbed data So far, we have only seen that decision tree and naïve bayes

classifier can utilize additive perturbation.

Benefits Can be applied to both the Web model and the

collaborative data pooling model Low cost

Page 4: Multiplicative Data Perturbations

More thoughts about perturbation

1. Preserve Privacy Hide the original data

not easy to estimate the original values from the perturbed data

Protect from data reconstruction techniques The attacker has prior

knowledge on the published data

2. Preserve Data Utility for Tasks

Single-dimensional info column data

distribution, etc.

Multi-dimensional infoCov matrix, distance,

etc

Page 5: Multiplicative Data Perturbations

For most PP approaches…

Privacy guarantee

Data Utility/Model accuracy

?

Privacy guarantee

Data utility/Model accuracy

•Difficult to balance the two factors•Subject to attacks•May need new DM algorithms: randomization, cryptographic approaches

Page 6: Multiplicative Data Perturbations

Multiplicative perturbations

Geometric data perturbation (GDP) Rotation data perturbation + Translation data perturbation + Noise addition

Random projection perturbation(RPP) Sketch approach

Page 7: Multiplicative Data Perturbations

Definition of Geometric Data Perturbation

G(X) = R*X + T + DR: random rotationT: random translationD: random noise, e.g., Gaussian noise

Characteristics: R&T preserving distance, D slightly perturbing distance

Example:

ID 001 002

age 30 25

rent 1350 1000

tax 4230 3320

12 12

18 18

30 30

-.4 .29

-1.7 -1.1

.13 1.2

ID 001 002

age 1158 943

rent 3143 2424

tax -2919 -2297

.83 -.40 .40

.2 .86 .46

.53 .30 -.79

= * + +

* Each component has its use to enhance the resilience to attacks!

Page 8: Multiplicative Data Perturbations

Benefits of Geometric Data Perturbation

Privacy guarantee

Data Utility/Model accuracy

decoupled

Make optimization and balancing easier!-Almost fully preserving model accuracy - we optimize privacy only

Applicable to many DM algorithms-Distance-based Clustering-Classification: linear, KNN, Kernel, SVM,…

Resilient to Attacks-the result of attack research

Page 9: Multiplicative Data Perturbations

Limitations

Multiplicative perturbations are mostly used in outsourcing Cloud computing Can be applied to multiparty

collaborative computing in same cases

Web model does not fit – perturbation parameters cannot be published

Page 10: Multiplicative Data Perturbations

Definition of Random Projection Perturbation F(X) = P*X

X is m*n matrix: m columns and n rows P is a k*m random matrix, k <= m

Johnson-Lindenstrauss LemmaThere is a random projection F() with

e is a small number <1, so that

(1-e)||x-y||<=||F(x)-F(y)||<=(1+e)||x-y||

i.e. distance is approximately preserved.

Page 11: Multiplicative Data Perturbations

Comparison between GDP and RPP

Privacy preservation Subject to similar kinds of attacks RPP is more resilience to distance-based

attacks

Utility preservation(model accuracy) GDP preserves distances well RPP approximately preserves distances

Model accuracy is not guaranteed

Page 12: Multiplicative Data Perturbations

Illustration of multiplicative data perturbation

Preserving distanceswhile perturbing eachindividual dimensions

Page 13: Multiplicative Data Perturbations

A Model “invariant” to GDP … If distance plays an important role

Class/cluster members and decision boundaries are

correlated in terms of distance, not the concrete locations

Classification boundary

Class 1

Class 2

Classification boundary

Class 1

Class 2

Rotation and translation

Class 1

Class 2Slightly

changed Classification boundary

Distance perturbation(Noise addition)

2D Example:

Page 14: Multiplicative Data Perturbations

Applicable DM algorithms Modeling methods that depend on Euclidean

geometric properties

Models “invariant” to GDP all Euclidean distance based clustering algorithms Classification algorithms

K Nearest Neighbors Kernel methods Linear classifier Support vector machines Most regression models And potentially more …

Page 15: Multiplicative Data Perturbations

When to Use Multiplicative Data Perturbation

Data Owner Service Provider/data user

G(X)=RX+T+D

Mined models/patterns

G(X)

F(G(X), )Apply F to G(Xnew)

Good for the outsourcing model.Major issue!! curious service providers/data users try to break G(X)

Page 16: Multiplicative Data Perturbations

Major issue: attacks! Many existing Privacy Preserving methods are

found not so effective when attacks are considered Ex: various data reconstruction algorithms to the

random noise addition approach [Huang05][Guo06]

Prior knowledge Service provider Y has “PRIOR

KNOWLEDGE” about X’s domain and nothing stops Y from using it to infer information in the sanitized data

Page 17: Multiplicative Data Perturbations

Knowledge used to attack GDP

Three levels of knowledge Know nothing naïve estimation Know column distributions Independent

Component Analysis Know specific input-output records (original

points and their images in perturbed data) distance inference

Page 18: Multiplicative Data Perturbations

Methodology of attack analysis

An attack is an estimate of the original data

Original O(x1, x2,…, xn) vs. estimate P(x’1, x’2,…, x’n)

How similar are these two series?

One of the effective methods is to evaluate the MSE of the estimation – VAR(P-O) or STD(P-O)

Page 19: Multiplicative Data Perturbations

Two multi-column privacy metrics

qi : privacy guarantee for column i qi = std(Pi–Oi), Oi normalized column values,

Pi estimated column values

Min privacy guarantee: the weakest link of all columns

min { qi, i=1..d}

Avg privacy guarantee: overall privacy guarantee

1/d qi

Page 20: Multiplicative Data Perturbations

Alternative metric

Based on Agarawal’s information theoretic measure: loss of privacy PI=1- 2-I(X; X^), X^ is the estimation of X I(X; X^) = H(X) – H(X|X^) = H(X) –

H(estimation error) Exact estimation H(X|X^) =0, PI = 1-2-H(X)

Random estimation I(X; X^) = 0, PI=0 Already normalized for different columns

Page 21: Multiplicative Data Perturbations

Attack 1: naïve estimation

Estimate original points purely based on the perturbed dataIf using “random rotation” only

Intensity of perturbation matters Points around origin

Classification boundary

Class 1

Class 2

Classification boundary

Class 1

Class 2

Classification boundary

Class 1

Class 2 X

Y

Page 22: Multiplicative Data Perturbations

Counter naïve estimation Maximize intensity

Based on formal analysis of “rotation intensity” Method to maximize intensity

Fast_Opt algorithm in GDP

“Random translation” T Hide origin Increase difficulty of attacking!

Need to estimate R first, in order to find out T

Page 23: Multiplicative Data Perturbations

Attack 2: ICA based attacks

Independent Component Analysis (ICA) Try to separate R and X from Y= R*X

Page 24: Multiplicative Data Perturbations

Characteristics of ICA

1. Ordering of dimensions is not preserved.

2. Intensity (value range) is not preserved

Conditions of effective ICA-attack1. Knowing column distribution2. Knowing value range.

Page 25: Multiplicative Data Perturbations

Counter ICA attack

Weakness of ICA attack Need certain amount of knowledge Cannot effectively handle dependent

columns

In reality… Most datasets have correlated columns We can find optimal rotation perturbation maximizing the difficulty of ICA attacks

Page 26: Multiplicative Data Perturbations

Original Perturbed

Known point

image

Attack 3: distance-inference attack

If with only rotation/translation perturbation, when the attacker knows a set of original points and their mapping…

Page 27: Multiplicative Data Perturbations

How is the Attack done … Knowing points and their images …

find exact images of the known points Enumerate pairs by matched distances … Less effective for large data … we assume pairs are successfully identified

Estimation

1. Cancel random translation T from pairs (x, x’)2. calculate R with pairs: Y=RX R = Y*X-1 3. calculate T with R and known pairs

Page 28: Multiplicative Data Perturbations

Counter distance-inference: Noise addition

Noise brings enough variance in estimation of R and T Now the attacker has to use regression to estimate R Then, use approximate R to estimate T increase uncertainty

Regression

1. Cancel random translation T from pairs (x, x’)2. estimate R with pairs:

Y=RX R = (Y*XT )(X*XT)-1

3. Use the estimated R and known pairs to estimate T

Page 29: Multiplicative Data Perturbations

Discussion Can the noise be easily filtered?

Need to know noise distribution, distribution of RX + T, Both distributions are not published, however. Attack analysis will be different from that for noise

addition data perturbation Will PCA based noise filtering [Huang05] be effective?

What are the best estimation that the attacker can get?

If we treat the attack problem as a learning problem -- Minimum variance of error for the learner Higher bound of “loss of privacy”

Page 30: Multiplicative Data Perturbations

Attackers with more knowledge?

What if attackers know a large amount of original records? Able to accurately estimate covariance matrix,

column distribution, and column range, etc., of the original data

Methods PCA, AK_ICA, …,etc can be used

What do we do?If you have released so much original information…

Stop releasing data anymore

Page 31: Multiplicative Data Perturbations

A randomized perturbation optimization algorithm

Start with a random rotation Goal: passing tests on simulated attacks Not simply random – a hillclimbing

method1. Iteratively determine R

- Test on naïve estimation (Fast_opt) - Test on ICA (2nd level) find a better rotation R

2. Append a random translation component3. Append an appropriate noise component

Page 32: Multiplicative Data Perturbations

Comparison on methods Privacy preservation

In general, RPP should be better than GDP Evaluate the effect of attacks for GDP

ICA and distance perturbation need experimental evaluation

Utility preservation GDP:

R and T exactly preserve distances, The effect of D needs experimental evaluation

RPP # of perturbed dimensions vs. utility

Datasets 12 datasets from UCI Data Repository

Page 33: Multiplicative Data Perturbations

Privacy guarantee:GDP In terms of naïve estimation and ICA-based attacks Use only the random rotation and translation (R*X+T)

components

Worst perturbation (no optimization)Optimized for Naïve estimation only

Optimized perturbation for both attacks

Page 34: Multiplicative Data Perturbations

Privacy guarantee:GDP In terms of distance inference attacks

Use all three components (R*X +T+D) Noise D : Gaussian N(0, 2) Assume pairs of (original, image) are identified by attackers

no noise addition, privacy guarantee =0

Considerably high PGaround small perturbation=0.1

Page 35: Multiplicative Data Perturbations

Data utility : GDP with noise addition

Noise addition vs. model accuracy - noise: N(0, 0.12)

Page 36: Multiplicative Data Perturbations

Data Utility: RPP

Reduced # of dims vs. model accuracy

KNN classifiers SVMs

Page 37: Multiplicative Data Perturbations

Perceptrons