randomization in privacy preserving data mining

17
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include materials from this paper

Upload: bishop

Post on 25-Feb-2016

87 views

Category:

Documents


2 download

DESCRIPTION

Randomization in Privacy Preserving Data Mining. Agrawal , R., and Srikant , R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include materials from this paper . Privacy-Preserving Data Mining. Problem: How do we publish data without compromising individual privacy? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Randomization in Privacy Preserving Data Mining

Randomization in Privacy Preserving Data Mining

Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00

the following slides include materials from this paper

Page 2: Randomization in Privacy Preserving Data Mining

Privacy-Preserving Data Mining

• Problem: How do we publish data without compromising individual privacy?

• Solution : randomization, anonymization

Page 3: Randomization in Privacy Preserving Data Mining

Randomization

• Adding random noise to original dataset

• Challenge– Is data still useful for further analysis?

Page 4: Randomization in Privacy Preserving Data Mining

Randomization

• Model: data is distorted by adding random noise

• Original data X = {x1 . . .xN}, for record xi X, ∈random variable Y = {y1 . . .yN} is added, so new data is denoted by Z ={ z1 . . .zN}, zi=xi + yi.

• yi is a random value– Uniform, [-α, +α]– Gaussian, N (0, σ2)

Page 5: Randomization in Privacy Preserving Data Mining

Reconstruction

• Perturbed data hides data distribution and need be reconstructed before data mining

• Given– x1+y1, x2+y2, ..., xn+yn

– the probability distribution of Y• Estimate the probability distribution of x

Clifton AusDM‘11

Page 6: Randomization in Privacy Preserving Data Mining

1. fx 0 = Uniform distribution

2. Repeat update

until stop criterion met

Reconstruction

• Bayes rule to estimate cumulative density functions

reconstruction algorithm

Page 7: Randomization in Privacy Preserving Data Mining

reconstructed

originalrandomized

original

reconstructed

randomized

N(0, 0.25)

(-0.5, 0.5)

Page 8: Randomization in Privacy Preserving Data Mining

Privacy Metric

• If a data x is estimated to be in the interval [α, β] with c% confidence, then the interval (β-α) defines the amount of privacy with c% confidence.

• ExampleAge 20-40, 95% confidence, 50% privacy in Uniform2 α = 20*0.5/0.95 = 10.5

Confidence50% 95% 99.9%

Uniform 0.5 X 2α 0.95 X 2α 0.999 X 2α

Gaussian 1.34 X σ 3.92 X σ 6.8 X σ

Page 9: Randomization in Privacy Preserving Data Mining

Decision Tree

Page 10: Randomization in Privacy Preserving Data Mining

Training Decision Tree

• Split point– interval boundaries

• Reconstruction algorithm– Global– Byclass– Local

• Dataset– Synthetic dataset, training set of 100,000 records and

testing set of 5,000 records, equally split into two classes

Page 11: Randomization in Privacy Preserving Data Mining
Page 12: Randomization in Privacy Preserving Data Mining

originalglobal and randomized

Byclass and local

global

randomized

original

byclasslocal

Page 13: Randomization in Privacy Preserving Data Mining
Page 14: Randomization in Privacy Preserving Data Mining

Extended Work

• ‘02 proposed a method to quantify information loss– Mutual information

• ‘07 evaluated randomization with combining of public information– Gaussian is better than uniform– Dataset with inherent cluster pattern will improve

randomization performance– Varying density and outliers will decrease performance

Page 15: Randomization in Privacy Preserving Data Mining

Multiplicative Randomization

• Rotation randomization– Distorted by an orthogonal matrix

• Projection randomization– Project high-dimensional dataset into low-

dimensional space• Preserving Euclidean distance and can be

applied with distance-based classification (KNN, SVM) and clustering (K-means)

Page 16: Randomization in Privacy Preserving Data Mining

Summary

• Pros: data and noise are independent, can be applied during data collection time, useful for stream data

• Cons: information loss, dimensionality curse

Page 17: Randomization in Privacy Preserving Data Mining

Questions?