introduction to data mining - emory universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 ·...

Post on 24-Jun-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction to Data Mining

Privacy preserving data mining

4/3/2011 1

Privacy preserving data mining

Li Xiong

Slides credits:

Chris Clifton

Agrawal and Srikant

Privacy Preserving Data Mining

� Privacy concerns about personal data

� AOL query log release

� Netflix challenge

� Data scraping

A race to the bottom: privacy ranking of Internet service companies

� A study done by Privacy International into the privacy practices of key Internet based companies

� Amazon, AOL, Apple, BBC, eBay, Facebook, Friendster, Google, LinkedIn, LiveJournal, Friendster, Google, LinkedIn, LiveJournal, Microsoft, MySpace, Skype, Wikipedia, LiveSpace, Yahoo!, YouTube

A Race to the Bottom: Methodologies

� Corporate administrative details

� Data collection and processing

� Data retention

� Openness and transparency

� Customer and user control

� Privacy enhancing innovations and privacy invasive innovations

A race to the bottom: interim results revealed

Why Google

� Retains a large quantity of information about users, often for an unstated or indefinite length of time, without clear limitation on subsequent use or disclosure

� Maintains records of all search strings with � Maintains records of all search strings with associated IP and time stamps for at least 18-24 months

� Additional personal information from user profiles in Orkut

� Use advanced profiling system for ads

Remember, they are always watching …

Some advice from privacy campaigners …

� Use cash when you can.

� Do not give your phone number, social-security number or address, unless you absolutely have to.

� Do not fill in questionnaires or respond to telemarketers.

� Demand that credit and data-marketing firms produce all information they have on you, correct errors and remove you information they have on you, correct errors and remove you from marketing lists.

� Check your medical records often.

� Block caller ID on your phone, and keep your number unlisted.

� Never leave your mobile phone on, your movements can be traced.

� Do not user store credit or discount cards

� If you must use the Internet, encrypt your e-mail, reject all “cookies” and never give your real name when registering at websites

� Better still, use somebody else’s computer

� Data obfuscation (non-interactive model)

Privacy-Preserving Data Mining

Original Data

“Sanitized”Data

MinerAnonymization

� Output perturbation (interactive model)

AccessInterface

Original Data

Miner

“Perturbed” Results

Classes of Solutions

� Methods

� Input obfuscation

� Perturbation

� Generalization� Generalization

� Output perturbation

� Differential privacy

� Metrics

� Privacy vs. Utility

Data Perturbation

� Data randomization� Randomization (additive noise)

� Geometric perturbation and projection (multiplicative noise)

� Randomized response technique (categorical data)

Randomization Based Decision Tree Learning (Agrawal and Srikant ’00)

� Basic idea: Perturb Data with Value Distortion� User provides xi+r instead of xi� r is a random value

� Uniform, uniform distribution between [-α, α]� Gaussian, normal distribution with µ = 0, σ

� Hypothesis� Hypothesis

� Miner doesn’t see the real data or can’t reconstruct real values

� Miner can reconstruct enough information to identify patterns

Classification using Randomization Data

50 | 40K | ...30 | 70K | ... ...

Randomizer Randomizer

65 | 20K | ... 25 | 60K | ... ...

Alice’s age

Add random number to

Age

Classification

AlgorithmModel

65 | 20K | ... 25 | 60K | ... ...30

becomes 65

(30+35)

?

Output: A Decision Tree for “buys_computer”

age?

overcast<=30 >4031..40

February 12, 2008 Data Mining: Concepts and Techniques 14

student? credit rating?

no yes yes

yes

fairexcellentyesno

Attribute Selection Measure: Gini index (CART)

� If a data set D contains examples from n classes, gini index, gini(D) is

defined as

where pj is the relative frequency of class j in D

� If a data set D is split on A into two subsets D1 and D2, the gini index

gini(D) is defined as

∑=

−=n

j

p jDgini

1

21)(

February 12, 2008 Data Mining: Concepts and Techniques 15

1 2

gini(D) is defined as

� Reduction in Impurity:

� The attribute provides the smallest ginisplit(D) (or the largest reduction

in impurity) is chosen to split the node

)(||

||)(

||

||)( 2

21

1Dgini

D

DDgini

D

DDgini A +=

)()()( DginiDginiAginiA

−=∆

Randomization Approach Overview

50 | 40K | ...30 | 70K | ... ...

Randomizer Randomizer

65 | 20K | ... 25 | 60K | ... ...

Alice’s age

Add random number to

Age

...Reconstruct

Distribution

of Age

Reconstruct

Distribution

of Salary

Classification

AlgorithmModel

65 | 20K | ... 25 | 60K | ... ...30

becomes 65

(30+35)

Original Distribution Reconstruction

� x1, x2, …, xn are the n original data values

� Drawn from n iid random variables with distribution X

� Using value distortion,

� The given values are w1 = x1 + y1, w2 = x2 + y2, …, wn = xn + yn

� yi’s are from n iid random variables with distribution Y

� Reconstruction Problem:

� Given FY and wi’s, estimate FX

Original Distribution Reconstruction: Method

� Bayes’ theorem for continuous distribution

� The estimated density function (minimum mean square error estimator):

( ) ( ) ( )∑

−=′

nXiY afawf

af1

� Iterative estimation

� The initial estimate for fX at j=0: uniform distribution

� Iterative estimation

� Stopping Criterion: difference between successive iterations is small

( ) ( ) ( )( ) ( )

∑∫=

∞−−

=′i

XiY

XiYX

dzzfzwfnaf

1

1

( ) ( ) ( )( ) ( )

∑∫=

∞−

+

−=

n

ij

XiY

j

XiYj

X

dzzfzwf

afawf

naf

1

1 1

Reconstruction of Distribution

800

1000

1200

Number of Peo

ple

Original

0

200

400

600

20 60

Age

Number of Peo

ple

Original

Randomized

Reconstructed

Original Distribution Reconstruction

Original Distribution Construction for Decision Tree

� When are the distributions reconstructed?� Global

� Reconstruct for each attribute once at the beginning� Build the decision tree using the reconstructed data

� ByClass� First split the training data

Reconstruct for each class separately� Reconstruct for each class separately� Build the decision tree using the reconstructed data

� Local� First split the training data� Reconstruct for each class separately� Reconstruct at each node while building the tree

Accuracy vs. Randomization Level

Fn 3

90

100

40

50

60

70

80

10 20 40 60 80 100 150 200

Randomization Level

Accura

cy Original

Randomized

ByClass

More Results

� Global performs worse than ByClass and Local

� ByClass and Local have accuracy within 5% to 15% (absolute error) of the Original accuracy

� Overall, all are much better than the Randomized accuracy

Privacy metrics

� Privacy metrics of random additive data perturbation

4/3/2011 Data Mining: Principles and Algorithms 24

Unfortunately

� Random additive data perturbation are subject to data reconstruction attacks

� Original data can be estimated using spectral filtering techniques

� H. Kargupta , S. Datta. On the privacy preserving properties of random data perturbation techniques, in ICDM 2003

4/3/2011 Data Mining: Principles and Algorithms 25

Estimating distribution and data values

4/3/2011 Data Mining: Principles and Algorithms 26

Follow-up Work

� Multiplicative randomization

� Geometric randomization

� Also subjective to data reconstruction attacks!

� Known input-output

� Known samples

4/3/2011 Data Mining: Principles and Algorithms 27

Data Perturbation

� Data randomization� Randomization (additive noise)

� Geometric perturbation and projection (multiplicative noise)

� Randomized response technique (categorical data)

Data Collection Model

Data cannot be shared directly because of privacy concern

Background:Randomized Response

Do you smoke?

Head Yes

The true answer is “Yes”

)5.0(

)(

=

θθYesP

P'(Yes) = P(Yes) ⋅ θ + P(No) ⋅ (1−θ)P'(No) = P(Yes) ⋅ (1−θ) + P(No) ⋅ θ

Head

TailNo

YesBiased coin:

5.0

)(

=

θθHeadP

Decision Tree Mining using Randomized Response

� Multiple attributes encoded in bits

)5.0(

)(

=

θθYesP

Head True answer E: 110Biased coin:

)( = θHeadP )5.0( ≠θ

TailFalse answer !E: 0015.0

)(

=

θθHeadP

� Column distribution can be estimated for learning a decision tree

Using Randomized Response Techniques for Privacy-Preserving Data Mining, Du, 2003

Generalization for Multi-Valued Categorical Data

Si

Si+1

Si+2

q1

q2

q3

q4True Value: Si Si+3

q4

P'(s1)

P'(s2)

P'(s3)

P'(s4)

=

q1 q4 q3 q2

q2 q1 q4 q3

q3 q2 q1 q4

q4 q3 q2 q1

P(s1)

P(s2)

P(s3)

P(s4)

M

A Generalization

� RR Matrices [Warner 65], [R.Agrawal 05], [S. Agrawal 05]

� RR Matrix can be arbitrary

� Can we find optimal RR matrices?

M =

a11 a12 a13 a14

a21 a22 a23 a24

a31 a32 a33 a34

a41 a42 a43 a44

OptRR:Optimizing Randomized Response Schemes for Privacy-Preserving Data Mining, Huang, 2008

What is an optimal matrix?

� Which of the following is better?

M1 =

1 0 0

0 1 0

M2 =

13

13

13

13

13

13

M1 = 0 1 0

0 0 1

2 3 3 3

13

13

13

Privacy: M2 is betterUtility: M1 is better

So, what is an optimal matrix?

Optimal RR Matrix

� An RR matrix M is optimal if no other RR matrix’s privacy and utility are both better than M (i, e, no other matrix dominates M).

� Privacy Quantification

� Utility Quantification� Utility Quantification

� Privacy and utility metrics

� Privacy: how accurately one can estimate individualinfo.

� Utility: how accurately we can estimate aggregate info.

Optimization algorithm

� Evolutionary Multi-Objective Optimization (EMOO)

� The algorithm � Start with a set of initial RR matrices

� Repeat the following steps in each iteration

� Mating: selecting two RR matrices in the pool

� Crossover: exchanging several columns between the two RR � Crossover: exchanging several columns between the two RR matrices

� Mutation: change some values in a RR matrix

� Meet the privacy bound: filtering the resultant matrices

� Evaluate the fitness value for the new RR matrices.

Note : the fitness values is defined in terms of privacy and utility metrics

Output of Optimization

Worse

M5M6

The optimal set is often plotted in the objective space as Pareto front.

Privacy

Utility

Better

M1M2

M4

M3

M5

M7

M6

M8

Classes of Solutions

� Methods

� Input obfuscation

� Perturbation

� Generalization

Output perturbation� Output perturbation

� Differential privacy

� Metrics

� Privacy vs. Utility

Disease

Birthdate

Zip

Data Re-identification

Sex

Zip

Name

k-anonymity & l-diversity

40

Privacy preserving data mining

� Generalization principles

� k-anonymity, l-diversity, …

� Methods

� Optimal

� Greedy

41

� Greedy

� Top-down vs. bottom-up

Mondrian: Greedy Partitioning Algorithm

� Problem

� Need an algorithm to find multi-dimensional partitions

� Optimal k-anonymous strict multi-dimensional partitioning is NP-hardpartitioning is NP-hard

� Solution

� Use a greedy algorithm

� Based on k-d trees

� Complexity O(nlogn)

Example

k = 2; Quasi Identifiers: Age, Zipcode

What should be the splitting criteria?

Patient Data Multi-Dimensional

Unfortunately

� Generalization based principles and methods are subjective to attacks

� Background knowledge sensitive

� Attack dependent

4/3/2011 Data Mining: Principles and Algorithms 44

Classes of Solutions

� Methods

� Input obfuscation

� Perturbation

� Generalization� Generalization

� Output perturbation

� Differential privacy

� Metrics

� Privacy vs. Utility

� Differential privacy requires the outcome to be formally indistinguishable when run with and without any particular record in the data set

Differential Privacy

D1 Q(D1) + Y1

Differentially

PrivateInterface

D2Bob out

UserQ

D1Bob in

A(D2)

A(D1)Q(D1) + Y1

Q(D2) + Y2

� Differential privacy

� Laplace mechanism

Q(D) + Y where Y is drawn from

� Query sensitivity

Differential Privacy

� Query sensitivity

Differentially Private

InterfaceD2

Bob out

UserQ

D1Bob in

A(D2)

A(D1)Q(D1) + Y1

Q(D2) + Y2

Coming up

� Data mining algorithms using differential privacy

� Decision tree learning (Data Mining with Differential Privacy, SIGKDD 10)

� Frequent itemsets mining (discovering frequent patterns in sensitive data, SIGKDD 10)patterns in sensitive data, SIGKDD 10)

4/3/2011 Data Mining: Principles and Algorithms 48

Midterm Exam

� Adjusted mean: 85.3

� Adjusted max: 101

� Your favorite topics: Clustering, frequent itemsets mining, decision tree

Your favorite assignments: Apriori� Your favorite assignments: Apriori

� Your least favorite: SOM, Weka analysis

4/3/2011 Data Mining: Principles and Algorithms 49

top related