introduction to data mining - emory universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 ·...

Introduction to Data Mining

Privacy preserving data mining

4/3/2011 1

Li Xiong

Slides credits:

Chris Clifton

Agrawal and Srikant

Privacy Preserving Data Mining

� Privacy concerns about personal data

� AOL query log release

� Netflix challenge

� Data scraping

A race to the bottom: privacy ranking of Internet service companies

� A study done by Privacy International into the privacy practices of key Internet based companies

� Amazon, AOL, Apple, BBC, eBay, Facebook, Friendster, Google, LinkedIn, LiveJournal, Friendster, Google, LinkedIn, LiveJournal, Microsoft, MySpace, Skype, Wikipedia, LiveSpace, Yahoo!, YouTube

A Race to the Bottom: Methodologies

� Corporate administrative details

� Data collection and processing

� Data retention

� Openness and transparency

� Customer and user control

� Privacy enhancing innovations and privacy invasive innovations

A race to the bottom: interim results revealed

Why Google

� Retains a large quantity of information about users, often for an unstated or indefinite length of time, without clear limitation on subsequent use or disclosure

� Maintains records of all search strings with � Maintains records of all search strings with associated IP and time stamps for at least 18-24 months

� Additional personal information from user profiles in Orkut

� Use advanced profiling system for ads

Remember, they are always watching …

Some advice from privacy campaigners …

� Use cash when you can.

� Do not give your phone number, social-security number or address, unless you absolutely have to.

� Do not fill in questionnaires or respond to telemarketers.

� Demand that credit and data-marketing firms produce all information they have on you, correct errors and remove you information they have on you, correct errors and remove you from marketing lists.

� Check your medical records often.

� Block caller ID on your phone, and keep your number unlisted.

� Never leave your mobile phone on, your movements can be traced.

� Do not user store credit or discount cards

� If you must use the Internet, encrypt your e-mail, reject all “cookies” and never give your real name when registering at websites

� Better still, use somebody else’s computer

� Data obfuscation (non-interactive model)

Privacy-Preserving Data Mining

Original Data

“Sanitized”Data

MinerAnonymization

� Output perturbation (interactive model)

AccessInterface

Original Data

“Perturbed” Results

Classes of Solutions

� Methods

� Input obfuscation

� Perturbation

� Generalization� Generalization

� Output perturbation

� Differential privacy

� Metrics

� Privacy vs. Utility

Data Perturbation

� Data randomization� Randomization (additive noise)

� Geometric perturbation and projection (multiplicative noise)

� Randomized response technique (categorical data)

Randomization Based Decision Tree Learning (Agrawal and Srikant ’00)

� Basic idea: Perturb Data with Value Distortion� User provides xi+r instead of xi� r is a random value

� Uniform, uniform distribution between [-α, α]� Gaussian, normal distribution with µ = 0, σ

� Hypothesis� Hypothesis

� Miner doesn’t see the real data or can’t reconstruct real values

� Miner can reconstruct enough information to identify patterns

Classification using Randomization Data

50 | 40K | ...30 | 70K | ... ...

Randomizer Randomizer

65 | 20K | ... 25 | 60K | ... ...

Alice’s age

Add random number to

Classification

AlgorithmModel

65 | 20K | ... 25 | 60K | ... ...30

becomes 65

(30+35)

Output: A Decision Tree for “buys_computer”

overcast<=30 >4031..40

February 12, 2008 Data Mining: Concepts and Techniques 14

student? credit rating?

no yes yes

fairexcellentyesno

Attribute Selection Measure: Gini index (CART)

� If a data set D contains examples from n classes, gini index, gini(D) is

defined as

where pj is the relative frequency of class j in D

� If a data set D is split on A into two subsets D1 and D2, the gini index

gini(D) is defined as

p jDgini

February 12, 2008 Data Mining: Concepts and Techniques 15

gini(D) is defined as

� Reduction in Impurity:

� The attribute provides the smallest ginisplit(D) (or the largest reduction

in impurity) is chosen to split the node

||)( 2

1Dgini

DDgini

DDgini A +=

)()()( DginiDginiAginiA

−=∆

Randomization Approach Overview

50 | 40K | ...30 | 70K | ... ...

Randomizer Randomizer

65 | 20K | ... 25 | 60K | ... ...

Alice’s age

Add random number to

...Reconstruct

Distribution

of Age

Reconstruct

Distribution

of Salary

Classification

AlgorithmModel

65 | 20K | ... 25 | 60K | ... ...30

becomes 65

(30+35)

Original Distribution Reconstruction

� x1, x2, …, xn are the n original data values

� Drawn from n iid random variables with distribution X

� Using value distortion,

� The given values are w1 = x1 + y1, w2 = x2 + y2, …, wn = xn + yn

� yi’s are from n iid random variables with distribution Y

� Reconstruction Problem:

� Given FY and wi’s, estimate FX

Original Distribution Reconstruction: Method

� Bayes’ theorem for continuous distribution

� The estimated density function (minimum mean square error estimator):

( ) ( ) ( )∑

−=′

nXiY afawf

� Iterative estimation

� The initial estimate for fX at j=0: uniform distribution

� Iterative estimation

� Stopping Criterion: difference between successive iterations is small

( ) ( ) ( )( ) ( )

∑∫=

∞−−

dzzfzwfnaf

( ) ( ) ( )( ) ( )

∑∫=

∞−

dzzfzwf

Reconstruction of Distribution

Number of Peo

Original

Number of Peo

Original

Randomized

Reconstructed

Original Distribution Reconstruction

Original Distribution Construction for Decision Tree

� When are the distributions reconstructed?� Global

� Reconstruct for each attribute once at the beginning� Build the decision tree using the reconstructed data

� ByClass� First split the training data

Reconstruct for each class separately� Reconstruct for each class separately� Build the decision tree using the reconstructed data

� Local� First split the training data� Reconstruct for each class separately� Reconstruct at each node while building the tree

Accuracy vs. Randomization Level

10 20 40 60 80 100 150 200

Randomization Level

Accura

cy Original

Randomized

ByClass

More Results

� Global performs worse than ByClass and Local

� ByClass and Local have accuracy within 5% to 15% (absolute error) of the Original accuracy

� Overall, all are much better than the Randomized accuracy

Privacy metrics

� Privacy metrics of random additive data perturbation

4/3/2011 Data Mining: Principles and Algorithms 24

Unfortunately

� Random additive data perturbation are subject to data reconstruction attacks

� Original data can be estimated using spectral filtering techniques

� H. Kargupta , S. Datta. On the privacy preserving properties of random data perturbation techniques, in ICDM 2003

Estimating distribution and data values

Follow-up Work

� Multiplicative randomization

� Geometric randomization

� Also subjective to data reconstruction attacks!

� Known input-output

� Known samples

Data Perturbation

� Data randomization� Randomization (additive noise)

� Geometric perturbation and projection (multiplicative noise)

� Randomized response technique (categorical data)

Data Collection Model

Data cannot be shared directly because of privacy concern

Background:Randomized Response

Do you smoke?

Head Yes

The true answer is “Yes”

θθYesP

P'(Yes) = P(Yes) ⋅ θ + P(No) ⋅ (1−θ)P'(No) = P(Yes) ⋅ (1−θ) + P(No) ⋅ θ

TailNo

YesBiased coin:

θθHeadP

Decision Tree Mining using Randomized Response

� Multiple attributes encoded in bits

θθYesP

Head True answer E: 110Biased coin:

)( = θHeadP )5.0( ≠θ

TailFalse answer !E: 0015.0

θθHeadP

� Column distribution can be estimated for learning a decision tree

Using Randomized Response Techniques for Privacy-Preserving Data Mining, Du, 2003

Generalization for Multi-Valued Categorical Data

q4True Value: Si Si+3

P'(s1)

P'(s2)

P'(s3)

P'(s4)

q1 q4 q3 q2

q2 q1 q4 q3

q3 q2 q1 q4

q4 q3 q2 q1

A Generalization

� RR Matrices [Warner 65], [R.Agrawal 05], [S. Agrawal 05]

� RR Matrix can be arbitrary

� Can we find optimal RR matrices?

a11 a12 a13 a14

a21 a22 a23 a24

a31 a32 a33 a34

a41 a42 a43 a44

OptRR:Optimizing Randomized Response Schemes for Privacy-Preserving Data Mining, Huang, 2008

What is an optimal matrix?

� Which of the following is better?

M1 = 0 1 0

2 3 3 3

Privacy: M2 is betterUtility: M1 is better

So, what is an optimal matrix?

Optimal RR Matrix

� An RR matrix M is optimal if no other RR matrix’s privacy and utility are both better than M (i, e, no other matrix dominates M).

� Privacy Quantification

� Utility Quantification� Utility Quantification

� Privacy and utility metrics

� Privacy: how accurately one can estimate individualinfo.

� Utility: how accurately we can estimate aggregate info.

Optimization algorithm

� Evolutionary Multi-Objective Optimization (EMOO)

� The algorithm � Start with a set of initial RR matrices

� Repeat the following steps in each iteration

� Mating: selecting two RR matrices in the pool

� Crossover: exchanging several columns between the two RR � Crossover: exchanging several columns between the two RR matrices

� Mutation: change some values in a RR matrix

� Meet the privacy bound: filtering the resultant matrices

� Evaluate the fitness value for the new RR matrices.

Note : the fitness values is defined in terms of privacy and utility metrics

Output of Optimization

The optimal set is often plotted in the objective space as Pareto front.

Privacy

Utility

Better

� Methods

� Perturbation

� Generalization

Output perturbation� Output perturbation

� Metrics

Disease

Birthdate

Data Re-identification

k-anonymity & l-diversity

� Generalization principles

� k-anonymity, l-diversity, …

� Methods

� Optimal

� Greedy

� Top-down vs. bottom-up

Mondrian: Greedy Partitioning Algorithm

� Problem

� Need an algorithm to find multi-dimensional partitions

� Optimal k-anonymous strict multi-dimensional partitioning is NP-hardpartitioning is NP-hard

� Solution

� Use a greedy algorithm

� Based on k-d trees

� Complexity O(nlogn)

Example

k = 2; Quasi Identifiers: Age, Zipcode

What should be the splitting criteria?

Patient Data Multi-Dimensional

Unfortunately

� Generalization based principles and methods are subjective to attacks

� Background knowledge sensitive

� Attack dependent

� Methods

� Perturbation

� Generalization� Generalization

� Output perturbation

� Metrics

� Differential privacy requires the outcome to be formally indistinguishable when run with and without any particular record in the data set

Differential Privacy

D1 Q(D1) + Y1

Differentially

PrivateInterface

D2Bob out

D1Bob in

A(D1)Q(D1) + Y1

Q(D2) + Y2

� Laplace mechanism

Q(D) + Y where Y is drawn from

� Query sensitivity

Differential Privacy

� Query sensitivity

Differentially Private

InterfaceD2

Bob out

D1Bob in

A(D1)Q(D1) + Y1

Q(D2) + Y2

Coming up

� Data mining algorithms using differential privacy

� Decision tree learning (Data Mining with Differential Privacy, SIGKDD 10)

� Frequent itemsets mining (discovering frequent patterns in sensitive data, SIGKDD 10)patterns in sensitive data, SIGKDD 10)

introduction to data mining - emory universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 ·...

Documents

opposition-gini rabbit .n.g

gini and jony

project gini gini – “chicken hawk” in navajo language

gini paper 10 year

indice gini e isu

Índice de gini

the gini coefficient

indice gini

the spatial gini coefficient

gini van til

authorized innovation assessor - gini · aina exam...

entry indesk gini

gini delhi: marketing strategy

gini 2014.pdf

cs570 introduction to data mining - emory...

gini coefficient

gini coefficient california pre-tax income, 2000,...

issues in optimization of decision tree learning: a...

gini gloabalizare

brochure part 1spreed · from gini sanskruti, gini...