optrr: optimizing randomized response schemes for privacy-preserving data mining

OptRR: Optimizing Randomized ReOptRR: Optimizing Randomized Response Schemes For Privacy-Pressponse Schemes For Privacy-Pres

erving Data Miningerving Data Mining

Zhengli Huang and Wenliang (Kevin) DuDepartment of EECSSyracuse University

Data Mining/AnalysisData Mining/Analysis

Data Publisher

Step 1: Data Collection

IndividualData

Data Miner

Step 2: Data PublishingData cannot be published directly because of privacy concern

Background:Background:Randomized ResponseRandomized Response

)5.0(

)(

≠=

θθYesP

€

P'(Yes) = P(Yes) ⋅θ + P(No) ⋅(1−θ)

P'(No) = P(Yes) ⋅(1−θ) + P(No) ⋅θ

Do you smoke?

Head

TailNo

Yes

The true answer is “Yes”

Biased coin:

5.0

)(

≠=

θθHeadP

RR for Categorical DataRR for Categorical Data

True Value: Si

Si

Si+1

Si+2

Si+3

q1

q2

q3

q4

€

P '(s1)

P '(s2)

P '(s3)

P '(s4)

⎛

⎝

⎜ ⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟ ⎟

=

q1 q4 q3 q2

q2 q1 q4 q3

q3 q2 q1 q4

q4 q3 q2 q1

⎛

⎝

⎜ ⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟ ⎟

P(s1)

P(s2)

P(s3)

P(s4)

⎛

⎝

⎜ ⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟ ⎟

M

A GeneralizationA Generalization

• Several RR Matrices have been proposedSeveral RR Matrices have been proposed– [Warner 65][Warner 65]– [R.Agrawal et al. 05], [S. Agrawal et al. [R.Agrawal et al. 05], [S. Agrawal et al. 05] 05]

• RR Matrix can be arbitraryRR Matrix can be arbitrary

• Can we find optimal RR matrices?Can we find optimal RR matrices?

€

M =

a11 a12 a13 a14

a21 a22 a23 a24

a31 a32 a33 a34

a41 a42 a43 a44

⎡

⎣

⎢ ⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥ ⎥

What is an optimal What is an optimal matrix?matrix?

• Which of the following is Which of the following is better?better?

€

M1 =

1 0 0

0 1 0

0 0 1

⎡

⎣

⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥

€

M2 =

13

13

13

13

13

13

13

13

13

⎡

⎣

⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥

Privacy: M2 is betterUtility: M1 is better

So, what is an optimal matrix?

Optimal RR MatrixOptimal RR Matrix

• An RR matrix M is optimal if no other RR An RR matrix M is optimal if no other RR matrix’s privacy and utility are both matrix’s privacy and utility are both better than M (i, e, better than M (i, e, no other matrix no other matrix dominates Mdominates M).).– PrivacyPrivacy Quantification Quantification– UtilityUtility Quantification Quantification

• A number of privacy and utility metrics A number of privacy and utility metrics have been proposed. We use the following:have been proposed. We use the following:– PrivacyPrivacy: how accurately one can estimate : how accurately one can estimate individualindividual info. info.

– UtilityUtility: how accurately we can estimate : how accurately we can estimate aggregateaggregate info. info.

Optimization MethodsOptimization Methods

• Approach 1: Weighted sum: Approach 1: Weighted sum: ww1 1 Privacy + wPrivacy + w22 Utility Utility

• Approach 2Approach 2– Fix Privacy, find M with the optimal Fix Privacy, find M with the optimal Utility.Utility.

– Fix Utility, find M with the optimal Fix Utility, find M with the optimal Privacy.Privacy.

– Challenge: Challenge: Difficult to generate M with a Difficult to generate M with a fixed privacy or utility.fixed privacy or utility.

• Our Approach: Multi-Objective Our Approach: Multi-Objective OptimizationOptimization

Evolutionary Multi-Evolutionary Multi-ObjectiveObjective

Optimization (EMOO)Optimization (EMOO)• Genetic algorithms has Genetic algorithms has difficulty of dealing with difficulty of dealing with multiple objectives.multiple objectives.

• We use the EMOO algorithmWe use the EMOO algorithm• We use SPEA2.We use SPEA2.

Our SPEA2-based Our SPEA2-based algorithmalgorithm

EMOOEMOO

• EvolutionEvolution– CrossoverCrossover– MutationMutation

• Fitness Assignment (SPEA2)Fitness Assignment (SPEA2)– Strength Value S(M):Strength Value S(M): the number of matrix the number of matrix dominated by M. dominated by M.

– Raw fitness F’(M):Raw fitness F’(M): the sum of the the sum of the strength of the RR matrices that dominate strength of the RR matrices that dominate M. The lower the better.M. The lower the better.

– Density d(M):Density d(M): discriminate the matrices discriminate the matrices with the same fitness.with the same fitness.

DiversityDiversity

Privacy

Utility

Worse

Better

M1

M2

M4

M3

M5

The Output of The Output of OptimizationOptimization

• Pareto FrontsPareto Fronts– The optimal set is often plotted in The optimal set is often plotted in the objective space and the plot is the objective space and the plot is called the called the Pareto frontPareto front..

Privacy

Utility(error)

0

ExperimentsExperiments

For normal distribution with different δ

For First attribute of Adult dataFor First attribute of Adult data

For normal distribution (For normal distribution (δδ=0.75=0.75))

Summary Summary

• We use an evolutionary multi-We use an evolutionary multi-objective optimization objective optimization technique to search for optimal technique to search for optimal RR matrices.RR matrices.

• The evaluation shows that our The evaluation shows that our scheme achieves better scheme achieves better performance than the existing performance than the existing RR schemes.RR schemes.

optrr: optimizing randomized response schemes for privacy-preserving data mining

Documents

optimal privacy

optimal rr matrices

optimal matrix

optimal utility

rr matrixs privacy

optimal rr matrixan

fix privacy

fixed privacy