optrr: optimizing randomized response schemes for privacy-preserving data mining

17
OptRR: Optimizing Randomiz OptRR: Optimizing Randomiz ed Response Schemes For Pr ed Response Schemes For Pr ivacy-Preserving Data Mini ivacy-Preserving Data Mini ng ng Zhengli Huang and Wenliang (Kevin) Du Department of EECS Syracuse University

Upload: louie

Post on 19-Jan-2016

43 views

Category:

Documents


2 download

DESCRIPTION

OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining. Zhengli Huang and Wenliang (Kevin) Du Department of EECS Syracuse University. Data Mining/Analysis. Data cannot be published directly because of privacy concern. Background: Randomized Response. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

OptRR: Optimizing Randomized ReOptRR: Optimizing Randomized Response Schemes For Privacy-Pressponse Schemes For Privacy-Pres

erving Data Miningerving Data Mining

Zhengli Huang and Wenliang (Kevin) DuDepartment of EECSSyracuse University

Page 2: OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

Data Mining/AnalysisData Mining/Analysis

Data Publisher

Step 1: Data Collection

IndividualData

Data Miner

Step 2: Data PublishingData cannot be published directly because of privacy concern

Page 3: OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

Background:Background:Randomized ResponseRandomized Response

)5.0(

)(

≠=

θθYesP

P'(Yes) = P(Yes) ⋅θ + P(No) ⋅(1−θ)

P'(No) = P(Yes) ⋅(1−θ) + P(No) ⋅θ

Do you smoke?

Head

TailNo

Yes

The true answer is “Yes”

Biased coin:

5.0

)(

≠=

θθHeadP

Page 4: OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

RR for Categorical DataRR for Categorical Data

True Value: Si

Si

Si+1

Si+2

Si+3

q1

q2

q3

q4

P '(s1)

P '(s2)

P '(s3)

P '(s4)

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

=

q1 q4 q3 q2

q2 q1 q4 q3

q3 q2 q1 q4

q4 q3 q2 q1

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

P(s1)

P(s2)

P(s3)

P(s4)

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

M

Page 5: OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

A GeneralizationA Generalization

• Several RR Matrices have been proposedSeveral RR Matrices have been proposed– [Warner 65][Warner 65]– [R.Agrawal et al. 05], [S. Agrawal et al. [R.Agrawal et al. 05], [S. Agrawal et al. 05] 05]

• RR Matrix can be arbitraryRR Matrix can be arbitrary

• Can we find optimal RR matrices?Can we find optimal RR matrices?

M =

a11 a12 a13 a14

a21 a22 a23 a24

a31 a32 a33 a34

a41 a42 a43 a44

⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥

Page 6: OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

What is an optimal What is an optimal matrix?matrix?

• Which of the following is Which of the following is better?better?

M1 =

1 0 0

0 1 0

0 0 1

⎢ ⎢ ⎢

⎥ ⎥ ⎥

M2 =

13

13

13

13

13

13

13

13

13

⎢ ⎢ ⎢

⎥ ⎥ ⎥

Privacy: M2 is betterUtility: M1 is better

So, what is an optimal matrix?

Page 7: OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

Optimal RR MatrixOptimal RR Matrix

• An RR matrix M is optimal if no other RR An RR matrix M is optimal if no other RR matrix’s privacy and utility are both matrix’s privacy and utility are both better than M (i, e, better than M (i, e, no other matrix no other matrix dominates Mdominates M).).– PrivacyPrivacy Quantification Quantification– UtilityUtility Quantification Quantification

• A number of privacy and utility metrics A number of privacy and utility metrics have been proposed. We use the following:have been proposed. We use the following:– PrivacyPrivacy: how accurately one can estimate : how accurately one can estimate individualindividual info. info.

– UtilityUtility: how accurately we can estimate : how accurately we can estimate aggregateaggregate info. info.

Page 8: OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

Optimization MethodsOptimization Methods

• Approach 1: Weighted sum: Approach 1: Weighted sum: ww1 1 Privacy + wPrivacy + w22 Utility Utility

• Approach 2Approach 2– Fix Privacy, find M with the optimal Fix Privacy, find M with the optimal Utility.Utility.

– Fix Utility, find M with the optimal Fix Utility, find M with the optimal Privacy.Privacy.

– Challenge: Challenge: Difficult to generate M with a Difficult to generate M with a fixed privacy or utility.fixed privacy or utility.

• Our Approach: Multi-Objective Our Approach: Multi-Objective OptimizationOptimization

Page 9: OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

Evolutionary Multi-Evolutionary Multi-ObjectiveObjective

Optimization (EMOO)Optimization (EMOO)• Genetic algorithms has Genetic algorithms has difficulty of dealing with difficulty of dealing with multiple objectives.multiple objectives.

• We use the EMOO algorithmWe use the EMOO algorithm• We use SPEA2.We use SPEA2.

Page 10: OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

Our SPEA2-based Our SPEA2-based algorithmalgorithm

Page 11: OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

EMOOEMOO

• EvolutionEvolution– CrossoverCrossover– MutationMutation

• Fitness Assignment (SPEA2)Fitness Assignment (SPEA2)– Strength Value S(M):Strength Value S(M): the number of matrix the number of matrix dominated by M. dominated by M.

– Raw fitness F’(M):Raw fitness F’(M): the sum of the the sum of the strength of the RR matrices that dominate strength of the RR matrices that dominate M. The lower the better.M. The lower the better.

– Density d(M):Density d(M): discriminate the matrices discriminate the matrices with the same fitness.with the same fitness.

Page 12: OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

DiversityDiversity

Privacy

Utility

Worse

Better

M1

M2

M4

M3

M5

Page 13: OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

The Output of The Output of OptimizationOptimization

• Pareto FrontsPareto Fronts– The optimal set is often plotted in The optimal set is often plotted in the objective space and the plot is the objective space and the plot is called the called the Pareto frontPareto front..

Privacy

Utility(error)

0

Page 14: OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

ExperimentsExperiments

For normal distribution with different δ

Page 15: OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

For First attribute of Adult dataFor First attribute of Adult data

Page 16: OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

For normal distribution (For normal distribution (δδ=0.75=0.75))

Page 17: OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

Summary Summary

• We use an evolutionary multi-We use an evolutionary multi-objective optimization objective optimization technique to search for optimal technique to search for optimal RR matrices.RR matrices.

• The evaluation shows that our The evaluation shows that our scheme achieves better scheme achieves better performance than the existing performance than the existing RR schemes.RR schemes.