data privacy and security - altervistadanieleventuri.altervista.org/files/05_diff_priv.pdfnoisy...

43
Data Privacy and Security Master Degree in Data Science Sapienza University of Rome Academic Year 2017-2018 Instructor : Daniele Venturi (Some slides from a talk by Cynthia Dwork)

Upload: duongtuong

Post on 27-May-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Data Privacy and Security

Master Degree in Data Science

Sapienza University of Rome

Academic Year 2017-2018

Instructor: Daniele Venturi(Some slides from a talk by Cynthia Dwork)

Differential PrivacyData Privacy and Security

2

Part V: Differential Privacy

Data Exploitation

Differential PrivacyData Privacy and Security

3

• Availability of lots of data

– Social networks, financial data, medical records…

• All these data are an asset

– We would like to exploit them

𝑥

𝑦

Data collection

Examples

Differential PrivacyData Privacy and Security

4

• Finding statistical correlations

– Genotype/phenotype association

– Correlating medical outcomes with risk factors

• Publishing aggregate statistics

• Noticing events/outliers

– Intrusion detection

• Datamining/learning

– Update strategies based on customers data

AOL Search Debacle

Differential PrivacyData Privacy and Security

5

• Back in 2006 AOL published research statisticsof over 3 milions users in a period of 3 months

• Data were anonymized in order to removepersonally identifiable information

– E.g., name, social security numbers,…

• Yet after some time many people wereidentified

Lessons to be Learned

Differential PrivacyData Privacy and Security

6

• Privacy is a concern when publishing datasets

• Wait: This does not apply to me!

– Don’t make the entire dataset available

– Only publish statistics

• Even if only data aggregations are publishedprivacy can be broken

• Overly accurate estimates of too manystatistics is blatantly non-private

Data Analysis

Differential PrivacyData Privacy and Security

7

𝑞1𝑎1𝑞2𝑎2

Data AnalystDatabase

• How to define privacy?

– Intuitively want that published statistics do notundermine privacy of the individuals

– After all statistics are just aggregated data aboutthe overall population

The Statistics Masquerade

Differential PrivacyData Privacy and Security

8

• Differential attack

– How many people in the room had XYZ last night?

– How many people, other than the speaker, hadXYZ last night?

• Needle in a hystack

– Determine presence of an individual genomic data in GWAS case group

• The big bang attack

– Reconstruct sensitive attibutes given statisticsfrom multiple overlapping stats

Privacy-Preserving Data Analysis?

Differential PrivacyData Privacy and Security

9

𝑞1𝑎1𝑞2𝑎2

Data AnalystDatabase

• Can’t learn anything new about Alice?

– Reminiscent of semantic security for encryption

– Then what is the point?

• Ideally: Learn same thing if Alice is replaced by a random member of the population

Differential Privacy

Differential PrivacyData Privacy and Security

10

• The outcome of any analysis is essentiallyequally likely, independent of whether anyindividual joins, or refrains from joining, the dataset

– Alice goes away, Bob joins, Alice is replaced by Bob (i.e., small perturbations do not matter)

• Note that instead if we completely change the dataset we get completely different answers

More Formally…

Differential PrivacyData Privacy and Security

11

𝑞1𝑎1𝑞2𝑎2

Data AnalystDatabase

RandomizedMechanism𝓜

Mechanism 𝓜 gives 휀-differential privacy if for all pairs of adjacent datasets 𝑥, 𝑦, and for all events 𝑆:

Pr 𝓜 𝑥 ∈ 𝑆 ≤ 𝑒 ∙ Pr 𝓜 𝑦 ∈ 𝑆

Notes on the Definition

Differential PrivacyData Privacy and Security

12

• Worst-case guarantee

– It holds for all datasets

– It holds even against unbounded adversaries

– Probability over the randomness of the algorithm, not over the choice of the dataset

• The roles of 𝑥 and 𝑦 can be flipped

• Randomness is in the hands of the good guys

Properties

Differential PrivacyData Privacy and Security

13

• Immune to auxiliary information

– Current and future side information

• Automatically yields group privacy

– Privacy loss 𝑘휀 for groups of size 𝑘

• Composition

– Can bound cumulative privacy loss over multiple analysis (the epsilons add up)

– Can combine a few differentially private mechanisms to solve complex analytical tasks

Did you XYZ Last Night?

Differential PrivacyData Privacy and Security

14

• Flip a coin

– Heads: Flip again and return YES if heads, and else return NO

– Tails: Answer honestly

•Pr YES Truth = YESPr YES Truth = NO

= 3

•Pr NO Truth = YESPr NO Truth = NO

= 3

• 휀 = 1.098Absolute error≈ 1/ 𝑛 for a

single fractional query

Real-Valued Functions

Differential PrivacyData Privacy and Security

15

• Want to compute 𝑞(𝑥)

• Adding pulls the answer to 𝑞 𝑦

– Add random noise to obscure difference 𝑞(𝑥) vs 𝑞(𝑦)

Sensitivity

Differential PrivacyData Privacy and Security

16

• Noise depends on ∆𝑞, 휀, but not on the dataset

– Smaller ∆𝑞: Less distortion

– Smaller 휀: More distortion

• Privacy in the land of plenty (i.e., not a tool for tiny datasets)

To achieve 휀-differential privacy it suffices to addnoise scaled to ∆𝑞/휀

∆𝑞 = maxadj. 𝑥,𝑦|𝑞 𝑥 − 𝑞(𝑦)|

The Laplace Mechanism

Differential PrivacyData Privacy and Security

17

Theorem: Adding noise Lap( ∆𝑞 ) yields 휀-differential privacy

∆𝑞 = maxadj. 𝑥,𝑦|𝑞 𝑥 − 𝑞(𝑦)|

𝑝 𝑧 = 𝑒− 𝑧 /𝑏/2𝑏

Why does it work?

Differential PrivacyData Privacy and Security

18

Theorem: Adding noise Lap( ∆𝑞 ) yields 휀-differential privacy

∆𝑞 = maxadj. 𝑥,𝑦|𝑞 𝑥 − 𝑞(𝑦)|

Pr[𝓜 𝑥 = 𝑡]

Pr[𝓜 𝑦 = 𝑡]= 𝑒

−|𝑡−𝑞(𝑥)|−|𝑡−𝑞(𝑦)|∆𝑞/ ≤ 𝑒

Example: Histogram Queries

Differential PrivacyData Privacy and Security

19

∆𝑞 = maxadj. 𝑥,𝑦 𝑞 𝑥 − 𝑞 𝑦 = 1

So it sufficesadding noiseLap( 1 )

Accuracy for a Set of Queries

• A mechanism𝓜 has (𝛼, 𝛽)-accuracy wrt a set of queries 𝑄 if for all 𝑥: With probability 1 − 𝛽the outcome 𝓜 𝑥 yields an approximate value [𝑞 𝑥 − 𝛼, 𝑞 𝑥 + 𝛼]

• Accuracy of Laplace mechanism for 𝑘 queries

– Fact: Pr Lap 𝑏 ≥ 𝑡𝑏 = 𝑒−𝑡 (for us 𝑏 = 𝑘/휀)

– Union bound: 𝑘 ∙ Pr[|Lap(𝑘/휀)| ≥ 𝑡𝑘/휀] ≤ 𝛽

– Thus: Pr Lap𝑘

≥𝑡𝑘

≤𝛽

𝑘= 𝑒−𝑡

– So, 𝑡 = ln(𝑘/𝛽) and 𝛼 =𝑘ln(𝑘/𝛽)

Differential PrivacyData Privacy and Security

20

≈ 1/휀 for 𝑘 = 1

Noisy ArgMax

• Compute 𝑔1 𝑥 ,… , 𝑔𝑚 𝑥

– How many people play ping pong, how many go running, and etc.

– Wanna know most popular hobby

• Add Lap(2max𝑖∆𝑔𝑖) to each value

– Do not release the outcome, but report the indexof the function with larger noisy outcome

– Works as long as there is a gap between two mostpopular choices

– Compute much more than what is released

Differential PrivacyData Privacy and Security

21

Generalization of Noisy ArgMax

• Target notions of utility

– E.g., mechanism outputs a classifier and utility issmallness of classification error

• Applications where adding noise makes no sense

• Goal: Maximize 𝑢(𝑥, ξ)

– Utility of ξ on database 𝑥

Differential PrivacyData Privacy and Security

22

The Exponential Mechanism

• 𝑞 𝑥 ∈ 𝛯 = ξ1, … , ξ𝑘– Strings, prices, etc.

– Each ξ ∈ 𝛯 has utility 𝑢 𝑥, ξ for 𝑥

Differential PrivacyData Privacy and Security

23

Intuition: Output ξ with probability

∝ 𝑒𝑢 𝑥,ξ /∆𝑢

𝑒𝑢 𝑥,ξ

𝑒𝑢 𝑦,ξ

/∆𝑢

= 𝑒𝑢 𝑥,ξ −𝑢 𝑦,ξ /∆𝑢≤ 𝑒

Unlimited Supply Auctions

• Bidders have demands curves, describing for each price 𝑝 ∈ [0,1] the number of goodsthey wish to purchase at 𝑝

– Total budget 𝑝 ∙ 𝑏𝑖 𝑝 ≤ 1

• Auctioneer picks 𝑝 max. revenue 𝑝 𝑖 𝑏𝑖 𝑝

• Select the price using exponential mechanism

– Utility is the revenue

– Approximately truthful: An individual can’tinfluence the price choice by changing her bid

– Resilient to collusion (for small coalitions)!

Differential PrivacyData Privacy and Security

24

Efficiency?

• Generating synthetic data can be hard

• Consider the following database

– Choose single sign/verify key pair (𝑣𝑘∗, 𝑠𝑘∗)

– Database is 𝑛 rows: (𝑚𝑖 , 𝐒𝐢𝐠𝐧 𝑠𝑘∗, 𝑚𝑖 , 𝑣𝑘

∗) for random messages𝑚𝑖

– One query for each 𝑣𝑘: What fraction of rows are valid signatures wrt 𝑣𝑘

• Efficient curator cannot generate syntheticdatabase (yielding correct answers wrt 𝑣𝑘∗) without leaking rows

Differential PrivacyData Privacy and Security

25

Efficient Synopsis Generation?

• Trivial to find a synopsis with same functionality

– Simply publish 𝑣𝑘∗

• Maintain functionality of the database wrt a set 𝑄 of queries

• Hide presence/absence of any individual

• We will argue that there are hard to sanitizedistributions

– Assuming so-called traitor tracing schemes

– Converse is also true (but we won’t talk about it)

Differential PrivacyData Privacy and Security

26

Traitor Tracing Schemes

Differential PrivacyData Privacy and Security

27

⋮𝑏𝑘

𝑐 ←$ 𝐄𝐧𝐜(𝑏𝑘, 𝑏)

𝑠𝑘1

𝑠𝑘2

𝑠𝑘𝑛

𝑐

𝑐

𝑐

𝐃𝐞𝐜 𝑠𝑘𝑖 , 𝑐 = 𝑏

Stateful Pirates

• What if some users try to resell the content?

• Some users in the coalition will be traced!

Differential PrivacyData Privacy and Security

28

Tracer

𝑠𝑘1

𝑠𝑘2

𝑠𝑘𝑛

𝑐1, … , 𝑐𝑡

𝑏1, … , 𝑏𝑡

Pirate Decoder

Accuse user 𝑖

𝑡𝑘

Intuition for the Lower Bound

• Assume traitor tracing scheme

• One universe element for each private key

– Database is collection of 𝑛 randomly chosen keys

• One query for each ciphertext 𝑐

– For what fraction of rows 𝑖 does 𝐃𝐞𝐜 𝑠𝑘𝑖 , 𝑐 = 1?

• For any synopsis (i.e. the pirate):

– Answer will be 0 or 1, i.e. the decryption of 𝑐

– Tracing reveals ≥ 1 key and never falsely accuses

– Violates differential privacy!

Differential PrivacyData Privacy and Security

29

More Lower Bounds

• Theorem: Assuming one-way functions exist, differentially private algorithms for the following require exponential time

• Synthetic data for 2-way marginals

– Proof relies on digital signatures

• Synopsis release for > 𝑛2 arbitrary countingqueries

– Proof relies on traitor tracing schemes

Differential PrivacyData Privacy and Security

30

Approximate Differential Privacy

Differential PrivacyData Privacy and Security

31

𝑞1𝑎1𝑞2𝑎2

Data AnalystDatabase

RandomizedMechanism𝓜

Mechanism 𝓜 gives (휀, 𝛿)-differential privacy if for all pairs of adjacent datasets 𝑥, 𝑦, and for all events 𝑆:

Pr 𝓜 𝑥 ∈ 𝑆 ≤ 𝑒 ∙ Pr 𝓜 𝑦 ∈ 𝑆 + 𝛿

Benefits of the Relaxation

• Advanced composition

– Can answer 𝑘 queries with cumulative loss √𝑘 ∙ 휀instead of 𝑘휀

• Can use cryptography to simulate trustedcenter

• Gaussian noise

– Leading to better accuracy

Differential PrivacyData Privacy and Security

32

Gaussian Noise

Differential PrivacyData Privacy and Security

33

N 𝜇; 𝜎2 =1

𝜎 2𝜋𝑒−(𝑧−𝜇)2

2𝜎2

The Gaussian Mechanism

Differential PrivacyData Privacy and Security

34

Theorem: Adding noise Lap(∆1/휀) yields (휀, 0)-differential privacy

∆1= maxadj. 𝑥,𝑦 𝑞 𝑥 − 𝑞(𝑦) 1

Theorem: Adding noise N(0,2ln(2/𝛿)(∆2

2)2

2 ) yields

(휀, 𝛿)-differential privacy

∆2= maxadj. 𝑥,𝑦 𝑞 𝑥 − 𝑞(𝑦) 2

For 𝑘 queries∆1= 𝑘

∆2= 𝑘

Incentives

• Until now: Goal was designing differentiallyprivate mechanisms, but the data is assumedto be already there

• But why should someone participate in the computation?

• Why would they give their true data?

• Do we need compensation? How much?

• Any connection with game theory?

Differential PrivacyData Privacy and Security

35

Game Theory and Mechanism Design

• Goal: Solve optimization problem

• Catch: No access to inputs

– Inputs held by self-interested agents

• Design incentives and choice of solution(mechanism) that incentivizes truth-telling

– No need for participants to strategize

– Simple to predict what will happen

– Often a non-truth-telling mechanism can be replaced by one where the coordinatore does the lying on behalf of the participants

Differential PrivacyData Privacy and Security

36

Good News

• Composition: Approximate truthfulness stillsatisfied under composition!

• Collusion resistance: 𝑂(𝑘휀)-approximatedominant strategy, even for coalitions of 𝑘agents

• Both properties not immediate in game-theoretic mechanism design

• All done without money!

Differential PrivacyData Privacy and Security

37

Bad News

• But not only truthful reporting gives an approximate dominant strategy, any reportdoes so

– Even malicious ones

• How do we actually properly get people to truthfully participate?

– Perhaps need compensation

– Much harder to achieve

Differential PrivacyData Privacy and Security

38

Differential Privacy as a Tool

• Nash equilibrium: An assignment of players to strategies so that no player would benefit by changing strategy, given how everyone else isplaying

• Correlated equilibrium: Players have access to correlated signals (e.g., traffic light)

– Every Nash equilibrium is a correlated equilibrium, but not viceversa

• Differential privacy has applications to mechanism design with correlated equilibria

Differential PrivacyData Privacy and Security

39

The Issue of Verification

• Challenging to strictly incentivize truth-tellingin differentially private mechanisms design

• Exceptions:

– Resposes are verifiable

– Agents care about outcome

• Challenge: No observed outcome

– What is the prevalence of drug use?

– Are students cheating in class?

Differential PrivacyData Privacy and Security

40

Bayesan Setting

• Bit-cost pairs (𝑏𝑖 , 𝑐𝑖) drawn from jointdistribution

• If 𝑏𝑖 tells I am a cheater, I tend to believe weare in a world where people cheat

• But cost 𝑐𝑖 does not give additionalinformation beyond what 𝑏𝑖 gives

– Privacy costs arbitrary, but upper bounded by linear cost in 𝑐𝑖 ∙ 휀

– Utility model: 𝑐𝑖 ∙ 휀 − 𝑝𝑖

Differential PrivacyData Privacy and Security

41

Privacy & Game Theory

• Asymptothic truthfulness, some newmechanism design and equilibrium selectionresults

• Interesting challenge of modeling costs for privacy

• In order to design privacy for humans do weneed to understand

– How people currently value or should value it?

– What are the right promises to give?

Differential PrivacyData Privacy and Security

42

Not a Panacea

• The fundamental law still applies!

Differential PrivacyData Privacy and Security

43

Overly accurate estimates of too many statistics is blatantly

non-private