tuning privacy-utility tradeoffs in statistical databases using policies ashwin machanavajjhala...
TRANSCRIPT
Summer @ Census, 8/15/2013 1
Tuning Privacy-Utility Tradeoffs in Statistical Databases using Policies
Ashwin Machanavajjhalaashwin @ cs.duke.edu
Collaborators: Daniel Kifer (PSU), Bolin Ding (MSR), Xi He (Duke)
Summer @ Census, 8/15/2013 2
Overview of the talk• An inherent trade-off between privacy (confidentiality) of
individuals and utility of statistical analyses over data collected from individuals.
• Differential privacy has revolutionized how we reason about privacy– Nice tuning knob ε for trading off privacy and utility
Summer @ Census, 8/15/2013 3
Overview of the talk
• However, differential privacy only captures a small part of the privacy-utility trade-off space
– No Free Lunch Theorem
– Differentially private mechanisms may not ensure sufficient utility
– Differentially private mechanisms may not ensure sufficient privacy
Summer @ Census, 8/15/2013 4
Overview of the talk
• I will present a new privacy framework that allows data publishers to more effectively tradeoff privacy for utility
– Better control on what to keep secret and who the adversaries are
– Can ensure more utility than differential privacy in many cases
– Can ensure privacy where differential privacy fails
Summer @ Census, 8/15/2013 5
Outline• Background
– Differential privacy
• No Free Lunch [Kifer-M SIGMOD ’11]– No `one privacy notion to rule them all’
• Pufferfish Privacy Framework [Kifer-M PODS’12]– Navigating the space of privacy definitions
• Blowfish: Practical privacy using policies [ongoing work]
Summer @ Census, 8/15/2013
Data Privacy Problem
6
Individual 1r1
Individual 2r2
Individual 3r3
Individual NrN
Server
DB
Utility:Privacy: No breach about any individual
Summer @ Census, 8/15/2013 7
Data Privacy in the real world
Application Data Collector Third Party (adversary)
Private Information
Function (utility)
Medical Hospital Epidemiologist Disease Correlation between disease and geography
Genome analysis
Hospital Statistician/Researcher
Genome Correlation between genome and disease
Advertising Google/FB/Y! Advertiser Clicks/Browsing
Number of clicks on an ad by age/region/gender …
Social Recommen-
dations
Facebook Another user Friend links / profile
Recommend other users or ads to users based on
social network
Summer @ Census, 8/15/2013
T-closenessLi et. al ICDE ‘07
K-AnonymitySweeney et al.IJUFKS ‘02
Many definitions
• Linkage attack• Background knowledge attack• Minimality /Reconstruction
attack• de Finetti attack• Composition attack
8
L-diversityMachanavajjhala et. al TKDD ‘07 E-PrivacyMachanavajjhala et. al VLDB ‘09
& several attacks
Differential PrivacyDwork et. al ICALP ‘06
Summer @ Census, 8/15/2013 9
Differential Privacy
For every output …
OD2D1
Adversary should not be able to distinguish between any D1 and D2 based on any O
Pr[A(D1) = O] Pr[A(D2) = O] .
For every pair of inputs that differ in one value
< ε (ε>0)log
Summer @ Census, 8/15/2013 10
Algorithms• No deterministic algorithm guarantees differential privacy.
• Random sampling does not guarantee differential privacy.
• Randomized response satisfies differential privacy.
Summer @ Census, 8/15/2013 11
Laplace Mechanism
-10-4.300000000000021.4 7.09999999999990
0.10.20.30.40.50.6
Laplace Distribution – Lap(λ)
DatabaseD
Researcher
Query q
True answer q(D) q(D) + η
η
h(η) α exp(-η / λ)
Privacy depends on the λ parameter
Mean: 0, Variance: 2 λ2
Summer @ Census, 8/15/2013 12
Laplace Mechanism
Thm: If sensitivity of the query is S, then the following guarantees ε-differential privacy.
λ = S/ε
Sensitivity: Smallest number s.t. for any D,D’ differing in one entry, || q(D) – q(D’) ||1 ≤ S(q)
[Dwork et al., TCC 2006]
Summer @ Census, 8/15/2013
Contingency tables
13
2 2
2 8
D
Count( , )
Each tuple takes k=4 different values
Summer @ Census, 8/15/2013
Laplace Mechanism for Contingency Tables
14
2 + Lap(2/ε) 2 + Lap(2/ε)
2 + Lap(2/ε) 8 + Lap(2/ε)
D Mean : 8Variance : 8/ε2
Sensitivity = 2
Summer @ Census, 8/15/2013 15
Composition PropertyIf algorithms A1, A2, …, Ak use independent randomness and each Ai satisfies εi-differential privacy, resp.
Then, outputting all the answers together satisfies differential privacy with
ε = ε1 + ε2 + … + εk
Privacy Budget
Summer @ Census, 8/15/2013 16
Differential Privacy• Privacy definition that is independent of the attacker’s prior
knowledge.
• Tolerates many attacks that other definitions are susceptible to.– Avoids composition attacks– Claimed to be tolerant against adversaries with arbitrary background
knowledge.
• Allows simple, efficient and useful privacy mechanisms – Used in LEHD’s OnTheMap [M et al ICDE ‘08]
Summer @ Census, 8/15/2013 17
Outline• Background
– Differential privacy
• No Free Lunch [Kifer-M SIGMOD ’11]– No `one privacy notion to rule them all’
• Pufferfish Privacy Framework [Kifer-M PODS’12]– Navigating the space of privacy definitions
• Blowfish: Practical privacy using policies [ongoing work]
Summer @ Census, 8/15/2013 18
Differential Privacy & Utility• Differentially private mechanisms may not ensure sufficient utility
for many applications.
• Sparse Data: Integrated Mean Square Error due to Laplace mechanism canbe worse than returning a random contingency table for typical values of ε (around 1)
• Social Networks [M et al PVLDB 2011]
Summer @ Census, 8/15/2013 19
Differential Privacy & Privacy• Differentially private algorithms may not limit the ability of an
adversary to learn sensitive information about individuals when records in the data are correlated
• Correlations across individuals occur in many ways:– Social Networks– Data with pre-released constraints– Functional Dependencies
Summer @ Census, 8/15/2013
Laplace Mechanism and Correlations
20
2 + Lap(2/ε) 2 + Lap(2/ε) 4
2 + Lap(2/ε) 8 + Lap(2/ε) 10
4 10
D
Does Laplace mechanism still guarantee privacy?
Auxiliary marginals published for following reasons:
1. Legal: 2002 Supreme Court case Utah v. Evans2. Contractual: Advertisers must know exact
demographics at coarse granularities
4
10
4 10
Summer @ Census, 8/15/2013
Laplace Mechanism and Correlations
21
2 + Lap(2/ε) 2 + Lap(2/ε) 4
2 + Lap(2/ε) 8 + Lap(2/ε) 10
4 10
D
2 + Lap(2/ε)
2 + Lap(2/ε)2 + Lap(2/ε)
Count ( , ) = 8 + Lap(2/ε) Count ( , ) = 8 – Lap(2/ε) Count ( , ) = 8 – Lap(2/ε) Count ( , ) = 8 + Lap(2/ε)
Summer @ Census, 8/15/2013
Mean : 8 Variance : 8/ke2
Laplace Mechanism and Correlations
22
2 + Lap(1/ε) 2 + Lap(1/ε) 4
2 + Lap(1/ε) 8 + Lap(2/ε) 10
4 10
D
2 + Lap(2/ε)
2 + Lap(2/ε)2 + Lap(2/ε)
can reconstruct the table with high precision for large k
Summer @ Census, 8/15/2013
No Free Lunch Theorem
It is not possible to guarantee any utility in addition to privacy, without making assumptions about
• the data generating distribution
• the background knowledge available to an adversary
23
[Kifer-M SIGMOD ‘11][Dwork-Naor JPC ‘10]
Summer @ Census, 8/15/2013 24
To sum up …
• Differential privacy only captures a small part of the privacy-utility trade-off space
– No Free Lunch Theorem
– Differentially private mechanisms may not ensure sufficient privacy
– Differentially private mechanisms may not ensure sufficient utility
Summer @ Census, 8/15/2013 25
Outline• Background
– Differential privacy
• No Free Lunch [Kifer-M SIGMOD ’11]– No `one privacy notion to rule them all’
• Pufferfish Privacy Framework [Kifer-M PODS’12]– Navigating the space of privacy definitions
• Blowfish: Practical privacy using policies [ongoing work]
Summer @ Census, 8/15/2013 26
Pufferfish Framework
Summer @ Census, 8/15/2013 27
Pufferfish Semantics• What is being kept secret?
• Who are the adversaries?
• How is information disclosure bounded? – (similar to epsilon in differential privacy)
Summer @ Census, 8/15/2013 28
Sensitive Information• Secrets: S be a set of potentially sensitive statements
– “individual j’s record is in the data, and j has Cancer”– “individual j’s record is not in the data”
• Discriminative Pairs: Mutually exclusive pairs of secrets. – (“Bob is in the table”, “Bob is not in the table”)– (“Bob has cancer”, “Bob has diabetes”)
Summer @ Census, 8/15/2013 29
Adversaries• We assume a Bayesian adversary who is can be completely
characterized by his/her prior information about the data– We do not assume computational limits
• Data Evolution Scenarios: set of all probability distributions that could have generated the data ( … think adversary’s prior).
– No assumptions: All probability distributions over data instances are possible.
– I.I.D.: Set of all f such that: P(data = {r1, r2, …, rk}) = f(r1) x f(r2) x…x f(rk)
Summer @ Census, 8/15/2013 30
Information Disclosure• Mechanism M satisfies ε-Pufferfish(S, Spairs, D), if
Summer @ Census, 8/15/2013 31
Pufferfish Semantic Guarantee
Prior odds of s vs s’
Posterior odds of s vs s’
Summer @ Census, 8/15/2013 34
Pufferfish & Differential Privacy
• Spairs: – si
x: record i takes the value x
–
• Attackers should not be able to significantly distinguish between any two values from the domain for any individual record.
Summer @ Census, 8/15/2013 35
Pufferfish & Differential Privacy
• Data evolution: – For all θ = [ f1, f2, f3, …, fk ]
• Adversary’s prior may be any distribution that makes records independent
Summer @ Census, 8/15/2013 36
Pufferfish & Differential Privacy• Spairs:
– six: record i takes the value x
–
• Data evolution: – For all θ = [ f1, f2, f3, …, fk ]
A mechanism M satisfies differential privacy if and only if
it satisfies Pufferfish instantiated using Spairs and {θ}
Summer @ Census, 8/15/2013 37
Summary of Pufferfish• A semantic approach to defining privacy
– Enumerates the information that is secret and the set of adversaries.– Bounds the odds ratio of pairs of mutually exclusive secrets
• Helps understand assumptions under which privacy is guaranteed– Differential privacy is one specific choice of secret pairs and adversaries
• How should a data publisher use this framework?• Algorithms?
Summer @ Census, 8/15/2013 38
Outline• Background
– Differential privacy
• No Free Lunch [Kifer-M SIGMOD ’11]– No `one privacy notion to rule them all’
• Pufferfish Privacy Framework [Kifer-M PODS’12]– Navigating the space of privacy definitions
• Blowfish: Practical privacy using policies [ongoing work]
Summer @ Census, 8/15/2013 39
Blowfish Privacy• A special class of Pufferfish instantiations
Both pufferfish and blowfish are marine fish of the Tetraodontidae family
Summer @ Census, 8/15/2013 40
Blowfish Privacy• A special class of Pufferfish instantiations
• Extends differential privacy using policies
– Specification of sensitive information• Allows more utility
– Specification of publicly known constraints in the data• Ensures privacy in correlated data
• Satisfies the composition property
Summer @ Census, 8/15/2013 41
Blowfish Privacy• A special class of Pufferfish instantiations
• Extends differential privacy using policies
– Specification of sensitive information• Allows more utility
– Specification of publicly known constraints in the data• Ensures privacy in correlated data
• Satisfies the composition property
Summer @ Census, 8/15/2013 42
Sensitive Information• Secrets: S be a set of potentially sensitive statements
– “individual j’s record is in the data, and j has Cancer”– “individual j’s record is not in the data”
• Discriminative Pairs: Mutually exclusive pairs of secrets. – (“Bob is in the table”, “Bob is not in the table”)– (“Bob has cancer”, “Bob has diabetes”)
Summer @ Census, 8/15/2013 43
Sensitive information in Differential Privacy
• Spairs: – si
x: record i takes the value x
–
• Attackers should not be able to significantly distinguish between any two values from the domain for any individual record.
Summer @ Census, 8/15/2013 44
Other notions of Sensitive Information• Medical Data
– OK to infer whether individual is healthy or not. – E.g., (Bob is Healthy, Bob is Diabetes) is not a discriminative pair of secrets
for any individual
• Partitioned Sensitive Information:
Summer @ Census, 8/15/2013 45
Other notions of Sensitive Information• Geospatial Data
– Do not want the attacker to distinguish between “close-by” points in the space.
– May distinguish between “far-away” points
• Distance based Sensitive Information
Summer @ Census, 8/15/2013 47
Generalization as a graph• Consider a graph G = (V, E), where V is the set of values that an
individual’s record can take.
• E encodes the set of discriminative pairs – Same for all records.
Summer @ Census, 8/15/2013 48
Blowfish Privacy + “Policy of Secrets”• A mechanism M satisfy blowfish privacy wrt policy G if
– For every set of outputs of the mechanism S– For every pair of datasets that differ in one record,
with values x and y s.t. (x,y) ε E
Summer @ Census, 8/15/2013 49
Blowfish Privacy + “Policy of Secrets”• A mechanism M satisfy blowfish privacy wrt policy G if
– For every set of outputs of the mechanism S– For every pair of datasets that differ in one record,
with values x and y s.t. (x,y) ε E
• For any x and y in the domain,
Shortest distance between x and y in G
Summer @ Census, 8/15/2013 50
Blowfish Privacy + “Policy of Secrets”• A mechanism M satisfy blowfish privacy wrt policy G if
– For every set of outputs of the mechanism S– For every pair of datasets that differ in one record,
with values x and y s.t. (x,y) ε E
• Adversary is allowed to distinguish between x and y that appear in different disconnected components in G
Summer @ Census, 8/15/2013 52
Algorithms for Blowfish• Consider an ordered 1-D attribute– Dom = {x1,x2,x3,…,xd}– E.g., ranges of Age, Salary, etc.
• Suppose our policy is:
Adversary should not distinguish whether an individual’s value is xj or xj+1 .
x1 x2 x3 xd
Summer @ Census, 8/15/2013 53
Algorithms for Blowfish• Suppose we want to release histogram privately
– Number of individuals in each age range
• Any differentially private algorithm also satisfies blowfish– Can use Laplace mechanism (with sensitivity 2)
x1 x2 x3 xd
C(x1) C(x3) C(xd)
Summer @ Census, 8/15/2013 54
Ordered Mechanism• We can answer a different set of queries to get a different private
estimator for the histogram.
x1 x2 x3 xd
C(x1) C(x3) C(xd)
S3S2
S1
Sd…
Summer @ Census, 8/15/2013 55
Ordered Mechanism• We can answer each Si using Laplace mechanism …• … but sensitivity for all the queries is only 1
x1 x2 x3 xd
C(x3) +1
S3S2
S1
Sd…
C(x2) -1
Changing one tuple from x2 to x3 only
changes S2
Summer @ Census, 8/15/2013 56
Ordered Mechanism• We can answer each Si using Laplace mechanism …• … but sensitivity for all the queries is only 1
Factor of 2 improvement
Summer @ Census, 8/15/2013 57
Ordered Mechanism• In addition, we have the following constraint:
• However, the noisy counts may not satisfy this constraint.
• We can post-process the noisy counts to ensure this constraint:
Summer @ Census, 8/15/2013 58
Ordered Mechanism• We can post-process the noisy counts to ensure this constraint:
Order of magnitude improvement for
large d
Summer @ Census, 8/15/2013 59
Ordered Mechanism• By leveraging the weaker sensitive information in the policy, we
can provide significantly better utility
• Extends to more general policy specifications.
• Ordered mechanisms and other blowfish algorithms are being tested on the synthetic data generator for LODES data product.
Summer @ Census, 8/15/2013 60
Blowfish Privacy & Correlations• Differentially private mechanisms may not ensure privacy when
correlations exist in the data.
• Blowfish can handle constraints in the form of publicly known constraints.– Well know marginal counts in the data– Other dependencies
• Privacy definition is similar to differential privacy with a modified notion of neighboring tables
Summer @ Census, 8/15/2013 61
Other instantiations of Pufferfish• All blowfish instantiations are extensions of differential privacy
using – Weaker notions of sensitive information– Allowing knowledge of constraints about the data– All blowfish mechanisms satisfy composition property
• We can instantiate Pufferfish with other “realistic” adversary notions– Only prior distributions that are similar to the expected data distribution– Open question: Which definitions satisfy composition property?
Summer @ Census, 8/15/2013 62
Summary• Differential privacy (and the tuning knob epsilon) is insufficient for
trading off privacy for utility in many applications– Sparse data, Social networks, …
• Pufferfish framework allows more expressive privacy definitions– Can vary sensitive information, adversary priors, and epsilon
• Blowfish shows one way to create more expressive definitions– Can provide useful composable mechanisms
• There is an opportunity to correctly tune privacy by using the above expressive privacy frameworks
Summer @ Census, 8/15/2013 63
Thank you
[M et al PVLDB’11]A. Machanavajjhala, A. Korolova, A. Das Sarma, “Personalized Social Recommendations – Accurate or Private?”, PVLDB 4(7) 2011
[Kifer-M SIGMOD’11]D. Kifer, A. Machanavajjhala, “No Free Lunch in Data Privacy”, SIGMOD 2011
[Kifer-M PODS’12]D. Kifer, A. Machanavajjhala, “A Rigorous and Customizable Framework for Privacy”, PODS 2012
[ongoing work]A. Machanavajjhala, B. Ding, X. He, “Blowfish Privacy: Tuning Privacy-Utility Trade-offs using Policies”, in preparation