a general methodology for masking output from remote analysis … · 2013-11-13 · output from...
TRANSCRIPT
A GENERAL METHODOLOGY FOR MASKING
OUTPUT FROM REMOTE ANALYSIS SYSTEMS
Krish Muralidhar
Christine O’Keefe
Rathindra Sarathy
REMOTE ANALYSIS SYSTEM
O’Keefe and Chipperfield (in press)
Query
Dataset Analysis
Output
Data
Transformations
Output for
publication
FOCUS OF THIS PAPER
Responses to statistical queries involving
numerical variables
We explicitly do not consider tabular data release
DATA-BASED CONFIDENTIALIZATION MEASURES
FOR REMOTE ANALYSIS
Input Perturbation and Data Subsetting
Restrictions on Data Transformations
Query
Dataset Analysis
Output
Data
Transformations
Output for
publication
ANALYSIS-BASED CONFIDENTIALIZATION
MEASURES
Refusal to answer risky queries
Output checking
Query
Dataset Analysis
Output
Data
Transformations
Output for
publication
OUTPUT CONFIDENTIALIZATION
Modify output prior to release
Query
Dataset Analysis
Output
Data
Transformations
Output for
publication
EFFECTIVE OUTPUT MASKING
Respond to a diverse set of queries
Meaningful responses to queries
Robust
Control disclosure risk
Automated
OUTPUT MASKING MECHANISMS
Additive Perturbation
Including differential privacy
In our opinion, the applicability of differential privacy for
statistical analyses involving numerical variables is open
to question. We do not consider differential privacy
further
Multiplicative perturbation
A SIMPLE ILLUSTRATION
Query: “What is the variance of a particular
subset of the data (n = 100)?”
True response: 3.81
RESPONSE DISTRIBUTION - ADDITIVE NOISE
But which one?
RESPONSE DISTRIBUTION - MULTIPLICATIVE
But which one?
DRAW FROM THE SAMPLING DISTRIBUTION
Use Chi-Square distribution to approximate the sampling distribution of the sample variance. Draw the response from this distribution.
ROBUST? The Chi-square approximation is sensitive to normality
assumption and not very robust. The data in this case is heavily skewed.
AN IDEAL MASKING MECHANISM
For any query, select a random sample from the
relevant population (not the database),
compute the value of the statistic, and release
this value
Practically infeasible
ALTERNATIVE MECHANISM
For any query, derive the sampling distribution
of the statistic. Randomly draw a value from
this distribution. Release this value
May be feasible for some simple statistics (like the
sample mean), but as our variance example
illustrates, may not be possible for others
Theoretically infeasible
A FEASIBLE APPROACH
Selecting a value from the sampling
distribution of the statistic always provides an
appropriate masked response
Problem – how do we approximate the
sampling distribution of the statistic that is
both accurate and robust?
Solution – THE STATISTICAL BOOTSTRAP
THE STATISTICAL BOOTSTRAP (EFRON 1979)
Draw a bootstrap sample of size n, with replacement, from the original sample also of size n.
Compute value of statistic from the bootstrap sample
Repeat process of selecting bootstrap samples
The standard deviation of the values of the statistic from the bootstrap samples provide a good approximation of the standard error of the statistic
The distribution of 𝜃 ∗ − 𝜃 provides a good
approximation of the distribution of 𝜃 − 𝜃
𝜃 – Parameter; 𝜃 - Statistic; 𝜃 ∗ - Bootstrap statistic
BACK TO OUR EXAMPLE
APPROPRIATE MASKED RESPONSE
Since the bootstrap distribution of the statistic
closely approximates the sampling distribution
of the statistic, choosing a value randomly from
the bootstrap distribution is a close
approximation of choosing a value randomly
from the true sampling distribution of the
statistic
Close equivalent to drawing an independent sample
from the population
CHOOSING FROM THE BOOTSTRAP
DISTRIBUTION
Only a single realization from the bootstrap
distribution is required
A single realization from the bootstrap
distribution is the result of selecting a single
bootstrap sample
No need to construct the entire bootstrap
distribution!
ACTUAL MASKING PROCEDURE
From the original query set, select one
bootstrap sample of the same size as the
original set, with replacement.
Compute the value of the statistic for this
bootstrap sample.
Release the value of this statistic as the
masked response.
CHARACTERISTICS OF THE BOOTSTRAP METHOD
The distribution of 𝜃 ∗ closely approximates the
sampling distribution of 𝜃 ,
If 𝜃 is an unbiased estimator, then 𝐸 𝜃 ∗ = 𝜃 ,
and
Variance of 𝜃 ∗ = 𝜎𝜃 2, the variance of 𝜃 .
PERFORMANCE OF THE BOOTSTRAP METHOD
Easy implementation
Usefulness: 𝜃 ∗ is a random value chosen from a distribution that closely approximates the
sampling distribution of 𝜃
Disclosure risk: Noise addition approximately
equal to the standard error of the statistic 𝜃
Robust (no assumptions)
Easily automated and programmed without the need for ongoing human intervention.
FUTURE RESEARCH
Tabular data
Multiple imputation using the bootstrap
Compare with Rubin’s Bayesian bootstrap
Relationship between the bootstrap and
smooth sensitivity
QUESTIONS OR COMMENTS?
Thank you