a general methodology for masking output from remote analysis … · 2013-11-13 · output from...

A GENERAL METHODOLOGY FOR MASKING

OUTPUT FROM REMOTE ANALYSIS SYSTEMS

Krish Muralidhar

Christine O’Keefe

Rathindra Sarathy

REMOTE ANALYSIS SYSTEM

O’Keefe and Chipperfield (in press)

Query

Dataset Analysis

Output

Data

Transformations

Output for

publication

FOCUS OF THIS PAPER

Responses to statistical queries involving

numerical variables

We explicitly do not consider tabular data release

DATA-BASED CONFIDENTIALIZATION MEASURES

FOR REMOTE ANALYSIS

Input Perturbation and Data Subsetting

Restrictions on Data Transformations

Query

Dataset Analysis

Output

Data

Transformations

Output for

publication

ANALYSIS-BASED CONFIDENTIALIZATION

MEASURES

Refusal to answer risky queries

Output checking

Query

Dataset Analysis

Output

Data

Transformations

Output for

publication

OUTPUT CONFIDENTIALIZATION

Modify output prior to release

Query

Dataset Analysis

Output

Data

Transformations

Output for

publication

EFFECTIVE OUTPUT MASKING

Respond to a diverse set of queries

Meaningful responses to queries

Robust

Control disclosure risk

Automated

OUTPUT MASKING MECHANISMS

Additive Perturbation

Including differential privacy

In our opinion, the applicability of differential privacy for

statistical analyses involving numerical variables is open

to question. We do not consider differential privacy

further

Multiplicative perturbation

A SIMPLE ILLUSTRATION

Query: “What is the variance of a particular

subset of the data (n = 100)?”

True response: 3.81

RESPONSE DISTRIBUTION - ADDITIVE NOISE

But which one?

RESPONSE DISTRIBUTION - MULTIPLICATIVE

But which one?

DRAW FROM THE SAMPLING DISTRIBUTION

Use Chi-Square distribution to approximate the sampling distribution of the sample variance. Draw the response from this distribution.

ROBUST? The Chi-square approximation is sensitive to normality

assumption and not very robust. The data in this case is heavily skewed.

AN IDEAL MASKING MECHANISM

For any query, select a random sample from the

relevant population (not the database),

compute the value of the statistic, and release

this value

Practically infeasible

ALTERNATIVE MECHANISM

For any query, derive the sampling distribution

of the statistic. Randomly draw a value from

this distribution. Release this value

May be feasible for some simple statistics (like the

sample mean), but as our variance example

illustrates, may not be possible for others

Theoretically infeasible

A FEASIBLE APPROACH

Selecting a value from the sampling

distribution of the statistic always provides an

appropriate masked response

Problem – how do we approximate the

sampling distribution of the statistic that is

both accurate and robust?

Solution – THE STATISTICAL BOOTSTRAP

THE STATISTICAL BOOTSTRAP (EFRON 1979)

Draw a bootstrap sample of size n, with replacement, from the original sample also of size n.

Compute value of statistic from the bootstrap sample

Repeat process of selecting bootstrap samples

The standard deviation of the values of the statistic from the bootstrap samples provide a good approximation of the standard error of the statistic

The distribution of 𝜃 ∗ − 𝜃 provides a good

approximation of the distribution of 𝜃 − 𝜃

𝜃 – Parameter; 𝜃 - Statistic; 𝜃 ∗ - Bootstrap statistic

BACK TO OUR EXAMPLE

APPROPRIATE MASKED RESPONSE

Since the bootstrap distribution of the statistic

closely approximates the sampling distribution

of the statistic, choosing a value randomly from

the bootstrap distribution is a close

approximation of choosing a value randomly

from the true sampling distribution of the

statistic

Close equivalent to drawing an independent sample

from the population

CHOOSING FROM THE BOOTSTRAP

DISTRIBUTION

Only a single realization from the bootstrap

distribution is required

A single realization from the bootstrap

distribution is the result of selecting a single

bootstrap sample

No need to construct the entire bootstrap

distribution!

ACTUAL MASKING PROCEDURE

From the original query set, select one

bootstrap sample of the same size as the

original set, with replacement.

Compute the value of the statistic for this

bootstrap sample.

Release the value of this statistic as the

masked response.

CHARACTERISTICS OF THE BOOTSTRAP METHOD

The distribution of 𝜃 ∗ closely approximates the

sampling distribution of 𝜃 ,

If 𝜃 is an unbiased estimator, then 𝐸 𝜃 ∗ = 𝜃 ,

and

Variance of 𝜃 ∗ = 𝜎𝜃 2, the variance of 𝜃 .

PERFORMANCE OF THE BOOTSTRAP METHOD

Easy implementation

Usefulness: 𝜃 ∗ is a random value chosen from a distribution that closely approximates the

sampling distribution of 𝜃

Disclosure risk: Noise addition approximately

equal to the standard error of the statistic 𝜃

Robust (no assumptions)

Easily automated and programmed without the need for ongoing human intervention.

FUTURE RESEARCH

Tabular data

Multiple imputation using the bootstrap

Compare with Rubin’s Bayesian bootstrap

Relationship between the bootstrap and

smooth sensitivity

QUESTIONS OR COMMENTS?

Thank you

a general methodology for masking output from remote analysis … · 2013-11-13 · output from...

Documents