using data privacy for better adaptive predictions vitaly feldman ibm research – almaden...

27
Using Data Privacy for Better Adaptive Predictions Vitaly Feldman IBM Research – Almaden Foundations of Learning Theory, 2014 Dwork Moritz Hardt Omer Reingold Aaron Roth IBM Almaden MSR SVC

Upload: homer-nash

Post on 26-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Using Data Privacy for Better Adaptive Predictions

Vitaly Feldman IBM Research – Almaden

Foundations of Learning Theory, 2014

Cynthia Dwork Moritz Hardt Omer Reingold Aaron Roth MSR SVC IBM Almaden MSR SVC UPenn, CS

Statistical inference

Genome Wide Association StudiesGiven: DNA sequences with medical recordsDiscover: • Find SNPs associated with diseases• Predict chances of developing some condition• Predict drug effectiveness• Hypothesis testing

Given: samples drawn i.i.d. from unknown distribution over Output solution of value with a guarantee that w.p. (over )

Existing approaches

Theoretical ML• Uniform convergence bounds for the solution

class For every of complexity , w.p.

• Output stability-based bounds• But often too loose in practice (do not exploit

additional structure) and complicated to derive

Practical ML• Cross-validation

Statistics• Model-specific fit and significance tests• Bootstrapping etc.

Real world is interactive

Outcomes of analyses inform future manipulations on the same data• Exploratory data analysis• Model selection• Feature selection• Hyper-parameter tuning• Public data - findings inform others

Samples are no longer i.i.d.!

Is the issue real?

Freedman’s paradox (1983):Data: Throw away uncorrelated variables: Perform least squares regression over:

“Such practices can distort the significance levels of conventional statistical tests. The existence of this effect is well known, but its magnitude may come as a surprise, even to a hardened statistician.”

competitions

Public Private

Private dataPublic score

Data

Private score

http://www.rouli.net/2013/02/five-lessons-from-kaggles-event.html

“If you based your model solely on the data which gave you constant feedback, you run the danger of a model that overfits to the specific noise in that data.” –Kaggle FAQ.

Adaptive statistical queries

is tolerance of the queryWith probability for all

𝜙𝑡

𝑣1𝜙2𝑣2

𝑣𝑡

𝜙1

Learning algorithm(s)

SQ oracle[K93, FGRVX13]

𝑆

Can measure error/performance and test hypothesesCan be used in place of samples in most algorithms!

SQ algorithms• PAC learning algorithms (except parities)• Convex optimization (Ellipsoid, iterative methods)• Expectation maximization (EM)• SVM (with kernel)• PCA• ICA• ID3• k-means• method of moments• MCMC• Naïve Bayes• Neural Networks (backprop)• Perceptron• Nearest neighbors• Boosting[K 93, BDMN 05, CKLYBNO 06, FPV 14]

For a query respond with How many samples are needed to answer queries?If are fixed then

With 1 round of adaptivity (constant and )?

Naïve answering

Chernoff Union

Our result

There exists an algorithm that can answer adaptively chosen SQ such that with probability the answers are -valid using

The algorithm runs in time

Also:

Cannot be achieved efficiently: lower bound for poly-time algorithms under crypto assumptions [HU14]

Fresh samplesData set analyzed

differentially privately

Privacy-preserving data analysis

How to get utility from data while preserving privacy of

individuals

DATA

Differential Privacy Each sample point is created frompersonal data of an individual (GTTCACG…TC, “YES”)

Differential Privacy [DMNS06]

(Randomized) algorithm A is -differentially private if for any two data sets such that :

If then

Properties of DP

• Privacy has a price– Minimum data set size usually scales as

• Composable adaptively:If is -DP and is -DP then is-DPOr better [DRV 10]: For every and , composition of -DP algorithms is

is a loss function an -DP algorithm such that

DP implies generalization

For all over :

DP composition implies that DP preserving algorithms can reuse data

adaptively

𝐿𝐻𝑆=∫0

𝐵

Pr𝐴

[𝐿 (𝐴 (𝑆 ) ,𝑥 )≥ 𝑧 ]𝑑𝑧≤∫0

𝐵

¿¿

Proof

For and let be with -th element replaced by . By -DP

Taking expectation over

Counting queriesCounting query on a data set For function , value DP algorithms for approximate answering of counting queries are actively studied for ~10 years

𝜙𝑡

𝑣1𝜙2𝑣2

𝑣𝑡

𝜙1

Data analyst(s) Query release algorithm

𝑆

𝜙𝑖 : 𝑋→ [0,1 ] ,|𝑣𝑖−𝜙 𝑖 (𝑆 )|≤ τ

From private counting to SQs

Let be an (adaptive) query asking strategy

Let be an algo that answers counting queries of s.t.1. is -DP2. For any data set w.p. answers are -accurate

Then for any over , applied to , w.p. outputs -valid response to SQs of provided that

Can be extended to -DP with

For let denote the -th query asked by Depends on , randomness of (and nothing else)Let

+ Union bound and accuracy of

For let denote the event

Proof I

Pr𝐴 ,𝑆∼𝑃 𝑛

[|𝝍 (𝑃 )−𝝍 (𝑆)|≥2𝜏 ]≤ 𝛽 /𝑡

E𝐴 ,𝑆∼𝑃 𝑛

[𝝍 (𝑆)𝑘∨𝚽 ]≤𝑒𝑘𝜖 ⋅ E𝑆∼𝑃𝑛

[𝜙 (𝑆 )𝑘 ]

• Concetration of

• Markov’s ineq.For

Proof II

Proof: moment bound

where . Suffice to prove for all

Consider where . -DP approximately preserves conditional expectations

E𝐴 ,𝑆∼𝑃 𝑛

[𝝍 (𝑆)𝑘∨𝚽 ]≤𝑒𝑘𝜖 ⋅ E𝑆∼𝑃𝑛

[𝜙 (𝑆 )𝑘 ]

Corollaries

There exists an algorithm that can answer adaptively chosen SQ such that with probability the answers are -valid using random samples.The algorithm runs in time

There exists an -DP algorithm that can answer (adaptive) counting queries such that with probability the answers are -accurate provided that . The algorithm runs in time Also -DP for

[HR10]

MWU + Sparse Vector

• Initialize ; • For each query :

– if answer with – else

• Answer with

Laplace noise

At most MWU updatesSparse Vector Technique [DNRRV09]: privacy loss only when approximate comparison with a threshold fails

Threshold validation queries

Threshold SQ:

-valid response {YES ,

There exists an algorithm that can answer adaptively chosen thresholds SQ such that with probability the answers are -valid as long as at most comparisons failed using

random samples.The algorithm runs in time time

Applications

𝜙𝑡

𝑣1𝜙2 ,𝑇2

𝑣2

𝑣𝑡

𝜙1

SQ oracle

𝑆2

Learning algorithm(s)

DATA

Validation setWorking set

Conclusions

• Adaptive data manipulations can cause overfitting/false discovery

• Theoretical model of the problem based on SQs• Using exact empirical means is risky • DP provably preserves “freshness” of samples:

adding noise can provably prevent overfitting• In applications not all data must be used with DP

Future work

¿ Better (direct) bounds on ?¿ Is necessary?¿ Better dependence on ?

¿ Efficient algorithms for special cases?¿ Practical implementations?

THANKS!