arxiv:0912.0071v1 [cs.lg] 1 dec 2009

28
arXiv:0912.0071v1 [cs.LG] 1 Dec 2009 Differentially Private Support Vector Machines Anand Sarwate ITA Center, UC San Diego Kamalika Chaudhuri CSE Department, UC San Diego Claire Monteleoni CCLS, Columbia University December 17, 2021 Abstract This paper addresses the problem of practical privacy-preserving machine learn- ing: how to detect patterns in massive, real-world databases of sensitive personal information, while maintaining the privacy of individuals. Chaudhuri and Mon- teleoni (2008) recently provided privacy-preserving techniques for learning lin- ear separators via regularized logistic regression. With the goal of handling large databases that may not be linearly separable, we provide privacy-preserving sup- port vector machine algorithms. We address general challenges left open by past work, such as how to release a kernel classifier without releasing any of the training data, and how to tune al- gorithm parameters in a privacy-preserving manner. We provide general, efficient algorithms for linear and nonlinear kernel SVMs, which guarantee ǫ-differential privacy, a very strong privacy definition due to Dwork et al. (2006). We also pro- vide learning generalization guarantees. Empirical evaluations reveal promising performance on real and simulated data sets. 1 Introduction An emerging problem at the interface of statistics and machine learning, is how to learn from statistical databases that contain sensitive information. Statisticians for health in- surance or credit-card companies, and epidemiologists studying disease risk, are faced with the task of performing data analysis on medical or financial records of private individuals. Machine learning algorithms applied to the data of many individuals can discover novel population-wide patterns but the results of such algorithms may reveal certain individuals’ sensitive information, thereby violating their privacy. This is a growing concern, due to the massive increase in personal information stored in elec- tronic databases. Our goal is to provide learning algorithms whose outputs are provably guaranteed to preserve privacy. The support vector machine (SVM) algorithm is one of the most widely used ma- chine learning algorithms in practice. In this paper we develop privacy-preserving 1

Upload: others

Post on 30-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

arX

iv:0

912.

0071

v1 [

cs.L

G]

1 D

ec 2

009 Differentially Private Support Vector Machines

Anand SarwateITA Center, UC San Diego

Kamalika ChaudhuriCSE Department, UC San Diego

Claire MonteleoniCCLS, Columbia University

December 17, 2021

Abstract

This paper addresses the problem of practical privacy-preserving machine learn-ing: how to detect patterns in massive, real-world databases of sensitive personalinformation, while maintaining the privacy of individuals. Chaudhuri and Mon-teleoni (2008) recently provided privacy-preserving techniques for learning lin-ear separators via regularized logistic regression. With the goal of handling largedatabases that may not be linearly separable, we provide privacy-preserving sup-port vector machine algorithms.

We address general challenges left open by past work, such ashow to releasea kernel classifier without releasing any of the training data, and how to tune al-gorithm parameters in a privacy-preserving manner. We provide general, efficientalgorithms for linear and nonlinear kernel SVMs, which guaranteeǫ-differentialprivacy, a very strong privacy definition due to Dwork et al. (2006). We also pro-vide learning generalization guarantees. Empirical evaluations reveal promisingperformance on real and simulated data sets.

1 Introduction

An emerging problem at the interface of statistics and machine learning, is how to learnfrom statistical databases that contain sensitive information. Statisticians for health in-surance or credit-card companies, and epidemiologists studying disease risk, are facedwith the task of performing data analysis on medical or financial records of privateindividuals. Machine learning algorithms applied to the data of many individuals candiscover novel population-wide patterns but the results ofsuch algorithms may revealcertain individuals’ sensitive information, thereby violating their privacy. This is agrowing concern, due to the massive increase in personal information stored in elec-tronic databases. Our goal is to provide learning algorithms whose outputs are provablyguaranteed to preserve privacy.

The support vector machine (SVM) algorithm is one of the mostwidely used ma-chine learning algorithms in practice. In this paper we develop privacy-preserving

1

Page 2: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

SVM algorithms for linear and nonlinear kernels. We also develop a method forprivacy-preserving parameter-tuning for general machinelearning algorithms.

Our work is inspired by the approach of Chaudhuri and Monteleoni [Chaudhuriand Monteleoni, 2008]. They provide algorithms for regularized logistic regressionthat are privacy-preserving with respect to theǫ-differential privacy model[Dworket al., 2006], a strong, cryptographically-motivated definition of privacy that has re-cently received a significant amount of research attention for its robustness to knownattacks. The privacy-preserving technique proposed by [Chaudhuri and Monteleoni,2008], for machine learning algorithms involving convex optimization, is to solve aperturbed optimization problem. For the case of regularized logistic regression, theyprovide a differentially private algorithm, with learningperformance guarantees. How-ever, their analysis does not apply directly to the more widely used support vectormachine algorithm. In this paper we take a step towards more practical solutions forprivacy-preserving machine learning by studying SVMs. Ourmain contributions arethe following:1

• We provide the first differentially private SVM algorithm. This involves applyingideas from [Chaudhuri and Monteleoni, 2008] to develop a privacy-preservinglinear SVM algorithm, and also entails the following challenges.

• We pose the general question: is it possible to learn a private version of a kernelclassifier? That is, for kernel methods such as SVMs with nonlinear kernel func-tions, the optimal classifier is a linear combination of kernel functions centeredat the training points. This form is inherently non-privatebecause it reveals thetraining data. We answer this question in the affirmative forSVMs by adapt-ing a random projection method due to Rahimi and Recht [Rahimi and Recht,2007, Rahimi and Recht, 2008b] to develop privacy-preserving kernel-SVM al-gorithms, and we give generalization guarantees.

• We provide a privacy-preserving tuning algorithm that is applicable to generalmachine learning algorithms. Our technique makes use of a randomized selec-tion procedure [McSherry and Talwar, 2007], and is the first privacy-preservingtuning algorithm.2

• Finally we provide experimental results on real and synthetic data that illustratethe tradeoffs between privacy, generalization error, and training size.

It is important to note that imposing privacy constraints ona machine learningalgorithm generally comes with a cost for learning performance. This is the case inearlier privacy methods, such as random-sampling in which only a fraction of the dataset is used, andk-anonymity [Sweeney, 2002], in which the data set is clustered, andonly the cluster centers are released, so that only a1/k fraction of the data set is used.In general, for classification, the error rate increases as the privacy requirements aremade more stringent. Our learning performance guarantees formalize this “price of

1Full proofs are omitted from the paper due to space constraints, and provided in the supplementarymaterials.

2To our knowledge.

2

Page 3: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

privacy,” and we demonstrate empirically that our methods better manage the tradeoffbetween privacy and learning performance than in previous work.

1.1 Related work

There has been a significant amount of literature on privacy-preservingdata-mining [Agrawaland Srikant, 2000, Evfimievski et al., 2003, Sweeney, 2002, Machanavajjhala et al.,2006], which work with privacy models other than differential privacy. However, manyof the models used in these works have been shown to be susceptible to attacks whenthe adversary has some reasonable amount of prior knowledge[Ganta et al., 2008]. Inparticular, [Mangasarian et al., 2008] consider the problem of privacy-preserving SVMclassification when separate agents have to share private data, and provide a solutionthat uses random kernels. However, their work does not provide any formal privacyguarantee.

Differential privacy, the formal privacy definition used inour paper, was proposedby the seminal work of Dworket. al [Dwork et al., 2006], and has been used sincein numerous works on privacy – see, for example, [McSherry and Talwar, 2007, Nis-sim et al., 2007,Barak et al., 2007,Chaudhuri and Monteleoni, 2008,Machanavajjhalaet al., 2008]. Unlike many other privacy definitions, such asthose mentioned above,differential privacy has been shown to be resistant to side-information attacks [Gantaet al., 2008]. Literature on learning with differential privacy include the work of [Ka-siviswanathan et al., 2008, Blum et al., 2008], which present a general method forPAC-learning a concept class, and [Dwork and Lei, 2009], which present an algorithmfor privacy-preserving regression using ideas from robuststatistics. However, these al-gorithms have running time exponential in the VC-dimensionof the concept class anddata dimension, respectively.

In the machine learning literature, the work most related toours is that of [Chaud-huri and Monteleoni, 2008]. First they show that one can extend thesensitivity method,a technique due to [Dwork et al., 2006], to classification algorithms. The sensitiv-ity method makes an algorithm differentially private by adding noise to its output.The noise protects the privacy of the training data, but increases the prediction error.Chaudhuri and Monteleoni then show that by adding noise to the objective function ofregularized logistic regression, the same level of privacycan be guaranteed with lowererror than the sensitivity method. Their method also works for other strongly-convexoptimization problems as long as the objective function is differentiable, and the effectof one person’s private value on the gradient is low. Their method does not directlyapply to SVMs.

Another line of work [Zhan and Matwin, 2007,Laur et al., 2006] considers comput-ing privacy-preserving SVMs in the Secure Multiparty Communication setting – whenthe sensitive data is split across multiple hostile databases, and the goal is to designa distributed protocol to learn a classifier. In contrast, our work deals with a settingwhere the algorithm has access to the entire dataset.

3

Page 4: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

2 Problem statement

In this paper we develop privacy-preserving algorithms that learn a classifier from la-beled examples. Our algorithms takes as input a databaseD = (x(i), y(i)) : i =1, 2, . . . , n of n data-label pairs. The samples arex(i) ∈ X ⊂ R

d and the labelsarey(i) ∈ −1,+1. We will use‖x‖2, ‖x‖∞, and‖x‖H to denote theℓ2-norm,ℓ∞-norm, and norm in a Hilbert spaceH, respectively. We will assume throughoutthatX is the unit ball so that‖x(i)‖2 ≤ 1. For an integerK we will use [K] to de-note the set1, 2, . . . ,K. In this paper we are interested in support vector machines(SVMs), which are a class of algorithms that construct a classifier f : X → R whichmakes a predictionsgn(f(x)) on a pointx. The standard SVM algorithm computes theclassifier that minimizes the regularized empirical loss:

J(f) =1

n

n∑

i=1

ℓ(f(x(i)), y(i)) +Λ

2‖f‖2H , (1)

whereℓ(·, ·) is a loss function and‖·‖H is the norm off in the reproducing kernelHilbert space (RKHS)H generated by a kernel functionk(a, b). The loss functionℓ(t, y) could technically be anything, but for support vector machines is usually set tothe hinge loss. In this paper we will use the Huber loss, whichis an approximation tothe hinge loss [Chapelle, 2007] that depends on a small parameterh:

ℓh(t, y) =

0 if yt > 1 + h14h (1 + h− yt)2 if |1− yt| ≤ h1− yt if yt < 1− h

(2)

As h → 0 this loss function approaches the hinge loss. The reason forchoosing theHuber loss is that it is differentiable, which is important for the proofs below (see theremark after Theorem 1).

Privacy Model. We are interested in producing a classifier in a manner that pre-serves the privacy of individual entries of the databaseD that is used in the training.The notion of privacy we use is theǫ-differential privacy model, developed by Dworket al. [Dwork et al., 2006, Dwork, 2006]. In this model, the training algorithm isran-domized. LetM denote the training algorithm and letM(D) be the random variablethat is the classifier and letµ denote the density ofM. The algorithmM for producinga classifier is said to preserveǫ-differential privacy if the likelihood thatM produces aclassifierf on databaseD is close to the likelihood of it producingf on any databaseD′

that differs fromD in one entry. That is, any single entry of the database does not affectthe results of the algorithm by much; dually, this means thatan adversary cannot gainany extra information about any single entry of the databaseby observing the output ofthe algorithm.

Definition 1. An algorithmM providesǫ-differential privacy if for any two databasesD andD′ that differ in a single entry and for anyf ,

logµ(

M(D) = f∣

∣ D)

µ(

M(D) = f∣

∣ D′)

≤ ǫ . (3)

4

Page 5: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

Note that an SVM solution is not intrinsically private. Thisis because, an SVMsolution is a linear combination ofsupport vectorswhich are selected training samples.If a support vector changes, the SVM solution will also change. Regularization helpsby bounding theL2 norm of the change, but does not completely solve the problem,as the direction of a low-norm classifier can still change when a single training samplechanges.

Performance Measures.In this paper we construct privacy-preserving SVM train-ing algorithms. The first question is how to approximate the SVM while guaranteeingǫp-differential privacy. The second question is how much we pay in terms ofgeneral-izationperformance. The measure of performance for a classifierf is the expected lossover a data distributionP onX × −1,+1:

L(f) = EP [ℓh(f(x), y)] . (4)

Assuming the databaseD is drawn iid according toP , we are interested in how large thenumber of entriesn has to be in order to guarantee a bounded gapL(fpriv)−L(f) ≤ ǫg.In particular, we are interested in how the number of samplesn behaves as a functionof the privacy levelǫp, generalization lossǫg, and dimensiond.

3 Linear support vector machines

The simplest case of a support vector machine classifier is alinear SVM, in whichthe classifierf(x) takes the inner product of the data pointx with a fixed vectorf .In this setting the‖f‖H in (1) becomes the Euclidean norm‖f‖2. Previous work hasshown that an efficient way of guaranteeing privacy in logistic regression is to solvea perturbed objective function [Chaudhuri and Monteleoni,2008]. We show how toprovide a privacy-preserving Huber SVM by taking a similar approach; in order toguaranteeǫp-differential privacy, we minimize the following objective function:

Jpriv(f) =1

n

n∑

i=1

ℓh(fTx(i)), y(i)) +

Λ

2‖f‖22 +

1

nbTf , (5)

whereb has density

G(b) =1

Ze−(β/2)‖b‖ , (6)

Z is a normalizing constant, andβ = ǫp. Recall thatℓh here is the Huber loss. Al-gorithm 1 is our privacy-preserving algorithm for the case of linear SVMs based on(5).

Similar to logistic regression, thesensitivity methodof [Dwork et al., 2006] canbe applied to privacy-preserving SVMs as well. The sensitivity method is a generalprocedure for computing privacy-preserving approximations to functions. The idea isto add noise to the output of a function proportional to its sensitivity (1). For SVMs, thesensitivity is at most2/(nΛ). To achieveǫp-differential privacy, the sensitivity methodoutputsc + argminJ(f) wherec has density given by (6) withβ = nǫpΛ. For moredetails on this analysis, see [Chaudhuri and Monteleoni, 2008].

5

Page 6: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

Algorithm 1 Privacy-preserving linear SVM classifier training

input DataZi, parametersǫp, Λ.output Linear classifierfpriv.

Draw a vectorb according to (6).Computefpriv = argminJpriv(f), defined in (5).

3.1 Privacy guarantees

Theorem 1. Given data(x(i), y(i)) : i ∈ [n], the outputfpriv of Algorithm 1 guar-anteesǫp-differential privacy.

We use the Huber loss instead of the hinge loss because the proof of Theorem 1 de-pends on the loss function being differentiable. To see why adifferential loss functionmatters for our algorithm, suppose thatℓ(t, y) = (1−ty)+ and consider the case whered = 1. Fix two databasesD andD′ which have only one entry each. That is, supposeD = (+1, 1) andD′ has(−1, 1), and supposef∗ = 1 was given by the algorithm.Then under databaseD we have that

Jpriv(f) = (1 − f)+ +Λ

2f2 + bf (7)

is minimized byf = f∗ = 1. If we setb to be any number in the interval[−Λ,−Λ+1],

the subgradient set ofJpriv at f = 1 contains0; this means that for this entire range ofvalues ofb, f = 1 is the minimizer. Now let us look atD′, for which

Jpriv(f) = (1 + f)+ +Λ

2f2 + bf . (8)

At f = 1 the gradient exists and by setting it equal to 0 we getb = −(Λ + 1), whichmeans that there is only one value ofb for which f = 1 is the minimizer. Thereforefrom the log likelihood ratio in the definition ofǫ-differential privacy, the likelihood ofD is far greater than that ofD′, so this protocol cannot beǫ-differentially private.

3.2 Generalization bounds

To provide learning guarantees we assume the datax(i), y(i)) consists of indepen-dent and identically distributed samples from a distributionP (X, Y ).

Lemma 1. Given a regularized SVM with regularization parameterΛ, let f0 be theclassifier that minimizesJ in (1) andfpriv be the classifier that minimizesJpriv in (5).Then

PP

(

J(fpriv) ≤ Jpriv(f0) +8d2 log2(d/δ)

Λn2ǫ2p

)

≥ 1− δ ,

where the probability is over iid sampling fromP .

Proof. The proof is identical to Lemma 4 in [Chaudhuri and Monteleoni, 2008].

6

Page 7: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

Theorem 2. Let f∗ be a classifier with expected lossL(f∗) and norm‖f∗‖2. Then for

n = Ω

(

max

‖f∗‖22ǫ2g

,‖f∗‖2 d log(d/δ)

ǫpǫg

)

, (9)

the classifierfpriv produced by Algorithm 1 satisfies

P (L (fpriv) ≤ L(f∗) + ǫg) ≥ 1− δ . (10)

The algorithm for linear SVMs is a natural modification of thealgorithm presentedin [Chaudhuri and Monteleoni, 2008], and the analysis is quite similar. The central ideais to perturb the objective function of a convex optimization program rather than theoutput of the optimization. Indeed, this algorithm can be generalized to general finitedimensional convex optimization programs.

4 Nonlinear kernels

In this section, we show how to use SVMs to generate nonlineardecision boundariesfor classification in a privacy-preserving way. Such boundaries are typically found byusing a nonlinear kernel functionk(x,x′). The kernel function is associated with areproducing kernel Hilbert space (RKHS) in which the classifier lies. The representertheorem [Kimeldorf and Wahba, 1970] shows that the regularized empirical risk in(1) is minimized by a functionf(x) that is given by a linear combination of kernelfunctions centered at the data points:

f∗(x) =

n∑

i=1

aik(x(i),x) . (11)

This elegant result is appealing from a theoretical standpoint. To use it in practice,one releases the valuesai corresponding to thef that minimizes the empirical risk,along with the data pointsx(i); the user classifies a newx by evaluating the functionin Equation 11.

Our first contribution in this section is to observe that thisrelease process is prob-lematic from a privacy point of view. This is because it directly releases the privatevaluesx(i) of individuals in the training set. Thus, even if the classifier is computed ina privacy-preserving way, any classifier released by this process will inherently violateprivacy. Our second contribution is to provide a solution that avoids this problem; wedo this by using the approximation method of Rahimi and Recht[Rahimi and Recht,2007, Rahimi and Recht, 2008b], which uses random projections to approximate thekernel function in a higher-dimensional space.

4.1 The algorithm

Our algorithm for privacy-preserving SVM classification using general kernels is givenin Algorithm 2. Here we will describe the algorithm for the Gaussian kernelk(x,x′) =

exp(−γ ‖x− x′‖22), but we can approximate many different kernels of interest [Rahimi

7

Page 8: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

and Recht, 2008b]. Forj ∈ [D] we draw a sampleθj = (ωj , ψ) in Rd × R ac-

cording to the distributionp(θ) given byN (0, 2γId) × Uniform[−π, π]. We thenmap each vectorx(i) ∈ R

d to a vectorv(i) ∈ RD by evaluating the functions

φ(x(i); θj) = cos(ωTj x(i) + ψ) for j ∈ [D]. Finally, we use Algorithm 1 to train

a privacy-preserving classifier for the projected datav(i). In order to compute theclassifier’s label on a new pointx, the classifier first projectsx into a vectorv via themapφ(x; θj) and then uses the linear classifier given by Algorithm 1.

Algorithm 2 Private SVM classification for nonlinear kernelsinput Data(xi, yi) : i ∈ [n], positive definite kernel functionk(·, ·), p(θ), parame-

tersǫp, Λ,Doutput Classifierfpriv and pre-filterθj : j ∈ [D].

Drawθj : j = 1, 2, . . . , D iid according top(θ).Setv(i) =

2/D[φ(x(i); θ1) · · ·φ(x(i); θD)]T for eachi.Run Algorithm 1 with data(v(i), y(i)) and parametersǫp, Λ.

4.2 Privacy guarantees

Because the workhorse of Algorithm 2 is Algorithm 1, and the pointsθj : j ∈ [D]are independent of the data, the privacy guarantees for Algorithm 2 follow triviallyfrom Theorem 1.

Theorem 3. Given data(x(i), y(i)) : i = 1, 2, . . . , nwith (x(i), y(i)) and‖x(i)‖ ≤1, the outputs(fpriv, θj : j ∈ [D]) of Algorithm 2 guaranteeǫp-differential privacy.

4.3 Generalization bounds

In this section we prove bounds on the generalization performance of the privacy pre-serving classifier produced by Algorithm 2. We will compare this generalization per-formance against arbitrary classifiersf

∗ whose “norm” is bounded in some sense. Thatis, given anf∗ with some properties, we will choose regularization parameterΛ, di-mensionD, and number of samplesn so that the classifierfpriv has expected loss closeto that off∗. In particular, our results will say that as long asn is greater than somequantity, thenL(fpriv)− L(f∗) ≤ ǫg.

Our first generalization result is the simplest, since it assumes a strong conditionthat gives easy guarantees on the projections. We would likeour privatized classifier tobe competitive against a classifierf

∗ such that

f∗(x) =

Θ

a∗(θ)φ(x; θ)p(θ)dθ , (12)

and|a∗(θ)| ≤ C (see [Rahimi and Recht, 2008b]).

8

Page 9: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

Lemma 2. Let f∗ be a classifier such that|a∗(θ)| ≤ C, wherea∗(θ) is given by (12).Then for any distributionP , if

n = Ω

(

C2√

log(1/δ)

ǫpǫ2g· log C log(1/δ)

ǫgδ

)

, (13)

thenΛ andD can be chosen such that

P (L(fpriv)− L(f∗) ≤ ǫg) ≥ 1− 4δ . (14)

The proof uses generalization results from [Rahimi and Recht, 2008b] with resultson perturbed optimization from [Sridharan et al., 2008, Shalev-Shwartz and Srebro,2008]. We can adapt the proof procedure to show that we can be competitive againstany classifierf∗ with a given bound on‖f∗‖∞. It can be shown that for some constantζ that |a∗(θ)| ≤ Vol(X )ζ ‖f∗‖∞. Then we can set this asC in (13) to obtain our nextresult.

Theorem 4. Let f∗ be a classifier with norm‖f∗‖∞. Then for any distributionP , if

n = Ω

(

‖f∗‖2∞ ζ2(Vol(X ))2√

log(1/δ)

ǫpǫ2g(15)

· log ‖f∗‖∞ Vol(X )ζ log(1/δ)

ǫgδΓ(d2 + 1)

)

, (16)

thenΛ andD can be chosen such thatP (L(fpriv)− L(f∗) ≤ ǫg) ≥ 1− 4δ .

Finally, we derive a generalization result with respect to an optimal classifier whichhas bounded‖f∗‖H.

Theorem 5. Let f∗ be a classifier with norm‖f∗‖H. Then for any distributionP , if

n = Ω

(

‖f∗‖4H ζ2(Vol(X ))2√

log(1/δ)

ǫpǫ4g(17)

· log ‖f∗‖H Vol(X )ζ log(1/δ)

ǫgδΓ(d2 + 1)

)

, (18)

thenΛ andD can be chosen such thatP (L(fpriv)− L(f∗) ≤ ǫg) ≥ 1− 4δ .

5 Privacy-preserving tuning

We now present a privacy-preserving parameter tuning technique that is generally ap-plicable to other machine learning algorithms, in additionto SVMs. In practice, onetypically tunes SVM’s regularization parameter as follows: using data held out for val-idation, learn an SVM for multiple values ofΛ, and select the one which provides thebest empirical performance. However, even though the output of Algorithm 1 preserves

9

Page 10: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

ǫp-differential privacy for a fixedΛ, using aΛ which itself is a function of the privatedata may breakǫp-differential privacy guarantees. That is, if the procedure that pro-videsΛ is not private, then from the output of our algorithm, an adversary may gainsome knowledge about the value ofΛ, allowing her to infer something about privatevalues in the database.

We suggest two ways of resolving this issue. First, if we haveaccess to somepublicly available data from the same distribution, then wecan use this as a holdoutset to tuneΛ. ThisΛ can be subsequently used to train a classifier on the private data.Since the value ofΛ does not depend on the values in the private data set, this procedurewill still preserve the privacy of individuals in the private data.

If no such public data is available, then we need a differentially private tuningprocedure. We provide such a procedure below. The main idea is to train for differ-ent values ofΛ on separate datasets, so that the total training procedure still maintainsǫp-differential privacy. We then select aΛ using a randomized privacy-preserving com-parison procedure [McSherry and Talwar, 2007]. The last step is needed to guaranteeǫp-differential privacy for individuals in the validation set.

Algorithm 3 Private-preserving parameter tuning

input DatabaseD, parametersΛ1, . . . ,Λm, ǫp.output Parameterfpriv.

DivideD intom+ 1 equal portionsD1, . . . ,Dm+1, each of size |D|m+1 .

For eachi = 1, 2, . . . ,m, apply a privacy-preserving learning algorithm (e.g. Algo-rithm 1) onDi with parameterΛi andǫp to get outputfi.Evaluatezi, the number of mistakes made byfi on Dm+1. Set fpriv = fi withprobability

qi =e−ǫzi/2

∑mi=1 e

−ǫzi/2. (19)

We note that the list of potentialΛ values input to this procedure should not be afunction of the private dataset. It can also be shown that theempirical error onDm+1 ofthe classifier output by this procedure is close to the empirical error of the best classifierin the setf1, . . . , fm onDm+1, provided|D| is high enough.

Theorem 6. The output of the tuning procedure of Algorithm 3 isǫp-differentiallyprivate.

The following theorem shows that the empirical error onDK+1 of the classifieroutput by the tuning procedure is close to the empirical error of the best classifier inthe setf1, . . . , fK. The proof of this Theorem follows from Lemma 7 of [McSherryand Talwar, 2007].

Theorem 7. Let zmin = mini zi, and letz be the number of mistakes made onDm+1

by the classifier output by our tuning procedure. Then, with probability 1− δ,

z ≤ zmin +1

ǫplog(m/δ) . (20)

10

Page 11: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

6 Experiments

We validated our theoretical results in three different scenarios. We tested the privatelinear SVM Algorithm 1 on two different real data sets, theAdult dataset from theUCI Machine Learning Repository [Asuncion and Newman, 2007], and theKDDCup99dataset [Hettich and Bay, 1999]. We validated the theoretical results for our kernelSVM method (Algorithm 2) using the Gaussian kernel on a toy problem using syn-thetic data. The convex optimization was done using a standard convex optimizationlibrary [Okazaki, 2009].

We chose the moderately-sizedAdult dataset because demographic data is rel-evant for privacy; it contains demographic information about approximately47, 000individuals, and the classification task was to predict whether the annual income ofan individual is below or above $50,000, based on variables such as age, sex, occupa-tion, and education. The average fraction of positive labels was0.2478. In the largerKDDCup99 dataset [Hettich and Bay, 1999] the task was to predict whether a networkconnection is a denial-of-service attack, based on severalattributes. The dataset in-cludes about 5,000,000 instances, subsampled down to about500,000 instances, witha 19% positive labels, and was large enough to use our privacy-preserving tuning Al-gorithm 3 to tune the regularization parameter.3 In the experiments with real data, wecompared Algorithm 1 with non-private (Huber) SVM, the privacy-preserving logisticregression classifier of [Chaudhuri and Monteleoni, 2008],and the Sensitivity-method,the straightforward application of the method of [Dwork et al., 2006], applied to Hu-ber SVM. On the real data sets, each value is an average over10-fold cross-validation,and and for the private algorithms, each value is additionally averaged over50 randomchoices ofb.

Imposing privacy requirements necessarily degrades classifier performance, andFigure 1 shows the classification error versus privacy levelin (a) for theAdult dataset. As the privacy parameterǫp increases, the error gap between the private (solid)and the non-private (dashed) methods shrink. ForAdult Algorithm 1 strictly out-performs the logistic regression and sensitivity methods.TheAdult dataset was notsufficiently large to allow us to do the privacy-preserving tuning of Algorithm 3, so in-stead we ran experiments over a matrix ofΛ andǫ values. The best error rate achievedby the non-private algorithm on this data set is15.362% (Λ = 10−6), and the best forthe Algorithm 1 is17.62% (Λ = 10−3), for ǫp = 0.2. The performance loss is approx-imately2.3% and increases asǫp decreases. For a data set of fixed size, these resultsillustrate the tradeoff between generalizationǫg and privacyǫp.

Our experiments suggest that for stronger privacy guarantees, the privacy-preservingSVM may be a better choice than logistic regression. ForKDDCup99 the error versusǫp curves for a random subset of size60, 000 points are shown in Figure 1 (b). We seethat the whenΛ is tuned for all algorithms for0.05-differential privacy, Algorithm 1(with Λ = 10−2) performs only slightly more than1.5% worse than non-private SVM(with Λ = 10−6), whereas the sensitivity-method is7.5% worse (withΛ = 10−1).4

For smallǫp, Algorithm 1 is superior to logistic regression, but the curves cross as the

3When using privacy-preserving tuning, we used7 values ofΛ, so87.5% of the data was used for tuning.4The tuned values forǫp = 0.1 areΛ = 10−3 for Algorithm 1, andΛ = 10−2 for the Sensitivity-

method.

11

Page 12: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

0.15

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

Privacy level εp

Err

or r

ate

Error versus privacy for optimal Λ

Non−private Huber−SVMAlgorithm 1SensitivityPrivate LR

(a) Linear kernel, UCI Adult

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Privacy level εp

Err

or r

ate

Error versus privacy for optimal Λ

Non−private Huber−SVMAlgorithm 1Logistic RegressionSensitivity

(b) Linear kernel,KDDCup99

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Privacy level εp

Err

or r

ate

Error versus privacy εp (D = 50, 180k training points)

Algorithm 2Non−private SVM−hinge loss

(c) Gaussian kernel, simulation

0.5 1 1.5 2 2.5 3 3.5 4

x 105

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Training set size

Err

or r

ate

Error versus training set size for KDDCup99 (εp = 0.05)

Non−private SVMAlgorithm 1Sensitivity

(d) Linear kernel,ǫp = 0.05

0.5 1 1.5 2 2.5 3 3.5 4

x 105

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Training set size

Err

or r

ate

Error versus training set size for KDDCup99 (εp = 0.005)

Non−private SVMAlgorithm 1Sensitivity

(e) Linear kernel,ǫp = 0.005

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4

x 105

0

0.05

0.1

0.15

0.2

0.25

0.3

Training set size

Err

or r

ate

Error versus training set size for simulated data (εp = 0.1)

Non−private SVMAlgorithm 1

(f) Gaussian kernel,ǫp = 0.1

Figure 1: Epsilon curves: error rate versus privacy parameter,ǫp for (a) Linear kernel,UCI Adult data, (b)KDDCup99 data, (c) Gaussian kernel, nested balls simulation data.Learning curves forKDDCup99 : error rate versus training set size for (d) Linearkernel,KDDCup99 data,ǫp = 0.05, (e) ǫp = 0.005, (f) Gaussian kernel, nested ballsgenerated data,ǫp = 0.1.

12

Page 13: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

Algorithm Adult (ǫp = 0.2) Adult (ǫp = 0.1) KDD (ǫp = 0.1) KDD (ǫp = 0.05)Non-private SVM 0.1536 0.1536 0.0063 0.0063Sensitivity-method 0.2469 0.2456 0.0207 0.0784Logistic regression 0.1819 0.1988 0.0148 0.0216Algorithm 1 0.1762 0.1853 0.0179 0.0181

Algorithm Simulation (ǫp = 0.1)Non-private Gaussian SVM (hinge-loss) < 0.0506Non-private RR-Gaussian SVM 0.0508Algorithm 2 0.1141

Figure 2: Table of errors for (a) linear and (b) Gaussian kernel SVM methods.

privacy requirement is weakened andǫp increases.We tested the Gaussian kernel version5 of Algorithm 2 on240, 000 data points in

R5 sampled from a “nested balls” distribution with mass0.45 uniformly in the ball of

radius0.1 with label+1, mass0.45 uniformly in the annulus from0.2 to 0.5 with label−1, and mass0.1 uniformly in the annulus from0.1 to 0.2 with equiprobable labels±1. Since the sensitivity method cannot be used for nonlinear kernels, we comparedAlgorithm 2 against a kernel SVM with hinge loss using SVMlight [Joachims, 1999]and a non-private linear SVM using the projections of [Rahimi and Recht, 2008b]. Weaveraged over5 random projections and50 drawings ofb.

Figure 2 gives tabular results. Loss in generalization performance is dominated bythe cost of privacy; e.g. in (b), the linear SVM on random projections has error veryclose to that of the kernel SVM. The degradation due to privacy is about double, butthe error rate,11.4% is still reasonable.

Learning curves. Figure 1(d)-(f) presents learning curves forKDDCup99 at ǫp =0.05, ǫp = 0.005, and the kernel simulation atǫp = 0.1.6 ForKDDCup99 for each dataset sizeN we did non-private tuning of Huber-SVM, and the private tuning procedurein Algorithm 3 for the private SVMs7. For the kernels, we compare to Huber SVMwith the random projection kernel method. These experiments show that to achievesmall generalization error requires large data sets, and decreasingǫp requires even moredata. In summary, our differentially private SVM makes significant reductions in errorcompared to the sensitivity-method, the only other differentially private SVM.

References[Agrawal and Srikant, 2000] Agrawal, R. and Srikant, R. (2000). Privacy-preserving data min-

ing. SIGMOD Rec., 29(2):439–450.

[Asuncion and Newman, 2007] Asuncion, A. and Newman, D. (2007). UCI Machine LearningRepository. University of California, Irvine, School of Information and Computer Sciences.

5The kernel parameterγ = 0.5 was tuned on a holdout set of10, 000 points using SVMlight.6We omit the Adult dataset because it is not large enough to perform privacy-preserving tuning.7See footnote 3.

13

Page 14: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

[Ball, 1997] Ball, K. (1997). An elementary introduction tomodern convex geometry. In Levy,S., editor,Flavors of Geometry, volume 31 ofMathematical Sciences Research Institute Pub-lications, pages 1–58. Cambridge University Press.

[Barak et al., 2007] Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., and Talwar,K. (2007). Privacy, accuracy, and consistency too: a holistic solution to contingency tablerelease. InPODS, pages 273–282.

[Blum et al., 2008] Blum, A., Ligett, K., and Roth, A. (2008).A learning theory approach tonon-interactive database privacy. In Ladner, R. E. and Dwork, C., editors,STOC, pages 609–618. ACM.

[Chapelle, 2007] Chapelle, O. (2007). Training a support vector machine in the primal.NeuralComputation, 19(5):1155–1178.

[Chaudhuri and Monteleoni, 2008] Chaudhuri, K. and Monteleoni, C. (2008). Privacy-preserving logistic regression. InTwenty-Second Annual Conference on Neural InformationProcessing Systems (NIPS), Vancouver, Canada.

[Dwork, 2006] Dwork, C. (2006). Differential privacy. In Bugliesi, M., Preneel, B., Sassone,V., and Wegener, I., editors,ICALP (2), volume 4052 ofLecture Notes in Computer Science,pages 1–12. Springer.

[Dwork and Lei, 2009] Dwork, C. and Lei, J. (2009). Differential privacy and robust statistics.In STOC.

[Dwork et al., 2006] Dwork, C., McSherry, F., Nissim, K., andSmith, A. (2006). Calibratingnoise to sensitivity in private data analysis. In3rd IACR Theory of Cryptography Conference,,pages 265–284.

[Evfimievski et al., 2003] Evfimievski, A., Gehrke, J., and Srikant, R. (2003). Limiting privacybreaches in privacy preserving data mining. InPODS, pages 211–222.

[Ganta et al., 2008] Ganta, S. R., Kasiviswanathan, S. P., and Smith, A. (2008). Compositionattacks and auxiliary information in data privacy. InKDD, pages 265–273.

[Hettich and Bay, 1999] Hettich, S. and Bay, S. (1999). The UCI KDD Archive. University ofCalifornia, Irvine, Department of Information and Computer Science.

[Joachims, 1999] Joachims, T. (1999). Making large-scale svm learning practical. InAdvancesin Kernel Methods - Support Vector Learning.

[Kasiviswanathan et al., 2008] Kasiviswanathan, S. A., Lee, H. K., Nissim, K., Raskhodnikova,S., and Smith, A. (2008). What can we learn privately? InProc. of Foundations of ComputerScience.

[Kimeldorf and Wahba, 1970] Kimeldorf, G. and Wahba, G. (1970). A correspondence betweenBayesian estimation on stochastic processes and smoothingby splines.Annals of Mathemat-ical Statistics, 41:495–502.

[Laur et al., 2006] Laur, S., Lipmaa, H., and Mielikainen, T. (2006). Cryptographically privatesupport vector machines. InKDD, pages 618–624.

[Machanavajjhala et al., 2006] Machanavajjhala, A., Gehrke, J., Kifer, D., and Venkitasubrama-niam, M. (2006). l-diversity: Privacy beyond k-anonymity.In Proc. of ICDE.

[Machanavajjhala et al., 2008] Machanavajjhala, A., Kifer, D., Abowd, J. M., Gehrke, J., andVilhuber, L. (2008). Privacy: Theory meets practice on the map. InICDE, pages 277–286.

[Mangasarian et al., 2008] Mangasarian, O. L., Wild, E. W., and Fung, G. (2008). Privacy-preserving classification of vertically partitioned data via random kernels.TKDD, 2(3).

14

Page 15: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

[McSherry and Talwar, 2007] McSherry, F. and Talwar, K. (2007). Mechanism design via dif-ferential privacy. InFOCS, pages 94–103.

[Nissim et al., 2007] Nissim, K., Raskhodnikova, S., and Smith, A. (2007). Smooth sensitivityand sampling in private data analysis. In Johnson, D. S. and Feige, U., editors,STOC, pages75–84. ACM.

[Okazaki, 2009] Okazaki, N. (2009). liblbfgs: a library of limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS). http://www.chokkan.org/software/liblbfgs/index.html.

[Rahimi and Recht, 2007] Rahimi, A. and Recht, B. (2007). Random features for large-scalekernel machines. InNIPS, Vancouver, Canada.

[Rahimi and Recht, 2008a] Rahimi, A. and Recht, B. (2008a). Uniform approximation of func-tions with random bases. InProceedings of the 46th Allerton Conference on Communication,Control, and Computing.

[Rahimi and Recht, 2008b] Rahimi, A. and Recht, B. (2008b). Weighted sums of randomkitchen sinks : Replacing minimization with randomizationin learning. InNIPS, Vancou-ver, Canada.

[Shalev-Shwartz and Srebro, 2008] Shalev-Shwartz, S. and Srebro, N. (2008). SVM optimiza-tion : Inverse dependence on training set size. InICML, Helsinki, Finland.

[Sridharan et al., 2008] Sridharan, K., Srebro, N., and Shalev-Shwartz, S. (2008). Fast rates forregularized objectives. InNIPS, Vancouver, Canada.

[Sweeney, 2002] Sweeney, L. (2002). k-anonymity: a model for protecting privacy.Int. J. onUncertainty, Fuzziness and Knowledge-Based Systems.

[Zhan and Matwin, 2007] Zhan, J. Z. and Matwin, S. (2007). Privacy-preserving support vectormachine classification.IJIIDS, 1(3/4):356–385.

7 Proofs

This section contains supplementary material and the proofs for our main results.

Random projections and kernel approximation

A kernel functionk(·, ·) is associated to a reproducing kernel Hilbert Space (RHKS)H of functions fromX → R. We assume that the kernel function is bounded and thatwe can write it as an integral over a spaceΘ with finite measurep(θ) as follows:

k(x,x′) =

Θ

φ(x; θ)φ(x′; θ)p(θ)dθ , (21)

where thefeature functionφ(x; θ) can be thought of as the mapping ofx into thehigher dimensional space whose dimensions are indexed byθ ∈ Θ. We will assumethatp(θ) is normalized so that it is a probability density overΘ. We will assume thatthe functionsφ(x; θ) are bounded:

|φ(x; θ)| ≤ ζ ∀x ∈ X , ∀θ ∈ Θ (22)

15

Page 16: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

Armed with this representation, we can write functions inH in the following way:

f(x) =

Θ

a(θ)φ(x; θ)p(θ)dθ . (23)

This representation can arise quite naturally, as shown by Rahimi and Recht [Rahimiand Recht, 2007]. For example for kernel functionsk(x,x′) = k(x− x

′), Fourier fea-tures of the formφ(x; (ω, ψ)) = cos(ωT

x + ψ) can be used, whereθ = (ω, ψ). Thedensityp(θ) can be interpreted a Fourier transform of the kernel function. Other ker-nels can be handled using different feature functions (see [Rahimi and Recht, 2008a]for more examples). For a constantC, we can define a class of functions

Fp(C) =

f(x) =

Θ

a(θ)φ(x; θ)p(θ)dθ∣

∣|a(θ)| < C

, (24)

Rahimi and Recht [Rahimi and Recht, 2008a, Proposition 4.1]showed that the class offunctionsFp(∞) is dense in the RHKSH and that the completion ofFp(∞) is equaltoH.

We need to show that bounded classifiersf induced bounded functionsa(θ). Givena functionf in an RKHS, we can write the evaluation functional as an innerproduct(this is thereproducing kernel property) and substitute the representation of the kernelfunction in (21):

f(x) =

X

f(x′)k(x,x′)dx′ (25)

=

X

f(x′)

Θ

φ(x; θ)φ(x′; θ)p(θ)dθdx′ (26)

=

Θ

(∫

X

f(x′)φ(x′; θ)dx′

)

φ(x; θ)p(θ)dθ (27)

Thus we have

a(θ) =

X

f(x′)φ(x′; θ)dx′ (28)

|a(θ)| ≤ Vol(X ) · ζ · ‖f‖∞ . (29)

This shows thata(θ) is bounded uniformly overΘ whenf(x) is bounded uniformlyoverX . The volume of the unit ball decreases for larged, and in particular [Ball, 1997]:

Vol(X ) =πd/2

Γ(d2 + 1). (30)

For larged this is(√

2πed )d by Stirling’s formula. Furthermore, we have

‖f‖2H =

Θ

a(θ)2p(θ)dθ . (31)

16

Page 17: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

Proof of Theorem 1

Proof. The proof follows the argument of [Chaudhuri and Monteleoni, 2008]. Con-sider two data sets differing in only one point:

D = (x(i), y(i)) : i ∈ [n] (32)

D′ = (x(i), y(i)) : i 6= j ∪ (x′(j), y′(j)) , (33)

Now suppose that the algorithm outputs the same classifier for both data sets, sofpriv(·;D) = fpriv(·;D′) = f . The only randomness in the process is the realiza-tion of the noise added to the objective. Letu andu′ denote the realization ofb for DandD′, respectively.

Sincef minimizes the objective, the gradient of the objective is0 at f for bothuandu′:

∇Jpriv(f |D) =1

n

i6=j

∂tℓh(y(i), t)

t=fTx(i)x(i) +

∂t

1

nℓh(y(j), t)

t=fTx(j)x(j) + Λf +

1

nu

(34)

∇Jpriv(f |D′) =1

n

i6=j

∂tℓh(y(i), t)

t=fTx(i)x(i) +

∂t

1

nℓh(y

′(j), t)∣

t=fTx′(j)x′(j) + Λf +

1

nu′ .

(35)

By canceling common terms we obtain:

∂tℓh(y(j), t)

t=f∗Tx(j)x(j) + u =

∂tℓh(y

′(j), t)∣

t=f∗Tx′(j)x′(j) + u

′ (36)

The derivative of the Huber loss is:

∂tℓ(y, t) =

0 if yt > 1 + h−y2h (1 + h− yt) if |1− yt| ≤ h−y if yt < 1− h

(37)

The derivative is always in the interval[−1, 1], so

‖u− u′‖ ≤ 2 , (38)

and therefore‖u′‖ is within 2 of ‖u‖, or ‖u‖ − 2 ≤ ‖u′‖ ≤ ‖u‖+ 2.Now, the log likelihood ratio for seeing the classifierf given the two data setsD

andD′ is∣

logP (fpriv(·;D) = f |D)

P (fpriv(·;D′) = f |D′)

= logG(u)

G(u′)(39)

= log exp(−(ǫp/2)(‖u‖ − ‖u′‖)) (40)

∈ [−ǫp, ǫp] . (41)

Therefore the algorithm preservesǫp-differential privacy.

17

Page 18: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

Proof of Theorem 2

LetL(f) be the expected loss for classifierf :

L(f) = EP

[

ℓh(Y, fTX+ µ)

]

, (42)

and letJ be the regularized true risk (expected loss):

J(f) = L(f) +Λ

2‖f‖2 . (43)

Proof. Let frtr = argminf J(f) be the classifier that minimizes the regularized truerisk, f0 be the classifier that minimizes the non-private SVM objective J(f) andfprivbe the classifier that minimizesJpriv(f). Then we can write:

J(fpriv) = J(f∗) + (J(fpriv)− J(frtr)) + (J(frtr)− J(f∗)) .

From this we can see that (c.f. [Shalev-Shwartz and Srebro, 2008]):

L(fpriv) = L(f∗) + (J(fpriv)− J(frtr)) + (J(frtr)− J(f∗)) +Λ

2‖f∗‖2 − Λ

2‖fpriv‖2 .

(44)

Now from Lemma 1, with probability at least1− δ,

J(fpriv)− J(f0) ≤8d2 log2(d/δ)

Λn2ǫ2p. (45)

Now using [Sridharan et al., 2008, Corollary 2], we can boundthe term(J(fpriv) −J(frtr)) by twice the gap in the regularized empirical risk difference (45) plus an addi-tional term. That is, with probability1− δ:

J(fpriv)− J(frtr) ≤16d2 log2(d/δ)

Λn2ǫ2p+O

(

1

Λn

)

. (46)

Sincefrtr minimizesJ(f), we knowJ(frtr) ≤ J(f∗), so

L(fpriv) ≤ L(f∗) +Λ

2‖f∗‖2 + 16d2 log2(d/δ)

Λn2ǫ2p+O

(

1

Λn

)

.

The last thing to do is setΛ and the bound on the number of samples. IfΛ =ǫg/ ‖f0‖2 then Λ

2 ‖f∗‖2 < ǫg/2. For a suitably large constantC, if n > C ‖f0‖2 /ǫ2g,then theO(1/Λn) term can be upper bounded byǫg/4. Similarly, if n > C ‖f0‖ d log(d/δ)/ǫpǫgthen the penultimate term is upper bounded byǫg/4. Thus the total loss can be bounded:

L(fpriv) ≤ L(f0) + ǫg . (47)

18

Page 19: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

Proof of Theorem 3

Proof. The pre-filter valuesθj : j ∈ [D] are iid and independent of the data, so withsome abuse of notation,

P

(

f = f , θj : j ∈ [D] = θj : j ∈ [D]∣

∣ (x(i), y(i)))

= P

(

f = f ,∣

∣ (x(i), y(i)), θj : j ∈ [D] = θj : j ∈ [D])

D∏

j=1

p(θj)

(48)

Since this holds for allθj : j ∈ [D], the terms∏D

j=1 p(θj) cancel in the log likeli-hood ratio in (3). Since Algorithm 1 guaranteesǫp-differential privacy by Theorem 1,the outputs of Algorithm 2 also guaranteeǫp-differential privacy.

Proof of Lemma 2

Proof. The assumptions guarantee thatf∗ ∈ Fp(C) as defined in (24), so [Rahimi and

Recht, 2008b, Theorem 1] shows that with probability1 − 2δ there exists anf0 ∈ RD

such that

L(f0) ≤ L(f∗) +O

(

(

1√n+

1√D

)

C

log1

δ

)

, (49)

We will chooseD later to make this loss small. Furthermore,f0 is guaranteed to have‖f0‖∞ ≤ C/D, so

‖f0‖22 ≤ C2

D. (50)

Now given such anf0 we must show thatfpriv will have true risk close to that off0 as long as there are enough data points. This can be shown using the techniquesin [Shalev-Shwartz and Srebro, 2008]. Let

J(f) = L(f) +Λ

2‖f‖22 (51)

frtr = argminf∈RD

J(f) , (52)

wherefrtr minimizes the regularized true risk. Then

J(fpriv) = J(f0) + (J(fpriv)− J(frtr)) + (J(frtr)− J(f0)) . (53)

Now, sinceJ(·) is minimized byfrtr, the last term is negative and we can disregard it.Then we have

L(fpriv)− L(f0) ≤ (J(fpriv)− J(frtr)) +Λ

2‖f0‖22 −

Λ

2‖fpriv‖22 (54)

19

Page 20: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

From Lemma 1, with probability at least1− δ:

J(fpriv)− J

(

argminf

J(f)

)

≤ 8D2 log2(D/δ)

Λn2ǫ2p(55)

Now using [Sridharan et al., 2008, Corollary 2], we can boundthe term(J(fpriv) −J(frtr)) by twice the gap in the regularized empirical risk difference (55) plus an addi-tional term. That is, with probability1− δ:

J(fpriv)− J(frtr) ≤ 2(J(fpriv)− J(frtr)) +O

(

log(1/δ)

Λn

)

(56)

≤ 16D2 log2(D/δ)

Λn2ǫ2p+O

(

log(1/δ)

Λn

)

(57)

Plugging (57) into (54), discarding the negative term involving ‖fpriv‖22 and setting

Λ = ǫg/ ‖f0‖2 gives

L(fpriv)− L(f0) ≤16 ‖f0‖22D2 log2(D/δ)

n2ǫ2pǫg+O

(

‖f0‖22 log 1δ

nǫg

)

+ǫg2. (58)

Now we have, using (49) and (58), that with probability1− 4δ:

L(fpriv)− L(f∗) ≤ (L(fpriv)− L(f0)) + (L(f0)− L(f∗)) (59)

≤ 16 ‖f0‖22D2 log2(D/δ)

n2ǫ2pǫg+O

(

‖f0‖22 log(1/δ)nǫg

)

+ǫg2

+O

(

(

1√n+

1√D

)

C

log1

δ

)

, (60)

Now, from (50) we can substitute:

L(fpriv)− L(f∗) ≤ 16C2D log2(D/δ)

n2ǫ2pǫg+O

(

C2 log(1/δ)

Dnǫg

)

+ǫg2

+O

(

(

1√n+

1√D

)

C

log1

δ

)

. (61)

To set the parameters, we will chooseD < n so that

L(fpriv)− L(f∗) ≤ 16C2D log2(D/δ)

n2ǫ2pǫg+O

(

C2 log(1/δ)

Dnǫg

)

+ǫg2

+O

(

C√

log(1/δ)√D

)

.

Again focusing on the last term, we setD = O(C2 log(1/δ)/ǫ2g) to make this last termǫg/6:

L(fpriv)− L(f∗) ≤ 16C2D log2(D/δ)

n2ǫ2pǫg+O

(

C2 log(1/δ)

Dnǫg

)

+2ǫg3

(62)

≤ O

C4 log 1δ log

2 C2 log(1/δ)ǫ2gδ

n2ǫ2pǫ3g

+O( ǫgn

)

+2ǫg3

. (63)

20

Page 21: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

Now we set

n = Ω

(

C2√

log(1/δ)

ǫpǫ2g· log C log(1/δ)

ǫgδ

)

(64)

to get

L(fpriv)− L(f∗) ≤ ǫg . (65)

Proof of Theorem 4

Proof. SubstitutingC = Vol(X )ζ ‖f∗‖∞ in (13) from Lemma 2 and using (30), weget the result.

Proof of Theorem 5

Let f∗ be a classifier with norm‖f∗‖2H and expected lossL(f∗). Now consider

frtr = argminf

L(f) +Λrtr

2‖f‖2H . (66)

We will first need a bound on‖frtr‖∞ in order to use our previous sample complexityresults. Note that sincefrtr is a minimizer, we can take the derivative of the regularizedexpected loss and set it to0 to get:

frtr(x′) =

1

Λrtr

∂fL(f) (67)

=−1

Λrtr

(

∂f

X

ℓ(f(x′), y)dP (x, y)

)

(68)

=−1

Λrtr

(∫

X

(

∂f(x′)ℓ(f(x), y)

)

·(

∂f(x′)f(x)

)

dP (x, y)

)

, (69)

whereP (x, y) is a distribution on pairs(x, y). Now, using the representer theorem:

∂f(x′)f(x) = k(x′,x) . (70)

And the kernel function is bounded. Furthermore, the derivative of the loss is alwaysupper bounded by1, so the integrand can be upper bounded by a constant. SinceP (x, y) is a probability distribution, we have for allx′:

|frtr(x′)| = O

(

1

Λrtr

)

. (71)

Now we setΛrtr = ǫg/ ‖f∗‖2H to get

‖frtr‖∞ = O

(

‖f∗‖2Hǫg

)

. (72)

21

Page 22: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

We now have two cases to consider, depending on whetherL(f∗) < L(frtr) orL(f∗) > L(frtr).

Case 1: Suppose thatL(f∗) < L(frtr). Then by the definition offrtr,

L(frtr) +ǫg2

· ‖frtr‖2H

‖f∗‖2H≤ L(f∗) +

ǫg2. (73)

Sinceǫg2 · ‖frtr‖

2

H

‖f∗‖2

H

≥ 0, we have

L(frtr)− L(f∗) ≤ ǫg2. (74)

Case 2: Suppose thatL(f∗) > L(frtr). Then the regularized classifier has bettergeneralization performance than the original, so we have trivially that

L(frtr)− L(f∗) ≤ ǫg2. (75)

Therefore in both cases we have a bound on‖frtr‖∞ and a generalization gap ofǫg/2. We can now apply our previous generalization bound to get that for

n = Ω

(

‖f∗‖4H ζ2(Vol(X ))2√

log(1/δ)

ǫpǫ4g· log ‖f∗‖2H ζ Vol(X ) log(1/δ)

ǫ2gδΓ(d2 + 1)

)

, (76)

we have

P (L(fpriv)− L(f∗) ≤ ǫg) ≥ 1− 4δ . (77)

7.0.1 Proof of Theorem 6

Proof. LetA(D) denote the output of the tuning algorithm, when the input is databaseD, and letD = D ∪ (x(n), y(n)) andD′ = D ∪ (x′(n), y′(n)) be two databaseswhich differ in the private value of one person. Moreover, let F = (f1, . . . , fm) denotethe classifiers computed in Step 2 of the tuning algorithm. Note that, for anyf andF,

P (A(D) = f) =

F

P (A(D) = f |F )P (F|D) dµ(F) (78)

= P (A(D) = f |FmDm+1) ·∏

i

fi

P (fi|Di) dµ(fi) (79)

whereµ(F) is the uniform measure onF, andµ(fi) is the uniform measure onfi. Here,the last line follows from the fact that for eachi, the random choices made in executioni in Step 2, as well as the random choices in Step 3, are independent.

If (x(n), y(n)) ∈ Dj , for j = 1, . . . ,m, then, by from Theorem 1,

P (fj |Dj) ≤ eǫpP(

fk|D′j

)

(80)

P (fi|Di) = P(fi|D′i) i 6= j (81)

22

Page 23: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

Moreover,P (A(D) = f |F,Dm+1) is also the same for the two databases. Therefore

P (A(D) = f) ≤ eǫpP (A(D′) = f) . (82)

On the other hand, if(x(n), y(n)) ∈ Dm+1 thenǫp-differential privacy follows fromTheorem 6 of [McSherry and Talwar, 2007].

References

[Agrawal and Srikant, 2000] Agrawal, R. and Srikant, R. (2000). Privacy-preservingdata mining.SIGMOD Rec., 29(2):439–450.

[Asuncion and Newman, 2007] Asuncion, A. and Newman, D. (2007). UCI MachineLearning Repository. University of California, Irvine, School of Information andComputer Sciences.

[Ball, 1997] Ball, K. (1997). An elementary introduction tomodern convex geome-try. In Levy, S., editor,Flavors of Geometry, volume 31 ofMathematical SciencesResearch Institute Publications, pages 1–58. Cambridge University Press.

[Barak et al., 2007] Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., andTalwar, K. (2007). Privacy, accuracy, and consistency too:a holistic solution tocontingency table release. InPODS, pages 273–282.

[Blum et al., 2008] Blum, A., Ligett, K., and Roth, A. (2008).A learning theory ap-proach to non-interactive database privacy. In Ladner, R. E. and Dwork, C., editors,STOC, pages 609–618. ACM.

[Chapelle, 2007] Chapelle, O. (2007). Training a support vector machine in the primal.Neural Computation, 19(5):1155–1178.

[Chaudhuri and Monteleoni, 2008] Chaudhuri, K. and Monteleoni, C. (2008).Privacy-preserving logistic regression. InTwenty-Second Annual Conference onNeural Information Processing Systems (NIPS), Vancouver, Canada.

[Dwork, 2006] Dwork, C. (2006). Differential privacy. In Bugliesi, M., Preneel, B.,Sassone, V., and Wegener, I., editors,ICALP (2), volume 4052 ofLecture Notes inComputer Science, pages 1–12. Springer.

[Dwork and Lei, 2009] Dwork, C. and Lei, J. (2009). Differential privacy and robuststatistics. InSTOC.

[Dwork et al., 2006] Dwork, C., McSherry, F., Nissim, K., andSmith, A. (2006). Cal-ibrating noise to sensitivity in private data analysis. In3rd IACR Theory of Cryptog-raphy Conference,, pages 265–284.

[Evfimievski et al., 2003] Evfimievski, A., Gehrke, J., and Srikant, R. (2003). Limit-ing privacy breaches in privacy preserving data mining. InPODS, pages 211–222.

23

Page 24: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

[Ganta et al., 2008] Ganta, S. R., Kasiviswanathan, S. P., and Smith, A. (2008). Com-position attacks and auxiliary information in data privacy. In KDD, pages 265–273.

[Hettich and Bay, 1999] Hettich, S. and Bay, S. (1999). The UCI KDD Archive. Uni-versity of California, Irvine, Department of Information and Computer Science.

[Joachims, 1999] Joachims, T. (1999). Making large-scale svm learning practical. InAdvances in Kernel Methods - Support Vector Learning.

[Kasiviswanathan et al., 2008] Kasiviswanathan, S. A., Lee, H. K., Nissim, K.,Raskhodnikova, S., and Smith, A. (2008). What can we learn privately? InProc. ofFoundations of Computer Science.

[Kimeldorf and Wahba, 1970] Kimeldorf, G. and Wahba, G. (1970). A correspon-dence between Bayesian estimation on stochastic processesand smoothing bysplines.Annals of Mathematical Statistics, 41:495–502.

[Laur et al., 2006] Laur, S., Lipmaa, H., and Mielikainen, T. (2006). Cryptographi-cally private support vector machines. InKDD, pages 618–624.

[Machanavajjhala et al., 2006] Machanavajjhala, A., Gehrke, J., Kifer, D., and Venki-tasubramaniam, M. (2006). l-diversity: Privacy beyond k-anonymity. InProc. ofICDE.

[Machanavajjhala et al., 2008] Machanavajjhala, A., Kifer, D., Abowd, J. M., Gehrke,J., and Vilhuber, L. (2008). Privacy: Theory meets practiceon the map. InICDE,pages 277–286.

[Mangasarian et al., 2008] Mangasarian, O. L., Wild, E. W., and Fung, G. (2008).Privacy-preserving classification of vertically partitioned data via random kernels.TKDD, 2(3).

[McSherry and Talwar, 2007] McSherry, F. and Talwar, K. (2007). Mechanism designvia differential privacy. InFOCS, pages 94–103.

[Nissim et al., 2007] Nissim, K., Raskhodnikova, S., and Smith, A. (2007). Smoothsensitivity and sampling in private data analysis. In Johnson, D. S. and Feige, U.,editors,STOC, pages 75–84. ACM.

[Okazaki, 2009] Okazaki, N. (2009). liblbfgs: a library oflimited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS).http://www.chokkan.org/software/liblbfgs/index.html.

[Rahimi and Recht, 2007] Rahimi, A. and Recht, B. (2007). Random features forlarge-scale kernel machines. InNIPS, Vancouver, Canada.

[Rahimi and Recht, 2008a] Rahimi, A. and Recht, B. (2008a). Uniform approximationof functions with random bases. InProceedings of the 46th Allerton Conference onCommunication, Control, and Computing.

24

Page 25: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

[Rahimi and Recht, 2008b] Rahimi, A. and Recht, B. (2008b). Weighted sums of ran-dom kitchen sinks : Replacing minimization with randomization in learning. InNIPS, Vancouver, Canada.

[Shalev-Shwartz and Srebro, 2008] Shalev-Shwartz, S. and Srebro, N. (2008). SVMoptimization : Inverse dependence on training set size. InICML, Helsinki, Finland.

[Sridharan et al., 2008] Sridharan, K., Srebro, N., and Shalev-Shwartz, S. (2008). Fastrates for regularized objectives. InNIPS, Vancouver, Canada.

[Sweeney, 2002] Sweeney, L. (2002). k-anonymity: a model for protecting privacy.Int. J. on Uncertainty, Fuzziness and Knowledge-Based Systems.

[Zhan and Matwin, 2007] Zhan, J. Z. and Matwin, S. (2007). Privacy-preserving sup-port vector machine classification.IJIIDS, 1(3/4):356–385.

25

Page 26: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

2 3 4 5 6 7 8

0.15

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

Log regularization parameter −log(Λ)

Err

or r

ate

Error versus Λ for non−private and private algorithms (εp = 0.1)

Non−private Huber−SVMAlgorithm 1Private LR

Page 27: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

1 2 3 4 5 6 7 80

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Log regularization parameter −log(Λ)

Err

or r

ate

Error versus Λ for non−private and private algorithms (εp = 0.05)

Non−private Huber−SVMAlgorithm 1Logistic RegressionSensitivity

Page 28: arXiv:0912.0071v1 [cs.LG] 1 Dec 2009

0.5 1 1.5 2 2.5 3 3.5 4

x 105

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Training set size

Err

or r

ate

Error versus training set size for KDDCup99 (εp = 0.01)

Non−private SVMAlgorithm 1Sensitivity