stealing hyperparameters in machine learning · is motivated by the emerging...

In the 39th IEEE Symposium on Security and Privacy, May 2018, San Francisco, CA

Stealing Hyperparameters in Machine Learning

Binghui Wang, Neil Zhenqiang GongECE Department, Duke University{binghui.wang, neil.gong}@duke.edu

Abstract—Hyperparameters are critical in machine learn-ing, as different hyperparameters often result in models withsignificantly different performance. Hyperparameters may bedeemed confidential because of their commercial value and theconfidentiality of the proprietary algorithms that the learner usesto learn them. In this work, we propose attacks on stealing thehyperparameters that are learnt by a learner. We call our attackshyperparameter stealing attacks. Our attacks are applicable to avariety of popular machine learning algorithms such as ridgeregression, logistic regression, support vector machine, and neuralnetwork. We evaluate the effectiveness of our attacks boththeoretically and empirically. For instance, we evaluate ourattacks on Amazon Machine Learning. Our results demonstratethat our attacks can accurately steal hyperparameters. We alsostudy countermeasures. Our results highlight the need for newdefenses against our hyperparameter stealing attacks for certainmachine learning algorithms.

I. INTRODUCTION

Many popular supervised machine learning (ML)algorithms–such as ridge regression (RR) [19] , logisticregression (LR) [20], support vector machine (SVM) [13],and neural network (NN) [16]–learn the parameters in amodel via minimizing an objective function, which is oftenin the form of loss function + λ × regularization term.Loss function characterizes how well the model performsover the training dataset, regularization term is used toprevent overfitting [7], and λ balances between the two.Conventionally, λ is called hyperparameter. Note that therecould be multiple hyperparameters if the ML algorithm adoptsmultiple regularization terms. Different ML algorithms usedifferent loss functions and/or regularization terms.

Hyperparameters are critical for ML algorithms. For thesame training dataset, with different hyperparameters, an MLalgorithm might learn models that have significantly differentperformance on the testing dataset, e.g., see our experimentalresults about the impact of hyperparameters on different MLclassifiers in Figure 16 in the Appendix. Moreover, hyperpa-rameters are often learnt through a computationally expensivecross-validation process, which may be implemented by pro-prietary algorithms that could vary across learners. Therefore,hyperparameters may be deemed confidential.

Our work: In this work, we formulate the research problem ofstealing hyperparameters in machine learning, and we providethe first systematic study on hyperparameter stealing attacksas well as their defenses.

Hyperparameter stealing attacks. We adopt a threatmodel in which an attacker knows the training dataset, theML algorithm (characterized by an objective function), and(optionally) the learnt model parameters. Our threat modelis motivated by the emerging machine-learning-as-a-service

(MLaaS) cloud platforms, e.g., Amazon Machine Learning [1]and Microsoft Azure Machine Learning [25], in which theattacker could be a user of an MLaaS platform. When themodel parameters are unknown, the attacker can use modelparameter stealing attacks [54] to learn them. As a first steptowards studying the security of hyperparameters, we focuson hyperparameters that are used to balance between the lossfunction and the regularization terms in the objective function.Many popular ML algorithms–such as ridge regression, logisticregression, and SVM (please refer to Table I for more MLalgorithms)–rely on such hyperparameters. It would be an in-teresting future work to study the security of hyperparametersfor other ML algorithms, e.g., the hyperparameter K for KNN,as well as network architecture, dropout rate [49], and mini-batch size for deep neural networks. However, as we willdemonstrate in our experiments, an attacker (e.g., user of anMLaaS platform) can already significantly financially benefitfrom stealing the hyperparameters in the objective function.

We make a key observation that the model parameterslearnt by an ML algorithm are often minima of the corre-sponding objective function. Roughly speaking, a data point isa minimum of an objective function if the objective functionhas larger values at the nearby data points. This implies thatthe gradient of the objective function at the model parametersis close to 0 (0 is a vector whose entries are all 0). Our attacksare based on this key observation. First, we propose a generalattack framework to steal hyperparameters. Specifically, in ourframework, we compute the gradient of the objective functionat the model parameters and set it to 0, which gives us asystem of linear equations about the hyperparameters. Thislinear system is overdetermined since the number of equa-tions (i.e., the number of model parameters) is usually largerthan the number of unknown variables (i.e., hyperparameters).Therefore, we leverage the linear least square method [30],a widely used method to derive an approximate solution ofan overdetermined system, to estimate the hyperparameters.Second, we demonstrate how we can apply our framework tosteal hyperparameters for a variety of ML algorithms.

Theoretical and empirical evaluations. We evaluate ourattacks both theoretically and empirically. Theoretically, weshow that 1) when the learnt model parameters are an exactminimum of the objective function, our attacks can obtainthe exact hyperparameters; and 2) when the model parametersdeviate from their closest minimum of the objective functionwith a small difference, then our estimation error is a lin-ear function of the difference. Empirically, we evaluate theeffectiveness of our attacks using six real-world datasets. Ourresults demonstrate that our attacks can accurately estimate thehyperparameters on all datasets for various ML algorithms.For instance, for various regression algorithms, the relativeestimation errors are less than 10−4 on the datasets.

arX

iv:1

802.

0535

1v3

[cs

.CR

] 7

Sep

201

9

Moreover, via simulations and evaluations on Amazon Ma-chine Learning, we show that a user can use our attacks to learna model via MLaaS with much less economical costs, whilenot sacrificing the model’s testing performance. Specifically,the user samples a small fraction of the training dataset, learnsmodel parameters via MLaaS, steals the hyperparameters usingour attacks, and re-learns model parameters using the entiretraining dataset and the stolen hyperparameters via MLaaS.

Rounding as a defense. One natural defense against ourattacks is to round model parameters, so attackers obtainobfuscated model parameters. We note that rounding wasproposed to obfuscate confidence scores of model predictionsto mitigate model inversion attacks [14] and model stealingattacks [54]. We evaluate the effectiveness of rounding usingthe six real-world datasets.

First, our results show that rounding increases the relativeestimation errors of our attacks, which is consistent with ourtheoretical evaluation. However, for some ML algorithms, ourattacks are still effective. For instance, for LASSO (a popularregression algorithm) [53], the relative estimation errors arestill less than around 10−3 even if we round the modelparameters to one decimal. Our results highlight the needto develop new countermeasures for hyperparameter stealingattacks. Second, since different ML algorithms use differentregularization terms, one natural question is which regulariza-tion term has better security property. Our results demonstratethat L2 regularization term can more effectively defend againstour attacks than L1 regularization term using rounding. Thisimplies that an ML algorithm should use L2 regularizationin terms of security against hyperparameter stealing attacks.Third, we also compare different loss functions in terms oftheir security property, and we observe that cross entropy lossand square hinge loss can more effectively defend against ourattacks than regular hinge loss using rounding. The cross-entropy loss function is adopted by logistic regression [20],while square and regular hinge loss functions are adopted bysupport vector machine and its variants [21].

In summary, our contributions are as follows:

• We provide the first study on hyperparameter stealing attacksto machine learning. We propose a general attack frameworkto steal the hyperparameters in the objective functions.

• We evaluate our attacks both theoretically and empirically.Our empirical evaluations on several real-world datasetsdemonstrate that our attacks can accurately estimate hyper-parameters for various ML algorithms. We also show thesuccess of our attacks on Amazon Machine Learning.

• We evaluate rounding model parameters as a defense againstour attacks. Our empirical evaluation results show thatour attacks are still effective for certain ML algorithms,highlighting the need for new countermeasures. We alsocompare different regularization terms and different lossfunctions in terms of their security against our attacks.

II. RELATED WORK

Existing attacks to ML can be roughly classified intofour categories: poisoning attacks, evasion attacks, modelinversion attacks, and model extraction attacks. Poisoningattacks and evasion attacks are also called causative attacks

and exploratory attacks [2], respectively. Our hyperparameterstealing attacks are orthogonal to these attacks.

Poisoning attacks: In poisoning attacks, an attacker aims topollute the training dataset such that the learner produces a badclassifier, which would mislabel malicious content or activitiesgenerated by the attacker at testing time. In particular, theattacker could insert new instances, edit existing instances, orremove existing instances in the training dataset [38]. Existingstudies have demonstrated poisoning attacks to worm signaturegenerators [34], [41], [35], spam filters [31], [32], anomalydetectors [43], [24], SVMs [5], face recognition methods [4],as well as recommender systems [27], [57].

Evasion attacks: In these attacks [33], [22], [3], [51], [52],[17], [26], [23], [37], [56], [9], [44], [28], [36], an attacker aimsto inject carefully crafted noise into a testing instance (e.g., anemail spam, a social spam, a malware, or a face image) suchthat the classifier predicts a different label for the instance.The injected noise often preserves the semantics of the originalinstance (e.g., a malware with injected noise preserves its ma-licious behavior) [51], [56], is human imperceptible [52], [17],[37], [28], or is physically realizable [9], [44]. For instance,Xu et al. [56] proposed a general evasion attack to searchfor a malware variant that preserves the malicious behavior ofthe malware but is classified as benign by the classifier (e.g.,PDFrate [47] or Hidost [50]). Szegedy et al. [52] observedthat deep neural networks would misclassify an image after weinject a small amount of noise that is imperceptible to human.Sharif et al. [44] showed that an attacker can inject human-imperceptible noise to a face image to evade recognition orimpersonate another individual, and the noise can be physicallyrealized by the attacker wearing a pair of customized eyeglassframes. Moreover, evasion attacks can be even black-box, i.e.,when the attacker does not know the classification model. Thisis because an adversarial example optimized for one model ishighly likely to be effective for other models, which is knownas transferability [52], [28], [36].

We note that Papernot et al. [39] proposed a distillationtechnique to defend against evasion attacks to deep neuralnetworks. However, Carlini and Wagner [10] demonstratedthat this distillation technique is not as secure to new evasionattacks as we thought. Cao and Gong [8] found that adversarialexamples, especially those generated by the attacks proposedby Carlini and Wagner, are close to the classification boundary.Based on the observation, they proposed region-based classi-fication, which ensembles information in the neighborhoodsaround a testing example (normal or adversarial) to predictits label. Specifically, for a testing example, they samplesome examples around the testing example in the input spaceand take a majority vote among the labels of the sampledexamples as the label of the testing example. Such region-based classification significantly enhances the robustness ofdeep neural networks against various evasion attacks, withoutsacrificing classification accuracy on normal examples at all. Inparticular, an evasion attack needs to add much larger noise inorder to construct adversarial examples that successfully evaderegion-based classifiers.

Model inversion attacks: In these attacks [15], [14], anattacker aims to leverage model predictions to compromiseuser privacy. For instance, Fredrikson et al. [15] demonstrated

that model inversion attacks can infer an individual’s privategenotype information. Furthermore, via considering confidencescores of model predictions [14], model inversion attacks canestimate whether a respondent in a lifestyle survey admitted tocheating on its significant other and can recover recognizableimages of people’s faces given their name and access to themodel. Several studies [45], [48], [42] demonstrated evenstronger attacks, i.e., an attacker can infer whether a particularinstance was in the training dataset or not.

Model extraction attacks: These attacks aim to steal param-eters of an ML model. Stealing model parameters compro-mises the intellectual property and algorithm confidentiality ofthe learner, and also enables an attacker to perform evasionattacks or model inversion attacks subsequently [54]. Lowdand Meek [29] presented efficient algorithms to steal modelparameters of linear classifiers when the attacker can issuemembership queries to the model through an API. Trameret al. [54] demonstrated that model parameters can be moreaccurately and efficiently extracted when the API also producesconfidence scores for the class labels.

III. BACKGROUND AND PROBLEM DEFINITION

A. Key Concepts in Machine Learning

We introduce some key concepts in machine learning (ML).In particular, we discuss supervised learning, which is the focusof this work. We will represent vectors and matrices as boldlowercase and uppercase symbols, respectively. For instance,x is a vector while X is a matrix. xi denotes the ith elementof the vector x. We assume all vectors are column vectors inthis paper. xT (or XT ) is the transpose of x (or X).

Decision function: Supervised ML aims to learn a decisionfunction f , which takes an instance as input and produces labelof the instance. The instance is represented by a feature vector;the label can be continuous value (i.e., regression problem) orcategorical value (i.e., classification problem). The decisionfunction is characterized by certain parameters, which wecall model parameters. For instance, for a linear regressionproblem, the decision function is f(x)=wTx, where x is theinstance and w is the model parameters. For kernel regressionproblem, the decision function is f(x)=wTφ(x), where φ isa kernel mapping function that maps an instance to a point ina high-dimensional space. Kernel methods are often used tomake a linear model nonlinear.

Learning model parameters in a decision function: AnML algorithm is a computational procedure to determine themodel parameters in a decision function from a given trainingdataset. Popular ML algorithms include ridge regression [19],logistic regression [20], SVM [13], and neural network [16].

Training dataset. Suppose the learner is given n instancesX = {xi}ni=1 ∈ Rm×n. For each instance xi, we have alabel yi, where i = 1, 2, · · · , n. yi takes continuous valuefor regression problems and categorical value for classificationproblems. For convenience, we denote y = {yi}ni=1. X and yform the training dataset.

Objective function. Many ML algorithms determine themodel parameters via minimizing a certain objective function

TABLE I: Loss functions and regularization terms of variousML algorithms we study in this paper.

Category ML Algorithm Loss Function Regularization

Regression

RR Least Square L2

LASSO Least Square L1

ENet Least Square L2 + L1

KRR Least Square L2

Logistic Regression

L2-LR Cross Entropy L2

L1-LR Cross Entropy L1

L2-KLR Cross Entropy L2

L1-BKLR Cross Entropy L1

SVM

SVM-RHL Regular Hinge Loss L2

SVM-SHL Square Hinge Loss L2

KSVM-RHL Regular Hinge Loss L2

KSVM-SHL Square Hinge Loss L2

Neural Network Regression Least Square L2

Classification Cross Entropy L2

over the training dataset. An objective function often has thefollowing forms:

Non-kernel algorithms: L(w) = L(X,y,w) + λR(w)

Kernel algorithms: L(w) = L(φ(X),y,w) + λR(w),

where L is called loss function, R is called regularization term,φ is a kernel mapping function (i.e., φ(X) = {φ(xi)}ni=1),and λ > 0 is called hyperparameter, which is used tobalance between the loss function and the regularization term.Non-kernel algorithms include linear algorithms and nonlinearneural network algorithms. In ML theory, the regularizationterm is used to prevent overfitting. Popular regularization termsinclude L1 regularization (i.e., R(w)=||w||1 =

∑i |wi|) and

L2 regularization (i.e., R(w)=||w||22 =∑i w

2i ). We note that,

some ML algorithms use more than one regularization termsand thus have multiple hyperparameters. Although we focuson ML algorithms with one hyperparameter in the main textof this paper for conciseness, our attacks are applicable tomore than one hyperparameter and we show an example inAppendix B.

An ML algorithm minimizes the above objective functionfor a given training dataset and a given hyperparameter, toget the model parameters w, i.e., w = argminL(w). Thelearnt model parameters are a minimum of the objectivefunction. w is a minimum if the objective function has largervalues at the points near w. Different ML algorithms adoptdifferent loss functions and different regularization terms. Forinstance, ridge regression uses least-square loss function andL2 regularization term. Table I lists the loss function andregularization term used by popular ML algorithms that weconsider in this work. As we will demonstrate, these differentloss functions and regularization terms have different securityproperties against our hyperparameter stealing attacks.

For kernel algorithms, the model parameters are in the formw =

∑ni=1 αiφ(xi). In other words, the model parameters are

a linear combination of the kernel mapping of the traininginstances. Equivalently, we can represent model parametersusing the parameters α = {αi}ni=1 for kernel algorithms.

Learning hyperparameters via cross-validation: Hyper-parameter is a key parameter in machine learning systems;a good hyperparameter makes it possible to learn a modelthat has good generalization performance on testing dataset.In practice, hyperparameters are often determined via cross-

Trainingdataset

Algorithm0for0learning

hyperparametersHyperparameters

Algorithm0for0learning0modelparameters

Model0parameters

Fig. 1: Key concepts of a machine learning system.

validation [21]. A popular cross-validation method is called K-fold cross-validation. Specifically, we can divide the trainingdataset into K folds. Suppose we are given a hyperparameter.For each fold, we learn the model parameters using theremaining K−1 folds as a training dataset and tests the modelperformance on the fold. Then, we average the performanceover the K folds. The hyperparameter is determined in asearch process such that the average performance in the cross-validation is maximized. Learning hyperparameters is muchmore computationally expensive than learning model parame-ters with a given hyperparameter because the former involvesmany trials of learning model parameters. Figure 1 illustratesthe process to learn hyperparameters and model parameters.

Testing performance of the decision function: We oftenuse a testing dataset to measure performance of the learntmodel parameters. Suppose the testing dataset consists of{xtesti }ntest

i=1 , whose labels are {ytesti }ntest

i=1 , respectively. Forregression, the performance is often measured by mean squareerror (MSE), which is defined as MSE = 1

ntest

∑ntest

i=1 (ytesti −f(xtesti ))2. For classification, the performance is often mea-sured by accuracy (ACC), which is defined as ACC =

1ntest

∑ntest

i=1 I(ytesti = f(xtesti )), where I is 1 if ytesti =f(xtesti ), otherwise it is 0. A smaller MSE or a higher ACCimplies better model parameters.

B. Problem Definition

Threat model: We assume the attacker knows the train-ing dataset, the ML algorithm, and (optionally) the learntmodel parameters. Our threat model is motivated by machine-learning-as-a-service (MLaaS) [6], [1], [18], [25]. MLaaS is anemerging technology to aid users, who have limited computingpower and machine learning expertise, to learn an ML modelover large datasets. Specifically, a user uploads the trainingdataset to an MLaaS platform and specifies an ML algorithm.The MLaaS platform uses proprietary algorithms to learn thehyperparameters, then learns the model parameters, and finallycertain MLaaS platforms (e.g., BigML [6]) allow the user todownload the model parameters to use them locally. Attackerscould be such users. We stress that when the model parametersare unknown, our attacks are still applicable as we demonstratein Section V-B3. Specifically, the attacker can first use modelparameter stealing attacks [54] to learn them and then performour attacks. We note that various MLaaS platforms–such asAmazon Machine Learning and Microsoft Azure MachineLearning–make the ML algorithm public. Moreover, for black-box MLaaS platforms such as Amazon Machine Learning andMicrosoft Azure Machine Learning, prior model parameterstealing attacks [54] are applicable.

We define hyperparameter stealing attacks as follows:

Definition 1 (Hyperparameter Stealing Attacks): Supposean ML algorithm learns model parameters via minimizingan objective function that is in the form of loss function+ λ × regularization term. Given the ML algorithm,the training dataset, and (optionally) the learnt modelparameters, hyperparameter stealing attacks aim to estimatethe hyperparameter value in the objective function.

Application scenario: One application of our hyperparameterstealing attacks is that a user can use our attacks to learn amodel via MLaaS with much less computations (thus muchless economical costs), while not sacrificing the model’s testingperformance. Specifically, the user can sample a small fractionof the training dataset, learns the model parameters throughMLaaS, steals the hyperparameter using our attacks, and re-learns the model parameters via MLaaS using the entire train-ing dataset and the stolen hyperparameter. We will demonstratethis application scenario in Section V-B via simulations andAmazon Machine Learning.

IV. HYPERPARAMETER STEALING ATTACKS

We first introduce our general attack framework. Second,we use several regression and classification algorithms asexamples to illustrate how we can use our framework to stealhyperparameters for specific ML algorithms, and we showresults of more algorithms in Appendix A.

A. Our Attack Framework

Our goal is to steal the hyperparameters in an objectivefunction. For an ML algorithm that uses such hyperparameters,the learnt model parameters are often a minimum of the objec-tive function (see the background knowledge in Section III-A).Therefore, the gradient of the objective function at the learntmodel parameters should be 0, which encodes the relationshipsbetween the learnt model parameters and the hyperparameters.We leverage this key observation to steal hyperparameters.

Non-kernel algorithms: We compute the gradient of theobjective function at the model parameters w and set it tobe 0. Then, we have

∂L(w)

∂w= b + λa = 0, (1)

where vectors b and a are defined as follows:

b =

∂L(X,y,w)

∂w1∂L(X,y,w)

∂w2

...∂L(X,y,w)∂wm

, a =

∂R(w)∂w1∂R(w)∂w2

...∂R(w)∂wm

. (2)

First, Eqn. 1 is a system of linear equations about thehyperparameter λ. Second, in this system, the number ofequations is more than the number of unknown variables(i.e., hyperparameter in our case). Such system is called anoverdetermined system in statistics and mathematics. We adoptthe linear least square method [30], a popular method to findan approximate solution to an overdetermined system, to solve

the hyperparameter in Eqn. 1. More specifically, we estimatethe hyperparameter as follows:

λ = −(aTa)−1aTb. (3)

Kernel algorithms: Recall that, for kernel algorithms, themodel parameters are a linear combination of the kernelmapping of the training instances, i.e., w =

∑ni=1 αiφ(xi).

Therefore, the model parameters can be equivalently repre-sented by the parameters α = {αi}ni=1. We replace the variablew with

∑ni=1 αiφ(xi) in the objective function, compute the

gradient of the objective function with respect to α, and set thegradient to 0. Then, we will obtain an overdetermined system.After solving the system with linear least square method,we again estimate the hyperparameter using Eqn. 3 with thevectors b and a re-defined as follows:

b =

∂L(φ(X),y,w)

∂α1∂L(φ(X),y,w)

∂α2

...∂L(φ(X),y,w)

∂αn

, a =

∂R(w)∂α1∂R(w)∂α2

...∂R(w)∂αn

. (4)

Addressing non-differentiability: Using Eqn. 3 still facestwo more challenges: 1) the objective function might not bedifferentiable at certain dimensions of the model parameters w(or α), and 2) the objective function might not be differentiableat certain training instances for the learnt model parameters.For instance, objective functions with L1 regularization arenot differentiable at the dimensions where the learnt modelparameters are 0, while the objective functions in SVMs mightnot be differentiable for certain training instances. We addressthe challenges via constructing the vectors a and b usingthe dimensions and training instances at which the objectivefunction is differentiable. Note that using less dimensions ofthe model parameters is equivalent to using less equations inthe overdetermined system shown in Eqn. 1. Once we have atleast one dimension of the model parameters and one traininginstance at which the objective function is differentiable, wecan estimate the hyperparameter.

Attack procedure: We summarize our hyperparameter steal-ing attacks in the following two steps:

• Step I. The attacker computes the vectors a and b for agiven training dataset, a given ML algorithm, and the learntmodel parameters.

• Step II. The attacker estimates the hyperparameter usingEqn. 3.

More than one hyperparameter: We note that, for concise-ness, we focus on ML algorithms whose objective functionshave a single hyperparameter in the main text of this paper.However, our attack framework is applicable and can be easilyextended to ML algorithms that use more than one hyperpa-rameter. Specifically, we can still estimate the hyperparametersusing Eqn. 3 with the vector a expanded to be a matrix, whereeach column corresponds to the gradient of a regularizationterm with respect to the model parameters. We use an exampleML algorithm, i.e., Elastic Net [58], with two hyperparametersto illustrate our attacks in Appendix B.

Next, we use various popular regression and classificationalgorithms to illustrate our attacks. In particular, we will

discuss how we can compute the vectors a and b. We willfocus on linear and kernel ML algorithms for simplicity, andwe will show results on neural networks in Appendix C. Wenote that the ML algorithms we study are widely deployed byMLaaS. For instance, logistic regression is deployed by Ama-zon Machine Learning, Microsoft Azure Machine Learning,BigML, etc.; SVM is employed by Microsoft Azure MachineLearning, Google Cloud Platform, and PredictionIO.

B. Attacks to Regression Algorithms

1) Linear Regression Algorithms: We demonstrate our at-tacks to popular linear regression algorithms including RidgeRegression (RR) [19] and LASSO [53]. Both algorithms useleast square loss function, and their regularization termsare L2 and L1, respectively. Due to the limited space, weshow attack details for RR, and the details for LASSO areshown in Appendix A. The objective function of RR isL(w) = ‖y − XTw‖22 + λ‖w‖22. We compute the gradientof the objective function with respect to w, and we have∂L(w)∂w = −2Xy + 2XXTw + 2λw. By setting the gradient

to be 0, we can estimate λ using Eqn. 3 with a = w andb = X(XTw − y).

2) Kernel Regression Algorithms: We use kernel ridgeregression (KRR) [55] as an example to illustrate our attacks.Similar to linear RR, KRR uses least square loss function andL2 regularization. After we represent the model parametersw using α, the objective function of KRR is L(α) = ‖y −Kα‖22 + λαTKα, where matrix K = φ(X)Tφ(X), whose(i, j)th entry is φ(xi)

Tφ(xj). In machine learning, K is calledgram matrix and is positive definite. By computing the gradientof the objective function with respect to α and setting it tobe 0, we have K(λα + Kα − y) = 0. K is invertible asit is positive definite. Therefore, we multiply both sides ofthe above equation with K−1. Then, we can estimate λ usingEqn. 3 with a = α and b = Kα−y. Our attacks are applicableto any kernel function. In our experiments, we will adopt thewidely used Gaussian kernel.

C. Attacks to Classification Algorithms

1) Linear Classification Algorithms: We demonstrate ourattacks to four popular linear classification algorithms: supportvector machine with regular hinge loss function (SVM-RHL),support vector machine with squared hinge loss function(SVM-SHL), L1-regularized logistic regression (L1-LR), andL2-regularized logistic regression (L2-LR). These four algo-rithms enable us to compare different regularization termsand different loss functions. For simplicity, we show attackdetails for L1-LR, and defer details for other algorithms toAppendix A. L1-LR enables us to illustrate how we addressthe challenge where the objective function is not differentiableat certain dimensions of the model parameters.

We focus on binary classification, since multi-class classi-fication is often transformed to multiple binary classificationproblems via the one-vs-all paradigm. However, our attacksare also applicable to multi-class classification algorithmssuch as multi-class support vector machine [11] and multi-class logistic regression [11] that use hyperparameters in theirobjective functions. For binary classification, each traininginstance has a label yi ∈ {1, 0}.

The objective function of L1-LR is L(w) = L(X,y,w) +λ‖w‖1, where L(X,y,w)=−∑n

i=1(yi log hw(xi) + (1− yi)log(1 − hw(xi))) is called cross entropy loss function andhw(x) is defined to be 1

1+exp (−wTx). The gradient of the

objective function is ∂L(w)∂w = X(hw(X) − y) + λsign(w),

where hw(X) = [hw(x1);hw(x2); · · · ;hw(xn))] and the ithentry sign(wi) of the vector sign(w) is defined as follows:

sign(wi) =∂|wi|∂wi

=

−1 if wi < 0

0 if wi = 0

1 if wi > 0

(5)

|wi| is not differentiable when wi = 0, so we define thederivative at wi = 0 as 0, which means that we do not use themodel parameters that are 0 to estimate the hyperparameter.Via setting the gradient to be 0, we can estimate λ using Eqn. 3with a = sign(w) and b = X(hw(X)− y).

2) Kernel Classification Algorithms: We demonstrate ourattacks to the kernel version of the above four linear classi-fication algorithms: kernel support vector machine with reg-ular hinge loss function (KSVM-RHL), kernel support vectormachine with squared hinge loss function (KSVM-SHL), L1-regularized kernel LR (L1-KLR), and L2-regularized kernel LR(L2-KLR). We show attack details for KSVM-RHL, and deferdetails for the other algorithms in Appendix A. KSVM-RHLenables us to illustrate how we can address the challenge wherethe objective function is non-differentiable for certain traininginstances. Again, we focus on binary classification.

The objective function of KSVM-RHL is L(α) =∑ni=1 L(φ(xi), yi,α) + λαTKα, where L(φ(xi), yi,α) =

max(0, 1 − yiαTki) is called regular hinge loss function. kiis the ith column of the gram matrix K = φ(X)Tφ(X). Thegradient of the loss function with respect to α is:

∂L(φ(xi), yi,α)

∂α=

{−yiki if yiαTki < 1

0 if yiαTki > 1,(6)

where L(φ(xi), yi,α) is non-differentiable when ki satisfiesyiα

Tki = 1. We estimate λ using ki that satisfy yiαTki < 1.Specifically, via setting the gradient of the objective functionto be 0, we estimate λ using Eqn. 3 with a = 2Kα andb =

∑ni=1−yiki1yiαTki<1, where 1yiαTki<1 is an indicator

function with value 1 if yiαTki < 1 and 0 otherwise.

V. EVALUATIONS

A. Theoretical Evaluations

We aim to evaluate the effectiveness of our hyperparameterstealing attacks theoretically. In particular, we show that 1)when the learnt model parameters are an exact minimum of theobjective function, our attacks can obtain the exact hyperpa-rameter value, and 2) when the model parameters deviate fromtheir closest minimum of the objective function with a smalldifference, then the estimation error of our attacks is a linearfunction of the small difference. Specifically, our theoreticalanalysis can be summarized in the following theorems.

Theorem 1: Suppose an ML algorithm learns model pa-rameters via minimizing an objective function which is in theform of loss function + λ × regularization term, λ is thetrue hyperparameter value, and the learnt model parametersw (or α for kernel algorithms) are an exact minimum of the

TABLE II: Datasets.

Dataset #Instances #Features TypeDiabetes 442 10

RegressionGeoOrig 1059 68UJIIndoor 19937 529

Iris 100 4ClassificationMadelon 4400 500

Bank 45210 16

objective function. Then, our attacks can obtain the exact truehyperparameter value, i.e., λ = λ.

Proof: See Appendix D.

Theorem 2: Suppose an ML algorithm learns model pa-rameters via minimizing an objective function which is in theform of loss function + λ × regularization term, λ is the truehyperparameter value, the learnt model parameters are w (orα for kernel algorithms), and w? (or α?) is the minimum ofthe objective function that is closest to w (or α). We denote∆w = w−w? and ∆α = α−α?. Then, when ∆w→ 0 or∆α→ 0, the difference between the estimated hyperparameterλ and the true hyperparameter can be bounded as follows:

Non-kernel algorithms:∆λ = λ− λ = ∆wT∇λ(w?) +O(‖∆w‖22) (7)Kernel algorithms:∆λ = λ− λ = ∆αT∇λ(α?) +O(‖∆α‖22), (8)

where ∇λ(w?) is the gradient of λ at w? and ∇λ(α?) is thegradient of λ at α?.

Proof: See Appendix E.

B. Empirical Evaluations

1) Experimental Setup: We use several real-world datasetsto evaluate the effectiveness of our hyperparameter steal-ing attacks on the machine learning algorithms we studied.We obtained these datasets from the UCI Machine LearningRepository,1 and their statistics are summarized in Table II.We note that our datasets have significantly different numberof instances and features, which represent different applicationscenarios. We use each dataset as a training dataset.

Implementation: We use the scikit-learn package [40],which implements various machine learning algorithms, tolearn model parameters. All experiments are conducted on alaptop with a 2.7GHz CPU and 8GB memory. We predefinea set of hyperparameters which span over a wide range, i.e.,10−3, 10−2, 10−1, 100, 101, 102, 103, in order to evaluate theeffectiveness of our attacks for a wide range of hyperparame-ters. Note that λ > 0, so we do not explore negative values forλ. For each hyperparameter and for each learning algorithm,we learn the corresponding model parameters using the scikit-learn package. For kernel algorithms, we use the Gaussiankernel, where the parameter σ in the kernel is set to be 10.We implemented our attacks in Python.

Evaluation metric: We evaluate the effectiveness of ourattacks using relative estimation error, which is formally

1https://archive.ics.uci.edu/ml/datasets.html

−3 −2 −1 0 1 2 3True hyperparameter value (log10)

−10

−9

−8

−7

−6

−5

−4

Rel

ativ

ees

tim

atio

ner

ror

(log

10)

RRLASSOKRR

(a) Diabetes


−10

−9

−8

−7

−6

−5

−4

Rel

ativ

ees

tim

atio

ner

ror

(log

10)

RRLASSOKRR

(b) GeoOrigin


−10

−9

−8

−7

−6

−5

−4

Rel

ativ

ees

tim

atio

ner

ror

(log

10)

RRLASSOKRR

(c) UJIIndoor

Fig. 2: Effectiveness of our hyperparameter stealing attacks for regression algorithms.

−3−3 −2 −1 0 1 2 3True hyperparameter value (log10)

−10

−8

−6

−4

−2

0

Rel

ativ

ees

tim

atio

ner

ror

(log

10)

L2-LRL1-LRL2-KLRL1-KLR

(a) Iris


−10

−8

−6

−4

−2

0R

elat

ive

esti

mat

ion

erro

r(l

og10

)


(b) Madelon


−10

−8

−6

−4

−2

0

Rel

ativ

ees

tim

atio

ner

ror

(log

10)


(c) Bank

Fig. 3: Effectiveness of our hyperparameter stealing attacks for logistic regression classification algorithms.


−10

−8

−6

−4

−2

0

Rel

ativ

ees

tim

atio

ner

ror

(log

10)

SVM-RHLSVM-SHLKSVM-RHLKSVM-SHL

(a) Iris


−10

−8

−6

−4

−2

0

Rel

ativ

ees

tim

atio

ner

ror

(log

10)


(b) Madelon


−10

−8

−6

−4

−2

0R

elat

ive

esti

mat

ion

erro

r(l

og10

)


(c) Bank

Fig. 4: Effectiveness of our hyperparameter stealing attacks for SVM classification algorithms.

defined as follows:

Relative estimation error: ε =|λ− λ|λ

, (9)

where λ and λ are the estimated hyperparameter and truehyperparameter, respectively.

2) Experimental Results for Known Model Parameters:We first show results for the scenario where an attackerknows the training dataset, the learning algorithm, and themodel parameters. Figure 2 shows the relative estimation errorsfor different regression algorithms on the regression datasets.Figure 3 shows the results for logistic regression algorithmson the classification datasets. Figure 4 shows the results forSVM algorithms on the classification datasets. Figure 5 showsthe results for three-layer neural networks for regression andclassification. In each figure, x-axis represents the true hyper-

parameter in a particular algorithm, and the y-axis representsthe relative estimation error of our attacks at stealing thehyperparameter. For better illustration, we set the relativeestimation errors to be 10−10 when they are smaller than10−10. Note that learning algorithms with L1 regularizationrequire the hyperparameter to be smaller than a maximumvalue λmax in order to learn meaningful model parameters(please refer to Appendix A for more details). Therefore, inthe figures, the data points are missing for such algorithmswhen the hyperparameter gets larger than λmax, which isdifferent for different training datasets and algorithms. Wedidn’t show results on kernel LASSO because it is not widelyused. Moreover, we didn’t find open-source implementationsto learn model parameters in kernel LASSO, and implementingkernel LASSO is out of the scope of this work. However, ourattacks are applicable to kernel LASSO.


−10

−8

−6

−4

−2

0

Rel

ativ

ees

tim

atio

ner

ror

(log

10)

DiabetesGeoOrigUJIIndoor

(a) Regression


−10

−8

−6

−4

−2

0

Rel

ativ

ees

tim

atio

ner

ror

(log

10)

IrisMadelonBank

(b) Classification

Fig. 5: Effectiveness of our hyperparameter stealing attacks fora) a three-layer neural network regression algorithm and b) athree-layer neural network classification algorithm.

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0∆w

0.000

0.004

0.008

0.012

0.016

0.020

∆λ

Fig. 6: Effectiveness of our hyperparameter stealing attacks forRR when the model parameters deviate from the optimal ones.

We have two key observations. First, our attacks can accu-rately estimate the hyperparameter for all learning algorithmswe studied and for a wide range of hyperparameter values. Sec-ond, we observe that our attacks can more accurately estimatethe hyperparameter for Ridge Regression (RR) and KernelRidge Regression (KRR) than for other learning algorithms.This is because RR and KRR have analytical solutions formodel parameters, and thus the learnt model parameters arethe exact minima of the objective functions. In contrast, otherlearning algorithms we studied do not have analytical solutionsfor model parameters, and their learnt model parameters arerelatively further away from the corresponding minima ofthe objective functions. Therefore, our attacks have largerestimation errors for these learning algorithms.

In practice, a learner may use approximate solutions forRR and KRR because computing the exact optimal solutionsmay be computationally expensive. We evaluate the impact ofsuch approximate solutions on the accuracy of hyperparameterstealing attacks, and compare the results with those predictedby our Theorem 2. Specifically, we use the RR algorithm,adopt the Diabetes dataset, and set the true hyperparameter tobe 1. We first compute the optimal model parameters for RR.Then, we modify a model parameter by ∆w and estimate thehyperparameter by our attack. Figure 6 shows the estimationerror ∆λ as a function of ∆w (we show the absolute estimationerror instead of relative estimation error in order to comparethe results with Theorem 2). We observe that when ∆w is verysmall, the estimation error ∆λ is a linear function of ∆w. As∆w becomes larger, ∆λ increases quadratically with ∆w. Ourobservation is consistent with Theorem 2, which shows that theestimation error is linear to the difference between the learntmodel parameters and the minimum closest to them when thedifference is very small.


−10

−9

−8

−7

−6

−5

−4

Rel

ativ

ees

tim

atio

ner

ror

(log

10)

RRLASSOKRR

(a) Diabetes


−10

−8

−6

−4

−2

0

Rel

ativ

ees

tim

atio

ner

ror

(log

10)

L2-LRL1-LR

SVM-RHLSVM-SHL

(b) Iris

Fig. 7: Effectiveness of our hyperparameter stealing attackswhen model parameters are unknown but stolen by model pa-rameter stealing attacks. (a) Regression algorithms on Diabetesand (b) Classification algorithms on Iris.

3) Experimental Results for Unknown Model Parameters:Our hyperparameter stealing attacks are still applicable whenthe model parameters are unknown to an attacker, e.g., forblack-box MLaaS platforms such as Amazon Machine Learn-ing. Specifically, the attacker can first use the equation-solving-based model parameter stealing attacks proposed in [54] tolearn the model parameters and then perform our hyperparam-eter stealing attacks. Our Theorem 2 bounds the estimationerror of hyperparameters with respect to the difference betweenthe stolen model parameters and the closest minimum of theobjective function of the ML algorithm.

We also empirically evaluate the effectiveness of our at-tacks when model parameters are unknown. For instance, Fig-ure 7 shows the relative estimation errors of hyperparametersfor regression algorithms and classification algorithms, whenthe model parameters are unknown but stolen by the modelparameter stealing attacks [54]. For simplicity, we only showresults on the Diabetes dataset for regression algorithms andon the Iris dataset for classification algorithms, but resultson other datasets are similar. Note that LASSO requires thehyperparameter to be smaller than a certain threshold aswe discussed in the above, and thus some data points aremissing for LASSO. We find that we can still accurately stealthe hyperparameters. The reason is that the model parameterstealing attacks can accurately steal the model parameters.

4) Summary: Via empirical evaluations, we have the fol-lowing observations. First, our attacks can accurately estimatethe hyperparameter for all ML algorithms we studied. Second,our attacks can more accurately estimate the hyperparameterfor ML algorithms that have analytical solutions of the modelparameters. Third, via combining with model parameter steal-ing attacks, our attacks can accurately estimate the hyperpa-rameter even if the model parameters are unknown.

C. Implications for MLaaS

We show that a user can use our hyperparamter stealingattacks to learn an accurate model through a machine-learning-as-a-service (MLaaS) platform with much less costs. Whiledifferent MLaaS platforms have different paradigms, we con-sider an MLaaS platform (e.g., Amazon Machine Learning [1],Microsoft Azure Machine Learning [25]) that charges a useraccording to the amount of computation that the MLaaSplatform performed to learn the model, and supports two

protocols for a user to learn a model. In the first protocol(denoted as Protocol I), the user uploads a training dataset tothe MLaaS platform and specifies a learning algorithm; theMLaaS platform learns the hyperparameter using proprietaryalgorithms and learns the model parameters with the learnthyperparameter; and then (optionally) the model parametersare sent back to the user. When the model parameters arenot sent back to the user, the MLaaS is called black-box.The MLaaS platform (e.g., a black-box platform) does notshare the learnt hyperparameter value with the user consideringintellectual property and algorithm confidentiality.

In Protocol I, learning the hyperparameter is often themost time-consuming and costly part, because it more or lessinvolves cross-validation. In practice, some users might alreadyhave appropriate hyperparameters through domain knowledge.Therefore, the MLaaS platform provides a second protocol(denoted as Protocol II), in which the user uploads a trainingdataset to the MLaaS platform, defines a hyperparameter value,and specifies a learning algorithm, and then the MLaaS plat-form produces the model parameters for the given hyperparam-eter. Protocol II helps users learn models with less economicalcosts when they already have good hyperparameters. We notethat Amazon Machine Learning and Microsoft Azure MachineLearning support the two protocols.

1) Learning an Accurate Model with Less Costs: Wedemonstrate that a user can use our hyperparameter stealingattacks to learn a model through MLaaS with much lesseconomical costs without sacrificing model performance. Inparticular, we assume the user does not have a good hyperpa-rameter yet. We compare the following three methods to learna model through MLaaS. By default, we assume the MLaaSshares the model parameters with the user. If not, the user canuse model parameter stealing attacks [54] to steal them.

Method 1 (M1): The user leverages Protocol I supportedby the MLaaS platform to learn the model. Specifically, theuser uploads the training dataset to the MLaaS platform andspecifies a learning algorithm. The MLaaS platform learns thehyperparameter and then learns the model parameters usingthe learnt hyperparameter. The user then downloads the modelparameters.

Method 2 (M2): In order to save economical costs, theuser samples p% of the training dataset uniformly at randomand then uses Protocol I to learn a model over the sampledsubset of the training dataset. We expect that this method isless computationally expensive than M1, but it may sacrificeperformance of the learnt model.

Method 3 (M3): In this method, the user uses our hyperpa-rameter stealing attacks. Specifically, the user first samples q%of the training dataset uniformly at random. Second, the userlearns a model over the sampled training dataset through theMLaaS via Protocol I. We note that, for big data, even a verysmall fraction (e.g., 1%) of the training dataset could be toolarge for the user to process locally, so we consider the useruses the MLaaS. Third, the user estimates the hyperparamterlearnt by the MLaaS using our hyperparameter stealing attacks.Fourth, the user re-learns a model over the entire trainingdataset through the MLaaS via Protocol II. We call this strategy“Train-Steal-Retrain”.

2) Comparing the Three Methods Empirically: We firstshow simulation results of the three methods. For these simu-lation results, we assume model parameters are known to theuser. In the next subsection, we compare the three methods onAmazon Machine Learning, a real-world MLaaS platform.

Setup: For each dataset in Table II, we randomly split itinto two halves, which are used as the training dataset and thetesting dataset, respectively. We consider the MLaaS learns thehyperparameter through 5-fold cross-validation on the trainingdataset. We measure the performance of the learnt modelthrough mean square error (MSE) (for regression models) oraccuracy (ACC) (for classification models). MSE and ACCare formally defined in Section II. Specifically, we use M1as a baseline; then we measure the relative MSE (or ACC)error of M2 and M3 over M1. For example, the relative MSEerror of M3 is defined as |MSEM3−MSEM1|

MSEM1. Moreover, we also

measure the speedup of M2 and M3 over M1 with respect tothe overall amount of computation required to learn the model.Note that we also include the computation required to steal thehyperparameter for M3.

M3 vs. M1: Figure 8 compares M3 with M1 with respectto model performance (measured by relative performance ofM3 over M1) and speedup as we sample a larger fractionof training dataset (i.e., q gets larger), where the regressionalgorithm is RR and the classification algorithm is SVM-SHL.Other learning algorithms and datasets have similar results, sowe omit them for conciseness.

We observe that M3 can learn a model that is as accurate asthe model learnt by M1, while saving a significant amount ofcomputation. Specifically, for RR on the dataset UJIndoorLoc,when we sample 3% of training dataset, M3 learns a modelthat has almost 0 relative MSE error over M1, but M3 is around8 times faster than M1. This means that the user can learn anaccurate model using M3 with much less economic costs, whenthe MLaaS platform charges the user according to the amountof computation. For the SVM-SHL algorithm on the Bankdataset, M3 can learn a model that has almost 0 relative ACCerror over M1 and is around 15 times faster than M1, when wesample 1% of training dataset. The reason why M3 and M1can learn models with similar performances is that learning thehyperparameter using a subset of the training dataset changesit slightly and the learning algorithms are relatively robust tosmall variations of the hyperparameter.

Moreover, we observe that the speedup of M3 over M1is more significant when the training dataset becomes larger.Figure 9 shows the speedup of M3 over M1 on binary-classtraining datasets with different sizes, where each class issynthesized via a Gaussian distribution with 10 dimensions.Entries of the mean vectors of the two Gaussian distributionsare all 1’s and all -1’s, respectively. Entries of the covariancematrix of the two Gaussian distributions are generated from thestandard Gaussian distribution. We select the parameter q% inM3 such that the relative ACC error is smaller than 0.1%, i.e.,M3 learns a model as accurately as M1. The speedup of M3over M1 is more significant as the training dataset gets larger.This is because the process of learning the hyperparameterhas a computational complexity that is higher than linear.M1 learns the hyperparameter over the entire training dataset,while M3 learns it on a sampled subset of training dataset.

1 2 3 4 5 6 7 8 9 100.00

0.02

0.04

0.06

0.08

0.10

Rel

ativ

eM

SEer

ror

M3 over M1

1 2 3 4 5 6 7 8 9 10q%, percentage of training instances

0

4

8

12

16

20

Spee

dup

(a)

1 2 3 4 5 6 7 8 9 100.00

0.02

0.04

0.06

0.08

0.10

Rel

ativ

eA

CC

erro

r

M3 over M1


0

3

6

9

12

15

Spee

dup

(b)

Fig. 8: M3 vs. M1. (a) Relative MSE error and speedup ofM3 over M1 for RR on the dataset UJIndoorLoc. (b) RelativeACC error and speedup of M3 over M1 for SVM-SHL on thedataset Bank.

3 4 5 6 7Number of training instances (log10)

0

60

120

180

240

300

Spee

dup

M3 over M1

Fig. 9: Speedup of M3 over M1 for SVM-SHL as the trainingdataset size gets larger.

As a result, the speedup is more significant for larger trainingdatasets. This implies that a user can benefit more by usingM3 when the user has a larger training dataset, which is oftenthe case in the era of big data.

M3 vs. M2: Figure 10 compares M3 with M2 with respectto their relative performance over M1 as we sample moretraining dataset for M2 (i.e., we increase p). For M3, we setq% such that the relative MSE (or ACC) error of M3 over M1is smaller than 0.1%. In particular, q% = 3% and q% = 1%for RR on the UJIndoorLoc dataset and SVM-SHL on theBank dataset, respectively. We observe that when M3 and M2achieve the same speedup over M1, the model learnt by M3is more accurate than that learnt by M2. For instance, for RRon the UJIndoorLoc dataset, M2 has the same speedup as M3when sampling 10% of training dataset, but M2 has around4% of relative MSE error while M3’s relative MSE error isalmost 0. For SVM-SHL on the Bank dataset, M2 has the samespeedup as M3 when sampling 4% to 5% of training dataset,but M2’s relative ACC error is much larger than M3’s.

The reason is that M2 learns both the hyperparameterand the model parameters using a subset of the trainingdataset. According to Figure 1, the unrepresentativeness ofthe subset is “doubled” because 1) it directly influences themodel parameters, and 2) it influences the hyperparameter,through which it indirectly influences the model parameters.In contrast, in M3, such unrepresentativeness only influencesthe hyperparameter and the learning algorithms are relativelyrobust to small variations of the hyperparameter.

1 2 3 4 5 6 7 8 9 100.00

0.06

0.12

0.18

0.24

0.30

Rel

ativ

eM

SEer

ror

M2 over M1M3 over M1

1 2 3 4 5 6 7 8 9 10p%, percentage of training instances

0

12

24

36

48

60

Spee

dup

(a)

1 2 3 4 5 6 7 8 9 100.00

0.06

0.12

0.18

0.24

0.30

Rel

ativ

eA

CC

erro

r

M2 over M1M3 over M1

1 2 3 4 5 6 7 8 9 10p%, percentage of training instances

0

5

10

15

20

25

Spee

dup

(b)

Fig. 10: M3 vs. M2. (a) Relative MSE error and speedup ofM3 and M2 over M1 for RR on the dataset UJIndoorLoc. (b)Relative ACC error and speedup of M3 and M2 over M1 forSVM-SHL on the dataset Bank.

3) Attacking Amazon Machine Learning: We also evaluatethe three methods using Amazon Machine Learning [1]. Ama-zon Machine Learning is a black-box MLaaS platform, i.e.,it does not disclose model parameters nor hyperparameters tousers. However, the ML algorithm is known to users, e.g., thedefault algorithm is logistic regression. In our experiments, weuse Amazon Machine Learning to learn a logistic regressionmodel (with L2 regularization) for the Bank dataset. Weleverage the SigOpt API [46], a hyperparameter tuning servicefor Amazon Machine Learning, to learn the hyperparameter.We obtained a free API token from SigOpt.

We split the Bank dataset into two halves; one for trainingand the other for testing. For M2 and M3, we sampled 15%and 3% of the training dataset, respectively, i.e., p%=15%and q%=3% (we selected these settings such that M2 and M3have around the same overall training costs). Since AmazonMachine Learning is black-box, we use the model parameterstealing attack [54] to steal model parameters in our M3.Specifically, in M3, we first used 3% of the training datasetto learn a logistic regression model. Amazon discloses theprediction API of the learnt model. Second, we queried theprediction API for 200 testing examples and used the equation-solving-based attack [54] to steal the model parameters. Third,we used our hyperparameter stealing attack to estimate thehyperparameter. Fourth, we used the entire training datasetand the stolen hyperparameter to re-train a logistic regressionmodel. We also evaluated the accuracy of the three modelslearnt by the three methods on the testing data via theirprediction APIs.

The overall training costs for M1, M2, and M3 (includingthe cost of querying the prediction API for stealing modelparameters) are $1.02, $0.15, and $0.16, respectively. The costper query of the prediction API is $0.0001. The relative ACCerror of M2 over M1 is 5.1%, while the relative ACC error ofM3 over M1 is 0.92%. Therefore, compared to M1, M3 savestraining costs significantly with little accuracy loss. When M2and M3 have around the same training costs, M3 is much moreaccurate than M2.

4) Summary: Through empirical evaluations, we have thefollowing key observations. First, M3 (i.e., the Train-Steal-Retrain strategy) can learn a model that is as accurate as that

learnt by M1 with much less computational costs. This impliesthat, for the considered MLaaS platforms, a user can use ourattacks to learn an accurate model while saving a large amountof economic costs. Second, M3 has bigger speedup over M1when the training dataset is larger. Third, M3 is more accuratethan M2 when having the same speedup over M1.

VI. ROUNDING AS A DEFENSE

According to our Theorem 2, the estimation error ofthe hyperparameter is linear to the difference between thelearnt model parameters and the minimum of the objectivefunction that is closest to them. This theorem implies thatwe could defend against our hyperparameter stealing attacksvia increasing such difference. Therefore, we propose that thelearner rounds the learnt model parameters before sharing themwith the end user. For instance, suppose a model parameter is0.8675342, rounding the model parameter to one decimal andtwo decimals results in 0.9 and 0.87, respectively. We notethat this rounding technique was also used by Fredrikson etal. [14] and Tramer et al. [54] to obfuscate confidence scoresof model predictions to defend against model inversion attacksand model stealing attacks, respectively.

Next, we perform experiments to empirically evaluate theeffectiveness of the rounding technique at defending againstour hyperparameter stealing attacks.

A. Evaluations

1) Setup: We use the datasets listed in Table II. Specifi-cally, for each dataset, we first randomly split the dataset intoa training dataset and a testing dataset with an equal size.Second, for each ML algorithm we considered, we learn ahyperparameter using the training dataset via 5-fold cross-validation, and learn the model parameters via the learnthyperparameter and the training dataset. Third, we roundeach model parameter to a certain number of decimals (weexplored from 1 decimal to 5 decimals). Fourth, we estimatethe hyperparameter using the rounded model parameters.

Evaluation metrics: Similar to evaluating the effectiveness ofour attacks, the first metric we adopt is the relative estimationerror of the hyperparameter value, which is formally definedin Eqn. 9. We say rounding is an effective defense for anML algorithm if rounding makes the relative estimation errorlarger. Moreover, we say one ML algorithm can more effec-tively defend against our attacks than another ML algorithmusing rounding, if the relative estimation error of the formeralgorithm increases more than that of the latter one.

However, relative estimation error alone is insufficientbecause it only measures security, while ignoring the testingperformance of the rounded model parameters. Specifically,severely rounding the model parameters could make the MLalgorithm secure against our hyperparameter stealing attacks,but the testing performance of the rounded model parametersmight also be affected significantly. Therefore, we also con-sider a metric to measure the testing-performance loss thatis resulted from rounding model parameters. In particular,suppose the unrounded model parameters have a testing MSE(or ACC for classification algorithms), and the rounded modelparameters have a testing MSEr (or ACCr) on the same testingdataset. Then, we define the relative MSE error and relative

ACC error as |MSE−MSEr|MSE and |ACC−ACCr|

ACC , respectively. Notethat the relative MSE error and the relative ACC error used inthis section are different from those used in Section V-C. Alarger relative estimation error and a smaller relative MSE (orACC) error indicate a better defense strategy.

2) Results: Figure 11, 12, 13, and 14 illustrate defenseresults for regression, logistic regression, SVM, and three-layerneural networks, respectively. Since we use log scale in thefigures, we set the relative MSE (or ACC) errors to be 10−10

when they are 0.

Rounding is not effective enough for certain ML algo-rithms: Rounding has small impact on the testing performanceof the models. For instance, when we keep one decimal, allML algorithms have relative MSE (or ACC) errors smallerthan 2%. Moreover, rounding model parameters increases therelative estimation errors of our attacks for all ML algorithms.However, for certain ML algorithms, the relative estimationerrors are still very small when significantly rounding themodel parameters, implying that our attacks are still veryeffective. For instance, for LASSO, our attacks have relativeestimation errors that are consistently smaller than around10−3 across the datasets, even if we round the model parame-ters to one decimal. These results highlight the needs for newcountermeasures for certain ML algorithms.

Comparing regularization terms: L2 regularization is moreeffective than L1 regularization: Different ML algorithmscould use different regularization terms, so one natural ques-tion is which regularization term can more effectively defendagainst our attacks using rounding. All the SVM classifica-tion algorithms that we studied use L2 regularization term.Therefore, we use results on regression algorithms and logisticregression classification algorithms (i.e., Figure 11 and Fig-ure 12) to compare regularization terms. In particular, we usethree pairs: RR vs. LASSO, L2-LR vs. L1-LR, and L2-KLRvs. L1-KLR. The two algorithms in each pair have the sameloss function, and use L2 and L1 regularizations, respectively.

We observe that L2 regularization can more effectively de-fend against our attacks than L1 regularization using rounding.Specifically, the relative estimation errors of RR (or L2-LR orL2-KLR) increases faster than those of LASSO (or L1-LR orL1-KLR), as we round the model parameters to less decimals.For instance, when we round the model parameters to onedecimal, the relative estimation errors increase by 1011 and102 for RR and LASSO on the Diabetes dataset, respectively,compared to those without rounding.

These observations are consistent with our Theorem 2. Inparticular, Appendix F shows our approximations to the gradi-ent ∇λ(w?) in Theorem 2 for RR, LASSO, L2-LR, L2-KLR,L1-LR, and L1-KLR. For an algorithm with L2 regularization,the magnitude of the gradient at the exact model parameters isinversely proportional to the L2 norm of the model parameters.However, if the algorithm has L1 regularization, the magnitudeof the gradient is inversely proportional to the L2 norm ofthe sign of the model parameters. For algorithms with L2

regularization, the learnt model parameters are often smallnumbers, and thus the L2 norm of the model parameters issmaller than that of the sign of the model parameters. As aresult, the magnitude of the gradient for an algorithm withL2 regularization is larger than that for an algorithm with

1 2 3 4 5Number of decimals

−10

−8

−6

−4

−2

0R

elat

ive

MSE

erro

r(l

og10

) RRLASSOKRR


−8

−6

−4

−2

0

2

Rel

ativ

ees

tim

atio

ner

ror

(log

10)

(a) Diabetes


−10

−8

−6

−4

−2

0

Rel

ativ

eM

SEer

ror

(log

10) RR

LASSOKRR


−8

−6

−4

−2

0

2

Rel

ativ

ees

tim

atio

ner

ror

(log

10)

(b) GeoOrigin


−10

−8

−6

−4

−2

0

Rel

ativ

eM

SEer

ror

(log

10) RR

LASSOKRR


−8

−6

−4

−2

0

2

Rel

ativ

ees

tim

atio

ner

ror

(log

10)

(c) UJIIndoor

Fig. 11: Defense results of the rounding technique for regression algorithms.


−10

−8

−6

−4

−2

0

Rel

ativ

eA

CC

erro

r(l

og10

)


−10

−8

−6

−4

−2

0

Rel

ativ

ees

tim

atio

ner

ror

(log

10)


(a) Iris


−10

−8

−6

−4

−2

0R

elat

ive

AC

Cer

ror

(log

10)


−10

−8

−6

−4

−2

0

Rel

ativ

ees

tim

atio

ner

ror

(log

10)


(b) Madelon


−10

−8

−6

−4

−2

0

Rel

ativ

eA

CC

erro

r(l

og10

)


−10

−8

−6

−4

−2

0

Rel

ativ

ees

tim

atio

ner

ror

(log

10)


(c) Bank

Fig. 12: Defense results of the rounding technique for logistic regression classification algorithms.


−10

−8

−6

−4

−2

0

Rel

ativ

eA

CC

erro

r(l

og10

)


−10

−8

−6

−4

−2

0

Rel

ativ

ees

tim

atio

ner

ror

(log

10)


(a) Iris


−10

−8

−6

−4

−2

0

Rel

ativ

eA

CC

erro

r(l

og10

)


−10

−8

−6

−4

−2

0

Rel

ativ

ees

tim

atio

ner

ror

(log

10)


(b) Madelon


−10

−8

−6

−4

−2

0

Rel

ativ

eA

CC

erro

r(l

og10

)


−10

−8

−6

−4

−2

0

Rel

ativ

ees

tim

atio

ner

ror

(log

10)


(c) Bank

Fig. 13: Defense results of the rounding technique for SVM classification algorithms.


−10

−8

−6

−4

−2

0

Rel

ativ

eM

SEer

ror

(log

10) Diabetes

GeoOrigUJIIndoor


−8

−6

−4

−2

0

2

Rel

ativ

ees

tim

atio

ner

ror

(log

10)

(a) Regression


−10

−8

−6

−4

−2

0

Rel

ativ

eA

CC

erro

r

IrisMadelonBank


−8

−6

−4

−2

0

2

Rel

ativ

ees

tim

atio

ner

ror

(log

10)

(b) Classification

Fig. 14: Defense results of the rounding technique for a)neural network regression algorithm and b) neural networkclassification algorithm.

L1 regularization. Therefore, according to Theorem 2, whenwe round model parameters to less decimals, the estimation

errors of an algorithm with L2 regularization increase morethan those with L1 regularization.

Comparing loss functions: cross entropy and square hingeloss can more effectively defend against our attacks thanregular hinge loss: We also compare defense effectivenessof different loss functions. Since all regression algorithms westudied have the same loss function, we use classificationalgorithms to compare loss functions. Specifically, we usetwo triples: (L2-LR, SVM-SHL, SVM-RHL) and (L2-KLR,KSVM-SHL, KSVM-RHL). The three algorithms in eachtriple use cross entropy loss, square hinge loss, and regularhinge loss, respectively, while all using L2 regularization.

We find that cross entropy and square hinge loss havesimilar defense effectiveness against our attacks, while theycan more effectively defend against our attacks than regular


−10

−8

−6

−4

−2

0

Rel

ativ

ees

tim

atio

ner

ror

(log

10)

SVM-SHLSVM-RHLL2-LR

(a)


0.00

0.01

0.02

0.03

0.04

0.05

0.06

Rel

ativ

eA

CC

erro

r

M3 w/o roundingM3 with rounding

(b)

Fig. 15: (a) Effectiveness of the rounding technique for differ-ent loss functions on the dataset Madelon. (b) Relative ACCerror of M3 over M1 for SVM-SHL on the dataset Bank.

hinge loss using rounding. For instance, Figure 15a comparesthe relative estimation errors of the triple (L2-LR, SVM-SHL,SVM-RHL) on the dataset Madelon when we use rounding.The relative estimation errors of L2-LR and SVM-SHL in-crease with a similar speed, but both increase faster thanthose of SVM-RHL, as we round the model parameters to lessdecimals. For instance, when we round the model parametersto one decimal, the relative estimation errors increase by 105,106, and 102 for L2-LR, SVM-SHL, SVM-RHL on the datasetMadelon, respectively, compared to those without rounding.

B. Implications for MLaaS

Recall that, in Section V-C, we demonstrate that a user canuse M3, i.e., the Train-Steal-Retrain strategy, to learn a modelthrough an MLaaS platform with much less economical costs,while not sacrificing the model’s testing performance. We aimto study whether M3 is still effective if the MLaaS rounds themodel parameters. We follow the same experimental setup asin Section V-C, except that the MLaaS platform rounds themodel parameters to one decimal before sharing them withthe user. Figure 15b compares M3 with M1 with respect torelative ACC error of M3 over M1. Note that the speedupsof M3 over M1 are the same with those in Figure 8, so wedo not show them again. We observe that M3 can still savemany economical costs, though rounding makes the saved costsless. Specifically, when we sample 10% of the training dataset,the relative ACC error of M3 is less than around 0.1% inFigure 15b, while M3 is 6 times faster than M1 (see Figure 8).

C. Summary

Through empirical evaluations, we have the followingobservations. First, rounding model parameters is not effectiveenough to prevent our attacks for certain ML algorithms.Second, L2 regularization can more effectively defend againstour attacks than L1 regularization. Third, cross entropy andsquare hinge loss have similar defense effectiveness. Moreover,they can more effectively defend against our attacks thanregular hinge loss. Fourth, the Train-Steal-Retrain strategy canstill save lots of costs when MLaaS adopts rounding.

VII. DISCUSSIONS AND LIMITATIONS

Assumptions for our Train-Steal-Retrain strategy: A userof an MLaaS platform can benefit from our Train-Steal-Retrain strategy when the following assumptions hold: 1) thehyperparameters can be accurately learnt using a small fraction

of the training dataset; 2) the user does not have enoughcomputational resource or ML expertise to learn the hyper-parameters locally; and 3) training both the hyperparametersand model parameters using a small fraction of the trainingdataset does not lead to an accurate model. The validity ofthe first and third assumptions is data-dependent. We note thatTrain-Steal-Retrain requires ML expertise, but an attacker candevelop it as a service for non-ML-expert users to use.

ML algorithm is unknown: When the ML algorithm isunknown, the problem becomes jointly stealing the ML algo-rithm and the hyperparameters. Our current attack is defeatedin this scenario. In fact, jointly stealing the ML algorithmand the hyperparameters may be impossible in some cases.For instance, logistic regression with a hyperparameter Aproduces a model MA. On the same training dataset, SVMwith a hyperparameter B produces a model MB . If the modelparameters MA and MB are the same, we cannot distinguishbetween the logistic regression with hyperparameter A and theSVM with a hyperparameter B. It is an interesting future workto study jointly stealing ML algorithm and hyperparameters,e.g., show when it is possible and impossible to do so.

Other types of hyperparameters: As a first step towardsstealing hyperparameters in machine learning, our work islimited to stealing the hyperparameters that are used to balancebetween the loss function and the regularization terms inan objective function. Many ML algorithms (please refer toTable I) rely on such hyperparameters. We note that someML algorithms use other types of hyperparameters. For in-stance, K is a hyperparameter for K Nearest Neighbor; thenumber of trees is a hyperparameter for random forest; andarchitecture, dropout rate, learning rate, and mini-batch sizeare important hyperparameters for deep convolutional neuralnetworks. Modern deep convolutional neural networks usedropout [49] instead of conventional L1/L2 norm to performregularization. We believe it is an interesting future work tostudy hyperparameter stealing for these hyperparameters.

Other countermeasures: It is an interesting future work toexplore defenses other than rounding model parameters. Forinstance, like differentially private ML algorithms [12], wecould add noise to the objective function.

VIII. CONCLUSION AND FUTURE WORK

We demonstrate that various ML algorithms are vulnerableto hyperparameter stealing attacks. Our attacks encode therelationships between hyperparameters, model parameters, andtraining dataset into a system of linear equations, which isderived by setting the gradient of the objective function to be0. Via both theoretical and empirical evaluations, we show thatour attacks can accurately steal hyperparameters. Moreover,we find that rounding model parameters can increase theestimation errors of our attacks, with negligible impact on thetesting performance of the model. However, for certain ML al-gorithms, our attacks still achieve very small estimation errors,highlighting the needs for new countermeasures. Future workincludes studying security of other types of hyperparametersand new countermeasures.

Acknowledgement: We thank the anonymous reviewers fortheir constructive comments. We also thank SigOpt for sharinga free API token.

REFERENCES

[1] AMAZON ML SERVICES. (2017, May). [Online]. Available:https://aws.amazon.com/cn/machine-learning

[2] M. Barreno, B. Nelson, R. Sears, A. D. Joseph, and J. D. Tygar, “Canmachine learning be secure?” in ACM ASIACCS, 2006.

[3] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Srndic, P. Laskov,G. Giacinto, and F. Roli, “Evasion attacks against machine learning attest time,” in ECML-PKDD. Springer, 2013.

[4] B. Biggio, L. Didaci, G. Fumera, and F. Roli, “Poisoning attacks tocompromise face templates,” in IEEE ICB, 2013.

[5] B. Biggio, B. Nelson, and P. Laskov, “Poisoning attacks against supportvector machines,” in ICML, 2012.

[6] BigML. (2017, May). [Online]. Available: https://www.bigml.com[7] C. M. Bishop, Pattern recognition and Machine Learning. Springer,

2006.[8] X. Cao and N. Z. Gong, “Mitigating evasion attacks to deep neural

networks via region-based classification,” in ACSAC, 2017.[9] N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields,

D. Wagner, and W. Zhou, “Hidden voice commands,” in USENIXSecurity Symposium, 2016.

[10] N. Carlini and D. Wagner, “Towards evaluating the robustness of neuralnetworks,” in IEEE S & P, 2017.

[11] C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vectormachines,” ACM TIST, 2011.

[12] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate, “Differentially privateempirical risk minimization,” JMLR, pp. 1069–1109, 2011.

[13] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning,1995.

[14] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks thatexploit confidence information and basic countermeasures,” in ACMCCS, 2015.

[15] M. Fredrikson, E. Lantz, S. Jha, S. Lin, D. Page, and T. Ristenpart,“Privacy in pharmacogenetics: An end-to-end case study of personalizedwarfarin dosing,” in USENIX Security Symposium, 2014.

[16] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT Press,2016.

[17] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” arXiv, 2014.

[18] Google Cloud Platform. (2017, May). [Online]. Available: https://cloud.google.com/prediction

[19] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimationfor nonorthogonal problems,” Technometrics, 1970.

[20] D. W. Hosmer Jr, S. Lemeshow, and R. X. Sturdivant, Applied logisticregression. John Wiley & Sons, 2013.

[21] C.-W. Hsu, C.-C. Chang, C.-J. Lin et al., “A practical guide to supportvector classification,” Preprint, 2003.

[22] L. Huang, A. D. Joseph, B. Nelson, B. I. Rubinstein, and J. Tygar,“Adversarial machine learning,” in ACM AISec, 2011.

[23] L. Ke, B. Li, and Y. Vorobeychik, “Behavioral experiments in emailfilter evasion,” in AAAI, 2016.

[24] M. Kloft and P. Laskov, “Online anomaly detection under adversarialimpact,” in AISTATS, 2010.

[25] M. A. M. Learning. (2017, May). [Online]. Available: https://azure.microsoft.com/sevices/machine-learning

[26] B. Li and Y. Vorobeychik, “Feature cross-substitution in adversarialclassification,” in NIPS, 2014.

[27] B. Li, Y. Wang, A. Singh, and Y. Vorobeychik, “Data poisoning attackson factorization-based collaborative filtering,” in NIPS, 2016.

[28] Y. Liu, X. Chen, C. Liu, and D. Song, “Delving into transferableadversarial examples and black-box attacks,” in ICLR, 2017.

[29] D. Lowd and C. Meek, “Adversarial learning,” in ACM SIGKDD, 2005.[30] D. C. Montgomery, E. A. Peck, and G. G. Vining, Introduction to linear

regression analysis. John Wiley & Sons, 2015.[31] B. Nelson, M. Barreno, F. J. Chi, A. D. Joseph, B. I. P. Rubinstein,

U. Saini, C. Sutton, J. D. Tygar, and K. Xia., “Exploiting machinelearning to subvert your spam filter,” in LEET, 2008.

[32] B. Nelson, M. Barreno, F. J. Chi, A. D. Joseph, B. I. P. Rubinstein,U. Saini, C. Sutton, J. D. Tygar, and K. Xia, “Misleading learners: Co-opting your spam filter,” in Machine Learning in Cyber Trust: Security,Privacy, Reliability, 2009.

[33] B. Nelson, B. I. Rubinstein, L. Huang, A. D. Joseph, S.-h. Lau, S. J.Lee, S. Rao, A. Tran, and J. D. Tygar, “Near-optimal evasion of convex-inducing classifiers.” in AISTATS, 2010.

[34] J. Newsome, B. Karp, and D. Song, “Polygraph: Automatically gener-ating signatures for polymorphic worms,” in IEEE S & P, 2005.

[35] J. Newsome, B. Karp, and D. Song, “Paragraph: Thwarting signaturelearning by training maliciously,” in RAID Workshop. Springer, 2006.

[36] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, andA. Swami, “Practical black-box attacks against machine learning,” inAsiaCCS, 2017.

[37] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, andA. Swami, “The limitations of deep learning in adversarial settings,”in EuroS&P, 2016.

[38] N. Papernot, P. McDaniel, A. Sinha, and M. Wellman, “Towards thescience of security and privacy in machine learning,” in Arxiv, 2016.

[39] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillationas a defense to adversarial perturbations against deep neural networks,”in IEEE S & P, 2016.

[40] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al.,“Scikit-learn: Machine learning in python,” JMLR, 2011.

[41] R. Perdisci, D. Dagon, W. Lee, P. Fogla, and M. Sharif, “Misleadingworm signature generators using deliberate noise injection,” in IEEE S& P, 2006.

[42] A. Pyrgelis, C. Troncoso, and E. D. Cristofaro, “Knock knock, who’sthere? membership inference on aggregate location data,” in NDSS,2018.

[43] B. I. Rubinstein, B. Nelson, L. Huang, A. D. Joseph, S.-h. Lau, S. Rao,N. Taft, and J. Tygar, “Antidote: understanding and defending againstpoisoning of anomaly detectors,” in ACM IMC, 2009.

[44] M. Sharif, S. Bhagavatula, L. Bauer, and K. M. Reiter, “Accessorize toa crime: Real and stealthy attacks on state-of-the-art face recognition,”in ACM CCS, 2016.

[45] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membershipinference attacks against machine learning models,” in IEEE S & P,2017.

[46] SigOpt. (2017, December). [Online]. Available: https://sigopt.com/[47] C. Smutz and A. Stavrou, “Malicious pdf detection using metadata and

structural features,” in ACSAC, 2012.[48] C. Song, T. Ristenpart, and V. Shmatikov, “Machine learning models

that remember too much,” in CCS, 2017.[49] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-

dinov, “Dropout: A simple way to prevent neural networks fromoverfitting,” JMLR, 2014.

[50] N. Srndic and P. Laskov, “Detection of malicious pdf files based onhierarchical document structure,” in NDSS, 2013.

[51] N. Srndic and P. Laskov, “Practical evasion of a learning-based classi-fier: A case study,” in IEEE S & P, 2014.

[52] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfel-low, and R. Fergus, “Intriguing properties of neural networks,” arXiv,2013.

[53] R. Tibshirani, “Regression shrinkage and selection via the lasso,”JRSSB, 1996.

[54] F. Tramer, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Stealingmachine learning models via prediction apis,” in USENIX SecuritySymposium, 2016.

[55] V. Vovk, “Kernel ridge regression,” in Empirical inference. Springer,2013.

[56] W. Xu, Y. Qi, and D. Evans, “Automatically evading classifiers: A casestudy on pdf malware classifiers,” in NDSS, 2016.

[57] G. Yang, N. Z. Gong, and Y. Cai, “Fake co-visitation injection attacksto recommender systems,” in NDSS, 2017.

[58] H. Zou and T. Hastie, “Regularization and variable selection via theelastic net,” JRSSB, 2005.

https://aws.amazon.com/cn/machine-learning

https://www.bigml.com

https://cloud.google.com/prediction

https://cloud.google.com/prediction

https://azure.microsoft.com/sevices/machine-learning

https://azure.microsoft.com/sevices/machine-learning

https://sigopt.com/


0.7

0.8

0.9

1.0

Acc

urac

yL2-LRL1-LRSVM-RHLSVM-SHLNN

Fig. 16: Testing accuracy vs. hyperparameter (log10 scale) ofclassification algorithms on Madelon in Table II.

APPENDIX AATTACKS TO OTHER LEARNING ALGORITHMS

LASSO: The objective function of LASSO is:

L(w) = ‖y −XTw‖22 + λ‖w‖1, (10)

whose gradient is:∂L(w)

∂w= −2Xy + 2XXTw + λsign(w),

where |wi| is not differentiable when wi = 0, so we define thederivative at wi = 0 as 0, which means that we do not use themodel parameters that are 0 to estimate the hyperparameter.By setting the gradient to be 0, we can estimate λ using Eqn. 3with a = sign(w) and b = −2Xy + 2XXTw.

We note that if λ ≥ λmax = ‖Xy‖∞, then w = 0. In suchcases, we cannot estimate the exact hyperparameter. However,in practice, λ < λmax must hold in order to learn meaningfulmodel parameters.

L2-regularized LR (L2-LR): Its objective function is

L(w) = L(X,y,w) + λ‖w‖22, (11)

where L(X,y,w)=−∑ni=1(yi log hw(xi) + (1− yi) log(1−

hw(xi))), which is called cross entropy loss function. hw(x)is defined to be 1

1+exp (−wTx). The gradient of the objective

function with respect to w is:∂L(w)

∂w= X(hw(X)− y) + 2λw,

where hw(X) = [hw(x1);hw(x2); · · · ;hw(xn)] is a vector.Via setting the gradient to be 0, we can estimate λ using Eqn. 3with a = 2w and b = X(hw(X)− y).

SVM with regular hinge loss (SVM-RHL): The objectivefunction of SVM-RHL is:

L(w) =

n∑i=1

L(xi, yi,w) + λ‖w‖22, (12)

where L(xi, yi,w) = max(0, 1 − yiwTxi) is called regularhinge loss function. The gradient with respect to w is:

∂L

∂w=

{−yixi if yiwTxi < 1

0 if yiwTxi > 1,

where L(xi, yi,w) is non-differentiable at the point whereyiw

Txi = 1. We estimate λ using only training instances xithat satisfy yiw

Txi < 1. Specifically, we estimate λ usingEqn. 3 with a = 2w and b =

∑ni=1−yixi1yiwTxi<1, where

1yiwTxi<1 is an indicator function with value 1 if yiwTxi < 1and 0 otherwise.

SVM with square hinge loss (SVM-SHL): The objectivefunction of SVM-SHL is:

L(w) =

n∑i=1

L(xi, yi,w) + λ‖w‖22, (13)

where L(xi, yi,w) = max(0, 1 − yiwTxi)2 is called square

hinge loss function. The gradient with respect to w is:

∂L

∂w=

{−2yixi(1− yiwTxi) if yiwTxi ≤ 1

0 if yiwTxi > 1.

Therefore, we estimate λ using Eqn. 3 with a = w andb =

∑ni=1−yixi(1− yiwTxi)1yiwTxi≤1.

L1-regularized kernel LR (L1-KLR): Its objective functionis:

L(α) = L(X,y,α) + λ‖Kα‖1, (14)

where L(X,y,α) = −∑ni=1(yi log hα(xi) + (1 − yi)

log(1 − hα(xi))) and hα(x) = 11+exp (−

∑nj=1 αjφ(x)Tφ(xj))

.The gradient of the objective function with respect to α is:

∂L(α)

∂α= K(hα(X)− y + λt),

where hα(X) = [hα(x1);hα(x2); · · · ;hα(xn)] and t =sign(Kα). Via setting the gradient to be 0 and consideringthat K is invertible, we can estimate λ using Eqn. 3 witha = t and b = hα(X)− y.

L2-regularized kernel LR (L2-KLR): Its objective functionof L1-KLR is:

L(α) = L(X,y,α) + λαTKα, (15)

where L(X,y,α) = −∑ni=1(yi log hα(xi) + (1 − yi)

log(1−hα(xi))) is a cross entropy loss function and hα(x) =1

1+exp (−∑n

j=1 αjφ(x)Tφ(xj)). The gradient of the objective

function with respect to α is:

∂L(α)

∂α= K(hα(X)− y + 2λα),

where hα(X) = [hα(x1);hα(x2); · · · ;hα(xn)]. Via settingthe gradient to be 0 and considering that K is invertible, wecan estimate λ using Eqn. 3 with a = 2α and b = hα(X)−y.

Kernel SVM with square hinge loss (KSVM-SHL): Theobjective function of KSVM-SHL is:

L(α) =n∑i=1

L(xi, yi,α) + λαTKα, (16)

where L(xi, yi,α) = max(0, 1 − yiαTki)

2. Following thesame methodology we used for SVM-SHL, we estimate λusing Eqn. 3 with a = Kα and b =

∑ni=1−yiki(1 −

yiαTki)1yiαTki≤1.


−10

−8

−6

−4

−2

0R

elat

ive

esti

mat

ion

erro

r(l

og10

)

DiabetesGeoOrigUJIIndoor

(a) Attack


−10

−8

−6

−4

−2

0

Rel

ativ

eM

SEer

ror

(log

10) Diabetes

GeoOrigUJIIndoor


−8

−6

−4

−2

0

2

Rel

ativ

ees

tim

atio

ner

ror

(log

10)

Diabetes-λ1

GeoOrig-λ1

UJIIndoor-λ1

Diabetes-λ2

GeoOrig-λ2

UJIIndoor-λ2

(b) Defense

Fig. 17: Attack and defense results of ENet.

APPENDIX BMORE THAN ONE HYPERPARAMETER

We use a popular regression algorithm called Elastic Net(ENet) [58] as an example to illustrate how we can applyattacks to learning algorithms with more than one hyperpa-rameter. The objective function of ENet is:

L(w) = ‖y −XTw‖22 + λ1‖w‖1 + λ2‖w‖22, (17)

where the loss function is least square and regularization termis the combination of L2 regularization and L1 regularization.We compute the gradient as follows:

∂L(w)

∂w= −2Xy + 2XXTw + λ1sign(w) + 2λ2w � |sign(w)|,

where w � |sign(w)| = [w1|sign(w1)|; · · · ;wm|sign(wm)|].Similar to LASSO, we do not use the model parameters that are0 to estimate the hyperparameters. Via setting the gradient to0 and using the linear least square to solve the overdeterminedsystem, we have:

λ = −(ATA

)−1

ATb, (18)

where λ = [λ1; λ2], A = [sign(w); 2w � |sign(w)|], andb = −2Xy + 2XXTw.

Figure 17 shows the attack and defense results for ENeton the three regression datasets. We observe that our attacksare effective for learning algorithms with more than onehyperparameter, and rounding is also an effective defense.

APPENDIX CNEURAL NETWORK (NN)

We evaluate attack and defense on a three-layer neuralnetwork (NN) for both regression and classification.

Regression: The objective function of the three-layer NN forregression is defined as

L(W1,w2) = ‖y − y‖22 + λ(‖W1‖2F + ‖w2‖22

), (19)

where y = sig(XTW1 + b1

)w2 + b2; sig(A) = 1

1+exp(−A)

is the logistic function; W1 ∈ Rm×d is the weight matrix ofinput layer to hidden layer and w2 ∈ Rd is the weight vectorof hidden layer to output layer; d is the number of hiddenunits; b1 and b2 are the bias terms of the two layers.

To perform our hyperparameter stealing attack, we canleverage the gradient of L(W1,w2) with respect to either W1or w2. For simplicity, we use w2. Specifically, the gradient ofthe objective function of w2 is:

∂L(W1,w2)

∂w2= 2sig

(XTW1 + b1

)T(y − y) + 2λw2.

By setting the gradient to be 0, we can estimat λ using Eqn. 3with a = w2 and b = sig

(XTW1 + b1

)T(y − y).

Classification: The objective function of the three-layer NNfor binary classification is defined as

L(W1,w2) = −n∑i=1

(yi log yi + (1− yi) log(1− yi))

+λ

2

(‖W1‖2F + ‖w2‖22

), (20)

where yi = sig(wT

2 sig(WT

1 xi + b1

)+ b2

). Similarly with

regression, we use w2 to steal the hyperparameter λ. Thegradient of the objective function of w2 is:

∂L(W1,w2)

∂w2= −

n∑i=1

(yi − yi)sig(WT

1 xi + b1

)+ λw2.

Setting the gradient to be 0, we can estimat λ via Eqn. 3 witha = w2 and b = −∑n

i=1(yi − yi)sig(WT

1 xi + b1

).

APPENDIX DPROOF OF THEOREM 5.1

When w is an exact minimum of the objective function,we have b = −λa. Therefore, we have:

λ = −(aTa

)−1aTb = −

(aTa

)−1aT (−λa)

= λ(aTa

)−1aTa = λ.

APPENDIX EPROOF OF THEOREM 5.2

We prove the theorem for linear learning algorithms, as it issimilar for kernel learning algorithms. We treat our estimatedhyperparameter λ as a function of model parameters. Weexpand λ(w? + ∆w) at w? using Taylor expansion:

λ(w? + ∆w) = λ(w?) + ∆wT∇λ(w?)

+1

2∆wT∇2λ(w?)∆w + · · ·

= λ(w?) + ∆wT∇λ(w?) +O(‖∆w‖22)

= λ+ ∆wT∇λ(w?) +O(‖∆w‖22)

Therefore, ∆λ=λ(w?+∆w)−λ =∆wT∇λ(w?)+O(‖∆w‖22).

APPENDIX FAPPROXIMATIONS OF GRADIENTS

We approximate the gradient ∇λ(w?) in Theorem 2 forRR, LASSO, L2-LR, L2-KLR, L1-LR, and L1-KLR. Accord-ing to the definition of gradient, we have:

∇λ(w?) = lim∆w→0

λ(w? + ∆w)− λ(w?)

∆w,

where the division and limit are component-wise for ∆w.

RR: We approximate the gradient as follows:

∇λRR(w?) = lim∆w→0

λRR(w? + ∆w)− λRR(w?)

∆w

= lim∆w→0

1

∆w

( (w? + ∆w)T (Xy −XXT (w? + ∆w))

(w? + ∆w)T (w? + ∆w)

−(w?)T (Xy −XXTw?)

(w?)Tw?

)≈ lim

∆w→0

1

∆w

(∆wT (Xy − 2XXTw?)−∆wTXXT∆w

(w?)Tw?

)≈

X(y − 2XTw?)

‖w?‖22,

where in the third and fourth equations, we use (w? +∆w)T (w? + ∆w) ≈ (w?)Tw? and ∆wTXXT∆w ≈ 0 forsufficiently small ∆w, respectively.

LASSO: We approximate the gradient as follows:

∇λLASSO(w?) = lim∆w→0

λLASSO(w? + ∆w)− λLASSO(w?)

∆w

= lim∆w→0

1

∆w

(2sign(w? + ∆w)T (Xy −XXT (w? + ∆w))

sign(w? + ∆w)T sign(w? + ∆w)

−2sign(w?)T (Xy −XXTw?)

sign(w?)T sign(w?)

)≈ lim

∆w→0

1

∆w

sign(w?)TXXT∆w

‖sign(w?)‖22

≈ lim∆w→0

1

∆w

∆wTXXT sign(w?)

‖sign(w?)‖22≈

XXT sign(w?)

‖sign(w?)‖22,

where in the third equation, we use sign(w? + ∆w) ≈sign(w?).

L2-LR: We approximate the gradient as follows:

∇λL2−LR(w?) = lim∆w→0

λL2−LR(w? + ∆w)− λL2−LR(w?)

∆w

= lim∆w→0

1

∆w

( (w? + ∆w)T (y − hw?+∆w(X))

(w? + ∆w)T (w? + ∆w)

−(w?)TX(y − hw? (X))

(w?)Tw?

)≈ lim

∆w→0

1

∆w

∆(w?)TX(y − hw?+∆w(X))

(w?)Tw?

−(w?)TX(hw?+∆w(X)− hw? (X))

(w?)Tw?

≈X(y − hw? (X))

‖w?‖22,

where in the third and fourth equations, we use (w? +∆w)T (w? + ∆w) ≈ (w?)Tw? and hw?+∆w(X) ≈ hw?(X)for sufficiently small ∆w, respectively.

L2-KLR: Similar to L2-LR, we approximate the gradient as:

∇λL2−KLR(α?) ≈K(y − hα? (K))

‖α?‖22.

L1-LR: We approximate the gradient as follows:

∇λL1−LR(w?) = lim∆w→0

λL1−LR(w? + ∆w)− λL1−LR(w?)

∆w

= lim∆w→0

1

∆w

( sign(w? + ∆w)TX(y − hw?+∆w(X))

sign(w? + ∆w)T sign(w? + ∆w)−

sign(w?)TX(y − hw? (X))

sign(w?)T sign(w?)

)≈ lim

∆w→0

(hw?+∆w(X)− hw? (X))T

∆w

XT sign(w?)

‖sign(w?)‖22

≈∇hw? (X))XT sign(w?)T

‖sign(w?)‖22.

L1-KLR: Similar to L1-LR, we approximate the gradient as:

∇λL1−KLR(α?) ≈∇hα? (K))KT sign(α?)T

‖sign(α?)‖22.

stealing hyperparameters in machine learning · is motivated by the emerging...

Documents