adversarial dropout for supervised and semi-supervised learning … · 2017-09-19 · adversarial...

Adversarial Dropout for Supervised and Semi-Supervised Learning

Sungrae Park and Jun-Keon Park and Su-Jin Shin and Il-Chul MoonDepartment of Industrial and System Engineering

KAISTDeajeon, South Korea

{ sungraepark and alex3012 and sujin.shin and icmoon } @kaist.ac.kr

AbstractRecently, training with adversarial examples, which are gen-erated by adding a small but worst-case perturbation on in-put examples, has improved the generalization performanceof neural networks. In contrast to the biased individual inputsto enhance the generality, this paper introduces adversarialdropout, which is a minimal set of dropouts that maximizethe divergence between 1) the training supervision and 2) theoutputs from the network with the dropouts. The identifiedadversarial dropouts are used to automatically reconfigure theneural network in the training process, and we demonstratedthat the simultaneous training on the original and the recon-figured network improves the generalization performance ofsupervised and semi-supervised learning tasks on MNIST,SVHN, and CIFAR-10. We analyzed the trained model to findthe performance improvement reasons. We found that adver-sarial dropout increases the sparsity of neural networks morethan the standard dropout. Finally, we also proved that ad-versarial dropout is a regularization term with a rank-valuedhyper parameter that is different from a continuous-valuedparameter to specify the strength of the regularization.

IntroductionDeep neural networks (DNNs) have demonstrated the sig-nificant improvement on benchmark performances in awide range of applications. As neural networks becomedeeper, the model complexity also increases quickly, andthis complexity leads DNNs to potentially overfit a train-ing data set. Several techniques (Hinton et al. 2012; Poole,Sohl-Dickstein, and Ganguli 2014; Bishop 1995b; Lasserre,Bishop, and Minka 2006) have emerged over the past yearsto address this challenge, and dropout has become one ofdominant methods due to its simplicity and effectiveness(Hinton et al. 2012; Srivastava et al. 2014).

Dropout randomly disconnects neural units during train-ing as a method to prevent the feature co-adaptation (Baldiand Sadowski 2013; Wager, Wang, and Liang 2013; Wangand Manning 2013; Li, Gong, and Yang 2016). The ear-lier work by Hinton et al. (2012) and Srivastava et al.(2014) interpreted dropout as an extreme form of modelcombinations, a.k.a. a model ensemble, by sharing exten-sive parameters on neural networks. They proposed learn-ing the model combination through minimizing an expected

Copyright c© 2017, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

loss of models perturbed by dropout. They also pointedout that the output of dropout is the geometric mean ofthe outputs from the model ensemble with the shared pa-rameters. Extending the weight sharing perspective, sev-eral studies (Baldi and Sadowski 2013; Chen et al. 2014;Jain et al. 2015) analyzed the ensemble effects from thedropout.

The recent work by Laine & Aila (2016) enhanced the en-semble effect of dropout by adding self-ensembling terms.The self-ensembling term is constructed by a divergence be-tween two sampled neural networks from the dropout. Byminimizing the divergence, the sampled networks learn fromeach other, and this practice is similar to the working mech-anism of the ladder network (Rasmus et al. 2015), whichbuilds a connection between an unsupervised and a super-vised neural network. Our method also follows the princi-ples of self-ensembling, but we apply the adversarial train-ing concept to the sampling of neural network structuresthrough dropout.

At the same time that the community has developed thedropout, adversarial training has become another focus ofthe community. Szegedy et al. (2013) showed that a cer-tain neural network is vulnerable to a very small perturba-tion in the training data set if the noise direction is sensi-tive to the models’ label assignment y given x, even whenthe perturbation is so small that human eyes cannot dis-cern the difference. They empirically proved that robustlytraining models against adversarial perturbation is effectivein reducing test errors. However, their method of identi-fying adversarial perturbations contains a computationallyexpensive inner loop. To compensate it, Goodfellow et al.(2014) suggested an approximation method, through the lin-earization of the loss function, that is free from the loop.Adversarial training can be conducted on supervised learn-ing because the adversarial direction can be defined whentrue y labels are known. Miyato et al. (2015) proposed avirtual adversarial direction to apply the adversarial train-ing in the semi-supervised learning that may not assumethe true y value. Until now, the adversarial perturbation canbe defined as a unit vector of additive noise imposed onthe input or the embedding spaces (Szegedy et al. 2013;Goodfellow, Shlens, and Szegedy 2014; Miyato et al. 2015).

Our proposed method, adversarial dropout, can beviewed from the dropout and from the adversarial train-

arX

iv:1

707.

0363

1v2

[cs

.LG

] 1

8 Se

p 20

17

Figure 1: Diagram description of loss functions from Πmodel (Laine and Aila 2016), the adversarial training (Miy-ato et al. 2015), and our adversarial dropout.

ing perspectives. Adversarial dropout can be interpreted asdropout masks whose direction is optimized adversarially tothe model’s label assignment. However, it should be notedthat adversarial dropout and traditional adversarial trainingwith additive perturbation are different because adversarialdropout induces the sparse structure of neural network whilethe other does not make changes in the structure of the neu-ral network, directly.

Figure 1 describes the proposed loss function construc-tion of adversarial dropout compared to 1) the recent dropoutmodel, which is Π model (Laine and Aila 2016) and 2) theadversarial training (Goodfellow, Shlens, and Szegedy 2014;Miyato et al. 2015). When we compare adversarial dropoutto Π model, both divergence terms are similarly com-puted from two different dropped networks, but adversar-ial dropout uses an optimized dropped network to adapt theconcept of adversarial training. When we compare adversar-ial dropout to the adversarial training, the divergence term ofthe adversarial training is computed from one network struc-ture with two training examples: clean and adversarial exam-ples. On the contrary, the divergence term of the adversarialdropout is defined with two network structures: a randomlydropped network and an adversarially dropped network.

Our experiments demonstrated that 1) adversarial dropoutimproves the performance on MNIST supervised learningcompared to the dropout suggested by Π model, and 2) ad-versarial dropout showed the state-of-the-art performance onthe semi-supervised learning task on SVHN and CIFAR-10when we compare the most recent techniques of dropoutand adversarial training. Following the performance com-parison, we visualize the neural network structure from ad-versarial dropout to illustrate that the adversarial dropout en-ables a sparse structure compared to the neural network ofstandard dropout. Finally, we theoretically show the origi-

nal characteristics of adversarial dropout that specifies thestrength of the regularization effect by the rank-valued pa-rameter while the adversarial training specifies the strengthwith the conventional continuous-valued scale.

PreliminariesBefore introducing adversarial dropout, we briefly intro-duce stochastic noise layers for deep neural networks. After-wards, we review adversarial training and temporal ensem-bling, or Π model, because two methods are closely relatedto adversarial dropout.

Noise LayersCorrupting training data with noises has been well-known tobe a method to stabilize prediction (Bishop 1995a; Maatenet al. 2013; Wager, Wang, and Liang 2013). This section de-scribes two types of noise injection techniques, such as ad-ditive Gaussian noise and dropout noise.

Let h(l) denote the lth hidden variables in a neural net-work, and this layer can be replaced with a noisy versionh(l). We can vary the noise types as the followings.

• Additive Gaussian noise: h(l) = h(l) + γ, where γ ∼N (0, σ2Id×d) with the parameter σ2 to restrict the degreeof noises.

• Dropout noise: h(l) = h(l) � ε, where � is the element-wise product of two vectors, and the elements of the noisevector are εi ∼ Bernoulli(1 − p) with the parameter p.To simply put, this function specifies that εi = 0 withprobability p and εi = 1 with probability (1− p).

Both additive Gaussian noise and dropout noise are gener-alization techniques to increase the generality of the trainedmodel, but they have different properties. The additive Gaus-sian noise increases the margin of decision boundaries whilethe dropout noise affects a model to be sparse (Srivastavaet al. 2014). These noise layers can be easily included in adeep neural network. For example, there can be a dropoutlayer between two convolutional layers. Similarly, a layer ofadditive Gaussian noise can be placed on the input layer.

Self-Ensembling ModelThe recently reported self-ensembling (SE) (Laine and Aila2016), or Π model, construct a loss function that mini-mizes the divergence between two outputs from two sam-pled dropout neural networks with the same input stimulus.Their suggested regularization term can be interpreted as thefollowing:

LSE(x;θ) := D[fθ(x, ε1), fθ(x, ε2)], (1)

where ε1 and ε2 are randomly sampled dropout noises in aneural network fθ, whose parameters are θ. Also,D[y,y′] isa non-negative function that represents the distance betweentwo output vectors: y and y′. For example, D can be thecross entropy function, D[y,y′] = −

∑i yi logy′i, where y

and y′ are the vectors whose ith elements represent the prob-ability of the ith class. The divergence could be calculatedbetween two outputs of two different structures, which turnthis regularization to be semi-supervised. Π model is based

on the principle of Γ model, which is the ladder networkby Rasmus et al. (2015). Our proposed method, adversarialdropout, can be seen as a special case of Π model when onedropout neural network is adversarially sampled.

Adversarial TrainingAdversarial dropout follows the training mechanism of ad-versarial training, so we briefly introduce a generalized for-mulation of the adversarial training. The basic concept ofadversarial training (AT) is an incorporation of adversar-ial examples on the training process. Additional loss func-tion by including adversarial examples (Szegedy et al. 2013;Goodfellow, Shlens, and Szegedy 2014; Miyato et al. 2015)can be defined as a generalized form:

LAT (x, y;θ, δ) := D[g(x, y,θ), fθ(x + γadv)] (2)

where γadv := argmaxγ;‖γ‖∞≤δD[g(x, y,θ), fθ(x + γ)].

Here, θ is a set of model parameters, δ is a hyperparametercontrolling the intensity of the adversarial perturbation γadv .The function fθ(x) is an output distribution of a neural net-work to be learned. Adversarial training can be diversifiedby differentiating the definition of g(x, y,θ), as the follow-ing.• Adversarial training (AT) (Goodfellow, Shlens, and

Szegedy 2014; Kurakin, Goodfellow, and Bengio 2016)defines g(x, y,θ) as g(y) ignoring x and θ, so g(y) is anone-hot encoding vector of y.

• Virtual adversarial training (VAT) (Miyato et al. 2015;Miyato, Dai, and Goodfellow 2016) defines g(x, y,θ) asfθ(x) where θ is the current estimated parameter. Thistraining method does not use any information from y inthe adversarial part of the loss function. This enables theadversarial part to be used as a regularization term for thesemi-supervised learning.

MethodThis section presents our adversarial dropout that combinesthe ideas of adversarial training and dropout. First, we for-mally define the adversarial dropout. Second, we proposea training algorithm to find the instantiations of adversarialdropouts with a fast approximation method.

General Expression of Adversarial DropoutNow, we propose the adversarial dropout (AdD), whichcould be an adversarial training method that determines thedropout condition to be sensitive on the model’s label as-signment. We use fθ(x, ε) as an output distribution of aneural network with a dropout mask. The below is the de-scription of the additional loss function by incorporating ad-versarial dropout.

LAdD(x, y, εs;θ, δ) := D[g(x, y,θ), fθ(x, εadv)] (3)

where εadv := argmaxε;‖εs−ε‖2≤δHD[g(x, y,θ), fθ(x, ε)].

Here, D[·, ·] indicates a divergence function; g(x, y,θ) rep-resents an adversarial target function that can be diversifiedby its definition; εadv is an adversarial dropout mask under

the function fθ when θ is a set of model parameters; εs isa sampled random dropout mask instance; δ is a hyperpa-rameter controlling the intensity of the noise; and H is thedropout layer dimension.

We introduce the boundary condition, ‖εs − ε‖2 ≤ δH ,which indicates a restriction of the number of the differencebetween two dropout conditions. An adversarial dropoutmask should be infinitesimally different from the randomdropout mask. Without this constraint, the network with ad-versarial dropout may become a neural network layer with-out connections. By restricting the adversarial dropout withthe random dropout, we prevent finding such irrational layer,which does not support the back propagation. We found thatthe Euclidean distance between two ε vectors can be calcu-lated by using the graph edit distance or the Jaccard distance.In the supplementary material, we proved that the graph editdistance and the Jaccard distance can be abstracted as Eu-clidean distances between two ε vectors.

In the general form of adversarial training, the key pointis the existence of the linear perturbation γadv . We can in-terpret the input with the adversarial perturbation as this ad-versarial noise input xadv = x + γadv . From this perspec-tive, the authors of adversarial training limited the adversar-ial direction only on the space of the additive Gaussian noisex = x + γ0, where γ0 is a sampled Gaussian noise on theinput layer. In contrast, adversarial dropout can be consid-ered as a noise space generated by masking hidden units,hadv = h � εadv where h is hidden units, and εadv is anadversarially selected dropout condition. If we assume theadversarial training as the Gaussian additive perturbation onthe input, the perturbation is linear in nature, but adversarialdropout could be non-linear perturbation if the adversarialdropout is imposed upon multiple layers.

Supervised Adversarial Dropout Supervised Adversar-ial dropout (SAdD) defines g(x, y,θ) as y, so g is a one-hotvector of y as the typical neural network. The divergenceterm from Formula 3 can be converted as follows:

LSAdD(x, y, εs;θ, δ) := D[g(y), fθ(x, εadv)] (4)

where εadv := argmaxε;‖εs−ε‖2≤δHD[g(y), fθ(x, ε)].

In this case, the divergence term can be seen as the pureloss function for a supervised learning with a dropout regu-larization. However, εadv is selected to maximize the diver-gence between the true information and the output from thedropout network, so εadv eventually becomes the mask onthe most contributing features. This adversarial mask pro-vides the learning opportunity on neurons, so called deadfilter, that was considered to be less informative.

Virtual Adversarial Dropout Virtual adversarialdropout (VAdD) defines g(x, y,θ) = fθ(x, εs). Thisuses the loss function as a regularization term for semi-supervised learning. The divergence term in Formula 3 canbe represented as bellow:

LV AdD(x, y, εs;θ, δ) := D[fθ(x, εs), fθ(x, εadv)] (5)

where εadv := argmaxε;‖εs−ε‖2≤δHD[fθ(x, εs), fθ(x, ε)].

VAdD is a special case of a self-ensembling model with twodropouts. They are 1) a dropout, εs, sampled from a ran-dom distribution with a hyperparameter and 2) a dropout,εadv , composed to maximize the divergence function of thelearner, which is the concept of the noise injection from thevirtual adversarial training. The two dropouts create a reg-ularization as the virtual adversarial training, and the in-ference procedure optimizes the parameters to reduce thedivergence between the random dropout and the adversar-ial dropout. This optimization triggers the self-ensemblelearning in (Laine and Aila 2016). However, the adversarialdropout is different from the previous self-ensembling be-cause one dropout is induced by the adversarial setting, notby a random sampling.

Learning with Adversarial Dropout The full objectivefunction for the learning with the adversarial dropout isgiven by

l(y, fθ(x, εs)) + λLAdD(x, y, εs;θ, δ) (6)

where l(y, fθ(x, εs)) is the negative log-likelihood for ygiven x under the sampled dropout instance εs. There aretwo scalar-scale hyper-parameters: (1) a trade-off parameter,λ, for controlling the impact of the proposed regularizationterm and (2) the constraints, δ, specifying the intensity ofadversarial dropout.

Combining Adversarial Dropout and Adversarial Train-ing Additionally, it should be noted that the adversarialtraining and the adversarial dropout are not exclusive train-ing methods. A neural network can be trained by imposingthe input perturbation with the Gaussian additive noise, andby enabling the adversarially chosen dropouts, simultane-ously. Formula 7 specifies the loss function of simultane-ously utilizing the adversarial dropout and the adversarialtraining.

l(y, fθ(x, εs)) + λ1LAdD(x, y, εs) + λ2LAT (x, y) (7)

where λ1 and λ2 are trade-off parameters controlling the im-pact of the regularization terms.

Fast Approximation Method for FindingAdversarial Dropout ConditionOnce the adversarial dropout, εadv , is identified, the evalua-tion of LAdD simply becomes the computation of the lossand the divergence functions. However, the inference onεadv is difficult because of three reasons. First, we cannotobtain a closed-form solution on the exact adversarial noisevalue, εadv . Second, the feasible space for εadv is restrictedunder ‖εs − εadv‖2 ≤ δH , which becomes a constraint inthe optimization. Third, εadv is a binary-valued vector ratherthan a continuous-valued vector because εadv indicates theactivation of neurons. This discrete nature requires an opti-mization technique like integer programming.

To mitigate this difficulty, we approximated the objectivefunction, LAdD, with the first order of the Taylor expansionby relaxing the domain space of εadv . This Taylor expansionof the objective function was used in the earlier works of ad-versarial training (Goodfellow, Shlens, and Szegedy 2014;

Miyato et al. 2015). After the approximation, we foundan adversarial dropout condition by solving an integer pro-gramming problem.

To define a neural network with a dropout layer, weseparate the output function into two neural sub-networks,fθ(x, ε) = fupperθ1

(h(x)�ε), where fupperθ1is the upper part

neural network of the dropout layer and h(x) = funderθ2(x)

is the under part neural network. Our objective is optimizingan adversarial dropout noise εadv by maximizing the follow-ing divergence function under the constraint ‖εs−εadv‖2 ≤δH:

D(x, ε;θ, εs) = D[g(x, y,θ, εs), fupperθ1(h(x)� ε))] (8)

where εs is a sampled dropout mask, and θ is a parameterof the neural network model. We approximate the above di-vergence function by deriving the first order of the Taylorexpansion by relaxing the domain space of ε from the multi-ple binary spaces, {0, 1}H , to the real value spaces, [0, 1]H .This conversion is a common step in the integer program-ming research as (Hemmecke et al. 2010):

D(x, ε;θ, εs) ≈ D(x, ε0;θ, εs) + (ε− ε0)TJ(x, ε0) (9)

where J(x, ε0) is the Jacobian vector given by J(x, ε0) :=5εD(x, ε;θ, εs)|ε=ε0 when ε0 = 1 indicates no noise in-jection. The above Taylor expansion provides a linearizedoptimization objective function by controlling ε. Therefore,we reorganized the Taylor expansion with respect to ε as thebelow:

D(x, ε;θ, εs) ∝∑i

εiJi(x, ε0) (10)

where Ji(x, ε0) is the ith element of J(x, ε0). Since we can-not proceed further with the given formula, we introducean alternative Jaccobian formula that further specifies thedropout mechanism by � and h(x) as the below.

J(x, ε0) ≈ h(x)�5h(x)D(x, ε0;θ, εs) (11)

where h(x) is the output vector of the under part neural net-work of the adversarial dropout.

The control variable, ε, is a binary vector whose elementsare either one or zero. Under this approximate divergence,finding a maximal point of ε can be viewed as the 0/1 knap-sack problem (Kellerer, Pferschy, and Pisinger 2004), whichis one of the most popular integer programming problems.

To find εadv with the constraint, we propose Algorithm1 based on the dynamic programming for the 0/1 knapsackproblem. In the algorithm, εadv is initialized with εs, andεadv changes its value by the order of the degree increasingthe objective divergence until ‖εs − εadv‖2 ≤ δH; or thereis no increment in the divergence. After using the algorithm,we obtain εadv that maximizes the divergence with the con-straint, and we evaluate the loss function LAdD.

We should notice that the complex vector of the Tay-lor expansion is not εs, but ε0. In the case of vir-tual adversarial dropout, whose divergence is formed asD[fθ(x, εs), fθ(x, ε)], εs is the minimal point leading thegradient to be zero because of the identical distribution be-tween the random and the optimized dropouts. This zero gra-dient affects the approximation of the divergence term as

Algorithm 1: Finding Adversarial Dropout ConditionInput : εs is current sampled dropout maskInput : δ is a hyper-parameter for the boundaryInput : J is the Jacobian vectorInput : H is the layer dimension.Output: εadv

1 begin2 z ←− |J| // absolute values of the Jacobian3 i←− Arg Sort z as zi1 ≤ ... ≤ ziH4 εadv ←− εs5 d←− 1

6 while ‖εs − εadv‖2 ≤ δH and d ≤ H do7 if εadvid

= 0 and Jid > 0 then8 εadvid

←− 1

9 else if εadvid= 1 and Jid < 0 then

10 εadvid←− 0

11 end12 d←− d+ 113 end14 end

Table 1: Test performance with 1,000 labeled (semi-supervised) and 60,000 labeled (supervised) examples onMNIST. Each setting is repeated for eight times.

Error rate (%) with # labelsMethod 1,000 All (60,000)Plain (only dropout) 2.99 ± 0.23 0.53 ± 0.03AT - 0.51 ± 0.03VAT 1.35 ± 0.14 0.50 ± 0.01Π model 1.00 ± 0.08 0.50 ± 0.02SAdD - 0.46 ± 0.01VAdD (KL) 0.99 ± 0.07 0.47 ± 0.01VAdD (QE) 0.99 ± 0.09 0.46 ± 0.02

zero. To avoid the zero gradients, we set the complex vectorof the Taylor expansion as ε0.

This zero gradient situation does not occur when themodel function, fθ, contains additional stochastic layers be-cause fθ(x, εs,ρ1) 6= fθ(x, εs,ρ2) when ρ1 and ρ2 are in-dependently sampled noises from another stochastic layers.

ExperimentsThis section evaluates the empirical performance of adver-sarial dropout for supervised and semi-supervised classifi-cation tasks on three benchmark datasets, MNIST, SVHN,and CIFAR-10. In every presented task, we compared ad-versarial dropout, Π model, and adversarial training. We alsoperformed additional experiments to analyze the sparsity ofadversarial dropout.

Supervised and Semi-supervised Learning onMNIST taskIn the first set of experiments, we benchmark our methodon the MNIST dataset (LeCun et al. 1998), which consists

of 70,000 handwritten digit images of size 28 × 28 where60,000 images are used for training and the rest for testing.

Our basic structure is a convolutional neural network(CNN) containing three convolutional layers, which filtersare 32, 64, and 128, respectively, and three max-pooling lay-ers sized by 2 × 2. The adversarial dropout applied only onthe final hidden layer. The structure detail and the hyper-parameters are described in Appendix B.1.

We conducted both supervised and semi-supervised learn-ings to compare the performances from the standarddropout, Π model, and adversarial training models utiliz-ing linear perturbations on the input space. The supervisedlearning used 60,000 instances for training with full labels.The semi-supervised learning used 1,000 randomly selectedinstances with their labels and 59,000 instances with onlytheir input images. Table 1 shows the test error rates includ-ing the baseline models. Over all experiment settings, SAdDand VAdD further reduce the error rate from Π model, whichhad the best performance among the baseline models. In thetable, KL and QE indicate Kullback-Leibler divergence andquadratic error, respectively, to specify the divergence func-tion, D[y, y].

Supervised and Semi-supervised Learning onSVHN and CIFAR-10We experimented the performances of the supervised andthe semi-supervised tasks on the SVHN (Netzer et al. 2011)and the CIFAR-10 (Krizhevsky and Hinton 2009) datasetsconsisting of 32 × 32 color images in ten classes. For theseexperiments, we used the large-CNN (Laine and Aila 2016;Miyato et al. 2017). The details of the structure and the set-tings are described in Appendix B.2.

Table 7 shows the reported performances of the close fam-ily of CNN-based classifiers for the supervised and semi-supervised learning. We did not consider the recently ad-vanced architectures, such as ResNet (He et al. 2016) andDenseNet (Huang et al. 2016), because we intend to com-pare the performance increment by the dropout and othertraining techniques.

In supervised learning tasks using all labeled train data,adversarial dropout models achieved the top performancecompared to the results from the baseline models, such asΠ model and VAT, on both datasets. When applying adver-sarial dropout and adversarial training together, there werefurther improvements in the performances.

Additionally, we conducted experiments on the semi-supervised learning with randomly selected labeled data andunlabeled images. In SVHN, 1,000 labeled and 72,257 un-labeled data were used for training. In CIFAR-10, 4,000 la-beled and 46,000 unlabeled data were used. Table 7 lists theperformance of the semi-supervised learning models, andour implementations with both VAdD and VAT achieved thetop performance compared to the results from (Sajjadi, Ja-vanmardi, and Tasdizen 2016).

Our experiments demonstrate that VAT and VAdD arecomplementary. When applying VAT and VAdD togetherby simply adding their divergence terms on the loss func-tion, see Formula 7, we achieved the state-of-the-art perfor-mances on the semi-supervised learning on both datasets;

Table 2: Test performances of semi-supervised and supervised learning on SVHN and CIFAR-10. Each setting is repeatedfor five times. KL and QE indicate Kullback-Leibler divergence and quadratic error, respectively, to specify the divergencefunction, D[y, y]

SVHN with # labels CIFAR-10 with # labelsMethod 1,000 73,257 (All) 4,000 50,000 (All)Π model (Laine and Aila 2016) 4.82 2.54 12.36 5.56Tem. ensembling (Laine and Aila 2016) 4.42 2.74 12.16 5.60Sajjadi et al. (Sajjadi, Javanmardi, and Tasdizen 2016) - - 11.29 -VAT (Miyato et al. 2017) 3.86 - 10.55 5.81Π model (our implementation) 4.35 ± 0.04 2.53 ± 0.05 12.62 ± 0.29 5.77 ± 0.11VAT (our implementation) 3.74 ± 0.09 2.69 ± 0.04 11.96 ± 0.10 5.65 ± 0.17SAdD - 2.46 ± 0.05 - 5.46 ± 0.16VAdD (KL) 4.16 ± 0.08 2.31 ± 0.01 11.68 ± 0.19 5.27 ± 0.10VAdD (QE) 4.26 ± 0.14 2.37 ± 0.03 11.32 ± 0.11 5.24 ± 0.12VAdD (KL) + VAT 3.55 ± 0.05 2.23 ± 0.03 10.07 ± 0.11 4.40 ± 0.12VAdD (QE) + VAT 3.55 ± 0.07 2.34 ± 0.05 9.22 ± 0.10 4.73 ± 0.04

3.55% of test error rates on SVHN, and 10.04% and 9.22%of test error rates on CIFAR-10. Additionally, VAdD aloneachieved a better performance than the self-ensemble model(Π model). This indicates that considering an adversarialperturbation on dropout layers enhances the self-ensembleeffect.

Effect on Features and Sparsity from AdversarialDropoutDropout prevents the co-adaptation between the units in aneural network, and the dropout decreases the dependencybetween hidden units (Srivastava et al. 2014). To comparethe adversarial dropout and the standard dropout, we ana-lyzed the co-adaptations by visualizing features of autoen-coders on the MNIST dataset. The autoencoder consists withone hidden layer, whose dimension is 256, with the ReLUactivation. When we trained the autoencoder, we set thedropout with p = 0.5, and we calculated the reconstruc-tion error between the input data and the output layer as aloss function to update the weight values of the autoencoderwith the standard dropout. On the other hand, the adversarialdropout error is also considered when we update the weightvalues of the autoencoder with the parameters, λ = 0.2, and δ= 0.3. The trained autoencoders showed similar reconstruc-tion errors on the test dataset.

Figure 2 shows the visualized features from the autoen-coders. There are two differences identified from the visu-alization; 1) adversarial dropout prevents that the learnedweight matrix contains black boxes, or dead filters, whichmay be all zero for many different inputs and 2) adversar-ial dropout tends to standardize other features, except forlocalized features viewed as black dots, while the standarddropout tends to ignore the neighborhoods of the localizedfeatures. These show that adversarial dropout standardizesthe other features while preserving the characteristics of lo-calized features from the standard dropout . These could bethe main reason for the better generalization performance.

The important side-effect of the standard dropout is thesparse activations of the hidden units (Hinton et al. 2012).To analyze the sparse activations by adversarial dropout, we

Figure 2: Features of one hidden layer autoencoders trainedon MNIST; a standard dropout (left) and an adversarialdropout (right).

compared the activation values of the auto-encoder modelswith no-dropout, dropout, and adversarial dropout on theMNIST test dataset. A sparse model should only have a fewhighly activated units, and the average activation of any unitacross data instances should be low (Hinton et al. 2012). Fig-ure 3 plot the distribution of the activation values and theirmeans across the test dataset. We found that the adversarialdropout has fewer highly activated units compared to oth-ers. Moreover, the mean activation values of the adversar-ial dropout were the lowest. These indicate that adversarialdropout improves the sparsity of the model than the standarddropout does.

Disucssion

The previous studies proved that the adversarial noise injec-tions were an effective regularizer (Goodfellow, Shlens, andSzegedy 2014). In order to investigate the different proper-ties of adversarial dropout, we explore a very simple case ofapplying adversarial training and adversarial dropout to thelinear regression.

Figure 3: Histograms of the activation values and the meanactivation values from a hidden layer of autoencoders in1,000 MNIST test images. All values are converted by thelog scale for the comparison.

Linear Regression with Adversarial TrainingLet xi ∈ RD be a data point and yi ∈ R be a target wherei = {1, ..., N}. The objective of the linear regression is find-ing w ∈ RD that minimizes l(w) =

∑i ‖yi − xTi w‖2.

To express adversarial examples, we denote xi = xi +radvi as the adversarial example of xi where radvi =δsign(5xi

l(w)) utilizing the fast gradient sign method(FGSM) (Goodfellow, Shlens, and Szegedy 2014), δ is acontrol parameter representing the degree of adversarialnoises. With the adversarial examples, the objective functionof the adversarial training can be viewed as follows:

lAT (w) =∑i

‖yi − (xi + radvi )Tw‖2 (12)

The above equation is translated into the below formula byisolating the terms with radvi as the additive noise.

l(w) +∑ij

|δ5xijl(w)|+ δ2wTΓATw (13)

where ΓAT =∑i sign(5xi

l(w))T sign(5xil(w)). The

second term shows the L1 regularization by multiplyingthe degree of the adversarial noise, δ, at each data point.Additionally, the third term indicates the L2 regularizationwith ΓAT , which form the scales of w by the gradient di-rection differences over all data points. The penalty termsare closely related with the hyper-parameter δ. When δ ap-proaches to zero, the regularization term disappears becausethe inputs become adversarial examples, not anymore. Fora large δ, the regularization constant grows larger than theoriginal loss function, and the learning becomes infeasible.The previous studies proved that the adversarial objectivefunction based on the FGSM is an effective regularizer. Thispaper investigated that training a linear regression with ad-versarial examples provides two regularization terms of theabove equation.

Linear Regression with Adversarial DropoutNow, we turn to the case of applying adversarial dropout toa linear regression. To represent the adversarial dropout, wedenote xi = εadvi � xi as the adversarially dropped input ofxi where εadvi = argmaxε;‖εi−1‖2≤k‖yi − (εi � xi)

Tw‖2with the hyper-parameter, k, controlling the degree of theadversarial dropout. For simplification, we used one vectoras the sampled dropout, εs, of the adversarial dropout. If weapply Algorithm 1, the adversarial dropout can be defined asfollows:

εadvij =

{0 if xij 5xij

l(w) ≤ min{sik, 0}1 otherwise (14)

where sik is the kth lowest element of xi �5xil(w). This

solution satisfies the constraint, ‖εi − εs‖2 ≤ k. With thisadversarial dropout condition, the objective function of theadversarial dropout can be defined as the belows:

lAdD(w) =∑i

‖yi − (εadvi � xi)Tw‖2 (15)

When we isolate the terms with εadv , the above equation istranslated into the below formula.

l(w) +∑i

∑j∈Si

|xij 5xij l(w)|+ wTΓAdDw (16)

where Si = {j|εadvij = 0} and ΓAdD =∑i((1 − εadvi ) �

xi)T ((1 − εadvi ) � xi). The second term is the L1 regu-

larization of the k largest loss changes from the featuresof each data point. The third term is the L2 regularizationwith ΓAdD. These two penalty terms are related with thehyper-parameter k controlling the degree of the adversarialdropout, because the k indicates the number of elements ofthe set Si,∀i. When k becomes zero, the two penalty termsdisappears because there will be no dropout by the constrainton ε.

There are two differences between the adversarial dropoutand the adversarial training. First, the regularization terms ofthe adversarial dropout are dependent on the scale of the fea-tures of each data point. In L1 regularization, the gradientsof the loss function are re-scaled with the data points. In L2

regularization, the data points affect the scales of the weightcosts. In contrast, the penalty terms of adversarial trainingare dependent on the degree of adversarial noise, δ, which isa static term across the instances because δ is a single-valuedhyper parameter given in the training process. Second, thepenalty terms of the adversarial dropout are selectively ac-tivated by the degree of the loss changes while the penaltyterms of the adversarial training are always activated.

ConclusionThe key point of our paper is combining the ideas from theadversarial training and the dropout. The existing methodsof the adversarial training control a linear perturbation withadditive properties only on the input layer. In contrast, wecombined the concept of the perturbation with the dropoutproperties on hidden layers. Adversarially dropped structurebecomes a poor ensemble model for the label assignment

even when very few nodes are changed. However, by learn-ing the model with the poor structure, the model preventsover-fitting using a few effective features. The experimentsshowed that the generalization performances are improvedby applying our adversarial dropout. Additionally, our ap-proach achieved the-state-of-the-art performances of 3.55%on SVHN and 9.22% on CIFAR-10 by applying VAdD andVAT together for the semi-supervised learning.

ReferencesBaldi, P., and Sadowski, P. J. 2013. Understanding dropout.In Advances in Neural Information Processing Systems.2814–2822.Bishop, C. M. 1995a. Training with noise is equivalent totikhonov regularization. Neural computation 7(1):108–116.Bishop, C. M. 1995b. Regularization and complexity controlin feed-forward networks.Chen, N.; Zhu, J.; Chen, J.; and Zhang, B. 2014. Dropouttraining for support vector machines. arXiv preprintarXiv:1404.4171.Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explain-ing and harnessing adversarial examples. arXiv preprintarXiv:1412.6572.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Identity map-pings in deep residual networks. In European Conferenceon Computer Vision, 630–645. Springer.Hemmecke, R.; Koppe, M.; Lee, J.; and Weismantel, R.2010. Nonlinear integer programming. In 50 Years of In-teger Programming 1958-2008. Springer. 561–618.Hinton, G. E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.;and Salakhutdinov, R. R. 2012. Improving neural networksby preventing co-adaptation of feature detectors. arXivpreprint arXiv:1207.0580.Huang, G.; Liu, Z.; Weinberger, K. Q.; and van der Maaten,L. 2016. Densely connected convolutional networks. arXivpreprint arXiv:1608.06993.Jain, P.; Kulkarni, V.; Thakurta, A.; and Williams, O. 2015.To drop or not to drop: Robustness, consistency and dif-ferential privacy properties of dropout. arXiv preprintarXiv:1503.02031.Kellerer, H.; Pferschy, U.; and Pisinger, D. 2004. Introduc-tion to np-completeness of knapsack problems. In Knapsackproblems. Springer. 483–493.Kingma, D., and Ba, J. 2014. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980.Krizhevsky, A., and Hinton, G. 2009. Learning multiplelayers of features from tiny images.Kurakin, A.; Goodfellow, I.; and Bengio, S. 2016. Ad-versarial examples in the physical world. arXiv preprintarXiv:1607.02533.Laine, S., and Aila, T. 2016. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242.Lasserre, J. A.; Bishop, C. M.; and Minka, T. P. 2006. Prin-cipled hybrids of generative and discriminative models. In

Computer Vision and Pattern Recognition, 2006 IEEE Com-puter Society Conference on, volume 1, 87–94. IEEE.LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998.Gradient-based learning applied to document recognition.Proceedings of the IEEE 86(11):2278–2324.Lee, C.-Y.; Xie, S.; Gallagher, P.; Zhang, Z.; and Tu, Z.2015. Deeply-supervised nets. In Artificial Intelligence andStatistics, 562–570.Li, Z.; Gong, B.; and Yang, T. 2016. Improved dropout forshallow and deep learning. In Advances in Neural Informa-tion Processing Systems, 2523–2531.Lin, M.; Chen, Q.; and Yan, S. 2013. Network in network.arXiv preprint arXiv:1312.4400.Maaten, L.; Chen, M.; Tyree, S.; and Weinberger, K. Q.2013. Learning with marginalized corrupted features. InProceedings of the 30th International Conference on Ma-chine Learning (ICML-13), 410–418.Miyato, T.; Maeda, S.-i.; Koyama, M.; Nakae, K.; and Ishii,S. 2015. Distributional smoothing with virtual adversarialtraining. arXiv preprint arXiv:1507.00677.Miyato, T.; Maeda, S.-i.; Koyama, M.; and Ishii, S. 2017.Virtual adversarial training: a regularization method forsupervised and semi-supervised learning. arXiv preprintarXiv:1704.03976.Miyato, T.; Dai, A. M.; and Goodfellow, I. 2016. Virtualadversarial training for semi-supervised text classification.stat 1050:25.Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; andNg, A. Y. 2011. Reading digits in natural images with unsu-pervised feature learning. In NIPS workshop on deep learn-ing and unsupervised feature learning, volume 2011, 5.Poole, B.; Sohl-Dickstein, J.; and Ganguli, S. 2014. Analyz-ing noise in autoencoders and deep networks. arXiv preprintarXiv:1406.1831.Rasmus, A.; Berglund, M.; Honkala, M.; Valpola, H.; andRaiko, T. 2015. Semi-supervised learning with ladder net-works. In Advances in Neural Information Processing Sys-tems, 3546–3554.Sajjadi, M.; Javanmardi, M.; and Tasdizen, T. 2016. Regu-larization with stochastic transformations and perturbationsfor deep semi-supervised learning. In Advances in NeuralInformation Processing Systems, 1163–1171.Salimans, T., and Kingma, D. P. 2016. Weight normaliza-tion: A simple reparameterization to accelerate training ofdeep neural networks. In Advances in Neural InformationProcessing Systems, 901–901.Salimans, T.; Goodfellow, I. J.; Zaremba, W.; Cheung, V.;Radford, A.; and Chen, X. 2016. Improved techniques fortraining gans. CoRR abs/1606.03498.Sanfeliu, A., and Fu, K.-S. 1983. A distance measurebetween attributed relational graphs for pattern recogni-tion. IEEE transactions on systems, man, and cybernetics(3):353–362.Springenberg, J. T.; Dosovitskiy, A.; Brox, T.; and Ried-

miller, M. 2014. Striving for simplicity: The all convolu-tional net. arXiv preprint arXiv:1412.6806.Springenberg, J. T. 2015. Unsupervised and semi-supervised learning with categorical generative adversarialnetworks. arXiv preprint arXiv:1511.06390.Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.;and Salakhutdinov, R. 2014. Dropout: a simple way to pre-vent neural networks from overfitting. Journal of MachineLearning Research 15(1):1929–1958.Srivastava, R. K.; Greff, K.; and Schmidhuber, J. 2015.Highway networks. arXiv preprint arXiv:1505.00387.Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan,D.; Goodfellow, I.; and Fergus, R. 2013. Intriguing proper-ties of neural networks. arXiv preprint arXiv:1312.6199.Wager, S.; Wang, S.; and Liang, P. S. 2013. Dropout trainingas adaptive regularization. In Advances in Neural Informa-tion Processing Systems, 351–359.Wang, S., and Manning, C. 2013. Fast dropout training.In Proceedings of the 30th International Conference on Ma-chine Learning (ICML-13), 118–126.

Appendix A. Distance between Two DropoutConditions

In this section, we describe process of induction for theboundary condition from the constraints distance(ε, εs) <δ. We applied two distance metrics, graph edit distance(GED) and Jaccard distance (JD) and proved that restrict-ing upper bounds of two metrics is same with limiting theEuclidean distance.

GED(ε1, ε2) ∝ JD(ε1, ε2) ∝ ||ε1 − ε2||2 (17)

Following sub-sections show the propositions.

A.1. Graph Edit DistanceWhen we consider a neural network as a graph, we can applythe graph edit distance (Sanfeliu and Fu 1983) to measurerelative difference between two dropouted networks, g1 andg2, by dropout masks, ε1 and ε2. The following is the defi-nition of graph edit distance (GED) between two networks.

GED(g1, g2) = min(e1,...,ek)∈P (g1,g2)

k∑i=1

c(ei), (18)

where P (g1, g2) denotes the set of edit path transformingg1 into g2, c(e) ≥ 0 is the cost of each graph edit opera-tion e, and k is the number of the edit operations required tochange g1 to g2. For simplification, we only considered edgeinsertion and deletion operations and their cost are same as1. When a hidden node (vertex) is dropped, the cost of theGED is Nl + Nu where Nl is the numbers of lower layernodes and Nu is the number of upper layer nodes. If weconsider a hidden node (vertex) is revival, change of GED issame as Nl +Nu. This leads following proposition.

Proposition 1 Given two networks g1 and g2, generated bytwo dropout masks ε1 and ε2, and all graph edit costs are

same as c(e) = 1, graph edit distance with two dropoutmasks can be interpreted as:

GED(g1, g2) = (Nl +Nu)‖ε1 − ε2‖2. (19)

Due to ε1 and ε2 are binary masks, their Euclidean distancecan provide the number of different dropped nodes.

A.2. Jaccard DistanceWhen we consider a dropout condition ε as a set of selectedhidden nodes, we can apply Jaccard distance to measure dif-ference two dropout masks, ε1 and ε2. The following equa-tion is the definition of Jaccard distance:

JD(ε1, ε2) =|ε1 ∪ ε2| − |ε1 ∩ ε2|

|ε1 ∪ ε2|. (20)

Since ε1 and ε2 are binary vectors, |ε1∩ε2| can be convertedas ‖ε1�ε2‖2 and |ε1∪ε2| can be viewed as ‖ε1 +ε2−ε1�ε2‖2. This leads the following proposition.

Proposition 2 Given two dropout masks ε1 and ε2, whichare binary vectors, Jaccard distance between them can bedefined as:

JD(ε1, ε2) =‖ε1 − ε2‖2

‖ε1 + ε2 − ε1 � ε2‖2. (21)

Appendix B. Detailed Experiment Set-upThis section describes the network architectures and settingsfor the experimental results in this paper. The tensorflowimplementations for reproducing these results can be ob-tained from https://github.com/sungraepark/Adversarial-Dropout.

B.1. MNIST : Convolutional Neural Networks

Table 3: The CNN architecture used on MNIST

Name Descriptioninput 28 X 28 imageconv1 32 filters, 1 x 1, pad=’same’, ReLUpool1 Maxpool 2 x 2 pixelsdrop1 Dropout, p = 0.5conv2 64 filters, 1 x 1, pad=’same’, ReLUpool2 Maxpool 2 x 2 pixelsdrop2 Dropout, p = 0.5conv3 128 filters, 1 x 1, pad=’same’, ReLUpool3 Maxpool 2 x 2 pixelsadt Adversarial dropout, p = 0.5, δ = 0.005dense1 Fully connected 2048→ 625dense2 Fully connected 625→ 10output Softmax

The MNIST dataset (LeCun et al., 1998) consists of70,000 handwritten digit images of size 28 × 28 where60,000 images are used for training and the rest for test-ing. The CNN architecture is described in Table 1. All net-works were trained using Adam (Kingma and Ba 2014)with a learning rate of 0.001 and momentum parameters

of β1 = 0.9 and β2 = 0.999. In all implementations, wetrained the model for 100 epochs with minibatch size of 128.

For the constraint of adversarial dropout, we set δ =0.005, which indicates 10 (2048∗0.005) adversarial changesfrom the randomly selected dropout mask. In all training,we ramped up the trade-off parameter, λ, for proposed reg-ularization term, LAdD. During the first 30 epochs, we useda Gaussian ramp-up curve exp[−5(1 − T )2], where T ad-vances linearly from zero to one during the ramp-up period.The maximum values of λmax are 1.0 for VAdD (KL) andVAT , and 30.0 for VAdD (QE) and Π model.

B.2. SVHN and CIFAR-10 : Supervised andSemi-supervised learning

Table 4: The network architecture used on SVHN andCIFAR-10

Name Descriptioninput 32 X 32 RGB imagenoise Additive Gaussian noise σ = 0.15conv1a 128 filters, 3 x 3, pad=’same’, LReLU(α = 0.1)conv1b 128 filters, 3 x 3, pad=’same’, LReLU(α = 0.1)conv1c 128 filters, 3 x 3, pad=’same’, LReLU(α = 0.1)pool1 Maxpool 2 x 2 pixelsdrop1 Dropout, p = 0.5conv2a 256 filters, 3 x 3, pad=’same’, LReLU(α = 0.1)conv2b 256 filters, 3 x 3, pad=’same’, LReLU(α = 0.1)conv2c 256 filters, 3 x 3, pad=’same’, LReLU(α = 0.1)pool2 Maxpool 2 x 2 pixelsconv3a 512 filters, 3 x 3, pad=’valid’, LReLU(α = 0.1)conv3b 256 filters, 1 x 1, LReLU(α = 0.1)conv3c 128 filters, 1 x 1, LReLU(α = 0.1)pool3 Global average pool (6 x 6→ 1 x 1)pixelsadd Adversarial dropout, p = 1.0, δ = 0.05dense Fully connected 128→ 10output Softmax

The both datasets, SVHN (Netzer et al. 2011) and CIFAR-10 (Krizhevsky and Hinton 2009), consist of 32× 32 colourimages in ten classes. For these experiments, we used aCNN, which used by (Laine and Aila 2016; Miyato etal. 2017) described in Table 2. In all layers, we appliedbatch normalization for SVHN and mean-only batch nor-malization (Salimans and Kingma 2016) for CIFAR-10 withmomentum 0.999. All networks were trained using Adam(Kingma and Ba 2014) with the momentum parameters ofβ1 = 0.9 and β2 = 0.999, and the maximum learning rate0.003. We ramped up the learning rate during the first 80epochs using a Gaussian ramp-up curve exp[−5(1 − T )2],where T advances linearly from zero to one during the ramp-up period. Additionally, we annealed the learning rate tozero and the Adam parameter, β1, to 0.5 during the last 50epochs. The number of total epochs is set as 300. Theselearning setting are same with (Laine and Aila 2016).

For adversarial dropout, we set the maximum valueof regularization component weight, λmax, as 1.0 forVAdD(KL) and 25.0 for VAdD(QE). We also ramped up theweight using the Gaussian ramp-up curve during the first 80

epochs. Additionally, we set δ as 0.05 and dropout probabil-ity p as 1.0, which means dropping 6 units among the fullhidden units. We set minibatch size as 100 for supervisedlearning and 32 labeled and 128 unlabeled data for semi-supervised learning.

Appendix C. Definition of NotationIn this section, we describe notations used over this paper.

Table 5: The notation used over this paper.

Notat. Descriptionx An input of a neural networky A true labelθ A set of parameters of a neural networkγ A noise vector of additive Gaussian noise layerε A binary noise vector of dropout layerδ A hyperparameter controlling the intensity of

the adversarial perturbationλ A trade-off parameter controlling the impact of

a regularization termD[y,y′] A nonnegative function that represents the

distance between two output vectors:cross entropy(CE), KL divergence(KL), andquadratic error (QE)

fθ(x) An output vector of a neural network withparameters (θ) and an input (x)

fθ(x,ρ) An output vector of a neural network withparameters (θ), an input (x), andnoise (ρ)

fupperθ1A upper part of a neural network, fθ(x, ε),of a adversarial dropout layer whereθ = {θ1,θ2}

funderθ2A under part of a neural network, fθ(x, ε),of a adversarial dropout layer

Appendix D. Performance Comparison withOther Models

D.1. CIFAR-10 : Supervised classification resultswith additional baselines

We compared the reported performances of the additionalclose family of CNN-based classifier for the supervisedlearning. As we mentioned in the paper, we did not considerthe recent advanced architectures, such as ResNet (He et al.2016) and DenseNet (Huang et al. 2016).

D.2. CIFAR-10 : Semi-supervised classificationresults with additional baselines

We compared the reported performances of additional base-line models for the semi-supervised learning. Our imple-mentation reproduced the closed performance from their re-ported results, and showed the performance improvementfrom adversarial dropout.

Table 6: Supervised learning performance on CIFAR-10.Each setting is repeated for five times.

Method Error rate (%)Network in Network (2013) 8.81All-CNN (2014) 7.25Deep Supervised Net (2015) 7.97Highway Network (2015) 7.72Π model (2016) 5.56Temportal ensembling (2016) 5.60VAT (2017) 5.81Π model (our implementation) 5.77 ± 0.11VAT (our implementation) 5.65 ± 0.17AdD 5.46 ± 0.16VAdD (KL) 5.27 ± 0.10VAdD (QE) 5.24 ± 0.12VAdD (KL) + VAT 4.40 ± 0.12VAdD (QE) + VAT 4.73 ± 0.04

Table 7: Semi-supervised learning task on CIFAR-10 with4,000 labeled examples. Each setting is repeated for fivetimes.

Method Error rate (%)Ladder network (2015) 20.40CatGAN (2015) 19.58GAN with feature matching (2016) 18.63Π model (2016) 12.36Temportal ensembling (2016) 12.16Sajjadi et al. (2016) 11.29VAT (2017) 10.55Π model (our implementation) 12.62 ± 0.29VAT (our implementation) 11.96 ± 0.10VAdD (KL) 11.68 ± 0.19VAdD (QE) 11.32 ± 0.11VAdD (KL) + VAT 10.07 ± 0.11VAdD (QE) + VAT 9.22 ± 0.10

Appendix E. Proof of Linear RegressionRegularization

In this section, we showed the detailed proof of regulariza-tion terms from adversarial training and adversarial dropout.

Linear Regression with Adversarial TrainingLet xi ∈ RD be a data point and yi ∈ R be a target wherei = {1, ..., N}. The objective of linear regression is to finda w ∈ RD that minimizes l(w) =

∑i ‖yi − xTi w‖2.

To express adversarial examples, we denote xi = xi +radvi as the adversarial example of xi where radvi =δsign(5xi

l(w)) utilizing the fast gradient sign method(FGSM) (Goodfellow, Shlens, and Szegedy 2014), δ is acontrolling parameter representing the degree of adversarialnoises. With the adversarial examples, the objective functionof adversarial training can be viewed as follows:

lAT (w) =∑i

‖yi − (xi + radvi )Tw‖2 (22)

This can be divided to

lAT (w) = l(w)− 2∑i

(yi − xTi w)wT radvi (23)

+∑i

wT (radvi )T (radvi )w

where l(w) is the loss function without adversarial noise.Note that the gradient is 5xi

l(w) = −2(yi − xTi w)w, andaT sign(a) = ‖a‖1. The above equation can be transformedas the following:

lAT (w) = l(w) +∑ij

|δ5xij l(w)|+ δ2wTΓATw,

(24)

where ΓAT =∑i sign(5xi

l(w))T sign(5xil(w)).

Linear Regression with Adversarial DropoutTo represent adversarial dropout, we denote xi = εadvi � xias the adversarially dropped input of xi where εadvi =argmaxε;‖εi−1‖2≤k‖yi − (εi � xi)

Tw‖2 with the hyper-parameter, k, controlling the degree of adversarial dropout.For simplification, we used one vector as the base conditionof a adversarial dropout. If we applied our proposed algo-rithm, the adversarial dropout can be defined as follows:

εadvij =

{0 if xij 5xij l(w) ≤ min{sik, 0}1 otherwise (25)

where sik is the kth lowest element of xi �5xil(w). This

solution satisfies the constraint, ‖εi − 1‖2 ≤ k. With thisadversarial dropout condition, the objective function of ad-versarial dropout can be defined as the following:

lAdD(w) =∑i

‖yi − (εadvi � xi)Tw‖2 (26)

This can be divided to

lAdD(w) = l(w) + 2∑i

(yi − xTi w)((1− εadvi )� xi)Tw

(27)

+∑i

wT ((1− εadvi )� xi)T ((1− εadvi )� xi)w

The second term of the right handside can be viewed as

2∑i

(yi − xTi w)∑j

(1− εadvij )xijwj . (28)

By defining a set Si = {j|εadvij = 0}, the second term canbe transformed as the following.

−∑i

∑j∈Si

−2(yi − xTi w)wjxij . (29)

Note that the gradient is5xijl(w) = −2(yi−xTi w)wj and

xij 5xijl(w) is always negative when j ∈ Si. The second

term can be re-defined as the following.∑i

∑j∈Si

|xij 5xij l(w)| (30)

Finally, the objective function of adversarial dropout is re-organized.

lAdD(w) = l(w) +∑i

∑j∈Si

|xij 5xijl(w)|+ wTΓAdDw,

(31)

where Si = {j|εadvij = 0} and ΓAdD =∑i((1 − εadvi ) �

xi)T ((1− εadvi )� xi).

adversarial dropout for supervised and semi-supervised learning … · 2017-09-19 · adversarial...

Documents