defending against adversarial attackscbl.eng.cam.ac.uk/pub/intranet/mlg/readinggroup/... ·...

Defending Against Adversarial Attacks

Ross Clarke & Elre Oldewage

Computational and Biological Learning LabDepartment of EngineeringUniversity of Cambridge

7th November 2018

1/52

Outline

1 Introduction to Adversarial ExamplesMotivationGenerating Adversarial Examples

2 Defending against Adversarial SamplesAdversarial TrainingEnsemblingDefensive Distillation

3 Defeating the DefencesBlack-Box Adversarial Attacks

4 Immunity to Adversarial AttacksTheoretical ConditionsEmpirical Results

2/52

Introduction to Adversarial Examples

Ian J. Goodfellow, Jonathon Shlens & Christian Szegedy:Explaining and Harnessing Adversarial Examples (2015)

3/52

Motivation

4/52

What are adversarial examples?

Adversarial/Wild examples: input patterns that have been modified sothat ML models misclassify them.

• Imperceptible

• Transferable• Attacks transfer between models with different architectures• Attacks can be generated using data from a distribution different to

the training data

5/52

What are adversarial examples?Adversarial examples indicate failure to learn underlying concepts.

ML models have built a Potemkin village that works for naturallyoccurring data, but is exposed as fake on adversarial points.

6/52

Generating Adversarial Examples

7/52

Adversarial Examples

Given an input x to a classifier F , construction of an adversarial examplesolves the minimization problem

min‖η‖

F (x + η) 6= F (x)

Let x + η be denoted by x

8/52

Fast Gradient Sign MethodConsider a classifier F . Let y denote the target associated with input xand C (F , x, y) be the cost used to train the model.

The Fast Gradient Sign Method (FSGM) computes the perturbation by:

η = ε sign(∇xC (F , x, y))

where ε controls the size ofthe perturbation.Let x = x + η.

9/52

Why do adversarial examples arise?

Hypotheses:

• Extreme non-linearity of NNs?

• Insufficient regularization?

• Insufficient model averaging?

10/52

Why do adversarial examples arise?

Hypotheses:

• Extreme non-linearity of NNs?

• Insufficient regularization?

• Insufficient model averaging?

Answer: Linear behaviour

10/52

Adversarial Examples

Digital precision is limited, so if ‖η‖∞ < δ,

F (x + η) = F (x)

if δ is the precision of the features.

Thus the perturbations must have some minimum size in order to have aneffect on the model’s predictions.

11/52

Linearity in DNNsMany small, imperceptible changes add up to a large change in activationdue to linearity.Consider a simple neuron with weights w. Then the dot product of theweights and the adversarial pattern is given by:

wT x = wTx + wTη

If

• ε ≤ ‖η‖∞ and

• sign(η) = sign(w) and

• the average magnitude of w is m

then the perturbation increases activationby at least εmn.

12/52

Defending against Adversarial Samples

13/52

Adversarial Training


14/52

Adversarial Training

• According to the universal approximationtheorem, deep neural networks must be able torepresent functions that are more robust toadversarial examples, if they exist.

• We expect such functions to exist(eg. biological vision)

• But whether such functions can be discoveredis a different question.

Idea: Incorporate robustness into the objective function

15/52

Adversarial Objective Function

The new objective function balances the accuracy on the training set withadversarial robustness:

C (F , x, y) = αC (F , x, y) + (1− α)C (F , x, y)

where α ∈ [0, 1] and x = x + ε sign(∇xC (F , x, y)) using the FGSM.

We may already suspect that the adversarial robustness term will have aregularizing effect

16/52

Results

MaxoutAdversarial

Training

Test Error Rate (%) 1.14 0.78Adversarial Error Rate (%) 89.4 17.9

Average Confidence onMisclassified Inputs (%)

97.6 81.4

Maxout network with no adversarial training compared to maxout networkwith adversarial training (α = 0.5, ε = 0.25)

17/52

Adversarial training in perspective

• Active learning - algorithm interactively queries a human labeler orother information source to obtain the labels at new data points.The model obtains the label of ”new” perturbed inputs where thecorrect label of the perturbed input is equal to the label of itsunperturbed version

• Playing an adversarial game - minimize the upper bound of theexpected cost over noisy samples

18/52

Isn’t it all just noise?

Instead of training with adversarial patterns, why not just train with morenoise? Regularize the model to be insensitive to features smaller than ε bytraining on noisy patterns where the noise has max-norm ε.

• Unclear what ε-value the attacker is using

• Very inefficient - the expected dot product between such a noisevector and any other vector is zero so there is very little overall effect

Adversarial training does hard example mining to find noise that makes apattern difficult to classify

19/52

Control Experiment with Noise

Train maxout network with noisy inputs

• N1 randomly adds ±ε to every pixel

• N2 adds noise in U(−ε, ε) to every pixel

M A N1 N2

Adversarial Error Rate (%) 89.4 17.9 86.2 90.4

Average Confidence onMisclassified Inputs (%)

97.6 81.4 97.3 97.8

Error rates and confidence for models with no adversarial training (M),with adversarial training (A) and with noisy input training (N1, N2)

20/52

Ensembling


21/52

Leveraging Ensembles

Recall the alternative hypothesis:

Adversarial examples arise due to insufficient model averaging.

The hypothesis can be tested by experimenting with an ensemble.

M A Ensemble

Adversarial Error Rate (%) 89.4 17.9 91.1Adversarial Error (Single Target) (%) — — 87.9

Error rates of an ensembles of 12 maxout networks for adversarial attackstargeting all the models in the ensemble (top) and targeting only of themodels (bottom).

22/52

Defensive Distillation

Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha &Ananthram Swami:

Distillation as a Defense to Adversarial Perturbations against Deep NeuralNetworks (2016)

23/52

A Better Approach

Ideally, we seek a defence which:

• Minimally modifies network architecture

• Maintains network accuracy

• Maintains prediction (not necessarily training) speed

• Handles adversarial samples close to training data

Defensive Distillation satisfies all these conditions.

24/52

Defensive Distillation

Design a suitable network architecture with a softmax output layer, then:

1 Train this network using one-hot target probability vectors

2 Relabel the training data with the probability vectors predicted by thefirst network

3 Train a second network of the same architecture using the relabelledtraining data

Ideas:

• Output probability vectors encode information about class relationsabsent in one-hot vectors

• This knowledge can be exploited to improve generalisation, henceresistance to perturbations

25/52

Evaluation of Defensive Distillation

We might ask three questions about defensive distillation:

1 Does it harm classification accuracy?

2 Does it reduce the sensitivity of a network to its inputs?

3 Does it lead to more robust networks?

Define a robustness ρ for our network with classifier output F (·):

ρ(F ) = Ep(x) [∆min] , ∆min = arg min {‖η‖ : F (x + η) 6= F (x)}

i.e. the expected smallest perturbation necessary to inducemisclassification.The authors use the 0-norm — the number of features changed.

26/52

Evaluation: Setup

Evaluate using:

• MNIST and CIFAR10 datasets

• Typical Convolutional Neural Networks, with momentum, parameterdecay and dropout

This achieves accuracy comparable to the state-of-the-art (99.51% onMNIST and 80.95% on CIFAR101).

To craft adversarial examples:

• Compute the output Jacobian (with respect to the input)

• Rank perturbation directions by sensitivity

• Move the most sensitive directions to their maximum permittedvalues, up to 112 altered features

Samples succeeded 97% of the time on MNIST and 92.78% on CIFAR10.

1For unaugmented datasets27/52

Evaluation1) Does it harm classification accuracy?

• Randomly select 100 samples from the test set

• Create a minimally perturbed version of each item which ismisclassified as each other category

MNIST CIFAR10

Initial Accuracy (%) 99.51 80.95Distilled Accuracy (%) 99.05 81.39

Initial Adversarial Success (%) 95.89 89.9Distilled Adversarial Success (%) 1.34 16.76

28/52

Evaluation2) Does it reduce the sensitivity of a network to its inputs?

Consider mean gradient amplitudes from the CIFAR10 test set

• Network behaviour is smoother around training samples

• Adversarial samples thus require larger perturbations

29/52

Evaluation3) Does it lead to more robust networks?

Compute robustness ρ(F ) over the test set

• At T = 20, the MNISTrobustness of 13.79% forcesadversarial examples to beobvious or genuinely changeclass

• In e.g. text-basedapplications, changing manyinput features is difficult

30/52

Summary of Distillation

• Distillation achieves our specification and desires for a defence:• Minimally modifies network architecture and speed• Handles adversarial samples near training data• Increased resilience to adversarial samples without harming accuracy• Reduced input-output sensitivity• Increased network robustness

• Predicted probability vectors contain valuable information which canaid generalisation

• This information is demonstrably more valuable than setting allincorrect classification probabilities to the same, finite value

• Not all networks produce probability vectors or have sufficientcapacity to defend against adversarial samples

31/52

Defeating the Defences

Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha,Z. Berkay Celik & Ananthram Swami:

Practical Black-Box Attacks against Machine Learning (2017)

32/52

A Gap in the Defences

Most adversarial attacks assume some knowledge of the internals of thetarget network.

Suppose we frame a more challenging adversarial problem:

• No information about the structure or parameters of the targetnetwork

• No access to a large training set

• Only able to observe the output classification (not probability vector)given for inputs of our choice

We thus treat the target like an oracle, or a black box.

We seek the minimum perturbation required to misclassify our input.

33/52

Black-Box Attack

Idea: Train a local copy of the target network, then attack that, exploitingtransferability of adversarial samples between architectures:

1 Heuristically generate synthetic inputs and query the target networkfor its output classifications

2 Use these data pairs to train a local substitute network

3 Craft adversarial samples which defeat the substitute network

4 Transfer those samples to the target network

This poses two main challenges:

• We must choose a substitute architecture without knowing thetarget’s architecture

• We must limit the number of queries submitted to the target networkto ensure tractability and covertness

34/52

Substitute DesignChoice of architecture need not be a major hurdle:

• We know the form of the inputs and output

• We know what architectural features are generally used for each datatype

• The precise configuration of layer sizes and numbers doesn’t greatlyaffect attack success

Forming a training dataset is harder:

• Gaussian noise is insufficient for learning, likely because it isunrepresentative of the input distribution

• Instead, focus inputs on the directions in which the output is mostsensitive: Jacobian-based Dataset Augmentation

• Let O(·) and F (·) be the classification functions of the oracle andsubstitute, respectively

• Perturb based on the sign of the Jacobian elements corresponding tothe current class, λ sign (JF (O(x)))

35/52

Substitute Training

Train the substitute as follows:

1 Collect a small number of representative inputs, not necessarily fromthe training distribution

2 Select a suitable architecture

3 Iteratively train more accurate substitutes using training set Sr :

1 Label each unlabelled training point x ∈ Sr by y = O(x)2 Apply classical training techniques using the now-complete dataset3 Augment the training set by:

Sr+1 = Sr ∪ {x + λ sign JF (y) | x ∈ Sr}

36/52

Crafting Adversarial Samples

Create adversarial samples using two methods:

Fast Gradient Sign Method (Goodfellow et al.)

• Let the substitute network have a cost function C (F , x, y)

• Perturb the sample by ηx = ε sign∇xC (F , x, y)

Papernot et al. Algorithm

• Let input component xi have saliency for a target class t:

S(x, t; i) =

0

{if ∂Ft

∂xi(x) < 0 or∑

j 6=t∂Fj

∂xi(x) > 0

∂Ft∂xi

(x)∣∣∣∑j 6=t

∂Fj

∂xi(x)∣∣∣ otherwise

• Add component perturbations to ηx in order of decreasing saliencyuntil the input is misclassified

37/52

Validation: Creating the Substitute (MNIST)

Use the MNIST dataset and classification task on MetaMind

• Service provides automated black-box training and prediction withoutrevealing network internals or permitting customisation

• Training accuracy was 94.97%

To craft attacks, try two initial training sets:

• 150 test-set MNIST characters

• 100 handcrafted handwritten digits in MNIST format

Final substitute accuracy was 81.20% for MNIST, 67.00% for handwrittendigits

• We seek not to maximise the substitute’s accuracy, but to mimic thetarget’s decision boundaries

38/52

Validation: Attacking the Substitute (MNIST)Use Fast Gradient Sign Method, with perturbation magnitude ε.Define

• Success rate: the proportion of adversarial samples misclassified bythe substitute network

• Transferability: proportion of misclassifications by the oracle/target

39/52

Validation: Creating the Substitute (GTSRB)

Use the German Traffic Sign Recognition Benchmark:

(GTSRB)

Train various network architectures with various test set subsets

• 71.42% accuracy for 1 000 training points

• 60.12% accuracy for 500 training points

40/52

Validation: Attacking the Substitute (GTSRB)Use the Fast Gradient Sign Method.

• More transferable than MNIST; higher dimensionality• No strong correlation between substitute accuracy and transferability

41/52

Defensive Distillation - Just Gradient MaskingThough the oracle’s gradient is smoother or masked and thus more difficultto exploit directly, the substitute model’s gradients are not masked.

42/52

Transferability Across Architectures

DNN IDAccuracy Accuracy Transferability

(ρ = 2) (%) (ρ = 6) (%) (ρ = 6) (%)

A 30.50 82.81 75.74F 68.67 79.19 64.28G 72.88 78.31 61.17H 56.70 74.67 63.44I 57.68 71.25 43.48J 64.39 68.99 47.03K 58.53 70.75 54.45L 67.73 75.43 65.95M 62.64 76.04 62.00

Accuracy and transferability of models that differ from the oracle in type,number and layer size. Some models did not even have convolution layersor used sigmoid instead of ReLUs.

43/52

FSGM - ε and Transferability

The effect of ε on transferability when using the Fast Gradient SignMethod

44/52

Summary

Adversarial attacks can succeed under very weak assumptions:

• No knowledge of the target’s architecture

• No access to the target’s training data

• Limited training set for substitute model

by using the target as an oracle and generating synthetic data.

Largely due to the transferability of adversarial patterns.

45/52

Immunity to Adversarial Attacks

Yarin Gal & Lewis Smith:Sufficient Conditions for Idealised Models to Have No Adversarial

Examples: a Theoretical and Empirical Study with Bayesian NeuralNetworks (2018)

46/52

Immunity to Adversarial Attacks

Idealised Network

1 Architecture which is invariant to the same transformations as thedata distribution is (improves generalisability)

2 Ability to indicate whether an input is “valid”; that is, close to thetraining points

Idea: Either:

• An input is near the space occupied by the training data, so 1) givesthat the correct classification should be made

• An input is vastly removed from this space, so 2) gives that we will beable to identify it as invalid

So idealised networks are immune to adversarial attacks.Consider Bayesian Neural Networks (BNNs) throughout.

47/52

A New DatasetIntroduce a dataset with known ground truth — Manifold MNIST:• Train a VAE on MNIST, enforcing a 2D latent space• Retain only 0, 1 and 4 samples, placing a small Gaussian on each to

give an analytical density• Discard MNIST latents; let the remainder be our ground truth• Draw and decode 5 000 samples to use as a training set

The dataset violates the lack-of-ambiguity assumption.

48/52

Adversarial Attacks on Manifold MNISTMaking samples adversarial decreases their likelihood under the datadistribution:

49/52

Adversarial Attacks on Manifold MNIST

With a LeNET BNN using Hamiltonian Monte Carlo (HMC) inference,uncertainty correlates with training set density:

49/52

Resistance of a HMC BNN to Adversarial Attacks

• Use the first-place attack from a NIPS 2017 competition: MIM

• Sample new BNN weights each time, approximating the ensemble ofpossible BNNs

• Compare performance to a random noise attack and a deterministic(non-Bayesian) neural network

HMC BNN Deterministic NNPerturbation Adversarial Noise Adversarial NoiseMagnitude ε Success (%) Success (%) Success (%) Success (%)

0.1 14± 3 10± 1 52± 0 3.0± 0.10.2 32± 2 23± 1 97± 0 3.0± 0.2

50/52

HMC BNN with Non-Idealised DataUsing the full MNIST dataset, and the same encoder, compare real-worldinference (dropout) to near-perfect inference (HMC):

Key issue: approximate inference increases uncertainty too slowly.Observations for dropout are used to propose a new adversarial attack(not shown).

51/52

Papers Discussed

1 Explaining and Harnessing Adversarial ExamplesIan J. Goodfellow, Jonathon Shlens & Christian Szegedy (2015)

2 Distillation as a Defense to Adversarial Perturbations against DeepNeural NetworksNicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha &Ananthram Swami (2016)

3 Practical Black-Box Attacks against Machine LearningNicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha,Z. Berkay Celik & Ananthram Swami (2017)

4 Sufficient Conditions for Idealised Models to Have No AdversarialExamples: a Theoretical and Empirical Study with Bayesian NeuralNetworksYarin Gal & Lewis Smith (2018)

52/52

defending against adversarial attackscbl.eng.cam.ac.uk/pub/intranet/mlg/readinggroup/... ·...

Documents