defending against adversarial attackscbl.eng.cam.ac.uk/pub/intranet/mlg/readinggroup/... ·...
TRANSCRIPT
Defending Against Adversarial Attacks
Ross Clarke & Elre Oldewage
Computational and Biological Learning LabDepartment of EngineeringUniversity of Cambridge
7th November 2018
1/52
Outline
1 Introduction to Adversarial ExamplesMotivationGenerating Adversarial Examples
2 Defending against Adversarial SamplesAdversarial TrainingEnsemblingDefensive Distillation
3 Defeating the DefencesBlack-Box Adversarial Attacks
4 Immunity to Adversarial AttacksTheoretical ConditionsEmpirical Results
2/52
Introduction to Adversarial Examples
Ian J. Goodfellow, Jonathon Shlens & Christian Szegedy:Explaining and Harnessing Adversarial Examples (2015)
3/52
Motivation
4/52
What are adversarial examples?
Adversarial/Wild examples: input patterns that have been modified sothat ML models misclassify them.
• Imperceptible
• Transferable• Attacks transfer between models with different architectures• Attacks can be generated using data from a distribution different to
the training data
5/52
What are adversarial examples?Adversarial examples indicate failure to learn underlying concepts.
ML models have built a Potemkin village that works for naturallyoccurring data, but is exposed as fake on adversarial points.
6/52
Generating Adversarial Examples
7/52
Adversarial Examples
Given an input x to a classifier F , construction of an adversarial examplesolves the minimization problem
min‖η‖
F (x + η) 6= F (x)
Let x + η be denoted by x
8/52
Fast Gradient Sign MethodConsider a classifier F . Let y denote the target associated with input xand C (F , x, y) be the cost used to train the model.
The Fast Gradient Sign Method (FSGM) computes the perturbation by:
η = ε sign(∇xC (F , x, y))
where ε controls the size ofthe perturbation.Let x = x + η.
9/52
Why do adversarial examples arise?
Hypotheses:
• Extreme non-linearity of NNs?
• Insufficient regularization?
• Insufficient model averaging?
10/52
Why do adversarial examples arise?
Hypotheses:
• Extreme non-linearity of NNs?
• Insufficient regularization?
• Insufficient model averaging?
Answer: Linear behaviour
10/52
Adversarial Examples
Digital precision is limited, so if ‖η‖∞ < δ,
F (x + η) = F (x)
if δ is the precision of the features.
Thus the perturbations must have some minimum size in order to have aneffect on the model’s predictions.
11/52
Linearity in DNNsMany small, imperceptible changes add up to a large change in activationdue to linearity.Consider a simple neuron with weights w. Then the dot product of theweights and the adversarial pattern is given by:
wT x = wTx + wTη
If
• ε ≤ ‖η‖∞ and
• sign(η) = sign(w) and
• the average magnitude of w is m
then the perturbation increases activationby at least εmn.
12/52
Defending against Adversarial Samples
13/52
Adversarial Training
Ian J. Goodfellow, Jonathon Shlens & Christian Szegedy:Explaining and Harnessing Adversarial Examples (2015)
14/52
Adversarial Training
• According to the universal approximationtheorem, deep neural networks must be able torepresent functions that are more robust toadversarial examples, if they exist.
• We expect such functions to exist(eg. biological vision)
• But whether such functions can be discoveredis a different question.
Idea: Incorporate robustness into the objective function
15/52
Adversarial Objective Function
The new objective function balances the accuracy on the training set withadversarial robustness:
C (F , x, y) = αC (F , x, y) + (1− α)C (F , x, y)
where α ∈ [0, 1] and x = x + ε sign(∇xC (F , x, y)) using the FGSM.
We may already suspect that the adversarial robustness term will have aregularizing effect
16/52
Results
MaxoutAdversarial
Training
Test Error Rate (%) 1.14 0.78Adversarial Error Rate (%) 89.4 17.9
Average Confidence onMisclassified Inputs (%)
97.6 81.4
Maxout network with no adversarial training compared to maxout networkwith adversarial training (α = 0.5, ε = 0.25)
17/52
Adversarial training in perspective
• Active learning - algorithm interactively queries a human labeler orother information source to obtain the labels at new data points.The model obtains the label of ”new” perturbed inputs where thecorrect label of the perturbed input is equal to the label of itsunperturbed version
• Playing an adversarial game - minimize the upper bound of theexpected cost over noisy samples
18/52
Isn’t it all just noise?
Instead of training with adversarial patterns, why not just train with morenoise? Regularize the model to be insensitive to features smaller than ε bytraining on noisy patterns where the noise has max-norm ε.
• Unclear what ε-value the attacker is using
• Very inefficient - the expected dot product between such a noisevector and any other vector is zero so there is very little overall effect
Adversarial training does hard example mining to find noise that makes apattern difficult to classify
19/52
Control Experiment with Noise
Train maxout network with noisy inputs
• N1 randomly adds ±ε to every pixel
• N2 adds noise in U(−ε, ε) to every pixel
M A N1 N2
Adversarial Error Rate (%) 89.4 17.9 86.2 90.4
Average Confidence onMisclassified Inputs (%)
97.6 81.4 97.3 97.8
Error rates and confidence for models with no adversarial training (M),with adversarial training (A) and with noisy input training (N1, N2)
20/52
Ensembling
Ian J. Goodfellow, Jonathon Shlens & Christian Szegedy:Explaining and Harnessing Adversarial Examples (2015)
21/52
Leveraging Ensembles
Recall the alternative hypothesis:
Adversarial examples arise due to insufficient model averaging.
The hypothesis can be tested by experimenting with an ensemble.
M A Ensemble
Adversarial Error Rate (%) 89.4 17.9 91.1Adversarial Error (Single Target) (%) — — 87.9
Error rates of an ensembles of 12 maxout networks for adversarial attackstargeting all the models in the ensemble (top) and targeting only of themodels (bottom).
22/52
Defensive Distillation
Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha &Ananthram Swami:
Distillation as a Defense to Adversarial Perturbations against Deep NeuralNetworks (2016)
23/52
A Better Approach
Ideally, we seek a defence which:
• Minimally modifies network architecture
• Maintains network accuracy
• Maintains prediction (not necessarily training) speed
• Handles adversarial samples close to training data
Defensive Distillation satisfies all these conditions.
24/52
Defensive Distillation
Design a suitable network architecture with a softmax output layer, then:
1 Train this network using one-hot target probability vectors
2 Relabel the training data with the probability vectors predicted by thefirst network
3 Train a second network of the same architecture using the relabelledtraining data
Ideas:
• Output probability vectors encode information about class relationsabsent in one-hot vectors
• This knowledge can be exploited to improve generalisation, henceresistance to perturbations
25/52
Defensive Distillation
Design a suitable network architecture with a softmax output layer, then:
1 Train this network using one-hot target probability vectors
2 Relabel the training data with the probability vectors predicted by thefirst network
3 Train a second network of the same architecture using the relabelledtraining data
Ideas:
• Output probability vectors encode information about class relationsabsent in one-hot vectors
• This knowledge can be exploited to improve generalisation, henceresistance to perturbations
25/52
Evaluation of Defensive Distillation
We might ask three questions about defensive distillation:
1 Does it harm classification accuracy?
2 Does it reduce the sensitivity of a network to its inputs?
3 Does it lead to more robust networks?
Define a robustness ρ for our network with classifier output F (·):
ρ(F ) = Ep(x) [∆min] , ∆min = arg min {‖η‖ : F (x + η) 6= F (x)}
i.e. the expected smallest perturbation necessary to inducemisclassification.The authors use the 0-norm — the number of features changed.
26/52
Evaluation of Defensive Distillation
We might ask three questions about defensive distillation:
1 Does it harm classification accuracy?
2 Does it reduce the sensitivity of a network to its inputs?
3 Does it lead to more robust networks?
Define a robustness ρ for our network with classifier output F (·):
ρ(F ) = Ep(x) [∆min] , ∆min = arg min {‖η‖ : F (x + η) 6= F (x)}
i.e. the expected smallest perturbation necessary to inducemisclassification.The authors use the 0-norm — the number of features changed.
26/52
Evaluation: Setup
Evaluate using:
• MNIST and CIFAR10 datasets
• Typical Convolutional Neural Networks, with momentum, parameterdecay and dropout
This achieves accuracy comparable to the state-of-the-art (99.51% onMNIST and 80.95% on CIFAR101).
To craft adversarial examples:
• Compute the output Jacobian (with respect to the input)
• Rank perturbation directions by sensitivity
• Move the most sensitive directions to their maximum permittedvalues, up to 112 altered features
Samples succeeded 97% of the time on MNIST and 92.78% on CIFAR10.
1For unaugmented datasets27/52
Evaluation: Setup
Evaluate using:
• MNIST and CIFAR10 datasets
• Typical Convolutional Neural Networks, with momentum, parameterdecay and dropout
This achieves accuracy comparable to the state-of-the-art (99.51% onMNIST and 80.95% on CIFAR101).
To craft adversarial examples:
• Compute the output Jacobian (with respect to the input)
• Rank perturbation directions by sensitivity
• Move the most sensitive directions to their maximum permittedvalues, up to 112 altered features
Samples succeeded 97% of the time on MNIST and 92.78% on CIFAR10.
1For unaugmented datasets27/52
Evaluation1) Does it harm classification accuracy?
• Randomly select 100 samples from the test set
• Create a minimally perturbed version of each item which ismisclassified as each other category
MNIST CIFAR10
Initial Accuracy (%) 99.51 80.95Distilled Accuracy (%) 99.05 81.39
Initial Adversarial Success (%) 95.89 89.9Distilled Adversarial Success (%) 1.34 16.76
28/52
Evaluation2) Does it reduce the sensitivity of a network to its inputs?
Consider mean gradient amplitudes from the CIFAR10 test set
• Network behaviour is smoother around training samples
• Adversarial samples thus require larger perturbations
29/52
Evaluation3) Does it lead to more robust networks?
Compute robustness ρ(F ) over the test set
• At T = 20, the MNISTrobustness of 13.79% forcesadversarial examples to beobvious or genuinely changeclass
• In e.g. text-basedapplications, changing manyinput features is difficult
30/52
Summary of Distillation
• Distillation achieves our specification and desires for a defence:• Minimally modifies network architecture and speed• Handles adversarial samples near training data• Increased resilience to adversarial samples without harming accuracy• Reduced input-output sensitivity• Increased network robustness
• Predicted probability vectors contain valuable information which canaid generalisation
• This information is demonstrably more valuable than setting allincorrect classification probabilities to the same, finite value
• Not all networks produce probability vectors or have sufficientcapacity to defend against adversarial samples
31/52
Defeating the Defences
Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha,Z. Berkay Celik & Ananthram Swami:
Practical Black-Box Attacks against Machine Learning (2017)
32/52
A Gap in the Defences
Most adversarial attacks assume some knowledge of the internals of thetarget network.
Suppose we frame a more challenging adversarial problem:
• No information about the structure or parameters of the targetnetwork
• No access to a large training set
• Only able to observe the output classification (not probability vector)given for inputs of our choice
We thus treat the target like an oracle, or a black box.
We seek the minimum perturbation required to misclassify our input.
33/52
Black-Box Attack
Idea: Train a local copy of the target network, then attack that, exploitingtransferability of adversarial samples between architectures:
1 Heuristically generate synthetic inputs and query the target networkfor its output classifications
2 Use these data pairs to train a local substitute network
3 Craft adversarial samples which defeat the substitute network
4 Transfer those samples to the target network
This poses two main challenges:
• We must choose a substitute architecture without knowing thetarget’s architecture
• We must limit the number of queries submitted to the target networkto ensure tractability and covertness
34/52
Black-Box Attack
Idea: Train a local copy of the target network, then attack that, exploitingtransferability of adversarial samples between architectures:
1 Heuristically generate synthetic inputs and query the target networkfor its output classifications
2 Use these data pairs to train a local substitute network
3 Craft adversarial samples which defeat the substitute network
4 Transfer those samples to the target network
This poses two main challenges:
• We must choose a substitute architecture without knowing thetarget’s architecture
• We must limit the number of queries submitted to the target networkto ensure tractability and covertness
34/52
Substitute DesignChoice of architecture need not be a major hurdle:
• We know the form of the inputs and output
• We know what architectural features are generally used for each datatype
• The precise configuration of layer sizes and numbers doesn’t greatlyaffect attack success
Forming a training dataset is harder:
• Gaussian noise is insufficient for learning, likely because it isunrepresentative of the input distribution
• Instead, focus inputs on the directions in which the output is mostsensitive: Jacobian-based Dataset Augmentation
• Let O(·) and F (·) be the classification functions of the oracle andsubstitute, respectively
• Perturb based on the sign of the Jacobian elements corresponding tothe current class, λ sign (JF (O(x)))
35/52
Substitute DesignChoice of architecture need not be a major hurdle:
• We know the form of the inputs and output
• We know what architectural features are generally used for each datatype
• The precise configuration of layer sizes and numbers doesn’t greatlyaffect attack success
Forming a training dataset is harder:
• Gaussian noise is insufficient for learning, likely because it isunrepresentative of the input distribution
• Instead, focus inputs on the directions in which the output is mostsensitive: Jacobian-based Dataset Augmentation
• Let O(·) and F (·) be the classification functions of the oracle andsubstitute, respectively
• Perturb based on the sign of the Jacobian elements corresponding tothe current class, λ sign (JF (O(x)))
35/52
Substitute Training
Train the substitute as follows:
1 Collect a small number of representative inputs, not necessarily fromthe training distribution
2 Select a suitable architecture
3 Iteratively train more accurate substitutes using training set Sr :
1 Label each unlabelled training point x ∈ Sr by y = O(x)2 Apply classical training techniques using the now-complete dataset3 Augment the training set by:
Sr+1 = Sr ∪ {x + λ sign JF (y) | x ∈ Sr}
36/52
Crafting Adversarial Samples
Create adversarial samples using two methods:
Fast Gradient Sign Method (Goodfellow et al.)
• Let the substitute network have a cost function C (F , x, y)
• Perturb the sample by ηx = ε sign∇xC (F , x, y)
Papernot et al. Algorithm
• Let input component xi have saliency for a target class t:
S(x, t; i) =
0
{if ∂Ft
∂xi(x) < 0 or∑
j 6=t∂Fj
∂xi(x) > 0
∂Ft∂xi
(x)∣∣∣∑j 6=t
∂Fj
∂xi(x)∣∣∣ otherwise
• Add component perturbations to ηx in order of decreasing saliencyuntil the input is misclassified
37/52
Crafting Adversarial Samples
Create adversarial samples using two methods:
Fast Gradient Sign Method (Goodfellow et al.)
• Let the substitute network have a cost function C (F , x, y)
• Perturb the sample by ηx = ε sign∇xC (F , x, y)
Papernot et al. Algorithm
• Let input component xi have saliency for a target class t:
S(x, t; i) =
0
{if ∂Ft
∂xi(x) < 0 or∑
j 6=t∂Fj
∂xi(x) > 0
∂Ft∂xi
(x)∣∣∣∑j 6=t
∂Fj
∂xi(x)∣∣∣ otherwise
• Add component perturbations to ηx in order of decreasing saliencyuntil the input is misclassified
37/52
Validation: Creating the Substitute (MNIST)
Use the MNIST dataset and classification task on MetaMind
• Service provides automated black-box training and prediction withoutrevealing network internals or permitting customisation
• Training accuracy was 94.97%
To craft attacks, try two initial training sets:
• 150 test-set MNIST characters
• 100 handcrafted handwritten digits in MNIST format
Final substitute accuracy was 81.20% for MNIST, 67.00% for handwrittendigits
• We seek not to maximise the substitute’s accuracy, but to mimic thetarget’s decision boundaries
38/52
Validation: Creating the Substitute (MNIST)
Use the MNIST dataset and classification task on MetaMind
• Service provides automated black-box training and prediction withoutrevealing network internals or permitting customisation
• Training accuracy was 94.97%
To craft attacks, try two initial training sets:
• 150 test-set MNIST characters
• 100 handcrafted handwritten digits in MNIST format
Final substitute accuracy was 81.20% for MNIST, 67.00% for handwrittendigits
• We seek not to maximise the substitute’s accuracy, but to mimic thetarget’s decision boundaries
38/52
Validation: Attacking the Substitute (MNIST)Use Fast Gradient Sign Method, with perturbation magnitude ε.Define
• Success rate: the proportion of adversarial samples misclassified bythe substitute network
• Transferability: proportion of misclassifications by the oracle/target
39/52
Validation: Creating the Substitute (GTSRB)
Use the German Traffic Sign Recognition Benchmark:
(GTSRB)
Train various network architectures with various test set subsets
• 71.42% accuracy for 1 000 training points
• 60.12% accuracy for 500 training points
40/52
Validation: Attacking the Substitute (GTSRB)Use the Fast Gradient Sign Method.
• More transferable than MNIST; higher dimensionality• No strong correlation between substitute accuracy and transferability
41/52
Defensive Distillation - Just Gradient MaskingThough the oracle’s gradient is smoother or masked and thus more difficultto exploit directly, the substitute model’s gradients are not masked.
42/52
Transferability Across Architectures
DNN IDAccuracy Accuracy Transferability
(ρ = 2) (%) (ρ = 6) (%) (ρ = 6) (%)
A 30.50 82.81 75.74F 68.67 79.19 64.28G 72.88 78.31 61.17H 56.70 74.67 63.44I 57.68 71.25 43.48J 64.39 68.99 47.03K 58.53 70.75 54.45L 67.73 75.43 65.95M 62.64 76.04 62.00
Accuracy and transferability of models that differ from the oracle in type,number and layer size. Some models did not even have convolution layersor used sigmoid instead of ReLUs.
43/52
FSGM - ε and Transferability
The effect of ε on transferability when using the Fast Gradient SignMethod
44/52
Summary
Adversarial attacks can succeed under very weak assumptions:
• No knowledge of the target’s architecture
• No access to the target’s training data
• Limited training set for substitute model
by using the target as an oracle and generating synthetic data.
Largely due to the transferability of adversarial patterns.
45/52
Immunity to Adversarial Attacks
Yarin Gal & Lewis Smith:Sufficient Conditions for Idealised Models to Have No Adversarial
Examples: a Theoretical and Empirical Study with Bayesian NeuralNetworks (2018)
46/52
Immunity to Adversarial Attacks
Idealised Network
1 Architecture which is invariant to the same transformations as thedata distribution is (improves generalisability)
2 Ability to indicate whether an input is “valid”; that is, close to thetraining points
Idea: Either:
• An input is near the space occupied by the training data, so 1) givesthat the correct classification should be made
• An input is vastly removed from this space, so 2) gives that we will beable to identify it as invalid
So idealised networks are immune to adversarial attacks.Consider Bayesian Neural Networks (BNNs) throughout.
47/52
Immunity to Adversarial Attacks
Idealised Network
1 Architecture which is invariant to the same transformations as thedata distribution is (improves generalisability)
2 Ability to indicate whether an input is “valid”; that is, close to thetraining points
Idea: Either:
• An input is near the space occupied by the training data, so 1) givesthat the correct classification should be made
• An input is vastly removed from this space, so 2) gives that we will beable to identify it as invalid
So idealised networks are immune to adversarial attacks.Consider Bayesian Neural Networks (BNNs) throughout.
47/52
A New DatasetIntroduce a dataset with known ground truth — Manifold MNIST:• Train a VAE on MNIST, enforcing a 2D latent space• Retain only 0, 1 and 4 samples, placing a small Gaussian on each to
give an analytical density• Discard MNIST latents; let the remainder be our ground truth• Draw and decode 5 000 samples to use as a training set
The dataset violates the lack-of-ambiguity assumption.
48/52
Adversarial Attacks on Manifold MNISTMaking samples adversarial decreases their likelihood under the datadistribution:
49/52
Adversarial Attacks on Manifold MNIST
With a LeNET BNN using Hamiltonian Monte Carlo (HMC) inference,uncertainty correlates with training set density:
49/52
Adversarial Attacks on Manifold MNIST
With a LeNET BNN using Hamiltonian Monte Carlo (HMC) inference,uncertainty correlates with training set density:
49/52
Resistance of a HMC BNN to Adversarial Attacks
• Use the first-place attack from a NIPS 2017 competition: MIM
• Sample new BNN weights each time, approximating the ensemble ofpossible BNNs
• Compare performance to a random noise attack and a deterministic(non-Bayesian) neural network
HMC BNN Deterministic NNPerturbation Adversarial Noise Adversarial NoiseMagnitude ε Success (%) Success (%) Success (%) Success (%)
0.1 14± 3 10± 1 52± 0 3.0± 0.10.2 32± 2 23± 1 97± 0 3.0± 0.2
50/52
HMC BNN with Non-Idealised DataUsing the full MNIST dataset, and the same encoder, compare real-worldinference (dropout) to near-perfect inference (HMC):
Key issue: approximate inference increases uncertainty too slowly.Observations for dropout are used to propose a new adversarial attack(not shown).
51/52
Papers Discussed
1 Explaining and Harnessing Adversarial ExamplesIan J. Goodfellow, Jonathon Shlens & Christian Szegedy (2015)
2 Distillation as a Defense to Adversarial Perturbations against DeepNeural NetworksNicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha &Ananthram Swami (2016)
3 Practical Black-Box Attacks against Machine LearningNicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha,Z. Berkay Celik & Ananthram Swami (2017)
4 Sufficient Conditions for Idealised Models to Have No AdversarialExamples: a Theoretical and Empirical Study with Bayesian NeuralNetworksYarin Gal & Lewis Smith (2018)
52/52