adversarial attacks on deep-learning based nlpwzhang/doc/iconip_tut-21nov...adversarial attacks on...
TRANSCRIPT
Adversarial Attacks on Deep-learning based NLP
Tutorial @ ICONIP’20
Dr. Wei (Emma) Zhang
21 November 2020
Self-Introduction – Wei (Emma) Zhang
• Short Bio:– Lecturer, School of Computer Science, The University of
Adelaide, Australia. July 2019 - now
– Research fellow, Department of Computing, Macquarie University. Mar 2017- June 2019
– Ph.D. on Computer Science from the University of Adelaide, Australia, Aug 2013 – Feb 2017.
• Research:– Text mining, Natural language Processing
– Internet of Things Applications
2
Motivation of this Tutorial
• Deep neural networks (DNNs), have gained significant popularity in many Artificial Intelligence (AI) applications
• DNNs are vulnerable to strategically modified samples, named adversarial examples
• Most attentions are put on generating adversarial examples for Computer Vision applications
• Relatively few works on Natural Language ProcessingDNN models, but showing a promising increasing trend
3
Expected Goal of this Tutorial
• Develop a shared vocabulary to talk about adversarial attack on textual DNN
• Understand adversarial attack on textual DNN and the differences to attacking images
• Perform black-box and white-box attacks
• Adopt defence strategies
4
Adversarial Examples- (very) Brief History
• History
– L-BFGS [Szegedy et al. ICLR’14]
• Invented “adversarial example”, which are the worst-case inputs
• Find minimum distance between original points and adversarial points that can make the output (label) incorrectly changes.
– FSGM [Goodfellow et al. ICLR’15]: Fast Sign Gradient Method
• Linear explanation
• Fast computation
– [Jia and Liang EMNLP 17’]: first work in NLP
• Most of the papers are in computer vision, > 3 times in NLP.
5
Content of this Tutorial
• Introduction to Adversarial Examples
• Attack considerations
• Black-box Attack
• (a 5 mins break)
• White-box Attack
• Attack on Multi-modal Applications
• Adversarial Training via adversarial examples
• Future Perspective
6Wei Emma Zhang, Quan Z. Sheng, Ahoud Abdulrahmn F. Alhazmi, Chenliang Li: Adversarial Attacks on Deep-learning Models in Natural Language Processing: A Survey. ACM Trans. Intell. Syst. Technol. 11(3): 24:1-24:41, 2020
Content of this Tutorial
• Introduction to Adversarial Examples
• Attack considerations
• Black-box Attack
• White-box Attack
• Attack on Multi-modal Applications
• Adversarial Training via adversarial examples
• Future Perspective
7
An Example
• Paragraph: “The largest portion of the Huguenots to settle in the Cape arrived between 1688 and 1689…but quite a few arrived as late as 1700; thereafter, the numbers declined.
• Question: “The number of new Huguenot colonists declined after what year?”
• Correct Answer: “1700”
8
Model used: BiDAF Ensemble (Seo et al., 2016)
Robin Jia and Percy Liang. Adversarial Examples for Evaluating Reading Comprehension Systems. EMNLP’17.
An Example
• Paragraph: “The largest portion of the Huguenots to settle in the Cape arrived between 1688 and 1689…but quite a few arrived as late as 1700; thereafter, the numbers declined. The number of old Acadian colonists declined after the year of 1675.”
• Question: “The number of new Huguenot colonists declined after what year?”
• Correct Answer: “1700”• Predicted Answer: “1675”
9
Robin Jia and Percy Liang. Adversarial Examples for Evaluating Reading Comprehension Systems. EMNLP’17.
Model used: BiDAF Ensemble (Seo et al., 2016)
Adversarial Examples
10
Formal Definition
Given:
A DNN model:
An allowed perturbation set with certain constraints
An adversarial example for is a point
11
𝑠. 𝑡. 𝑓 𝑥 + 𝜂 ≠ 𝑓(𝑥)
𝑥′ = 𝑥 + 𝜂 for 𝜂 ∈ 𝑆
untargeted
𝑜𝑟 𝑓 𝑥 + 𝜂 = 𝑦′ targeted
Terminology
12
• Adversarial examples
– Perturbed examples.
• Adversary attack (Evasion Attack): – A method for generating adversarial examples
• Adversarial Machine Learning
– Technique that attempts to fool models by supplying deceptive input.
Terminology
13
• Adversarial Training
– The processes where adversarial examples are introduced to the model and make the model more robust.
• General Adversarial NetsIt is different. Not to be confused!
– Non-cooperative Game
Why do adversarial attacks matter?
14
[Song et al. ICML’18]
Why do adversarial attacks matter?
15
Samuel G. Finlayson et al. Science 2019;363:1287-1289
Why do adversarial attacks matter?
16
Samuel G. Finlayson et al. Science 2019;363:1287-1289
17
Why do adversarial attacks matter?
Why do adversarial attacks matter?
18
To study adversarial attacks
• Test the robustness of the model against worst-case examples.
19
• Discern how a model actually understands its input.
• Improve models through training and optimization with adversarial examples (adversarial training).
How to attack?
20
Content of this Tutorial
• Introduction to Adversarial Examples
• Attack considerations
• Black-box Attack
• White-box Attack
• Attack on Multi-modal Applications
• Adversarial Training via adversarial examples
• Future Perspective
21
Prepare for an attack
• Black-box or White-box
• Targeted or Untargeted
• Character-level, Word-level, Sentence level, Subword-level
• Single Modal or Multi-Modal
22
Prepare for an attack
24Wei Emma Zhang, Quan Z. Sheng, Ahoud Abdulrahmn F. Alhazmi, Chenliang Li: Adversarial Attacks on Deep-learning Models in Natural Language Processing: A Survey. ACM Trans. Intell. Syst. Technol. 11(3): 24:1-24:41, 2020
General Steps
• Attack Positions
– Which word/subword/character?
– The whole sentence
– Representation in latent space
25
• Attack Strategies
insert, delete, switch, replace with:
– Similar token
– Synonyms, paraphrase
– Token within a constraint distance
General Steps
• Perturbation Control the way to measure the size of the perturbation, so that it can be controlled to ensure the ability of fooling the victim DNN while remains less perceivable to a robust DNN.
26
– Edit-based measurement
• e.g., Levenshtein Distance
– Jaccard similarity coefficient
• Check token overlap
– Semantic-preserving measurement
– Norm-based measurement (L-p)
On originalrepresentation
Both
On vectorized representation
General Steps
27
• Language Control
– Valid word or word embedding
– Grammar checker
– Language model
Perplexity
– Paraphrases
Content of this Tutorial
• Introduction to Adversarial Examples
• Attack considerations
• Black-box Attack
• White-box Attack
• Attack on Multi-modal Applications
• Adversarial Training via adversarial examples
• Future Perspective
28
Black-Box Attack
29
Application DNN
Original Example Adversarial ExampleAttack
Output
Black-Box Attack
• Edit Adversaries
• Paraphrase Adversaries
• GAN-based Adversaries
• BERT-based Adversaries
30
Jia and Liang EMNLP’17
• Generate concatenative adversaries
– Append distracting text to the paragraph
– Must ensure that added text does not actually answer the question
31
Paragraph: “The largest portion of the Huguenots to settle in the Cape arrived between 1688 and 1689…but quite a few arrived as late as 1700; thereafter, the numbers declined. The number of old Acadian colonists declined after the year of 1675.”
Question: “The number of new Huguenot colonists declined after what year?”
Correct Answer: “1700”
Predicted Answer: “1675”
32
What city did Tesla move to in 1880? Prague
What city did Tadakatsu move to in 1881? Chicago
Tadakatsu moved the city of Chicago to in 1881.
Change entities,numbers, antonyms
Generate fake answer withsame NER/POS tag
Convert to declarative sentence
Have crowd workers fix errors
Jia and Liang EMNLP’17: AddSent
Tadakatsu moved to the city of Chicago in 1881.
Tadakatsu moved to Chicago in 1881.
In 1881, Tadakatsu moved to the city of Chicago.
Model failed if distracted by
any of these (concatenation)
Jia and Liang EMNLP’17: AddSent
• F1 score
33
Edit Adversaries
• DeepWordBug [Gao et al. SP’18]
– Scoring function
– Character-level transform are applied to the highest-ranked tokens
– Minimize the edit distance of the perturbation
34
Measure word importance according to the effect of the model prediction:
Edit Adversaries
• Probability Weighted Word Saliency [Ren et al. ACL’19]
35
– Saliency
– Use synonym that maximizes the change of prediction output as the substitute word.
Edit Adversaries
• Linguistic information
– POS-Tags [Alhazmi et al. IJCNN’20]
36
Ahoud Abdulrahmn F. Alhazmi, Wei Emma Zhang, Quan Z. Sheng, Abdulwahab Aljubairy: Analyzing the Sensitivity of Deep Neural Networks for Sentiment Analysis: A Scoring Approach. IJCNN 2020Ahoud Abdulrahmn F. Alhazmi, Wei Emma Zhang, Quan Z. Sheng, Abdulwahab Aljubairy: Are Modern Deep Learning Models for Sentiment Analysis Brittle: An Examination on Part-of-Speech. IJCNN 2020
Edit Adversaries
• Semantic relatedness [Alzantot et al. EMNLP’18]
– Nearest neighbours in the embedding space (Glove) are candidate adversarial examples
– Post-process the adversary's Glove embedding to ensure that the nearest neighbors are synonyms
– Use language model to filter out words that do not fit within the context surrounding the word
– Pick the one that will maximize the target label prediction probability when it replaces the word.
37
Edit Adversaries – Summary
• Sentence-level
• Character-level and word-level
– Swap with neighboring character/word
– Delete character/word
– Insert distracting character/word
– Replace words with their synonym or paraphrases
– Change a verb to the wrong tense or forms
– Negate the root verb of the source input
– Changes verbs, adjectives, or adverbs to their antonyms.
– …
38
Paraphrase-based Adversaries
• SCPN [Iyyer et al. NAACL’18]– Produces a paraphrase of the given sentence with desired syntax , the
output is the targeted paraphrase of the original sentence.
39
• use the pre-trained PARANMT-50M corpus from Wieting and Gimpel(2017): 50 million paraphrases obtained by backtranslating the Czech side of the Czech-English.
• Parse the backtranslatedparaphrases using the Stanford parser: <s1,s2> → <p1,p2>
• Relax the target syntactic form to a parse template (top two levels of the linearized parse tree):p2 → t2
• Given a paraphrase pair ⟨s1,s2⟩and corresponding target syntax trees ⟨p1, p2⟩, an encoder-decoder model trained on :
⟨s1,p2⟩→ s2• Given syntax trees ⟨p1,p2⟩ and
template ⟨t1,t2⟩, a parse generator is trained on :
⟨p1,t2⟩→ p2
Paraphrase-based Adversaries
• SCPN
– A: s1,p2->s2
– B: p1,t2 -> p2
40
Paraphrase-based Adversaries
• SCPN: On sentiment analysis
41
GAN-based Adversaries
• [Zhao et al. ICLR’18] Search in the space of latent dense representation z of the input x and find adversarial z*. Then map z* to x* .
• Generator G on X . G: z->x
• Inverter I: x->z
• Adversarial example x*:
The perturbation is on dense representation:
• Training:
42
GAN-based Adversaries
• Generating:
– utilize the inverter to obtain the latent vector , and feed perturbations in the neighborhood of to the generator to generate natural samples . Check the prediction on .
– Incrementally increase the search range within which the perturbations are randomly sampled until the generated samples that change the prediction. Among these samples, choose the one which has the closest to the original as an adversarial example x.
43
GAN-based Adversaries
• For text:
– Use a regularized autoencoder to get continuous representation of text
– Two MLP for generator and inverter
44
BERT-based
• BAE [Garg et al. EMNLP’20]
– A black box attack based on language model
– These perturbations in the input sentence are achieved by masking a part of the input and using a LM to fill in the mask.
–
45
The authors use BERT-MLM to predict masked tokens in the text for generating adversarial examples.The MASK token replaces a word (BAE-R attack) or is inserted to the left/right of the word (BAE-I).
BERT-based
• Compute token importance by examining the changes of the model prediction– decide which word to perturb.
• For importance in descending order:
– Predict top-k tokens for the mask, ranked with the sentence similarity,
– For BAE-R, delete tokens with different POS
– Choose the token in the most similar sentence
– If none, choose the one that decreases the prediction probability the most
– Check the prediction- changes (success) or all tokens checked (fail)
46
Content of this Tutorial
• Introduction to Adversarial Examples
• Attack considerations
• Black-box Attack
• (a 5 mins break)
• White-box Attack
• Attack on Multi-modal Applications
• Adversarial Training via adversarial examples
• Future Perspective
47
Content of this Tutorial
• Introduction to Adversarial Examples
• Attack considerations
• Black-box Attack
• (a 5 mins break)
• White-box Attack
• Attack on Multi-modal Applications
• Adversarial Training via adversarial examples
• Future Perspective
48
White-Box Attack
49
Application DNN
Original Example Adversarial ExampleAttack
Output
Fast Sign Gradient Method (FSGM)
• FSGM [Goodfellow et al. ICLR’15]
50
FSGM
• Linearize
• Maximize
subject to
51
Gradient on x
Perturbation size
Loss
FSGM
• FGSM hypothesizes that the designs of modern DNN intentionally encourage linear behavior for computational gains.
52
White-Box Attack
53
• Gradient-based
• Optimisation-based
• Attention-based
• Adversarial Reprogramming
• Invariance Attack
Gradient-based Textual Attack
• TextFool [Liang et al. IJCAI’18]
54
TextFool
• Identify text items (hot phrases) that are important for classification according to their cost gradients
– Word-level: word vectors that possesses the maximum highest gradient magnitude
– Character-level: hot characters containing the dimensions with maximum highest magnitude - > hot words contain >=3 hot characters.
• then leverage these items, to insert/ modify/ remove
55Combination of three strategies (83.7% Building to 95.7% Means of Transportation).
TextFool
• Modification
– The modification should follow the direction of the cost gradient
, and against the direction of
56
attack CNN-based models
HotFlip [Ebrahimi et al. ACL’17]
• Generate adversarial examples with character “flips”.
– Flip: given one-hot representation of the input, a character flip in the jth character of the ith word (a→b) can be represented by the following vector:
57
Maximize:
“aid” -> “bid”
a b
• The first-order approximation of change in loss
directional derivative
HotFlip
58
• Inserts and deletes can be treated as a sequence of character flips
• Multiple Changes: <20% flips.
• Word-level: derivatives with respect to one-hot word Vectors + semantics-preserving Constraints.
• Efficiency: Rather than query-based method that need multiple backward and forwards passes, Hotfilp only requires one forward and backward passes.
Seq2Sick [Cheng et al. AAAI’20]
59
• A machine translation example
non-overlapping attack.
Target keywords “Hund Sitzt”
Seq2Sick
• Non-overlapping attack
60
< 0
Seq2Sick
• Targeted keyword attack
– do not specify the positions
61
Seq2Sick
• Overall objective function:
62
L_keyword or L_nonoverlapping Group lasso to avoid large changes (perturbation control)
Regularization that penalizes a large distance to the nearest point in word embedding space.
Attention-based
• [Blohm et al. CoNLL’18]
– For Question Answering
– Two white-box attention-based attack
• Word-level:
– The authors leveraged the model’s internal attention distribution to find the pivotal sentence, which is assigned a larger weight by the model to derive the correct answer.
– Then they exchanged the words that received the most attention with the randomly chosen words in a known vocabulary.
• Sentence-level:
– Remove the whole sentence that gets the highest attention.
64
Reprogramming
• Adversarial Reprogramming [Elsayed et al., ICLR’19] is a new class of adversarial attacks where the adversary wishes to repurpose an existing neural network for a new task chosen by the attacker, without the need for the attacker to compute the specific desired output.
• Adversarial reprogramming shares the same basic idea as adversarial examples: The attack changes the behaviorof a deep learning model by making changes to its input.
65
Reprogramming [Neekhara EMNLP’19]
s t
C C’
ls lt
original task adversary task
Reprogramming
67
• context-based vocabulary remapping model, a trainable 3D matrix .
• Generate the adversarial sequence s:
where
Reprogramming
• As si is discrete, the optimization problem is non-differentiable.
• Gumbel-Softmax trick to smoothen the s.
69
Reprogramming
70
Invariance Attack
• Invariance Attack [Chaturvedi et al. arXiv’20]
– Contrary to most adversarial attacks, this method looks to provide a model with maximally perturbed inputs that result in no change to the model’s output (i.e. invariance):
𝑓 𝑥′′ = 𝑦
where 𝑥′′ is a maximally perturbed input.
– Apply gradient to choose the position for replacement
– Choose words in the vocabulary but not in the input sentence as replacing words – choose the ones that keeps the loss minimized.
71
Content of this Tutorial
• Introduction to Adversarial Examples
• Attack considerations
• Black-box Attack
• White-box Attack
• Attack on Multi-modal Applications
• Adversarial Training via Adversarial Examples
• Future Perspective
72
Attacks on Multi-Modal Applications
• Continuous data space to discrete data space
73
Show-and-Fool [Chen et al. ACL’18]
• White-box optimization-based attack on CNN+RNN
• Targeted caption and targeted keyword
74
Show-and-Fool
• Targeted caption strategy:
– Given targeted caption
75
Show-and-Fool
• Targeted keywords strategy
– Given targeted keywords
76
Show-and-Fool
77
Content of this Tutorial
• Introduction to Adversarial Examples
• Attack considerations
• Black-box Attack
• White-box Attack
• Attack on Multi-modal Applications
• Adversarial Training via AdversarialExamples
• Future Perspective
78
Empirical Defences
• Defences that seem to work in practice, but lack of theocratical proof.
• Many of defences have broken soon when new attacks released in publications
• One notable exception is adversarial training
– Heuristic defence as it has not theoretical proof but is effective in most cases.
79
[Madry et al’ 18]
Adversarial Training
• Adversarial training is the process of training a model to correctly classify both unmodified examples and adversarial examples.
– Data augmentation extends the original training set with the generated adversarial examples
– Model Regularization. Model regularization enforces the generated adversarial examples as the regularizer and follows the form of
80
Adversarial Training
• Adversarial training is the process of training a model to correctly classify both unmodified examples and adversarial examples.
– Data augmentation extends the original training set with the generated adversarial examples
81
[Jia and Liang. EMNLP’17]
Adversarial Training
• Adversarial training is the process of training a model to correctly classify both unmodified examples and adversarial examples.
– Model Regularization. Model regularization enforces the generated adversarial examples as the regularizer and follows the form
where
In FSGM:
82
Adversarial Training [Miyato et al. ICLR’17]
• Applying perturbations to the word embeddings in a recurrent neural network
83
The model with perturbed embeddings
[Sato et al. ACL’20]
• Applies the aforementioned method in NMT
85
Content of this Tutorial
• Introduction to Adversarial Examples
• Attack considerations
• Black-box Attack
• White-box Attack
• Attack on Multi-modal Applications
• Adversarial Training via Adversarial Examples
• Future Perspective
86
Future Perspectives
• Perceivability vs Attack Effectiveness
87
• Invariance-based Attack
• Transferability• same architecture with different data• different architectures with same application • different architectures with different data
[Yuan et al. 19’]
• Increase the robustness of the NLP models
• More applications
Thanks!
Q&A
88
Please refer to our paper: Wei Emma Zhang, Quan Z. Sheng, Ahoud Abdulrahmn F. Alhazmi, Chenliang Li: Adversarial Attacks on Deep-learning Models in Natural Language Processing: A Survey. ACM Trans. Intell. Syst. Technol. 11(3): 24:1-24:41, 2020