adversarial attacks on deep-learning based nlpwzhang/doc/iconip_tut-21nov...adversarial attacks on...

Adversarial Attacks on Deep-learning based NLP

Tutorial @ ICONIP’20

Dr. Wei (Emma) Zhang

21 November 2020

Self-Introduction – Wei (Emma) Zhang

• Short Bio:– Lecturer, School of Computer Science, The University of

Adelaide, Australia. July 2019 - now

– Research fellow, Department of Computing, Macquarie University. Mar 2017- June 2019

– Ph.D. on Computer Science from the University of Adelaide, Australia, Aug 2013 – Feb 2017.

• Research:– Text mining, Natural language Processing

– Internet of Things Applications

2

Motivation of this Tutorial

• Deep neural networks (DNNs), have gained significant popularity in many Artificial Intelligence (AI) applications

• DNNs are vulnerable to strategically modified samples, named adversarial examples

• Most attentions are put on generating adversarial examples for Computer Vision applications

• Relatively few works on Natural Language ProcessingDNN models, but showing a promising increasing trend

3

Expected Goal of this Tutorial

• Develop a shared vocabulary to talk about adversarial attack on textual DNN

• Understand adversarial attack on textual DNN and the differences to attacking images

• Perform black-box and white-box attacks

• Adopt defence strategies

4

Adversarial Examples- (very) Brief History

• History

– L-BFGS [Szegedy et al. ICLR’14]

• Invented “adversarial example”, which are the worst-case inputs

• Find minimum distance between original points and adversarial points that can make the output (label) incorrectly changes.

– FSGM [Goodfellow et al. ICLR’15]: Fast Sign Gradient Method

• Linear explanation

• Fast computation

– [Jia and Liang EMNLP 17’]: first work in NLP

• Most of the papers are in computer vision, > 3 times in NLP.

5

Content of this Tutorial

• Introduction to Adversarial Examples

• Attack considerations

• Black-box Attack

• (a 5 mins break)

• White-box Attack

• Attack on Multi-modal Applications

• Adversarial Training via adversarial examples

• Future Perspective

6Wei Emma Zhang, Quan Z. Sheng, Ahoud Abdulrahmn F. Alhazmi, Chenliang Li: Adversarial Attacks on Deep-learning Models in Natural Language Processing: A Survey. ACM Trans. Intell. Syst. Technol. 11(3): 24:1-24:41, 2020









7

An Example

• Paragraph: “The largest portion of the Huguenots to settle in the Cape arrived between 1688 and 1689…but quite a few arrived as late as 1700; thereafter, the numbers declined.

• Question: “The number of new Huguenot colonists declined after what year?”

• Correct Answer: “1700”

8

Model used: BiDAF Ensemble (Seo et al., 2016)

Robin Jia and Percy Liang. Adversarial Examples for Evaluating Reading Comprehension Systems. EMNLP’17.

An Example

• Paragraph: “The largest portion of the Huguenots to settle in the Cape arrived between 1688 and 1689…but quite a few arrived as late as 1700; thereafter, the numbers declined. The number of old Acadian colonists declined after the year of 1675.”

• Question: “The number of new Huguenot colonists declined after what year?”

• Correct Answer: “1700”• Predicted Answer: “1675”

9

Robin Jia and Percy Liang. Adversarial Examples for Evaluating Reading Comprehension Systems. EMNLP’17.

Model used: BiDAF Ensemble (Seo et al., 2016)

Adversarial Examples

10

Formal Definition

Given:

A DNN model:

An allowed perturbation set with certain constraints

An adversarial example for is a point

11

𝑠. 𝑡. 𝑓 𝑥 + 𝜂 ≠ 𝑓(𝑥)

𝑥′ = 𝑥 + 𝜂 for 𝜂 ∈ 𝑆

untargeted

𝑜𝑟 𝑓 𝑥 + 𝜂 = 𝑦′ targeted

Terminology

12

• Adversarial examples

– Perturbed examples.

• Adversary attack (Evasion Attack): – A method for generating adversarial examples

• Adversarial Machine Learning

– Technique that attempts to fool models by supplying deceptive input.

Terminology

13

• Adversarial Training

– The processes where adversarial examples are introduced to the model and make the model more robust.

• General Adversarial NetsIt is different. Not to be confused!

– Non-cooperative Game

Why do adversarial attacks matter?

14

[Song et al. ICML’18]


15

Samuel G. Finlayson et al. Science 2019;363:1287-1289


16

Samuel G. Finlayson et al. Science 2019;363:1287-1289

17



18

To study adversarial attacks

• Test the robustness of the model against worst-case examples.

19

• Discern how a model actually understands its input.

• Improve models through training and optimization with adversarial examples (adversarial training).

How to attack?

20









21

Prepare for an attack

• Black-box or White-box

• Targeted or Untargeted

• Character-level, Word-level, Sentence level, Subword-level

• Single Modal or Multi-Modal

22

Prepare for an attack

24Wei Emma Zhang, Quan Z. Sheng, Ahoud Abdulrahmn F. Alhazmi, Chenliang Li: Adversarial Attacks on Deep-learning Models in Natural Language Processing: A Survey. ACM Trans. Intell. Syst. Technol. 11(3): 24:1-24:41, 2020

General Steps

• Attack Positions

– Which word/subword/character?

– The whole sentence

– Representation in latent space

25

• Attack Strategies

insert, delete, switch, replace with:

– Similar token

– Synonyms, paraphrase

– Token within a constraint distance

General Steps

• Perturbation Control the way to measure the size of the perturbation, so that it can be controlled to ensure the ability of fooling the victim DNN while remains less perceivable to a robust DNN.

26

– Edit-based measurement

• e.g., Levenshtein Distance

– Jaccard similarity coefficient

• Check token overlap

– Semantic-preserving measurement

– Norm-based measurement (L-p)

On originalrepresentation

Both

On vectorized representation

General Steps

27

• Language Control

– Valid word or word embedding

– Grammar checker

– Language model

Perplexity

– Paraphrases









28

Black-Box Attack

29

Application DNN

Original Example Adversarial ExampleAttack

Output

Black-Box Attack

• Edit Adversaries

• Paraphrase Adversaries

• GAN-based Adversaries

• BERT-based Adversaries

30

Jia and Liang EMNLP’17

• Generate concatenative adversaries

– Append distracting text to the paragraph

– Must ensure that added text does not actually answer the question

31

Paragraph: “The largest portion of the Huguenots to settle in the Cape arrived between 1688 and 1689…but quite a few arrived as late as 1700; thereafter, the numbers declined. The number of old Acadian colonists declined after the year of 1675.”

Question: “The number of new Huguenot colonists declined after what year?”

Correct Answer: “1700”

Predicted Answer: “1675”

32

What city did Tesla move to in 1880? Prague

What city did Tadakatsu move to in 1881? Chicago

Tadakatsu moved the city of Chicago to in 1881.

Change entities,numbers, antonyms

Generate fake answer withsame NER/POS tag

Convert to declarative sentence

Have crowd workers fix errors

Jia and Liang EMNLP’17: AddSent

Tadakatsu moved to the city of Chicago in 1881.

Tadakatsu moved to Chicago in 1881.

In 1881, Tadakatsu moved to the city of Chicago.

Model failed if distracted by

any of these (concatenation)

Jia and Liang EMNLP’17: AddSent

• F1 score

33

Edit Adversaries

• DeepWordBug [Gao et al. SP’18]

– Scoring function

– Character-level transform are applied to the highest-ranked tokens

– Minimize the edit distance of the perturbation

34

Measure word importance according to the effect of the model prediction:

Edit Adversaries

• Probability Weighted Word Saliency [Ren et al. ACL’19]

35

– Saliency

– Use synonym that maximizes the change of prediction output as the substitute word.

Edit Adversaries

• Linguistic information

– POS-Tags [Alhazmi et al. IJCNN’20]

36

Ahoud Abdulrahmn F. Alhazmi, Wei Emma Zhang, Quan Z. Sheng, Abdulwahab Aljubairy: Analyzing the Sensitivity of Deep Neural Networks for Sentiment Analysis: A Scoring Approach. IJCNN 2020Ahoud Abdulrahmn F. Alhazmi, Wei Emma Zhang, Quan Z. Sheng, Abdulwahab Aljubairy: Are Modern Deep Learning Models for Sentiment Analysis Brittle: An Examination on Part-of-Speech. IJCNN 2020

Edit Adversaries

• Semantic relatedness [Alzantot et al. EMNLP’18]

– Nearest neighbours in the embedding space (Glove) are candidate adversarial examples

– Post-process the adversary's Glove embedding to ensure that the nearest neighbors are synonyms

– Use language model to filter out words that do not fit within the context surrounding the word

– Pick the one that will maximize the target label prediction probability when it replaces the word.

37

Edit Adversaries – Summary

• Sentence-level

• Character-level and word-level

– Swap with neighboring character/word

– Delete character/word

– Insert distracting character/word

– Replace words with their synonym or paraphrases

– Change a verb to the wrong tense or forms

– Negate the root verb of the source input

– Changes verbs, adjectives, or adverbs to their antonyms.

– …

38

Paraphrase-based Adversaries

• SCPN [Iyyer et al. NAACL’18]– Produces a paraphrase of the given sentence with desired syntax , the

output is the targeted paraphrase of the original sentence.

39

• use the pre-trained PARANMT-50M corpus from Wieting and Gimpel(2017): 50 million paraphrases obtained by backtranslating the Czech side of the Czech-English.

• Parse the backtranslatedparaphrases using the Stanford parser: <s1,s2> → <p1,p2>

• Relax the target syntactic form to a parse template (top two levels of the linearized parse tree):p2 → t2

• Given a paraphrase pair ⟨s1,s2⟩and corresponding target syntax trees ⟨p1, p2⟩, an encoder-decoder model trained on :

⟨s1,p2⟩→ s2• Given syntax trees ⟨p1,p2⟩ and

template ⟨t1,t2⟩, a parse generator is trained on :

⟨p1,t2⟩→ p2


• SCPN

– A: s1,p2->s2

– B: p1,t2 -> p2

40


• SCPN: On sentiment analysis

41

GAN-based Adversaries

• [Zhao et al. ICLR’18] Search in the space of latent dense representation z of the input x and find adversarial z*. Then map z* to x* .

• Generator G on X . G: z->x

• Inverter I: x->z

• Adversarial example x*:

The perturbation is on dense representation:

• Training:

42


• Generating:

– utilize the inverter to obtain the latent vector , and feed perturbations in the neighborhood of to the generator to generate natural samples . Check the prediction on .

– Incrementally increase the search range within which the perturbations are randomly sampled until the generated samples that change the prediction. Among these samples, choose the one which has the closest to the original as an adversarial example x.

43


• For text:

– Use a regularized autoencoder to get continuous representation of text

– Two MLP for generator and inverter

44

BERT-based

• BAE [Garg et al. EMNLP’20]

– A black box attack based on language model

– These perturbations in the input sentence are achieved by masking a part of the input and using a LM to fill in the mask.

–

45

The authors use BERT-MLM to predict masked tokens in the text for generating adversarial examples.The MASK token replaces a word (BAE-R attack) or is inserted to the left/right of the word (BAE-I).

BERT-based

• Compute token importance by examining the changes of the model prediction– decide which word to perturb.

• For importance in descending order:

– Predict top-k tokens for the mask, ranked with the sentence similarity,

– For BAE-R, delete tokens with different POS

– Choose the token in the most similar sentence

– If none, choose the one that decreases the prediction probability the most

– Check the prediction- changes (success) or all tokens checked (fail)

46










47










48

White-Box Attack

49

Application DNN

Original Example Adversarial ExampleAttack

Output

Fast Sign Gradient Method (FSGM)

• FSGM [Goodfellow et al. ICLR’15]

50

FSGM

• Linearize

• Maximize

subject to

51

Gradient on x

Perturbation size

Loss

FSGM

• FGSM hypothesizes that the designs of modern DNN intentionally encourage linear behavior for computational gains.

52

White-Box Attack

53

• Gradient-based

• Optimisation-based

• Attention-based

• Adversarial Reprogramming

• Invariance Attack

Gradient-based Textual Attack

• TextFool [Liang et al. IJCAI’18]

54

TextFool

• Identify text items (hot phrases) that are important for classification according to their cost gradients

– Word-level: word vectors that possesses the maximum highest gradient magnitude

– Character-level: hot characters containing the dimensions with maximum highest magnitude - > hot words contain >=3 hot characters.

• then leverage these items, to insert/ modify/ remove

55Combination of three strategies (83.7% Building to 95.7% Means of Transportation).

TextFool

• Modification

– The modification should follow the direction of the cost gradient

, and against the direction of

56

attack CNN-based models

HotFlip [Ebrahimi et al. ACL’17]

• Generate adversarial examples with character “flips”.

– Flip: given one-hot representation of the input, a character flip in the jth character of the ith word (a→b) can be represented by the following vector:

57

Maximize:

“aid” -> “bid”

a b

• The first-order approximation of change in loss

directional derivative

HotFlip

58

• Inserts and deletes can be treated as a sequence of character flips

• Multiple Changes: <20% flips.

• Word-level: derivatives with respect to one-hot word Vectors + semantics-preserving Constraints.

• Efficiency: Rather than query-based method that need multiple backward and forwards passes, Hotfilp only requires one forward and backward passes.

Seq2Sick [Cheng et al. AAAI’20]

59

• A machine translation example

non-overlapping attack.

Target keywords “Hund Sitzt”

Seq2Sick

• Non-overlapping attack

60

< 0

Seq2Sick

• Targeted keyword attack

– do not specify the positions

61

Seq2Sick

• Overall objective function:

62

L_keyword or L_nonoverlapping Group lasso to avoid large changes (perturbation control)

Regularization that penalizes a large distance to the nearest point in word embedding space.

Attention-based

• [Blohm et al. CoNLL’18]

– For Question Answering

– Two white-box attention-based attack

• Word-level:

– The authors leveraged the model’s internal attention distribution to find the pivotal sentence, which is assigned a larger weight by the model to derive the correct answer.

– Then they exchanged the words that received the most attention with the randomly chosen words in a known vocabulary.

• Sentence-level:

– Remove the whole sentence that gets the highest attention.

64

Reprogramming

• Adversarial Reprogramming [Elsayed et al., ICLR’19] is a new class of adversarial attacks where the adversary wishes to repurpose an existing neural network for a new task chosen by the attacker, without the need for the attacker to compute the specific desired output.

• Adversarial reprogramming shares the same basic idea as adversarial examples: The attack changes the behaviorof a deep learning model by making changes to its input.

65

Reprogramming [Neekhara EMNLP’19]

s t

C C’

ls lt

original task adversary task

Reprogramming

67

• context-based vocabulary remapping model, a trainable 3D matrix .

• Generate the adversarial sequence s:

where

Reprogramming

• As si is discrete, the optimization problem is non-differentiable.

• Gumbel-Softmax trick to smoothen the s.

69

Reprogramming

70

Invariance Attack

• Invariance Attack [Chaturvedi et al. arXiv’20]

– Contrary to most adversarial attacks, this method looks to provide a model with maximally perturbed inputs that result in no change to the model’s output (i.e. invariance):

𝑓 𝑥′′ = 𝑦

where 𝑥′′ is a maximally perturbed input.

– Apply gradient to choose the position for replacement

– Choose words in the vocabulary but not in the input sentence as replacing words – choose the ones that keeps the loss minimized.

71







• Adversarial Training via Adversarial Examples


72

Attacks on Multi-Modal Applications

• Continuous data space to discrete data space

73

Show-and-Fool [Chen et al. ACL’18]

• White-box optimization-based attack on CNN+RNN

• Targeted caption and targeted keyword

74

Show-and-Fool

• Targeted caption strategy:

– Given targeted caption

75

Show-and-Fool

• Targeted keywords strategy

– Given targeted keywords

76

Show-and-Fool

77







• Adversarial Training via AdversarialExamples


78

Empirical Defences

• Defences that seem to work in practice, but lack of theocratical proof.

• Many of defences have broken soon when new attacks released in publications

• One notable exception is adversarial training

– Heuristic defence as it has not theoretical proof but is effective in most cases.

79

[Madry et al’ 18]

Adversarial Training

• Adversarial training is the process of training a model to correctly classify both unmodified examples and adversarial examples.

– Data augmentation extends the original training set with the generated adversarial examples

– Model Regularization. Model regularization enforces the generated adversarial examples as the regularizer and follows the form of

80



– Data augmentation extends the original training set with the generated adversarial examples

81

[Jia and Liang. EMNLP’17]



– Model Regularization. Model regularization enforces the generated adversarial examples as the regularizer and follows the form

where

In FSGM:

82

Adversarial Training [Miyato et al. ICLR’17]

• Applying perturbations to the word embeddings in a recurrent neural network

83

The model with perturbed embeddings

[Sato et al. ACL’20]

• Applies the aforementioned method in NMT

85







• Adversarial Training via Adversarial Examples


86

Future Perspectives

• Perceivability vs Attack Effectiveness

87

• Invariance-based Attack

• Transferability• same architecture with different data• different architectures with same application • different architectures with different data

[Yuan et al. 19’]

• Increase the robustness of the NLP models

• More applications

Thanks!

Q&A

88

Please refer to our paper: Wei Emma Zhang, Quan Z. Sheng, Ahoud Abdulrahmn F. Alhazmi, Chenliang Li: Adversarial Attacks on Deep-learning Models in Natural Language Processing: A Survey. ACM Trans. Intell. Syst. Technol. 11(3): 24:1-24:41, 2020

adversarial attacks on deep-learning based nlpwzhang/doc/iconip_tut-21nov...adversarial attacks on...

Documents