grammar: five analysis methods investigating bert’s

40
Investigating BERT’s Knowledge of Grammar: Five analysis methods with NPIs Alex Warstadt*, Yu Cao*, Ioana Grosu*, Wei Peng*, Hagen Blix*, Yining Nie*, Anna Alsop*, Shikha Bordia*, Haokun Liu*, Alicia Parrish*, Sheng-Fu Wang*, Jason Phang*, Anhad Mohananey*, Phu Mon Htut*, Paloma Jeretic*, and Samuel R. Bowman New York University *=equal contribution EMNLP 2019 November 6, 2019 Slides:

Upload: others

Post on 29-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Grammar: Five analysis methods Investigating BERT’s

Investigating BERT’s Knowledge of Grammar: Five analysis methods with NPIsAlex Warstadt*, Yu Cao*, Ioana Grosu*, Wei Peng*, Hagen Blix*, Yining Nie*, Anna Alsop*, Shikha Bordia*, Haokun Liu*, Alicia Parrish*, Sheng-Fu Wang*, Jason Phang*, Anhad Mohananey*, Phu Mon Htut*, Paloma Jeretic*, and Samuel R. Bowman

New York University *=equal contribution

EMNLP 2019November 6, 2019

Slides:

Page 2: Grammar: Five analysis methods Investigating BERT’s

Recent research asks:

BERT (and other self-supervised models) clearly knows a lot about language, but what does it know about linguistic phenomenon X?

(Linzen et al., 2016; Wilcox et al., 2018; Warstadt and Bowman, 2019)

Page 3: Grammar: Five analysis methods Investigating BERT’s

But first…

What analysis methods should you use to test for phenomenon-specific knowledge?

Page 4: Grammar: Five analysis methods Investigating BERT’s

Our approach:

One phenomenonNegative Polarity Items (NPIs)

Five analysis methods- Probing tasks- Masked language modeling- Acceptability judgments (3 ways)

Page 5: Grammar: Five analysis methods Investigating BERT’s

Conclusions

No one analysis method can show what a model “knows” a phenomenon. A variety of experimental methods is necessary to tell a full story.

Page 6: Grammar: Five analysis methods Investigating BERT’s

Negative Polarity Items (NPIs)

Page 7: Grammar: Five analysis methods Investigating BERT’s

Negative Polarity Items

Expressions (e.g. any, ever) that only appear in a “negative” environment.

(1) Mary has ever been to France. ✗(2) Mary hasn’t ever been to France. ✓

Page 8: Grammar: Five analysis methods Investigating BERT’s

Why NPIs?

Page 9: Grammar: Five analysis methods Investigating BERT’s

Why NPIs?

They’re just really complicated!

Page 10: Grammar: Five analysis methods Investigating BERT’s

Why NPIs?

They’re just really complicated!

...and extensively studied in theoretical linguistics.

Fauconnier (1975), Ladusaw (1979), Linebarger (1980), Kadmon & Landman (1993), Giannakidou (1998), Chierchia (2013), and many others

Page 11: Grammar: Five analysis methods Investigating BERT’s

Why NPIs?

They’re just really complicated!

...and extensively studied in theoretical linguistics.

Fauconnier (1975), Ladusaw (1979), Linebarger (1980), Kadmon & Landman (1993), Giannakidou (1998), Chierchia (2013), and many others

And in NLP: Marvin & Linzen, (2018), Wilcox et al., (2019)

Page 12: Grammar: Five analysis methods Investigating BERT’s

1. The class of licensing environments is heterogeneous.

Page 13: Grammar: Five analysis methods Investigating BERT’s

2. The syntactic scope of the licensor matters.

Page 14: Grammar: Five analysis methods Investigating BERT’s

Our Data

>136k sentences generated by hand-crafted grammars and automatically labeled with boolean acceptability (grammaticality) judgments.

Page 15: Grammar: Five analysis methods Investigating BERT’s

Our Data

>136k sentences generated by hand-crafted grammars and automatically labeled with boolean acceptability (grammaticality) judgments.

Vocab of >1000 items encoding fine-grained selectional information.

Page 16: Grammar: Five analysis methods Investigating BERT’s

Our Data

>136k sentences generated by hand-crafted grammars and automatically labeled with boolean acceptability (grammaticality) judgments.

Vocab of >1000 items encoding fine-grained selectional information.

Scales up similar methods by Ettinger et al (2016), Marvin & Linzen (2018).

Page 17: Grammar: Five analysis methods Investigating BERT’s

Our Data

>136k sentences generated by hand-crafted grammars and automatically labeled with boolean acceptability (grammaticality) judgments.

Vocab of >1000 items encoding fine-grained selectional information.

Scales up similar methods by Ettinger et al (2016), Marvin & Linzen (2018).

9 sub-datasets. 1 per licensing environment.

Page 18: Grammar: Five analysis methods Investigating BERT’s

Our Data

>136k sentences generated by hand-crafted grammars and automatically labeled with boolean acceptability (grammaticality) judgments.

Vocab of >1000 items encoding fine-grained selectional information.

Scales up similar methods by Ettinger et al (2016), Marvin & Linzen (2018).

9 sub-datasets. 1 per licensing environment.

A crowd-worker validation gives >82% agreement with our boolean labels.

Page 19: Grammar: Five analysis methods Investigating BERT’s

Zero-Shot LearningTraining and evaluating on data generated from the same grammar is too easy.

We train on one sub-dataset (licensing environment), evaluate on the others, to test how easily BERT generalizes.

Page 20: Grammar: Five analysis methods Investigating BERT’s

Five Analysis Methods

Page 21: Grammar: Five analysis methods Investigating BERT’s

Analysis Method 1:Probing Tasks

Train a classifier on top of BERT without fine-tuning.

Common method in NLP analysis (Adi et al., 2016; Ettinger et al., 2016; Conneau & Kiela, 2018; Tenney et al., 2019)

Page 22: Grammar: Five analysis methods Investigating BERT’s

Probing tasks

Question: Does BERT know the syntactic scope of NPI licensors?

Page 23: Grammar: Five analysis methods Investigating BERT’s

Results: Probing Tasks

Page 24: Grammar: Five analysis methods Investigating BERT’s

Analysis method 2:Masked Language Modeling

We can use BERT’s pre-training task to test it in a totally unsupervised setting.

Similar methods used by Linzen et al (2016), Wilcox et al (2018).

Page 25: Grammar: Five analysis methods Investigating BERT’s

Masked Language Modeling

Question: Does BERT assign higher probability to grammatical completions?

John knows <MASK> Betsy has ever been to France.

whether

that☞

Page 26: Grammar: Five analysis methods Investigating BERT’s

Two Kinds of Minimal Pairs

John knows Betsy has ever been to France that☞ whether

Licensor-Presence

John knows that Betsy has been to France ever☞ often

NPI-Presence

Page 27: Grammar: Five analysis methods Investigating BERT’s

Results:Masked Language Modeling

Page 28: Grammar: Five analysis methods Investigating BERT’s

Analysis method 3:Boolean acceptability classifier

Same task is used by Linzen et al. (2016), CoLA dataset (Warstadt et al., 2019), Kann et al. (2019)

Classifier

Mary hasn’t eaten any cookies.

Mary has eaten any cookies.

Acceptable (grammatical)

Unacceptable (ungrammatical)

Page 29: Grammar: Five analysis methods Investigating BERT’s

Results:Boolean acceptability classification

Page 30: Grammar: Five analysis methods Investigating BERT’s

Analysis methods 4&5: Boolean/Gradient minimal pairs

A synthesis of the acceptability classification method and the masked language model method.Question: Is a supervised acceptability classifier sensitive to minimal differences between sentences?

Page 31: Grammar: Five analysis methods Investigating BERT’s

Classifier

Only Mary has ever eaten the cake.

Mary has ever eaten only the cake.

Acceptable

Unacceptable

Correct

Classifier

Acceptable

Acceptable

Incorrect

Only Mary has ever eaten the cake.

Mary has ever eaten only the cake.

Boolean minimal pair:The classifier must label both sentences correctly.

Page 32: Grammar: Five analysis methods Investigating BERT’s

Classifier

Only Mary has ever eaten the cake.

Mary has ever eaten only the cake.

0.4

0.1

Correct

Classifier

0.1

0.4

Incorrect

Only Mary has ever eaten the cake.

Mary has ever eaten only the cake.

Gradient minimal pair:The classifier must prefer the grammatical sentence.

Page 33: Grammar: Five analysis methods Investigating BERT’s

Results: Scope detection with minimal pairs

Page 34: Grammar: Five analysis methods Investigating BERT’s

Results: Scope detection with minimal pairs

Page 35: Grammar: Five analysis methods Investigating BERT’s

ResultsScope detection by environment

Page 36: Grammar: Five analysis methods Investigating BERT’s

Recap: If we had only used...

Probing, Masked language modeling, or Boolean acceptability: - BERT encodes properties of NPIs well, but imperfectly.

Boolean minimal pairs- BERT has pretty weak knowledge of NPIs

Gradient minimal pairs- BERT has totally systematic knowledge of NPIs.

Page 37: Grammar: Five analysis methods Investigating BERT’s

Conclusions:What does BERT know about NPIs?

● BERT does have systematic knowledge about NPIs.● But this knowledge is gradient, not discrete.● And it can only be observed after giving BERT some amount of

supervision.

Page 38: Grammar: Five analysis methods Investigating BERT’s

Summary:What we learn about methodology

● Relying on one method can give misleading results when trying to probe for knowledge of a phenomenon.

● Researchers should compare boolean and gradient measures.● And compare supervised and unsupervised evaluation methods.

Page 39: Grammar: Five analysis methods Investigating BERT’s

Resources

Generation Code:https://github.com/alexwarstadt/data_generation

Experiments:https://github.com/nyu-mll/jiant/tree/blimp-and-npi/scripts/bert_npi

NPI Data:https://alexwarstadt.files.wordpress. com/2019/08/npi_lincensing_data.zip

Page 40: Grammar: Five analysis methods Investigating BERT’s

Thank you!

This material is based upon work supported by the National Science Foundation under Grant No. 1850208. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. This project has also benefited from financial support to SB by Samsung Research under the project Improving Deep Learning using Latent Structure and from the donation of a Titan V GPU by NVIDIA Corporation.