the uphill battles

Post on 25-Jul-2022

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The Uphill Battles of Doing NLP without Psychology and

Theoretical Machine Learning7th Swedish Language Technology Conference

Anders Søgaard University of Copenhagen, Dpt. of Computer Science

The Uphill Battles of Doing NLP without Psychology and

Theoretical Machine Learning7th Swedish Language Technology Conference

Anders Søgaard University of Copenhagen, Dpt. of Computer Science

NLP WHEN I STARTED NLP TODAY

The Uphill Battles of Doing NLP without Psychology and

Theoretical Machine Learning7th Swedish Language Technology Conference

Anders Søgaard University of Copenhagen, Dpt. of Computer Science

NLP WHEN I STARTED+ NAÏVE EMPIRICISM + LESS INTUITIVE METHODS

OutlineThree examples of psychologically naïve NLP:

Two examples of machine learningly naïve NLP

1 The Dot Plot Syntactic parsing

2 The Bad Fortune Speller Word recognition

3 Attention without Attention Sentiment

4 Flavors of Failure Dictionary induction

5 Flavors of Succes Multi-modal MT

OutlineThree examples of psychologically naïve NLP:

Two examples of machine learningly naïve NLP

1 The Dot Plot Syntactic parsing

2 The Bad Fortune Speller Word recognition

3 Attention without Attention Sentiment

4 Flavors of Failure Dictionary induction

5 Flavors of Succes Multi-modal MT

MO

DEL

SIN

TERP

RETA

TIO

N

Psychologically naïve NLP

Gold (1967) showed that even regular languages are unlearnable from finite

samples.

Gold (1967). Language Identification in the Limit. Information and Control 10(5): 447-474.

Ur Example

Gold (1967) showed that even regular languages are unlearnable from finite

samples.

That is, in the sense of inducing the exact grammar that generated the observed

strings.

Gold (1967). Language Identification in the Limit. Information and Control 10(5): 447-474.

Ur Example

Gold (1967) showed that even regular languages are unlearnable from finite

samples.

That is, in the sense of inducing the exact grammar that generated the observed

strings.

Problem: More than strings, strings generated by many grammars, and no need for

identical grammars.

Gold (1967). Language Identification in the Limit. Information and Control 10(5): 447-474.

Ur Example

The Dot Plot

Modern parsers are trained to analyze sentences written by

journalists.

Søgaard, de Lhoneux & Augenstein. (2018). Nightmare at Test Time. BlackBox@EMNLP.

The Dot Plot

[Pierre Vinken]NP , [61 years old]ADJP , [will join the board]VP .

Modern parsers are trained to analyze sentences written by

journalists.

Søgaard, de Lhoneux & Augenstein. (2018). Nightmare at Test Time. BlackBox@EMNLP.

The Dot Plot

[Pierre Vinken]NP , [61 years old]ADJP , [will join the board]VP .

Modern parsers are trained to analyze sentences written by

journalists.

Parsers learn to rely on near-perfect punctuation, a give-

away of the syntactic analysis.

Poor performance on texts with non-standard

punctuation.

Søgaard, de Lhoneux & Augenstein. (2018). Nightmare at Test Time. BlackBox@EMNLP.

The Dot Plot

Modern parsers are trained to analyze sentences written by

journalists.

Parsers learn to rely on near-perfect punctuation, a give-

away of the syntactic analysis.

Poor performance on texts with non-standard

punctuation.

Søgaard, de Lhoneux & Augenstein. (2018). Nightmare at Test Time. BlackBox@EMNLP.

The Dot Plot

Humans are not sensitive to non-standard punctuation

(Baldwin and Coady, 1978).

Hypothesis: Punctuation prevents parsers from

learning deeper generalisations.

Baldwin and Coady (1978). Psycholinguistic approaches to a theory of punctuation. Journal of Literacy Research 10 (4): 363-376.

Søgaard, de Lhoneux & Augenstein. (2018). Nightmare at Test Time. BlackBox@EMNLP.

The Dot Plot

Baldwin and Coady (1978). Psycholinguistic approaches to a theory of punctuation. Journal of Literacy Research 10 (4): 363-376.

Labe

led

atta

chm

ent s

core

s

50

62,5

75

87,5

100

UUParser MaltParser

67,5

79,4

75,7

85,1

80,4

88,6

79,1

86,7

83,6

90,1

80,5

86,985,8

91,8

Minimum

Maximum

d=0,c=0 nopunct d=.01,c=.01 d=.01,c=.05 d=.05,c=.01 d=.05,c=.05 d=.1,c=.1

NOPUNCT (89.8)

The Bad Fortune Speller

Modern NLP is often trained on sentences written by

journalists.

This includes near-perfect spelling, making word

recognition easy.

Poor performance on texts with non-standard spelling.

The Bad Fortune Speller

Modern NLP is often trained on sentences written by

journalists.

This includes near-perfect spelling, making word

recognition easy.

Poor performance on texts with non-standard spelling.

The Bad Fortune Speller

Huamns are not snesitvie to non-stnadard spellign (Forster

et al., 1987).

Hypothesis: Near-perfect spelling prevents models

from learning deeper generalisations.

Forster et al. (1987). Masked priming with graphemically related words. The Quarterly Journal of Experimental Psychology 32(4): 211-251.

Didn’t character-based RNNs solve this?

The Bad Fortune Speller

The Bad Fortune Speller

Heigold et al. (2018). How robust are character-based word embeddings in tagging and MT against wrod scramlbing or random nouse? AMTA.

Labe

led

atta

chm

ent s

core

s

25

50

75

100

POS MT

21,7

82

25

84

30,7

94

s=0,f=0 s=0.1/0.05,f=0 s=0,f=0.1/0.05

Huamns are not snesitvie to non-stnadard spellign (Forster

et al., 1987).

Sakaguchi et al. (2017). Robsut Wrod Reocginiton via semi-Character RNN. AAAI.

The Bad Fortune Speller

Forster et al. (1987). Masked priming with graphemically related words. The Quarterly Journal of Experimental Psychology 32(4): 211-251.

(Semi-character RNNs)

Attention without Attention

LSTMs with attention functions are popular models.

Attention without Attention

Barrett, Bingel, Hollenstein, Rei & Søgaard (2018). Sentence classification with human attention. CoNLL.

LSTMs with attention functions are popular models.

Attention functions are latent variables with no direct

supervision.

Problem: Attention functions are prone to over-fitting (Rei

and Søgaard, 2018).

Attention without Attention

Rei & Søgaard (2018). Zero-shot sequence labeling. NAACL.

Attention without Attention

LSTMs with attention functions are popular models.

Attention functions are latent variables with no direct

supervision.

Barrett, Bingel, Hollenstein, Rei & Søgaard (2018). Sentence classification with human attention. CoNLL.

Rei & Søgaard (2018). Zero-shot sequence labeling. NAACL.

Problem: Attention functions are prone to over-fitting (Rei

and Søgaard, 2018).

Gaze data reflects word-level human attention - or

relevance (Loboda et al., 2011).

Hypothesis: Gaze can be used to regularize neural

attention (Barrett et al., 2018).

Loboda et al. (2011). Inferring word relevance from eye-movements of readers. IUI.

Attention without Attention

Barrett, Bingel, Hollenstein, Rei & Søgaard (2018). Sentence classification with human attention. CoNLL.

Labe

led

atta

chm

ent s

core

s

40

50

60

70

80

90

Sentiment Error Abusive

72,45

84,28

52,37

71,23

83,84

50,05

Learned attention Human attention

Summary

Human reading insensitive to punctuation.

Human reading insensitive to (a lot of) spelling variation.

Human attention during reading is partial and

systematic.

Summary

Ignore punctuation. Semi-character RNNs. Regularize attention.

Machine learningly naïve NLP

Flavors of Failure

Conneau et al. (2018) show GANs with linear generators

align cross-lingual embedding spaces.

Flavors of Failure

Conneau et al. 2018. Word translation without parallel data. ICLR

Conneau et al. (2018) show GANs with linear generators

align cross-lingual embedding spaces.

Flavors of Failure

Conneau et al. 2018. Word translation without parallel data. ICLR

Conneau et al. (2018) show GANs with linear generators

align cross-lingual embedding spaces.

Søgaard et al. (2018) show their approach is challenged

by some language pairs.

Søgaard et al. (2018). On the limitations of unsupervised bilingual dictionary induction. ACL.

Flavors of Failure

Conneau et al. 2018. Word translation without parallel data. ICLR

0

30

60

90

es et fi el hu pl tr

334745

000

82

Conneau et al. (2018) show GANs with linear generators

align cross-lingual embedding spaces.

Søgaard et al. (2018) show their approach is challenged

by some language pairs.It’s the morphology, stupid!

Søgaard et al. (2018). On the limitations of unsupervised bilingual dictionary induction. ACL.

Flavors of Failure

Conneau et al. 2018. Word translation without parallel data. ICLR

Conneau et al. (2018) show GANs with linear generators

align cross-lingual embedding spaces.

Søgaard et al. (2018) show their approach is challenged

by some language pairs.It’s the morphology, stupid!

Flavors of Failure

Problem: How do we know when to blame morphology?

Søgaard et al. (2018). On the limitations of unsupervised bilingual dictionary induction. ACL.

Conneau et al. 2018. Word translation without parallel data. ICLR

Chocolate consumption correlates with Nobel

laureates.

Flavors of Failure

Chocolate consumption correlates with Nobel

laureates.

Flavors of Failure

https://en.wikipedia.org/wiki/Galton%27s_problem

Geographical diffusion caused by borrowing and

common ancestors.

Morphology correlates with siestas (Roberts and Winters,

2013).

Flavors of Failure

Chocolate consumption correlates with Nobel

laureates.

Morphology correlates with siestas (Roberts and Winters,

2013).

Problem: How do we know when to blame morphology?

Roberts & Winters (2013). Linguistic Diversity and Traffic Accidents. PLOSone.

Flavors of Failure

Roberts & Winters (2013). Linguistic Diversity and Traffic Accidents. PLOSone.

Control for it?

Flavors of Failure

Chocolate consumption correlates with Nobel

laureates.

Morphology correlates with siestas (Roberts and Winters,

2013).

Problem: How do we know when to blame morphology?

Hartmann et al. (2018) show failure cases for English-

English.

Flavors of Failure

Hartmann et al. (2018). Why is unsupervised alignment of English embeddings from different algorithms so hard? EMNLP.

Hartmann et al. (2018) show failure cases for English-

English.

Hartmann et al. (2019) show the loss landscapes of these

cases.

Flavors of Failure

Hartmann et al. (2019). Alignability of word vector spaces. ArXiV.

Hartmann et al. (2018). Why is unsupervised alignment of English embeddings from different algorithms so hard? EMNLP.

Discriminator

Performance

LOCAL MINIMUM

GLOBAL MINIMUM

Discriminator

Performance

LOCAL MINIMUM

GLOBAL MINIMUM

Hartmann et al. (2019). Alignability of word vector spaces. ArXiV.

Hartmann et al. (2018) show failure cases for English-

English.

Hartmann et al. (2019) show the loss landscapes of these

cases.

Diagnostic for loss functions based on Wasserstein or

Sinkhorn.

Flavors of Failure

Hartmann et al. (2019). Alignability of word vector spaces. ArXiV.

Hartmann et al. (2018). Why is unsupervised alignment of English embeddings from different algorithms so hard? EMNLP.

Flavors of Success

Caglayan et al. (2017) use an attentive encoder-decoder for

image-aware MT.

Flavors of Success

Caglayan et al. (2017). LIUM-CVC Submissions for WMT17 Multimodal Translation Task. WMT.

Man with Mardi Gras beads around his neck holding pole with banner.

Caglayan et al. (2017) use an attentive encoder-decoder for

image-aware MT.

They report a 1.2 point METEOR improvement from

adding related images.

… but in what sense is the system aware of the related

image?

Flavors of Success

Caglayan et al. (2017). LIUM-CVC Submissions for WMT17 Multimodal Translation Task. WMT.

Caglayan et al. (2017) use an attentive encoder-decoder for

image-aware MT.

They report a 1.2 point METEOR improvement from

adding related images.

… but in what sense is the system aware of the related

image?

Flavors of Success

Elliott (2018). Adversarial evaluation of multimodal machine translation. EMNLP.

Caglayan et al. (2017). LIUM-CVC Submissions for WMT17 Multimodal Translation Task. WMT.

Elliott (2018) show that pairing texts with random

images obtains same improvements.

Hypothesis: The images simply help the optimizer.

Hypothesis: The images simply help the optimizer.

Flavors of Success

Rationale: Over-parameterisation and skip

connections help the optimizer.

Flavors of Success

Li et al. (2017). Visualizing the loss landscapes of neural nets. ICLR.

Summary

The Nobel-Chocolate Fallacy: Is it really the morphology?

Are our models really aware of externally information?

Summary

Understand the limitations of models and optimizers

In high-dimensional space, we have limited intuitions.

Summary

Understand the limitations of models and optimizers

In high-dimensional space, we have limited intuitions.

Bingel & Søgaard (2017), e.g., show linguistic intuitions poor predictor for multi-task gains.

Bingel & Søgaard (2017). Identifying beneficial task relations for multi-task learning in deep neural networks. EACL.

What’s wrong with being naïve?

The scientific dance

Simplification Anomaly

The scientific dance

Common: Random finite samples are representative

Anomaly

The scientific dance

Alternative: Controlled/synthetic language.

Anomaly

Hegel’s holiday?

Controlled language Finite samples

NLP WHEN I STARTED NLP TODAY

Hegel’s holiday?

Controlled language Finite samples

NLP WHEN I STARTED NLP TODAY

NLP TOMORROW

Questions?

Supplementary slides

John Dewey (1910)

A: “It will probably rain tomorrow.” B: “Why do you think so?” A: “Because the sky was lowering at sunset.” B: “What has that to do with it?” A: “I do not know, but it generally does rain after such a sunset.”

John Dewey (1910)

[The scientific] method of proceeding is by varying conditions one by one so far as possible, and noting just what happens when a given condition is eliminated. There are two methods for varying conditions. The first […] consists in comparing very carefully the results of a great number of observations which have occurred under accidentally different conditions. […]  [This] method […] is, however, badly handicapped; it can do nothing until it is presented with a certain number of diversified cases. […] The method is passive and dependent upon external accidents. Hence the superiority of the active or experimental method. Even a small number of observations may suggest an explanation - a hypothesis or theory. Working upon this suggestion, the scientist may then intentionally vary conditions and note what happens.

• NON-SCIENTIFIC METHOD: Extract regularities from available data (Penn Treebank).

• EMPIRICAL METHOD: Get more data (Web Treebank).

• EXPERIMENTAL METHOD: Create data (punctuation injection, etc).

top related